起因是一个网友提了一个问题:
$pattern='/^\w+$/'; $str="人1994"; $ret = preg_match($pattern, $str, $matches);
他想着\w是应该能够匹配到中文字符“人”的,但是实际执行结果却是不能匹配。网友们给了各种解释和解决方案,总结下来有两个可行方案:
- 使用/u模式修饰符[1]
- 在pattern中使用unicode编码表示两种方案可以解决如上问题。
代码及输出如下:
<?php $pattern='/^\w+$/'; $str="人1994"; $ret = preg_match($pattern, $str, $matches); var_dump($ret); var_dump($matches); $pattern='/^\w+$/u'; $str="人1994"; $ret = preg_match($pattern, $str, $matches); var_dump($ret); var_dump($matches); //From PHP 7.0 $pattern='/^\w+$/u'; $str="\u{4eba}1994";//人 => \u4eba $ret = preg_match($pattern, $str, $matches); var_dump($ret); var_dump($matches); Output for hhvm-3.18.5 - 3.22.0, 7.1.0 - 7.2.4 int(0) array(0) { } int(1) array(1) { [0]=> string(7) "人1994" } int(1) array(1) { [0]=> string(7) "人1994" } Output for 5.6.0 - 5.6.30 int(0) array(0) { } int(1) array(1) { [0]=> string(7) "人1994" } int(0) array(0) { }
注意:第三个例子中的语法从PHP 7.0开始支持。[2] [3]
还有一个网友提出了一个观点[4]:
这个不是 PHP 的锅,而是 PCRE 库的配置导致,\w 的匹配官方文档是这么说的
A “word” character is any letter or digit or the underscore character, that is, any character which can be part of a Perl “word”. The definition of letters and digits is controlled by PCRE’s character tables, and may vary if locale-specific matching is taking place. For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
也就是说, PCRE 库的 character tables 配置会影响到\w 的匹配
目前没有验证TA所说的是否正确。
目前来看最好的方式就在在写pattern时使用模式修饰符/u。
References:
[1]https://secure.php.net/manual/zh/reference.pcre.pattern.modifiers.php
[2] PHP 7.0 has introduced the “Unicode codepoint escape” syntax.
[3] To support large Unicode ranges (ie: [\x{E000}-\x{FFFD}] or \x{10FFFFF}) you must use the modifier ‘/u’ at the end of your expression.
https://secure.php.net/manual/en/function.preg-match.php#90771
[4] https://www.v2ex.com/t/262150
https://stackoverflow.com/questions/1330693/validate-username-as-alphanumeric-with-underscores