CHAPTER 9 – REGULAR EXPRESSIONS – Lazy Matching
Suppose you have the following string and you want to match the string inside the first <a /> tag: <a href="http://php.net/">PHP</a> has an <a href="http://php.net/ manual">excellent</a> manual. The following pattern looks like it will work:
However, when you run the following example, you see that it outputs the wrong result: <?php $str = '<a href="http://php.net/">PHP</a> has an '. '<a href="http://php.net/manual">excellent</a> manual.'; $pattern = '@<a.*>(.*)</a>@'; preg_match($pattern, $str, $matches); print_r($matches); ?> outputs Array ( [0] => <a href="http://php.net/">PHP</a> [1] => PHP ) The example fails because the * and the + are greedy operators. They try to match as many characters as possible. In this case, <a.*> will match every- thing to manual">. You can tell the PCRE engine not to do this by appending the ? to the quantifier. If the ? is added, the PCRE engine tries to match as little characters/sub-patterns as possible, which is what we want here. When the pattern @<a.*?>(.*?)</a>@ is used, the output is correct: Array ( [0] => <a href="http://php.net">PHP</a> [1] => PHP ) However, this is not the most efficient way. It's usually better to use the pattern @<a[^>]+>([^<]+)</a>@, which requires less processing by the PCRE engine.
Modifiers The modifiers "modify" the behavior of the pattern match- ing engine. Table 9.3 lists them all with descriptions and examples. Table 9.3 Modifiers Modifier Description i Makes the PCRE engine match in a case-insensitive way. /[a-z]/ matches a letter in the range a..z./ [a-z]/i matches a letter in the ranges A..Z and a..z.
Table 9.3 Modifiers Modifier Description m Changes the behavior of the ^ and $ in such a way that ^ also matches just after a newline character, and $ also matches just before a newline character. <?php $str = "ABCnDEFnGHI"; preg_match('@^DEF@', $str, $matches1); preg_match('@^DEF@m', $str, $matches2); print_r($matches1); print_r($matches2); ?> outputs Array ( ) Array ( [0] => DEF ) s With this modifier set, the . (dot) also matches the newline character; without this modifier set (the default), it does not match the newline character. <?php $str = "ABCnDEFnGHI"; preg_match('@BC.DE@', $str, $matches1); preg_match('@BC.DE@s', $str, $matches2); print_r($matches1); print_r($matches2); ?> outputs Array ( ) Array ( [0] => BC DE )
Table 9.3 Modifiers Modifier Description x If this modifier is set, you can put arbitrary whitespace inside your pat- tern, except of course in character classes. <?php $str = "ABCnDEFnGHI"; preg_match('@A B C@', $str, $matches1); preg_match('@A B C@x', $str, $matches2); print_r($matches1); print_r($matches2); ?> outputs Array ( ) Array ( [0] => ABC ) e Only has an effect on the preg_replace() function. When it is set, it per- forms the normal replacement of back references and then evaluates the replacement string as PHP code. For an example, see the section "Replacement Functions." A Setting this modifier has the same effect as using ^ as the first character in your pattern unless the m modifier is set. <?php $str = "ABC"; preg_match('@BC@', $str, $matches1); preg_match('@BC@A', $str, $matches2); print_r($matches1); print_r($matches2); ?> outputs Array ( [0] => BC ) Array ( )
Table 9.3 Modifiers Modifier Description D Makes the $ only match at the very end of the subject string, and not one character before the end in case that is a newline character. <?php $str = "ABCn"; preg_match('@BC$@', $str, $matches1); preg_match('@BC$@D', $str, $matches2); print_r($matches1); print_r($matches2); ?> outputs Array ( [0] => BC ) Array ( ) U Swaps the "greediness" of the PCRE engine. Quantifiers become ungreedy by default, and the ? character turns on greediness. This makes the pattern we saw in an earlier example ('@<a.*?>(.*?)</a>@') an equivalent of '@<a.*>.*</a>@U'. <?php $str = '<a href="http://php.net/">PHP</a> has an '. '<a href="http://php.net/manual">'. 'excellent</a> manual.'; $pattern = '@<a.*>(.*)</a>@U'; preg_match($pattern, $str, $matches); print_r($matches); ?> outputs Array ( [0] => <a href="http://php.net/">PHP</a> has an <a href="http://php.net/manual">excellent</a> [1] => excellent )
Table 9.3 Modifiers Modifier Description X Turns on extra features in the PCRE engine. At the moment, the only feature it turns on is that the engine will throw an error in case an unknown escape sequence was detected. Normally, this would just have been treated as a literal. (Notice that we still have to escape the one for PHP itself.) <?php $str = '\h'; preg_match('@\h@', $str, $matches1); preg_match('@\h@X', $str, $matches2); ?> output: Warning: preg_match(): Compilation failed: unrecognized character follows at offset 1 in /dat/docs/book/ prenticehall/php5powerprogramming/chapters/draft/10- mainstream-extensions/pcre/mod-X.php on line 4 u Turns on UTF-8 mode. In UTF-8 mode the PCRE engine treats the pat- tern as UTF-8 encoded. This means that the . (dot) matches a multi-byte character for example. (The next example expects you to view this book in the iso-8859-1 character set; if you view it in UTF-8, you'll see Dérick instead.) <?php $str = 'Dérick'; preg_match('@D.rick@', $str, $matches1); preg_match('@D.rick@u', $str, $matches2); print_r($matches1); print_r($matches2); ?> outputs Array ( ) Array ( [0] => Dérick )