CHAPTER 9 – REGULAR EXPRESSIONS
Although regular expressions are very powerful, they are difficult to use, espe- cially if you're new to them. So, instead of jumping on the functions that PHP supports for dealing with the regular expressions, we cover the pattern match- ing syntax first. If PCRE is enabled, the following should show up in phpinfo() output, as shown in Figure 9.3. Fig. 9.3 PCRE phpinfo() output.
Syntax PCRE functions check whether a text string matches a pattern. The syntax of a pattern always has the following format: <delimiter> <pattern> <delimiter> [<modifiers>] The modifiers are optional. The delimiter separates the pattern from the modifiers. PCRE uses the first character of the expression as the delimiter. You should use a character that does not exist in the pattern itself. Or, you can use a character that exists in your expression, but then you must escape it with the . Traditionally, the / is used as the delimiter, but other common delimiters are | or @. It's your choice. Personally, in most cases, we would pick the @, unless we need to do matching on an email or similar pattern that con- tains the @, in which case we would use the /. The PHP function preg_match() is used to match regular expressions. The first parameter passed to the function is the pattern. The second parameter is the string to be matched to the pattern and is also called the subject. The function returns TRUE (the pattern matches) or FALSE (the pattern does not match). You can also pass a third parameter--a variable name. The text that matches is stored by reference in the array with this name. If you don't need to use the matching text but just want to know if there is a match or not, you can leave out the third parameter. In short, the format is as follows, with $matches being optional: $result = preg_match($pattern, $subject, $matches); Note: The examples in this section will not use the <?php and ?> tags, but of course, they are required. Pattern Syntax PCRE's matching syntax is very complex. A full dis- cussion of all its details would exceed the scope of this book. We cover just the basics here, which is enough to be very useful. On most UNIX systems with the PCRE library installed, you can use man pcrepattern to read about the whole pattern matching language, or have a look at the (somewhat outdated) PHP Manual page at http://www.php.net/manual/en/pcre.pattern.syntax.php. But here we start with the simple things: Metacharacters The characters from the Table 9.1 are special char- acters in the way that they can be used to construct patterns. Table 9.1 Metacharacters Character Description
The general escape character. You need this in case you want to use any of the metacharacters in your pattern, or the delimiter. The back- slash also can be used to specify other special characters, which you can find in the next table. . Matches exactly one character, except a newline character. preg_match('/./', 'PHP 5', $matches); $matches now contains Array ( [0] => P ) ? Marks the preceding character or sub-pattern (optional). preg_match('/PHP.?5/', 'PHP 5', $matches); This matches both PHP5 and PHP 5. + Matches the preceding character or sub-pattern one or more times. '/a+b/' matches both 'ab', 'aab', 'aaaaaaaab', but not 'b'. preg_match also returns TRUE in the example, but $matches does not contain the excessive characters. preg_match('/a+b/', 'caaabc', $matches); $matches now contains Array ( [0] => aaab ) * Matches the preceding character zero or more times. '/de*f/' matches both 'df', 'def' and 'deeeef'. Again, excessive characters are not part of the matched substring, but do not cause the match to fail.
Table 9.1 Metacharacters Character Description {m} Matches the preceding character or sub-pattern 'm' times in case the {m.n} {m} variant is used, or 'm' to 'n' times if the {m,n} variant is used. '/tre{1,2}f/' matches 'tref' and 'treef', but not 'treeef'. It is possible to leave out the 'm' part of the equation or the 'n' part. In case there is no number in front of the comma, it means that the lower boundary for the number of matches is 0 and the upper boundary is determined by the number after the comma; in case the number after the comma is missing, then the upper boundary is undetermined. '/fo{2,}ba{,2}r/' matches 'foobar', 'fooooooobar', and 'fooobaar', but not 'foobaaar'. ^ Marks the beginning of the subject. ' /^ghi/' matches 'ghik' and 'ghi', but not 'fghi'. $ Marks the end of the subject, unless the last character is a newline (n) character. In that case, it will match just before that newline character. '/Derick$/' matches "Rethans, Derick" and "Rethans, Derickn" but not "Derick Rethans". [ ... ] Makes a character class out of the characters between the opening and closing bracket. You can use this to create a group of characters to match. Using an hypen inside the character class creates a range of characters. In case you want to use the hypen as a character being part of the class, put it as last character in the class. The caret (^) has a special meaning if it is used as the first character in the class. In this case, it negates the character class, which means that it does not match with the characters listed. Example 1: preg_match('/[0-9]+/', 'PHP is released in 2005.', $matches); $matches now contains Array ( [0] => 2005 ) Example 2: preg_match('/[^0-9]+/', 'PHP is released in 2005.', $matches); $matches now contains Array ( [0] => PHP is released in ) Note that the $matches does not include the dot from the subject because a pattern always matches a consecutive string of characters. Inside the character class, you cannot use any of the mentioned meta- characters from this table, except for ^ (to negate the character class), - (to create a range), ] (to end the character class) and, the (to escape special characters).
Table 9.1 Metacharacters Character Description ( ... ) Creates a sub-pattern, which can be used to group certain elements in a pattern. For example, if we had the string 'PHP in 2005.' and we wanted to extract both the century and the year as two separate entries, in the $matches array we would use the following: regexp: '/([12][0-9])([0-9]{2})/' This creates two sub-patterns: ([12][0-9]) to match all centuries from 10 to 29. ([0-9]{2}) to match the year in the century. preg_match( '/([12][0-9])([0-9]{2})/', 'PHP in 2005.', $matches ); $matches now contains Array ( [0] => 2005 [1] => 20 [2] => 05 ) The element with index 0 is always the fully matched string, and all sub-patterns are assigned a number in the order in which they occur in the pattern. (?: ...) Creates a sub-pattern that is not captured in the output. You can use this to assert that the pattern is followed by something. preg_match('@([A-Za-z ]+)(?:hans)@', 'Derick Rethans', $matches); $matches now contains Array ( [0] => Derick Rethans [1] => Derick Ret ) As you can see, the full match string still includes the fully matched part of the subject, but there is only one element extra for the sub- pattern matches. Without the ?: in the second sub-pattern, there would also have been an element containing hans.
Table 9.1 Metacharacters Character Description (?P<name>...) Creates a named sub-pattern. It is the same as a normal sub-pattern, but it generates additional elements in the $matches array. preg_match( '/(?P<century>[12][0-9])(?P<year>[0-9]{2})/', 'PHP in 2005.', $matches ); $matches now contains: Array ( [0] => 2005 [century] => 20 [1] => 20 [year] => 05 [2] => 05 ) This is useful in case you have a complex pattern and don't want to bother finding out the correct index number in the $matches array. 9.3.1.3 Example 1 Let's dissect some useful complex regular expressions that we can create with the metacharacters from Table 9.1: $pattern = "/^([0-9a-f][0-9a-f]:){5}[0-9a-f][0-9a-f]$/"; This pattern matches a MAC address--a unique number bound to a network card--with the format 00:04:23:7c5d:01. The pattern is bound to the start and end of our subject string with ^ and $, and it contains two parts: ([0-9a-f][0-9a-f]:){5}. Matches the first five 2 character groups and the associated colon ([0-9a-f][0-9a-f]). The sixth group of two digits This regexp could also have been written as /^([0-9a-f]{2}:){5}[0-9a- f]{2}$/, which would have been a bit shorter. To test the text against the pat- tern, use the following code: preg_match($pattern, '00:04:23:7c:5d:01', $matches); print_r($matches);
With either pattern, the output would be the same, as follows: Array ( [0] => 00:04:23:7c:5d:01 [1] => 5d: ) 9.3.1.4 Example 2 "/([^<]+)<([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\.)+[a-zA-Z0-9_-]+)>/" This pattern is used to match email addresses in the following format: 'Derick Rethans <derick@php.net>' This pattern is not good enough to match all email addresses, and vali- dates some addresses that should not be matched. It only serves as a simple example. The first part is ([^<]+)<, as follows: / . Delimiter used in this pattern. ( [^<]+). Subpattern that matches all characters unless it is the `<' character. <. The < character which is not part of any sub-pattern. The second part is ([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\.)+[a-zA-Z0-9_-]+), which used to match the email address itself: [a-zA-Z0-9_-]+ . This matches everything until the @ and consists of one or more characters from the specified character class. @. The @ sign. ([a-zA-Z0-9_-]+\.)+. A subpattern that matches one or more levels of subdomains. Notice that the . in the pattern is escaped with the , but also note that this is escaped with another . This is needed because the pattern is enclosed in double quotes ("). You need to be careful with this. It would usually be better to use single quotes for the pattern. [a-zA-Z0-9_-]+. The top-level domain name (as in .com). As you can see, the regexp is not correct here; the last part should have been simply [a- z]{2,4}. Then there is the trailing > and delimiter.
The following example shows the contents of the $matches array after running the preg_match() function: <?php $string = 'Derick Rethans <derick@php.net>'; preg_match(
"/([^<]+)<([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\.)+[a-zA-Z0 9_]+)>/",
$string,
$matches ); print_r($matches); ?> The output is Array ( [0] => Derick Rethans <derick@php.net> [1] => Derick Rethans [2] => derick@php.net [3] => php. ) The fourth element cannot really be avoided because a subpattern was used for the (sub)domain part of the pattern, but of course, it doesn't hurt to have it.