CHAPTER 9 – Extra Functions Dealing with Multi-Byte Character Sets

A couple of extra functions in both the mbstring and iconv extension are surro- gates for some of the string functions. For example, iconv_strlen (and mb_strlen) returns the number of "characters" (not bytes) in the strings passed to the function: <?php $string = "Må jeg bytte tog?"; $from = 'iso-8859-1'; $to = 'utf-8'; iconv_set_encoding('internal_encoding', $to);

echo $string."n"; echo "strlen: ". strlen($string). "n";

$string = iconv($from, $to, $string);

echo $string."n"; echo "strlen: ". strlen($string). "n"; echo "iconv_strlen: ". iconv_strlen($string). "n"; ?> outputs Må jeg bytte tog? strlen: 17 MÃ¥ jeg bytte tog? strlen: 18 iconv_strlen: 17 The iconv_strlen() takes into account the multi-byte character Ã¥ (which is UTF-8 for "å"). Replacement functions for strpos() and strrpos() also exist. With these and the replacement for substr(), you can safely find a multi-byte string inside another multi-byte string. While trying to come up with an exam- ple for these functions that shows why it is important to use the multi-byte variants of those functions, we realized that it does not matter at all if UTF-8 is used as the encoding. The common problem that we are trying to illustrate was that a uni-byte character (like ") could also be a part of a multi-byte char- acter in the same string. However, for UTF-8 encoded strings this is not possi- ble, because all bytes of a multi-byte character have ordinal values of 128 or greater, while single-byte characters are always less than the ordinal value 128. iconv_substr() is still useful for a multi-byte version of a "shorten" func- tion, which in the example adds dieresis if a string is longer than a given set of characters (not bytes!). <?php header("Content-type: text/html; encoding: UTF-8"); iconv_set_encoding('internal_encoding', 'utf-8'); $text = "Ceci est un texte en français, il n'a pas de sense si ce n'est celui de vous montrez comment nous pouvons utiliser ces fonctions afin de réduire ce texte à une taille acceptable.";

echo "<p>$text</p>n";

echo '<p>'. substr($text, 0, 26). "...</p>n"; echo '<p>'. iconv_substr($text, 0, 26). "...</p>n"; ?> Note: The character set in which this example is shown is UTF-8 and not ISO-8859-15. When this script is run, the output in a browser will be similar to Figure 9.16. Fig. 9.16 Broken UTF-8 characters. As you can see, the normal substr() function doesn't care about character sets. It chops the "ç" into two bytes, generating an invalid UTF-8 character-- which is rendered as the black square with the question mark in it. iconv_substr() does a much better job. It "knows" that the "ç" is a multi-byte character and counts it as one. For this to work, the internal encoding needs to be set to "UTF-8." To demonstrate the use of iconv_strpos(), we use UCS-2BE (which actu- ally doesn't encode anything, but simply stores the least significant bits of a UCS character), rather than UTF-8. The following script shows why you need to use iconv_strpos() and cannot simply use strpos(): <pre> <?php $internal = 'UCS-2BE'; $output = 'UTF-8'; $space = ' '; $text = iconv('iso-8859-15', $internal, ' 12.50'); Because there is no way to create UCS-2BE encoded texts, we "create" a UCS-2BE encoded text from an ISO-8859-15 encoded string consisting of the Euro sign, a space, and the text 12.50. The Euro sign is especially interesting, because the UCS-2 encoding is 0x20 0xac (in hexadecimal). A single space in any ISO-8859-* encoding is assigned the same code 0x20. In Figure 9.17, you see the hexadecimal representation of the UCS-2 encoded string after Original. /* Initialize the output buffering mechanism */ iconv_set_encoding('output_encoding', $output); ob_start('ob_iconv_handler'); echo "Original: ", bin2hex($text), "n"; We initialize the output buffer and set the output encoding to UTF-8. Then, we output the hexadecimal representation of our string, which will be converted to UTF-8 by the output buffer mechanism. /* The "wrong" way */ $amount = substr($text, strpos($text, $space) + 1); With strpos(), we locate the first space in the string. Then with substr(), we obtain everything following this first space and assign it to the $amount variable. However, this code doesn't do what we expected. echo "After substr(): ", bin2hex($amount), "n"; ob_flush(); We print the hexadecimal representation of the new string and flush the output buffer. The flush is needed so that all data in the buffer is send to the iconv output handler and we can reset the internal encoding to UCS-2BE. Without this flush, the output handler does not correctly encode the output (because it normally operates in blocks of 4096 bytes only). As you can see in Figure 9.17, following After substr(): the "space" was matched in the wrong location. The normal substr() function doesn't know a thing about character sets, and thus the $amount variable does not contain valid UCS-2BE encoded text. iconv_set_encoding('internal_encoding', $internal); echo $amount; ob_flush(); We need to set the internal iconv encoding to UCS-2BE, echo the (broken) $amount string, and flush the output buffer so that we can change the internal encoding again.

/* Convert space character to UCS-2BE and match again */ $space = iconv('iso-8859-1', $internal, $space); $amount = iconv_substr($text, iconv_strpos($text, $space) + 1); Now, we convert our space character into UCS-2BE too, so that we can use iconv_strpos() to find the first (real) occurrence in the string. iconv_strpos() uses the internal encoding setting to determine if a character is found inside the string. Just like the normal strpos(), it returns the position where the needle was found, or false if it wasn't found. Therefore, because 0 can be returned if the needle was found in the first position, you need to com- pare with === false to see whether the needle was actually found. In our example, it doesn't matter if the needle is found at position 0 or not at all, because the iconv_substr() will copy the string starting from position 0 (false evaluates to 0) anyway. iconv_set_encoding('internal_encoding', 'iso-8859-1'); echo "nAfter iconv_substr(): ", bin2hex($amount), "n"; ob_flush(); We temporarily set the internal encoding to ISO-8859-1 so that we can safely output the hexadecimal representation of the string. We flush the out- put buffer because we next want to output the $amount variable, which is encoded in UCS-2BE. iconv_set_encoding('internal_encoding', $internal); echo $amount; ?> With these final statements, the full output is displayed, as shown in Fig- ure 9.14. Notice that the first match (space = 0x20) is wrong. After the second one, the correct 0x0020 was found and the string chopped up accordingly (see Figure 9.17). Fig. 9.17 Problems without iconv_strops().

Post Comment
Login to post comments