CHAPTER 9 – Character Set Conversions
PHP 5 has support for character encoding and multi-byte issues in two exten- sions: iconv and mbstring. The main difference between the two is that iconv makes use of an external library (or the C library functions, if available), while the mbstring extension has the library bundled with PHP. Although iconv (at least in recent Linux distributions) supports much more encodings, mbstring might be the better choice for a script that has to be more portable. In addition to character encoding conversions, the mbstring extension includes a multi- byte regular expression library. The mbstring extension is enabled with the -- enable-mbstring option. The additional regular expression support is enabled by default when mbstring is enabled, but it can be turned of with --disable- mbregex. The iconv extension is enabled with the --with-iconv switch. In Fig- ures 9.13 and 9.14, you find the corresponding sections in phpinfo() for mbstring and iconv. The examples cover both extensions, whenever possible, and the character set used in the example scripts and output is in ISO-8859- 15, unless otherwise noted. Note: Some of these examples require OS support for the used character set. If something is not supported, you might see a different output for the example scripts.
Fig. 9.13 mbstring phpinfo() output. Fig. 9.14 iconv phpinfo() output. In the first example, we convert ISO-8859-15 (Latin 9) text to UTF-8: <?php $string = "Kan De være så vennlig å hjelpe meg?nn"; echo "ISO-8859-15: $string";
echo 'UTF-8: '. mb_convert_encoding($string, 'UTF-8', 'ISO-8859- 15'); echo 'UTF-8: '. iconv('ISO-8859-15', 'UTF-8', $string); ?> When the script runs, the output looks like this: ISO-8859-15: Kan De være så vennlig å hjelpe meg?
UTF-8: Kan De và re så vennlig å hjelpe meg?
UTF-8: Kan De và re så vennlig å hjelpe meg?
Sometimes, it's not possible to convert text from one encoding to another, as shown in the following example: <?php error_reporting(E_ALL & ~E_NOTICE); $from = 'ISO-8859-1'; // Latin 1: West European $to = 'ISO-8859-2'; // Latin 2: Central and East European $string = "Denna text är på svenska."; echo "$from: $stringnn";
echo "$to: ". mb_convert_encoding($string, $to, $from). "nn"; echo "$to: ". iconv($from, $to, $string). "nn"; echo "$to: ". iconv($from, "$to//TRANSLIT", $string). "nn"; ?> We try to convert the text Denna text är på svenska. from ISO-8859-1 to ISO-8859-2, but the "å" does not exist in ISO-8859-2. mb_convert_encoding() handles replaces the offending character (by default) with a "?", whereas iconv() just aborts the conversion at that point. However, you can add the // TRANSLIT modifier to the to encoding parameter to tell iconv() to replace the offending character by a "?". The //TRANSLIT also tries to convert to a represen- tation of a character, such as converting "©" to "(C)", while converting from ISO-8859-1 to ISO-8859-2. You can use the mb_substitute_character() function to tell the mbstring extension to do something different with an offending char- acter, as shown here: <?php error_reporting(E_ALL & ~E_NOTICE); $from = 'ISO-8859-1'; // Latin 1: West European $to = 'ISO-8859-4'; // Latin 4: Scandinavian/Baltic $string = "Ce texte est en français."; echo "$from: $stringnn";
// Default echo "$to: ". mb_convert_encoding($string, $to, $from). "n";
// no output for offending characters: mb_substitute_character('none'); echo "$to: ". mb_convert_encoding($string, $to, $from). "n";
// Unicode value output for offending characters: mb_substitute_character('long'); echo "$to: ". mb_convert_encoding($string, $to, $from). "n"; ?>
outputs ISO-8859-1: Ce texte est en français.
ISO-8859-4: Ce texte est en fran?ais. ISO-8859-4: Ce texte est en franais. ISO-8859-4: Ce texte est en franU+E7ais.
Tip: The web site http://www.eki.ee/letter/ is a useful tool that shows you what happens during character conversions. It provides lists of special charac- ters needed to write a certain language, including a list of encodings that sup- port this set. mbstring() also features a non-encoding encoding html which might be useful in some cases: <?php error_reporting(E_ALL & ~E_NOTICE); $from = 'ISO-8859-1'; // Latin 1: West European $to = 'html'; // Pseudo encoding $string = "Esto texto es Español."; echo "$from: $stringn";
echo "$to: ". mb_convert_encoding($string, $to, $from). "n"; ?> outputs ISO-8859-1: Esto texto es Español. html: Esto texto es Español. The third parameter to the mb_convert_encoding() function is optional and defaults to the "internal encoding" that you can set with the function mb_internal_encoding(). If there is a parameter, the function returns either TRUE, if the encoding is supported, or FALSE and a warning if the encoding is not supported. If no parameters are passed, the function simply returns the cur- rent setting: <?php echo mb_internal_encoding(). "n"; if (@mb_internal_encoding('UTF-8')) { echo mb_internal_encoding(). "n"; } if (@mb_internal_encoding('ISO-8859-17')) { echo mb_internal_encoding(). "n"; } echo mb_internal_encoding(). "n"; ?> outputs ISO-8859-1 UTF-8 UTF-8 Tip: You can see a list with supported encodings by using the function mb_get_encodings(). T h e i c o n v e x t e n s i o n h a s s i m i l a r p o s s i b i l i t i e s. T h e f u n c t i o n iconv_set_encoding() can be used to set the internal encoding and the output encoding: <?php iconv_set_encoding('internal_encoding', 'UTF-8'); iconv_set_encoding('output_encoding', 'ISO-8859-1');
echo iconv_get_encoding('internal_encoding'). "n"; echo iconv_get_encoding('output_encoding'). "n"; ?> outputs UTF-8 ISO-8859-1 The internal encoding setting has an effect on a couple of functions (which we cover in a bit) dealing with strings. The output encoding option doesn't have any effect on those options, but can be used in combination with the ob_iconv_handler output buffering handler. With this enabled, PHP will automatically convert the text output to the browser from internal encoding to output encoding. It adjusts the Content-type header if it wasn't set in the script, and the current Content-type starts with text/. This example changes the output encoding to UTF-8 and activates the out- put handler. The result is an UTF-8 encoded output page (see Figure 9.15): <?php ob_start("ob_iconv_handler"); iconv_set_encoding("internal_encoding", "ISO-8859-1"); iconv_set_encoding("output_encoding", "UTF-8");
$text = <<<END PHP, est un acronyme récursif, qui signifie "PHP: Hypertext Preprocessor": c'est un langage de script HTML, exécuté coté serveur. L'essentiel de sa syntaxe est emprunté aux langages C, Java et Perl, avec des améliorations spécifiques. L'objet de ce langage est de permettre aux développeurs web d'écrire des pages dynamiques rapidement.
END;
echo $text; ?> Fig. 9.15 UTF-8 encoded output. The other way around is a bit more useful. It makes more sense to store all of your data in UTF-8 (for example, in a database) and convert to the cor- rect encoding for the language you're currently serving.