CHAPTER 9 – MULTI-BYTE STRINGS AND CHARACTER SETS
Not all languages use the same character set, not even in the western world.
For example, the S is only part of ISO-8859-2, not of ISO-8859-1. Because these character sets only have 8 bits to use, that only makes 256 different com- binations. 8 bits is a problem for languages such as Chinese that have thou- sands of letters but 8 bits only support 256 characters. That's why the Chinese (and also other Asian scripts) have to use another encoding for their charac- ters, such as BIG5 or GB2312. The Japanse use other encodings for their char- acters: EUC-JP, JIS, SJIS, and so on. All those different character sets are a problem to work with because some map the same character number to a dif- ferent character (such as © and which caused our problem at the end of the preceding section). That's one of the reasons the Unicode project was started.
Unicode solves the problem by assigning a number to every unique character, just like the ISO 10646 standard. This standard reserves 31 bits for charac- ters, which should be more than enough room for every script out there (including "fictional" scripts like Tolkien's Tengwar and the Egyptian hiero- glyphs). The characters that fit in the range 0-127 are the same as the good old ASCII standard, and the range 0-255 is the same as iso-8859-1 (Latin 1). All "normal" scripts characters are encoded in the range 0-65533--a subset called the Basic Multilingual Plane (BMP). Although Unicode only assigns num- bers to characters, it is usually not used to store text. The simplest ways of encoding are UCS-2 and UCS-4, which store characters as 2- or 4-byte sequences. UCS-2 and UCS-4 are not really useful because there is a possibil- ity of NULL bytes in the text or because the text would use too much space, even when the characters are only in the ASCII range. UTF-8, which solves these problems, is used more often. Characters in an UTF-8 encoded string can be 1 to 6 bytes long and can represent all 231 characters from UCS. This section of the chapter deals mainly with UTF-8 and conversions to other encodings (such as iso-8859-1). Tip: For more information on Unicode, see the excellent FAQ at http:// www.cl.cam.ac.uk/~mgk25/unicode.html.