Robert's Blog


Tuesday, May 13, 2008

DB2 Code Pages: Don't Get Lost in Translation

For a long time, many of us in the DB2 community didn't pay a whole lot of attention to code pages (referring to the standard representations of numbers and letters and other characters within computer systems). There didn't seem to be much of a reason to think about this particular topic. Data went into the database and came out of the database just fine, so what was the big deal about coded character set translation? Oh, sure, everyone knew that an ASCII-based system could exchange data with an EBCDIC-encoding server, and translation had to happen in such cases, but it worked all right. In any case, there were more pressing concerns like database design, performance tuning, and availability planning.

Nowadays, coded character set translation matters A LOT to many DB2-using organizations. The one-word reason: internationalization. Whereas many companies once did business solely within a single country, to an increasing degree those same companies now serve clients located all over the world. This trend (which is accelerating) has caused coded character translation to become a matter of significant importance. Getting it wrong might - if you're lucky - simply annoy a customer who sees that his name is displayed incorrectly on an invoice. If you're less fortunate, mistranslation could result directly in lost business if an incoming order is bounced from your system because some required match on a text field didn't happen as it was supposed to.

Internationalization is far more important than any ASCII-EBCDIC translation issue (the latter is the character encoding scheme historically used on mainframe servers). Why? Because in days of purely domestic markets, ASCII-EBCDIC translation occurred within a given language, such as from French EBCDIC (code page 297) to French ASCII (code page 819 for Linux), or from Czech EBCDIC (code page 870) to Czech ASCII (code page 912). This kind of translation worked fine because with 8 bits available for character representation (EBCDIC was always an 8-bit encoding scheme, and 8-bit ASCII code pages are widely used, though the original ASCII specification was for a 7-bit encoding scheme), all of the "special" characters particular to a given language (most languages, that is - see below) could be represented (and I mean "special" from the U.S. English perspective, referring to such characters as ô, an "o" topped with a circumflex).

[8 bits are not enough to represent all of the characters used in the written forms of languages such as Chinese and Japanese; consequently, double-byte ASCII and EBCDIC code sets were developed to address this limitation.]

It is in going from language A to language B that one is more likely to encounter coded character translation problems when ASCII and/or EBCDIC encoding is in use. For example, suppose you're using the ISO-8859-1 (aka "Latin 1") ASCII code set (which corresponds to the 819 ASCII code page), and you're exchanging data with an organization that uses the ISO-8859-2 code set (code page 912, Central and Eastern Europe)? Code point 241 in your code page represents ñ (an "n" with a tilde on top), but code point 241 in the code page used by the other organization represents ń (an accented "n"). Oh, and there is no representation of ń in your code page, nor is there a representation of
ñ in the code page used by the other organization. Things get more fun if you go between languages with significantly dissimilar alphabets, such as between Greek (ISO-8859-7 ASCII code set) and Turkish (ISO-8859-9).

All of this would present a really intimidating problem, were it not for the Unicode character encoding scheme. With Unicode, up to four bytes (32 bits) are available for character encoding, so there are several billion code points to which characters can be mapped. The great thing about Unicode is that there isn't a Greek Unicode and a Spanish Unicode and a Japanese Unicode - there's just Unicode. Similarly, there isn't one Unicode for mainframe systems and another for Linux, UNIX, and Windows servers - there's just Unicode. If data is sent from one Unicode-using system to another Unicode-using system, no translation takes place (if the data-requesting system uses an ASCII or EBCDIC code page, data coming from a Unicode-based server will be translated as needed). Remember the
ñ I mentioned previously (the "tilde-n")? In Unicode it's represented by 00F1. And the ń (the "accented n")? That's 0144 as represented in Unicode. When there's room for everything, everything can have its own code point (in contrast to the situation in which, for example, different characters map to ASCII code point 241, depending on which ASCII code page is in use).

Both DB2 for z/OS and DB2 for Linux, UNIX, and Windows (LUW) can store data in Unicode format (in fact, this is the default character encoding scheme for DB2 for LUW), and this capability is more and more important as more and more organizations go global in their business dealings. A couple of things to keep in mind:
  • If you are accustomed to working with ASCII code pages, you may have used the DB2 built-in scalar function CHR to get from DB2 the character that maps to a particular ASCII code point (e.g., CHR(241) to get ñ IF you are using the ISO-8859-1 ASCII coded character set). If you store data in a DB2 database in Unicode format, why use CHR to get a particular character from DB2? Ask DB2, instead, to give you the character that maps to a specific Unicode code point (e.g., SELECT &U'\00F1' FROM table-name to get ñ).
  • Changing the Coded Character Set Identifier (CCSID) for an existing DB2 system (e.g., from ASCII or EBCDIC to Unicode) is a non-trivial matter. If you are considering making such a change in your DB2 environment, consult the DB2 for z/OS Installation Guide or the DB2 for LUW Internationalization Guide.
So go Unicode, and don't sweat the ñ (or the ń or the ä or the Ř or the ů or the Σ or the...).

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home