Language Integration Issues

From XMBdocs

This document discusses some of the benefits and potential pitfalls of installing multiple languages into the XMB1 system.

English and its Close Cousins

One of XMB's greatest features is its support for several languages that use the same electronic character set. Translations at various times have included Albanian, Croatian, Dutch, Estonian, French, German, Italian, Portuguese, Spanish, and Swedish. These languages are all compatible with one another, and with several others, due to the fact that XMB is written in English using the same character encoding called "ISO 8859-1".

All Other Languages

XMB has at various times included translations in Chinese, Finnish, Hungarian, Polish, Russian, and Ukranian. These languages can be installed and expected to work independently of each other very smoothly. However, these languages are not natively compatible with English (or with each other for that matter) because they do not use the ISO 8859-1 character encoding.

Database Character Encoding

The XMB1 database is required to use a single-byte character encoding such as ISO 8859-1. One benefit of this design is that it is binary safe, meaning there is a one-to-one relationship between each byte being stored and retrieved. This makes the storage engine itself relatively blind to the type of data or language being used on the message board.

One potential pitfall of this design is that the original authors of the XMB1 software forgot to specify a character encoding in the database schema. As a result, all language and related localization issues have to be treated as though the database is using an unknown default character encoding, which might not be ISO 8859-1. So long as the default encoding is one that uses single-byte characters, then this is a trivial concern.

Database Connection Character Encoding

As of versions 5 of MySQL and PHP, it is possible to change the default connection encoding to something other than what the database itself uses. This can be a major problem, as was discovered by Paulo from postcrossing.com. His database was using the ISO 8859-1 character set, but the default connection encoding on the server was UTF-8. As as a result, depending on which language the XMB end user was transmitting, some of the binary input data were being treated as multi-byte characters and narrowed to single-byte characters before storage. As you can see, it is imperative that the XMB database connection encoding exactly match the database (table) character encoding.

Integrating All Languages

Some webmasters have a great desire to allow their users to interact in any language at any time. This configuration is possible, however it is not officially supported by XMB1. The following pitfalls need to be considered before attempting this configuration:

Language File Character Encoding - All of the XMB language files would have to be completely re-encoded into a single common encoding such as UTF-8. Without doing this, the XMB output would be meaningless binary garbage for any non-ISO 8859-1 language.

Existing Data Encoding - The biggest pitfall is for boards that have been in operation with members posting messages containing non-English characters. This is not a limitation of any particular encoding, but a major problem when changing between them. UTF-8 is only backward-compatible with the US ASCII character set, which in turn only includes half of the characters in ISO-8859-1. Any non-English data that were stored by users will be represented by two or more bytes in UTF-8, and are therefore incompatible unless they are completely re-encoded. This is a difficult task considering users may change their language setting at any time.

Existing Encrypted Data - Another major pitfall in language integration has to do with passwords. Any change to the website's character encoding will cause password inputs to change the same way as all other character data. Only US ASCII passwords will remain valid. Passwords are stored in a non-reversible cryptographic hash, making it impossible to re-encode the saved data. Users will have to request a password reset if the encoding of their password has changed.

Database Character Encoding - Again, the database must use a single-byte encoding such as ISO 8859-1.

Database Connection Character Encoding - Again, the connection encoding must match the database encoding.