[development] ALL Contrib maintainers: UTF-8 update

Steven Wittens steven at acko.net
Sat Jan 21 19:54:44 UTC 2006


>In order to solve this, you either have to run the update for the
>specific module, or you can do it manually for every table like so:
>
>ALTER TABLE vocabulary_node_types CONVERT TO CHARACTER SET utf8;
>  
>

As I explained in the original issue, do /not/ run a query like this.

This would do an actual character set conversion, e.g. from Latin1 to UTF-8.

Drupal has already been using UTF-8 data, we simply didn't tell MySQL 
about it (it thought it was e.g. Latin 1 data).

Example: You had the character 'é' (Unicode U+E9) in a node. Its UTF-8 
representation is (in hexadecimal bytes) "C3 A9". If the database 
character set was Latin1, then MySQL thought this was 2 characters, 
because in Latin1 each byte is one character.

So, if you do a character set conversion, you would end up with the 
UTF-8 encoding for character C3 followed by the UTF-8 encoding for 
character A9, which is "C3 83 C2 A9".

What we want MySQL instead to do is to realize the bytestream is already 
UTF-8 and see that the byte sequence "C3 A9" is a single character.

The only way to do is to convert all columns to a binary type and then 
to UTF-8. This means no actual conversion is done, only re-interpretation:
See: http://dev.mysql.com/doc/refman/4.1/en/charset-conversion.html

Steven



More information about the development mailing list