Re: [drupal-devel] multibyte/mbstring in Drupal

27 May 2005

      I updated the comment for that function recently... it is about bytes, 
not characters. It is a tough situation...not only can PHP support 
UTF-8, but also the MySQL database. And for most people, they have no 
control over the MySQL settings of their host. If MySQL is not 
configured for UTF-8, then strings will still work, but the column sizes 
will be in bytes. We assume the worst situation right now (no mbstring, 
no mysql utf-8).

I think it is a reasonable expectation that hosts with mbstring also use 
UTF-8 for the database. People who don't have a utf-8 database will have 
to disable mbstring. We can provide a section for this in settings.php, 
and we can enable mbstring by default if present, and force the 
character set to UTF-8.

So, for every "utf8" function, we make an mbstring version, which works 
on characters, and a non-mbstring version which works on bytes (but is 
UTF-8 aware). I'd like a global toggle between the two modes, set in 
settings.php. That way we can control encodings inside Drupal. We can 
provide specific UTF-8-aware APIs, and separate text processing from 
byte array processing. They are different in many ways, but PHP does not 
provide a distinction. Text processing in Drupal has been handicapped in 
Drupal ever since we went to UTF-8. This allows us to rectify things and 
reintroduce things like upper/lowercasing for all languages (needed for 
search).

However, it is hard to use mbstring overload in Drupal. Drupal has to 
handle all sorts of encodings, and once overloaded it is impossible to 
call the original functions and you have to change the internal encoding 
back and from. It would also complicate the code a lot in ugly ways. For 
example, in drupal_xml_parser_create(), we use ereg() on a string in an 
as of yet unknown encoding. If the internal encoding is UTF-8, the 
operation cannot be performed because most likely the string is not 
valid UTF-8. We would need to switch to a byte-based encoding, like 
ISO-8859-1.

There are also several places where it is important that we count in 
bytes, not characters. For example, the mime header encode: it has to be 
wrapped at no more than 80 bytes per line. If mbstring overload is 
enabled, we again need to switch the internal encoding to ensure we can 
count in bytes.

That's why I'd say that we do not support mbstring overload in Drupal, 
but update our UTF-8 APIs so they can take advantage of mbstring if it 
is present. We cannot use the non-overloaded versions of strtolower() 
and friends, because they will mess up UTF-8.

We can provide wrappers around strtolower(), ucfirst(), etc. If mbstring 
is present, they use that (and support all of Unicode), otherwise we use 
a poor man's ASCII version which leaves Unicode alone. I can provide 
optimized routines, I've written dozens of UTF-8 processors in PHP.

We already have several utf8 utility functions which resemble mb_ api 
calls anyway (string_length, truncate_utf8, ...). I wrote some more in 
my recent access keys patch.

Steven Wittens