[drupal-devel] multibyte/mbstring in Drupal
Steven Wittens
steven at acko.net
Fri May 27 06:07:36 UTC 2005
I updated the comment for that function recently... it is about bytes,
not characters. It is a tough situation...not only can PHP support
UTF-8, but also the MySQL database. And for most people, they have no
control over the MySQL settings of their host. If MySQL is not
configured for UTF-8, then strings will still work, but the column sizes
will be in bytes. We assume the worst situation right now (no mbstring,
no mysql utf-8).
I think it is a reasonable expectation that hosts with mbstring also use
UTF-8 for the database. People who don't have a utf-8 database will have
to disable mbstring. We can provide a section for this in settings.php,
and we can enable mbstring by default if present, and force the
character set to UTF-8.
So, for every "utf8" function, we make an mbstring version, which works
on characters, and a non-mbstring version which works on bytes (but is
UTF-8 aware). I'd like a global toggle between the two modes, set in
settings.php. That way we can control encodings inside Drupal. We can
provide specific UTF-8-aware APIs, and separate text processing from
byte array processing. They are different in many ways, but PHP does not
provide a distinction. Text processing in Drupal has been handicapped in
Drupal ever since we went to UTF-8. This allows us to rectify things and
reintroduce things like upper/lowercasing for all languages (needed for
search).
However, it is hard to use mbstring overload in Drupal. Drupal has to
handle all sorts of encodings, and once overloaded it is impossible to
call the original functions and you have to change the internal encoding
back and from. It would also complicate the code a lot in ugly ways. For
example, in drupal_xml_parser_create(), we use ereg() on a string in an
as of yet unknown encoding. If the internal encoding is UTF-8, the
operation cannot be performed because most likely the string is not
valid UTF-8. We would need to switch to a byte-based encoding, like
ISO-8859-1.
There are also several places where it is important that we count in
bytes, not characters. For example, the mime header encode: it has to be
wrapped at no more than 80 bytes per line. If mbstring overload is
enabled, we again need to switch the internal encoding to ensure we can
count in bytes.
That's why I'd say that we do not support mbstring overload in Drupal,
but update our UTF-8 APIs so they can take advantage of mbstring if it
is present. We cannot use the non-overloaded versions of strtolower()
and friends, because they will mess up UTF-8.
We can provide wrappers around strtolower(), ucfirst(), etc. If mbstring
is present, they use that (and support all of Unicode), otherwise we use
a poor man's ASCII version which leaves Unicode alone. I can provide
optimized routines, I've written dozens of UTF-8 processors in PHP.
We already have several utf8 utility functions which resemble mb_ api
calls anyway (string_length, truncate_utf8, ...). I wrote some more in
my recent access keys patch.
Steven Wittens
More information about the drupal-devel
mailing list