[drupal-devel] [bug] Add mbstring support to Drupal

Fri Jul 15 16:09:18 UTC 2005

Issue status update for 
http://drupal.org/node/26688
Post a follow up: 
http://drupal.org/project/comments/add/26688

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  Poetro
 Updated by:   Steven
 Status:       patch

It doesn't make sense to replicate every PHP string function... a lot of
them are not useful, and several wouldn't do anything differently.

For example, if you use strpos, you'll probably want to do a substr
later. Or you just want to check if the result is FALSE or not. As this
is done on known character boundaries with a known substring, the
regular PHP functions (with byte offsets) can be used safely. Yes, I
could write UTF-8 compatible versions which count characters, but they
would be a lot slower and they would bloat the codebase unnecessarily.

And functions like split(), and substr_count() behave exactly the same
on bytes or characters. There is nothing to be gained from wrapping
them.

A lot of the multibyte problems stem from the annoying properties of
Japanese and Chinese multibyte encodings.  UTF-8 does not suffer from
them (which is why it is relatively easy to implement).

Steven

Previous comments:
------------------------------------------------------------------------

July 14, 2005 - 12:40 : Poetro

The search module contains some bugs when using native UTF-8 database
connection. To correct this some of the strtolower function should be
changed to some UTF-8 compatible versions of the function like
mb_strtolower.
The following lines should be changed:
398:
-            $word = strtolower($word);
+            $word = mb_strtolower($word, 'UTF-8');
493:
-      $arguments[] = str_replace('?', '%', strtolower($word));
+      $arguments[] = str_replace('?', '%', mb_strtolower($word,
'UTF-8'));
497:
-      $arguments[] = strtolower($word);
+      $arguments[] = mb_strtolower($word, 'UTF-8');

------------------------------------------------------------------------

July 15, 2005 - 04:33 : Steven

Attachment: http://drupal.org/files/issues/multibyte.patch (30.04 KB)

This minor issue is part of a bigger problem: lack of clear string
handling in Drupal. There was a discussion a while ago on the mailing
list (someone can dig it up if interested) about mbstring. We agreed on
a plan of action, so now, here's the patch.

Sorry for the long text, but it does not make sense to review this
patch without knowing the how's and why's. ;)

Situation
Because PHP's string handling does not support Unicode out of the box,
we need the mbstring extension to do stuff like case conversion. But,
mbstring is not a required extension (and considering its state [1], it
never will be) so we can't rely on it. But it would be nice to use it if
it's there.

Another important thing to consider is that PHP 5.1 will be getting
real Unicode support (see this comment for more info [2]). While the
functionality looks great, it introduces some new syntax elements and
changes string handling in significant ways. Because we don't want to
mess up Drupal's code too much when it happens, I decided that this was
a good opportunity to move all the encoding-specific stuff into a
unicode.inc file. That way the transition later will be easier. Also,
the encoding-stuff has been growing steadily, so it's not a bad idea to
separate it from the rest regardless.

My patch performs various checks on startup (in the common.inc stage,
i.e. after bootstrap) to see if mbstring is present. Aside from the
literal check, we also have to do some configuration checks. I also
took the liberty of introducing a check for the
search/PCRE-compatibilty issue ("Characters > 255 are not yet
supported..."). It's quite simple. Errors are reported on
admin/settings, and I also added a small status item there with info
about which string handling is used. It suggests that people install
mbstring for better Unicode support if it's not yet present.

Now, in order to use mbstring, there is the option of overloading the
standard string functions automatically. This is not really appropriate
for Drupal because it prevents access to the original string APIs.
Furthermore, if mbstring can already handle UTF-8, then it does not
make sense for us to call custom fallback routines. So, I decided to
make wrappers for the string handling operations that need them
(substr, strtolower, strtoupper, ucfirst). All of the wrappers deal in
characters and character indices only, so for all intents and purposes
they give you Unicode support already. This means the transition to PHP
Unicode will be easier for module authors. Of course, when appropriate,
you can still use the plain PHP string APIs. For example, when doing a
strpos + substr.

I also added a call to setlocale() to do away with any remaining
vagueries in non-mbstring string handling (which was locale dependant
before). As a bonus, I was able to code a very simple routine for doing
case conversion on the Latin-1 character set. This means the
capabilities of non-mbstring string handling are back to the level of
before we switched to UTF-8. Of course most non-english sites will
still benefit from mbstring.

Finally I fixed some possible case conversion issues: in a few places,
we compared the SQL function LOWER() with PHP's strtolower. Because the
behaviour of either depends on the setup, we should never mix them. So
now, you either put drupal_strtolower'd data in the database, and
strtolower your condition before you select it,  or you use LOWER() on
both the column and the value.

Note: In theory I could alter routines like truncate_utf8() and
mime_header_encode() to take more advantage of mbstring APIs... this
might result in a (minor) speed up, but it would complicate the code
further without additional features, and would mean more dependance on
the PHP configuration. So I didn't do this.

Remaining questions
One issue to deal with is database encodings. If the database encoding
is UTF-8, then the (VAR)CHAR column lengths are expressed in
characters. In that case, strings should be chopped off based on
character counts (i.e. using drupal_substr). However, if the database
encoding is an old 8-bit one like Latin-1 (ISO-8859-1), then the column
lengths are expressed in bytes. In that case, strings should be chopped
off based on byte counts (i.e. using truncate_utf8).

To make matters worse, PHP only recently added a function call to get
the database character set from MySQL to PHP HEAD, so it is not usable
for us. If we continue to use truncate_utf8, then we might chop off too
much. If we switch to drupal_substr, then we might chop off too little.

Also, if the database character set is not UTF-8, then it means any
Drupal function that relies on LOWER() will not work on non-ASCII data.
Unfortunately we cannot rely on drupal_strtolower everywhere without
wasting database space: often, we want to do a case-insensitive
comparison, but still keep the original case of the data (e.g. a
username).

Any ideas here?

Also, we got rid of ucfirst() in the page title a long time ago.
Browser support for capitalization is still shady, so it might not be a
bad idea to add it back. On the other hand, we could just capitalize the
menu item titles directly.

[1] http://www.acko.net/blog/unicode-in-php#mbstring-sucks
[2] http://www.acko.net/blog/unicode-in-php#comment-274

------------------------------------------------------------------------

July 15, 2005 - 04:53 : Steven

By the way, even if we don't end up using ucfirst in core, the function
(and the other case conversions) are used a lot in contrib. In the
interests of l10n, the idea of identically behaving, pluggable
Unicode-aware replacements of PHP's borky functions is nice.

------------------------------------------------------------------------

July 15, 2005 - 13:31 : Poetro

I would appreciate these wrappers like drupal_strtolower and for module
developers this would uniform the using of string conversion.
I think what are needed:
strtolower
strtoupper
ucfirst,
ucwords
strpos
strrpos
stripos
substr
strlen
split
substr_count
And the wrapper should deside which function to call in the
unicode.inc.
For Developers it would mean a search and replace for the previous
functions to the drupal_* function. This would extend the usability of
previous modules to get full usage of the UTF-8 characters even though
they don't have to know how it works, what converison do the called
function do, they only get the result they wanted. In times the
mb_string functions is not available there should be functions that
would convert the text in a way that is most acceptable. In my language
(Hungarian) the LOWER and UPPER SQL functions work fine (assumed that
UTF-8 is nativly used in the database [SET NAMES 'utf-8';SET CHARACTER
SET utf8] mentioned in http://drupal.org/node/26549), which is I think
is advised if You are using MySQL 4.1+, but in this case the wrappers
should be used insted the PHPs embeded functions, to prevent the
problems using them.
A good codebase I think would extend the API, and would bring some help
to those who want to use their special native characters using UTF-8 to
store data.