[drupal-devel] PHP string functions overloading for multibyte support
Hello. My name is Piotr Szotkowski and I'm a developer responsible for internationalisation, and, thus, UTF-8 support in the CiviCRM module. One of the problems I've encountered is that we'd like PHP to overload the string manipulation functions with their multibyte counterparts (as per [1]). Unfortunately, the overloading is possible only at the directory level, which means we'd have to overload all of the Drupal's calls, not only CiviCRM's ones. After adding php_value mbstring.func_overload 7 php_value mbstring.internal_encoding UTF-8 to the end of my /var/www/drupal/.htaccess, every Drupal page starts with the following warning: warning: mb_strrpos() [function.mb-strrpos]: Empty haystack in /var/www/drupal/includes/menu.inc on line 974. which, of course, breaks any following header() calls. What do you think about the whole idea of forcing this overload on Drupal? Is it feasible and I should pursuit the warning (and any other that follow), or should I simply drop this idea, as it will break Drupal and/or any popular modules anyway? // We could rewrite all of our substr(), split(), strpos() calls to // their mb_* counterparts, the catch is we'd also have to rewrite // parts of third-party code (like Smarty), and we'd rather try the // overloading first. We're using CiviCRM inside Drupal 4.6.0, running on PHP 5 and Apache 2. [1] http://www.php.net/manual/en/ref.mbstring.php#mbstring.overload Cheers, -- Shot -- Anarchism is founded on the observation that since few men are wise enough to rule themselves, even fewer are wise enough to rule others. -- Edward Abbey ====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
to the end of my /var/www/drupal/.htaccess, every Drupal page starts with the following warning:
warning: mb_strrpos() [function.mb-strrpos]: Empty haystack in /var/www/drupal/includes/menu.inc on line 974.
which, of course, breaks any following header() calls.
What do you think about the whole idea of forcing this overload on Drupal? Is it feasible and I should pursuit the warning (and any other that follow), or should I simply drop this idea, as it will break Drupal and/or any popular modules anyway?
IMHO one should hunt down and fix the empty haystack errors. Goba
Hello. Thanks for your quick reply! Gabor Hojtsy:
Piotr Szotkowski:
What do you think about the whole idea of forcing this overload on Drupal? Is it feasible and I should pursuit the warning (and any other that follow), or should I simply drop this idea, as it will break Drupal and/or any popular modules anyway?
IMHO one should hunt down and fix the empty haystack errors.
I agree that these warning should be fixed anyway. I'd give them a go without any hesitation, but I'd like to know whether Drupal is fully UTF-8-aware and whether UTF-8 support is a requirement for the modules - i.e., whether my work would be useful in the end. Adding php_value mbstring.func_overload 7 php_value mbstring.internal_encoding UTF-8 affects everything just for CiviCRM's use, and if it's known that Drupal would break (in some non-obvious way, perhaps), or there's a popular module that could stop working when we're overloading all of the string manipulating functions under its feet then I'd rather concentrate on other solutions to my problem... Please excuse me if my questions are simple and/or obvious, I'm very new to Drupal developement and even Drupal in general. Cheers, -- Shot -- /* I'd just like to take this moment to point out that C has all the expressive power of two dixie cups and a string. */ -- Jamie Zawinski's xkeycaps source ====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
I agree that these warning should be fixed anyway. I'd give them a go without any hesitation, but I'd like to know whether Drupal is fully UTF-8-aware and whether UTF-8 support is a requirement for the modules - i.e., whether my work would be useful in the end.
Adding
php_value mbstring.func_overload 7 php_value mbstring.internal_encoding UTF-8
affects everything just for CiviCRM's use, and if it's known that Drupal would break (in some non-obvious way, perhaps), or there's a popular module that could stop working when we're overloading all of the string manipulating functions under its feet then I'd rather concentrate on other solutions to my problem...
Please excuse me if my questions are simple and/or obvious, I'm very new to Drupal developement and even Drupal in general.
We will see, if someone jumps up and informs us that this was already tried. Nothing is supposed to break of course if you use mbstring, but I never tried it myself. Goba
On Wed, May 25, 2005 at 02:22:32PM +0200, Piotr Szotkowski wrote:
Gabor Hojtsy:
Piotr Szotkowski:
What do you think about the whole idea of forcing this overload on Drupal? Is it feasible and I should pursuit the warning (and any other that follow), or should I simply drop this idea, as it will break Drupal and/or any popular modules anyway?
IMHO one should hunt down and fix the empty haystack errors.
I agree that these warning should be fixed anyway. I'd give them a go without any hesitation, but I'd like to know whether Drupal is fully UTF-8-aware and whether UTF-8 support is a requirement for the modules - i.e., whether my work would be useful in the end.
Drupal works with UTF-8 only. It is a requirement for contributed modules as well; but, like other contributed module requirements, it is a loose reequirement as there isn't a formal review process.
Adding
php_value mbstring.func_overload 7 php_value mbstring.internal_encoding UTF-8
affects everything just for CiviCRM's use, and if it's known that Drupal would break (in some non-obvious way, perhaps), or there's a popular module that could stop working when we're overloading all of the string manipulating functions under its feet then I'd rather concentrate on other solutions to my problem...
Since Drupal is meant to run on many platforms we can't depend on the libraries we want being present. So Drupal has implemented specific UTF-8 functionality as needed. For example, truncate_utf8() is implemented using pure PHP and drupal_convert_to_utf8() is implemented by attempting to use any of the potentially availiable libraries. -Neil
Piotr Szotkowski wrote:
Hello.
My name is Piotr Szotkowski and I'm a developer responsible for internationalisation, and, thus, UTF-8 support in the CiviCRM module.
One of the problems I've encountered is that we'd like PHP to overload the string manipulation functions with their multibyte counterparts (as per [1]). Unfortunately, the overloading is possible only at the directory level, which means we'd have to overload all of the Drupal's calls, not only CiviCRM's ones.
After adding
php_value mbstring.func_overload 7 php_value mbstring.internal_encoding UTF-8
Actually you will run into some more problems. Drupal is designed to not use mbstring, because it is not available everywhere. We have our own functions for handling basic UTF-8 stuff (string truncation, mime header encode, etc). These functions assume they get direct access to the string's bytes. I have no idea how thorough mbstring override is, but you will certainly run into problems. Perhaps the best solution is to make Drupal explicitly check for mbstring and use it if present, otherwise use its own routines. truncate_utf8() is an excellent example of this. I believe that with the above PHP settings, it will actually perform excessive truncation, where the unicode character codepoints are treated with their meaning as UTF-8 bytes. Furthermore, if it counts characters, not bytes, then there is no guarantee that the returned string is short enough. Steven
Hello. Steven Wittens:
Actually you will run into some more problems. Drupal is designed to not use mbstring, because it is not available everywhere. We have our own functions for handling basic UTF-8 stuff (string truncation, mime header encode, etc). These functions assume they get direct access to the string's bytes.
Ok, I think I get it. The Drupal's *_utf8 functions use the "classic" (non-multibyte) string manipulation functions as something that manipulates on bytes, not characters, and *depend* on such behaviour, right?
I have no idea how thorough mbstring override is, but you will certainly run into problems. Perhaps the best solution is to make Drupal explicitly check for mbstring and use it if present, otherwise use its own routines.
That seems to make sense. Do all of such Drupal wrapper functions contain utf8 in their names? If not, is there some place that lists all these functions, so I won't miss any? If I come with a patch for this, should I send it here or someplace else? Cheers, -- Shot -- It is a book about a Spanish guy called Manual. You should read it. -- Dilbert ====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
Hello. Steven Wittens:
Actually you will run into some more problems. Drupal is designed to not use mbstring, because it is not available everywhere. We have our own functions for handling basic UTF-8 stuff (string truncation, mime header encode, etc). These functions assume they get direct access to the string's bytes.
I'll take a look at mime_header_encode() later (at last I'll have a reason to throughly research how should non-ASCII characters be encoded in MIME headers...), but am I right in thinking the below should bulletproof the truncate_utf8()? diff -ur drupal-4.6.0/includes/common.inc drupal/includes/common.inc --- drupal-4.6.0/includes/common.inc 2005-04-12 00:50:41.000000000 +0200 +++ drupal/includes/common.inc 2005-05-26 14:08:03.994239172 +0200 @@ -1707,10 +1707,12 @@ if ($wordsafe) { while (($string[--$len] != ' ') && ($len > 0)) {}; } - if ((ord($string[$len]) < 0x80) || (ord($string[$len]) >= 0xC0)) { - return substr($string, 0, $len); + if (!(ini_get('mbstring.func_overload') & 2)) { + if ((ord($string[$len]) < 0x80) || (ord($string[$len]) >= 0xC0)) { + return substr($string, 0, $len); + } + while (ord($string[--$len]) < 0xC0) {}; } - while (ord($string[--$len]) < 0xC0) {}; return substr($string, 0, $len); } I guess that the truncate_utf8() function is used (among other places) for VARCHAR fields in MySQL. Does Drupal require MySQL 4.1 and assume it's properly configured (so MySQL's VARCHAR length is in characters, not bytes), or should this function assure that the return value is at most $len *bytes* long?
I have no idea how thorough mbstring override is,
http://www.php.net/manual/en/ref.mbstring.php#AEN80049 - the throughness depends on the mbstring.func_overload setting.
but you will certainly run into problems. Perhaps the best solution is to make Drupal explicitly check for mbstring and use it if present, otherwise use its own routines.
That's what I'm trying to do. I think patching Drupal (and having the "classic" functions work as multibyte ones) is still better than patching Smarty and all the other third-party packages to use mb_substr() instead of substr(), etc. Of course I'm open to any comments on this approach. Cheers, -- Shot -- I hate leaving Windows 95 boxes publically accessible, so shifting even to NT is a blessing in some ways. At least I can reboot them remotely in a sane manner, rather than having to send them malformed packets. -- _BOFHJournal_ ====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
I updated the comment for that function recently... it is about bytes, not characters. It is a tough situation...not only can PHP support UTF-8, but also the MySQL database. And for most people, they have no control over the MySQL settings of their host. If MySQL is not configured for UTF-8, then strings will still work, but the column sizes will be in bytes. We assume the worst situation right now (no mbstring, no mysql utf-8). I think it is a reasonable expectation that hosts with mbstring also use UTF-8 for the database. People who don't have a utf-8 database will have to disable mbstring. We can provide a section for this in settings.php, and we can enable mbstring by default if present, and force the character set to UTF-8. So, for every "utf8" function, we make an mbstring version, which works on characters, and a non-mbstring version which works on bytes (but is UTF-8 aware). I'd like a global toggle between the two modes, set in settings.php. That way we can control encodings inside Drupal. We can provide specific UTF-8-aware APIs, and separate text processing from byte array processing. They are different in many ways, but PHP does not provide a distinction. Text processing in Drupal has been handicapped in Drupal ever since we went to UTF-8. This allows us to rectify things and reintroduce things like upper/lowercasing for all languages (needed for search). However, it is hard to use mbstring overload in Drupal. Drupal has to handle all sorts of encodings, and once overloaded it is impossible to call the original functions and you have to change the internal encoding back and from. It would also complicate the code a lot in ugly ways. For example, in drupal_xml_parser_create(), we use ereg() on a string in an as of yet unknown encoding. If the internal encoding is UTF-8, the operation cannot be performed because most likely the string is not valid UTF-8. We would need to switch to a byte-based encoding, like ISO-8859-1. There are also several places where it is important that we count in bytes, not characters. For example, the mime header encode: it has to be wrapped at no more than 80 bytes per line. If mbstring overload is enabled, we again need to switch the internal encoding to ensure we can count in bytes. That's why I'd say that we do not support mbstring overload in Drupal, but update our UTF-8 APIs so they can take advantage of mbstring if it is present. We cannot use the non-overloaded versions of strtolower() and friends, because they will mess up UTF-8. We can provide wrappers around strtolower(), ucfirst(), etc. If mbstring is present, they use that (and support all of Unicode), otherwise we use a poor man's ASCII version which leaves Unicode alone. I can provide optimized routines, I've written dozens of UTF-8 processors in PHP. We already have several utf8 utility functions which resemble mb_ api calls anyway (string_length, truncate_utf8, ...). I wrote some more in my recent access keys patch. Steven Wittens
Hello. Steven Wittens:
I updated the comment for that function recently... it is about bytes, not characters. It is a tough situation...not only can PHP support UTF-8, but also the MySQL database. And for most people, they have no control over the MySQL settings of their host.
Just a side note: They don't have to. The database character set can be set at database creation (and this is done by Drupal), and everything else (client, connection and results character sets) can be set on runtime with the SQL query SET NAMES 'utf8' (best run just after establishing the connection, of course).
However, it is hard to use mbstring overload in Drupal. Drupal has to handle all sorts of encodings, and once overloaded it is impossible to call the original functions and you have to change the internal encoding back and from.
It's possible with some of them, you just call them with ISO-8859-1: if ($overloaded) $byteLength = strlen($string, 'ISO-8859-1'); else $byteLength = strlen($string);
It would also complicate the code a lot in ugly ways. For example, in drupal_xml_parser_create(), we use ereg() on a string in an as of yet unknown encoding. If the internal encoding is UTF-8, the operation cannot be performed because most likely the string is not valid UTF-8. We would need to switch to a byte-based encoding, like ISO-8859-1.
if ($overloaded) { mb_regex_encoding('ISO-8859-1'); $result = ereg($pattern, $string); mb_regex_encoding('UTF-8'); } else { $result = ereg($pattern, $string); } But I agree, it's uglier than the previous example - although most probably mb_regex_encoding could simply be set once for the whole drupal_xml_parser_create. I'm not trying to argue here; if you decide on a "no overloading" policy in the end, then so be it - it's just that all of the problems raised so far seem solvable. :o) The perspective of having the string functions overloaded is very tempting, as otherwise we'll have to make our own wrappers/overloads for Smarty functions, and for any other pieces of third-party packages we use. String functions overloading would take care of everything at once and should be relatively future-proof, while each new versions of Smarty (and other packages) could require rewriting of the fixes. Cheers, -- Shot -- We're the technical experts. We were hired so that management could ignore our recommendations and tell us how to do our jobs. -- Mike Andrews ====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
I'm not trying to argue here; if you decide on a "no overloading" policy in the end, then so be it - it's just that all of the problems raised so far seem solvable. :o)
The perspective of having the string functions overloaded is very tempting, as otherwise we'll have to make our own wrappers/overloads for Smarty functions, and for any other pieces of third-party packages we use. String functions overloading would take care of everything at once and should be relatively future-proof, while each new versions of Smarty (and other packages) could require rewriting of the fixes.
We need to create our own wrappers anyway, since we should not make mbstring a requirement for Drupal! Supporting mbstring too might turn our to be more performant in some cases than user defined PHP functions, but then again the special case checks all around Drupal might complicate our code significantly and might hurt some performance. Goba
Hello. Gabor Hojtsy:
We need to create our own wrappers anyway, since we should not make mbstring a requirement for Drupal!
Of course we need. All I'm saying is, "let's make the wrappers work even when the string functions are overloaded" - which basically means, "let's check whether strlen() is overloaded or not before using it for obtaining *byte* length of a string and act accordingly." Cheers, -- Shot -- It's clear that whoever set up the font colorings for most programming modes has seen too many Peter Max posters, or did more acid than I did in the 60s. -- Charles R. Martin, gnu.emacs.help ====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
I'm not trying to argue here; if you decide on a "no overloading" policy in the end, then so be it - it's just that all of the problems raised so far seem solvable. :o)
What I'm trying to avoid is the situation where every call to a string function has to have luggage around it. A lot of the standard string functions work perfectly fine on UTF-8, it is not needed to use mbstring for them. Only some functions like strtolower() and strtoupper() mess up UTF-8 when you don't have mbstring. Another unsafe example is substr(): you cannot use it without mbstring in most cases, because you need to split on a character boundary. So we need the wrappers for at least these functions to ensure Drupal is still UTF-8-safe without mbstring. Furthermore, mbstring behaves subtly different for some functions, e.g. throwing warnings when the standard ones don't complain. It seems to me that limiting our usage of mbstring to a few well-known and tested cases is much less likely to cause problems rather than overloading everywhere. I also believe that because of PHP's lack of distinguishing characters vs bytes, mbstring overloading is a bad idea regardless. If we allow mbstring overload, there is no guarantee for a simple PHP API call anymore. Steven Wittens
Hello. Steven Wittens:
Furthermore, mbstring behaves subtly different for some functions, e.g. throwing warnings when the standard ones don't complain. It seems to me that limiting our usage of mbstring to a few well-known and tested cases is much less likely to cause problems rather than overloading everywhere.
I also believe that because of PHP's lack of distinguishing characters vs bytes, mbstring overloading is a bad idea regardless. If we allow mbstring overload, there is no guarantee for a simple PHP API call anymore.
Ok, I can understand this. So, is Drupal going to override any system-wide setting, thus guaranteeing that the string functions are not overloaded? This would make sense... If I may ask for an opinion: What would be your approach to our problem, e.g. Smarty's truncate function being not-multibyte aware? Would you patch Smarty, write a mb_truncate Smarty wrapper, or what? Cheers, -- Shot -- Anyway, the :// part is an 'emoticon' representing a man with a strip of sticky tape across his mouth. -- R. Douglas, asr ====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
On 26 May 2005, at 00:53, Steven Wittens wrote:
After adding
php_value mbstring.func_overload 7 php_value mbstring.internal_encoding UTF-8
Actually you will run into some more problems. Drupal is designed to not use mbstring, because it is not available everywhere. We have our own functions for handling basic UTF-8 stuff (string truncation, mime header encode, etc). These functions assume they get direct access to the string's bytes.
Maybe we should document this in settings.php and overwrite the PHP settings (in the unlikely event that someone enabled mb_string globally)? -- Dries Buytaert :: http://www.buytaert.net/
Hello. Dries Buytaert:
Maybe we should document this in settings.php and overwrite the PHP settings (in the unlikely event that someone enabled mb_string globally)?
Shouldn't we rather go the other way, i.e. use strlen() for getting the character length, not the byte length of strings? As far as I understand, Drupal is internally UTF-8 only; wouldn't it be logical to have the string functions consider the strings as UTF-8 ones? As far as I understand, Drupal requires modules to use the truncate_utf8 function anyway, and fixing it to check whether it's overloaded or not (and react accordingly) is easy... Cheers, -- Shot -- She was good at playing abstract confusion in the same way a midget is good at being short. -- Clive James on Marilyn Monroe ====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
participants (5)
-
Dries Buytaert -
Gabor Hojtsy -
neil@civicspacelabs.org -
Piotr Szotkowski -
Steven Wittens