[drupal-devel] Search.module and minimum version of PHP
Some of you may have noticed that the 4.6 RC announcements has some comments about php version in it. The cause is the new search.module which requires Unicode/UTF-8 support in the perl-compatible regular expressions (PCRE) library. The PHP documentation says compatibility was added to PCRE with PHP 4.1 on Unix and PHP 4.2.3 on Windows (and this is why I thought it was reasonable to use it). But it seems that the /real/ support wasn't added until PHP 4.3.3. I've done some testing with Gerhard and the UTF-8 support in the PCRE in PHP 4.3.2 (or earlier) is pretty broken as far as I can tell on both Windows and Linux The reason behind all this is that the search now supports characters in the entire Unicode range when it splits up text into words. This is in fact important for every language, as more and more 'high' unicode characters are used every day (anything outside ISO-8859-1/Latin-1, e.g; smart/curly quotes, euro sign, math symbols, and any language that does not use the accented latin script). In theory we could convert the character-based regular expression into a byte-based one, but this would require an insane amount of coding to do the conversion programmatically. The result would be a truly monstrous regular expression, so I really don't see it happening in practice. Or we could write our own UTF-8 compatible tokenizer, but again this would be a large piece of code that is slow to boot. An alternative is to ignore high unicode characters in the searching. This means that sites with western-european content will still be indexed in a somewhat working fashion (just behave badly around curly quotes and euro signs), but any other language will be broken. It is certainly not something we can implement in Drupal core, but I can make a patch which does this for those that are stuck on an old PHP install. Or we could just say "search.module and thus Drupal requires PHP 4.3.3". What do you guys think about it? Steven
On Mon, 07 Mar 2005 01:11:22 +0100 Steven Wittens <steven@acko.net> wrote: [...]
Or we could just say "search.module and thus Drupal requires PHP 4.3.3".
Does this only affect "search.module"? ie, could a site use Drupal 4.6 on pre-PHP 4.3.3 if they disable the search module? (I realize this isn't the point, I'm just wanting to understand...) Thanks, -Jeremy
Does this only affect "search.module"? ie, could a site use Drupal 4.6 on pre-PHP 4.3.3 if they disable the search module? (I realize this isn't the point, I'm just wanting to understand...)
Yes, I am not aware of any problems with the current minimum PHP requirement (4.1.0?) other than search.module. By the way I did some more testing, and it seems that the pre-4.3.3 UTF-8 support was more broken than I thought: the only option is in fact a byte-based regular expression, as even using ISO-8859-1 characters is problematic with anything other than simple character matches. So it would be a solution where not only it is broken on anything non-western-european, but it would be very difficult to make this patch in the first place. Steven
On Mon, 7 Mar 2005, Steven Wittens wrote:
Does this only affect "search.module"? ie, could a site use Drupal 4.6 on pre-PHP 4.3.3 if they disable the search module? (I realize this isn't the point, I'm just wanting to understand...)
Yes, I am not aware of any problems with the current minimum PHP requirement (4.1.0?) other than search.module.
I don't think there are any.
By the way I did some more testing, and it seems that the pre-4.3.3 UTF-8 support was more broken than I thought: the only option is in fact a byte-based regular expression, as even using ISO-8859-1 characters is problematic with anything other than simple character matches. So it would be a solution where not only it is broken on anything non-western-european, but it would be very difficult to make this patch in the first place.
The problem with PHP 4.1 is that it is still the standard for the stable Debian release. This release is probably still used by a lot of webhosters (and by myself). Do we want to take away the possibility to use the Drupal search from all their clients? There is no backported .deb for PHP from a reliable source. I'd really like some "fix" to be included and executed for PHP versions below 4.3.3, even if it only works for a limited set of character sets (or nothing beyond iso-8859-1). Cheers, Gerhard
The problem with PHP 4.1 is that it is still the standard for the stable Debian release. This release is probably still used by a lot of webhosters (and by myself). Do we want to take away the possibility to use the Drupal search from all their clients? There is no backported .deb for PHP from a
So, what happens if I went ahead and ran 4.6 on a non-4.3.3 system anyways? Will only UTF8 text searching be broken? Will all of search be broken? If I don't use non-US ASCII characters, will search be fine? -- Morbus Iff ( i'm wearing footsie jammies here ) Technical: http://www.oreillynet.com/pub/au/779 Culture: http://www.disobey.com/ and http://www.gamegrene.com/ icq: 2927491 / aim: akaMorbus / yahoo: morbus_iff / jabber.org: morbus
So, what happens if I went ahead and ran 4.6 on a non-4.3.3 system anyways? Will only UTF8 text searching be broken? Will all of search be broken? If I don't use non-US ASCII characters, will search be fine?
Nope, you'll get an error and it will be broken. The Unicode characters are used in the regular expressions as well. Steven Wittens
Steven Wittens wrote:
Or we could just say "search.module and thus Drupal requires PHP 4.3.3".
What do you guys think about it?
Release "as is" saying that it requires PHP 4.3.3. Its not unreasonable in a PHP 5x world. And (if you or someone feels up to it) release a contributed module that works on 4.1 and call it the "4.1 Semi Compatible Search" module. (with no support for high uni-code characters in search). And/or release an updated version of the old search module as well. The "Guaranteed to work, but search results might be lacking" module. Just my $0.02. andre
participants (6)
-
Andre Molnar -
Gabor Hojtsy -
Gerhard Killesreiter -
Jeremy Andrews -
Morbus Iff -
Steven Wittens