[drupal-devel] [bug] Fix XML UTF-8 bom issue. Parse according to specs or not?

Steven drupal-devel at drupal.org
Thu Jun 2 07:29:34 UTC 2005


Issue status update for http://drupal.org/node/24141

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  Steven
 Updated by:   Steven
 Status:       patch

Better title.




Steven



Previous comments:
------------------------------------------------------------------------

June 2, 2005 - 08:46 : Steven

Attachment: http://drupal.org/files/issues/bom.patch (1.26 KB)

As reported in this forum topic [1] (near the bottom), PHP5 cannot parse
UTF-8 encoded XML feeds that start with the so-called "byte order mark"
which most Microsoft apps fondly prefix UTF-8 encoded files with as a
signature.


E.g. http://msdn.microsoft.com/rss.xml


The XML specs allow for BOM's in UTF-/x/ encoded feeds. PHP4's parser
is smart enough to strip it away, while PHP5 reports an "Empty
document" error. The attached patch explicitly strips the BOM if
present.


Note that even after this patch, Drupal still doesn't parse XML 100%
according to spec... most notably, the following situations will fail:
- XML requires that any parser handle UTF-16 encoded XML... PHP doesn't
support this, so we would need to check for the UTF-16 BOMs (little and
big endian), strip them out, then convert to UTF-8.
- XML says that external encoding information (like the HTTP
Content-Type: text/xml; charset=utf-8) takes precedence over the
encoding="" stuff inside the document. We currently don't check the
HTTP headers at all in aggregator.module or allow the passing of
external encoding info to drupal_xml_parser_create().
- XML says that if the detected and declared encoding are not equal, an
error should be thrown.


In theory I could cook up a patch to make Drupal's parser 100%
compliant, but aside from this issue it handles pretty much every feed
out there. I very much doubt anyone would make a UTF-16 encoded feed,
as it would certainly break every other PHP-based parser out there.
Same for external encoding information: it's just not used, as no-one
out there supports it. Heck, even MagpieRSS, the most popular parsing
library, didn't support encodings at all not so long ago.


The only argument pro is "standards compliance", but I'm reluctant to
write code that will not be executed except by some weird masochist who
wants to break XML parsers.
[1] http://drupal.org/node/18788







More information about the drupal-devel mailing list