[drupal-devel] [bug] Fix XML UTF-8 bom issue. Parse according to specs or not?

Morbus Iff drupal-devel at drupal.org
Thu Jun 2 11:28:23 UTC 2005

Issue status update for http://drupal.org/node/24141

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  Steven
 Updated by:   Morbus Iff
 Status:       patch

Patch looks good for me. I had to do this once for AmphetaDesk many many
moons ago, as versions of expat prior to 1.95.2 had the same problem (in
that case, they'd cause the script to segfault, not just fail to parse).
The regexp I ended up using is identical to the one of this patch. +1.

Morbus Iff

Previous comments:

June 2, 2005 - 01:46 : Steven

Attachment: http://drupal.org/files/issues/bom.patch (1.26 KB)

As reported in this forum topic [1] (near the bottom), PHP5 cannot parse
UTF-8 encoded XML feeds that start with the so-called "byte order mark"
which most Microsoft apps fondly prefix UTF-8 encoded files with as a

E.g. http://msdn.microsoft.com/rss.xml

The XML specs allow for BOM's in UTF-/x/ encoded feeds. PHP4's parser
is smart enough to strip it away, while PHP5 reports an "Empty
document" error. The attached patch explicitly strips the BOM if

Note that even after this patch, Drupal still doesn't parse XML 100%
according to spec... most notably, the following situations will fail:
- XML requires that any parser handle UTF-16 encoded XML... PHP doesn't
support this, so we would need to check for the UTF-16 BOMs (little and
big endian), strip them out, then convert to UTF-8.
- XML says that external encoding information (like the HTTP
Content-Type: text/xml; charset=utf-8) takes precedence over the
encoding="" stuff inside the document. We currently don't check the
HTTP headers at all in aggregator.module or allow the passing of
external encoding info to drupal_xml_parser_create().
- XML says that if the detected and declared encoding are not equal, an
error should be thrown.

In theory I could cook up a patch to make Drupal's parser 100%
compliant, but aside from this issue it handles pretty much every feed
out there. I very much doubt anyone would make a UTF-16 encoded feed,
as it would certainly break every other PHP-based parser out there.
Same for external encoding information: it's just not used, as no-one
out there supports it. Heck, even MagpieRSS, the most popular parsing
library, didn't support encodings at all not so long ago.

The only argument pro is "standards compliance", but I'm reluctant to
write code that will not be executed except by some weird masochist who
wants to break XML parsers.
[1] http://drupal.org/node/18788


June 2, 2005 - 02:29 : Steven

Better title.


June 2, 2005 - 02:30 : Steven

And for those who feel the need to induce their brain to seep out of
their ears and run off to a dark and safe place, here's the relevant
portion of the XML specs:

More information about the drupal-devel mailing list