[drupal-devel] [bug] Make Drupal parse XML 100% according to specs
Steven
drupal-devel at drupal.org
Thu Jun 2 07:26:49 UTC 2005
Issue status update for http://drupal.org/node/24141
Project: Drupal
Version: cvs
Component: base system
Category: bug reports
Priority: normal
Assigned to: Anonymous
Reported by: Steven
Updated by: Steven
Status: patch
Attachment: http://drupal.org/files/issues/bom.patch (1.26 KB)
As reported in this forum topic [1] (near the bottom), PHP5 cannot parse
UTF-8 encoded XML feeds that start with the so-called "byte order mark"
which most Microsoft apps fondly prefix UTF-8 encoded files with as a
signature.
E.g. http://msdn.microsoft.com/rss.xml
The XML specs allow for BOM's in UTF-/x/ encoded feeds. PHP4's parser
is smart enough to strip it away, while PHP5 reports an "Empty
document" error. The attached patch explicitly strips the BOM if
present.
Note that even after this patch, Drupal still doesn't parse XML 100%
according to spec... most notably, the following situations will fail:
- XML requires that any parser handle UTF-16 encoded XML... PHP doesn't
support this, so we would need to check for the UTF-16 BOMs (little and
big endian), strip them out, then convert to UTF-8.
- XML says that external encoding information (like the HTTP
Content-Type: text/xml; charset=utf-8) takes precedence over the
encoding="" stuff inside the document. We currently don't check the
HTTP headers at all in aggregator.module or allow the passing of
external encoding info to drupal_xml_parser_create().
- XML says that if the detected and declared encoding are not equal, an
error should be thrown.
In theory I could cook up a patch to make Drupal's parser 100%
compliant, but aside from this issue it handles pretty much every feed
out there. I very much doubt anyone would make a UTF-16 encoded feed,
as it would certainly break every other PHP-based parser out there.
Same for external encoding information: it's just not used, as no-one
out there supports it. Heck, even MagpieRSS, the most popular parsing
library, didn't support encodings at all not so long ago.
The only argument pro is "standards compliance", but I'm reluctant to
write code that will not be executed except by some weird masochist who
wants to break XML parsers.
[1] http://drupal.org/node/18788
Steven
More information about the drupal-devel
mailing list