Issue status update for http://drupal.org/node/24141 Project: Drupal Version: cvs Component: base system Category: bug reports Priority: normal Assigned to: Anonymous Reported by: Steven Updated by: Steven Status: patch Attachment: http://drupal.org/files/issues/bom.patch (1.26 KB) As reported in this forum topic [1] (near the bottom), PHP5 cannot parse UTF-8 encoded XML feeds that start with the so-called "byte order mark" which most Microsoft apps fondly prefix UTF-8 encoded files with as a signature. E.g. http://msdn.microsoft.com/rss.xml The XML specs allow for BOM's in UTF-/x/ encoded feeds. PHP4's parser is smart enough to strip it away, while PHP5 reports an "Empty document" error. The attached patch explicitly strips the BOM if present. Note that even after this patch, Drupal still doesn't parse XML 100% according to spec... most notably, the following situations will fail: - XML requires that any parser handle UTF-16 encoded XML... PHP doesn't support this, so we would need to check for the UTF-16 BOMs (little and big endian), strip them out, then convert to UTF-8. - XML says that external encoding information (like the HTTP Content-Type: text/xml; charset=utf-8) takes precedence over the encoding="" stuff inside the document. We currently don't check the HTTP headers at all in aggregator.module or allow the passing of external encoding info to drupal_xml_parser_create(). - XML says that if the detected and declared encoding are not equal, an error should be thrown. In theory I could cook up a patch to make Drupal's parser 100% compliant, but aside from this issue it handles pretty much every feed out there. I very much doubt anyone would make a UTF-16 encoded feed, as it would certainly break every other PHP-based parser out there. Same for external encoding information: it's just not used, as no-one out there supports it. Heck, even MagpieRSS, the most popular parsing library, didn't support encodings at all not so long ago. The only argument pro is "standards compliance", but I'm reluctant to write code that will not be executed except by some weird masochist who wants to break XML parsers. [1] http://drupal.org/node/18788 Steven