[development] PHP 5 > aggregator.module rewrite to XML API?

Larry Garfield larry at garfieldtech.com
Tue Jun 19 18:15:24 UTC 2007


Disclaimer: I am not an RSS guru, just a pedant. :-)

RSS is XML.  The XML spec explicitly says that invalid files should be discarded, not guessed at the way HTML is.  Trying to make sense of a broken RSS feed is explicitly contrary to the spec.  So, er, why are we spending so much time trying to sanitize?  If it doesn't parse correctly, report an error "this site's RSS feed is f*ed up, tell 'em to fix it".  Am I missing something here?

--Larry Garfield

On Tue, 19 Jun 2007 20:12:45 +0300, "Ashraf Amayreh" <mistknight at gmail.com> wrote:
> I'm not really sure about the argument to sanitize data. Can't we sanitize
> it in a little less than 11 seconds? Also, isn't there a possibility the
> user wants this HTML code to come in as HTML code rather than plain text?
> 
> I would guess that my module does lack many sanity checks, but at the same
> time, I do assume that administrators should be responsible as to what
> feeds
> they add to their sites.
> 
> By the way, any sanity gurus who would like to check on my module's sanity
> checks and help me with additional sanity checks are very welcome and have
> my full gratitude. Just drop me a line off-list.
> 
> On 6/19/07, Morbus Iff <morbus at disobey.com> wrote:
>>
>> > Unfortunately, we can't take these statistics as canon:
>> >
>> >   * there's no instructions on how to duplicate.
>> >
>> >   * the SimplePie result is an estimate ("At SimplePie I have to
>> >     do an estimate, because the feed download time was accumulated
>> >     to the measure."
>> >
>> >   * it is unknown whether the other feed parsers are doing the
>> >     same sanitization that SimplePie does, again, which adds
>> >     more time to the results.
>>
>> I have done some quick tests, using the same URL as Aron:
>>
>>   http://www.christiannewswire.com/rss/catfeed_2.xml
>>
>> I downloaded this file to my desktop. I will be passing this string into
>> SimplePie instead of allowing SimplePie to download it. The file is 1M:
>>
>>   1027320 Jun 19 11:50 catfeed_2.xml
>>
>> This is the script I used with SimplePie 1.0 b3.2 (20061124):
>>
>>    <?php
>>      $handle = fopen('./catfeed_2.xml', "r");
>>      $contents = fread($handle, filesize('./catfeed_2.xml'));
>>
>>      require './simplepie.inc';
>>      $feed = new SimplePie();
>>      $feed->set_raw_data($contents);
>>      $feed->init();
>>      $parsed = $feed->get_items();
>>    ?>
>>
>> With this command line:
>>
>>    ~/Desktop > date && php simplepie.php && date
>>    Tue Jun 19 12:26:10 EDT 2007
>>    Tue Jun 19 12:26:22 EDT 2007
>>
>> As you can see, this does confirm the 10 or 12 second parse time -- it
>> is also using all the sanitation that SimplePie does by default.
>> However, SimpleFeed and FeedParser both ship with the latest development
>> version of SimplePie which includes an option to stop this sanitation:
>>
>>    $feed->set_stupidly_fast(TRUE);
>>
>> I grabbed today's development version, added the above
>> line before the ->init() in the above script, and reran:
>>
>>    ~/Desktop > date && php simplepie.php && date
>>    Tue Jun 19 12:28:54 EDT 2007
>>    Tue Jun 19 12:28:55 EDT 2007
>>
>> You'll notice that it is only 1 second which removes all doubt in my
>> mind that SimplePie is a bad thing comparitively (since one would assume
>> you'd sanitize the data as necessary within Drupal).
>>
>> --
>> Morbus Iff ( and think about the bad things that I didn't do )
>> Technical: http://www.oreillynet.com/pub/au/779
>> Culture: http://www.disobey.com/ and http://www.gamegrene.com/
>> aim: akaMorbus / skype: morbusiff / icq: 2927491 / jabber.org: morbus
>>
> 
> 



More information about the development mailing list