[development] PHP 5 > aggregator.module rewrite to XML API?

Alexander Barth alex at developmentseed.org
Tue Jun 19 19:07:34 UTC 2007


Tuesday, June 19, 2007, 2:15:24 PM, you wrote:

> Disclaimer: I am not an RSS guru, just a pedant. :-)

> RSS is XML.  The XML spec explicitly says that invalid files should
> be discarded, not guessed at the way HTML is.  Trying to make sense
> of a broken RSS feed is explicitly contrary to the spec.  So, er,
> why are we spending so much time trying to sanitize?  If it doesn't
> parse correctly, report an error "this site's RSS feed is f*ed up,
> tell 'em to fix it".  Am I missing something here?

True. That's one side of the coin. The other side is a world of
non compliant feeds that all turn up on your issue queue to haunt you
if your parser complains about them :)

Alex

> --Larry Garfield

> On Tue, 19 Jun 2007 20:12:45 +0300, "Ashraf Amayreh" <mistknight at gmail.com> wrote:
>> I'm not really sure about the argument to sanitize data. Can't we sanitize
>> it in a little less than 11 seconds? Also, isn't there a possibility the
>> user wants this HTML code to come in as HTML code rather than plain text?
>> 
>> I would guess that my module does lack many sanity checks, but at the same
>> time, I do assume that administrators should be responsible as to what
>> feeds
>> they add to their sites.
>> 
>> By the way, any sanity gurus who would like to check on my module's sanity
>> checks and help me with additional sanity checks are very welcome and have
>> my full gratitude. Just drop me a line off-list.
>> 
>> On 6/19/07, Morbus Iff <morbus at disobey.com> wrote:
>>>
>>> > Unfortunately, we can't take these statistics as canon:
>>> >
>>> >   * there's no instructions on how to duplicate.
>>> >
>>> >   * the SimplePie result is an estimate ("At SimplePie I have to
>>> >     do an estimate, because the feed download time was accumulated
>>> >     to the measure."
>>> >
>>> >   * it is unknown whether the other feed parsers are doing the
>>> >     same sanitization that SimplePie does, again, which adds
>>> >     more time to the results.
>>>
>>> I have done some quick tests, using the same URL as Aron:
>>>
>>>   http://www.christiannewswire.com/rss/catfeed_2.xml
>>>
>>> I downloaded this file to my desktop. I will be passing this string into
>>> SimplePie instead of allowing SimplePie to download it. The file is 1M:
>>>
>>>   1027320 Jun 19 11:50 catfeed_2.xml
>>>
>>> This is the script I used with SimplePie 1.0 b3.2 (20061124):
>>>
>>>    <?php
>>>      $handle = fopen('./catfeed_2.xml', "r");
>>>      $contents = fread($handle, filesize('./catfeed_2.xml'));
>>>
>>>      require './simplepie.inc';
>>>      $feed = new SimplePie();
>>>      $feed->set_raw_data($contents);
>>>      $feed->init();
>>>      $parsed = $feed->get_items();
>>>    ?>
>>>
>>> With this command line:
>>>
>>>    ~/Desktop > date && php simplepie.php && date
>>>    Tue Jun 19 12:26:10 EDT 2007
>>>    Tue Jun 19 12:26:22 EDT 2007
>>>
>>> As you can see, this does confirm the 10 or 12 second parse time -- it
>>> is also using all the sanitation that SimplePie does by default.
>>> However, SimpleFeed and FeedParser both ship with the latest development
>>> version of SimplePie which includes an option to stop this sanitation:
>>>
>>>    $feed->set_stupidly_fast(TRUE);
>>>
>>> I grabbed today's development version, added the above
>>> line before the ->init() in the above script, and reran:
>>>
>>>    ~/Desktop > date && php simplepie.php && date
>>>    Tue Jun 19 12:28:54 EDT 2007
>>>    Tue Jun 19 12:28:55 EDT 2007
>>>
>>> You'll notice that it is only 1 second which removes all doubt in my
>>> mind that SimplePie is a bad thing comparitively (since one would assume
>>> you'd sanitize the data as necessary within Drupal).
>>>
>>> --
>>> Morbus Iff ( and think about the bad things that I didn't do )
>>> Technical: http://www.oreillynet.com/pub/au/779
>>> Culture: http://www.disobey.com/ and http://www.gamegrene.com/
>>> aim: akaMorbus / skype: morbusiff / icq: 2927491 / jabber.org: morbus
>>>
>> 
>> 




  


-- 
Alexander Barth
Development Seed
http://www.developmentseed.org
http://www.developmentseed.org/blog
lx_barth(skype)
alex_b(drupal.org)
Tel. 202.250.3633 
Fax. 806.214.6218 



More information about the development mailing list