[development] PHP 5 > aggregator.module rewrite to XML API?
Alexander Barth
alex at developmentseed.org
Tue Jun 19 19:07:34 UTC 2007
Tuesday, June 19, 2007, 2:15:24 PM, you wrote:
> Disclaimer: I am not an RSS guru, just a pedant. :-)
> RSS is XML. The XML spec explicitly says that invalid files should
> be discarded, not guessed at the way HTML is. Trying to make sense
> of a broken RSS feed is explicitly contrary to the spec. So, er,
> why are we spending so much time trying to sanitize? If it doesn't
> parse correctly, report an error "this site's RSS feed is f*ed up,
> tell 'em to fix it". Am I missing something here?
True. That's one side of the coin. The other side is a world of
non compliant feeds that all turn up on your issue queue to haunt you
if your parser complains about them :)
Alex
> --Larry Garfield
> On Tue, 19 Jun 2007 20:12:45 +0300, "Ashraf Amayreh" <mistknight at gmail.com> wrote:
>> I'm not really sure about the argument to sanitize data. Can't we sanitize
>> it in a little less than 11 seconds? Also, isn't there a possibility the
>> user wants this HTML code to come in as HTML code rather than plain text?
>>
>> I would guess that my module does lack many sanity checks, but at the same
>> time, I do assume that administrators should be responsible as to what
>> feeds
>> they add to their sites.
>>
>> By the way, any sanity gurus who would like to check on my module's sanity
>> checks and help me with additional sanity checks are very welcome and have
>> my full gratitude. Just drop me a line off-list.
>>
>> On 6/19/07, Morbus Iff <morbus at disobey.com> wrote:
>>>
>>> > Unfortunately, we can't take these statistics as canon:
>>> >
>>> > * there's no instructions on how to duplicate.
>>> >
>>> > * the SimplePie result is an estimate ("At SimplePie I have to
>>> > do an estimate, because the feed download time was accumulated
>>> > to the measure."
>>> >
>>> > * it is unknown whether the other feed parsers are doing the
>>> > same sanitization that SimplePie does, again, which adds
>>> > more time to the results.
>>>
>>> I have done some quick tests, using the same URL as Aron:
>>>
>>> http://www.christiannewswire.com/rss/catfeed_2.xml
>>>
>>> I downloaded this file to my desktop. I will be passing this string into
>>> SimplePie instead of allowing SimplePie to download it. The file is 1M:
>>>
>>> 1027320 Jun 19 11:50 catfeed_2.xml
>>>
>>> This is the script I used with SimplePie 1.0 b3.2 (20061124):
>>>
>>> <?php
>>> $handle = fopen('./catfeed_2.xml', "r");
>>> $contents = fread($handle, filesize('./catfeed_2.xml'));
>>>
>>> require './simplepie.inc';
>>> $feed = new SimplePie();
>>> $feed->set_raw_data($contents);
>>> $feed->init();
>>> $parsed = $feed->get_items();
>>> ?>
>>>
>>> With this command line:
>>>
>>> ~/Desktop > date && php simplepie.php && date
>>> Tue Jun 19 12:26:10 EDT 2007
>>> Tue Jun 19 12:26:22 EDT 2007
>>>
>>> As you can see, this does confirm the 10 or 12 second parse time -- it
>>> is also using all the sanitation that SimplePie does by default.
>>> However, SimpleFeed and FeedParser both ship with the latest development
>>> version of SimplePie which includes an option to stop this sanitation:
>>>
>>> $feed->set_stupidly_fast(TRUE);
>>>
>>> I grabbed today's development version, added the above
>>> line before the ->init() in the above script, and reran:
>>>
>>> ~/Desktop > date && php simplepie.php && date
>>> Tue Jun 19 12:28:54 EDT 2007
>>> Tue Jun 19 12:28:55 EDT 2007
>>>
>>> You'll notice that it is only 1 second which removes all doubt in my
>>> mind that SimplePie is a bad thing comparitively (since one would assume
>>> you'd sanitize the data as necessary within Drupal).
>>>
>>> --
>>> Morbus Iff ( and think about the bad things that I didn't do )
>>> Technical: http://www.oreillynet.com/pub/au/779
>>> Culture: http://www.disobey.com/ and http://www.gamegrene.com/
>>> aim: akaMorbus / skype: morbusiff / icq: 2927491 / jabber.org: morbus
>>>
>>
>>
--
Alexander Barth
Development Seed
http://www.developmentseed.org
http://www.developmentseed.org/blog
lx_barth(skype)
alex_b(drupal.org)
Tel. 202.250.3633
Fax. 806.214.6218
More information about the development
mailing list