[development] PHP 5 > aggregator.module rewrite to XML API?

Sean Robertson seanr at ngpsoftware.com
Wed Jun 20 17:10:52 UTC 2007


Curl is frequently not available.  I've had to have it installed a few 
times, but a LOT of people won't even have that as an option.



Scott Trudeau wrote:
> Re: curl vs. drupal_http_request
> 
> In addition to handling FTP urls and handle http authenticaton, curl
> can optionally follow redirects and handle things like content
> disposition headers (with custom code), which I've found to be
> important when dealing with enclosures.  Not sure how flexible/capable
> drupal_http_request is on those kinds of issues.
> 
> Scott
> 
> On 6/20/07, Ashraf Amayreh <mistknight at gmail.com> wrote:
>> > Are you saying the header Connection: Close is ignored?
>> >
>> > Any reason why Drupal should not use HTTP/1.1?
>>
>> Examining the drupal_http_request I found that it actually doesn't use 
>> HTTP
>> 1.1 in the first place, so I guess there's no problem in using it at all.
>> But I would still like to maintain the ability to transparently accept 
>> feeds
>> from HTTP or FTP as well as providing the users the option to access
>> authenticated URLs. I don't know how widely used these features are, 
>> but I'd
>> hate to remove a feature that could help a user out.
>>
>> Seems we were guilty in assuming what SimplePie did during these 11 
>> seconds.
>> Although I still think it's going about it the wrong way. 11 seconds is
>> suicide. I sanitize against the extracted data, rather than the feed 
>> string
>> as a whole. That's what I presume SimplePie is doing. I wish I could 
>> check
>> it out for myself but my sleep indicators are overloaded.
>>
>> Morbus' suggestion to pass along the string as whole sounds logical, I'll
>> see what I can do about that. Although I really had assumed that 
>> aggregation
>> happens from XMLs only so the module would need a considerable amount of
>> change to accommodate non-XML strings. I'll study the option and see 
>> what I
>> can do. Anyone care to give me a patch for my next birthday? :-P
>>
>>
>> AA
>>
>> On 6/20/07, Morbus Iff <morbus at disobey.com > wrote:
>> > > opinion the second is not sanitization and no aggregator needs to 
>> waste
>> > > the code and time on trying to handle non-XML or non-standards 
>> compliant
>> >
>> > It depends entirely on your definition of "aggregator". In your module,
>> > you have only one parser, really - PHP's SimpleXML (or whatever it's
>> > called) that then sends the loaded data structure to the smaller "do
>> > things with it" (ie., RSS20.inc, etc.) subparsers. However, I'd think
>> > that it'd be far more flexible to send the raw strings around /as well/
>> > - then one could support, for example, non-XML documents (or, in my
>> > particular case, I could write scrapers for sites that don't support
>> > feeds [or feeds that contain useful data]) so that I'd be able to hook
>> > into the generic aggregating process. Aggregation != just XML, IMO.
>> >
>> > I'd love, for example, to be able to add a "feed" that points to (pff,
>> > making crap outta my ass here) some comic site's "latest comic" HTML,
>> > choose a custom-made parser that expects that HTML, and return the same
>> > data structure that the aggregation API expects as legit. This /is/
>> > aggregation - pulling disparate sources together.
>> >
>> > > I would be very surprised if I found that SimplePie is wasting 11
>> > > seconds out of 12 in preventing XSS or SQL injection attacks 
>> alone. But
>> > > hey, what do I know about SimplePie. Does anyone know what SimplePie
>> > > actually does within these 11 seconds?
>> >
>> > SimplePie's set_stupidly_fast is a wrapper around:
>> >
>> >    $this->enable_order_by_date(false);
>> >    $this->remove_div(false);
>> >    $this->strip_comments(false);
>> >    $this->strip_htmltags(false);
>> >    $this->strip_attributes(false);
>> >    $this->set_image_handler(false);
>> >
>> > None of those are "fix broken XML". I reran the initial test like so:
>> >
>> >    $feed->set_stupidly_fast(TRUE);
>> >    $feed->enable_order_by_date(TRUE);
>> >
>> > i.e. first shutting everything off, then enabling one command:
>> >
>> >    $feed->enable_order_by_date(TRUE)       2 seconds
>> >    $feed->remove_div(TRUE)                 1 second
>> >    $feed->strip_comments(TRUE);            2 seconds
>> >    $feed->strip_htmltags(TRUE);            2 seconds
>> >    $feed->strip_attributes(TRUE);          2 seconds
>> >    $feed->set_image_handler(TRUE);         1 second
>> >
>> > --
>> > Morbus Iff ( if god is my witness, god must be blind )
>> > Technical: http://www.oreillynet.com/pub/au/779
>> > Culture: http://www.disobey.com/ and http://www.gamegrene.com/
>> > aim: akaMorbus / skype: morbusiff / icq: 2927491 / jabber.org: morbus
>> >
>>
>>

-- 
Sean Robertson
Web Developer
NGP Software, Inc.
seanr at ngpsoftware.com
(202) 686-9330
http://www.ngpsoftware.com



More information about the development mailing list