cURL and drupal_http_request do not properly download certain Google News feeds
After getting a report that http://news.google.com/news?pz=1&hl=ar&q=سوريا&cf=all&output=rss is not properly downloading with Feeds module, I dug deep and discovered that cURL and drupal_http_request() return an RSS feed with no items, while wget and PHP stream_get_contents() do return a full RSS feed with a number of items. Details here: http://drupal.org/node/689552 I am unsure what is actually causing this peculiar behavior and I would appreciate people's input. The issue affects not just Feeds but any other Drupal module that downloads and processes Google News RSS feeds - including core aggregator. - This seems to be an issue where Google News decides, based on some request parameters, what content to return and what not - or am I missing something? - The user agent is the same in cases where the issue occurs and where it doesn't, I am using the same machine for all tests - what else could Google use to distinguish my requests? - Any tips on an 'HTTP monitor' I could be using to actually monitor outgoing HTTP requests from my local machine? Alex Barth http://www.developmentseed.org/blog tel (202) 250-3633
for HTTP Monitor, I use the Net tab in Firebug or the Tamper Data firebug extension. On Tue, Jan 19, 2010 at 2:15 PM, Alex Barth <alex@developmentseed.org> wrote:
After getting a report that
http://news.google.com/news?pz=1&hl=ar&q=سوريا&cf=all&output=rss
is not properly downloading with Feeds module, I dug deep and discovered that cURL and drupal_http_request() return an RSS feed with no items, while wget and PHP stream_get_contents() do return a full RSS feed with a number of items.
Details here: http://drupal.org/node/689552
I am unsure what is actually causing this peculiar behavior and I would appreciate people's input. The issue affects not just Feeds but any other Drupal module that downloads and processes Google News RSS feeds - including core aggregator.
- This seems to be an issue where Google News decides, based on some request parameters, what content to return and what not - or am I missing something? - The user agent is the same in cases where the issue occurs and where it doesn't, I am using the same machine for all tests - what else could Google use to distinguish my requests? - Any tips on an 'HTTP monitor' I could be using to actually monitor outgoing HTTP requests from my local machine?
Alex Barth http://www.developmentseed.org/blog tel (202) 250-3633
On Tue, Jan 19, 2010 at 2:15 PM, Alex Barth <alex@developmentseed.org>wrote:
After getting a report that
Could be a UTF-8 issue? The q= has "Syria" (in Arabic) in it. Is that stripped out somewhere in some layer in Drupal? -- Khalid M. Baheyeldin 2bits.com, Inc. http://2bits.com Drupal optimization, development, customization and consulting. Simplicity is prerequisite for reliability. -- Edsger W.Dijkstra Simplicity is the ultimate sophistication. -- Leonardo da Vinci
On Jan 19, 2010, at 4:04 PM, Khalid Baheyeldin wrote:
On Tue, Jan 19, 2010 at 2:15 PM, Alex Barth <alex@developmentseed.org> wrote:
After getting a report that
http://news.google.com/news?pz=1&hl=ar&q=سوريا&cf=all&output=rss
Could be a UTF-8 issue? The q= has "Syria" (in Arabic) in it. Is that stripped out somewhere in some layer in Drupal?
bangpound pointed that out on the issue queue, too. Indeed url encoding the arabic string fixes the behavior I described - my guesses that Google News might require special request parameters were simply not on the right track. What I am not clear about now is whether wget and PHP streams do better URL sanitation before doing the request or if non ASCII characters are allowed in an HTTP URL but curl doesn't support it.
-- Khalid M. Baheyeldin 2bits.com, Inc. http://2bits.com Drupal optimization, development, customization and consulting. Simplicity is prerequisite for reliability. -- Edsger W.Dijkstra Simplicity is the ultimate sophistication. -- Leonardo da Vinci
Alex Barth http://www.developmentseed.org/blog tel (202) 250-3633
participants (3)
-
Alex Barth -
Khalid Baheyeldin -
Moshe Weitzman