[development] Drupal module for scraping information from an HTML/XML document

Tue Nov 30 18:56:09 UTC 2010

I've finally got round to doing some serious work on Drupalversity, an open,
web-based Drupal education project I've had in mind for a year or so.

People who use Drupalversity to learn have the option of adding Resources to
the site - i.e., links to posts at Lullabot, Chapter3 etc that explain how
to do specific things with Drupal. A Resource is a custom content type that
includes a link to the resource and a text field containing a description of
that resource.

What I'd like to do once a Resource has been added to the site is to scrape
certain information from it: at this point I'm thinking the Title of the
page the link points to and the provider of the resource - e.g., which
Drupal shop originally created the resource. What's the best way to go about
doing this? I'm pretty sure there's not a Drupal module that solves the
problem out of the box.

So far I've considered:

   - http://drupal.org/project/querypath
   - Drupal's built-in drupal_http_request() -
   http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6
   - curl

Thanks,

--Jim
--
My IM and Skype details are at http://state68.com/contact
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment.html