Re: [development] Drupal module for scraping information from an HTML/XML document

30 Nov 2010

      I second the recommendation of using QueryPath. I use it almost exclusively
along with drupal_http_request, though I use curl only in a few places (if
you use curl I recommend http://drupal.org/project/curl for a dependency
check). I'd really recommend though creating a custom module that uses the
above and then has your logic for filtering in it, I've done this for about
a dozen modules now.

That said, there are some more modules available out there nowadays, such as
using http://drupal.org/project/feeds_xpathparser with feeds
http://drupal.org/project/feeds There are about a dozen more modules that
will accomplish the goal though I haven't used them, but I went through and
tried most of the methods out for some recent projects.

Cheers,

Kevin O'Brien
Drupal Developer
http://www.coderintherye.com
415-754-0112

On Tue, Nov 30, 2010 at 11:26 AM, <development-request@drupal.org> wrote:
...
Send development mailing list submissions to
       development@drupal.org
To subscribe or unsubscribe via the World Wide Web, visit
       http://lists.drupal.org/mailman/listinfo/development
or, via email, send a message with subject or body 'help' to
       development-request@drupal.org
You can reach the person managing the list at
       development-owner@drupal.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of development digest..."
Today's Topics:
1. Drupal module for scraping information from an    HTML/XML
     document (James Benstead)
  2. Re: Drupal module for scraping information from an HTML/XML
     document (John Fiala)
  3. Easter problem (?mon Tam?s)
  4. Re: Easter problem (Carl Wiedemann)
  5. Re: Easter problem (larry@garfieldtech.com)
  6. Re: Easter problem (jeff@ayendesigns.com)
  7. Re: Easter problem (larry@garfieldtech.com)
  8. Re: Easter problem (Jennifer Hodgdon)
----------------------------------------------------------------------
Message: 1
Date: Tue, 30 Nov 2010 18:56:09 +0000
From: James Benstead <james.benstead@gmail.com>
Subject: [development] Drupal module for scraping information from an
       HTML/XML document
To: development <development@drupal.org>
Message-ID:
       <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV@mail.gmail.com<AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV@mail.gmail.com>
...
Content-Type: text/plain; charset="iso-8859-1"
I've finally got round to doing some serious work on Drupalversity, an
open,
web-based Drupal education project I've had in mind for a year or so.
People who use Drupalversity to learn have the option of adding Resources
to
the site - i.e., links to posts at Lullabot, Chapter3 etc that explain how
to do specific things with Drupal. A Resource is a custom content type that
includes a link to the resource and a text field containing a description
of
that resource.
What I'd like to do once a Resource has been added to the site is to scrape
certain information from it: at this point I'm thinking the Title of the
page the link points to and the provider of the resource - e.g., which
Drupal shop originally created the resource. What's the best way to go
about
doing this? I'm pretty sure there's not a Drupal module that solves the
problem out of the box.
So far I've considered:
- http://drupal.org/project/querypath
  - Drupal's built-in drupal_http_request() -
http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_r...
  - curl
Thanks,
--Jim
--
My IM and Skype details are at http://state68.com/contact