I second the recommendation of using QueryPath. I use it almost exclusively along with drupal_http_request, though I use curl only in a few places (if you use curl I recommend http://drupal.org/project/curl for a dependency check). I'd really recommend though creating a custom module that uses the above and then has your logic for filtering in it, I've done this for about a dozen modules now. That said, there are some more modules available out there nowadays, such as using http://drupal.org/project/feeds_xpathparser with feeds http://drupal.org/project/feeds There are about a dozen more modules that will accomplish the goal though I haven't used them, but I went through and tried most of the methods out for some recent projects. Cheers, Kevin O'Brien Drupal Developer http://www.coderintherye.com 415-754-0112 On Tue, Nov 30, 2010 at 11:26 AM, <development-request@drupal.org> wrote:
Send development mailing list submissions to development@drupal.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.drupal.org/mailman/listinfo/development or, via email, send a message with subject or body 'help' to development-request@drupal.org
You can reach the person managing the list at development-owner@drupal.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of development digest..."
Today's Topics:
1. Drupal module for scraping information from an HTML/XML document (James Benstead) 2. Re: Drupal module for scraping information from an HTML/XML document (John Fiala) 3. Easter problem (?mon Tam?s) 4. Re: Easter problem (Carl Wiedemann) 5. Re: Easter problem (larry@garfieldtech.com) 6. Re: Easter problem (jeff@ayendesigns.com) 7. Re: Easter problem (larry@garfieldtech.com) 8. Re: Easter problem (Jennifer Hodgdon)
----------------------------------------------------------------------
Message: 1 Date: Tue, 30 Nov 2010 18:56:09 +0000 From: James Benstead <james.benstead@gmail.com> Subject: [development] Drupal module for scraping information from an HTML/XML document To: development <development@drupal.org> Message-ID: <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV@mail.gmail.com<AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
I've finally got round to doing some serious work on Drupalversity, an open, web-based Drupal education project I've had in mind for a year or so.
People who use Drupalversity to learn have the option of adding Resources to the site - i.e., links to posts at Lullabot, Chapter3 etc that explain how to do specific things with Drupal. A Resource is a custom content type that includes a link to the resource and a text field containing a description of that resource.
What I'd like to do once a Resource has been added to the site is to scrape certain information from it: at this point I'm thinking the Title of the page the link points to and the provider of the resource - e.g., which Drupal shop originally created the resource. What's the best way to go about doing this? I'm pretty sure there's not a Drupal module that solves the problem out of the box.
So far I've considered:
- http://drupal.org/project/querypath - Drupal's built-in drupal_http_request() -
http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_r... - curl
Thanks,
--Jim -- My IM and Skype details are at http://state68.com/contact