[development] Drupal module for scraping information from an HTML/XML document

Kevin O nowarninglabel at gmail.com
Tue Nov 30 19:48:43 UTC 2010


I second the recommendation of using QueryPath. I use it almost exclusively
along with drupal_http_request, though I use curl only in a few places (if
you use curl I recommend http://drupal.org/project/curl for a dependency
check). I'd really recommend though creating a custom module that uses the
above and then has your logic for filtering in it, I've done this for about
a dozen modules now.

That said, there are some more modules available out there nowadays, such as
using http://drupal.org/project/feeds_xpathparser with feeds
http://drupal.org/project/feeds There are about a dozen more modules that
will accomplish the goal though I haven't used them, but I went through and
tried most of the methods out for some recent projects.

Cheers,

Kevin O'Brien
Drupal Developer
http://www.coderintherye.com
415-754-0112



On Tue, Nov 30, 2010 at 11:26 AM, <development-request at drupal.org> wrote:

> Send development mailing list submissions to
>        development at drupal.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.drupal.org/mailman/listinfo/development
> or, via email, send a message with subject or body 'help' to
>        development-request at drupal.org
>
> You can reach the person managing the list at
>        development-owner at drupal.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of development digest..."
>
>
> Today's Topics:
>
>   1. Drupal module for scraping information from an    HTML/XML
>      document (James Benstead)
>   2. Re: Drupal module for scraping information from an HTML/XML
>      document (John Fiala)
>   3. Easter problem (?mon Tam?s)
>   4. Re: Easter problem (Carl Wiedemann)
>   5. Re: Easter problem (larry at garfieldtech.com)
>   6. Re: Easter problem (jeff at ayendesigns.com)
>   7. Re: Easter problem (larry at garfieldtech.com)
>   8. Re: Easter problem (Jennifer Hodgdon)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 30 Nov 2010 18:56:09 +0000
> From: James Benstead <james.benstead at gmail.com>
> Subject: [development] Drupal module for scraping information from an
>        HTML/XML document
> To: development <development at drupal.org>
> Message-ID:
>        <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV at mail.gmail.com<AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV at mail.gmail.com>
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> I've finally got round to doing some serious work on Drupalversity, an
> open,
> web-based Drupal education project I've had in mind for a year or so.
>
> People who use Drupalversity to learn have the option of adding Resources
> to
> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain how
> to do specific things with Drupal. A Resource is a custom content type that
> includes a link to the resource and a text field containing a description
> of
> that resource.
>
> What I'd like to do once a Resource has been added to the site is to scrape
> certain information from it: at this point I'm thinking the Title of the
> page the link points to and the provider of the resource - e.g., which
> Drupal shop originally created the resource. What's the best way to go
> about
> doing this? I'm pretty sure there's not a Drupal module that solves the
> problem out of the box.
>
> So far I've considered:
>
>   - http://drupal.org/project/querypath
>   - Drupal's built-in drupal_http_request() -
>
> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6
>   - curl
>
> Thanks,
>
> --Jim
> --
> My IM and Skype details are at http://state68.com/contact
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html
>
> ------------------------------
>
> Message: 2
> Date: Tue, 30 Nov 2010 12:06:33 -0700
> From: John Fiala <jcfiala at gmail.com>
> Subject: Re: [development] Drupal module for scraping information from
>        an HTML/XML document
> To: development at drupal.org
> Message-ID:
>        <AANLkTi=N6WxHfigUC4ZopfxswMBv8bj7BZZJErHmko_T at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> These days, if I'm going to be trying to extract data from html/xml,
> I'd use querypath.  Give it a try!
>
> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead
> <james.benstead at gmail.com> wrote:
> > What I'd like to do once a Resource has been added to the site is to
> scrape
> > certain information from it: at this point I'm thinking the Title of the
> > page the link points to and the provider of the resource - e.g., which
> > Drupal shop originally created the resource. What's the best way to go
> about
> > doing this? I'm pretty sure there's not a Drupal module that solves the
> > problem out of the box.
>
> --
> John Fiala
> www.jcfiala.net
>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 30 Nov 2010 20:14:04 +0100
> From: ?mon Tam?s <amont at 5net.hu>
> Subject: [development] Easter problem
> To: development at drupal.org
> Message-ID:
>        <AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX+iMjmBtvy at mail.gmail.com<AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX%2BiMjmBtvy at mail.gmail.com>
> >
> Content-Type: text/plain; charset="utf-8"
>
> Hello,
>
> I have the nameday module (http://drupal.org/project/nameday) and I get a
> feature request for the Greek namedays. How I see it is based on the
> Easter,
> what is not an easy thing to count.
>
> Well, I want to find some algorithm for Easter, and similar days, what is
> can be stored somehow. Maybe it should be a hook or some other think what
> can be stored in database.
>
>
> Thanks
>
> --
> ?mon Tam?s
> Sitefejleszt? ?s programoz?
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html
>
> ------------------------------
>
> Message: 4
> Date: Tue, 30 Nov 2010 12:22:42 -0700
> From: Carl Wiedemann <carl.wiedemann at gmail.com>
> Subject: Re: [development] Easter problem
> To: development at drupal.org
> Message-ID:
>        <AANLkTinD9Xz=3inJj2GraAuqde_=3yshJDwxCJzu12zr at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-2"
>
> Does this help? http://php.net/manual/en/function.easter-days.php
>
> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu> wrote:
>
> > Hello,
> >
> > I have the nameday module (http://drupal.org/project/nameday) and I get
> a
> > feature request for the Greek namedays. How I see it is based on the
> Easter,
> > what is not an easy thing to count.
> >
> > Well, I want to find some algorithm for Easter, and similar days, what is
> > can be stored somehow. Maybe it should be a hook or some other think what
> > can be stored in database.
> >
> >
> > Thanks
> >
> > --
> > ?mon Tam?s
> > Sitefejleszt? ?s programoz?
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html
>
> ------------------------------
>
> Message: 5
> Date: Tue, 30 Nov 2010 13:24:07 -0600
> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
> Subject: Re: [development] Easter problem
> To: development at drupal.org
> Message-ID: <4CF54F57.2030602 at garfieldtech.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> There's no need for a hook here at all.  You can either code in the
> algorithm for defining when Easter is (which sounds like it is in fact
> rather complicated) or just pre-store know pre-calculated dates for it
> for the next decade or so.  (10 records, one per year; totally easy.)
>
> Both options are described here, including the different mechanisms for
> defining when Easter is in different calendars:
>
> http://en.wikipedia.org/wiki/Easter#Date_of_Easter
>
> --Larry Garfield
>
> On 11/30/10 1:14 PM, ?mon Tam?s wrote:
> > Hello,
> >
> > I have the nameday module (http://drupal.org/project/nameday) and I get
> > a feature request for the Greek namedays. How I see it is based on the
> > Easter, what is not an easy thing to count.
> >
> > Well, I want to find some algorithm for Easter, and similar days, what
> > is can be stored somehow. Maybe it should be a hook or some other think
> > what can be stored in database.
> >
> >
> > Thanks
> >
> > --
> > ?mon Tam?s
> > Sitefejleszt? ?s programoz?
> >
>
>
> ------------------------------
>
> Message: 6
> Date: Tue, 30 Nov 2010 14:23:56 -0500
> From: jeff at ayendesigns.com
> Subject: Re: [development] Easter problem
> To: development at drupal.org
> Message-ID: <4CF54F4C.2060409 at ayendesigns.com>
> Content-Type: text/plain; charset="utf-8"
>
> You can google it, but I believe this is one of those things that cannot
> be reduced to an equation or algorithm. It's something like the first
> Sunday after the first full moon after the spring equinox.
>
> On 11/30/2010 02:14 PM, ?mon Tam?s wrote:
> > Hello,
> >
> > I have the nameday module ( http://drupal.org/project/nameday) and I
> > get a feature request for the Greek namedays. How I see it is based on
> > the Easter, what is not an easy thing to count.
> >
> > Well, I want to find some algorithm for Easter, and similar days, what
> > is can be stored somehow. Maybe it should be a hook or some other
> > think what can be stored in database.
> >
> >
> > Thanks
> >
> > --
> > ?mon Tam?s
> > Sitefejleszt? ?s programoz?
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html
>
> ------------------------------
>
> Message: 7
> Date: Tue, 30 Nov 2010 13:26:23 -0600
> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
> Subject: Re: [development] Easter problem
> To: development at drupal.org
> Message-ID: <4CF54FDF.7070506 at garfieldtech.com>
> Content-Type: text/plain; charset=ISO-8859-2; format=flowed
>
> The Calendar PHP module is not enabled by default in a stock PHP, so I
> don't know that you can rely on it (unfortunately).  It does have some
> cool stuff in it, though.
>
> --Larry Garfield
>
> On 11/30/10 1:22 PM, Carl Wiedemann wrote:
> > Does this help? http://php.net/manual/en/function.easter-days.php
> >
> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu
> > <mailto:amont at 5net.hu>> wrote:
> >
> >     Hello,
> >
> >     I have the nameday module (http://drupal.org/project/nameday) and I
> >     get a feature request for the Greek namedays. How I see it is based
> >     on the Easter, what is not an easy thing to count.
> >
> >     Well, I want to find some algorithm for Easter, and similar days,
> >     what is can be stored somehow. Maybe it should be a hook or some
> >     other think what can be stored in database.
> >
> >
> >     Thanks
> >
> >     --
> >     ?mon Tam?s
> >     Sitefejleszt? ?s programoz?
> >
> >
>
>
> ------------------------------
>
> Message: 8
> Date: Tue, 30 Nov 2010 11:21:08 -0800
> From: Jennifer Hodgdon <yahgrp at poplarware.com>
> Subject: Re: [development] Easter problem
> To: development at drupal.org
> Message-ID: <4CF54EA4.1050502 at poplarware.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> http://php.net/manual/en/function.easter-date.php
>
> On 11/30/2010 11:14 AM, ?mon Tam?s wrote:
> > I have the nameday module (http://drupal.org/project/nameday) and I get
> a
> > feature request for the Greek namedays. How I see it is based on the
> Easter,
> > what is not an easy thing to count.
> >
> > Well, I want to find some algorithm for Easter, and similar days, what is
> > can be stored somehow. Maybe it should be a hook or some other think what
> > can be stored in database.
>
> --
> Jennifer Hodgdon * Poplar ProductivityWare
> www.poplarware.com
> Drupal web sites and custom Drupal modules
>
>
>
> ------------------------------
>
> --
> [ Drupal development list | http://lists.drupal.org/ ]
>
> End of development Digest, Vol 95, Issue 58
> *******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.drupal.org/pipermail/development/attachments/20101130/0eb8ed05/attachment-0001.html 


More information about the development mailing list