[development] Drupal module for scraping information from an HTML/XML document

Balazs Dianiska csillagasz at gmail.com
Wed Dec 1 10:46:50 UTC 2010


Sadly some of the older legacy sites are just not available in rss, I
had such a scraping request recently. I have to say that with
drupal_http_request you don't even have to look at curl. You can do
all sorts of things, even faking logins.

To parse the HTML use querypath, a trick that we use is to first run
some sort of HTML tidyup library on the downloaded page, otherwise
querypath runs away crying. beautify module can help you a great deal
with that.

Balazs

On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <cweagans at gmail.com> wrote:
> Most of the time, you can get to the posts via RSS. Aggregator module does a
> pretty good job of pulling stuff in, and the author of the post that's
> displayed is whatever you tell it to display (see Drupal Planet for an
> example)
> Thanks,
> Cameron
>
>
>
> On Tue, Nov 30, 2010 at 12:48, Kevin O <nowarninglabel at gmail.com> wrote:
>>
>> I second the recommendation of using QueryPath. I use it almost
>> exclusively along with drupal_http_request, though I use curl only in a few
>> places (if you use curl I recommend http://drupal.org/project/curl for a
>> dependency check). I'd really recommend though creating a custom module that
>> uses the above and then has your logic for filtering in it, I've done this
>> for about a dozen modules now.
>> That said, there are some more modules available out there nowadays, such
>> as using http://drupal.org/project/feeds_xpathparser with feeds
>> http://drupal.org/project/feeds There are about a dozen more modules that
>> will accomplish the goal though I haven't used them, but I went through and
>> tried most of the methods out for some recent projects.
>> Cheers,
>> Kevin O'Brien
>> Drupal Developer
>> http://www.coderintherye.com
>> 415-754-0112
>>
>>
>> On Tue, Nov 30, 2010 at 11:26 AM, <development-request at drupal.org> wrote:
>>>
>>> Send development mailing list submissions to
>>>        development at drupal.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>        http://lists.drupal.org/mailman/listinfo/development
>>> or, via email, send a message with subject or body 'help' to
>>>        development-request at drupal.org
>>>
>>> You can reach the person managing the list at
>>>        development-owner at drupal.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of development digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>>   1. Drupal module for scraping information from an    HTML/XML
>>>      document (James Benstead)
>>>   2. Re: Drupal module for scraping information from an HTML/XML
>>>      document (John Fiala)
>>>   3. Easter problem (?mon Tam?s)
>>>   4. Re: Easter problem (Carl Wiedemann)
>>>   5. Re: Easter problem (larry at garfieldtech.com)
>>>   6. Re: Easter problem (jeff at ayendesigns.com)
>>>   7. Re: Easter problem (larry at garfieldtech.com)
>>>   8. Re: Easter problem (Jennifer Hodgdon)
>>>
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Tue, 30 Nov 2010 18:56:09 +0000
>>> From: James Benstead <james.benstead at gmail.com>
>>> Subject: [development] Drupal module for scraping information from an
>>>        HTML/XML document
>>> To: development <development at drupal.org>
>>> Message-ID:
>>>        <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV at mail.gmail.com>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>>
>>> I've finally got round to doing some serious work on Drupalversity, an
>>> open,
>>> web-based Drupal education project I've had in mind for a year or so.
>>>
>>> People who use Drupalversity to learn have the option of adding Resources
>>> to
>>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain
>>> how
>>> to do specific things with Drupal. A Resource is a custom content type
>>> that
>>> includes a link to the resource and a text field containing a description
>>> of
>>> that resource.
>>>
>>> What I'd like to do once a Resource has been added to the site is to
>>> scrape
>>> certain information from it: at this point I'm thinking the Title of the
>>> page the link points to and the provider of the resource - e.g., which
>>> Drupal shop originally created the resource. What's the best way to go
>>> about
>>> doing this? I'm pretty sure there's not a Drupal module that solves the
>>> problem out of the box.
>>>
>>> So far I've considered:
>>>
>>>   - http://drupal.org/project/querypath
>>>   - Drupal's built-in drupal_http_request() -
>>>
>>> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6
>>>   - curl
>>>
>>> Thanks,
>>>
>>> --Jim
>>> --
>>> My IM and Skype details are at http://state68.com/contact
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html
>>>
>>> ------------------------------
>>>
>>> Message: 2
>>> Date: Tue, 30 Nov 2010 12:06:33 -0700
>>> From: John Fiala <jcfiala at gmail.com>
>>> Subject: Re: [development] Drupal module for scraping information from
>>>        an HTML/XML document
>>> To: development at drupal.org
>>> Message-ID:
>>>        <AANLkTi=N6WxHfigUC4ZopfxswMBv8bj7BZZJErHmko_T at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> These days, if I'm going to be trying to extract data from html/xml,
>>> I'd use querypath.  Give it a try!
>>>
>>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead
>>> <james.benstead at gmail.com> wrote:
>>> > What I'd like to do once a Resource has been added to the site is to
>>> > scrape
>>> > certain information from it: at this point I'm thinking the Title of
>>> > the
>>> > page the link points to and the provider of the resource - e.g., which
>>> > Drupal shop originally created the resource. What's the best way to go
>>> > about
>>> > doing this? I'm pretty sure there's not a Drupal module that solves the
>>> > problem out of the box.
>>>
>>> --
>>> John Fiala
>>> www.jcfiala.net
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 3
>>> Date: Tue, 30 Nov 2010 20:14:04 +0100
>>> From: ?mon Tam?s <amont at 5net.hu>
>>> Subject: [development] Easter problem
>>> To: development at drupal.org
>>> Message-ID:
>>>        <AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX+iMjmBtvy at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Hello,
>>>
>>> I have the nameday module (http://drupal.org/project/nameday) and I get a
>>> feature request for the Greek namedays. How I see it is based on the
>>> Easter,
>>> what is not an easy thing to count.
>>>
>>> Well, I want to find some algorithm for Easter, and similar days, what is
>>> can be stored somehow. Maybe it should be a hook or some other think what
>>> can be stored in database.
>>>
>>>
>>> Thanks
>>>
>>> --
>>> ?mon Tam?s
>>> Sitefejleszt? ?s programoz?
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html
>>>
>>> ------------------------------
>>>
>>> Message: 4
>>> Date: Tue, 30 Nov 2010 12:22:42 -0700
>>> From: Carl Wiedemann <carl.wiedemann at gmail.com>
>>> Subject: Re: [development] Easter problem
>>> To: development at drupal.org
>>> Message-ID:
>>>        <AANLkTinD9Xz=3inJj2GraAuqde_=3yshJDwxCJzu12zr at mail.gmail.com>
>>> Content-Type: text/plain; charset="iso-8859-2"
>>>
>>> Does this help? http://php.net/manual/en/function.easter-days.php
>>>
>>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu> wrote:
>>>
>>> > Hello,
>>> >
>>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>>> > a
>>> > feature request for the Greek namedays. How I see it is based on the
>>> > Easter,
>>> > what is not an easy thing to count.
>>> >
>>> > Well, I want to find some algorithm for Easter, and similar days, what
>>> > is
>>> > can be stored somehow. Maybe it should be a hook or some other think
>>> > what
>>> > can be stored in database.
>>> >
>>> >
>>> > Thanks
>>> >
>>> > --
>>> > ?mon Tam?s
>>> > Sitefejleszt? ?s programoz?
>>> >
>>> >
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html
>>>
>>> ------------------------------
>>>
>>> Message: 5
>>> Date: Tue, 30 Nov 2010 13:24:07 -0600
>>> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
>>> Subject: Re: [development] Easter problem
>>> To: development at drupal.org
>>> Message-ID: <4CF54F57.2030602 at garfieldtech.com>
>>> Content-Type: text/plain; charset=UTF-8; format=flowed
>>>
>>> There's no need for a hook here at all.  You can either code in the
>>> algorithm for defining when Easter is (which sounds like it is in fact
>>> rather complicated) or just pre-store know pre-calculated dates for it
>>> for the next decade or so.  (10 records, one per year; totally easy.)
>>>
>>> Both options are described here, including the different mechanisms for
>>> defining when Easter is in different calendars:
>>>
>>> http://en.wikipedia.org/wiki/Easter#Date_of_Easter
>>>
>>> --Larry Garfield
>>>
>>> On 11/30/10 1:14 PM, ?mon Tam?s wrote:
>>> > Hello,
>>> >
>>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>>> > a feature request for the Greek namedays. How I see it is based on the
>>> > Easter, what is not an easy thing to count.
>>> >
>>> > Well, I want to find some algorithm for Easter, and similar days, what
>>> > is can be stored somehow. Maybe it should be a hook or some other think
>>> > what can be stored in database.
>>> >
>>> >
>>> > Thanks
>>> >
>>> > --
>>> > ?mon Tam?s
>>> > Sitefejleszt? ?s programoz?
>>> >
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 6
>>> Date: Tue, 30 Nov 2010 14:23:56 -0500
>>> From: jeff at ayendesigns.com
>>> Subject: Re: [development] Easter problem
>>> To: development at drupal.org
>>> Message-ID: <4CF54F4C.2060409 at ayendesigns.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> You can google it, but I believe this is one of those things that cannot
>>> be reduced to an equation or algorithm. It's something like the first
>>> Sunday after the first full moon after the spring equinox.
>>>
>>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote:
>>> > Hello,
>>> >
>>> > I have the nameday module ( http://drupal.org/project/nameday) and I
>>> > get a feature request for the Greek namedays. How I see it is based on
>>> > the Easter, what is not an easy thing to count.
>>> >
>>> > Well, I want to find some algorithm for Easter, and similar days, what
>>> > is can be stored somehow. Maybe it should be a hook or some other
>>> > think what can be stored in database.
>>> >
>>> >
>>> > Thanks
>>> >
>>> > --
>>> > ?mon Tam?s
>>> > Sitefejleszt? ?s programoz?
>>> >
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html
>>>
>>> ------------------------------
>>>
>>> Message: 7
>>> Date: Tue, 30 Nov 2010 13:26:23 -0600
>>> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
>>> Subject: Re: [development] Easter problem
>>> To: development at drupal.org
>>> Message-ID: <4CF54FDF.7070506 at garfieldtech.com>
>>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed
>>>
>>> The Calendar PHP module is not enabled by default in a stock PHP, so I
>>> don't know that you can rely on it (unfortunately).  It does have some
>>> cool stuff in it, though.
>>>
>>> --Larry Garfield
>>>
>>> On 11/30/10 1:22 PM, Carl Wiedemann wrote:
>>> > Does this help? http://php.net/manual/en/function.easter-days.php
>>> >
>>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu
>>> > <mailto:amont at 5net.hu>> wrote:
>>> >
>>> >     Hello,
>>> >
>>> >     I have the nameday module (http://drupal.org/project/nameday) and I
>>> >     get a feature request for the Greek namedays. How I see it is based
>>> >     on the Easter, what is not an easy thing to count.
>>> >
>>> >     Well, I want to find some algorithm for Easter, and similar days,
>>> >     what is can be stored somehow. Maybe it should be a hook or some
>>> >     other think what can be stored in database.
>>> >
>>> >
>>> >     Thanks
>>> >
>>> >     --
>>> >     ?mon Tam?s
>>> >     Sitefejleszt? ?s programoz?
>>> >
>>> >
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 8
>>> Date: Tue, 30 Nov 2010 11:21:08 -0800
>>> From: Jennifer Hodgdon <yahgrp at poplarware.com>
>>> Subject: Re: [development] Easter problem
>>> To: development at drupal.org
>>> Message-ID: <4CF54EA4.1050502 at poplarware.com>
>>> Content-Type: text/plain; charset=UTF-8; format=flowed
>>>
>>> http://php.net/manual/en/function.easter-date.php
>>>
>>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote:
>>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>>> > a
>>> > feature request for the Greek namedays. How I see it is based on the
>>> > Easter,
>>> > what is not an easy thing to count.
>>> >
>>> > Well, I want to find some algorithm for Easter, and similar days, what
>>> > is
>>> > can be stored somehow. Maybe it should be a hook or some other think
>>> > what
>>> > can be stored in database.
>>>
>>> --
>>> Jennifer Hodgdon * Poplar ProductivityWare
>>> www.poplarware.com
>>> Drupal web sites and custom Drupal modules
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> --
>>> [ Drupal development list | http://lists.drupal.org/ ]
>>>
>>> End of development Digest, Vol 95, Issue 58
>>> *******************************************
>>
>
>


More information about the development mailing list