[development] Drupal module for scraping information from an HTML/XML document

Victor Kane victorkane at gmail.com
Wed Dec 1 11:42:47 UTC 2010


Well, Python has "Beautiful Soup".

http://www.crummy.com/software/BeautifulSoup/

"You didn't write that awful page. You're just trying to get some data out
of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser"
In PHP I have use http://simplehtmldom.sourceforge.net/ as a way of parsing
badly formed HTML.

I wrote a script to import nodes using the latter and then saved them with
"node_save()".

An alternative could be to parse to CSV, then import using the node_export
or node_import modules.

Hope that helps,

Victor Kane
http://awebfactory.com.ar
http://projectflowandtracker.com

On Wed, Dec 1, 2010 at 7:46 AM, Balazs Dianiska <csillagasz at gmail.com>wrote:

> Sadly some of the older legacy sites are just not available in rss, I
> had such a scraping request recently. I have to say that with
> drupal_http_request you don't even have to look at curl. You can do
> all sorts of things, even faking logins.
>
> To parse the HTML use querypath, a trick that we use is to first run
> some sort of HTML tidyup library on the downloaded page, otherwise
> querypath runs away crying. beautify module can help you a great deal
> with that.
>
> Balazs
>
> On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <cweagans at gmail.com> wrote:
> > Most of the time, you can get to the posts via RSS. Aggregator module
> does a
> > pretty good job of pulling stuff in, and the author of the post that's
> > displayed is whatever you tell it to display (see Drupal Planet for an
> > example)
> > Thanks,
> > Cameron
> >
> >
> >
> > On Tue, Nov 30, 2010 at 12:48, Kevin O <nowarninglabel at gmail.com> wrote:
> >>
> >> I second the recommendation of using QueryPath. I use it almost
> >> exclusively along with drupal_http_request, though I use curl only in a
> few
> >> places (if you use curl I recommend http://drupal.org/project/curl for
> a
> >> dependency check). I'd really recommend though creating a custom module
> that
> >> uses the above and then has your logic for filtering in it, I've done
> this
> >> for about a dozen modules now.
> >> That said, there are some more modules available out there nowadays,
> such
> >> as using http://drupal.org/project/feeds_xpathparser with feeds
> >> http://drupal.org/project/feeds There are about a dozen more modules
> that
> >> will accomplish the goal though I haven't used them, but I went through
> and
> >> tried most of the methods out for some recent projects.
> >> Cheers,
> >> Kevin O'Brien
> >> Drupal Developer
> >> http://www.coderintherye.com
> >> 415-754-0112
> >>
> >>
> >> On Tue, Nov 30, 2010 at 11:26 AM, <development-request at drupal.org>
> wrote:
> >>>
> >>> Send development mailing list submissions to
> >>>        development at drupal.org
> >>>
> >>> To subscribe or unsubscribe via the World Wide Web, visit
> >>>        http://lists.drupal.org/mailman/listinfo/development
> >>> or, via email, send a message with subject or body 'help' to
> >>>        development-request at drupal.org
> >>>
> >>> You can reach the person managing the list at
> >>>        development-owner at drupal.org
> >>>
> >>> When replying, please edit your Subject line so it is more specific
> >>> than "Re: Contents of development digest..."
> >>>
> >>>
> >>> Today's Topics:
> >>>
> >>>   1. Drupal module for scraping information from an    HTML/XML
> >>>      document (James Benstead)
> >>>   2. Re: Drupal module for scraping information from an HTML/XML
> >>>      document (John Fiala)
> >>>   3. Easter problem (?mon Tam?s)
> >>>   4. Re: Easter problem (Carl Wiedemann)
> >>>   5. Re: Easter problem (larry at garfieldtech.com)
> >>>   6. Re: Easter problem (jeff at ayendesigns.com)
> >>>   7. Re: Easter problem (larry at garfieldtech.com)
> >>>   8. Re: Easter problem (Jennifer Hodgdon)
> >>>
> >>>
> >>> ----------------------------------------------------------------------
> >>>
> >>> Message: 1
> >>> Date: Tue, 30 Nov 2010 18:56:09 +0000
> >>> From: James Benstead <james.benstead at gmail.com>
> >>> Subject: [development] Drupal module for scraping information from an
> >>>        HTML/XML document
> >>> To: development <development at drupal.org>
> >>> Message-ID:
> >>>        <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV at mail.gmail.com<AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV at mail.gmail.com>
> >
> >>> Content-Type: text/plain; charset="iso-8859-1"
> >>>
> >>> I've finally got round to doing some serious work on Drupalversity, an
> >>> open,
> >>> web-based Drupal education project I've had in mind for a year or so.
> >>>
> >>> People who use Drupalversity to learn have the option of adding
> Resources
> >>> to
> >>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain
> >>> how
> >>> to do specific things with Drupal. A Resource is a custom content type
> >>> that
> >>> includes a link to the resource and a text field containing a
> description
> >>> of
> >>> that resource.
> >>>
> >>> What I'd like to do once a Resource has been added to the site is to
> >>> scrape
> >>> certain information from it: at this point I'm thinking the Title of
> the
> >>> page the link points to and the provider of the resource - e.g., which
> >>> Drupal shop originally created the resource. What's the best way to go
> >>> about
> >>> doing this? I'm pretty sure there's not a Drupal module that solves the
> >>> problem out of the box.
> >>>
> >>> So far I've considered:
> >>>
> >>>   - http://drupal.org/project/querypath
> >>>   - Drupal's built-in drupal_http_request() -
> >>>
> >>>
> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6
> >>>   - curl
> >>>
> >>> Thanks,
> >>>
> >>> --Jim
> >>> --
> >>> My IM and Skype details are at http://state68.com/contact
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL:
> >>>
> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 2
> >>> Date: Tue, 30 Nov 2010 12:06:33 -0700
> >>> From: John Fiala <jcfiala at gmail.com>
> >>> Subject: Re: [development] Drupal module for scraping information from
> >>>        an HTML/XML document
> >>> To: development at drupal.org
> >>> Message-ID:
> >>>        <AANLkTi=N6WxHfigUC4ZopfxswMBv8bj7BZZJErHmko_T at mail.gmail.com>
> >>> Content-Type: text/plain; charset=ISO-8859-1
> >>>
> >>> These days, if I'm going to be trying to extract data from html/xml,
> >>> I'd use querypath.  Give it a try!
> >>>
> >>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead
> >>> <james.benstead at gmail.com> wrote:
> >>> > What I'd like to do once a Resource has been added to the site is to
> >>> > scrape
> >>> > certain information from it: at this point I'm thinking the Title of
> >>> > the
> >>> > page the link points to and the provider of the resource - e.g.,
> which
> >>> > Drupal shop originally created the resource. What's the best way to
> go
> >>> > about
> >>> > doing this? I'm pretty sure there's not a Drupal module that solves
> the
> >>> > problem out of the box.
> >>>
> >>> --
> >>> John Fiala
> >>> www.jcfiala.net
> >>>
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 3
> >>> Date: Tue, 30 Nov 2010 20:14:04 +0100
> >>> From: ?mon Tam?s <amont at 5net.hu>
> >>> Subject: [development] Easter problem
> >>> To: development at drupal.org
> >>> Message-ID:
> >>>        <AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX+iMjmBtvy at mail.gmail.com<AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX%2BiMjmBtvy at mail.gmail.com>
> >
> >>> Content-Type: text/plain; charset="utf-8"
> >>>
> >>> Hello,
> >>>
> >>> I have the nameday module (http://drupal.org/project/nameday) and I
> get a
> >>> feature request for the Greek namedays. How I see it is based on the
> >>> Easter,
> >>> what is not an easy thing to count.
> >>>
> >>> Well, I want to find some algorithm for Easter, and similar days, what
> is
> >>> can be stored somehow. Maybe it should be a hook or some other think
> what
> >>> can be stored in database.
> >>>
> >>>
> >>> Thanks
> >>>
> >>> --
> >>> ?mon Tam?s
> >>> Sitefejleszt? ?s programoz?
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL:
> >>>
> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 4
> >>> Date: Tue, 30 Nov 2010 12:22:42 -0700
> >>> From: Carl Wiedemann <carl.wiedemann at gmail.com>
> >>> Subject: Re: [development] Easter problem
> >>> To: development at drupal.org
> >>> Message-ID:
> >>>        <AANLkTinD9Xz=3inJj2GraAuqde_=3yshJDwxCJzu12zr at mail.gmail.com>
> >>> Content-Type: text/plain; charset="iso-8859-2"
> >>>
> >>> Does this help? http://php.net/manual/en/function.easter-days.php
> >>>
> >>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu> wrote:
> >>>
> >>> > Hello,
> >>> >
> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
> get
> >>> > a
> >>> > feature request for the Greek namedays. How I see it is based on the
> >>> > Easter,
> >>> > what is not an easy thing to count.
> >>> >
> >>> > Well, I want to find some algorithm for Easter, and similar days,
> what
> >>> > is
> >>> > can be stored somehow. Maybe it should be a hook or some other think
> >>> > what
> >>> > can be stored in database.
> >>> >
> >>> >
> >>> > Thanks
> >>> >
> >>> > --
> >>> > ?mon Tam?s
> >>> > Sitefejleszt? ?s programoz?
> >>> >
> >>> >
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL:
> >>>
> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 5
> >>> Date: Tue, 30 Nov 2010 13:24:07 -0600
> >>> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
> >>> Subject: Re: [development] Easter problem
> >>> To: development at drupal.org
> >>> Message-ID: <4CF54F57.2030602 at garfieldtech.com>
> >>> Content-Type: text/plain; charset=UTF-8; format=flowed
> >>>
> >>> There's no need for a hook here at all.  You can either code in the
> >>> algorithm for defining when Easter is (which sounds like it is in fact
> >>> rather complicated) or just pre-store know pre-calculated dates for it
> >>> for the next decade or so.  (10 records, one per year; totally easy.)
> >>>
> >>> Both options are described here, including the different mechanisms for
> >>> defining when Easter is in different calendars:
> >>>
> >>> http://en.wikipedia.org/wiki/Easter#Date_of_Easter
> >>>
> >>> --Larry Garfield
> >>>
> >>> On 11/30/10 1:14 PM, ?mon Tam?s wrote:
> >>> > Hello,
> >>> >
> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
> get
> >>> > a feature request for the Greek namedays. How I see it is based on
> the
> >>> > Easter, what is not an easy thing to count.
> >>> >
> >>> > Well, I want to find some algorithm for Easter, and similar days,
> what
> >>> > is can be stored somehow. Maybe it should be a hook or some other
> think
> >>> > what can be stored in database.
> >>> >
> >>> >
> >>> > Thanks
> >>> >
> >>> > --
> >>> > ?mon Tam?s
> >>> > Sitefejleszt? ?s programoz?
> >>> >
> >>>
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 6
> >>> Date: Tue, 30 Nov 2010 14:23:56 -0500
> >>> From: jeff at ayendesigns.com
> >>> Subject: Re: [development] Easter problem
> >>> To: development at drupal.org
> >>> Message-ID: <4CF54F4C.2060409 at ayendesigns.com>
> >>> Content-Type: text/plain; charset="utf-8"
> >>>
> >>> You can google it, but I believe this is one of those things that
> cannot
> >>> be reduced to an equation or algorithm. It's something like the first
> >>> Sunday after the first full moon after the spring equinox.
> >>>
> >>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote:
> >>> > Hello,
> >>> >
> >>> > I have the nameday module ( http://drupal.org/project/nameday) and I
> >>> > get a feature request for the Greek namedays. How I see it is based
> on
> >>> > the Easter, what is not an easy thing to count.
> >>> >
> >>> > Well, I want to find some algorithm for Easter, and similar days,
> what
> >>> > is can be stored somehow. Maybe it should be a hook or some other
> >>> > think what can be stored in database.
> >>> >
> >>> >
> >>> > Thanks
> >>> >
> >>> > --
> >>> > ?mon Tam?s
> >>> > Sitefejleszt? ?s programoz?
> >>> >
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL:
> >>>
> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 7
> >>> Date: Tue, 30 Nov 2010 13:26:23 -0600
> >>> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
> >>> Subject: Re: [development] Easter problem
> >>> To: development at drupal.org
> >>> Message-ID: <4CF54FDF.7070506 at garfieldtech.com>
> >>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed
> >>>
> >>> The Calendar PHP module is not enabled by default in a stock PHP, so I
> >>> don't know that you can rely on it (unfortunately).  It does have some
> >>> cool stuff in it, though.
> >>>
> >>> --Larry Garfield
> >>>
> >>> On 11/30/10 1:22 PM, Carl Wiedemann wrote:
> >>> > Does this help? http://php.net/manual/en/function.easter-days.php
> >>> >
> >>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu
> >>> > <mailto:amont at 5net.hu>> wrote:
> >>> >
> >>> >     Hello,
> >>> >
> >>> >     I have the nameday module (http://drupal.org/project/nameday)
> and I
> >>> >     get a feature request for the Greek namedays. How I see it is
> based
> >>> >     on the Easter, what is not an easy thing to count.
> >>> >
> >>> >     Well, I want to find some algorithm for Easter, and similar days,
> >>> >     what is can be stored somehow. Maybe it should be a hook or some
> >>> >     other think what can be stored in database.
> >>> >
> >>> >
> >>> >     Thanks
> >>> >
> >>> >     --
> >>> >     ?mon Tam?s
> >>> >     Sitefejleszt? ?s programoz?
> >>> >
> >>> >
> >>>
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 8
> >>> Date: Tue, 30 Nov 2010 11:21:08 -0800
> >>> From: Jennifer Hodgdon <yahgrp at poplarware.com>
> >>> Subject: Re: [development] Easter problem
> >>> To: development at drupal.org
> >>> Message-ID: <4CF54EA4.1050502 at poplarware.com>
> >>> Content-Type: text/plain; charset=UTF-8; format=flowed
> >>>
> >>> http://php.net/manual/en/function.easter-date.php
> >>>
> >>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote:
> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
> get
> >>> > a
> >>> > feature request for the Greek namedays. How I see it is based on the
> >>> > Easter,
> >>> > what is not an easy thing to count.
> >>> >
> >>> > Well, I want to find some algorithm for Easter, and similar days,
> what
> >>> > is
> >>> > can be stored somehow. Maybe it should be a hook or some other think
> >>> > what
> >>> > can be stored in database.
> >>>
> >>> --
> >>> Jennifer Hodgdon * Poplar ProductivityWare
> >>> www.poplarware.com
> >>> Drupal web sites and custom Drupal modules
> >>>
> >>>
> >>>
> >>> ------------------------------
> >>>
> >>> --
> >>> [ Drupal development list | http://lists.drupal.org/ ]
> >>>
> >>> End of development Digest, Vol 95, Issue 58
> >>> *******************************************
> >>
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.drupal.org/pipermail/development/attachments/20101201/782be1b6/attachment-0001.html 


More information about the development mailing list