[development] Drupal module for scraping information from an HTML/XML document

James Benstead james.benstead at gmail.com
Mon Dec 6 19:04:24 UTC 2010


Thanks guys - looks like QueryPath is the way forward :)

--Jim
--
My IM and Skype details are at http://state68.com/contact


On 1 December 2010 11:42, Victor Kane <victorkane at gmail.com> wrote:

> Well, Python has "Beautiful Soup".
>
> http://www.crummy.com/software/BeautifulSoup/
>
> "You didn't write that awful page. You're just trying to get some data out
> of it. Right now, you don't really care what HTML is supposed to look like.
>
> Neither does this parser"
> In PHP I have use http://simplehtmldom.sourceforge.net/ as a way of
> parsing badly formed HTML.
>
> I wrote a script to import nodes using the latter and then saved them with
> "node_save()".
>
> An alternative could be to parse to CSV, then import using the node_export
> or node_import modules.
>
> Hope that helps,
>
> Victor Kane
> http://awebfactory.com.ar
> http://projectflowandtracker.com
>
>
> On Wed, Dec 1, 2010 at 7:46 AM, Balazs Dianiska <csillagasz at gmail.com>wrote:
>
>> Sadly some of the older legacy sites are just not available in rss, I
>> had such a scraping request recently. I have to say that with
>> drupal_http_request you don't even have to look at curl. You can do
>> all sorts of things, even faking logins.
>>
>> To parse the HTML use querypath, a trick that we use is to first run
>> some sort of HTML tidyup library on the downloaded page, otherwise
>> querypath runs away crying. beautify module can help you a great deal
>> with that.
>>
>> Balazs
>>
>> On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <cweagans at gmail.com>
>> wrote:
>> > Most of the time, you can get to the posts via RSS. Aggregator module
>> does a
>> > pretty good job of pulling stuff in, and the author of the post that's
>> > displayed is whatever you tell it to display (see Drupal Planet for an
>> > example)
>> > Thanks,
>> > Cameron
>> >
>> >
>> >
>> > On Tue, Nov 30, 2010 at 12:48, Kevin O <nowarninglabel at gmail.com>
>> wrote:
>> >>
>> >> I second the recommendation of using QueryPath. I use it almost
>> >> exclusively along with drupal_http_request, though I use curl only in a
>> few
>> >> places (if you use curl I recommend http://drupal.org/project/curl for
>> a
>> >> dependency check). I'd really recommend though creating a custom module
>> that
>> >> uses the above and then has your logic for filtering in it, I've done
>> this
>> >> for about a dozen modules now.
>> >> That said, there are some more modules available out there nowadays,
>> such
>> >> as using http://drupal.org/project/feeds_xpathparser with feeds
>> >> http://drupal.org/project/feeds There are about a dozen more modules
>> that
>> >> will accomplish the goal though I haven't used them, but I went through
>> and
>> >> tried most of the methods out for some recent projects.
>> >> Cheers,
>> >> Kevin O'Brien
>> >> Drupal Developer
>> >> http://www.coderintherye.com
>> >> 415-754-0112
>> >>
>> >>
>> >> On Tue, Nov 30, 2010 at 11:26 AM, <development-request at drupal.org>
>> wrote:
>> >>>
>> >>> Send development mailing list submissions to
>> >>>        development at drupal.org
>> >>>
>> >>> To subscribe or unsubscribe via the World Wide Web, visit
>> >>>        http://lists.drupal.org/mailman/listinfo/development
>> >>> or, via email, send a message with subject or body 'help' to
>> >>>        development-request at drupal.org
>> >>>
>> >>> You can reach the person managing the list at
>> >>>        development-owner at drupal.org
>> >>>
>> >>> When replying, please edit your Subject line so it is more specific
>> >>> than "Re: Contents of development digest..."
>> >>>
>> >>>
>> >>> Today's Topics:
>> >>>
>> >>>   1. Drupal module for scraping information from an    HTML/XML
>> >>>      document (James Benstead)
>> >>>   2. Re: Drupal module for scraping information from an HTML/XML
>> >>>      document (John Fiala)
>> >>>   3. Easter problem (?mon Tam?s)
>> >>>   4. Re: Easter problem (Carl Wiedemann)
>> >>>   5. Re: Easter problem (larry at garfieldtech.com)
>> >>>   6. Re: Easter problem (jeff at ayendesigns.com)
>> >>>   7. Re: Easter problem (larry at garfieldtech.com)
>> >>>   8. Re: Easter problem (Jennifer Hodgdon)
>> >>>
>> >>>
>> >>> ----------------------------------------------------------------------
>> >>>
>> >>> Message: 1
>> >>> Date: Tue, 30 Nov 2010 18:56:09 +0000
>> >>> From: James Benstead <james.benstead at gmail.com>
>> >>> Subject: [development] Drupal module for scraping information from an
>> >>>        HTML/XML document
>> >>> To: development <development at drupal.org>
>> >>> Message-ID:
>> >>>        <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV at mail.gmail.com<AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV at mail.gmail.com>
>> >
>> >>> Content-Type: text/plain; charset="iso-8859-1"
>> >>>
>> >>> I've finally got round to doing some serious work on Drupalversity, an
>> >>> open,
>> >>> web-based Drupal education project I've had in mind for a year or so.
>> >>>
>> >>> People who use Drupalversity to learn have the option of adding
>> Resources
>> >>> to
>> >>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain
>> >>> how
>> >>> to do specific things with Drupal. A Resource is a custom content type
>> >>> that
>> >>> includes a link to the resource and a text field containing a
>> description
>> >>> of
>> >>> that resource.
>> >>>
>> >>> What I'd like to do once a Resource has been added to the site is to
>> >>> scrape
>> >>> certain information from it: at this point I'm thinking the Title of
>> the
>> >>> page the link points to and the provider of the resource - e.g., which
>> >>> Drupal shop originally created the resource. What's the best way to go
>> >>> about
>> >>> doing this? I'm pretty sure there's not a Drupal module that solves
>> the
>> >>> problem out of the box.
>> >>>
>> >>> So far I've considered:
>> >>>
>> >>>   - http://drupal.org/project/querypath
>> >>>   - Drupal's built-in drupal_http_request() -
>> >>>
>> >>>
>> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6
>> >>>   - curl
>> >>>
>> >>> Thanks,
>> >>>
>> >>> --Jim
>> >>> --
>> >>> My IM and Skype details are at http://state68.com/contact
>> >>> -------------- next part --------------
>> >>> An HTML attachment was scrubbed...
>> >>> URL:
>> >>>
>> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> Message: 2
>> >>> Date: Tue, 30 Nov 2010 12:06:33 -0700
>> >>> From: John Fiala <jcfiala at gmail.com>
>> >>> Subject: Re: [development] Drupal module for scraping information from
>> >>>        an HTML/XML document
>> >>> To: development at drupal.org
>> >>> Message-ID:
>> >>>        <AANLkTi=N6WxHfigUC4ZopfxswMBv8bj7BZZJErHmko_T at mail.gmail.com>
>> >>> Content-Type: text/plain; charset=ISO-8859-1
>> >>>
>> >>> These days, if I'm going to be trying to extract data from html/xml,
>> >>> I'd use querypath.  Give it a try!
>> >>>
>> >>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead
>> >>> <james.benstead at gmail.com> wrote:
>> >>> > What I'd like to do once a Resource has been added to the site is to
>> >>> > scrape
>> >>> > certain information from it: at this point I'm thinking the Title of
>> >>> > the
>> >>> > page the link points to and the provider of the resource - e.g.,
>> which
>> >>> > Drupal shop originally created the resource. What's the best way to
>> go
>> >>> > about
>> >>> > doing this? I'm pretty sure there's not a Drupal module that solves
>> the
>> >>> > problem out of the box.
>> >>>
>> >>> --
>> >>> John Fiala
>> >>> www.jcfiala.net
>> >>>
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> Message: 3
>> >>> Date: Tue, 30 Nov 2010 20:14:04 +0100
>> >>> From: ?mon Tam?s <amont at 5net.hu>
>> >>> Subject: [development] Easter problem
>> >>> To: development at drupal.org
>> >>> Message-ID:
>> >>>        <AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX+iMjmBtvy at mail.gmail.com<AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX%2BiMjmBtvy at mail.gmail.com>
>> >
>> >>> Content-Type: text/plain; charset="utf-8"
>> >>>
>> >>> Hello,
>> >>>
>> >>> I have the nameday module (http://drupal.org/project/nameday) and I
>> get a
>> >>> feature request for the Greek namedays. How I see it is based on the
>> >>> Easter,
>> >>> what is not an easy thing to count.
>> >>>
>> >>> Well, I want to find some algorithm for Easter, and similar days, what
>> is
>> >>> can be stored somehow. Maybe it should be a hook or some other think
>> what
>> >>> can be stored in database.
>> >>>
>> >>>
>> >>> Thanks
>> >>>
>> >>> --
>> >>> ?mon Tam?s
>> >>> Sitefejleszt? ?s programoz?
>> >>> -------------- next part --------------
>> >>> An HTML attachment was scrubbed...
>> >>> URL:
>> >>>
>> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> Message: 4
>> >>> Date: Tue, 30 Nov 2010 12:22:42 -0700
>> >>> From: Carl Wiedemann <carl.wiedemann at gmail.com>
>> >>> Subject: Re: [development] Easter problem
>> >>> To: development at drupal.org
>> >>> Message-ID:
>> >>>        <AANLkTinD9Xz=3inJj2GraAuqde_=3yshJDwxCJzu12zr at mail.gmail.com>
>> >>> Content-Type: text/plain; charset="iso-8859-2"
>> >>>
>> >>> Does this help? http://php.net/manual/en/function.easter-days.php
>> >>>
>> >>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu> wrote:
>> >>>
>> >>> > Hello,
>> >>> >
>> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
>> get
>> >>> > a
>> >>> > feature request for the Greek namedays. How I see it is based on the
>> >>> > Easter,
>> >>> > what is not an easy thing to count.
>> >>> >
>> >>> > Well, I want to find some algorithm for Easter, and similar days,
>> what
>> >>> > is
>> >>> > can be stored somehow. Maybe it should be a hook or some other think
>> >>> > what
>> >>> > can be stored in database.
>> >>> >
>> >>> >
>> >>> > Thanks
>> >>> >
>> >>> > --
>> >>> > ?mon Tam?s
>> >>> > Sitefejleszt? ?s programoz?
>> >>> >
>> >>> >
>> >>> -------------- next part --------------
>> >>> An HTML attachment was scrubbed...
>> >>> URL:
>> >>>
>> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> Message: 5
>> >>> Date: Tue, 30 Nov 2010 13:24:07 -0600
>> >>> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
>> >>> Subject: Re: [development] Easter problem
>> >>> To: development at drupal.org
>> >>> Message-ID: <4CF54F57.2030602 at garfieldtech.com>
>> >>> Content-Type: text/plain; charset=UTF-8; format=flowed
>> >>>
>> >>> There's no need for a hook here at all.  You can either code in the
>> >>> algorithm for defining when Easter is (which sounds like it is in fact
>> >>> rather complicated) or just pre-store know pre-calculated dates for it
>> >>> for the next decade or so.  (10 records, one per year; totally easy.)
>> >>>
>> >>> Both options are described here, including the different mechanisms
>> for
>> >>> defining when Easter is in different calendars:
>> >>>
>> >>> http://en.wikipedia.org/wiki/Easter#Date_of_Easter
>> >>>
>> >>> --Larry Garfield
>> >>>
>> >>> On 11/30/10 1:14 PM, ?mon Tam?s wrote:
>> >>> > Hello,
>> >>> >
>> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
>> get
>> >>> > a feature request for the Greek namedays. How I see it is based on
>> the
>> >>> > Easter, what is not an easy thing to count.
>> >>> >
>> >>> > Well, I want to find some algorithm for Easter, and similar days,
>> what
>> >>> > is can be stored somehow. Maybe it should be a hook or some other
>> think
>> >>> > what can be stored in database.
>> >>> >
>> >>> >
>> >>> > Thanks
>> >>> >
>> >>> > --
>> >>> > ?mon Tam?s
>> >>> > Sitefejleszt? ?s programoz?
>> >>> >
>> >>>
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> Message: 6
>> >>> Date: Tue, 30 Nov 2010 14:23:56 -0500
>> >>> From: jeff at ayendesigns.com
>> >>> Subject: Re: [development] Easter problem
>> >>> To: development at drupal.org
>> >>> Message-ID: <4CF54F4C.2060409 at ayendesigns.com>
>> >>> Content-Type: text/plain; charset="utf-8"
>> >>>
>> >>> You can google it, but I believe this is one of those things that
>> cannot
>> >>> be reduced to an equation or algorithm. It's something like the first
>> >>> Sunday after the first full moon after the spring equinox.
>> >>>
>> >>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote:
>> >>> > Hello,
>> >>> >
>> >>> > I have the nameday module ( http://drupal.org/project/nameday) and
>> I
>> >>> > get a feature request for the Greek namedays. How I see it is based
>> on
>> >>> > the Easter, what is not an easy thing to count.
>> >>> >
>> >>> > Well, I want to find some algorithm for Easter, and similar days,
>> what
>> >>> > is can be stored somehow. Maybe it should be a hook or some other
>> >>> > think what can be stored in database.
>> >>> >
>> >>> >
>> >>> > Thanks
>> >>> >
>> >>> > --
>> >>> > ?mon Tam?s
>> >>> > Sitefejleszt? ?s programoz?
>> >>> >
>> >>> -------------- next part --------------
>> >>> An HTML attachment was scrubbed...
>> >>> URL:
>> >>>
>> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> Message: 7
>> >>> Date: Tue, 30 Nov 2010 13:26:23 -0600
>> >>> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
>> >>> Subject: Re: [development] Easter problem
>> >>> To: development at drupal.org
>> >>> Message-ID: <4CF54FDF.7070506 at garfieldtech.com>
>> >>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed
>> >>>
>> >>> The Calendar PHP module is not enabled by default in a stock PHP, so I
>> >>> don't know that you can rely on it (unfortunately).  It does have some
>> >>> cool stuff in it, though.
>> >>>
>> >>> --Larry Garfield
>> >>>
>> >>> On 11/30/10 1:22 PM, Carl Wiedemann wrote:
>> >>> > Does this help? http://php.net/manual/en/function.easter-days.php
>> >>> >
>> >>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu
>> >>> > <mailto:amont at 5net.hu>> wrote:
>> >>> >
>> >>> >     Hello,
>> >>> >
>> >>> >     I have the nameday module (http://drupal.org/project/nameday)
>> and I
>> >>> >     get a feature request for the Greek namedays. How I see it is
>> based
>> >>> >     on the Easter, what is not an easy thing to count.
>> >>> >
>> >>> >     Well, I want to find some algorithm for Easter, and similar
>> days,
>> >>> >     what is can be stored somehow. Maybe it should be a hook or some
>> >>> >     other think what can be stored in database.
>> >>> >
>> >>> >
>> >>> >     Thanks
>> >>> >
>> >>> >     --
>> >>> >     ?mon Tam?s
>> >>> >     Sitefejleszt? ?s programoz?
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> Message: 8
>> >>> Date: Tue, 30 Nov 2010 11:21:08 -0800
>> >>> From: Jennifer Hodgdon <yahgrp at poplarware.com>
>> >>> Subject: Re: [development] Easter problem
>> >>> To: development at drupal.org
>> >>> Message-ID: <4CF54EA4.1050502 at poplarware.com>
>> >>> Content-Type: text/plain; charset=UTF-8; format=flowed
>> >>>
>> >>> http://php.net/manual/en/function.easter-date.php
>> >>>
>> >>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote:
>> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
>> get
>> >>> > a
>> >>> > feature request for the Greek namedays. How I see it is based on the
>> >>> > Easter,
>> >>> > what is not an easy thing to count.
>> >>> >
>> >>> > Well, I want to find some algorithm for Easter, and similar days,
>> what
>> >>> > is
>> >>> > can be stored somehow. Maybe it should be a hook or some other think
>> >>> > what
>> >>> > can be stored in database.
>> >>>
>> >>> --
>> >>> Jennifer Hodgdon * Poplar ProductivityWare
>> >>> www.poplarware.com
>> >>> Drupal web sites and custom Drupal modules
>> >>>
>> >>>
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> --
>> >>> [ Drupal development list | http://lists.drupal.org/ ]
>> >>>
>> >>> End of development Digest, Vol 95, Issue 58
>> >>> *******************************************
>> >>
>> >
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.drupal.org/pipermail/development/attachments/20101206/0f6f8dfe/attachment-0001.html 


More information about the development mailing list