[development] Drupal module for scraping information from an HTML/XML document

Cameron Eagans cweagans at gmail.com
Wed Dec 1 05:27:09 UTC 2010


Most of the time, you can get to the posts via RSS. Aggregator module does a
pretty good job of pulling stuff in, and the author of the post that's
displayed is whatever you tell it to display (see Drupal Planet for an
example)

Thanks,
Cameron



On Tue, Nov 30, 2010 at 12:48, Kevin O <nowarninglabel at gmail.com> wrote:

> I second the recommendation of using QueryPath. I use it almost exclusively
> along with drupal_http_request, though I use curl only in a few places (if
> you use curl I recommend http://drupal.org/project/curl for a dependency
> check). I'd really recommend though creating a custom module that uses the
> above and then has your logic for filtering in it, I've done this for about
> a dozen modules now.
>
> That said, there are some more modules available out there nowadays, such
> as using http://drupal.org/project/feeds_xpathparser with feeds
> http://drupal.org/project/feeds There are about a dozen more modules that
> will accomplish the goal though I haven't used them, but I went through and
> tried most of the methods out for some recent projects.
>
> Cheers,
>
> Kevin O'Brien
> Drupal Developer
> http://www.coderintherye.com
> 415-754-0112
>
>
>
> On Tue, Nov 30, 2010 at 11:26 AM, <development-request at drupal.org> wrote:
>
>> Send development mailing list submissions to
>>        development at drupal.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>        http://lists.drupal.org/mailman/listinfo/development
>> or, via email, send a message with subject or body 'help' to
>>        development-request at drupal.org
>>
>> You can reach the person managing the list at
>>        development-owner at drupal.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of development digest..."
>>
>>
>> Today's Topics:
>>
>>   1. Drupal module for scraping information from an    HTML/XML
>>      document (James Benstead)
>>   2. Re: Drupal module for scraping information from an HTML/XML
>>      document (John Fiala)
>>   3. Easter problem (?mon Tam?s)
>>   4. Re: Easter problem (Carl Wiedemann)
>>   5. Re: Easter problem (larry at garfieldtech.com)
>>   6. Re: Easter problem (jeff at ayendesigns.com)
>>   7. Re: Easter problem (larry at garfieldtech.com)
>>   8. Re: Easter problem (Jennifer Hodgdon)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Tue, 30 Nov 2010 18:56:09 +0000
>> From: James Benstead <james.benstead at gmail.com>
>> Subject: [development] Drupal module for scraping information from an
>>        HTML/XML document
>> To: development <development at drupal.org>
>> Message-ID:
>>        <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV at mail.gmail.com<AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV at mail.gmail.com>
>> >
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>>
>> I've finally got round to doing some serious work on Drupalversity, an
>> open,
>> web-based Drupal education project I've had in mind for a year or so.
>>
>> People who use Drupalversity to learn have the option of adding Resources
>> to
>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain how
>> to do specific things with Drupal. A Resource is a custom content type
>> that
>> includes a link to the resource and a text field containing a description
>> of
>> that resource.
>>
>> What I'd like to do once a Resource has been added to the site is to
>> scrape
>> certain information from it: at this point I'm thinking the Title of the
>> page the link points to and the provider of the resource - e.g., which
>> Drupal shop originally created the resource. What's the best way to go
>> about
>> doing this? I'm pretty sure there's not a Drupal module that solves the
>> problem out of the box.
>>
>> So far I've considered:
>>
>>   - http://drupal.org/project/querypath
>>   - Drupal's built-in drupal_http_request() -
>>
>> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6
>>   - curl
>>
>> Thanks,
>>
>> --Jim
>> --
>> My IM and Skype details are at http://state68.com/contact
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Tue, 30 Nov 2010 12:06:33 -0700
>> From: John Fiala <jcfiala at gmail.com>
>> Subject: Re: [development] Drupal module for scraping information from
>>        an HTML/XML document
>> To: development at drupal.org
>> Message-ID:
>>        <AANLkTi=N6WxHfigUC4ZopfxswMBv8bj7BZZJErHmko_T at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>>
>> These days, if I'm going to be trying to extract data from html/xml,
>> I'd use querypath.  Give it a try!
>>
>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead
>> <james.benstead at gmail.com> wrote:
>> > What I'd like to do once a Resource has been added to the site is to
>> scrape
>> > certain information from it: at this point I'm thinking the Title of the
>> > page the link points to and the provider of the resource - e.g., which
>> > Drupal shop originally created the resource. What's the best way to go
>> about
>> > doing this? I'm pretty sure there's not a Drupal module that solves the
>> > problem out of the box.
>>
>> --
>> John Fiala
>> www.jcfiala.net
>>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Tue, 30 Nov 2010 20:14:04 +0100
>> From: ?mon Tam?s <amont at 5net.hu>
>> Subject: [development] Easter problem
>> To: development at drupal.org
>> Message-ID:
>>        <AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX+iMjmBtvy at mail.gmail.com<AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX%2BiMjmBtvy at mail.gmail.com>
>> >
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hello,
>>
>> I have the nameday module (http://drupal.org/project/nameday) and I get a
>> feature request for the Greek namedays. How I see it is based on the
>> Easter,
>> what is not an easy thing to count.
>>
>> Well, I want to find some algorithm for Easter, and similar days, what is
>> can be stored somehow. Maybe it should be a hook or some other think what
>> can be stored in database.
>>
>>
>> Thanks
>>
>> --
>> ?mon Tam?s
>> Sitefejleszt? ?s programoz?
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Tue, 30 Nov 2010 12:22:42 -0700
>> From: Carl Wiedemann <carl.wiedemann at gmail.com>
>> Subject: Re: [development] Easter problem
>> To: development at drupal.org
>> Message-ID:
>>        <AANLkTinD9Xz=3inJj2GraAuqde_=3yshJDwxCJzu12zr at mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-2"
>>
>> Does this help? http://php.net/manual/en/function.easter-days.php
>>
>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu> wrote:
>>
>> > Hello,
>> >
>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>> a
>> > feature request for the Greek namedays. How I see it is based on the
>> Easter,
>> > what is not an easy thing to count.
>> >
>> > Well, I want to find some algorithm for Easter, and similar days, what
>> is
>> > can be stored somehow. Maybe it should be a hook or some other think
>> what
>> > can be stored in database.
>> >
>> >
>> > Thanks
>> >
>> > --
>> > ?mon Tam?s
>> > Sitefejleszt? ?s programoz?
>> >
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html
>>
>> ------------------------------
>>
>> Message: 5
>> Date: Tue, 30 Nov 2010 13:24:07 -0600
>> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
>> Subject: Re: [development] Easter problem
>> To: development at drupal.org
>> Message-ID: <4CF54F57.2030602 at garfieldtech.com>
>> Content-Type: text/plain; charset=UTF-8; format=flowed
>>
>> There's no need for a hook here at all.  You can either code in the
>> algorithm for defining when Easter is (which sounds like it is in fact
>> rather complicated) or just pre-store know pre-calculated dates for it
>> for the next decade or so.  (10 records, one per year; totally easy.)
>>
>> Both options are described here, including the different mechanisms for
>> defining when Easter is in different calendars:
>>
>> http://en.wikipedia.org/wiki/Easter#Date_of_Easter
>>
>> --Larry Garfield
>>
>> On 11/30/10 1:14 PM, ?mon Tam?s wrote:
>> > Hello,
>> >
>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>> > a feature request for the Greek namedays. How I see it is based on the
>> > Easter, what is not an easy thing to count.
>> >
>> > Well, I want to find some algorithm for Easter, and similar days, what
>> > is can be stored somehow. Maybe it should be a hook or some other think
>> > what can be stored in database.
>> >
>> >
>> > Thanks
>> >
>> > --
>> > ?mon Tam?s
>> > Sitefejleszt? ?s programoz?
>> >
>>
>>
>> ------------------------------
>>
>> Message: 6
>> Date: Tue, 30 Nov 2010 14:23:56 -0500
>> From: jeff at ayendesigns.com
>> Subject: Re: [development] Easter problem
>> To: development at drupal.org
>> Message-ID: <4CF54F4C.2060409 at ayendesigns.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> You can google it, but I believe this is one of those things that cannot
>> be reduced to an equation or algorithm. It's something like the first
>> Sunday after the first full moon after the spring equinox.
>>
>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote:
>> > Hello,
>> >
>> > I have the nameday module ( http://drupal.org/project/nameday) and I
>> > get a feature request for the Greek namedays. How I see it is based on
>> > the Easter, what is not an easy thing to count.
>> >
>> > Well, I want to find some algorithm for Easter, and similar days, what
>> > is can be stored somehow. Maybe it should be a hook or some other
>> > think what can be stored in database.
>> >
>> >
>> > Thanks
>> >
>> > --
>> > ?mon Tam?s
>> > Sitefejleszt? ?s programoz?
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html
>>
>> ------------------------------
>>
>> Message: 7
>> Date: Tue, 30 Nov 2010 13:26:23 -0600
>> From: "larry at garfieldtech.com" <larry at garfieldtech.com>
>> Subject: Re: [development] Easter problem
>> To: development at drupal.org
>> Message-ID: <4CF54FDF.7070506 at garfieldtech.com>
>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed
>>
>> The Calendar PHP module is not enabled by default in a stock PHP, so I
>> don't know that you can rely on it (unfortunately).  It does have some
>> cool stuff in it, though.
>>
>> --Larry Garfield
>>
>> On 11/30/10 1:22 PM, Carl Wiedemann wrote:
>> > Does this help? http://php.net/manual/en/function.easter-days.php
>> >
>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont at 5net.hu
>> > <mailto:amont at 5net.hu>> wrote:
>> >
>> >     Hello,
>> >
>> >     I have the nameday module (http://drupal.org/project/nameday) and I
>> >     get a feature request for the Greek namedays. How I see it is based
>> >     on the Easter, what is not an easy thing to count.
>> >
>> >     Well, I want to find some algorithm for Easter, and similar days,
>> >     what is can be stored somehow. Maybe it should be a hook or some
>> >     other think what can be stored in database.
>> >
>> >
>> >     Thanks
>> >
>> >     --
>> >     ?mon Tam?s
>> >     Sitefejleszt? ?s programoz?
>> >
>> >
>>
>>
>> ------------------------------
>>
>> Message: 8
>> Date: Tue, 30 Nov 2010 11:21:08 -0800
>> From: Jennifer Hodgdon <yahgrp at poplarware.com>
>> Subject: Re: [development] Easter problem
>> To: development at drupal.org
>> Message-ID: <4CF54EA4.1050502 at poplarware.com>
>> Content-Type: text/plain; charset=UTF-8; format=flowed
>>
>> http://php.net/manual/en/function.easter-date.php
>>
>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote:
>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>> a
>> > feature request for the Greek namedays. How I see it is based on the
>> Easter,
>> > what is not an easy thing to count.
>> >
>> > Well, I want to find some algorithm for Easter, and similar days, what
>> is
>> > can be stored somehow. Maybe it should be a hook or some other think
>> what
>> > can be stored in database.
>>
>> --
>> Jennifer Hodgdon * Poplar ProductivityWare
>> www.poplarware.com
>> Drupal web sites and custom Drupal modules
>>
>>
>>
>> ------------------------------
>>
>> --
>> [ Drupal development list | http://lists.drupal.org/ ]
>>
>> End of development Digest, Vol 95, Issue 58
>> *******************************************
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.drupal.org/pipermail/development/attachments/20101130/29f25fd6/attachment-0001.html 


More information about the development mailing list