Thanks guys - looks like QueryPath is the way forward :)

--Jim
--
My IM and Skype details are at http://state68.com/contact


On 1 December 2010 11:42, Victor Kane <victorkane@gmail.com> wrote:
Well, Python has "Beautiful Soup". 

http://www.crummy.com/software/BeautifulSoup/

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser"

In PHP I have use http://simplehtmldom.sourceforge.net/ as a way of parsing badly formed HTML.

I wrote a script to import nodes using the latter and then saved them with "node_save()". 

An alternative could be to parse to CSV, then import using the node_export or node_import modules.

Hope that helps,

Victor Kane
http://awebfactory.com.ar
http://projectflowandtracker.com


On Wed, Dec 1, 2010 at 7:46 AM, Balazs Dianiska <csillagasz@gmail.com> wrote:
Sadly some of the older legacy sites are just not available in rss, I
had such a scraping request recently. I have to say that with
drupal_http_request you don't even have to look at curl. You can do
all sorts of things, even faking logins.

To parse the HTML use querypath, a trick that we use is to first run
some sort of HTML tidyup library on the downloaded page, otherwise
querypath runs away crying. beautify module can help you a great deal
with that.

Balazs

On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <cweagans@gmail.com> wrote:
> Most of the time, you can get to the posts via RSS. Aggregator module does a
> pretty good job of pulling stuff in, and the author of the post that's
> displayed is whatever you tell it to display (see Drupal Planet for an
> example)
> Thanks,
> Cameron
>
>
>
> On Tue, Nov 30, 2010 at 12:48, Kevin O <nowarninglabel@gmail.com> wrote:
>>
>> I second the recommendation of using QueryPath. I use it almost
>> exclusively along with drupal_http_request, though I use curl only in a few
>> places (if you use curl I recommend http://drupal.org/project/curl for a
>> dependency check). I'd really recommend though creating a custom module that
>> uses the above and then has your logic for filtering in it, I've done this
>> for about a dozen modules now.
>> That said, there are some more modules available out there nowadays, such
>> as using http://drupal.org/project/feeds_xpathparser with feeds
>> http://drupal.org/project/feeds There are about a dozen more modules that
>> will accomplish the goal though I haven't used them, but I went through and
>> tried most of the methods out for some recent projects.
>> Cheers,
>> Kevin O'Brien
>> Drupal Developer
>> http://www.coderintherye.com
>> 415-754-0112
>>
>>
>> On Tue, Nov 30, 2010 at 11:26 AM, <development-request@drupal.org> wrote:
>>>
>>> Send development mailing list submissions to
>>>        development@drupal.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>        http://lists.drupal.org/mailman/listinfo/development
>>> or, via email, send a message with subject or body 'help' to
>>>        development-request@drupal.org
>>>
>>> You can reach the person managing the list at
>>>        development-owner@drupal.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of development digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>>   1. Drupal module for scraping information from an    HTML/XML
>>>      document (James Benstead)
>>>   2. Re: Drupal module for scraping information from an HTML/XML
>>>      document (John Fiala)
>>>   3. Easter problem (?mon Tam?s)
>>>   4. Re: Easter problem (Carl Wiedemann)
>>>   5. Re: Easter problem (larry@garfieldtech.com)
>>>   6. Re: Easter problem (jeff@ayendesigns.com)
>>>   7. Re: Easter problem (larry@garfieldtech.com)
>>>   8. Re: Easter problem (Jennifer Hodgdon)
>>>
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Tue, 30 Nov 2010 18:56:09 +0000
>>> From: James Benstead <james.benstead@gmail.com>
>>> Subject: [development] Drupal module for scraping information from an
>>>        HTML/XML document
>>> To: development <development@drupal.org>
>>> Message-ID:
>>>        <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV@mail.gmail.com>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>>
>>> I've finally got round to doing some serious work on Drupalversity, an
>>> open,
>>> web-based Drupal education project I've had in mind for a year or so.
>>>
>>> People who use Drupalversity to learn have the option of adding Resources
>>> to
>>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain
>>> how
>>> to do specific things with Drupal. A Resource is a custom content type
>>> that
>>> includes a link to the resource and a text field containing a description
>>> of
>>> that resource.
>>>
>>> What I'd like to do once a Resource has been added to the site is to
>>> scrape
>>> certain information from it: at this point I'm thinking the Title of the
>>> page the link points to and the provider of the resource - e.g., which
>>> Drupal shop originally created the resource. What's the best way to go
>>> about
>>> doing this? I'm pretty sure there's not a Drupal module that solves the
>>> problem out of the box.
>>>
>>> So far I've considered:
>>>
>>>   - http://drupal.org/project/querypath
>>>   - Drupal's built-in drupal_http_request() -
>>>
>>> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6
>>>   - curl
>>>
>>> Thanks,
>>>
>>> --Jim
>>> --
>>> My IM and Skype details are at http://state68.com/contact
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html
>>>
>>> ------------------------------
>>>
>>> Message: 2
>>> Date: Tue, 30 Nov 2010 12:06:33 -0700
>>> From: John Fiala <jcfiala@gmail.com>
>>> Subject: Re: [development] Drupal module for scraping information from
>>>        an HTML/XML document
>>> To: development@drupal.org
>>> Message-ID:
>>>        <AANLkTi=N6WxHfigUC4ZopfxswMBv8bj7BZZJErHmko_T@mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> These days, if I'm going to be trying to extract data from html/xml,
>>> I'd use querypath.  Give it a try!
>>>
>>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead
>>> <james.benstead@gmail.com> wrote:
>>> > What I'd like to do once a Resource has been added to the site is to
>>> > scrape
>>> > certain information from it: at this point I'm thinking the Title of
>>> > the
>>> > page the link points to and the provider of the resource - e.g., which
>>> > Drupal shop originally created the resource. What's the best way to go
>>> > about
>>> > doing this? I'm pretty sure there's not a Drupal module that solves the
>>> > problem out of the box.
>>>
>>> --
>>> John Fiala
>>> www.jcfiala.net
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 3
>>> Date: Tue, 30 Nov 2010 20:14:04 +0100
>>> From: ?mon Tam?s <amont@5net.hu>
>>> Subject: [development] Easter problem
>>> To: development@drupal.org
>>> Message-ID:
>>>        <AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX+iMjmBtvy@mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Hello,
>>>
>>> I have the nameday module (http://drupal.org/project/nameday) and I get a
>>> feature request for the Greek namedays. How I see it is based on the
>>> Easter,
>>> what is not an easy thing to count.
>>>
>>> Well, I want to find some algorithm for Easter, and similar days, what is
>>> can be stored somehow. Maybe it should be a hook or some other think what
>>> can be stored in database.
>>>
>>>
>>> Thanks
>>>
>>> --
>>> ?mon Tam?s
>>> Sitefejleszt? ?s programoz?
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html
>>>
>>> ------------------------------
>>>
>>> Message: 4
>>> Date: Tue, 30 Nov 2010 12:22:42 -0700
>>> From: Carl Wiedemann <carl.wiedemann@gmail.com>
>>> Subject: Re: [development] Easter problem
>>> To: development@drupal.org
>>> Message-ID:
>>>        <AANLkTinD9Xz=3inJj2GraAuqde_=3yshJDwxCJzu12zr@mail.gmail.com>
>>> Content-Type: text/plain; charset="iso-8859-2"
>>>
>>> Does this help? http://php.net/manual/en/function.easter-days.php
>>>
>>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont@5net.hu> wrote:
>>>
>>> > Hello,
>>> >
>>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>>> > a
>>> > feature request for the Greek namedays. How I see it is based on the
>>> > Easter,
>>> > what is not an easy thing to count.
>>> >
>>> > Well, I want to find some algorithm for Easter, and similar days, what
>>> > is
>>> > can be stored somehow. Maybe it should be a hook or some other think
>>> > what
>>> > can be stored in database.
>>> >
>>> >
>>> > Thanks
>>> >
>>> > --
>>> > ?mon Tam?s
>>> > Sitefejleszt? ?s programoz?
>>> >
>>> >
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html
>>>
>>> ------------------------------
>>>
>>> Message: 5
>>> Date: Tue, 30 Nov 2010 13:24:07 -0600
>>> From: "larry@garfieldtech.com" <larry@garfieldtech.com>
>>> Subject: Re: [development] Easter problem
>>> To: development@drupal.org
>>> Message-ID: <4CF54F57.2030602@garfieldtech.com>
>>> Content-Type: text/plain; charset=UTF-8; format=flowed
>>>
>>> There's no need for a hook here at all.  You can either code in the
>>> algorithm for defining when Easter is (which sounds like it is in fact
>>> rather complicated) or just pre-store know pre-calculated dates for it
>>> for the next decade or so.  (10 records, one per year; totally easy.)
>>>
>>> Both options are described here, including the different mechanisms for
>>> defining when Easter is in different calendars:
>>>
>>> http://en.wikipedia.org/wiki/Easter#Date_of_Easter
>>>
>>> --Larry Garfield
>>>
>>> On 11/30/10 1:14 PM, ?mon Tam?s wrote:
>>> > Hello,
>>> >
>>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>>> > a feature request for the Greek namedays. How I see it is based on the
>>> > Easter, what is not an easy thing to count.
>>> >
>>> > Well, I want to find some algorithm for Easter, and similar days, what
>>> > is can be stored somehow. Maybe it should be a hook or some other think
>>> > what can be stored in database.
>>> >
>>> >
>>> > Thanks
>>> >
>>> > --
>>> > ?mon Tam?s
>>> > Sitefejleszt? ?s programoz?
>>> >
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 6
>>> Date: Tue, 30 Nov 2010 14:23:56 -0500
>>> From: jeff@ayendesigns.com
>>> Subject: Re: [development] Easter problem
>>> To: development@drupal.org
>>> Message-ID: <4CF54F4C.2060409@ayendesigns.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> You can google it, but I believe this is one of those things that cannot
>>> be reduced to an equation or algorithm. It's something like the first
>>> Sunday after the first full moon after the spring equinox.
>>>
>>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote:
>>> > Hello,
>>> >
>>> > I have the nameday module ( http://drupal.org/project/nameday) and I
>>> > get a feature request for the Greek namedays. How I see it is based on
>>> > the Easter, what is not an easy thing to count.
>>> >
>>> > Well, I want to find some algorithm for Easter, and similar days, what
>>> > is can be stored somehow. Maybe it should be a hook or some other
>>> > think what can be stored in database.
>>> >
>>> >
>>> > Thanks
>>> >
>>> > --
>>> > ?mon Tam?s
>>> > Sitefejleszt? ?s programoz?
>>> >
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL:
>>> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html
>>>
>>> ------------------------------
>>>
>>> Message: 7
>>> Date: Tue, 30 Nov 2010 13:26:23 -0600
>>> From: "larry@garfieldtech.com" <larry@garfieldtech.com>
>>> Subject: Re: [development] Easter problem
>>> To: development@drupal.org
>>> Message-ID: <4CF54FDF.7070506@garfieldtech.com>
>>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed
>>>
>>> The Calendar PHP module is not enabled by default in a stock PHP, so I
>>> don't know that you can rely on it (unfortunately).  It does have some
>>> cool stuff in it, though.
>>>
>>> --Larry Garfield
>>>
>>> On 11/30/10 1:22 PM, Carl Wiedemann wrote:
>>> > Does this help? http://php.net/manual/en/function.easter-days.php
>>> >
>>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <amont@5net.hu
>>> > <mailto:amont@5net.hu>> wrote:
>>> >
>>> >     Hello,
>>> >
>>> >     I have the nameday module (http://drupal.org/project/nameday) and I
>>> >     get a feature request for the Greek namedays. How I see it is based
>>> >     on the Easter, what is not an easy thing to count.
>>> >
>>> >     Well, I want to find some algorithm for Easter, and similar days,
>>> >     what is can be stored somehow. Maybe it should be a hook or some
>>> >     other think what can be stored in database.
>>> >
>>> >
>>> >     Thanks
>>> >
>>> >     --
>>> >     ?mon Tam?s
>>> >     Sitefejleszt? ?s programoz?
>>> >
>>> >
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 8
>>> Date: Tue, 30 Nov 2010 11:21:08 -0800
>>> From: Jennifer Hodgdon <yahgrp@poplarware.com>
>>> Subject: Re: [development] Easter problem
>>> To: development@drupal.org
>>> Message-ID: <4CF54EA4.1050502@poplarware.com>
>>> Content-Type: text/plain; charset=UTF-8; format=flowed
>>>
>>> http://php.net/manual/en/function.easter-date.php
>>>
>>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote:
>>> > I have the nameday module (http://drupal.org/project/nameday) and I get
>>> > a
>>> > feature request for the Greek namedays. How I see it is based on the
>>> > Easter,
>>> > what is not an easy thing to count.
>>> >
>>> > Well, I want to find some algorithm for Easter, and similar days, what
>>> > is
>>> > can be stored somehow. Maybe it should be a hook or some other think
>>> > what
>>> > can be stored in database.
>>>
>>> --
>>> Jennifer Hodgdon * Poplar ProductivityWare
>>> www.poplarware.com
>>> Drupal web sites and custom Drupal modules
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> --
>>> [ Drupal development list | http://lists.drupal.org/ ]
>>>
>>> End of development Digest, Vol 95, Issue 58
>>> *******************************************
>>
>
>