Re: [development] Drupal module for scraping information from an HTML/XML document

1 Dec 2010

      Well, Python has "Beautiful Soup".

http://www.crummy.com/software/BeautifulSoup/

"You didn't write that awful page. You're just trying to get some data out
of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser"
In PHP I have use http://simplehtmldom.sourceforge.net/ as a way of parsing
badly formed HTML.

I wrote a script to import nodes using the latter and then saved them with
"node_save()".

An alternative could be to parse to CSV, then import using the node_export
or node_import modules.

Hope that helps,

Victor Kane
http://awebfactory.com.ar
http://projectflowandtracker.com

On Wed, Dec 1, 2010 at 7:46 AM, Balazs Dianiska <csillagasz@gmail.com>wrote:
...
Sadly some of the older legacy sites are just not available in rss, I
had such a scraping request recently. I have to say that with
drupal_http_request you don't even have to look at curl. You can do
all sorts of things, even faking logins.
To parse the HTML use querypath, a trick that we use is to first run
some sort of HTML tidyup library on the downloaded page, otherwise
querypath runs away crying. beautify module can help you a great deal
with that.
Balazs
On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <cweagans@gmail.com> wrote:
...
Most of the time, you can get to the posts via RSS. Aggregator module
does a
pretty good job of pulling stuff in, and the author of the post that's
displayed is whatever you tell it to display (see Drupal Planet for an
example)
Thanks,
Cameron
On Tue, Nov 30, 2010 at 12:48, Kevin O <nowarninglabel@gmail.com> wrote:
...
I second the recommendation of using QueryPath. I use it almost
exclusively along with drupal_http_request, though I use curl only in a
few
...
places (if you use curl I recommend http://drupal.org/project/curl for
a
dependency check). I'd really recommend though creating a custom module
that
uses the above and then has your logic for filtering in it, I've done
this
for about a dozen modules now.
That said, there are some more modules available out there nowadays,
such
as using http://drupal.org/project/feeds_xpathparser with feeds
http://drupal.org/project/feeds There are about a dozen more modules
that
will accomplish the goal though I haven't used them, but I went through
and
tried most of the methods out for some recent projects.
Cheers,
Kevin O'Brien
Drupal Developer
http://www.coderintherye.com
415-754-0112
On Tue, Nov 30, 2010 at 11:26 AM, <development-request@drupal.org>
wrote:
...
Send development mailing list submissions to
       development@drupal.org
To subscribe or unsubscribe via the World Wide Web, visit
       http://lists.drupal.org/mailman/listinfo/development
or, via email, send a message with subject or body 'help' to
       development-request@drupal.org
You can reach the person managing the list at
       development-owner@drupal.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of development digest..."
Today's Topics:
1. Drupal module for scraping information from an    HTML/XML
     document (James Benstead)
  2. Re: Drupal module for scraping information from an HTML/XML
     document (John Fiala)
  3. Easter problem (?mon Tam?s)
  4. Re: Easter problem (Carl Wiedemann)
  5. Re: Easter problem (larry@garfieldtech.com)
  6. Re: Easter problem (jeff@ayendesigns.com)
  7. Re: Easter problem (larry@garfieldtech.com)
  8. Re: Easter problem (Jennifer Hodgdon)
----------------------------------------------------------------------
Message: 1
Date: Tue, 30 Nov 2010 18:56:09 +0000
From: James Benstead <james.benstead@gmail.com>
Subject: [development] Drupal module for scraping information from an
       HTML/XML document
To: development <development@drupal.org>
Message-ID:
       <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV@mail.gmail.com<AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV@mail.gmail.com>
...
...
Content-Type: text/plain; charset="iso-8859-1"
I've finally got round to doing some serious work on Drupalversity, an
open,
web-based Drupal education project I've had in mind for a year or so.
People who use Drupalversity to learn have the option of adding
Resources
to
the site - i.e., links to posts at Lullabot, Chapter3 etc that explain
how
to do specific things with Drupal. A Resource is a custom content type
that
includes a link to the resource and a text field containing a
description
of
that resource.
What I'd like to do once a Resource has been added to the site is to
scrape
certain information from it: at this point I'm thinking the Title of
the
page the link points to and the provider of the resource - e.g., which
Drupal shop originally created the resource. What's the best way to go
about
doing this? I'm pretty sure there's not a Drupal module that solves the
problem out of the box.
So far I've considered:
- http://drupal.org/project/querypath
  - Drupal's built-in drupal_http_request() -
http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_r...
...
- curl
Thanks,
--Jim
--
My IM and Skype details are at http://state68.com/contact