Well, Python has "Beautiful Soup". http://www.crummy.com/software/BeautifulSoup/ "You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser" In PHP I have use http://simplehtmldom.sourceforge.net/ as a way of parsing badly formed HTML. I wrote a script to import nodes using the latter and then saved them with "node_save()". An alternative could be to parse to CSV, then import using the node_export or node_import modules. Hope that helps, Victor Kane http://awebfactory.com.ar http://projectflowandtracker.com On Wed, Dec 1, 2010 at 7:46 AM, Balazs Dianiska <csillagasz@gmail.com>wrote:
Sadly some of the older legacy sites are just not available in rss, I had such a scraping request recently. I have to say that with drupal_http_request you don't even have to look at curl. You can do all sorts of things, even faking logins.
To parse the HTML use querypath, a trick that we use is to first run some sort of HTML tidyup library on the downloaded page, otherwise querypath runs away crying. beautify module can help you a great deal with that.
Balazs
On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <cweagans@gmail.com> wrote:
Most of the time, you can get to the posts via RSS. Aggregator module does a pretty good job of pulling stuff in, and the author of the post that's displayed is whatever you tell it to display (see Drupal Planet for an example) Thanks, Cameron
On Tue, Nov 30, 2010 at 12:48, Kevin O <nowarninglabel@gmail.com> wrote:
I second the recommendation of using QueryPath. I use it almost exclusively along with drupal_http_request, though I use curl only in a
few
places (if you use curl I recommend http://drupal.org/project/curl for a dependency check). I'd really recommend though creating a custom module that uses the above and then has your logic for filtering in it, I've done this for about a dozen modules now. That said, there are some more modules available out there nowadays, such as using http://drupal.org/project/feeds_xpathparser with feeds http://drupal.org/project/feeds There are about a dozen more modules that will accomplish the goal though I haven't used them, but I went through and tried most of the methods out for some recent projects. Cheers, Kevin O'Brien Drupal Developer http://www.coderintherye.com 415-754-0112
On Tue, Nov 30, 2010 at 11:26 AM, <development-request@drupal.org> wrote:
Send development mailing list submissions to development@drupal.org
To subscribe or unsubscribe via the World Wide Web, visit http://lists.drupal.org/mailman/listinfo/development or, via email, send a message with subject or body 'help' to development-request@drupal.org
You can reach the person managing the list at development-owner@drupal.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of development digest..."
Today's Topics:
1. Drupal module for scraping information from an HTML/XML document (James Benstead) 2. Re: Drupal module for scraping information from an HTML/XML document (John Fiala) 3. Easter problem (?mon Tam?s) 4. Re: Easter problem (Carl Wiedemann) 5. Re: Easter problem (larry@garfieldtech.com) 6. Re: Easter problem (jeff@ayendesigns.com) 7. Re: Easter problem (larry@garfieldtech.com) 8. Re: Easter problem (Jennifer Hodgdon)
----------------------------------------------------------------------
Message: 1 Date: Tue, 30 Nov 2010 18:56:09 +0000 From: James Benstead <james.benstead@gmail.com> Subject: [development] Drupal module for scraping information from an HTML/XML document To: development <development@drupal.org> Message-ID: <AANLkTi=AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV@mail.gmail.com<AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
I've finally got round to doing some serious work on Drupalversity, an open, web-based Drupal education project I've had in mind for a year or so.
People who use Drupalversity to learn have the option of adding Resources to the site - i.e., links to posts at Lullabot, Chapter3 etc that explain how to do specific things with Drupal. A Resource is a custom content type that includes a link to the resource and a text field containing a description of that resource.
What I'd like to do once a Resource has been added to the site is to scrape certain information from it: at this point I'm thinking the Title of the page the link points to and the provider of the resource - e.g., which Drupal shop originally created the resource. What's the best way to go about doing this? I'm pretty sure there's not a Drupal module that solves the problem out of the box.
So far I've considered:
- http://drupal.org/project/querypath - Drupal's built-in drupal_http_request() -
http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_r...
- curl
Thanks,
--Jim -- My IM and Skype details are at http://state68.com/contact