Well, Python has "Beautiful Soup". <div><br></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><a href="http://www.crummy.com/software/BeautifulSoup/">http://www.crummy.com/software/BeautifulSoup/</a></div>
<div><br></div><div>"<span class="Apple-style-span" style="font-family: 'Times New Roman'; font-size: medium; ">You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.</span></div>
<meta http-equiv="content-type" content="text/html; charset=utf-8"><p style="font-family: 'Times New Roman'; font-size: medium; ">Neither does this parser"</p><div>In PHP I have use <meta http-equiv="content-type" content="text/html; charset=utf-8"><a href="http://simplehtmldom.sourceforge.net/">http://simplehtmldom.sourceforge.net/</a> as a way of parsing badly formed HTML.</div>
<div><br></div><div>I wrote a script to import nodes using the latter and then saved them with "node_save()". </div><div><br></div><div>An alternative could be to parse to CSV, then import using the node_export or node_import modules.</div>
<div><br></div><div>Hope that helps,</div><div><br></div><div>Victor Kane</div><div><a href="http://awebfactory.com.ar">http://awebfactory.com.ar</a></div><div><a href="http://projectflowandtracker.com">http://projectflowandtracker.com</a><br>
<br><div class="gmail_quote">On Wed, Dec 1, 2010 at 7:46 AM, Balazs Dianiska <span dir="ltr"><<a href="mailto:csillagasz@gmail.com">csillagasz@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Sadly some of the older legacy sites are just not available in rss, I<br>
had such a scraping request recently. I have to say that with<br>
drupal_http_request you don't even have to look at curl. You can do<br>
all sorts of things, even faking logins.<br>
<br>
To parse the HTML use querypath, a trick that we use is to first run<br>
some sort of HTML tidyup library on the downloaded page, otherwise<br>
querypath runs away crying. beautify module can help you a great deal<br>
with that.<br>
<font color="#888888"><br>
Balazs<br>
</font><div><div></div><div class="h5"><br>
On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <<a href="mailto:cweagans@gmail.com">cweagans@gmail.com</a>> wrote:<br>
> Most of the time, you can get to the posts via RSS. Aggregator module does a<br>
> pretty good job of pulling stuff in, and the author of the post that's<br>
> displayed is whatever you tell it to display (see Drupal Planet for an<br>
> example)<br>
> Thanks,<br>
> Cameron<br>
><br>
><br>
><br>
> On Tue, Nov 30, 2010 at 12:48, Kevin O <<a href="mailto:nowarninglabel@gmail.com">nowarninglabel@gmail.com</a>> wrote:<br>
>><br>
>> I second the recommendation of using QueryPath. I use it almost<br>
>> exclusively along with drupal_http_request, though I use curl only in a few<br>
>> places (if you use curl I recommend <a href="http://drupal.org/project/curl" target="_blank">http://drupal.org/project/curl</a> for a<br>
>> dependency check). I'd really recommend though creating a custom module that<br>
>> uses the above and then has your logic for filtering in it, I've done this<br>
>> for about a dozen modules now.<br>
>> That said, there are some more modules available out there nowadays, such<br>
>> as using <a href="http://drupal.org/project/feeds_xpathparser" target="_blank">http://drupal.org/project/feeds_xpathparser</a> with feeds<br>
>> <a href="http://drupal.org/project/feeds" target="_blank">http://drupal.org/project/feeds</a> There are about a dozen more modules that<br>
>> will accomplish the goal though I haven't used them, but I went through and<br>
>> tried most of the methods out for some recent projects.<br>
>> Cheers,<br>
>> Kevin O'Brien<br>
>> Drupal Developer<br>
>> <a href="http://www.coderintherye.com" target="_blank">http://www.coderintherye.com</a><br>
>> 415-754-0112<br>
>><br>
>><br>
>> On Tue, Nov 30, 2010 at 11:26 AM, <<a href="mailto:development-request@drupal.org">development-request@drupal.org</a>> wrote:<br>
>>><br>
>>> Send development mailing list submissions to<br>
>>> <a href="mailto:development@drupal.org">development@drupal.org</a><br>
>>><br>
>>> To subscribe or unsubscribe via the World Wide Web, visit<br>
>>> <a href="http://lists.drupal.org/mailman/listinfo/development" target="_blank">http://lists.drupal.org/mailman/listinfo/development</a><br>
>>> or, via email, send a message with subject or body 'help' to<br>
>>> <a href="mailto:development-request@drupal.org">development-request@drupal.org</a><br>
>>><br>
>>> You can reach the person managing the list at<br>
>>> <a href="mailto:development-owner@drupal.org">development-owner@drupal.org</a><br>
>>><br>
>>> When replying, please edit your Subject line so it is more specific<br>
>>> than "Re: Contents of development digest..."<br>
>>><br>
>>><br>
>>> Today's Topics:<br>
>>><br>
>>> 1. Drupal module for scraping information from an HTML/XML<br>
>>> document (James Benstead)<br>
>>> 2. Re: Drupal module for scraping information from an HTML/XML<br>
>>> document (John Fiala)<br>
>>> 3. Easter problem (?mon Tam?s)<br>
>>> 4. Re: Easter problem (Carl Wiedemann)<br>
>>> 5. Re: Easter problem (<a href="mailto:larry@garfieldtech.com">larry@garfieldtech.com</a>)<br>
>>> 6. Re: Easter problem (<a href="mailto:jeff@ayendesigns.com">jeff@ayendesigns.com</a>)<br>
>>> 7. Re: Easter problem (<a href="mailto:larry@garfieldtech.com">larry@garfieldtech.com</a>)<br>
>>> 8. Re: Easter problem (Jennifer Hodgdon)<br>
>>><br>
>>><br>
>>> ----------------------------------------------------------------------<br>
>>><br>
>>> Message: 1<br>
>>> Date: Tue, 30 Nov 2010 18:56:09 +0000<br>
>>> From: James Benstead <<a href="mailto:james.benstead@gmail.com">james.benstead@gmail.com</a>><br>
>>> Subject: [development] Drupal module for scraping information from an<br>
>>> HTML/XML document<br>
>>> To: development <<a href="mailto:development@drupal.org">development@drupal.org</a>><br>
>>> Message-ID:<br>
>>> <AANLkTi=<a href="mailto:AFhBkvyURzgwNB54Z%2Bq-rRj_B_uRLZbUUd3UV@mail.gmail.com">AFhBkvyURzgwNB54Z+q-rRj_B_uRLZbUUd3UV@mail.gmail.com</a>><br>
>>> Content-Type: text/plain; charset="iso-8859-1"<br>
>>><br>
>>> I've finally got round to doing some serious work on Drupalversity, an<br>
>>> open,<br>
>>> web-based Drupal education project I've had in mind for a year or so.<br>
>>><br>
>>> People who use Drupalversity to learn have the option of adding Resources<br>
>>> to<br>
>>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain<br>
>>> how<br>
>>> to do specific things with Drupal. A Resource is a custom content type<br>
>>> that<br>
>>> includes a link to the resource and a text field containing a description<br>
>>> of<br>
>>> that resource.<br>
>>><br>
>>> What I'd like to do once a Resource has been added to the site is to<br>
>>> scrape<br>
>>> certain information from it: at this point I'm thinking the Title of the<br>
>>> page the link points to and the provider of the resource - e.g., which<br>
>>> Drupal shop originally created the resource. What's the best way to go<br>
>>> about<br>
>>> doing this? I'm pretty sure there's not a Drupal module that solves the<br>
>>> problem out of the box.<br>
>>><br>
>>> So far I've considered:<br>
>>><br>
>>> - <a href="http://drupal.org/project/querypath" target="_blank">http://drupal.org/project/querypath</a><br>
>>> - Drupal's built-in drupal_http_request() -<br>
>>><br>
>>> <a href="http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6" target="_blank">http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6</a><br>
>>> - curl<br>
>>><br>
>>> Thanks,<br>
>>><br>
>>> --Jim<br>
>>> --<br>
>>> My IM and Skype details are at <a href="http://state68.com/contact" target="_blank">http://state68.com/contact</a><br>
>>> -------------- next part --------------<br>
>>> An HTML attachment was scrubbed...<br>
>>> URL:<br>
>>> <a href="http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html" target="_blank">http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html</a><br>
>>><br>
>>> ------------------------------<br>
>>><br>
>>> Message: 2<br>
>>> Date: Tue, 30 Nov 2010 12:06:33 -0700<br>
>>> From: John Fiala <<a href="mailto:jcfiala@gmail.com">jcfiala@gmail.com</a>><br>
>>> Subject: Re: [development] Drupal module for scraping information from<br>
>>> an HTML/XML document<br>
>>> To: <a href="mailto:development@drupal.org">development@drupal.org</a><br>
>>> Message-ID:<br>
>>> <AANLkTi=<a href="mailto:N6WxHfigUC4ZopfxswMBv8bj7BZZJErHmko_T@mail.gmail.com">N6WxHfigUC4ZopfxswMBv8bj7BZZJErHmko_T@mail.gmail.com</a>><br>
>>> Content-Type: text/plain; charset=ISO-8859-1<br>
>>><br>
>>> These days, if I'm going to be trying to extract data from html/xml,<br>
>>> I'd use querypath. Give it a try!<br>
>>><br>
>>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead<br>
>>> <<a href="mailto:james.benstead@gmail.com">james.benstead@gmail.com</a>> wrote:<br>
>>> > What I'd like to do once a Resource has been added to the site is to<br>
>>> > scrape<br>
>>> > certain information from it: at this point I'm thinking the Title of<br>
>>> > the<br>
>>> > page the link points to and the provider of the resource - e.g., which<br>
>>> > Drupal shop originally created the resource. What's the best way to go<br>
>>> > about<br>
>>> > doing this? I'm pretty sure there's not a Drupal module that solves the<br>
>>> > problem out of the box.<br>
>>><br>
>>> --<br>
>>> John Fiala<br>
>>> <a href="http://www.jcfiala.net" target="_blank">www.jcfiala.net</a><br>
>>><br>
>>><br>
>>> ------------------------------<br>
>>><br>
>>> Message: 3<br>
>>> Date: Tue, 30 Nov 2010 20:14:04 +0100<br>
>>> From: ?mon Tam?s <<a href="mailto:amont@5net.hu">amont@5net.hu</a>><br>
>>> Subject: [development] Easter problem<br>
>>> To: <a href="mailto:development@drupal.org">development@drupal.org</a><br>
>>> Message-ID:<br>
>>> <<a href="mailto:AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX%2BiMjmBtvy@mail.gmail.com">AANLkTikmKoVkedks2FkWUbHRq9sNTe6r0iX+iMjmBtvy@mail.gmail.com</a>><br>
>>> Content-Type: text/plain; charset="utf-8"<br>
>>><br>
>>> Hello,<br>
>>><br>
>>> I have the nameday module (<a href="http://drupal.org/project/nameday" target="_blank">http://drupal.org/project/nameday</a>) and I get a<br>
>>> feature request for the Greek namedays. How I see it is based on the<br>
>>> Easter,<br>
>>> what is not an easy thing to count.<br>
>>><br>
>>> Well, I want to find some algorithm for Easter, and similar days, what is<br>
>>> can be stored somehow. Maybe it should be a hook or some other think what<br>
>>> can be stored in database.<br>
>>><br>
>>><br>
>>> Thanks<br>
>>><br>
>>> --<br>
>>> ?mon Tam?s<br>
>>> Sitefejleszt? ?s programoz?<br>
>>> -------------- next part --------------<br>
>>> An HTML attachment was scrubbed...<br>
>>> URL:<br>
>>> <a href="http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html" target="_blank">http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html</a><br>
>>><br>
>>> ------------------------------<br>
>>><br>
>>> Message: 4<br>
>>> Date: Tue, 30 Nov 2010 12:22:42 -0700<br>
>>> From: Carl Wiedemann <<a href="mailto:carl.wiedemann@gmail.com">carl.wiedemann@gmail.com</a>><br>
>>> Subject: Re: [development] Easter problem<br>
>>> To: <a href="mailto:development@drupal.org">development@drupal.org</a><br>
>>> Message-ID:<br>
>>> <AANLkTinD9Xz=3inJj2GraAuqde_=<a href="mailto:3yshJDwxCJzu12zr@mail.gmail.com">3yshJDwxCJzu12zr@mail.gmail.com</a>><br>
>>> Content-Type: text/plain; charset="iso-8859-2"<br>
>>><br>
>>> Does this help? <a href="http://php.net/manual/en/function.easter-days.php" target="_blank">http://php.net/manual/en/function.easter-days.php</a><br>
>>><br>
>>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <<a href="mailto:amont@5net.hu">amont@5net.hu</a>> wrote:<br>
>>><br>
>>> > Hello,<br>
>>> ><br>
>>> > I have the nameday module (<a href="http://drupal.org/project/nameday" target="_blank">http://drupal.org/project/nameday</a>) and I get<br>
>>> > a<br>
>>> > feature request for the Greek namedays. How I see it is based on the<br>
>>> > Easter,<br>
>>> > what is not an easy thing to count.<br>
>>> ><br>
>>> > Well, I want to find some algorithm for Easter, and similar days, what<br>
>>> > is<br>
>>> > can be stored somehow. Maybe it should be a hook or some other think<br>
>>> > what<br>
>>> > can be stored in database.<br>
>>> ><br>
>>> ><br>
>>> > Thanks<br>
>>> ><br>
>>> > --<br>
>>> > ?mon Tam?s<br>
>>> > Sitefejleszt? ?s programoz?<br>
>>> ><br>
>>> ><br>
>>> -------------- next part --------------<br>
>>> An HTML attachment was scrubbed...<br>
>>> URL:<br>
>>> <a href="http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html" target="_blank">http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html</a><br>
>>><br>
>>> ------------------------------<br>
>>><br>
>>> Message: 5<br>
>>> Date: Tue, 30 Nov 2010 13:24:07 -0600<br>
>>> From: "<a href="mailto:larry@garfieldtech.com">larry@garfieldtech.com</a>" <<a href="mailto:larry@garfieldtech.com">larry@garfieldtech.com</a>><br>
>>> Subject: Re: [development] Easter problem<br>
>>> To: <a href="mailto:development@drupal.org">development@drupal.org</a><br>
>>> Message-ID: <<a href="mailto:4CF54F57.2030602@garfieldtech.com">4CF54F57.2030602@garfieldtech.com</a>><br>
>>> Content-Type: text/plain; charset=UTF-8; format=flowed<br>
>>><br>
>>> There's no need for a hook here at all. You can either code in the<br>
>>> algorithm for defining when Easter is (which sounds like it is in fact<br>
>>> rather complicated) or just pre-store know pre-calculated dates for it<br>
>>> for the next decade or so. (10 records, one per year; totally easy.)<br>
>>><br>
>>> Both options are described here, including the different mechanisms for<br>
>>> defining when Easter is in different calendars:<br>
>>><br>
>>> <a href="http://en.wikipedia.org/wiki/Easter#Date_of_Easter" target="_blank">http://en.wikipedia.org/wiki/Easter#Date_of_Easter</a><br>
>>><br>
>>> --Larry Garfield<br>
>>><br>
>>> On 11/30/10 1:14 PM, ?mon Tam?s wrote:<br>
>>> > Hello,<br>
>>> ><br>
>>> > I have the nameday module (<a href="http://drupal.org/project/nameday" target="_blank">http://drupal.org/project/nameday</a>) and I get<br>
>>> > a feature request for the Greek namedays. How I see it is based on the<br>
>>> > Easter, what is not an easy thing to count.<br>
>>> ><br>
>>> > Well, I want to find some algorithm for Easter, and similar days, what<br>
>>> > is can be stored somehow. Maybe it should be a hook or some other think<br>
>>> > what can be stored in database.<br>
>>> ><br>
>>> ><br>
>>> > Thanks<br>
>>> ><br>
>>> > --<br>
>>> > ?mon Tam?s<br>
>>> > Sitefejleszt? ?s programoz?<br>
>>> ><br>
>>><br>
>>><br>
>>> ------------------------------<br>
>>><br>
>>> Message: 6<br>
>>> Date: Tue, 30 Nov 2010 14:23:56 -0500<br>
>>> From: <a href="mailto:jeff@ayendesigns.com">jeff@ayendesigns.com</a><br>
>>> Subject: Re: [development] Easter problem<br>
>>> To: <a href="mailto:development@drupal.org">development@drupal.org</a><br>
>>> Message-ID: <<a href="mailto:4CF54F4C.2060409@ayendesigns.com">4CF54F4C.2060409@ayendesigns.com</a>><br>
>>> Content-Type: text/plain; charset="utf-8"<br>
>>><br>
>>> You can google it, but I believe this is one of those things that cannot<br>
>>> be reduced to an equation or algorithm. It's something like the first<br>
>>> Sunday after the first full moon after the spring equinox.<br>
>>><br>
>>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote:<br>
>>> > Hello,<br>
>>> ><br>
>>> > I have the nameday module ( <a href="http://drupal.org/project/nameday" target="_blank">http://drupal.org/project/nameday</a>) and I<br>
>>> > get a feature request for the Greek namedays. How I see it is based on<br>
>>> > the Easter, what is not an easy thing to count.<br>
>>> ><br>
>>> > Well, I want to find some algorithm for Easter, and similar days, what<br>
>>> > is can be stored somehow. Maybe it should be a hook or some other<br>
>>> > think what can be stored in database.<br>
>>> ><br>
>>> ><br>
>>> > Thanks<br>
>>> ><br>
>>> > --<br>
>>> > ?mon Tam?s<br>
>>> > Sitefejleszt? ?s programoz?<br>
>>> ><br>
>>> -------------- next part --------------<br>
>>> An HTML attachment was scrubbed...<br>
>>> URL:<br>
>>> <a href="http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html" target="_blank">http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html</a><br>
>>><br>
>>> ------------------------------<br>
>>><br>
>>> Message: 7<br>
>>> Date: Tue, 30 Nov 2010 13:26:23 -0600<br>
>>> From: "<a href="mailto:larry@garfieldtech.com">larry@garfieldtech.com</a>" <<a href="mailto:larry@garfieldtech.com">larry@garfieldtech.com</a>><br>
>>> Subject: Re: [development] Easter problem<br>
>>> To: <a href="mailto:development@drupal.org">development@drupal.org</a><br>
>>> Message-ID: <<a href="mailto:4CF54FDF.7070506@garfieldtech.com">4CF54FDF.7070506@garfieldtech.com</a>><br>
>>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed<br>
>>><br>
>>> The Calendar PHP module is not enabled by default in a stock PHP, so I<br>
>>> don't know that you can rely on it (unfortunately). It does have some<br>
>>> cool stuff in it, though.<br>
>>><br>
>>> --Larry Garfield<br>
>>><br>
>>> On 11/30/10 1:22 PM, Carl Wiedemann wrote:<br>
>>> > Does this help? <a href="http://php.net/manual/en/function.easter-days.php" target="_blank">http://php.net/manual/en/function.easter-days.php</a><br>
>>> ><br>
>>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <<a href="mailto:amont@5net.hu">amont@5net.hu</a><br>
>>> > <mailto:<a href="mailto:amont@5net.hu">amont@5net.hu</a>>> wrote:<br>
>>> ><br>
>>> > Hello,<br>
>>> ><br>
>>> > I have the nameday module (<a href="http://drupal.org/project/nameday" target="_blank">http://drupal.org/project/nameday</a>) and I<br>
>>> > get a feature request for the Greek namedays. How I see it is based<br>
>>> > on the Easter, what is not an easy thing to count.<br>
>>> ><br>
>>> > Well, I want to find some algorithm for Easter, and similar days,<br>
>>> > what is can be stored somehow. Maybe it should be a hook or some<br>
>>> > other think what can be stored in database.<br>
>>> ><br>
>>> ><br>
>>> > Thanks<br>
>>> ><br>
>>> > --<br>
>>> > ?mon Tam?s<br>
>>> > Sitefejleszt? ?s programoz?<br>
>>> ><br>
>>> ><br>
>>><br>
>>><br>
>>> ------------------------------<br>
>>><br>
>>> Message: 8<br>
>>> Date: Tue, 30 Nov 2010 11:21:08 -0800<br>
>>> From: Jennifer Hodgdon <<a href="mailto:yahgrp@poplarware.com">yahgrp@poplarware.com</a>><br>
>>> Subject: Re: [development] Easter problem<br>
>>> To: <a href="mailto:development@drupal.org">development@drupal.org</a><br>
>>> Message-ID: <<a href="mailto:4CF54EA4.1050502@poplarware.com">4CF54EA4.1050502@poplarware.com</a>><br>
>>> Content-Type: text/plain; charset=UTF-8; format=flowed<br>
>>><br>
>>> <a href="http://php.net/manual/en/function.easter-date.php" target="_blank">http://php.net/manual/en/function.easter-date.php</a><br>
>>><br>
>>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote:<br>
>>> > I have the nameday module (<a href="http://drupal.org/project/nameday" target="_blank">http://drupal.org/project/nameday</a>) and I get<br>
>>> > a<br>
>>> > feature request for the Greek namedays. How I see it is based on the<br>
>>> > Easter,<br>
>>> > what is not an easy thing to count.<br>
>>> ><br>
>>> > Well, I want to find some algorithm for Easter, and similar days, what<br>
>>> > is<br>
>>> > can be stored somehow. Maybe it should be a hook or some other think<br>
>>> > what<br>
>>> > can be stored in database.<br>
>>><br>
>>> --<br>
>>> Jennifer Hodgdon * Poplar ProductivityWare<br>
>>> <a href="http://www.poplarware.com" target="_blank">www.poplarware.com</a><br>
>>> Drupal web sites and custom Drupal modules<br>
>>><br>
>>><br>
>>><br>
>>> ------------------------------<br>
>>><br>
>>> --<br>
>>> [ Drupal development list | <a href="http://lists.drupal.org/" target="_blank">http://lists.drupal.org/</a> ]<br>
>>><br>
>>> End of development Digest, Vol 95, Issue 58<br>
>>> *******************************************<br>
>><br>
><br>
><br>
</div></div></blockquote></div><br></div>