[development] Programmatic data importing from JSON source

Paul Hoza paulhoza at gmail.com
Wed Feb 18 10:42:29 UTC 2009


Hi David,

Sorry for the delayed reply... I chose to extend my trip another day. A 
much-needed break that's paying off in general attitude adjustments and 
coffee consumption (way down.)

So, I was likely not clear about my skill at programming PHP and/or 
Drupal modules: "not fantastic." I didn't mean to say that the feed is 
necessarily "so complicated", but more specifically, it's botching every 
parser I've tried -- and each tweak I've tried to the given parsers. 
It's a predictable XML structure, but there are several different types 
of elements that defy simple parsing, IMO.

First, the main issues I always run into (with the FeedAPI parsers, 
mainly) are numbered arrays. Any time there's a numeric array name, most 
parsers seem to ditch the whole tree below that element. The only 
exceptions are arrays where the parser knows there's likely to be an 
array in at the element, like "tags" or such.

I did get around this with an admittedly-crude hack, where I told the 
parser to ignore sub-arrays after a certain point for a specific element 
that was causing me trouble (in this case, an enclosure element) while 
using the FeedAPI & Simplepie parser combo -- detailed here if anyone 
wants to read further... I apologize again for the hacky nature of my 
solution:
http://www.thisworked.com/content/drupal-feedapi-feed-element-mapper-missing-unnamed-elements-array

The thing is, since that's of course a very bad hack and only 
specifically "solves" one exact problem with a known feed, there are 
several other places in the feed where nested arrays with numeric names 
are also ignored. I gave up on this ugly approach at this point, not 
wanting to further butcher the parser without just writing my own.


So... the main reason I failed so miserably with the XML version of this 
particular feed is the fact they (feed authors) have entrenched a great 
deal of important data into the <summary> and <description> elements, 
but nested inside a bunch of <dl><dt><dd> tags which seem like they 
should be parsable with proper care. In the JSON version, these are all 
very nicely extracted out into the root of the feed as separate elements.

I realize now, after much festering with RSS parsers that the use of 
<summary>/<description> is done solely to get around the limitations of 
the RSS specs. Any RSS parser will strip off all sorts of special 
elements, so there's not much point in including them, I suppose. In my 
case, I could have dug the info out, but I understand their logic there 
-- typical RSS readers/parsers would butcher the data.

Here's a quick sample of what the <summary> tag looks like and why I 
again failed to figure out how to parse through it and get the data out 
of all the nested tags into mappable elements for Feed Element Mapper:

<summary type="xhtml">
 <div xmlns="http://www.w3.org/1999/xhtml">
  <dl>
    <dt>Recommended</dt>
	<dd class="recommended">FALSE</dd>
    <dt>Width</dt>
        <dd class="width">640</dd>
    <dt>Height</dt>
        <dd class="height">480</dd>
    <dt>Categories</dt>
        <dd class="categories">First, Second, Other</dd>
    (...)
  </dl>
 </div>
</summary>

The point of this weak example is that there are a bunch of very useful 
bits buried in the <summary> that I haven't been able to extract. I 
understand that an array recursion expert might swim right through 
there, but I didn't figure it out before giving up on that path.

I do really think that a parser with FeedAPI /should/ be able to dig 
through that element and pull out all sorts of niceties, but I don't 
understand how.


...so, this email is getting way too long. Sorry. :P

Hopefully this gives enough of a picture. I will go answer other email now.

(Seriously sorry about the lack of brevity.)

Paul





David Metzler wrote:
> Hey Paul,
>
> I haven't done this with JSON, but have written some XML to nested 
> array conversion stuff that might help, but might not. Could you shoot 
> an example of the feed and what makes it so complicated so that we 
> don't shower you with irrelavent solutions :). Is it that you're 
> trying to parse data that's inside XML that makes it so nasty or 
> something else?
>
> You give me the impression that the JSON feed contains more 
> information than the other feed. Is that true?
>
> Dave
> On Feb 15, 2009, at 6:33 AM, Paul Hoza wrote:
>
>> Hello folks,
>>
>> I've been struggling for a long time (way too long) to get a data 
>> feed imported into CCK nodes. I've attempted a plethora of different 
>> strategies, including but not limited to: FeedAPI, Feed Element 
>> Mapper, custom parsers, hacking parsers, tweaking mappers, Yahoo! 
>> Pipes to create custom feed versions of the original feed, serialized 
>> PHP exports, etc.
>>
>> I'm tired of the project, but I have to find a way to get this to work.
>>
>> So, I found a few articles on programmatically creating CCK nodes and 
>> I'm hoping to connect with anyone who's had experience doing this 
>> with JSON data. There is an XML/RSS version of this feed I need, but 
>> it sucks compared to the JSON version, with respect to how much data 
>> is in there and how it's formatted. The XML version hides a lot of 
>> crucial info into a <summary> element... which I might be able to 
>> parse through separately, but RSS feed aggregators just ignore stuff 
>> in there. Again, I'd have to make a custom parser to get in there.
>>
>> Here's an article that hits about as close as I've seen yet. I am 
>> leaving for a couple days, so I'll try to get something like this 
>> working when I get back, but I hoped to hear from anyone who's done 
>> the same thing. Information on using JSON data to create nodes is 
>> sparse, but this article hits pretty close to the mark:
>> https://secure.prolucid.com/node/43
>> I had read other posts about doing similar methods using 
>> drupal_execute(), et. al, but they all talk only about XML as data 
>> source. I haven't found anything talking about JSON or (un)serialized 
>> PHP sources.
>>
>> What I really need to do is do an initial import of the JSON feed 
>> into my CCK node (which is a huge feed of 6,200+ items). After that, 
>> I want to check the feed every day for changes and create new daily 
>> nodes accordingly -- which is why FeedAPI really seemed like the 
>> ticket, aside from my massive struggles with making my own parser. 
>> For now, I'd be happy with a PHP script that I could call daily with 
>> cron.
>>
>>
>> Thanks for any feedback... sorry for the long post. Part rant, part 
>> plea. :)
>>
>> Cheers,
>> Paul Hoza
>>



More information about the development mailing list