Re: [development] Programmatic data importing from JSON source

18 Feb 2009

      Hi David,

Sorry for the delayed reply... I chose to extend my trip another day. A 
much-needed break that's paying off in general attitude adjustments and 
coffee consumption (way down.)

So, I was likely not clear about my skill at programming PHP and/or 
Drupal modules: "not fantastic." I didn't mean to say that the feed is 
necessarily "so complicated", but more specifically, it's botching every 
parser I've tried -- and each tweak I've tried to the given parsers. 
It's a predictable XML structure, but there are several different types 
of elements that defy simple parsing, IMO.

First, the main issues I always run into (with the FeedAPI parsers, 
mainly) are numbered arrays. Any time there's a numeric array name, most 
parsers seem to ditch the whole tree below that element. The only 
exceptions are arrays where the parser knows there's likely to be an 
array in at the element, like "tags" or such.

I did get around this with an admittedly-crude hack, where I told the 
parser to ignore sub-arrays after a certain point for a specific element 
that was causing me trouble (in this case, an enclosure element) while 
using the FeedAPI & Simplepie parser combo -- detailed here if anyone 
wants to read further... I apologize again for the hacky nature of my 
solution:
http://www.thisworked.com/content/drupal-feedapi-feed-element-mapper-missing...

The thing is, since that's of course a very bad hack and only 
specifically "solves" one exact problem with a known feed, there are 
several other places in the feed where nested arrays with numeric names 
are also ignored. I gave up on this ugly approach at this point, not 
wanting to further butcher the parser without just writing my own.

So... the main reason I failed so miserably with the XML version of this 
particular feed is the fact they (feed authors) have entrenched a great 
deal of important data into the <summary> and <description> elements, 
but nested inside a bunch of <dl><dt><dd> tags which seem like they 
should be parsable with proper care. In the JSON version, these are all 
very nicely extracted out into the root of the feed as separate elements.

I realize now, after much festering with RSS parsers that the use of 
<summary>/<description> is done solely to get around the limitations of 
the RSS specs. Any RSS parser will strip off all sorts of special 
elements, so there's not much point in including them, I suppose. In my 
case, I could have dug the info out, but I understand their logic there 
-- typical RSS readers/parsers would butcher the data.

Here's a quick sample of what the <summary> tag looks like and why I 
again failed to figure out how to parse through it and get the data out 
of all the nested tags into mappable elements for Feed Element Mapper:

<summary type="xhtml">
 <div xmlns="http://www.w3.org/1999/xhtml">
  <dl>
    <dt>Recommended</dt>
	<dd class="recommended">FALSE</dd>
    <dt>Width</dt>
        <dd class="width">640</dd>
    <dt>Height</dt>
        <dd class="height">480</dd>
    <dt>Categories</dt>
        <dd class="categories">First, Second, Other</dd>
    (...)
  </dl>
 </div>
</summary>

The point of this weak example is that there are a bunch of very useful 
bits buried in the <summary> that I haven't been able to extract. I 
understand that an array recursion expert might swim right through 
there, but I didn't figure it out before giving up on that path.

I do really think that a parser with FeedAPI /should/ be able to dig 
through that element and pull out all sorts of niceties, but I don't 
understand how.

...so, this email is getting way too long. Sorry. :P

Hopefully this gives enough of a picture. I will go answer other email now.

(Seriously sorry about the lack of brevity.)

Paul

David Metzler wrote:
...
Hey Paul,
I haven't done this with JSON, but have written some XML to nested 
array conversion stuff that might help, but might not. Could you shoot 
an example of the feed and what makes it so complicated so that we 
don't shower you with irrelavent solutions :). Is it that you're 
trying to parse data that's inside XML that makes it so nasty or 
something else?
You give me the impression that the JSON feed contains more 
information than the other feed. Is that true?
Dave
On Feb 15, 2009, at 6:33 AM, Paul Hoza wrote:
...
Hello folks,
I've been struggling for a long time (way too long) to get a data 
feed imported into CCK nodes. I've attempted a plethora of different 
strategies, including but not limited to: FeedAPI, Feed Element 
Mapper, custom parsers, hacking parsers, tweaking mappers, Yahoo! 
Pipes to create custom feed versions of the original feed, serialized 
PHP exports, etc.
I'm tired of the project, but I have to find a way to get this to work.
So, I found a few articles on programmatically creating CCK nodes and 
I'm hoping to connect with anyone who's had experience doing this 
with JSON data. There is an XML/RSS version of this feed I need, but 
it sucks compared to the JSON version, with respect to how much data 
is in there and how it's formatted. The XML version hides a lot of 
crucial info into a <summary> element... which I might be able to 
parse through separately, but RSS feed aggregators just ignore stuff 
in there. Again, I'd have to make a custom parser to get in there.
Here's an article that hits about as close as I've seen yet. I am 
leaving for a couple days, so I'll try to get something like this 
working when I get back, but I hoped to hear from anyone who's done 
the same thing. Information on using JSON data to create nodes is 
sparse, but this article hits pretty close to the mark:
https://secure.prolucid.com/node/43
I had read other posts about doing similar methods using 
drupal_execute(), et. al, but they all talk only about XML as data 
source. I haven't found anything talking about JSON or (un)serialized 
PHP sources.
What I really need to do is do an initial import of the JSON feed 
into my CCK node (which is a huge feed of 6,200+ items). After that, 
I want to check the feed every day for changes and create new daily 
nodes accordingly -- which is why FeedAPI really seemed like the 
ticket, aside from my massive struggles with making my own parser. 
For now, I'd be happy with a PHP script that I could call daily with 
cron.
Thanks for any feedback... sorry for the long post. Part rant, part 
plea. :)
Cheers,
Paul Hoza