Re: [development] Data import and synchronization to nodes
Quoting Larry Garfield <larry@garfieldtech.com>:
So, I toss the brain-teaser out there: Is there a good way to have my nodes and import them too, or are these cases where nodes are simply the wrong tool and the direct-import-and-cache mechanisms described above are the optimal solutions?
Not that I've found and I've spent several hours recently researching this. Chris Mellor and I have begun collaborating on this issue here http://portallink.linkmatics.com/gdf and have development staging here http://datafeed.progw.org. Help is welcome, we want to be able to feed all types of external data. Goals are being established and documented on the http://portallink.linkmatics.com/gdf pages. *Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing. I've found http://drupal.org/project/feedparser which will accept RSS, RDF or ATOM feeds and create nodes or aggregated lists. I am successfully using that module with a change documented in issue http://drupal.org/node/169865 at http://give-me-an-offer.com. Earnie
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable XML parsing and node creation hooks. @Larry- My preference here is the old 'lazy instantiation' trick. [1] Import the data and write a callback that will present the table view of courses, etc. You're dealing with structured data, so your callbacks shouldl make it easy for people to browse the data (think MVC). Keep a {data_node} lookup table. When writing links for individual items, check the {data_node} table. If found, write the link to node/NID, otherwise, write it to a node-generating callback that also inserts a record into {data_node}. This way, you only create nodes when your users want to interact with them. Saves lots of processing overhead. I have some sample code if you need it. One drawback: if you want the data to be searchable, you either have to initiate your own hook_search, or wait for the nodes to be created. - Ken [1] http://barcelona2007.drupalcon.org/node/58 On 8/28/07, Earnie Boyd <earnie@users.sourceforge.net> wrote:
Quoting Larry Garfield <larry@garfieldtech.com>:
So, I toss the brain-teaser out there: Is there a good way to have my
nodes
and import them too, or are these cases where nodes are simply the wrong tool and the direct-import-and-cache mechanisms described above are the optimal solutions?
Not that I've found and I've spent several hours recently researching this. Chris Mellor and I have begun collaborating on this issue here http://portallink.linkmatics.com/gdf and have development staging here http://datafeed.progw.org. Help is welcome, we want to be able to feed all types of external data. Goals are being established and documented on the http://portallink.linkmatics.com/gdf pages. *Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
I've found http://drupal.org/project/feedparser which will accept RSS, RDF or ATOM feeds and create nodes or aggregated lists. I am successfully using that module with a change documented in issue http://drupal.org/node/169865 at http://give-me-an-offer.com.
Earnie
For the most part, what Ken said -- Particularly wrt the FeedAPI project, which looks incredibly strong. WRT courses, I'd actually recommend creating nodes on import -- this obviously depends on your use case, but in most situations where I've had to deal with this type of data, people are interacting with the courses almost immediately. The other challenge will be to determine what constitutes an updated course, and what constitutes a new course. Toward this end, as much specific information as possible you can capture in the import (short, of course, of a specific ID for each course, which makes it all so much easier :) ) -- can you get semester info (ie, fall, 2007), instructor info, room location, description, etc. So, with this in mind, you'll need to determine when a course is new, and whether that merits updating the existing node, or creating a new one. This also gets at your data structure for courses, and how granular it is -- how much info is stored along with a course, and how much is stored as separate nodes, or within separate tables? For example, does a course contain semester info? Room info? Anyways -- I look forward to hearing the solution you choose. Cheers, Bill Ken Rickard wrote:
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable XML parsing and node creation hooks.
@Larry-
My preference here is the old 'lazy instantiation' trick. [1]
Import the data and write a callback that will present the table view of courses, etc. You're dealing with structured data, so your callbacks shouldl make it easy for people to browse the data (think MVC).
Keep a {data_node} lookup table.
When writing links for individual items, check the {data_node} table. If found, write the link to node/NID, otherwise, write it to a node-generating callback that also inserts a record into {data_node}.
This way, you only create nodes when your users want to interact with them. Saves lots of processing overhead.
I have some sample code if you need it.
One drawback: if you want the data to be searchable, you either have to initiate your own hook_search, or wait for the nodes to be created.
- Ken
[1] http://barcelona2007.drupalcon.org/node/58
On 8/28/07, *Earnie Boyd * <earnie@users.sourceforge.net <mailto:earnie@users.sourceforge.net>> wrote:
Quoting Larry Garfield <larry@garfieldtech.com <mailto:larry@garfieldtech.com>>:
> > So, I toss the brain-teaser out there: Is there a good way to have my nodes > and import them too, or are these cases where nodes are simply the wrong tool > and the direct-import-and-cache mechanisms described above are the optimal > solutions? >
Not that I've found and I've spent several hours recently researching this. Chris Mellor and I have begun collaborating on this issue here http://portallink.linkmatics.com/gdf and have development staging here http://datafeed.progw.org. Help is welcome, we want to be able to feed all types of external data. Goals are being established and documented on the http://portallink.linkmatics.com/gdf pages. *Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
I've found http://drupal.org/project/feedparser which will accept RSS, RDF or ATOM feeds and create nodes or aggregated lists. I am successfully using that module with a change documented in issue http://drupal.org/node/169865 at http://give-me-an-offer.com.
Earnie
-- Bill Fitzgerald http://www.funnymonkey.com Tools for Teachers 503.897.7160
Hi Ken. Yes, I'm using your lazy-import method now for another project and it's working well. The problem here is that not just the data but the structure of the data may change, and I need to be able to change it on-demand, including en-masse. Lazy-import could work for the Docbook case, but what happens then when a section is moved to another chapter? How does the code know that the new section is actually the old node, so it now has to be moved in the page tree and regenerated, and every page around it has to be regenerated (for next/prev links to be rebuilt)? The only unique identifier for it would be its XPath address, but that just changed. I don't know if I can require an ID on every element, as that would run into huge collision problems in some cases. (eg, /installation/gettingstarted vs /writingmodule/gettingstarted. That causes me grief in the current DocBook system I hope to replace this way.) For courses, I have no guarantee that a course will even exist in the new import. While it does have a Course ID, and sections have a Section ID, Term, etc. (and joins without a surrogate key get ugly when there's 4 values in the primary key, which is the system I am replacing with Drupal), I would need to detect and immediately delete courses not in the new CSV, so I'd be parsing the whole thing anyway. That's why I don't think the usual lazy-import method would work here. --Larry Garfield On Tue, 28 Aug 2007 09:08:10 -0400, "Ken Rickard" <agentrickard@gmail.com> wrote:
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable XML parsing and node creation hooks.
@Larry-
My preference here is the old 'lazy instantiation' trick. [1]
Import the data and write a callback that will present the table view of courses, etc. You're dealing with structured data, so your callbacks shouldl make it easy for people to browse the data (think MVC).
Keep a {data_node} lookup table.
When writing links for individual items, check the {data_node} table. If found, write the link to node/NID, otherwise, write it to a node-generating callback that also inserts a record into {data_node}.
This way, you only create nodes when your users want to interact with them. Saves lots of processing overhead.
I have some sample code if you need it.
One drawback: if you want the data to be searchable, you either have to initiate your own hook_search, or wait for the nodes to be created.
- Ken
[1] http://barcelona2007.drupalcon.org/node/58
On 8/28/07, Earnie Boyd <earnie@users.sourceforge.net> wrote:
Quoting Larry Garfield <larry@garfieldtech.com>:
So, I toss the brain-teaser out there: Is there a good way to have my
nodes
and import them too, or are these cases where nodes are simply the
wrong
tool
and the direct-import-and-cache mechanisms described above are the optimal solutions?
Not that I've found and I've spent several hours recently researching this. Chris Mellor and I have begun collaborating on this issue here http://portallink.linkmatics.com/gdf and have development staging here http://datafeed.progw.org. Help is welcome, we want to be able to feed all types of external data. Goals are being established and documented on the http://portallink.linkmatics.com/gdf pages. *Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
I've found http://drupal.org/project/feedparser which will accept RSS, RDF or ATOM feeds and create nodes or aggregated lists. I am successfully using that module with a change documented in issue http://drupal.org/node/169865 at http://give-me-an-offer.com.
Earnie
Questions like these are why I'm no good with edge cases or race conditions :-) - Ken (not a CS guy, just a problem solver) On 8/28/07, Larry Garfield <larry@garfieldtech.com> wrote:
Hi Ken. Yes, I'm using your lazy-import method now for another project and it's working well. The problem here is that not just the data but the structure of the data may change, and I need to be able to change it on-demand, including en-masse.
Lazy-import could work for the Docbook case, but what happens then when a section is moved to another chapter? How does the code know that the new section is actually the old node, so it now has to be moved in the page tree and regenerated, and every page around it has to be regenerated (for next/prev links to be rebuilt)? The only unique identifier for it would be its XPath address, but that just changed. I don't know if I can require an ID on every element, as that would run into huge collision problems in some cases. (eg, /installation/gettingstarted vs /writingmodule/gettingstarted. That causes me grief in the current DocBook system I hope to replace this way.)
For courses, I have no guarantee that a course will even exist in the new import. While it does have a Course ID, and sections have a Section ID, Term, etc. (and joins without a surrogate key get ugly when there's 4 values in the primary key, which is the system I am replacing with Drupal), I would need to detect and immediately delete courses not in the new CSV, so I'd be parsing the whole thing anyway.
That's why I don't think the usual lazy-import method would work here.
--Larry Garfield
On Tue, 28 Aug 2007 09:08:10 -0400, "Ken Rickard" <agentrickard@gmail.com> wrote:
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable XML parsing and node creation hooks.
@Larry-
My preference here is the old 'lazy instantiation' trick. [1]
Import the data and write a callback that will present the table view of courses, etc. You're dealing with structured data, so your callbacks shouldl make it easy for people to browse the data (think MVC).
Keep a {data_node} lookup table.
When writing links for individual items, check the {data_node} table. If found, write the link to node/NID, otherwise, write it to a node-generating callback that also inserts a record into {data_node}.
This way, you only create nodes when your users want to interact with them. Saves lots of processing overhead.
I have some sample code if you need it.
One drawback: if you want the data to be searchable, you either have to initiate your own hook_search, or wait for the nodes to be created.
- Ken
[1] http://barcelona2007.drupalcon.org/node/58
On 8/28/07, Earnie Boyd <earnie@users.sourceforge.net> wrote:
Quoting Larry Garfield <larry@garfieldtech.com>:
So, I toss the brain-teaser out there: Is there a good way to have my
nodes
and import them too, or are these cases where nodes are simply the
wrong
tool
and the direct-import-and-cache mechanisms described above are the optimal solutions?
Not that I've found and I've spent several hours recently researching this. Chris Mellor and I have begun collaborating on this issue here http://portallink.linkmatics.com/gdf and have development staging here http://datafeed.progw.org. Help is welcome, we want to be able to feed all types of external data. Goals are being established and documented on the http://portallink.linkmatics.com/gdf pages. *Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
I've found http://drupal.org/project/feedparser which will accept RSS, RDF or ATOM feeds and create nodes or aggregated lists. I am successfully using that module with a change documented in issue http://drupal.org/node/169865 at http://give-me-an-offer.com.
Earnie
Quoting Ken Rickard <agentrickard@gmail.com>:
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable XML parsing and node creation hooks.
*Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
XML is only one data format. GDF hopes to provide a means to I/O more formats than XML or RSS. There is a huge commercial need to be able to create nodes out of any format of feed. Earnie
Cool. Check with Aron, but I'm pretty sure his parsing abstraction does not require XML. It's just the obvious choice for most uses. - Ken On 8/29/07, Earnie Boyd <earnie@users.sourceforge.net> wrote:
Quoting Ken Rickard <agentrickard@gmail.com>:
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable XML parsing and node creation hooks.
*Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
XML is only one data format. GDF hopes to provide a means to I/O more formats than XML or RSS. There is a huge commercial need to be able to create nodes out of any format of feed.
Earnie
Yes, FeedAPI parser abstraction is independent from XML. Aron 2007. augusztus 29. dátummal Ken Rickard ezt írta:
Cool. Check with Aron, but I'm pretty sure his parsing abstraction does not require XML. It's just the obvious choice for most uses.
- Ken
On 8/29/07, Earnie Boyd <earnie@users.sourceforge.net> wrote:
Quoting Ken Rickard <agentrickard@gmail.com>:
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable
XML
parsing and node creation hooks.
*Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
XML is only one data format. GDF hopes to provide a means to I/O more formats than XML or RSS. There is a huge commercial need to be able to create nodes out of any format of feed.
Earnie
Wednesday, August 29, 2007, 8:25:15 AM, you wrote:
Quoting Ken Rickard <agentrickard@gmail.com>:
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable XML parsing and node creation hooks.
*Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
XML is only one data format. GDF hopes to provide a means to I/O more formats than XML or RSS. There is a huge commercial need to be able to create nodes out of any format of feed.
Hi Earnie, I am warming up an old thread here and I am possibly overlooking sth, so please apologize if I ask sth that's available on the thread. What's GDF? Sounds a lot like FeedAPI. Have you checked it out? http://drupal.org/project/feedapi We are about to iron out the details on it, reviews, suggestions and patches are more than welcome at this moment. Alex
Earnie
-- Alexander Barth Development Seed http://www.developmentseed.org http://www.developmentseed.org/blog lx_barth(skype) alex_b(drupal.org) Tel. 202.250.3633 Fax. 806.214.6218
Quoting Alexander Barth <alex@developmentseed.org>:
Wednesday, August 29, 2007, 8:25:15 AM, you wrote:
Quoting Ken Rickard <agentrickard@gmail.com>:
@Earnie: Take a look at the FeedAPI SoC project, it includes pluggable XML parsing and node creation hooks.
*Note* we are aware of all the existing modules and API and our plans are to use the existing things as well as create what is missing.
XML is only one data format. GDF hopes to provide a means to I/O more formats than XML or RSS. There is a huge commercial need to be able to create nodes out of any format of feed.
Hi Earnie,
I am warming up an old thread here and I am possibly overlooking sth, so please apologize if I ask sth that's available on the thread.
What's GDF? Sounds a lot like FeedAPI. Have you checked it out? http://drupal.org/project/feedapi
See http://portallink.linkmatics.com/gdf for an explanation of GDF. You'll see that FeedAPI is one of the options.
We are about to iron out the details on it, reviews, suggestions and patches are more than welcome at this moment.
I am having great success with the Feedparser module which is a replacement for the aggregator module (cannot be used with aggregator activated) and some Q&D php to map data in to RSS out with the description preformatted in a table layout to display the image with the text. I need to use the full html filter for the data display. Actually one of the feeds I receive worked with Feedparser with no manipulation required and I patterned the Q&D after it. I tried out FeedAPI with it predefined node processor and one of my feeds. It refused to use the full html regardless of the fact that I specified full in two places. Since I had something working I didn't look into what I would need to make the processor work but added that to my round tuit list. The data I am parsing and processing are of the affiliate publishing nature. The data can be in any format and the description data may need filters as well. The description filter would be the same for the feeds coming from the same affiliate program. I also see a need for category filters. The data provided contains the taxonomy that each provider uses and that needs mapped to the site taxonomy structure. The feed pull management also needs a filter for time of day. I have one program that provides data between the hours of x and y so there isn't a need to try outside of that time. But I would like to try every hour between x and y. Also we need to limit the number of pulls in that hour. And I just remembered, the FeedAPI node processor doesn't allow all of the data in the feed to be populated; only x number. I need 100% of the data. Earnie -- http://for-my-kids.com/ -- http://give-me-an-offer.com/
participants (6)
-
Alexander Barth -
Bill Fitzgerald -
Earnie Boyd -
Ken Rickard -
Larry Garfield -
Novák Áron