[development] Data import and synchronization to nodes

Tue Aug 28 13:51:17 UTC 2007

I'm facing a similar problem.

I have to import data from CSV files of a non-standardized format  
(i.e. the columns may be different, more or less information can be  
specified per row, depending on the format used). The files can  
contain a couple hundred, but also a couple thousand nodes. So  
performance is critical. Even more so, because these nodes are  
actually the data I have to analyze. And these analysises have to be  
available immediately. And they have to be updated daily. And  
depending on certain tresholds, the user should be notified. So lazy  
instantation is no option for me.
When the imported data is updated, by uploading the updated CSV file,  
all data should be updated as well.

I'd have to use 4 node types, with various node references and other  
CCK fields. drupal_execute() is a no go, because of performance  
requirements.
The Import/Export API does not allow conversion of one format to  
another easily, so there was no point in using it. Not to mention its  
lousy performance (and memory usage). I do think the documentation is  
ok though, just a bit hard to consume. No other module comes close,  
so I wrote my own.

I'm NOT using nodes. So I lose Views support and any other modules  
that operates on nodes. Pity, but I need the performance.
I've set up 4 tables, with some "UNIQUE INDEX" indexes and am  
converting the CSV files to SQL files. Every query is of this style:  
"INSERT INTO ... ON DUPLICATE KEY UPDATE ...". So if the data is new,  
it gets inserted, if the data is duplicate for any unique index, it  
updates the existing data. Performance is excellent with this approach.

I'm not sure yet the Generic Data Feed project could fullfil my  
requirements. I think defining a hook that can say "this combination  
of fields marks a unique entity" would be necessary if you want  
"smart updates". My biggest concern is: how could a generic solution,  
that also allows you to use CCK node types, be performant enough?

Wim

mail     work at wimleers.com
mob.    0032 (0)495 83.63.68

On Aug 28, 2007, at 07:08 , Larry Garfield wrote:

> Hi all.  Algorithmic question.  Actually, two questions that I feel  
> have
> related or complementary solutions, so I am tossing them out together.
>
> I have two scenarios I am looking at where I need to be able to  
> pull data in
> from an external source and make a node out of it, *and* be able to  
> fully
> update the entire dataset at any time from the external source.
>
> Scenario 1: Course listings for a school.  Courses come in as a CSV  
> file.  The
> CSV includes duplicate data for courses and sections of courses, as  
> well as
> fields like the instructor, which is an Instructor node type  
> elsewhere in the
> system.  It is not linked by node ID, however, but by instructor  
> name, and
> then a lookup needs to be done on import.  Typically imports are  
> done a few
> times a year, but at those times will be done easily a dozen times  
> in a few
> days.  Each import is several hundred records.
>
> Scenario 2: A DocBook source tree, such as the one sepeck is  
> building for the
> new Drupal.org documentation.  A given book needs to be made  
> available via
> Drupal pages, with the structure of the pages on the site matching  
> up with
> the DocBook tree.  Whether pages break at chapters/articles, sections,
> sub-sections, etc. should be admin controlled.  Each book could  
> potentially
> be hundreds of pages (for some admin-defined definition of page).
>
> In neither scenario does the incoming data have any knowledge of  
> "nodes".
> Also, in neither scenario do I need round-tripping, so if the data  
> is not
> editable via Drupal, or edits are always overwritten by a new  
> import, that is
> perfectly fine.
>
> Now, the non-node solution to both of these is reasonably  
> straightforward.
>
> In the first scenario, simply read course data into two separate  
> tables (one
> for courses, one for sections) with the appropriate foreign keys,  
> then setup
> custom menu callbacks for courses and sections and lists thereof  
> that display
> the desired information.  When a new file is imported, flush those  
> tables and
> rebuild.  You probably wouldn't even use auto-increment IDs for them.
>
> In the second case, have a single menu callback that corresponds to  
> the root
> of the Docbook tree.  Arguments to that callback map to the  
> structure of the
> tree.  Scan the tree once and build a menu based on the outline, then
> lazy-load page data and cache it in rendered (via XSLT or whatever)  
> form for
> later display.  If the tree structure changes or the admin changes the
> page-break settings, flush the cache and rebuild the spine menu.
>
> The problem with both of those methods is in neither case is the  
> data a node.
> That means you do not get any of the benefits of data being nodes  
> (comments,
> CCK capability, nodereference, a node/$nid "permalink" that doesn't  
> change
> when the book structure is refactored, Views support, etc.)  On the  
> other
> hand, both cases need flush/rebuild ability.  That means creating and
> destroying nodes by the hundreds on a regular basis -- which would  
> be a very
> slow operation and would result in the loss of any of that additional
> metadata -- or building some additional mechanism for tracking what  
> part of
> the original source data maps to what once-created node.  The  
> former is quite
> undesirable, while the latter is potentially quite complex  
> (especially when,
> of course, you don't have absolute control over the incoming data  
> so can't
> guarantee that it has a unique ID you can reference).
>
> Since I can think of at least two places I would want to use each  
> of those
> (Drupal.org being one use case for scenario 2), both seem like  
> natural cases
> for generalized modules.  I want to solve the import/sync problem  
> first,
> however.
>
> I also do not believe that the importexportapi module would be  
> useful here.  I
> looked into it last year for a similar task, and from the  
> documentation
> determined that it was either too over-engineered or too under- 
> documented for
> my uses.  By the time I figured out how to use it, I could probably  
> just have
> written it myself. :-/
>
> The DocBook scenario I'm assuming is a Drupal 6-based module only,  
> and PHP
> 5-only.  The courses scenario may be Drupal 5 or Drupal 6,  
> depending on
> scheduling.
>
> So, I toss the brain-teaser out there: Is there a good way to have  
> my nodes
> and import them too, or are these cases where nodes are simply the  
> wrong tool
> and the direct-import-and-cache mechanisms described above are the  
> optimal
> solutions?
>
> -- 
> Larry Garfield			AIM: LOLG42
> larry at garfieldtech.com		ICQ: 6817012
>
> "If nature has made any one thing less susceptible than all others of
> exclusive property, it is the action of the thinking power called  
> an idea,
> which an individual may exclusively possess as long as he keeps it to
> himself; but the moment it is divulged, it forces itself into the  
> possession
> of every one, and the receiver cannot dispossess himself of it."   
> -- Thomas
> Jefferson