[development] Data import and synchronization to nodes

Tue Aug 28 05:08:01 UTC 2007

Hi all.  Algorithmic question.  Actually, two questions that I feel have 
related or complementary solutions, so I am tossing them out together.

I have two scenarios I am looking at where I need to be able to pull data in 
from an external source and make a node out of it, *and* be able to fully 
update the entire dataset at any time from the external source.

Scenario 1: Course listings for a school.  Courses come in as a CSV file.  The 
CSV includes duplicate data for courses and sections of courses, as well as 
fields like the instructor, which is an Instructor node type elsewhere in the 
system.  It is not linked by node ID, however, but by instructor name, and 
then a lookup needs to be done on import.  Typically imports are done a few 
times a year, but at those times will be done easily a dozen times in a few 
days.  Each import is several hundred records.

Scenario 2: A DocBook source tree, such as the one sepeck is building for the 
new Drupal.org documentation.  A given book needs to be made available via 
Drupal pages, with the structure of the pages on the site matching up with 
the DocBook tree.  Whether pages break at chapters/articles, sections, 
sub-sections, etc. should be admin controlled.  Each book could potentially 
be hundreds of pages (for some admin-defined definition of page).

In neither scenario does the incoming data have any knowledge of "nodes".  
Also, in neither scenario do I need round-tripping, so if the data is not 
editable via Drupal, or edits are always overwritten by a new import, that is 
perfectly fine.

Now, the non-node solution to both of these is reasonably straightforward.  

In the first scenario, simply read course data into two separate tables (one 
for courses, one for sections) with the appropriate foreign keys, then setup 
custom menu callbacks for courses and sections and lists thereof that display 
the desired information.  When a new file is imported, flush those tables and 
rebuild.  You probably wouldn't even use auto-increment IDs for them.

In the second case, have a single menu callback that corresponds to the root 
of the Docbook tree.  Arguments to that callback map to the structure of the 
tree.  Scan the tree once and build a menu based on the outline, then 
lazy-load page data and cache it in rendered (via XSLT or whatever) form for 
later display.  If the tree structure changes or the admin changes the 
page-break settings, flush the cache and rebuild the spine menu.

The problem with both of those methods is in neither case is the data a node.  
That means you do not get any of the benefits of data being nodes (comments, 
CCK capability, nodereference, a node/$nid "permalink" that doesn't change 
when the book structure is refactored, Views support, etc.)  On the other 
hand, both cases need flush/rebuild ability.  That means creating and 
destroying nodes by the hundreds on a regular basis -- which would be a very 
slow operation and would result in the loss of any of that additional 
metadata -- or building some additional mechanism for tracking what part of 
the original source data maps to what once-created node.  The former is quite 
undesirable, while the latter is potentially quite complex (especially when, 
of course, you don't have absolute control over the incoming data so can't 
guarantee that it has a unique ID you can reference).  

Since I can think of at least two places I would want to use each of those 
(Drupal.org being one use case for scenario 2), both seem like natural cases 
for generalized modules.  I want to solve the import/sync problem first, 
however.  

I also do not believe that the importexportapi module would be useful here.  I 
looked into it last year for a similar task, and from the documentation 
determined that it was either too over-engineered or too under-documented for 
my uses.  By the time I figured out how to use it, I could probably just have 
written it myself. :-/

The DocBook scenario I'm assuming is a Drupal 6-based module only, and PHP 
5-only.  The courses scenario may be Drupal 5 or Drupal 6, depending on 
scheduling.

So, I toss the brain-teaser out there: Is there a good way to have my nodes 
and import them too, or are these cases where nodes are simply the wrong tool 
and the direct-import-and-cache mechanisms described above are the optimal 
solutions?

-- 
Larry Garfield			AIM: LOLG42
larry at garfieldtech.com		ICQ: 6817012

"If nature has made any one thing less susceptible than all others of 
exclusive property, it is the action of the thinking power called an idea, 
which an individual may exclusively possess as long as he keeps it to 
himself; but the moment it is divulged, it forces itself into the possession 
of every one, and the receiver cannot dispossess himself of it."  -- Thomas 
Jefferson