[development] Data import and synchronization to nodes
larry at garfieldtech.com
Tue Aug 28 05:08:01 UTC 2007
Hi all. Algorithmic question. Actually, two questions that I feel have
related or complementary solutions, so I am tossing them out together.
I have two scenarios I am looking at where I need to be able to pull data in
from an external source and make a node out of it, *and* be able to fully
update the entire dataset at any time from the external source.
Scenario 1: Course listings for a school. Courses come in as a CSV file. The
CSV includes duplicate data for courses and sections of courses, as well as
fields like the instructor, which is an Instructor node type elsewhere in the
system. It is not linked by node ID, however, but by instructor name, and
then a lookup needs to be done on import. Typically imports are done a few
times a year, but at those times will be done easily a dozen times in a few
days. Each import is several hundred records.
Scenario 2: A DocBook source tree, such as the one sepeck is building for the
new Drupal.org documentation. A given book needs to be made available via
Drupal pages, with the structure of the pages on the site matching up with
the DocBook tree. Whether pages break at chapters/articles, sections,
sub-sections, etc. should be admin controlled. Each book could potentially
be hundreds of pages (for some admin-defined definition of page).
In neither scenario does the incoming data have any knowledge of "nodes".
Also, in neither scenario do I need round-tripping, so if the data is not
editable via Drupal, or edits are always overwritten by a new import, that is
Now, the non-node solution to both of these is reasonably straightforward.
In the first scenario, simply read course data into two separate tables (one
for courses, one for sections) with the appropriate foreign keys, then setup
custom menu callbacks for courses and sections and lists thereof that display
the desired information. When a new file is imported, flush those tables and
rebuild. You probably wouldn't even use auto-increment IDs for them.
In the second case, have a single menu callback that corresponds to the root
of the Docbook tree. Arguments to that callback map to the structure of the
tree. Scan the tree once and build a menu based on the outline, then
lazy-load page data and cache it in rendered (via XSLT or whatever) form for
later display. If the tree structure changes or the admin changes the
page-break settings, flush the cache and rebuild the spine menu.
The problem with both of those methods is in neither case is the data a node.
That means you do not get any of the benefits of data being nodes (comments,
CCK capability, nodereference, a node/$nid "permalink" that doesn't change
when the book structure is refactored, Views support, etc.) On the other
hand, both cases need flush/rebuild ability. That means creating and
destroying nodes by the hundreds on a regular basis -- which would be a very
slow operation and would result in the loss of any of that additional
metadata -- or building some additional mechanism for tracking what part of
the original source data maps to what once-created node. The former is quite
undesirable, while the latter is potentially quite complex (especially when,
of course, you don't have absolute control over the incoming data so can't
guarantee that it has a unique ID you can reference).
Since I can think of at least two places I would want to use each of those
(Drupal.org being one use case for scenario 2), both seem like natural cases
for generalized modules. I want to solve the import/sync problem first,
I also do not believe that the importexportapi module would be useful here. I
looked into it last year for a similar task, and from the documentation
determined that it was either too over-engineered or too under-documented for
my uses. By the time I figured out how to use it, I could probably just have
written it myself. :-/
The DocBook scenario I'm assuming is a Drupal 6-based module only, and PHP
5-only. The courses scenario may be Drupal 5 or Drupal 6, depending on
So, I toss the brain-teaser out there: Is there a good way to have my nodes
and import them too, or are these cases where nodes are simply the wrong tool
and the direct-import-and-cache mechanisms described above are the optimal
Larry Garfield AIM: LOLG42
larry at garfieldtech.com ICQ: 6817012
"If nature has made any one thing less susceptible than all others of
exclusive property, it is the action of the thinking power called an idea,
which an individual may exclusively possess as long as he keeps it to
himself; but the moment it is divulged, it forces itself into the possession
of every one, and the receiver cannot dispossess himself of it." -- Thomas
More information about the development