[development] Everything's a node != Everything's in 1 table

Tue Jan 23 01:59:08 UTC 2007

In the discussion of whether everything should be a node, there are two
recurring objections. I believe both can be addressed. The 3 numbered
paragraphs below describe a data model that responds to the objections.
Then, the paragraphs following look at additional implications.

First objection is that database performance would be a nightmare if
everything were stored in one table. I agree. However, it is not
necessary to keep everything in one table.

The second objection is stated in different ways, so I'll synthesize:
data nodes, taxonomy vocabularies and terms, and other things like
comments and users, are all different _kinds_of_things_. It doesn't make
sense to treat them the same. I respond by saying that they *are* the
same thing: they are data structures with most fields and operations in
common. The fact that they don't _mean_ the same thing to us meatpeople
isn't relevant to the question of storage and low-level API's, only to
the implementation of each type's handler functions.

Here's what I have in mind. 

1. Put each node type in its own table, and name the table for the type,
e.g. {node_footype}. Each node-type table will have columns for the
fields that are absolutely universal to all nodes, at a minimum the
globally unique node ID, plus timestamps, etc. Note that the node type
is not stored in a column in this table, as it is invariant across rows;
the typename is implicit in the table name. (In memory, of course, a
node's type identifier would need to be recorded.)

2. Each "node type component" that bundles its own fields and defines
its own handlers will also have its own table. For example, Case Tracker
and Category implement fields and behavior that can be attached to any
node type. This one will only have the fields that pertain to the add-on
functionality. Again, the type is not recorded in the table data; it is
implicit in the table's name, e.g. {typecomp_casetracker}.

3. The fields that are added to the node type's definition individually
would be an implicit node type component, so we'll put those in
{typecomp_footype}, corresponding to {nodetype_footype}. For example,
the fields we would add with CCK, or in a module that defines a complete
and independent node type.

[Note this implies that node types and type components share a
namespace. I don't believe that's a problem. It makes sense for a node
type component to have dibs on the base node type of the same name. (If
a node type name is already taken, you need to choose a different name
for your module anyway.)]

So, the definition of a node type is recorded as a type-name and a list
of the associated add-ons, the type components. A type component is
defined with a name and a list of field names and types, presumably from
a catalog of field type definitions recorded elsewhere.

For any node type, regardless of the meaning of its data, whether those
are content or categories or whatever, the database operations are the
same. For example, I want to retrieve all of one node's fields. I the
programmer don't even necessarily know what they all are, and I don't
need to. (Please excuse my sloppy syntax.)

$querytext = 'select * from node_'. $nodetype. ' ';
$jointype = 'inner';
foreach( list_type_components($nodetype) as $comp) {
	$querytext .= "$jointype join typecomp_$comp 
		on node_$nodetype.node_id 
		= typecomp_$comp.node_id". ' ';
}
$querytext .= 'where blah blah whatever'

If the fieldnames are encoded from the type component's name, e.g.
my_component_name.text_field1, the developer will never need to know or
care about the storage mechanism, as all he'll ever give or get is an
associative array. 

Each type component's handler would add/remove/whatever the node's ID
and type to an index table owned by that type component. This takes care
of the case where we want to manipulate a node or nodes from the
database without knowing their type, and that's when we want to work
with all of one metatype (taxo term, content, etc.). For each metatype,
create a node type component that's exclusive to it, and *POOF!*. All
the semantical differences among node types take care of themselves in
the naming conventions, and where necessary in the handler functions.

I believe this mechanism might allow us to avoid the major database
performance issues, particularly the self-join problem. It should also
simplify the engine by merging all our current data-metatypes into only
one basic abstract type with one API, without introducing confusion for
the module-level developers.

-Edgar