Database schema abstraction and *reflection* (was: Referential integrity -- finally?)
I raised the issue of database schema abstraction last summer, and I know that other people raised it before me. The arguments this time seem to be the same as the used to be (though apparently Dries is closer to being persuaded :-). I would like to elaborate on my primary reason for supporting a database schema abstraction layer (not for data manipulation (SELECT, etc.), just for CREATE TABLE, ALTER, etc): database schema reflection. Schema reflection will provide Drupal the ability to understand and manipulate its own data schema (in the flavor of the Forms API). Every table's structure will be defined in a data structure (just like a form), indicating its columns and data types, keys, etc. I envision something like this (I wrote a perl script to automatically generate this structure by parsing 4.7's database.4.0.mysql. Other methods are possible, but it only has to be done once): function system_install() { // // Table structure for table 'access' // $tables['access'] = array ('#cols' => array (array('aid', 'tinyint', NOT_NULL, '#disp_width' => 10), array('mask', 'varchar(255)', NOT_NULL, '#default' => ''), array('type', 'varchar(255)', NOT_NULL, '#default' => ''), array('status', 'tinyint', NOT_NULL, '#disp_width' => 2, '#default' => '0')), '#keys' => array (array('', 'aid', 'PRIMARY'))); // // other tables defined here // return $tables; } Every column has a name, type, and NULL/NOT NULL setting, so those array elements are positional, and any additional/optional attributes are set with #name elements (this is to save space and typing, though all elements could obviously all be keyed to mimic Forms API). Similarly, all keys have a name, column list, and index type, with possible optional attributes. Every module, including system, can return an array like this. When _install hooks are called, the caller merges the result and hands the array off to a DBMS engine driver that translates the schema into appropriate CREATE TABLE, ALTER, etc. statements. I believe this is much better than db_create_table(), db_alter_table(), etc. for the same reason that the Forms API is better than pre-4.7's form_button(), form_checkbox(), etc. A data-driven approach will prove a better design than a procedural approach. Nothing prevents db_create_table() etc. from existing, of course; in fact they probably will, but usually only be called by the DBMS driver. Any module author that does not want to play the game just continues to call db_query('CREATE TABLE') in hook_install(); it is not like we can prevent it. Similarly, _update_n() functions can remain unchanged. If they don't return a structured array, they won't be involved in the abstraction layer. This is not a least-common-denominator approach. Nothing prevents DBMS-specific settings: $tables['some_table'] = array ('#mssql_proprietary_setting' => 'value', ...); Schema version updates, at least the normal cases (add, rename, remove columns) can be handled just by adding version info into the table. For example: $tables['some_table'] = array ('#version' => 2, '#cols' => array (// all the original columns would be here, and now ... array('newcol', 'int', NOT_NULL, '#version_added' => 2), For version updates that require custom code, we can add $tables['some_table'] = array ('#version' => 4, '#version_custom_updates' => array(2, 6), ...); so when the schema updates from 1 to 2 or 5 to 6, module_update_some_table_2 or _6 is called. (As I write this, I think I realize that the #version keys should be in $tables, not $tables['some_table'].) What does this buy us? - Several modules (e.g. Views) already need schema reflection, this just centralizes the information. It makes it much easier to create new modules that need reflection. A module that wants to support Views or other similar modules can hang any necessary data off its primary $tables array. In fact, many modules will probably not have to do anything more to support such reflection-modules. - Code size reduction and elimination of duplicate code. My automatically-generated schema for core (with all comments) is 110% the size of database.4.0.mysql, but it obsoletes database.4.1.mysql, database.psql, etc. The same reduction would happen in 5.0's system_install(). - A DBMS driver can use inspection statements (SHOW TABLES, SHOW COLUMNS) to actually validate the database layout against Drupal's internal schema, reporting errors. - It is a huge step towards enabling partial data migration across Drupal systems (this is a whole other topic). - hook_table_alter. - The Forms API made all kinds of things possible that weren't before, not all of them foreseen. This will, too. Yes, this is somewhat more complicated than writing a single MySQL CREATE TABLE statement. Supporting Views is extra effort, too, but has a big payoff for module developers; so will this. I assert that it is easier to create a structure like this once than it is to write AND MAINTAIN separate statements for MySQL, Postgres, Oracle, Sybase, MS SQL Server, and god knows what else. And module authors can keep using CREATE TABLE if they really want to. Anyway, that's my $.02. Thanks, Barry
This is a really great proposal. I agree that Drupal needs a DB schema abstraction system, for all the reasons that you listed, Barry. I'm not suggesting that we should do more than a DB schema abstraction system. I realise that this is all that it's practical to try and get into core. However, I am pointing out that there is the possibility to do more. The entity definition system in the importexportapi module is rich enough to support automated CRUD operations throughout Drupal, and of course to support automated import/export of data to various text formats (e.g. XML, CSV). However, it is also much heavier than the proposed DB schema abstraction system, and obviously needs a lot of thought and a lot of work before it could be put into core (if it ever will be). Cheers, Jaza. On 1/26/07, Barry Jaspan <barry@jaspan.org> wrote:
I raised the issue of database schema abstraction last summer, and I know that other people raised it before me. The arguments this time seem to be the same as the used to be (though apparently Dries is closer to being persuaded :-).
I would like to elaborate on my primary reason for supporting a database schema abstraction layer (not for data manipulation (SELECT, etc.), just for CREATE TABLE, ALTER, etc): database schema reflection.
Schema reflection will provide Drupal the ability to understand and manipulate its own data schema (in the flavor of the Forms API). Every table's structure will be defined in a data structure (just like a form), indicating its columns and data types, keys, etc. I envision something like this (I wrote a perl script to automatically generate this structure by parsing 4.7's database.4.0.mysql. Other methods are possible, but it only has to be done once):
function system_install() { // // Table structure for table 'access' // $tables['access'] = array ('#cols' => array (array('aid', 'tinyint', NOT_NULL, '#disp_width' => 10), array('mask', 'varchar(255)', NOT_NULL, '#default' => ''), array('type', 'varchar(255)', NOT_NULL, '#default' => ''), array('status', 'tinyint', NOT_NULL, '#disp_width' => 2, '#default' => '0')), '#keys' => array (array('', 'aid', 'PRIMARY')));
// // other tables defined here //
return $tables; }
Every column has a name, type, and NULL/NOT NULL setting, so those array elements are positional, and any additional/optional attributes are set with #name elements (this is to save space and typing, though all elements could obviously all be keyed to mimic Forms API). Similarly, all keys have a name, column list, and index type, with possible optional attributes.
Every module, including system, can return an array like this. When _install hooks are called, the caller merges the result and hands the array off to a DBMS engine driver that translates the schema into appropriate CREATE TABLE, ALTER, etc. statements.
I believe this is much better than db_create_table(), db_alter_table(), etc. for the same reason that the Forms API is better than pre-4.7's form_button(), form_checkbox(), etc. A data-driven approach will prove a better design than a procedural approach. Nothing prevents db_create_table() etc. from existing, of course; in fact they probably will, but usually only be called by the DBMS driver.
Any module author that does not want to play the game just continues to call db_query('CREATE TABLE') in hook_install(); it is not like we can prevent it. Similarly, _update_n() functions can remain unchanged. If they don't return a structured array, they won't be involved in the abstraction layer.
This is not a least-common-denominator approach. Nothing prevents DBMS-specific settings:
$tables['some_table'] = array ('#mssql_proprietary_setting' => 'value', ...);
Schema version updates, at least the normal cases (add, rename, remove columns) can be handled just by adding version info into the table. For example:
$tables['some_table'] = array ('#version' => 2, '#cols' => array (// all the original columns would be here, and now ... array('newcol', 'int', NOT_NULL, '#version_added' => 2),
For version updates that require custom code, we can add
$tables['some_table'] = array ('#version' => 4, '#version_custom_updates' => array(2, 6), ...);
so when the schema updates from 1 to 2 or 5 to 6, module_update_some_table_2 or _6 is called. (As I write this, I think I realize that the #version keys should be in $tables, not $tables['some_table'].)
What does this buy us?
- Several modules (e.g. Views) already need schema reflection, this just centralizes the information. It makes it much easier to create new modules that need reflection. A module that wants to support Views or other similar modules can hang any necessary data off its primary $tables array. In fact, many modules will probably not have to do anything more to support such reflection-modules.
- Code size reduction and elimination of duplicate code. My automatically-generated schema for core (with all comments) is 110% the size of database.4.0.mysql, but it obsoletes database.4.1.mysql, database.psql, etc. The same reduction would happen in 5.0's system_install().
- A DBMS driver can use inspection statements (SHOW TABLES, SHOW COLUMNS) to actually validate the database layout against Drupal's internal schema, reporting errors.
- It is a huge step towards enabling partial data migration across Drupal systems (this is a whole other topic).
- hook_table_alter.
- The Forms API made all kinds of things possible that weren't before, not all of them foreseen. This will, too.
Yes, this is somewhat more complicated than writing a single MySQL CREATE TABLE statement. Supporting Views is extra effort, too, but has a big payoff for module developers; so will this. I assert that it is easier to create a structure like this once than it is to write AND MAINTAIN separate statements for MySQL, Postgres, Oracle, Sybase, MS SQL Server, and god knows what else. And module authors can keep using CREATE TABLE if they really want to.
Anyway, that's my $.02.
Thanks,
Barry
When doing a Firebird 1.x port of Drupal, I met a limitation of the underlying DBMS with the node_access table, which can't be created with a structure equivalent to the MySQL/PGSQL versions (see http://cvs.drupal.org/viewcvs/drupal/contributions/docs/developer/database/c... for the workaround). Would that level of abstraction enable us to use this type of solution (an additional table f-keyed to a base table) ? It looks like iti wouldn't be sufficient yet. ----- Original Message ----- From: "Barry Jaspan" <barry@jaspan.org> To: <development@drupal.org> Sent: Friday, January 26, 2007 1:03 AM Subject: [development] Database schema abstraction and *reflection* (was: Referential integrity -- finally?)
I raised the issue of database schema abstraction last summer, and I know that other people raised it before me. The arguments this time seem to be the same as the used to be (though apparently Dries is closer to being persuaded :-).
I would like to elaborate on my primary reason for supporting a database schema abstraction layer (not for data manipulation (SELECT, etc.), just for CREATE TABLE, ALTER, etc): database schema reflection.
[...]
"FGM" <fgm@osinet.fr> writes:
When doing a Firebird 1.x port of Drupal, I met a limitation of the underlying DBMS with the node_access table, which can't be created with a structure equivalent to the MySQL/PGSQL versions (see http://cvs.drupal.org/viewcvs/drupal/contributions/docs/developer/database/c... for the workaround).
I do not claim to be wise in all the ways of databases, nor have I ever touched Firebird, and I am only somewhat familiar with the node_access table. What I see in your diagram is that you took the realm column out of node_access into its own table, keyed by a realm id. It looks to me like this is just normalizing the node_access table. Am I correct?
Would that level of abstraction enable us to use this type of solution (an additional table f-keyed to a base table) ?
I certainly plan to support foreign-key information in the schema data structure even if it is not represented/enforced by the underlying DBMS. This is crucial to achieving "Drupal understanding its own schema." I'm not sure I understand your question, but I think you are asking: Could this abstraction allow the Drupal-Firebird driver to convert the current node_access table into the separate node_access and node_access_realm tables as shown in your diagram? If that is the question, the answer is no. My proposal is only for data-definition statements, not for abstracting data-manipulation statements. Yes, one could implement the hypothetical hook_schema_alter (I'm making this up on the fly): function firebird_schema_alter(&$schema) { unset($schema['node_access']['cols']['realm']); $schema['node_access']['cols'][] = array('rid', ...); $schema['node_access_realm'] = array(/* table definition goes here */); } However, that would break all the code that calls the CRUD functions on node_access. I do NOT want to put an abstraction layer on top of SELECT. For this change, the schema would just need to be changed and code using node_access would have to be patched. Am I misunderstanding your question? Thanks, Barry
Barry, You're indeed correct in evaluating the change and understanding the question: this was necessary because of the stringent FB limitations on the width of indexes, and is an example where a given DBMS cannot be used to port Drupal "as such" without an actual DB abstraction (instead of a DB API abstraction as we have currently). The hook_schema_alter is interesting/perplexing, BTW. I wonder where this could take us. More generally, as you say, such ports would need the specific module to be patched, instead of having a function implementing the change in the DB indendence layer, which looks like a Bad Thing. But since we're not (yet ?) considering DML abstraction, I'm not sure we can go much further... except (maybe) by some very clever hook_db_rewrite_sql implementations. (very ?). Frederic. ----- Original Message ----- From: "Barry Jaspan" <barry@jaspan.org> To: <development@drupal.org> Sent: Friday, January 26, 2007 8:03 PM Subject: Re: [development] Database schema abstraction and *reflection*(was: Referential integrity -- finally?)
"FGM" <fgm@osinet.fr> writes:
[...]
What I see in your diagram is that you took the realm column out of node_access into its own table, keyed by a realm id. It looks to me like this is just normalizing the node_access table. Am I correct? [...]
I'm not sure I understand your question, but I think you are asking: Could this abstraction allow the Drupal-Firebird driver to convert the current node_access table into the separate node_access and node_access_realm tables as shown in your diagram?
If that is the question, the answer is no. My proposal is only for data-definition statements, not for abstracting data-manipulation statements. Yes, one could implement the hypothetical hook_schema_alter (I'm making this up on the fly):
function firebird_schema_alter(&$schema) { unset($schema['node_access']['cols']['realm']); $schema['node_access']['cols'][] = array('rid', ...); $schema['node_access_realm'] = array(/* table definition goes here */); [...]
i'm pretty excited about this direction. we certainly can borrow ideas from the definitions in importexportapi module.
- Several modules (e.g. Views) already need schema reflection, this just centralizes the information. It makes it much easier to create new modules that need reflection. A module that wants to support Views or other similar modules can hang any necessary data off its primary $tables array. In fact, many modules will probably not have to do anything more to support such reflection-modules.
aha, so *this* is how we get Views in core :) actually, drupal core itself uses basic reflection on the users table. see user_fields(). an admin can add columns to the users table and these columns will be loaded and saved into the $user object. this table definition API would help us extend that feature to other objects like $node and $term. Very nice for performance.
- Code size reduction and elimination of duplicate code. My automatically-generated schema for core (with all comments) is 110% the size of database.4.0.mysql, but it obsoletes database.4.1.mysql, database.psql, etc. The same reduction would happen in 5.0's system_install().
- A DBMS driver can use inspection statements (SHOW TABLES, SHOW COLUMNS) to actually validate the database layout against Drupal's internal schema, reporting errors.
- It is a huge step towards enabling partial data migration across Drupal systems (this is a whole other topic).
if possible, please document your thoughts on this topic too. building out this vision will support the need for this table definition API
Moshe Weitzman <weitzman@tejasa.com> writes:
actually, drupal core itself uses basic reflection on the users table. see user_fields(). ... this table definition API would help us extend that feature to other objects like $node and $term. Very nice for performance.
Yes. Just to spell this out, a hypothetical hook_schema_alter for foo.module: function foo_schema_alter(&$schema) { $schema['node']['cols'][] = array(/* define the foo column for nodes */); } Now, node_load and node_save automatically handle $node->foo, so foo_nodeapi doesn't have to do its own queries. Of course this only makes sense for data that applies to every node (a node type's custom columns should not (?) exist in the node table as they'd be NULL most of the time and we'd get a very wide table). Barry
etc. I envision something like this (I wrote a perl script to automatically generate this structure by parsing 4.7's database.4.0.mysql. Other methods are possible, but it only has to be done once):
Awesome. Let's do it! You have a Perl script... and maybe some Drupal code to deal with the structure? Or that needs to be written?
"Karoly Negyesi" <karoly@negyesi.net> writes:
Awesome. Let's do it! You have a Perl script... and maybe some Drupal code to deal with the structure? Or that needs to be written?
A lot more thought and design is needed before it is time to write real code. I'll start when I hear from the core maintainers that they like the idea; I don't want to waste my time on something that has no chance of being committed. Barry
On 26 Jan 2007 16:14:56 -0500, Barry Jaspan <barry@mit.edu> wrote:
"Karoly Negyesi" <karoly@negyesi.net> writes:
Awesome. Let's do it! You have a Perl script... and maybe some Drupal code to deal with the structure? Or that needs to be written?
A lot more thought and design is needed before it is time to write real code. I'll start when I hear from the core maintainers that they like the idea; I don't want to waste my time on something that has no chance of being committed.
Karoly is not a core maintainer, but certainly a core developer. Generally, maintainers listen to the will of the community. You're the community, and there are some general +1s going around. Maybe touch base with Karoly and go over it directly. -- Boris Mann Vancouver 778-896-2747 San Francisco 415-367-3595 Skype borismann http://www.bryght.com
participants (7)
-
Barry Jaspan -
Barry Jaspan -
Boris Mann -
FGM -
Jeremy Epstein -
Karoly Negyesi -
Moshe Weitzman