Why do we avoid auto-incrementing db columns?
Could someone either clue me in or point to a prior discussion about why Drupal avoids using auto-incrementing database columns? If it is database compatibility, which versions of which databases that we support do not have it? Thanks, Barry
Separate counters are far more standard. MySQL is weird by having auto-increment columns but not counters. Now, there are a few things we could do to speed up Drupal's mechanism for emulating counters in MySQL... Barry Jaspan wrote:
Could someone either clue me in or point to a prior discussion about why Drupal avoids using auto-incrementing database columns? If it is database compatibility, which versions of which databases that we support do not have it?
Thanks,
Barry
Here's my guess: When inserting a node, there are several operations that take place, and you want them all to use a persistent $node->nid . Since few node creations can occur in the same time, you want to know which one is YOUR node. Fetching an nid/tid/... from the sequence table makes sure each one works in a safe environment, where all IDs are known. ...But that's just a guess ;-) Barry Jaspan wrote:
Could someone either clue me in or point to a prior discussion about why Drupal avoids using auto-incrementing database columns? If it is database compatibility, which versions of which databases that we support do not have it?
Thanks,
Barry
On Apr 25, 2007, at 10:15 AM, Barry Jaspan wrote:
Could someone either clue me in or point to a prior discussion about why Drupal avoids using auto-incrementing database columns? If it is database compatibility, which versions of which databases that we support do not have it?
The point isn't the databases we support, it's the databases we want to support.
Barry, I happened to discuss this with Greg Knaddison yesterday. MySQL 4 does not support auto-incrementing columns. I believe Drupal.org uses MySQL 4, and one reason it has not been upgraded could be to avoid complications caused by changes in the way information, particularly international characters, are stored between 4 and and later versions. Ezra Gildesgame Developer pingVision 1350 Pine Street, Suite 1 Boulder, CO 80302-4847 303.415.2559 On Apr 25, 2007, at 8:15 AM, Barry Jaspan wrote:
Could someone either clue me in or point to a prior discussion about why Drupal avoids using auto-incrementing database columns? If it is database compatibility, which versions of which databases that we support do not have it?
Thanks,
Barry
On Wednesday 25 April 2007 10:29, Ezra B. Gildesgame wrote:
I happened to discuss this with Greg Knaddison yesterday. MySQL 4 does not support auto-incrementing columns.
I must respectfully disagree. I happened to have a copy of the MySQL reference book from version 3.22 sitting on my bookshelf, and AUTO_INCREMENT was supported even then. :-) Never mind the question of why I still have such an old book.... I also checked a PostgreSQL reference book, and they offer a SERIAL column type that works very similarly to the AUTO_INCREMENT in MySQL. I don't know how far back this support goes; my book is for PG version 8.something and was copyright 2006. I can't comment on the Microsoft SQL Server implementation of this, if any, because I don't have any manual for that here. Scott -- ------------------------------------------------------------------------------- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
I believe one of the reasons is that many times you want to know the ID of your new record. For example, you inserted a new file to the 'files' table, and you would like to store a reference to it in your xyz table. For mysql there is mysql_insert_id(), but not for others. On 4/25/07, Syscrusher <syscrusher@4th.com> wrote:
On Wednesday 25 April 2007 10:29, Ezra B. Gildesgame wrote:
I happened to discuss this with Greg Knaddison yesterday. MySQL 4 does not support auto-incrementing columns.
I must respectfully disagree. I happened to have a copy of the MySQL reference book from version 3.22 sitting on my bookshelf, and AUTO_INCREMENT was supported even then. :-) Never mind the question of why I still have such an old book....
I also checked a PostgreSQL reference book, and they offer a SERIAL column type that works very similarly to the AUTO_INCREMENT in MySQL. I don't know how far back this support goes; my book is for PG version 8.somethingand was copyright 2006.
I can't comment on the Microsoft SQL Server implementation of this, if any, because I don't have any manual for that here.
Scott
--
------------------------------------------------------------------------------- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
I have implemented random_id generation which solves 2 problems: (Link to explination: http://earnestberry.com/node/13 ) 1. Tables are NEVER locked when generating an id 2. ID, and there fore node (and id) generation is MUCH faster 3. It is non-sequential, so no issues with the sequencing getting out of sync. The only draw back is on the DB side.if you're writing stored procedures, you don't have a sequence or counter table to look at. However, one can easily solve this by implementing the methodology in a sp (stored-proc). _____ From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Amr Mostafa Sent: Wednesday, April 25, 2007 11:21 AM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns? I believe one of the reasons is that many times you want to know the ID of your new record. For example, you inserted a new file to the 'files' table, and you would like to store a reference to it in your xyz table. For mysql there is mysql_insert_id(), but not for others. On 4/25/07, Syscrusher <syscrusher@4th.com > wrote: On Wednesday 25 April 2007 10:29, Ezra B. Gildesgame wrote:
I happened to discuss this with Greg Knaddison yesterday. MySQL 4 does not support auto-incrementing columns.
I must respectfully disagree. I happened to have a copy of the MySQL reference book from version 3.22 sitting on my bookshelf, and AUTO_INCREMENT was supported even then. :-) Never mind the question of why I still have such an old book.... I also checked a PostgreSQL reference book, and they offer a SERIAL column type that works very similarly to the AUTO_INCREMENT in MySQL. I don't know how far back this support goes; my book is for PG version 8.something and was copyright 2006. I can't comment on the Microsoft SQL Server implementation of this, if any, because I don't have any manual for that here. Scott -- ---------------------------------------------------------------------------- --- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/ <http://4th.com/>
I'd like to point out this is a really discussion about mysql, not about other database. Postgresql Drupal installations haven't used a sequences table ever, IIRC. As long as each database can implement some equivalent of db_next_id() -- be it by a sequences table, auto incremented values, Postgres's CURVAL() function, whatever -- Drupal's db abstraction will work. So let's rephrase: do the currently supported versions of mysql all support some sort of db_next_id() functionality? What about Drupal 6? If so, the sequences table can be dropped in favor of auto_increment. -M
Hmm...I'm still more of a fan of controlling the IDs application level Than using a db generated serial number, especially when dealing with referential integrity. I do believe in db generated id's for non-application/non-information columns. E.g. the watchdog table; information is just inserted into the database, and then read later, so the id field isn't really used can be blind to the application. However, say an NID, of the txid (from e-commerce) gets passed around drupal and used by other modules, sometimes in loops, and sometimes for other modules to ref. the node; this I'd rather have some control of. Also, the implementations of "next_val" per-database get a little hairy also. You also then have the issue of if you are in a transaction, the transaction level, etc. that all starts to play in to the db's implementation of "next_val" or "last_insert_id". -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Mark Fredrickson Sent: Wednesday, April 25, 2007 11:31 AM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns? I'd like to point out this is a really discussion about mysql, not about other database. Postgresql Drupal installations haven't used a sequences table ever, IIRC. As long as each database can implement some equivalent of db_next_id() -- be it by a sequences table, auto incremented values, Postgres's CURVAL() function, whatever -- Drupal's db abstraction will work. So let's rephrase: do the currently supported versions of mysql all support some sort of db_next_id() functionality? What about Drupal 6? If so, the sequences table can be dropped in favor of auto_increment. -M
Hi Earnest and others, On 4/25/07, Earnest Berry III <earnest.berry@gmail.com> wrote:
Hmm...I'm still more of a fan of controlling the IDs application level Than using a db generated serial number, especially when dealing with referential integrity.
I have the opposite standpoint. The more referential integrity that is in the app layer, the harder it it is to integrate and share data from the database. Use case: I have a another (non-PHP) system that needs to insert nodes. If it can't generate a valid ID for that node, this process is halted in it's tracks. I would like to see more data modeling logic moved to the db layer (stored procedures, triggers, views, etc), and leave PHP to do the theming and IO on top of that. E.g. CCK would be a great candidate for going to the db layer.
However, say an NID, of the txid (from e-commerce) gets passed around drupal and used by other modules, sometimes in loops, and sometimes for other modules to ref. the node; this I'd rather have some control of.
I do not see why these behaviors indicate that referential integrity should be handled in the app. If you are saying that within a loop, one must generate N IDs, each with a database call, there is some merit to argument that app created IDs would be better. But there are going to be other tradeoffs.
Also, the implementations of "next_val" per-database get a little hairy also. You also then have the issue of if you are in a transaction, the transaction level, etc. that all starts to play in to the db's implementation of "next_val" or "last_insert_id".
I'll defer to others' experience on this subject, but I'd be surprised if the problems were intractable. Considering that Drupal currently supports all of 2 databases (I'm crossing my fingers for SQLite support - that'd be cool) unifying the implementations shouldn't be that hard. -M
Mark Fredrickson wrote:
theming and IO on top of that. E.g. CCK would be a great candidate for going to the db layer. Putting more logic into the DB layer requires that more logic be ported to each database version with each version having to be maintained. By having logic such as CCK use Drupal's database API this extra work is avoided.
Mike Cantelon
On Wed, 25 Apr 2007 09:58:48 -0700, Mike Cantelon <m_cantelon@straight.com> wrote:
Mark Fredrickson wrote:
theming and IO on top of that. E.g. CCK would be a great candidate for going to the db layer. Putting more logic into the DB layer requires that more logic be ported to each database version with each version having to be maintained. By having logic such as CCK use Drupal's database API this extra work is avoided.
Mike Cantelon
As long as Drupal supports MySQL MyISAM tables, we have to treat the database as a glorified card catalog since that's what MyISAM tables are. Relying on triggers, referential integrity, foreign keys, cascade-on-delete, and so on and so on would be great if we were actually using a database that supported them. As long as MyISAM is supported, and since a *lot* of shared hosts don't offer anything else we will have to keep doing so for a long time to come, the database is simply a data sink. (And before anyone thinks I'm suggesting we drop MyISAM support, I am not saying anything of the sort. <g>) --Larry Garfield
Larry Garfield wrote:
(And before anyone thinks I'm suggesting we drop MyISAM support, I am not saying anything of the sort. <g>)
Not wanting to add fuel to any fire, but I think this is an interesting discussion to have in the context of "the drop is always moving". I don't want to make a judgment either way, but if Drupal is to be compatible with arguably an old way of "doing" databases then it definitely complicates the questions of: should it work on PHP3? should Drupal be more indulgent towards backwards-compatibility? Or the converse argument: that Drupal's development methodology works well, but instead calls into question support for MyISAM tables. Cheers, J-P
On 4/26/07, J-P Stacey <jp.stacey@torchbox.com> wrote:
Larry Garfield wrote:
(And before anyone thinks I'm suggesting we drop MyISAM support, I am not saying anything of the sort. <g>)
Not wanting to add fuel to any fire, but I think this is an interesting discussion to have in the context of "the drop is always moving". I don't want to make a judgment either way, but if Drupal is to be compatible with arguably an old way of "doing" databases then it definitely complicates the questions of: should it work on PHP3? should Drupal be more indulgent towards backwards-compatibility? Or the converse argument: that Drupal's development methodology works well, but instead calls into question support for MyISAM tables.
I don't think you're going to get a useful conclusion comparing MyISAM (default choice on the majority of current webhosts) and PHP3 (old old old software that is only used on lazy hosts or hosts who value "stability" over security). The drop is moving, but it's not like the drop can veer away from the current most popular pieces of the stack...yet. Not yet. But now that die welt runs drupal perhaps it can soon ;) Greg About Die Welt: http://drupal.org/node/139174
Better than dropping support for MyISAM tables would be dropping support for hosts which don't allow locking. :-) Pardon my rant for the moment, but WTF is up with that? What's the point of even having an RDBMS if you can't guarantee any of your data changing operations with a lock? (Rhetorical question, you all -- I can think of some reasons, really.) I think the best argument against stored-procedures and triggers is the amount of extra effort it will require a volunteer open-source effort to maintain across multiple databases. CRUD SQL is the most "portable" of all SQL, but even that has its problems as we know. Then comes DDL (table creation) SQL. Stuff like stored procedures is likely way on the incompatible between DB vendors scale. Are the gains worth the very real increases in labor required?
questions of: should it work on PHP3? should Drupal be more indulgent towards backwards-compatibility?
Its not matter of backward compatibility rather availability. Most shared hosting services only offer MyISAM. So do you think Drupal should ignore a majority of user who are forced to use MyISAM ? __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Vivek Puri wrote:
questions of: should it work on PHP3? should Drupal be more indulgent towards backwards-compatibility?
Its not matter of backward compatibility rather availability. Most shared hosting services only offer MyISAM. So do you think Drupal should ignore a majority of user who are forced to use MyISAM ?
No: sorry if it came across that way, but I don't think I implied that at all. I just wanted to raise the question in a wider context rather than answer it either way: I don't know the answer! Best regards, J-P
Guys/Gals We are not discussing MySQL's MyISAM here. If we want a CMS that usable by the masses, then we have to keep it. If we want to appeal to only an elite few with InnoDB (or Falcon or Oracle), then we can drop it. I don't think we want to latter. At least not yet. Again, the discussion is about auto incrementing vs. db_next_id() vs. locking ... NOT MyISAM. Dries has chimed in supporting the move to auto increment, so let us keep the patches rolling.
On 4/26/07, Khalid Baheyeldin <kb@2bits.com> wrote:
We are not discussing MySQL's MyISAM here. If we want a CMS that usable by the masses, then we have to keep it. If we want to appeal to only an elite few with InnoDB (or Falcon or Oracle), then we can drop it.
I don't think we want to latter. At least not yet.
Oh, and feel free to post about what we should obsolete and not obsolete in separate threads with the appropriate subject line. -- 2bits.com http://2bits.com Drupal development, customization and consulting.
Mark, Good points. However, I think there are 2 types of Drupal users: Enterprise, and non-enterprise. (read: large/small). Meaning, not everyone has a company large-enough where they can have the dedicated DB staff to maintain, create, and update the sp's, triggers, etc. However, on the flip side, companies that do have this type of setup and resources balk at Drupal because they can't leverage their DB staff to truly separate the db from the app-layer. It's a catch 22 either way you think of it. By and large though, most Drupal installs do not have the dedicated DB staff to truly separate the layers; and I do mean dedicated. Maintaining and optimizing triggers, sp's, etc. in a production environment is a full time job and not really something to be looked at as a 'hobby'. The other flip side to stored procs, is PHP's arch. And what sp's were made for. Each request is it's own thread. (this is knowledge from SQL Server, I'll have to read about MySQL's sp implementation to see if it's the same). But, depending on the DB, the sp is compiled upon first use with each connection. If the same connection calls the sp again, the cached query plan is used. However, if the query plan is dynamic (e.g. WHERE clause changes each time), performance actually goes down greatly. If the WITH RECOMPILE option is used (meaning re-do the query plan with each execution), performance only goes down slightly. More over, sp's are made for 2 things: 1. Reduce network traffic. 2. Place the processing on the database server and not the app. So, let's look at sp's; they send a small RPC call to the DB, and the db sends back the results. Where currently we send SQL statements to the DB, and get back results. The catch is though, that Drupal sends relatively small statements, and statements that are not really processor intensive, compared to say a manager's analytical application that sends a 40 line SQL statement that goes across 3 databases and takes about 15 seconds to process. I guess what I'm saying is that I don't think sp's would give Drupal much of a performance boost. However, this is a null point because I have no benchmarks to back this up. Regarding ref. integrity, I probably generalized and mis-spoke; when I said it, I meant the ID generation, meaning module x can generate an id for table y and work with it before actual insertion...hmm....which upon thinking about it should all be handled in node_save function either way. However, I have many modules that do not create a node, thus creating a node and doing a node_save wouldn't give me an nid, if say I needed the eid (an arbitrary id I made up) because the item is not a node/content-type. All in all, there are tradeoffs either way...its just which one are we willing to accept? -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Mark Fredrickson Sent: Wednesday, April 25, 2007 12:06 PM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns? Hi Earnest and others, On 4/25/07, Earnest Berry III <earnest.berry@gmail.com> wrote:
Hmm...I'm still more of a fan of controlling the IDs application level Than using a db generated serial number, especially when dealing with referential integrity.
I have the opposite standpoint. The more referential integrity that is in the app layer, the harder it it is to integrate and share data from the database. Use case: I have a another (non-PHP) system that needs to insert nodes. If it can't generate a valid ID for that node, this process is halted in it's tracks. I would like to see more data modeling logic moved to the db layer (stored procedures, triggers, views, etc), and leave PHP to do the theming and IO on top of that. E.g. CCK would be a great candidate for going to the db layer.
However, say an NID, of the txid (from e-commerce) gets passed around drupal and used by other modules, sometimes in loops, and sometimes for other modules to ref. the node; this I'd rather have some control of.
I do not see why these behaviors indicate that referential integrity should be handled in the app. If you are saying that within a loop, one must generate N IDs, each with a database call, there is some merit to argument that app created IDs would be better. But there are going to be other tradeoffs.
Also, the implementations of "next_val" per-database get a little hairy also. You also then have the issue of if you are in a transaction, the transaction level, etc. that all starts to play in to the db's implementation of "next_val" or "last_insert_id".
I'll defer to others' experience on this subject, but I'd be surprised if the problems were intractable. Considering that Drupal currently supports all of 2 databases (I'm crossing my fingers for SQLite support - that'd be cool) unifying the implementations shouldn't be that hard. -M
More over, sp's are made for 2 things:
1. Reduce network traffic. 2. Place the processing on the database server and not the app.
These are two reasons for stored procedures. Wikipedia lists two more: 3. Simplification of Data Management 4. Security Of all of these, I really only care about 3. I think the wikipedia article does a good job on that, so I'll just link to it: http://en.wikipedia.org/wiki/Stored_procedure#Simplification_of_data_managem...
I guess what I'm saying is that I don't think sp's would give Drupal much of a performance boost. However, this is a null point because I have no benchmarks to back this up.
Agreed. But I actually don't really care that much about performance. :-) That said, I think we could remove 50% of nodeapi overhead if we had triggers and views doing most of the load/insert/delete stuff. node_load is a pretty expensive operation right now. I'd like to see that improved. I think this is one of Barry's goals with Schema.
All in all, there are tradeoffs either way...its just which one are we willing to accept?
True. We could probably go around in circles on this all day. :-) I would encourage people to attempt db level logic where possible and we'll see what happens. Happy hunting, -Mark
FYI: SQL Server's version is the IDENTITY field. E.g. CREATE TABLE x ( id INT IDENTITY(1,1) PRIMARY KEY); -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Syscrusher Sent: Wednesday, April 25, 2007 10:52 AM To: development@drupal.org Cc: Ezra B. Gildesgame Subject: Re: [development] Why do we avoid auto-incrementing db columns? On Wednesday 25 April 2007 10:29, Ezra B. Gildesgame wrote:
I happened to discuss this with Greg Knaddison yesterday. MySQL 4 does not support auto-incrementing columns.
I must respectfully disagree. I happened to have a copy of the MySQL reference book from version 3.22 sitting on my bookshelf, and AUTO_INCREMENT was supported even then. :-) Never mind the question of why I still have such an old book.... I also checked a PostgreSQL reference book, and they offer a SERIAL column type that works very similarly to the AUTO_INCREMENT in MySQL. I don't know how far back this support goes; my book is for PG version 8.something and was copyright 2006. I can't comment on the Microsoft SQL Server implementation of this, if any, because I don't have any manual for that here. Scott -- ---------------------------------------------------------------------------- --- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
On Wednesday 25 April 2007 11:40, Earnest Berry III wrote:
FYI: SQL Server's version is the IDENTITY field. E.g. CREATE TABLE x ( id INT IDENTITY(1,1) PRIMARY KEY);
Thanks! I assumed there was some equivalent function, but didn't have the reference info to look it up. Scott -- ------------------------------------------------------------------------------- Syscrusher (Scott Courtney) Drupal page: http://drupal.org/user/9184 syscrusher at 4th dot com Home page: http://4th.com/
Hi everyone, I'm PostgreSQL maintainer for Drupal and since people are up in the air about what is going on down in the DB layer re autoincrement and serial fields I figured i'd shed some light. Many people have already stated this stuff in earlier mails and hopefully getting the info in one big pile will make it more concrete. PostgreSQL does implement atomic sequencing with the serial and bigserial types. MySQL does the same with the auto_increment field property. These are used in Drupal to produce a guaranteed unique ID for tables using them. Implementing this in PHP will certainly generate non-unique IDs unless you lock tables thus introducing a massive performance penalty. We use db_next_id() in Drupal to provide a db-independent method of getting the next unique ID into the application for use just after the insert operation on the DB. The performance hit of using a sequence is next to nothing considering all that's involved is a mutex lock, read, increment and mutex unlock. Making this more efficient is simply not possible. Sure there are two different use cases for insertions: those where you need the ID afterwards (e.g node) and those you don't (e.g watchdog). If you want to write code without the use of db_next_id() then do an INSERT and omit the id column. I don't see anyone wanting to spend effort patching Drupal core to gain 0.01% efficiency across the application. /me turns his torch off -- Sammy Spets Synerger Pty Ltd http://synerger.com On 25-Apr-07 10:15, Barry Jaspan wrote:
Could someone either clue me in or point to a prior discussion about why Drupal avoids using auto-incrementing database columns? If it is database compatibility, which versions of which databases that we support do not have it?
Thanks,
Barry
Sammy Spets <sammys-drupal@synerger.com> writes:
PostgreSQL [has] serial and bigserial types. MySQL [has] auto_increment field property. These are used in Drupal to produce a guaranteed unique ID for tables using them.
We use db_next_id() in Drupal to provide a db-independent method of getting the next unique ID into the application for use just after the insert operation on the DB.
I interpret this to mean, in your opinion at least, that (a) Drupal does officially condone and support the use of auto-incrementing columns and (b) it also provides db_next_id() for those situations where having it is simply more convenient or efficient than letting the database do the id generation. Is that correct? Is this the official core policy on this subject?
I'm PostgreSQL maintainer for Drupal...
FYI, the Schema module now supports MySQL and PostgeSQL for table and index creation from the same schema data structure. I discovered the pgsql's use of the serial type during the implementation which is what led to my initiating this thread. Schema now supports a 'serial' column type which maps to auto_increment for MySQL or {,big}serial for PostgreSQL. I'd be very happy to have you review the pgsql support. Schema is currently in contrib. Thanks, Barry
On 25-Apr-07 13:55, Barry Jaspan wrote:
Sammy Spets <sammys-drupal@synerger.com> writes:
PostgreSQL [has] serial and bigserial types. MySQL [has] auto_increment field property. These are used in Drupal to produce a guaranteed unique ID for tables using them.
We use db_next_id() in Drupal to provide a db-independent method of getting the next unique ID into the application for use just after the insert operation on the DB.
I interpret this to mean, in your opinion at least, that (a) Drupal does officially condone and support the use of auto-incrementing columns and (b) it also provides db_next_id() for those situations where having it is simply more convenient or efficient than letting the database do the id generation.
db_next_id() DOES use the DB to generate the next ID in PostgreSQL. I don't remember what it does for MySQL (not my area of care anyway). If PHP generates the ID for MySQL then start using a proper DBMS! db_next_id() is a DB independent wrapper for acquiring the next value from the DB's auto-increment implementation. It's never more efficient because there is a (albeit) small data transfer back to the app and the query to the DB. Considering the size of this data transfer and the fact there is no other way to implement insertion of a record with the ID as primary key (no other unique key in table) followed by records with the ID as foreign key, this is the best you'll get. Auto-increment columns are already used in Drupal core so your a) is correct.
FYI, the Schema module now supports MySQL and PostgeSQL for table and index creation from the same schema data structure. I discovered the pgsql's use of the serial type during the implementation which is what led to my initiating this thread. Schema now supports a 'serial' column type which maps to auto_increment for MySQL or {,big}serial for PostgreSQL.
I'd be very happy to have you review the pgsql support. Schema is currently in contrib.
Thanks,
Barry
-- Sammy Spets Synerger Pty Ltd http://synerger.com
Hi Sammy, There's only on thing I disagree on; the performance hit of the mutex lock. If you do not have a relatively large number of insertions, than this is not a large performance hit; however, if you do, this does become a performance hit. I admit, I'm on the extreme end of the spectrum as I was inserting some 20 million nodes (which, expanded means 20 million nodes, 20 million node_revisions, and 7-14 million other insertions in diff. tables for the application). After implementing random-id generation with out locking the table ever at all, the performance increase was very noticeable, on the order to 70-100% more throughput of node creation. I would think removing the bottle neck though would benefit sites that probably have a lot of comments being added and content, as there is no contention for the sequences table anymore. -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Sammy Spets Sent: Wednesday, April 25, 2007 12:54 PM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns? Hi everyone, I'm PostgreSQL maintainer for Drupal and since people are up in the air about what is going on down in the DB layer re autoincrement and serial fields I figured i'd shed some light. Many people have already stated this stuff in earlier mails and hopefully getting the info in one big pile will make it more concrete. PostgreSQL does implement atomic sequencing with the serial and bigserial types. MySQL does the same with the auto_increment field property. These are used in Drupal to produce a guaranteed unique ID for tables using them. Implementing this in PHP will certainly generate non-unique IDs unless you lock tables thus introducing a massive performance penalty. We use db_next_id() in Drupal to provide a db-independent method of getting the next unique ID into the application for use just after the insert operation on the DB. The performance hit of using a sequence is next to nothing considering all that's involved is a mutex lock, read, increment and mutex unlock. Making this more efficient is simply not possible. Sure there are two different use cases for insertions: those where you need the ID afterwards (e.g node) and those you don't (e.g watchdog). If you want to write code without the use of db_next_id() then do an INSERT and omit the id column. I don't see anyone wanting to spend effort patching Drupal core to gain 0.01% efficiency across the application. /me turns his torch off -- Sammy Spets Synerger Pty Ltd http://synerger.com On 25-Apr-07 10:15, Barry Jaspan wrote:
Could someone either clue me in or point to a prior discussion about why Drupal avoids using auto-incrementing database columns? If it is database compatibility, which versions of which databases that we support do not have it?
Thanks,
Barry
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Earnest Berry III schrieb:
Hi Sammy,
There's only on thing I disagree on; the performance hit of the mutex lock. If you do not have a relatively large number of insertions, than this is not a large performance hit; however, if you do, this does become a performance hit. I admit, I'm on the extreme end of the spectrum as I was inserting some 20 million nodes (which, expanded means 20 million nodes, 20 million node_revisions, and 7-14 million other insertions in diff. tables for the application).
After implementing random-id generation with out locking the table ever at all, the performance increase was very noticeable, on the order to 70-100% more throughput of node creation.
http://drupal.org/node/55516 You'll find this interesting. Cheers, Gerhard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGL4vHfg6TFvELooQRAjpKAKCBPK+yW2CfpFGdTxmYyrFJdoL8egCfenj8 vR8BKlJxyTdTyb0Aip1T1AU= =HooF -----END PGP SIGNATURE-----
Stored procedures would be nice to have in some cases, but remember that many people use drupal for small sites hosted on shared servers. As far as I know you can't create mysql stored procedures via PHP (try that in phpmyadmin), so unless you have access to the database server via ssh or some other non-php interface, you won't be able to create any sp's. Florian
Earnest, Insertion of 20 million nodes in a short space of time implies 10 million users are inserting 2 nodes each in a short space of time. It also implies many other combinations. Looking at the plausibility that this situation become true.... pretty much no chance you'll get that many insertions in the same table at the same time except for the watchdog table. Sure, you've found a way to improve performance for an unrealistic boundary case and that's fantastic. Thanks for working it out! How does your technique of randomly allocating IDs avoid duplicate ID allocation? Cheers, -- Sammy Spets Synerger Pty Ltd http://synerger.com On 25-Apr-07 13:59, Earnest Berry III wrote:
Hi Sammy,
There's only on thing I disagree on; the performance hit of the mutex lock. If you do not have a relatively large number of insertions, than this is not a large performance hit; however, if you do, this does become a performance hit. I admit, I'm on the extreme end of the spectrum as I was inserting some 20 million nodes (which, expanded means 20 million nodes, 20 million node_revisions, and 7-14 million other insertions in diff. tables for the application).
After implementing random-id generation with out locking the table ever at all, the performance increase was very noticeable, on the order to 70-100% more throughput of node creation.
I would think removing the bottle neck though would benefit sites that probably have a lot of comments being added and content, as there is no contention for the sequences table anymore.
-----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Sammy Spets Sent: Wednesday, April 25, 2007 12:54 PM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns?
Hi everyone,
I'm PostgreSQL maintainer for Drupal and since people are up in the air about what is going on down in the DB layer re autoincrement and serial fields I figured i'd shed some light. Many people have already stated this stuff in earlier mails and hopefully getting the info in one big pile will make it more concrete.
PostgreSQL does implement atomic sequencing with the serial and bigserial types. MySQL does the same with the auto_increment field property. These are used in Drupal to produce a guaranteed unique ID for tables using them. Implementing this in PHP will certainly generate non-unique IDs unless you lock tables thus introducing a massive performance penalty.
We use db_next_id() in Drupal to provide a db-independent method of getting the next unique ID into the application for use just after the insert operation on the DB. The performance hit of using a sequence is next to nothing considering all that's involved is a mutex lock, read, increment and mutex unlock. Making this more efficient is simply not possible.
Sure there are two different use cases for insertions: those where you need the ID afterwards (e.g node) and those you don't (e.g watchdog). If you want to write code without the use of db_next_id() then do an INSERT and omit the id column. I don't see anyone wanting to spend effort patching Drupal core to gain 0.01% efficiency across the application.
/me turns his torch off
-- Sammy Spets Synerger Pty Ltd http://synerger.com
On 25-Apr-07 10:15, Barry Jaspan wrote:
Could someone either clue me in or point to a prior discussion about why Drupal avoids using auto-incrementing database columns? If it is database compatibility, which versions of which databases that we support do not have it?
Thanks,
Barry
On 4/29/07, Sammy Spets <sammys-drupal@synerger.com> wrote:
Earnest,
Insertion of 20 million nodes in a short space of time implies 10 million users are inserting 2 nodes each in a short space of time. It also implies many other combinations. Looking at the plausibility that this situation become true.... pretty much no chance you'll get that many insertions in the same table at the same time except for the watchdog table.
Uh. Except for high performance external sources of data feeding a Drupal front end. Remember what assumptions make out of you and me... Earnest figured it out because he needed to implement it for a project.
How does your technique of randomly allocating IDs avoid duplicate ID allocation?
I believe it checks for dupes and re-randomizes, reading his write up. -- Boris Mann Office 604-682-2889 Skype borismann http://www.bryght.com
"pretty much no chance" === extremely rare yet not impossible -- Sammy Spets Synerger Pty Ltd http://synerger.com On 29-Apr-07 23:55, Boris Mann wrote:
On 4/29/07, Sammy Spets <sammys-drupal@synerger.com> wrote:
Earnest,
Insertion of 20 million nodes in a short space of time implies 10 million users are inserting 2 nodes each in a short space of time. It also implies many other combinations. Looking at the plausibility that this situation become true.... pretty much no chance you'll get that many insertions in the same table at the same time except for the watchdog table.
Uh. Except for high performance external sources of data feeding a Drupal front end. Remember what assumptions make out of you and me...
Earnest figured it out because he needed to implement it for a project.
How does your technique of randomly allocating IDs avoid duplicate ID allocation?
I believe it checks for dupes and re-randomizes, reading his write up.
-- Boris Mann Office 604-682-2889 Skype borismann http://www.bryght.com
Sammy, A user insertion rate is probably unrealistic, how ever there is more to the issue. Sequences are a headache when they get "out" of sync (sequence). By randomizing them, this becomes a non-issue. On the case of the situation being impossible, take the case of about 30,000 logged in users be logged in at the same time. Let's say about %50 of them are replying to comments, thus why they're logged in, and the other %50 are just browsing. At any given time, this ration could flip on the imaginary %50 line; regardless, you have 15,000 inserting information into the system. Those 15,000 entries create a new comment (thus 15,000 new ids). Now, add to that a CCK type, you now have about 3 ids per entry, thus 75,000 new ids. Now add to that your own custom module that's listening to nodeapi for an insertion of a certain type, and let say about %50 of those insertions are caught by your module, thus you now have 75,000 * 1.5, or 112,500 new ids. This can quickly turn into a geometric problem. That said, I made most of those stats up as I don't have time to go through some logs and take actual metrics, however, the theory should be sound enough that ID generation can be resource consuming if not handled well. Look at http://buytaert.net/drupal-site-statistics, it's a geometric growth function. And that's just the topical "node"/"content" entity, not all the functions, ids, and quires underneath the creation of each one. As far as id collision, I kind of do a re-hash, but not really. The way I handle that is there is a "buffer" stack. I took the "impossibility" that 2 numbers could be generated in the same second. Thus, in that second, T1 (thread 1) could be still working with ID 5, and thread 2 would be issues ID 5 because the function wouldn't see ID 5 as taken because T1 hasn't finished/committed yet. Thus a "buffer" stack is used to hold ID's that haven't been used. This is flushed every so often. It's in the write up. This setup should bring the possibility of a number collision down to nearly 0. My problem also came about because I am feeing in large amounts of information from other sources on the back-end/underneath. So it's not so much the users I'm worried about as much the sheer escalation of the content. Some of the data is dropped, some added, some updated, the basic principle though is that the information is always changing, adding, and subtracting, thus I needed to build a better system to handle this. -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Sammy Spets Sent: Monday, April 30, 2007 3:22 AM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns? "pretty much no chance" === extremely rare yet not impossible -- Sammy Spets Synerger Pty Ltd http://synerger.com On 29-Apr-07 23:55, Boris Mann wrote:
On 4/29/07, Sammy Spets <sammys-drupal@synerger.com> wrote:
Earnest,
Insertion of 20 million nodes in a short space of time implies 10 million users are inserting 2 nodes each in a short space of time. It also implies many other combinations. Looking at the plausibility that this situation become true.... pretty much no chance you'll get that many insertions in the same table at the same time except for the watchdog table.
Uh. Except for high performance external sources of data feeding a Drupal front end. Remember what assumptions make out of you and me...
Earnest figured it out because he needed to implement it for a project.
How does your technique of randomly allocating IDs avoid duplicate ID allocation?
I believe it checks for dupes and re-randomizes, reading his write up.
-- Boris Mann Office 604-682-2889 Skype borismann http://www.bryght.com
Most of that makes sense and kudos for putting in the time. I can't see how sequences can get out of sync. Is this in MySQL only? /me goes into the stupid basket. -- Sammy Spets Synerger Pty Ltd http://synerger.com On 30-Apr-07 09:54, Earnest Berry III wrote:
Sammy,
A user insertion rate is probably unrealistic, how ever there is more to the issue. Sequences are a headache when they get "out" of sync (sequence). By randomizing them, this becomes a non-issue.
On the case of the situation being impossible, take the case of about 30,000 logged in users be logged in at the same time. Let's say about %50 of them are replying to comments, thus why they're logged in, and the other %50 are just browsing. At any given time, this ration could flip on the imaginary %50 line; regardless, you have 15,000 inserting information into the system. Those 15,000 entries create a new comment (thus 15,000 new ids). Now, add to that a CCK type, you now have about 3 ids per entry, thus 75,000 new ids. Now add to that your own custom module that's listening to nodeapi for an insertion of a certain type, and let say about %50 of those insertions are caught by your module, thus you now have 75,000 * 1.5, or 112,500 new ids.
This can quickly turn into a geometric problem. That said, I made most of those stats up as I don't have time to go through some logs and take actual metrics, however, the theory should be sound enough that ID generation can be resource consuming if not handled well. Look at http://buytaert.net/drupal-site-statistics, it's a geometric growth function. And that's just the topical "node"/"content" entity, not all the functions, ids, and quires underneath the creation of each one.
As far as id collision, I kind of do a re-hash, but not really. The way I handle that is there is a "buffer" stack. I took the "impossibility" that 2 numbers could be generated in the same second. Thus, in that second, T1 (thread 1) could be still working with ID 5, and thread 2 would be issues ID 5 because the function wouldn't see ID 5 as taken because T1 hasn't finished/committed yet. Thus a "buffer" stack is used to hold ID's that haven't been used. This is flushed every so often. It's in the write up. This setup should bring the possibility of a number collision down to nearly 0.
My problem also came about because I am feeing in large amounts of information from other sources on the back-end/underneath. So it's not so much the users I'm worried about as much the sheer escalation of the content. Some of the data is dropped, some added, some updated, the basic principle though is that the information is always changing, adding, and subtracting, thus I needed to build a better system to handle this.
-----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Sammy Spets Sent: Monday, April 30, 2007 3:22 AM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns?
"pretty much no chance" === extremely rare yet not impossible
-- Sammy Spets Synerger Pty Ltd http://synerger.com
On 29-Apr-07 23:55, Boris Mann wrote:
On 4/29/07, Sammy Spets <sammys-drupal@synerger.com> wrote:
Earnest,
Insertion of 20 million nodes in a short space of time implies 10 million users are inserting 2 nodes each in a short space of time. It also implies many other combinations. Looking at the plausibility that this situation become true.... pretty much no chance you'll get that many insertions in the same table at the same time except for the watchdog table.
Uh. Except for high performance external sources of data feeding a Drupal front end. Remember what assumptions make out of you and me...
Earnest figured it out because he needed to implement it for a project.
How does your technique of randomly allocating IDs avoid duplicate ID allocation?
I believe it checks for dupes and re-randomizes, reading his write up.
-- Boris Mann Office 604-682-2889 Skype borismann http://www.bryght.com
If everything works like its supposed to, then no, sequences should never get out of sync. However, bad code, or a random error can cause them to get out of sync. And besides...when's the last time code acted "exactly" as "we" told it to :). In Drupal's case though, the error would have to happen in the dbnext function. Which is prob. Rare at best. -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Sammy Spets Sent: Thursday, May 03, 2007 3:50 AM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns? Most of that makes sense and kudos for putting in the time. I can't see how sequences can get out of sync. Is this in MySQL only? /me goes into the stupid basket. -- Sammy Spets Synerger Pty Ltd http://synerger.com On 30-Apr-07 09:54, Earnest Berry III wrote:
Sammy,
A user insertion rate is probably unrealistic, how ever there is more to the issue. Sequences are a headache when they get "out" of sync (sequence). By randomizing them, this becomes a non-issue.
On the case of the situation being impossible, take the case of about 30,000 logged in users be logged in at the same time. Let's say about %50 of them are replying to comments, thus why they're logged in, and the other %50 are just browsing. At any given time, this ration could flip on the imaginary %50 line; regardless, you have 15,000 inserting information into the system. Those 15,000 entries create a new comment (thus 15,000 new ids). Now, add to that a CCK type, you now have about 3 ids per entry, thus 75,000 new ids. Now add to that your own custom module that's listening to nodeapi for an insertion of a certain type, and let say about %50 of those insertions are caught by your module, thus you now have 75,000 * 1.5, or 112,500 new ids.
This can quickly turn into a geometric problem. That said, I made most of those stats up as I don't have time to go through some logs and take
actual
metrics, however, the theory should be sound enough that ID generation can be resource consuming if not handled well. Look at http://buytaert.net/drupal-site-statistics, it's a geometric growth function. And that's just the topical "node"/"content" entity, not all the functions, ids, and quires underneath the creation of each one.
As far as id collision, I kind of do a re-hash, but not really. The way I handle that is there is a "buffer" stack. I took the "impossibility" that 2 numbers could be generated in the same second. Thus, in that second, T1 (thread 1) could be still working with ID 5, and thread 2 would be issues ID 5 because the function wouldn't see ID 5 as taken because T1 hasn't finished/committed yet. Thus a "buffer" stack is used to hold ID's that haven't been used. This is flushed every so often. It's in the write up. This setup should bring the possibility of a number collision down to nearly 0.
My problem also came about because I am feeing in large amounts of information from other sources on the back-end/underneath. So it's not so much the users I'm worried about as much the sheer escalation of the content. Some of the data is dropped, some added, some updated, the basic principle though is that the information is always changing, adding, and subtracting, thus I needed to build a better system to handle this.
-----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Sammy Spets Sent: Monday, April 30, 2007 3:22 AM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns?
"pretty much no chance" === extremely rare yet not impossible
-- Sammy Spets Synerger Pty Ltd http://synerger.com
On 29-Apr-07 23:55, Boris Mann wrote:
On 4/29/07, Sammy Spets <sammys-drupal@synerger.com> wrote:
Earnest,
Insertion of 20 million nodes in a short space of time implies 10 million users are inserting 2 nodes each in a short space of time. It also implies many other combinations. Looking at the plausibility that this situation become true.... pretty much no chance you'll get that many insertions in the same table at the same time except for the watchdog table.
Uh. Except for high performance external sources of data feeding a Drupal front end. Remember what assumptions make out of you and me...
Earnest figured it out because he needed to implement it for a project.
How does your technique of randomly allocating IDs avoid duplicate ID allocation?
I believe it checks for dupes and re-randomizes, reading his write up.
-- Boris Mann Office 604-682-2889 Skype borismann http://www.bryght.com
On 25 Apr 2007, at 18:53, Sammy Spets wrote:
Sure there are two different use cases for insertions: those where you need the ID afterwards (e.g node) and those you don't (e.g watchdog). If you want to write code without the use of db_next_id() then do an INSERT and omit the id column. I don't see anyone wanting to spend effort patching Drupal core to gain 0.01% efficiency across the application.
Overall, I think it would be good if we could get rid of db_next_id (). Related patches: + http://drupal.org/node/49836 + http://drupal.org/node/118066 * Drupal's db_next_id() was introduced in 2002 by Marco Molinari, who used to be a top-contributor. Parts of the relevant discussion can be found at: http://lists.drupal.org/archives/development/2002-08/msg00222.html http://lists.drupal.org/archives/development/2002-09/msg00185.html We introduced it because it was hard to reproduce the (correct) last ID on non MySQL databases -- we felt this was a more portable way to get the last ID. As this has been a while, the original motivation might no longer be relevant, or it might have been incorrect all the way. We had less database expertise/experts than we have now, and lack of experience probably made db_next_id() feel like a portable and safe choice ... * I don't think it has been mentioned already but the fact we do not use auto_increment make it difficult/impossible to setup Drupal in a master-master configuration. MySQL 5 introdeced a "auto_increment_increment" and "auto_increment_offset" which can't be used until we get rid of db_next_id(). See http://dev.mysql.com/doc/ refman/5.0/en/replication-auto-increment.html. * The table lock in db_next_id() has been shown to be a problem in certain setups. * table locking, as used in db_next_id(), is not always supported by cheap webhosts. * http://drupal.org/node/49836#comment-74275 suggests that most/all database support something like auto_increment. That's 5 reasons why we could remove db_next_id() ... ;-) -- Dries Buytaert :: http://www.buytaert.net/
And to clarify, what i've been explaining above is different to what is explained in this post. This post is about modifing db_next_id() to stop using the sequences table for MySQL. My post was about PostgreSQL. The use of db_next_id() or db_insert_id() can't be removed for obvious reasons. The implementation can be changed. Thanks Dries for bringing up these posts. -- Sammy Spets Synerger Pty Ltd http://synerger.com On 26-Apr-07 07:53, Dries Buytaert wrote:
On 25 Apr 2007, at 18:53, Sammy Spets wrote:
Sure there are two different use cases for insertions: those where you need the ID afterwards (e.g node) and those you don't (e.g watchdog). If you want to write code without the use of db_next_id() then do an INSERT and omit the id column. I don't see anyone wanting to spend effort patching Drupal core to gain 0.01% efficiency across the application.
Overall, I think it would be good if we could get rid of db_next_id (). Related patches:
+ http://drupal.org/node/49836 + http://drupal.org/node/118066
* Drupal's db_next_id() was introduced in 2002 by Marco Molinari, who used to be a top-contributor. Parts of the relevant discussion can be found at:
http://lists.drupal.org/archives/development/2002-08/msg00222.html http://lists.drupal.org/archives/development/2002-09/msg00185.html
We introduced it because it was hard to reproduce the (correct) last ID on non MySQL databases -- we felt this was a more portable way to get the last ID. As this has been a while, the original motivation might no longer be relevant, or it might have been incorrect all the way. We had less database expertise/experts than we have now, and lack of experience probably made db_next_id() feel like a portable and safe choice ...
* I don't think it has been mentioned already but the fact we do not use auto_increment make it difficult/impossible to setup Drupal in a master-master configuration. MySQL 5 introdeced a "auto_increment_increment" and "auto_increment_offset" which can't be used until we get rid of db_next_id(). See http://dev.mysql.com/doc/ refman/5.0/en/replication-auto-increment.html.
* The table lock in db_next_id() has been shown to be a problem in certain setups.
* table locking, as used in db_next_id(), is not always supported by cheap webhosts.
* http://drupal.org/node/49836#comment-74275 suggests that most/all database support something like auto_increment.
That's 5 reasons why we could remove db_next_id() ... ;-)
-- Dries Buytaert :: http://www.buytaert.net/
participants (21)
-
Amr Mostafa -
Barry Jaspan -
Boris Mann -
Chris Johnson -
Darren Oh -
David Strauss -
Dries Buytaert -
Earnest Berry III -
Ezra B. Gildesgame -
Florian Loretan -
Gerhard Killesreiter -
Greg Knaddison - GVS -
J-P Stacey -
Khalid Baheyeldin -
Larry Garfield -
Mark Fredrickson -
Mike Cantelon -
Sammy Spets -
Syscrusher -
Vivek Puri -
Zohar Stolar