[development] Why do we avoid auto-incrementing db columns?

Mon Apr 30 13:54:16 UTC 2007

Sammy,

A user insertion rate is probably unrealistic, how ever there is more to the
issue. Sequences are a headache when they get "out" of sync (sequence). By
randomizing them, this becomes a non-issue.

On the case of the situation being impossible, take the case of about 30,000
logged in users be logged in at the same time. Let's say about %50 of them
are replying to comments, thus why they're logged in, and the other %50 are
just browsing. At any given time, this ration could flip on the imaginary
%50 line; regardless, you have 15,000 inserting information into the system.
Those 15,000 entries create a new comment (thus 15,000 new ids). Now, add to
that a CCK type, you now have about 3 ids per entry, thus 75,000 new ids.
Now add to that your own custom module that's listening to nodeapi for an
insertion of a certain type, and let say about %50 of those insertions are
caught by your module, thus you now have 75,000 * 1.5, or 112,500 new ids. 

This can quickly turn into a geometric problem. That said, I made most of
those stats up as I don't have time to go through some logs and take actual
metrics, however, the theory should be sound enough that ID generation can
be resource consuming if not handled well. Look at
http://buytaert.net/drupal-site-statistics, it's a geometric growth
function. And that's just the topical "node"/"content" entity, not all the
functions, ids, and quires underneath the creation of each one.

As far as id collision, I kind of do a re-hash, but not really. The way I
handle that is there is a "buffer" stack. I took the "impossibility" that 2
numbers could be generated in the same second. Thus, in that second, T1
(thread 1) could be still working with ID 5, and thread 2 would be issues ID
5 because the function wouldn't see ID 5 as taken because T1 hasn't
finished/committed yet. Thus a "buffer" stack is used to hold ID's that
haven't been used. This is flushed every so often. It's in the write up.
This setup should bring the possibility of a number collision down to nearly
0.

My problem also came about because I am feeing in large amounts of
information from other sources on the back-end/underneath. So it's not so
much the users I'm worried about as much the sheer escalation of the
content. Some of the data is dropped, some added, some updated, the basic
principle though is that the information is always changing, adding, and
subtracting, thus I needed to build a better system to handle this.

-----Original Message-----
From: development-bounces at drupal.org [mailto:development-bounces at drupal.org]
On Behalf Of Sammy Spets
Sent: Monday, April 30, 2007 3:22 AM
To: development at drupal.org
Subject: Re: [development] Why do we avoid auto-incrementing db columns?

"pretty much no chance" === extremely rare yet not impossible

-- 
Sammy Spets
Synerger Pty Ltd
http://synerger.com

On 29-Apr-07 23:55, Boris Mann wrote:
> On 4/29/07, Sammy Spets <sammys-drupal at synerger.com> wrote:
> >Earnest,
> >
> >Insertion of 20 million nodes in a short space of time implies 10
> >million users are inserting 2 nodes each in a short space of time. It
> >also implies many other combinations. Looking at the plausibility that
> >this situation become true.... pretty much no chance you'll get that
> >many insertions in the same table at the same time except for the
> >watchdog table.
> 
> Uh. Except for high performance external sources of data feeding a
> Drupal front end. Remember what assumptions make out of you and me...
> 
> Earnest figured it out because he needed to implement it for a project.
> 
> >How does your technique of randomly allocating IDs avoid duplicate ID
> >allocation?
> 
> I believe it checks for dupes and re-randomizes, reading his write up.
> 
> -- 
> Boris Mann
> Office 604-682-2889
> Skype borismann
> http://www.bryght.com