Most of that makes sense and kudos for putting in the time. I can't see how sequences can get out of sync. Is this in MySQL only? /me goes into the stupid basket. -- Sammy Spets Synerger Pty Ltd http://synerger.com On 30-Apr-07 09:54, Earnest Berry III wrote:
Sammy,
A user insertion rate is probably unrealistic, how ever there is more to the issue. Sequences are a headache when they get "out" of sync (sequence). By randomizing them, this becomes a non-issue.
On the case of the situation being impossible, take the case of about 30,000 logged in users be logged in at the same time. Let's say about %50 of them are replying to comments, thus why they're logged in, and the other %50 are just browsing. At any given time, this ration could flip on the imaginary %50 line; regardless, you have 15,000 inserting information into the system. Those 15,000 entries create a new comment (thus 15,000 new ids). Now, add to that a CCK type, you now have about 3 ids per entry, thus 75,000 new ids. Now add to that your own custom module that's listening to nodeapi for an insertion of a certain type, and let say about %50 of those insertions are caught by your module, thus you now have 75,000 * 1.5, or 112,500 new ids.
This can quickly turn into a geometric problem. That said, I made most of those stats up as I don't have time to go through some logs and take actual metrics, however, the theory should be sound enough that ID generation can be resource consuming if not handled well. Look at http://buytaert.net/drupal-site-statistics, it's a geometric growth function. And that's just the topical "node"/"content" entity, not all the functions, ids, and quires underneath the creation of each one.
As far as id collision, I kind of do a re-hash, but not really. The way I handle that is there is a "buffer" stack. I took the "impossibility" that 2 numbers could be generated in the same second. Thus, in that second, T1 (thread 1) could be still working with ID 5, and thread 2 would be issues ID 5 because the function wouldn't see ID 5 as taken because T1 hasn't finished/committed yet. Thus a "buffer" stack is used to hold ID's that haven't been used. This is flushed every so often. It's in the write up. This setup should bring the possibility of a number collision down to nearly 0.
My problem also came about because I am feeing in large amounts of information from other sources on the back-end/underneath. So it's not so much the users I'm worried about as much the sheer escalation of the content. Some of the data is dropped, some added, some updated, the basic principle though is that the information is always changing, adding, and subtracting, thus I needed to build a better system to handle this.
-----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Sammy Spets Sent: Monday, April 30, 2007 3:22 AM To: development@drupal.org Subject: Re: [development] Why do we avoid auto-incrementing db columns?
"pretty much no chance" === extremely rare yet not impossible
-- Sammy Spets Synerger Pty Ltd http://synerger.com
On 29-Apr-07 23:55, Boris Mann wrote:
On 4/29/07, Sammy Spets <sammys-drupal@synerger.com> wrote:
Earnest,
Insertion of 20 million nodes in a short space of time implies 10 million users are inserting 2 nodes each in a short space of time. It also implies many other combinations. Looking at the plausibility that this situation become true.... pretty much no chance you'll get that many insertions in the same table at the same time except for the watchdog table.
Uh. Except for high performance external sources of data feeding a Drupal front end. Remember what assumptions make out of you and me...
Earnest figured it out because he needed to implement it for a project.
How does your technique of randomly allocating IDs avoid duplicate ID allocation?
I believe it checks for dupes and re-randomizes, reading his write up.
-- Boris Mann Office 604-682-2889 Skype borismann http://www.bryght.com