Unique/Random IDs and drupal
Gentlebeings, I've read the recent thread on devel->staging->deployment, and I wanted to share what I've done in the area. My main interest lies in moving away from monotonically incrementing integers as id values so I have a greater chance of being able to "shard" the drupal DB for high performance. As a first step, I am working on moving drupal core to using 64 bit integers. It was relatively trivial to change schema to create 64 bit tables, but right now there's nothing in schema that marks "foreign keys" as also being special. Would anyone object to schema requiring foreign ID references to be marked specially? Anyway, the direction I'm interested in heading is a 64 bit int + creation date timestamp for every row; the two can be combined to form a valid UUID if there are useful reasons for doing so. Note that Postgresql has UUID creating functions and MySQL 5.1 has uuid_short() which generates a 64 bit random int based on the UUID algorithm. The UUID() algorithm in mysql 5.0 isn't viable for scaling purposes because it's not cluster-safe. I do have some preliminary performance data on the speed at which one can create UUIDs: http://mindlace.net/archives/2008/06/23/generating-uuids-in-php-for-drupal/ I'm very interested in any feedback about the feasibility of at least widening the ids to 64 bits in D7. ~ethan
On Sunday 10 August 2008 2:49:58 pm Ethan Fremen wrote:
Gentlebeings,
I've read the recent thread on devel->staging->deployment, and I wanted to share what I've done in the area.
My main interest lies in moving away from monotonically incrementing integers as id values so I have a greater chance of being able to "shard" the drupal DB for high performance.
As a first step, I am working on moving drupal core to using 64 bit integers. It was relatively trivial to change schema to create 64 bit tables, but right now there's nothing in schema that marks "foreign keys" as also being special.
Would anyone object to schema requiring foreign ID references to be marked specially?
Anyway, the direction I'm interested in heading is a 64 bit int + creation date timestamp for every row; the two can be combined to form a valid UUID if there are useful reasons for doing so.
Note that Postgresql has UUID creating functions and MySQL 5.1 has uuid_short() which generates a 64 bit random int based on the UUID algorithm. The UUID() algorithm in mysql 5.0 isn't viable for scaling purposes because it's not cluster-safe.
I do have some preliminary performance data on the speed at which one can create UUIDs: http://mindlace.net/archives/2008/06/23/generating-uuids-in-php-for-drupal/
I'm very interested in any feedback about the feasibility of at least widening the ids to 64 bits in D7.
~ethan
There was discussion of including foreign keys in Schema API in Drupal 6, but it was dropped after we determined that we couldn't actually do anything with that information at the time. We have to still support MySQL MyISAM tables, which are far and away the most common, and those don't support foreign keys. (Therefore we can't rely on integrity checking, cascading delete/update, etc.) I believe there was consideration of adding FK support in Drupal 7 to allow add-on functionality in places, and I'm certainly open to doing so, but not until the existing 300 KB database overhaul patch has landed. :-) Serial fields are actually very useful, and there's nothing wrong with them. In fact, they are a requirement if we want even remotely intelligible URLs. nids, uids, tids, etc. are all used in URLs, and most Drupal nodes do not in fact have a URL alias on them AFAIK. If we add some form of GUID to the system (which I am not against, and Greg Dunlap has made good arguments for) it will have to be in addition to existing serial fields. We also can't rely on MySQL 5.1 at this point, as it's not even fully stable to say nothing of widely deployed. Given that, I don't see much advantage for 99% of sites to using 64-bit unique IDs over 32-bit. It wouldn't break anything I suppose, but how many Drupal sites have the multiple millions of nodes required to run out of the 32-bit space? I can't actually think of any. -- Larry Garfield larry@garfieldtech.com
Given that, I don't see much advantage for 99% of sites to using 64- bit unique IDs over 32-bit. It wouldn't break anything I suppose, but how many Drupal sites have the multiple millions of nodes required to run out of the 32-bit space? I can't actually think of any.
32 bit allow for more than 4 billion unique IDs. If any site goes over it, they can easily change their columns (or rethink their data model). Konstantin
On Aug 10, 2008, at 5:11 PM, Konstantin Käfer wrote:
32 bit allow for more than 4 billion unique IDs. If any site goes over it, they can easily change their columns (or rethink their data model).
I would definitely accept random 32 bit identifiers as a step in the right direction. That being said, on any 64 bit machine, the 32 bit identifier is represented internally as a 64 bit identifier anyway. It is totally conceivable to me that the set of drupal sites in existence could quickly encounter collisions with a 32 bit identifier. Again, a driving use case is "two drupal sites that were founded independently of any knowledge of each other would like to merge their databases". Whenever you're using a hash function to generate identifiers, having a key space that is close to the size of the number of keys dramatically increases the chance that a collision will occur. By using a 64 bit identifier, we minimize that chance. Thanks for your feedback, ~ethan fremen
Again, a driving use case is "two drupal sites that were founded independently of any knowledge of each other would like to merge their databases".
To me, two even more compelling use cases are: a) sharing global taxonomy vocabularies (the countries of the world each have globally unique ids shared across all sites) b) import/export. This is like the merge example from Ethan but perhaps a little more concrete in terms of what people are already doing. I could export my blog posts from one site and import them into another site and keep the id. c) global user ids. I always have the same user id on every Drupal site.
OK, but please all take into account that it is not necessarily the best thing to tie the creation of the UUID to the database at all. That would be a big mistake and bad design decision and continue the kind of dependencies that need to be overcome. There exist many independent UUID libraries (in Java, PHP, etc) as described in the other thread. Once created, the UUID of course will be persisted, so the rest of the discussion is exceedingly fruitful, especially the use cases. Victor Kane http://awebfactory.com.ar On Sun, Aug 10, 2008 at 7:52 PM, Robert Douglass <rob@robshouse.net> wrote:
Again, a driving use case is "two drupal sites that were founded independently of any knowledge of each other would like to merge their databases".
To me, two even more compelling use cases are:
a) sharing global taxonomy vocabularies (the countries of the world each have globally unique ids shared across all sites) b) import/export. This is like the merge example from Ethan but perhaps a little more concrete in terms of what people are already doing. I could export my blog posts from one site and import them into another site and keep the id. c) global user ids. I always have the same user id on every Drupal site.
On Aug 11, 2008, at 7:20 AM, Victor Kane wrote:
OK, but please all take into account that it is not necessarily the best thing to tie the creation of the UUID to the database at all.
100% agreed. Right now I'm most strongly in favor of using the OSSP UUID library (aka php5-uuid in debian/ubuntu), which is, btw, *much* faster than the DIY random-64-bit identifiers I tried. ~ethan fremen
I'm all in favor of supporting 64bit ints in the database, whether we're going for UUID's or not. I'm personally experimenting with 64bit timestamps instead of using DateTime strings for storing dates. It is silly to not use it because "Noone could concievably use that many ID's in the near future." Platform native word length is a valid argument. UUIDs are 128bit... so I don't see how 64bit values helps you implement UUIDs either, If it's 64bits it's not a UUID per spec it's something else. You probably need a char(36) and hexadecimal representation if you wish to store UUIDs in a single column unless, or you have a plan for reducing the significant data which needs to be stored in the index to 64bits, or you plan to spread the UUID over two integer indexes? I don't quite follow how the sharding is intended to be implemented. I'm more curious about your plans for that. I've read up on how flickr does their user centric sharding, but how do you see the concept being mapped to Drupal? On Tue, Aug 12, 2008 at 8:37 AM, Ethan Fremen <ethan@acquia.com> wrote:
On Aug 11, 2008, at 7:20 AM, Victor Kane wrote:
OK, but please all take into account that it is not necessarily the
best thing to tie the creation of the UUID to the database at all.
100% agreed. Right now I'm most strongly in favor of using the OSSP UUID library (aka php5-uuid in debian/ubuntu), which is, btw, *much* faster than the DIY random-64-bit identifiers I tried.
~ethan fremen
Quoting Robert Douglass <rob@robshouse.net>:
I could export my blog posts from one site and import them into another site and keep the id.
You could "keep the id" but IMO you shouldn't. You don't know if there may be an existing id in the receiving DB. The ID is unique to each DB and existing values should not be merged from one DB into another. That said, the only correct method for merging data from one DB into another is to use the API for the receiving DB so that the foreign key constraints match appropriately (even for those DB engines not supporting foreign key constraints). In Drupal's case the nid should be removed so that a new nid is created or the uid should be removed so that a new uid is created. An import/export API should take into consideration this fact and allow for the removal of the ID columns on import or export. Earnie -- http://for-my-kids.com/ -- http://give-me-an-offer.com/
On Aug 11, 2008, at 2:38 PM, Earnie Boyd wrote:
Quoting Robert Douglass <rob@robshouse.net>:
I could export my blog posts from one site and import them into another site and keep the id.
You could "keep the id" but IMO you shouldn't. You don't know if there may be an existing id in the receiving DB. The ID is unique to each DB and existing values should not be merged from one DB into another. That said, the only correct method for merging data from one DB into another is to use the API for the receiving DB so that the foreign key constraints match appropriately (even for those DB engines not supporting foreign key constraints). In Drupal's case the nid should be removed so that a new nid is created or the uid should be removed so that a new uid is created. An import/export API should take into consideration this fact and allow for the removal of the ID columns on import or export.
Earnie -- http://for-my-kids.com/ -- http://give-me-an-offer.com/
No, this is the entire point of Ethan's suggestion. The id's *are* globally (meaning universe-wide) unique. Note also that there is no inherent suggestion that one *wouldn't* use Drupal APIs. On the contrary, this discussion is about extending Drupal APIs (at least taking the first incremental step - moving to 64 bit integers). I really don't think you can satisfy the use cases discussed thus far easily with existing Drupal APIs. http://en.wikipedia.org/wiki/UUID "A Universally Unique Identifier (UUID) is an identifier standard used in software construction, standardized by the Open Software Foundation (OSF) as part of the Distributed Computing Environment (DCE). The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. Thus, anyone can create a UUID and use it to identify something with reasonable confidence that the identifier will never be unintentionally used by anyone for anything else. Information labeled with UUIDs can therefore be later combined into a single database without needing to resolve name conflicts. The most widespread use of this standard is in Microsoft's Globally Unique Identifiers (GUIDs). Other significant users include Linux's ext2/ext3 filesystem, LUKS encrypted partitions, GNOME, KDE, and Mac OS X, all of which use implementations derived from the uuid library found in the e2fsprogs package."
right! but mustn't be database generated, must be generated on business logic level to avoid dependence on any given persistence On Mon, Aug 11, 2008 at 9:49 AM, Robert Douglass <rob@robshouse.net> wrote:
On Aug 11, 2008, at 2:38 PM, Earnie Boyd wrote:
Quoting Robert Douglass <rob@robshouse.net>:
I could export my blog posts from one site and import them into another site and keep the id.
You could "keep the id" but IMO you shouldn't. You don't know if there may be an existing id in the receiving DB. The ID is unique to each DB and existing values should not be merged from one DB into another. That said, the only correct method for merging data from one DB into another is to use the API for the receiving DB so that the foreign key constraints match appropriately (even for those DB engines not supporting foreign key constraints). In Drupal's case the nid should be removed so that a new nid is created or the uid should be removed so that a new uid is created. An import/export API should take into consideration this fact and allow for the removal of the ID columns on import or export.
Earnie -- http://for-my-kids.com/ -- http://give-me-an-offer.com/
No, this is the entire point of Ethan's suggestion. The id's *are* globally (meaning universe-wide) unique. Note also that there is no inherent suggestion that one *wouldn't* use Drupal APIs. On the contrary, this discussion is about extending Drupal APIs (at least taking the first incremental step - moving to 64 bit integers). I really don't think you can satisfy the use cases discussed thus far easily with existing Drupal APIs.
http://en.wikipedia.org/wiki/UUID
"A Universally Unique Identifier (UUID) is an identifier standard used in software construction, standardized by the Open Software Foundation (OSF) as part of the Distributed Computing Environment (DCE). The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. Thus, anyone can create a UUID and use it to identify something with reasonable confidence that the identifier will never be unintentionally used by anyone for anything else. Information labeled with UUIDs can therefore be later combined into a single database without needing to resolve name conflicts. The most widespread use of this standard is in Microsoft's Globally Unique Identifiers (GUIDs). Other significant users include Linux's ext2/ext3 filesystem, LUKS encrypted partitions, GNOME, KDE, and Mac OS X, all of which use implementations derived from the uuid library found in the e2fsprogs package."
On Aug 10, 2008, at 4:57 PM, Larry Garfield wrote:
There was discussion of including foreign keys in Schema API in Drupal 6, but it was dropped after we determined that we couldn't actually do anything with that information at the time. We have to still support MySQL MyISAM tables, which are far and away the most common, and those don't support foreign keys.
I don't actually care about schema supporting foreign keys. I just want it that when someone puts a reference to a foreign key in their schema, they do so using a type that is distinct from 'int'.
Serial fields are actually very useful, and there's nothing wrong with them.
There are several things wrong with them: a.) it makes it impossible to shard the DB because you have to coordinate what the next sequence ID is, a huge problem with scaling. b.) it makes it difficult to merge two (or more) existing DBs, whether development/production or otherwise. c.) it means that every drupal site has namespace collisions all over the place with every other drupal site.
In fact, they are a requirement if we want even remotely intelligible URLs.
If by "intelligible" you mean "a small number" then you are correct.
nids, uids, tids, etc. are all used in URLs, and most Drupal nodes do not in fact have a URL alias on them AFAIK.
I was under the impression that pathauto was in wide use.
If we add some form of GUID to the system (which I am not against, and Greg Dunlap has made good arguments for) it will have to be in addition to existing serial fields.
Which will not, in fact, address any of the issues I've outlined above.
We also can't rely on MySQL 5.1 at this point, as it's not even fully stable to say nothing of widely deployed.
I don't disagree. I was sharing the state of my understanding of DB support for these sorts of things. I think it is best for drupal to generate them itself at the moment.
Given that, I don't see much advantage for 99% of sites to using 64- bit unique IDs over 32-bit.
Use cases for the 99% of sites using a 64 bit unique identifier: a.) Their content is globally unique across the set of Drupal sites. This makes many syndication and federation tasks easier. b.) They ever wish to join with another site selected at random.
It wouldn't break anything I suppose, but how many Drupal sites have the multiple millions of nodes required to run out of the 32-bit space? I can't actually think of any.
The set of Drupal sites does. This is like the "we'll never run out of a 32 bit identifier" idea with ipv4. Plus, there's exactly 0 performance difference when using a 64 bit machine. I hope this helps clarify some of the use cases, ~ethan fremen
Quoting Ethan Fremen <ethan@acquia.com>:
a.) it makes it impossible to shard the DB because you have to coordinate what the next sequence ID is, a huge problem with scaling. b.) it makes it difficult to merge two (or more) existing DBs, whether development/production or otherwise. c.) it means that every drupal site has namespace collisions all over the place with every other drupal site.
I think you go about the problem incorrectly. Any merging of data would need to use the Drupal API to do that merge. You therefore do not need to know what the next id is because you simply shouldn't care. When merging the data from DB-2 into DB-1 you do not supply the id's to the data from DB-2 and allow Drupal to create new id's in DB-1. As for your supposed namespace collisions between sites, that is the nature of all data when two merge to one. It is the job of the one doing the bridging to overcome that issue but overcoming that issue shouldn't involve needing to make your id's 64bit. What do you do with users who have subscribed to both sites? What do you do with nodes with matching titles? The dilemma you describe can only be resolved by the rules set forth by the business acquiring DB-2. Earnie -- http://for-my-kids.com/ -- http://give-me-an-offer.com/
participants (7)
-
Darrel O'Pry -
Earnie Boyd -
Ethan Fremen -
Konstantin Käfer -
Larry Garfield -
Robert Douglass -
Victor Kane