One-to-one tables considered harmful
== What's bad about the current approach == The arguments for splitting table into one-to-one relationships are: * Improving performance on systems that can only use table-level locking * Minimizing invalidations of the query cache These are not good reasons considering their impacts. First, I'd like to adopt a development philosophy on the Drupal project: screw scalability if the database running Drupal doesn't support row-level locking. Sites using only table-level locks are doomed to scale poorly anyway because of the over-aggressive locking. We won't be able to prevent that disaster with the tiny improvements in lock granularity afforded by splitting tables into one-to-one table pairs. Drupal should still run on hosts allowing only table-level locks. We just need to stop weighing their unique performance concerns so heavily. Second, the query cache is a luxury, not a key part of database design. A site needs to run fast without the query cache because -- at some point -- the database will need to be restarted, which will dump the query cache. If the site can't survive without the query cache, it will enter a cycle of crashing/overloading the server before it can populate the cache. The query cache provides minimal support for the large variety of queries hitting the system. For example, a cached tracker query for me does nothing to help a cached track query for you. The query cache only works after the first run. If a query takes 45 seconds to run, the page load is in jeopardy. A failed page load to populate the query cache is unacceptable. Minimizing invalidations to the query cache by splitting tables yields far worse performance overall because we can't have keys span the tables. This will be discussed in detail in the next section. If we rely on the query cache, we're also being hypocrites with our stance on the first reason because many low-end hosts don't run the query cache. Either we care about scalability on low-end configurations or we don't. I'm suggesting "don't," but if people truly do want to care about low-end scalability, we can't use the query cache argument. == Why avoiding one-to-one joins is a good idea == Many of the slowest queries in Drupal (as seen on Drupal.org) are products of the following query plan: 1. Join the two one-to-one tables, maybe after a WHERE restriction on one. This creates a temporary table. 2. Finish processing the WHERE restrictions on the temporary table. 3. Filesort the results to meet ORDER BY criteria. An example of such a query would be finding all published nodes in order of most recent comments. Another would be finding all published nodes in order of most views today. These are extremely common queries in Drupal. The problem is caused by an insurmountable restriction in MySQL (and probably PostgreSQL): we cannot create a key that spans tables. Because we cannot create such a key, any query with WHERE or ORDER BY criteria spanning two one-to-one tables is doomed to a temp table and a filesort. To have these queries run without temp tables and filesorts, we need to have a key that has all WHERE criteria as a prefix and then has the ORDER BY criteria left over. The greatest paradox to me is that many of the worst one-to-one offenders are denormalized tables designed to improve performance. Because these denormalized tables contain calculated values that many queries use for WHERE and ORDER BY, the JOINs with them inevitably sink the query into temp table/filesort hell. == Where one-to-one is good == It's fine to have one-to-one tables if one table is only used in the JOIN and doesn't have WHERE and ORDER BY restrictions on it. For example, the one-to-one tables created by many content types that JOIN with {node_revisions} are great. You simply INNER JOIN {node_revisions} with the content type's table at little performance penalty. For this same reason, there's not much argument for denormalizing user names into other tables unless we're allowing people to sort on them. Even the forums don't allow sorting by user name. A simple JOIN from {blah}.uid to {users}.uid places almost no burden on the query. == What I'm doing about it == I'm working on a patch: Merge {node_comment_statistics} and {node_counter} into {node} http://drupal.org/node/148849 It also makes more intelligent choices for which calculated values to manage, which removes GREATEST() from numerous queries.
David Strauss wrote:
== What's bad about the current approach ==
The arguments for splitting table into one-to-one relationships are:
* Improving performance on systems that can only use table-level locking * Minimizing invalidations of the query cache
You *completely* missed the real argument: Drupal is modular. Modules tend to have their own tables. Often these are 1::1 tables. I fail to understand how it is MySQL is unable to optimize for the 1::1 join case, especially when that 1::1 join is on the primary key. What am I missing that makes this join worse than other basic joins?
Earl Miles wrote:
David Strauss wrote:
== What's bad about the current approach ==
The arguments for splitting table into one-to-one relationships are:
* Improving performance on systems that can only use table-level locking * Minimizing invalidations of the query cache
You *completely* missed the real argument:
Drupal is modular. Modules tend to have their own tables. Often these are 1::1 tables.
You misinterpreted my post. I said the 1::1 tables created by content types are fine. We're talking about core tables here, like {node_comment_statistics} that have columns often used in WHERE or ORDER BY criteria.
I fail to understand how it is MySQL is unable to optimize for the 1::1 join case, especially when that 1::1 join is on the primary key. What am I missing that makes this join worse than other basic joins?
I also explained this in my original post. Indexes cannot span JOINs. If you don't understand precisely how databases use indexes to fulfill WHERE and ORDER BY criteria, I'm afraid a full explanation is beyond what I could include in an email. I'll try to give a brief one. But here's what happens when you have criteria spanning a JOIN: 1. The database tries to minimize the rows involved in the JOIN using as many criteria as it can. 2. The database performs a Cartesian JOIN on the remaining rows. This creates a giant temporary table on disk. 3. The database runs remaining restrictions on the results. 4. The database sorts the results. This uses a filesort. 5. The database returns a range of the sorted results. For a site like Drupal.org, the running time is about 30-45 seconds. Here's what happens when all criteria (WHERE and ORDER BY are in an index on one table): 1. The database uses the index to fully satisfy WHERE and ORDER BY criteria. This does not use a temp table or a filesort. 2. The database runs JOINs on other tables to collect data for remaining columns. 3. The results are already filtered and sorted, so the database returns a range of the results. For a site like Drupal.org, the running time is 30-45 milliseconds. The difference is really that order of magnitude, and the difference only increases as the site grows in size.
For a common example of a user noticeable slow query, go to modules and look at all the contrib modules available for download. Or perhaps all the time is formating that much information. -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of David Strauss Sent: Sunday, June 03, 2007 5:12 PM To: development@drupal.org Subject: Re: [development] One-to-one tables considered harmful Earl Miles wrote:
David Strauss wrote:
== What's bad about the current approach ==
The arguments for splitting table into one-to-one relationships are:
* Improving performance on systems that can only use table-level locking * Minimizing invalidations of the query cache
You *completely* missed the real argument:
Drupal is modular. Modules tend to have their own tables. Often these are 1::1 tables.
You misinterpreted my post. I said the 1::1 tables created by content types are fine. We're talking about core tables here, like {node_comment_statistics} that have columns often used in WHERE or ORDER BY criteria.
I fail to understand how it is MySQL is unable to optimize for the 1::1 join case, especially when that 1::1 join is on the primary key. What am I missing that makes this join worse than other basic joins?
I also explained this in my original post. Indexes cannot span JOINs. If you don't understand precisely how databases use indexes to fulfill WHERE and ORDER BY criteria, I'm afraid a full explanation is beyond what I could include in an email. I'll try to give a brief one. But here's what happens when you have criteria spanning a JOIN: 1. The database tries to minimize the rows involved in the JOIN using as many criteria as it can. 2. The database performs a Cartesian JOIN on the remaining rows. This creates a giant temporary table on disk. 3. The database runs remaining restrictions on the results. 4. The database sorts the results. This uses a filesort. 5. The database returns a range of the sorted results. For a site like Drupal.org, the running time is about 30-45 seconds. Here's what happens when all criteria (WHERE and ORDER BY are in an index on one table): 1. The database uses the index to fully satisfy WHERE and ORDER BY criteria. This does not use a temp table or a filesort. 2. The database runs JOINs on other tables to collect data for remaining columns. 3. The results are already filtered and sorted, so the database returns a range of the results. For a site like Drupal.org, the running time is 30-45 milliseconds. The difference is really that order of magnitude, and the difference only increases as the site grows in size.
David Strauss wrote:
Earl Miles wrote:
David Strauss wrote:
== What's bad about the current approach ==
The arguments for splitting table into one-to-one relationships are:
* Improving performance on systems that can only use table-level locking * Minimizing invalidations of the query cache You *completely* missed the real argument:
Drupal is modular. Modules tend to have their own tables. Often these are 1::1 tables.
You misinterpreted my post. I said the 1::1 tables created by content types are fine. We're talking about core tables here, like {node_comment_statistics} that have columns often used in WHERE or ORDER BY criteria.
No, you misinterpreted mine. Drupal is modular. comment.module is a module. Now, it's a poor module -- it has some rather nasty integration with the rest of the site which, IMO, it should not. But it worries me that one of the solutions is to re-integrate rather than to make it properly modular.
I fail to understand how it is MySQL is unable to optimize for the 1::1 join case, especially when that 1::1 join is on the primary key. What am I missing that makes this join worse than other basic joins?
I also explained this in my original post. Indexes cannot span JOINs. If you don't understand precisely how databases use indexes to fulfill WHERE and ORDER BY criteria, I'm afraid a full explanation is beyond what I could include in an email. I'll try to give a brief one.
But here's what happens when you have criteria spanning a JOIN: 1. The database tries to minimize the rows involved in the JOIN using as many criteria as it can. 2. The database performs a Cartesian JOIN on the remaining rows. This creates a giant temporary table on disk. 3. The database runs remaining restrictions on the results. 4. The database sorts the results. This uses a filesort. 5. The database returns a range of the sorted results.
For a site like Drupal.org, the running time is about 30-45 seconds.
Here's what happens when all criteria (WHERE and ORDER BY are in an index on one table): 1. The database uses the index to fully satisfy WHERE and ORDER BY criteria. This does not use a temp table or a filesort. 2. The database runs JOINs on other tables to collect data for remaining columns. 3. The results are already filtered and sorted, so the database returns a range of the results.
So to me, instead of merging comment data into the node table, it sounds like the node_comment_statistics table needs to reflect other data that is being filtered in some way. IMO, this would be preferable. Doing it any other way might end up creating a requirement to have comment.module on (or make a site very, very useless if comment.module is off).
Earl Miles wrote:
No, you misinterpreted mine. Drupal is modular. comment.module is a module. Now, it's a poor module -- it has some rather nasty integration with the rest of the site which, IMO, it should not. But it worries me that one of the solutions is to re-integrate rather than to make it properly modular.
The schema for Drupal core is not modular and never has been (at least looking back to 4.7).
So to me, instead of merging comment data into the node table, it sounds like the node_comment_statistics table needs to reflect other data that is being filtered in some way. IMO, this would be preferable. Doing it any other way might end up creating a requirement to have comment.module on (or make a site very, very useless if comment.module is off).
There is no debate here, no matter what theoretical model you prefer. The database requires that all criteria for WHERE and ORDER BY exist in the same table to achieve acceptable performance. That table is {node}. The only way to realize the same performance goals without combining the fields in {node} is to create another table that contains the same fields, but just isn't called node. That would be pointless.
It seems to me you might be over-generalizing. Do you have any data to back up this assertion (URL ?) ? From what I remember from examining Firebird and Oracle query plans, they can apply criteria on several tables in a join without significantly impacting performance. ----- Original Message ----- From: "David Strauss" <david@fourkitchens.com> To: <development@drupal.org> Sent: Monday, June 04, 2007 8:19 AM Subject: Re: [development] One-to-one tables considered harmful [...] There is no debate here, no matter what theoretical model you prefer. The database requires that all criteria for WHERE and ORDER BY exist in the same table to achieve acceptable performance. That table is {node}. The only way to realize the same performance goals without combining the fields in {node} is to create another table that contains the same fields, but just isn't called node. That would be pointless.
You can efficiently apply criteria to several tables in a JOIN, but then you generally cannot use an index for sorting. Considering that an enormous number of Drupal queries follow a pattern of joining, sorting, and paginating, the efficiency of the sort is paramount to query performance. The database "eats" keys by satisfying WHERE criteria first. If removing the prefix used for the WHERE criteria leaves the ORDER BY criteria, then the key can also be used for sorting. http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html == Examples == Key: nid * Fast: WHERE nid = %d * Fast: ORDER BY nid Key: nid, vid * Fast: WHERE nid = %d * Fast: WHERE nid = %d AND vid = %d * Fast: WHERE nid = %d ORDER BY vid FGM wrote:
It seems to me you might be over-generalizing. Do you have any data to back up this assertion (URL ?) ? From what I remember from examining Firebird and Oracle query plans, they can apply criteria on several tables in a join without significantly impacting performance.
----- Original Message ----- From: "David Strauss" <david@fourkitchens.com> To: <development@drupal.org> Sent: Monday, June 04, 2007 8:19 AM Subject: Re: [development] One-to-one tables considered harmful
[...] There is no debate here, no matter what theoretical model you prefer. The database requires that all criteria for WHERE and ORDER BY exist in the same table to achieve acceptable performance. That table is {node}. The only way to realize the same performance goals without combining the fields in {node} is to create another table that contains the same fields, but just isn't called node. That would be pointless.
Cross posts ...
The schema for Drupal core is not modular and never has been (at least looking back to 4.7).
How so? Offhand, in terms of optional components, I can only think of the unfortunate existence of the comment field in the node table.
Karthik wrote:
The schema for Drupal core is not modular and never has been (at least looking back to 4.7).
How so? Offhand, in terms of optional components, I can only think of the unfortunate existence of the comment field in the node table.
Drupal's schema is managed by system.module, even for optional core modules. There's a bit of change in Drupal 6 to move some schema to other core, required modules, but the overall core schema is monolithic, not modular.
On 04 Jun 2007, at 09:04, Karthik wrote:
How so? Offhand, in terms of optional components, I can only think of the unfortunate existence of the comment field in the node table.
$ grep -r node_comment_statistics * | awk -F: '{print $1}' | grep ".module" | sort | uniq modules/comment/comment.module modules/forum/forum.module modules/node/node.module modules/search/search.module modules/tracker/tracker.module Five core modules directly query the node_comment_statistics table. In my view of the world, comments are first class citizens in Drupal. We should optimize how they interact with nodes. In other words, I agree that it makes sense to merge the node_comment_statistics table into the node table (assuming it does improve performance). It would also help eliminate bugs. -- Dries Buytaert :: http://www.buytaert.net/
Quoting Dries Buytaert <dries.buytaert@gmail.com>:
Five core modules directly query the node_comment_statistics table. In my view of the world, comments are first class citizens in Drupal. We should optimize how they interact with nodes. In other words, I agree that it makes sense to merge the node_comment_statistics table into the node table (assuming it does improve performance). It would also help eliminate bugs.
I have been reading the posts and wondering what is so different about a comment. A comment is nothing more than content very much similar to a node. As content it has a specialized content type which at the moment isn't visible to the administrator. So, I think to myself that perhaps a comment should be treated as a node and the comment table is nothing more than a one to many relationship mapper. I'll give this some more thought toward D-7 I can see this idea as enhancing many modules. Earnie
Earnie Boyd wrote:
I have been reading the posts and wondering what is so different about a comment. A comment is nothing more than content very much similar to a node. As content it has a specialized content type which at the moment isn't visible to the administrator. So, I think to myself that perhaps a comment should be treated as a node and the comment table is nothing more than a one to many relationship mapper. I'll give this some more thought toward D-7 I can see this idea as enhancing many modules.
Please to check out the nodecomment module which does just what you describe: http://drupal.org/project/nodecomment
Also, if comments are nodes, then you can modify them through CCK. Many new things would be possible that currently are not (or not as easily) - or am I seeing that wrong? But then the need to have per-content type "comment content types" could become a requirement, simply because you need different CCK fields per content type. So this brings me to: why don't we allow for any content type to be chosen as the "comment content type"? Of course this is completely off-topic, sorry for that. Wim On Jun 4, 2007, at 16:50 , Robert Douglass wrote:
Earnie Boyd wrote:
I have been reading the posts and wondering what is so different about a comment. A comment is nothing more than content very much similar to a node. As content it has a specialized content type which at the moment isn't visible to the administrator. So, I think to myself that perhaps a comment should be treated as a node and the comment table is nothing more than a one to many relationship mapper. I'll give this some more thought toward D-7 I can see this idea as enhancing many modules.
Please to check out the nodecomment module which does just what you describe: http://drupal.org/project/nodecomment
Quoting Robert Douglass <rob@robshouse.net>:
Earnie Boyd wrote:
I have been reading the posts and wondering what is so different about a comment. A comment is nothing more than content very much similar to a node. As content it has a specialized content type which at the moment isn't visible to the administrator. So, I think to myself that perhaps a comment should be treated as a node and the comment table is nothing more than a one to many relationship mapper. I'll give this some more thought toward D-7 I can see this idea as enhancing many modules.
Please to check out the nodecomment module which does just what you describe: http://drupal.org/project/nodecomment
Sounds like a starting place to add functionality to the core comment module. Since nodecomment is a replacement for the current comment module maybe it even makes sense to just replace it with nodecomment; renaming nodecomment to comment of course. I would like to hear from Dries on this idea. I don't see why we need a special table for comment content when we already have a building block for content wrapped up in a node. We only need to map child nids to a parent nid. Earnie
On 6/4/07, Earnie Boyd <earnie@users.sourceforge.net> wrote:
I have been reading the posts and wondering what is so different about a comment. A comment is nothing more than content very much similar to a node.
Well, not quite. There is one very distinct difference that makes a comment very different from nodes as we treat them today. And it's also the one thing that makes it hard to clearly call a comment a first-class object, in a modeling sense. That difference is: * nodes are "new" objects. They "start" or "instantiate" a piece of content. They are independent, typically. * comments are "follow on" objects. They are a "reaction" to a piece of content. They are *always* dependent upon the node to which they refer. So much for the model. In a practical implementation sense, the questions are different, but need to keep the model in mind.
Quoting Chris Johnson <cxjohnson@gmail.com>:
On 6/4/07, Earnie Boyd <earnie@users.sourceforge.net> wrote:
I have been reading the posts and wondering what is so different about a comment. A comment is nothing more than content very much similar to a node.
Well, not quite. There is one very distinct difference that makes a comment very different from nodes as we treat them today. And it's also the one thing that makes it hard to clearly call a comment a first-class object, in a modeling sense.
That difference is:
* nodes are "new" objects. They "start" or "instantiate" a piece of content. They are independent, typically.
* comments are "follow on" objects. They are a "reaction" to a piece of content. They are *always* dependent upon the node to which they refer.
So much for the model. In a practical implementation sense, the questions are different, but need to keep the model in mind.
How is what you describe different from a child node dependent on a parent node? If I have a content type relationship mapping table that maps a parent to a child then the children become the ``follow on'' objects. Further more we can design the content type relationship mapping table such that it maps any content type to another content type. So I can have one content type represent albums and the content be generally about the album and another content type represent songs whose content is about the song and a child of the album. Yes, this is different thinking and I am asking that we think of a comment as nothing more than more content. I am asking that all content be thought of as having possible relationships to each other. IMO, it will make for a more robust API and potentially less code. Obviously others think it is more robust because they build replacement modules for the existing core comment module. Earnie
Not many people are addressing David's actual technical argument about sorting and writing temporary tables and file sorting. So far his argument has made perfect sense to me, and I would not begrudge seeing a comment count column in the node table. I am glad Dries sees this as well.
Quoting Robert Douglass <rob@robshouse.net>:
Not many people are addressing David's actual technical argument about sorting and writing temporary tables and file sorting. So far his argument has made perfect sense to me, and I would not begrudge seeing a comment count column in the node table. I am glad Dries sees this as well.
I agree that this makes sense for the time being; we can redesign it later in D7. Earnie
Earnie Boyd wrote:
Quoting Robert Douglass <rob@robshouse.net>:
Not many people are addressing David's actual technical argument about sorting and writing temporary tables and file sorting. So far his argument has made perfect sense to me, and I would not begrudge seeing a comment count column in the node table. I am glad Dries sees this as well.
I agree that this makes sense for the time being; we can redesign it later in D7.
Maybe that will give you time to learn why we have the current approach and why I'm slightly modifying that approach for performance. Sorry for my snarky tone, but it comes off as quite rude to say, "Oh, I guess we'll just replace your hard work with my ideas /next/ version," when you haven't provided anything more than some abstract wishes about comments being nodes. Why don't you just use Robert Douglass's module and see if it suits your needs? I'm sure it's well-written.
Flame war aside, I think there are some good points in this thread. Perhaps someone can clarify two points: 1. Is it kosher to add columns to other modules' tables? E.g. Is it ok for comment to add it's statistics to the node table. Similarly, I could imagine the stats module would want to add summary information, and perhaps other modules as well. 2. Do David's arguments support more Single Table Inheritance[1] in Drupal (e.g. adding columns to node that don't necessarily apply to every node type)? For example, should CCK add columns to the node table that only apply to some types of nodes. -M [1] http://www.martinfowler.com/eaaCatalog/singleTableInheritance.html
On 6/5/07, Mark Fredrickson <mark.m.fredrickson@gmail.com> wrote:
Flame war aside, I think there are some good points in this thread. Perhaps someone can clarify two points:
1. Is it kosher to add columns to other modules' tables? E.g. Is it ok for comment to add it's statistics to the node table. Similarly, I could imagine the stats module would want to add summary information, and perhaps other modules as well.
If one module is core and one is contrib, then it is not kosher (IMO). Actually it is not the sanctioned way of doing things, but no one stops a module author from doing this or a site from having their own "schema fork". There are consequences when upgrading, and they should be prepared to live with their decisions. If both modules are from core, then why not? We avoid the performance issues caused by having two tables, and are actually more normalized that way. -- 2bits.com http://2bits.com Drupal development, customization and consulting.
Quoting Khalid Baheyeldin <kb@2bits.com>:
On 6/5/07, Mark Fredrickson <mark.m.fredrickson@gmail.com> wrote:
Flame war aside, I think there are some good points in this thread. Perhaps someone can clarify two points:
1. Is it kosher to add columns to other modules' tables? E.g. Is it ok for comment to add it's statistics to the node table. Similarly, I could imagine the stats module would want to add summary information, and perhaps other modules as well.
If one module is core and one is contrib, then it is not kosher (IMO). Actually it is not the sanctioned way of doing things, but no one stops a module author from doing this or a site from having their own "schema fork". There are consequences when upgrading, and they should be prepared to live with their decisions.
But it may be a reason to begin to collaborate with each other and agree to combine module functionality if both modules are contrib modules.
If both modules are from core, then why not? We avoid the performance issues caused by having two tables, and are actually more normalized that way.
I agree. Earnie
Mark Fredrickson wrote:
Flame war aside, I think there are some good points in this thread. Perhaps someone can clarify two points:
1. Is it kosher to add columns to other modules' tables? E.g. Is it ok for comment to add it's statistics to the node table. Similarly, I could imagine the stats module would want to add summary information, and perhaps other modules as well.
The problem with dynamically modifying other module's tables has one glaring problem: the upgrade path. Now let's take menu module or locale module. Locale module had the locales_meta table in Drupal 5. If contrib modules added columns to this table, the column and the data will be completely lost, since the Drupal 6 upgrade creates a new "languages" table, gets the data **which is expected by locale module to be there** in the old locales_meta table and finally drops the locales_meta table. So only the data handled by the module will be saved. Similar issues can arise when data is updated only, if you would do updates in your custom columns in those algorithms too. Or the module update might try to add a column named as the one which was put there by some of your other modules. So as long as you modify someone else's (either core module or contrib, does not matter) table in a contrib module, you fork Drupal, and you are on your own to handle the situations that can possibly arise. Optimizing *core* functionality by putting related data together is a different question though.
2. Do David's arguments support more Single Table Inheritance[1] in Drupal (e.g. adding columns to node that don't necessarily apply to every node type)? For example, should CCK add columns to the node table that only apply to some types of nodes.
Hm, the node table already has a newly added column in Drupal 6 to store the language code, and this is not managed in node module. Dries also requested to add node translation related fields directly into the node table underlining that this is a core feature (even if you don't have the node translation module enabled). [Node translation is still in the works, look here: http://drupal.org/node/142280 ] I would not advise any contrib module to modify anyone else's core or contrib table, but feel free to play along with their own tables as they wish. I don't think it is a good idea maintanance wise for CCK to modify the node table directly, although it might look performance wise attractive. Gabor
Mark Fredrickson wrote:
Flame war aside, I think there are some good points in this thread. Perhaps someone can clarify two points:
1. Is it kosher to add columns to other modules' tables? E.g. Is it ok for comment to add it's statistics to the node table. Similarly, I could imagine the stats module would want to add summary information, and perhaps other modules as well.
2. Do David's arguments support more Single Table Inheritance[1] in Drupal (e.g. adding columns to node that don't necessarily apply to every node type)? For example, should CCK add columns to the node table that only apply to some types of nodes.
I don't think so. I discussed my reasons in my original post.
[1] http://www.martinfowler.com/eaaCatalog/singleTableInheritance.html
Mark Fredrickson wrote:
Flame war aside, I think there are some good points in this thread. Perhaps someone can clarify two points:
1. Is it kosher to add columns to other modules' tables? E.g. Is it ok for comment to add it's statistics to the node table. Similarly, I could imagine the stats module would want to add summary information, and perhaps other modules as well.
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored. Then we could justify never ever having a 1::1 relationship table for 'node'. This was taken out when the node_revisions table was added. Which leads me to another question: Should the 'current' node_revision be reflected in the node table, to eliminate that join if we need to sort/query on something in that table?
Earl Miles wrote:
Mark Fredrickson wrote:
Flame war aside, I think there are some good points in this thread. Perhaps someone can clarify two points:
1. Is it kosher to add columns to other modules' tables? E.g. Is it ok for comment to add it's statistics to the node table. Similarly, I could imagine the stats module would want to add summary information, and perhaps other modules as well.
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored.
What about the upgrade path arguments I mentied above in the thread? Or if those are taken care of somehow, why only have this for nodes? :P Gabor
1. Is it kosher to add columns to other modules' tables? E.g. Is it ok for comment to add it's statistics to the node table. Similarly, I could imagine the stats module would want to add summary information, and perhaps other modules as well.
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored.
What about the upgrade path arguments I mentied above in the thread? Or if those are taken care of somehow, why only have this for nodes? :P
That's a very valid point. It can be solved, especially with the introduction of $schema, it becomes a variation on the *_alter theme, but at install/upgrade and module enable/disable times. The downside that it becomes far more complicated to track relations and conflicts between modules. A compromise can be achieved by having index caches - tables which serve only for the purpose of maintaining fast indexes, so the various where and order clases are executed on these tables only, so that unnessesary temp tables are avoided. Again, it complicates the design, and special attention has to be paid for the consistency of the caches, which can very brittle.Thespace overhead can be significant. The problem we have is where to compromise. We have a very dynamic system, where the types can be changed at different points in time, and static db schemas. How do we maintain the mapping between the two and keep it efficient without major sacrifices in either?
The problem we have is where to compromise. We have a very dynamic system, where the types can be changed at different points in time, and static db schemas. How do we maintain the mapping between the two and keep it efficient without major sacrifices in either?
For my own two cents, I think the long term solution lies in making the schemas dynamic (maybe not core schemas but definitely module schemas). If we finish implementing a solid data api then adding fields to a user_profiles table would simply execute the appropriate alter column statements. RAILS sort of works this way as well. The API is really in charge of the schema. I sort of have the feeling that when we try and store to much in these generic field extending (like profile fields) that we usually give up performance and make it really harder to get at the data. And certainly hard to optimize performance it with indexes. In my dream drupal, we would have node_<content_type> tables that would store all the added fields to a content type, and I as a conrib developer could get at those tables with simple SQL queries. I've just been bit to many times by scalability problems with the one table fits all approach.
Earl Miles wrote:
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored. Then we could justify never ever having a 1::1 relationship table for 'node'. This was taken out when the node_revisions table was added.
Content types adding tables with 1::1 relationships to {node_revisions} aren't a problem unless we use fields from the content type's table in WHERE or ORDER BY criteria. I'm not aware of that happening anywhere in core.
Which leads me to another question: Should the 'current' node_revision be reflected in the node table, to eliminate that join if we need to sort/query on something in that table?
We could also mark rows in {node_revisions} with an is_current field that's set to 1 for the current revision and NULL for older revisions. With a UNIQUE index on (nid, is_current), we could even have the database enforce the restriction. UNIQUE indexes ignore combinations with NULL values, so only a single row per nid could have a non-null is_current value. From the MySQL manual:
In MySQL, a UNIQUE index is one in which all values in the index must be distinct. An error occurs if you try to add a new row with a key that matches an existing row. The exception to this is that if a column in the index is allowed to contain NULL values, it can contain multiple NULL values. This exception does not apply to BDB tables, for which an indexed column allows only a single NULL.
The UNIQUE index on (nid, is_current) would allow rapid lookups of a given nid's current revision. That said, the current relationship between {node} and {node_revisions} isn't too bad. We tend to not put WHERE or ORDER BY criteria on {node_revisions} when joining.
David Strauss wrote:
Earl Miles wrote:
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored. Then we could justify never ever having a 1::1 relationship table for 'node'. This was taken out when the node_revisions table was added.
Content types adding tables with 1::1 relationships to {node_revisions} aren't a problem unless we use fields from the content type's table in WHERE or ORDER BY criteria. I'm not aware of that happening anywhere in core.
Ahh, but I bring this up for things like gsitemap or things that add flags to nodes that will get queried by that module. Possibly with queries similar to the tracker query. Some of these may be important. If we had a 'kosher', acceptable way for a module to add fields to the node table, and comment.module in core used that as a shining example, then any and all of my objections to this on principle would vanish, melt, disappear and be gone. And we'd get useful functionality, to boot. I'd quickly turn out a 'node flags' module that let the administrator create set arbitrary flags on a module, control who can set them, and offer those flags up to Views, if the 'field' op for nodeapi were restored.
That said, the current relationship between {node} and {node_revisions} isn't too bad. We tend to not put WHERE or ORDER BY criteria on {node_revisions} when joining.
Fair enough, that's a moot point then.
On Jun 5, 2007, at 10:36 AM, Earl Miles wrote:
David Strauss wrote:
Earl Miles wrote:
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored. Then we could justify never ever having a 1::1 relationship table for 'node'. This was taken out when the node_revisions table was added. Content types adding tables with 1::1 relationships to {node_revisions} aren't a problem unless we use fields from the content type's table in WHERE or ORDER BY criteria. I'm not aware of that happening anywhere in core.
Ahh, but I bring this up for things like gsitemap or things that add flags to nodes that will get queried by that module. Possibly with queries similar to the tracker query. Some of these may be important.
If we had a 'kosher', acceptable way for a module to add fields to the node table, and comment.module in core used that as a shining example, then any and all of my objections to this on principle would vanish, melt, disappear and be gone. And we'd get useful functionality, to boot.
I'd quickly turn out a 'node flags' module that let the administrator create set arbitrary flags on a module, control who can set them, and offer those flags up to Views, if the 'field' op for nodeapi were restored.
And, as Gabor said, we would need to carefully make sure the upgrade of this modified table was handled sanely. Perhaps, a db schema API for table deletion/upgrade that allows modules to handle it's own columns. That's an EXTREMELY rough idea, but I'm sure there are smarter people who could shape it into something useful. I hope no one minds that I moved this to a new thread.
For example I wish the node table had a last edited by user field as well as the created by user field. In an enviroment where multiple people can modify nodes it is more interesting to know the last modifier than the creator. Yes, I know that can be gotten if you turn revisions on but I don't need that functionality. -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of John Wilkins Sent: Tuesday, June 05, 2007 1:48 PM To: development@drupal.org Subject: [development] module-specific columns in core tables On Jun 5, 2007, at 10:36 AM, Earl Miles wrote:
David Strauss wrote:
Earl Miles wrote:
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored. Then we could justify never ever having a 1::1 relationship table for 'node'. This was taken out when the node_revisions table was added. Content types adding tables with 1::1 relationships to {node_revisions} aren't a problem unless we use fields from the content type's table in WHERE or ORDER BY criteria. I'm not aware of that happening anywhere in core.
Ahh, but I bring this up for things like gsitemap or things that add flags to nodes that will get queried by that module. Possibly with queries similar to the tracker query. Some of these may be important.
If we had a 'kosher', acceptable way for a module to add fields to the node table, and comment.module in core used that as a shining example, then any and all of my objections to this on principle would vanish, melt, disappear and be gone. And we'd get useful functionality, to boot.
I'd quickly turn out a 'node flags' module that let the administrator create set arbitrary flags on a module, control who can set them, and offer those flags up to Views, if the 'field' op for nodeapi were restored.
And, as Gabor said, we would need to carefully make sure the upgrade of this modified table was handled sanely. Perhaps, a db schema API for table deletion/upgrade that allows modules to handle it's own columns. That's an EXTREMELY rough idea, but I'm sure there are smarter people who could shape it into something useful. I hope no one minds that I moved this to a new thread. -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.472 / Virus Database: 269.8.9/832 - Release Date: 6/4/2007 6:43 PM
Use {node_revisions}. Walt Daniels wrote:
For example I wish the node table had a last edited by user field as well as the created by user field. In an enviroment where multiple people can modify nodes it is more interesting to know the last modifier than the creator. Yes, I know that can be gotten if you turn revisions on but I don't need that functionality.
-----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of John Wilkins Sent: Tuesday, June 05, 2007 1:48 PM To: development@drupal.org Subject: [development] module-specific columns in core tables
On Jun 5, 2007, at 10:36 AM, Earl Miles wrote:
David Strauss wrote:
Earl Miles wrote:
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored. Then we could justify never ever having a 1::1 relationship table for 'node'. This was taken out when the node_revisions table was added. Content types adding tables with 1::1 relationships to {node_revisions} aren't a problem unless we use fields from the content type's table in WHERE or ORDER BY criteria. I'm not aware of that happening anywhere in core. Ahh, but I bring this up for things like gsitemap or things that add flags to nodes that will get queried by that module. Possibly with queries similar to the tracker query. Some of these may be important.
If we had a 'kosher', acceptable way for a module to add fields to the node table, and comment.module in core used that as a shining example, then any and all of my objections to this on principle would vanish, melt, disappear and be gone. And we'd get useful functionality, to boot.
I'd quickly turn out a 'node flags' module that let the administrator create set arbitrary flags on a module, control who can set them, and offer those flags up to Views, if the 'field' op for nodeapi were restored.
And, as Gabor said, we would need to carefully make sure the upgrade of this modified table was handled sanely.
Perhaps, a db schema API for table deletion/upgrade that allows modules to handle it's own columns. That's an EXTREMELY rough idea, but I'm sure there are smarter people who could shape it into something useful.
I hope no one minds that I moved this to a new thread.
-- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.472 / Virus Database: 269.8.9/832 - Release Date: 6/4/2007 6:43 PM
I'll just add that the users module has explicit support for adding custom columns to it. those columns get magically loaded into user object and saved into the DB. please don't confuse this fine feature with the the horrid users.data column. i am speaking of new dedicated columns, not an existing column which is a catch-all. so a similar feature for nodes is consistent and proper IMO. Earl Miles wrote:
David Strauss wrote:
Earl Miles wrote:
At one point we had a 'field' op to nodeapi that let modules add fields to the node table. I would like to see this restored. Then we could justify never ever having a 1::1 relationship table for 'node'. This was taken out when the node_revisions table was added.
Content types adding tables with 1::1 relationships to {node_revisions} aren't a problem unless we use fields from the content type's table in WHERE or ORDER BY criteria. I'm not aware of that happening anywhere in core.
Ahh, but I bring this up for things like gsitemap or things that add flags to nodes that will get queried by that module. Possibly with queries similar to the tracker query. Some of these may be important.
If we had a 'kosher', acceptable way for a module to add fields to the node table, and comment.module in core used that as a shining example, then any and all of my objections to this on principle would vanish, melt, disappear and be gone. And we'd get useful functionality, to boot.
I'd quickly turn out a 'node flags' module that let the administrator create set arbitrary flags on a module, control who can set them, and offer those flags up to Views, if the 'field' op for nodeapi were restored.
That said, the current relationship between {node} and {node_revisions} isn't too bad. We tend to not put WHERE or ORDER BY criteria on {node_revisions} when joining.
Fair enough, that's a moot point then.
On 5-Jun-07, at 12:53 PM, Earl Miles wrote:
Which leads me to another question: Should the 'current' node_revision be reflected in the node table, to eliminate that join if we need to sort/query on something in that table?
"Back in the day," I made a patch that completely split off the node revisions system into a separate, turn-off-able module, and part of that was storing the current revision's title, body, etc. in the node table. I'm sure it doesn't apply anymore and still needs a bit of work, but if there's interest... http://drupal.org/node/120967
Quoting David Strauss <david@fourkitchens.com>:
Earnie Boyd wrote:
Quoting Robert Douglass <rob@robshouse.net>:
Not many people are addressing David's actual technical argument about sorting and writing temporary tables and file sorting. So far his argument has made perfect sense to me, and I would not begrudge seeing a comment count column in the node table. I am glad Dries sees this as well.
I agree that this makes sense for the time being; we can redesign it later in D7.
Maybe that will give you time to learn why we have the current approach and why I'm slightly modifying that approach for performance.
I am excitedly looking at just that. Yes performance matters as does elegance in coding or an API.
Sorry for my snarky tone, but it comes off as quite rude to say, "Oh, I guess we'll just replace your hard work with my ideas /next/ version," when you haven't provided anything more than some abstract wishes about comments being nodes.
Please accept my apologies for offending you; it wasn't intended. You have good ideas, my comments weren't meant to flame you.
Why don't you just use Robert Douglass's module and see if it suits your needs? I'm sure it's well-written.
As I have already said, Robert's module is a good starting point for my ideas and I will be looking into it. Earnie
Robert Douglass wrote:
Not many people are addressing David's actual technical argument about sorting and writing temporary tables and file sorting. So far his argument has made perfect sense to me, and I would not begrudge seeing a comment count column in the node table. I am glad Dries sees this as well.
I spoke with David a little bit about this via IRC. I did offer an alternate solution, but he wasn't too thrilled with it. (Which is, instead of combining the tables, the 1::1 table reflects the data we'll be filtering on. Which in the case of tracker query turns out to be just status = 0). It does have other problems though, and I can see why we might not want to do that.
Earnie Boyd wrote:
Yes, this is different thinking and I am asking that we think of a comment as nothing more than more content. I am asking that all content be thought of as having possible relationships to each other.
That's already possible: * Taxonomies * Books
IMO, it will make for a more robust API and potentially less code. Obviously others think it is more robust because they build replacement modules for the existing core comment module.
Code simplicity does not trump performance.
Quoting David Strauss <david@fourkitchens.com>:
Earnie Boyd wrote:
Yes, this is different thinking and I am asking that we think of a comment as nothing more than more content. I am asking that all content be thought of as having possible relationships to each other.
That's already possible: * Taxonomies * Books
That is good. You forgot to mention the forums module.
IMO, it will make for a more robust API and potentially less code. Obviously others think it is more robust because they build replacement modules for the existing core comment module.
Code simplicity does not trump performance.
Code performance doesn't trump simplicity either they tend to go hand in hand. Earnie
On 6/5/07, Earnie Boyd <earnie@users.sourceforge.net> wrote:
How is what you describe different from a child node dependent on a parent node? If I have a content type relationship mapping table that maps a parent to a child then the children become the ``follow on'' objects.
A comment is dependent in a modeling sense on a node. That is, if a comment says "Hey, I think that's a great idea" in response to an idea posted in a blog, that comment BY ITSELF without the node makes no sense. It is meaningless. One can certainly construct relationships between nodes containing custom data which exhibit the same dependency, of not making sense without their parent node. Likewise, one could construct relationships where there is no dependency, where the nodes make sense by themselves. That does not alter the basic fact that comments are always dependent objects. Most nodes today are independent objects. Since we are trying to optimize comment behavior specifically, and dependent data in general, it's good to keep that in mind. Creating dependent relationships makes it harder to optimize, especially when selection and sorting criteria are in a joined table.
Chris Johnson wrote:
That difference is:
* nodes are "new" objects. They "start" or "instantiate" a piece of content. They are independent, typically.
* comments are "follow on" objects. They are a "reaction" to a piece of content. They are *always* dependent upon the node to which they refer.
So much for the model. In a practical implementation sense, the questions are different, but need to keep the model in mind.
Well said. * We also have to track complex statistics for all of a node's comments, including the replies to the replies to the replies of the comments for the node. * Treating comments as nodes implies the creation of nice modularity. Unfortunately, the performance of many comment-related queries depends on denormalization.
David Strauss wrote:
Earl Miles wrote:
No, you misinterpreted mine. Drupal is modular. comment.module is a
module. Now, it's a poor module -- it has some rather nasty integration with the rest of the site which, IMO, it should not. But it worries me that one of the solutions is to re-integrate rather than to make it properly modular. I was training someone on some basics of developing a Drupal module last week and we needed to create a module tying data to a node. His initial instinct was to add a new column to {node} and I said "That is not the Drupal way. Drupal is modular so we must create our own table." So we created our own table JOINed to nid.
So, I don't pretend to be a master of all things DB, but what about getting closer to how CCK does a bit heavier manipulation of the DB schema. For example, if you have a CCK field used in more than one content type, CCK creates it's own dedicated table, if just one content type uses that CCK field, those columns go in that node type's table. Especially with the new Schema API, couldn't we allow a module to add/remove/update its own fields to {node}, for example, if it would really improve performance in certain circumstances and as long as we maintain our module's namespace? It would still be modular as we have .install files to do all the magic on install/uninstall. -Rob
Rob Barreca wrote:
I was training someone on some basics of developing a Drupal module last week and we needed to create a module tying data to a node. His initial instinct was to add a new column to {node} and I said "That is not the Drupal way. Drupal is modular so we must create our own table." So we created our own table JOINed to nid.
That is generally appropriate for a contributed module. My one-to-one tables post is mostly about *core* tables.
So, I don't pretend to be a master of all things DB, but what about getting closer to how CCK does a bit heavier manipulation of the DB schema. For example, if you have a CCK field used in more than one content type, CCK creates it's own dedicated table, if just one content type uses that CCK field, those columns go in that node type's table.
Especially with the new Schema API, couldn't we allow a module to add/remove/update its own fields to {node}, for example, if it would really improve performance in certain circumstances and as long as we maintain our module's namespace? It would still be modular as we have .install files to do all the magic on install/uninstall.
Yes, that is more possible than ever with Schema API. I believe there is already a hook to alter the schema for another module.
David Strauss wrote:
Earl Miles wrote:
No, you misinterpreted mine. Drupal is modular. comment.module is a module. Now, it's a poor module -- it has some rather nasty integration with the rest of the site which, IMO, it should not. But it worries me that one of the solutions is to re-integrate rather than to make it properly modular.
The schema for Drupal core is not modular and never has been (at least looking back to 4.7).
Wow. This is so wrong that it's a total waste of my time to be even looking this shit up. The following modules install their own tables: book.module aggregator.module poll.module search.module contact.module color.module Your argument about core being monolithic is bullshit. It's not. It's modular. In some cases, it's badly modular. Dries comments that 4 modules (other than comment.module) utilize node_comment_statistics table: node.module forum.module tracker.module search.module node.module has dependencies on comment.module that I think are wrong. tracker.module and forum.module both depend upon comment.module already search.module -- not sure what's up with that. In a world where some people have actually written an alternative comment.module because the existing comment.module isn't adequate, I beg, please, do not change Drupal's fundamental architecture.
Earl Miles wrote:
David Strauss wrote:
Earl Miles wrote:
No, you misinterpreted mine. Drupal is modular. comment.module is a module. Now, it's a poor module -- it has some rather nasty integration with the rest of the site which, IMO, it should not. But it worries me that one of the solutions is to re-integrate rather than to make it properly modular.
The schema for Drupal core is not modular and never has been (at least looking back to 4.7).
Wow. This is so wrong that it's a total waste of my time to be even looking this shit up.
Sorry, I'll admit to being wrong here. Several of the optional core Drupal modules do manage their own schemas. I've been so consumed in comment and tracker world so much that I wrongly assumed the other optional core modules functioned the same way. Clearly, that's not the case.
On 04 Jun 2007, at 11:39, Earl Miles wrote:
In a world where some people have actually written an alternative comment.module because the existing comment.module isn't adequate, I beg, please, do not change Drupal's fundamental architecture.
Maybe at some point we can talk about this more in another thread. I think the proper solution is to improve the comment module in core, not to write a replacement. As I said, the comment functionality is a first class citizen of core, and therefore, we're allowed to optimize Drupal's node system to integrate closely with the comment system. -- Dries Buytaert :: http://www.buytaert.net/
Dries comments that 4 modules (other than comment.module) utilize node_comment_statistics table:
node.module forum.module tracker.module search.module
node.module has dependencies on comment.module that I think are wrong. tracker.module and forum.module both depend upon comment.module already search.module -- not sure what's up with that.
search.module: It comes from a doxygen comment for do_search(). The only real culprit is node.module.
Earl Miles wrote:
search.module -- not sure what's up with that.
It is only a code comment... false positive in the grep.
In a world where some people have actually written an alternative comment.module because the existing comment.module isn't adequate, I beg, please, do not change Drupal's fundamental architecture.
The node_comment module uses the node_comment_statistics table as well. From that point of view, it would make no difference at all whether the column were in the node table... in fact the node_comment module would undoubtedly benefit from moving it since we perform the same 1:1 join and sort that David is complaining about. This isn't my full endorsement, but just the perspective of the node_comment maintainer. -Robert
You *completely* missed the real argument:
Drupal is modular. Modules tend to have their own tables. Often these are 1::1 tables.
You misinterpreted my post. I said the 1::1 tables created by content types are fine. We're talking about core tables here, like {node_comment_statistics} that have columns often used in WHERE or ORDER BY criteria.
I think Earl does not want you to make any such discrimination. IMO, you're compromising design integrity for performance. The comment module is, whilst still a core module, an *optional* component. Some installations/distributions might use it. Some installations might not. Some installations might have a drop in replacement for it. For e.g., the project module, IIRC, does not use the comment module and still uses its own table. I believe that Weitzman has an existing issue for this as well. There is also another issue or mailman thread proposing that the url_aliases table be merged into the node table for similar reasons. Let's just invite everybody in, shall we? How is the comment module different from other modules - core or contrib - in similar situations? Should this (inserting/removing fields into/from the node/taxonomy/user/x tables in .install) become the rule of thumb to streamline any queries that result in temporary tables? -K
Karthik wrote:
You *completely* missed the real argument:
Drupal is modular. Modules tend to have their own tables. Often these are 1::1 tables.
You misinterpreted my post. I said the 1::1 tables created by content types are fine. We're talking about core tables here, like {node_comment_statistics} that have columns often used in WHERE or ORDER BY criteria.
I think Earl does not want you to make any such discrimination. IMO, you're compromising design integrity for performance. The comment module is, whilst still a core module, an *optional* component. Some installations/distributions might use it. Some installations might not. Some installations might have a drop in replacement for it. For e.g., the project module, IIRC, does not use the comment module and still uses its own table.
Please explore how Drupal core functions before you make such claims. The tables exist for all core modules regardless of whether they're enabled or not. So, it is irrelevant that comment is an optional module.
I believe that Weitzman has an existing issue for this as well. There is also another issue or mailman thread proposing that the url_aliases table be merged into the node table for similar reasons. Let's just invite everybody in, shall we?
Yes, if it improves performance without compromising design.
How is the comment module different from other modules - core or contrib - in similar situations? Should this (inserting/removing fields into/from the node/taxonomy/user/x tables in .install) become the rule of thumb to streamline any queries that result in temporary tables?
Well, the comment and node tables are different in that they're system-owned tables.
Please explore how Drupal core functions before you make such claims. The tables exist for all core modules regardless of whether they're enabled or not. So, it is irrelevant that comment is an optional module.
These are being addressed as system.install is being split into appropriate install files. In other words, the fact that these tables are being created regardless of their use, is wrong and is being fixed.
Karthik wrote:
Please explore how Drupal core functions before you make such claims. The tables exist for all core modules regardless of whether they're enabled or not. So, it is irrelevant that comment is an optional module.
These are being addressed as system.install is being split into appropriate install files. In other words, the fact that these tables are being created regardless of their use, is wrong and is being fixed.
More specifically, different .schema files. Still, I see nothing wrong with putting a few goodies for comment.module in node.schema. Denormalization isn't pretty, but it's necessary.
+1 on proper normalization as the base rule, and having exceptions only in crucial cases. -1 on extra joins caused by 1:1 relationships being placed in more than one table. Not that also Drupal.org has already moved to InnoDB, so table locks should not be happening as they have been on MyISAM, aside from the table lock on autoincrement. -- 2bits.com http://2bits.com Drupal development, customization and consulting.
On Sunday 03 June 2007, David Strauss wrote:
== What's bad about the current approach ==
*snip*
First, I'd like to adopt a development philosophy on the Drupal project: screw scalability if the database running Drupal doesn't support row-level locking. Sites using only table-level locks are doomed to scale poorly anyway because of the over-aggressive locking. We won't be able to prevent that disaster with the tiny improvements in lock granularity afforded by splitting tables into one-to-one table pairs.
*snip*
If we rely on the query cache, we're also being hypocrites with our stance on the first reason because many low-end hosts don't run the query cache. Either we care about scalability on low-end configurations or we don't. I'm suggesting "don't," but if people truly do want to care about low-end scalability, we can't use the query cache argument.
Scalability on "low-end" hosts is very important for a number of reasons. 1) Most sites run on them. How many web sites run on a shared host vs. a dedicated farm? I'd venture to say most sites, Drupal or otherwise, run on a shared host. That means we can't ignore that use case. 2) That's where people start. If Drupal is slow and crappy unless you're using InnoDB, a dedicated server, and an opcode cache, then no one is going to give it a second thought. Most people new to Drupal will try it out first in a shared host, or a private dev box that no one bothered to optimize. If Drupal sucks in that case, then those people will never bother installing it on high-end servers with carefully-tuned databases. True story: When I was first looking for a framework or CMS, I tried Typo3 before I tried Drupal. I never actually got it installed because at the time just running the installer died on my system (a stock Debian Sid PHP configuration with no customization, at least at the time) because it hit the default PHP memory limit. Not knowing then what I do now about PHP configuration and optimization, my response was "wtf? What a memory hog, it can't even install on a default setup! Screw this, I'm trying Drupal." Now, I can certainly accept the argument that we shouldn't try to bend over backward to get every last bit of performance out of MyISAM. There comes a point where a site really does need to have a dedicated box with hand-tuned databases. But that doesn't mean "don't care" about smaller hosts. The longer those small hosts hold out, the less expensive it is to run Drupal and the more people use it. (The menu handler split / split-mode-redux stuff I'm doing is aimed primarily at shared hosts, but will have an impact on larger sites, too.) -- Larry Garfield AIM: LOLG42 larry@garfieldtech.com ICQ: 6817012 "If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it." -- Thomas Jefferson
Larry Garfield wrote:
On Sunday 03 June 2007, David Strauss wrote:
== What's bad about the current approach ==
*snip*
First, I'd like to adopt a development philosophy on the Drupal project: screw scalability if the database running Drupal doesn't support row-level locking. Sites using only table-level locks are doomed to scale poorly anyway because of the over-aggressive locking. We won't be able to prevent that disaster with the tiny improvements in lock granularity afforded by splitting tables into one-to-one table pairs.
*snip*
If we rely on the query cache, we're also being hypocrites with our stance on the first reason because many low-end hosts don't run the query cache. Either we care about scalability on low-end configurations or we don't. I'm suggesting "don't," but if people truly do want to care about low-end scalability, we can't use the query cache argument.
Scalability on "low-end" hosts is very important for a number of reasons.
1) Most sites run on them. How many web sites run on a shared host vs. a dedicated farm? I'd venture to say most sites, Drupal or otherwise, run on a shared host. That means we can't ignore that use case.
I disagree. If we optimize for large-scale use on InnoDB or PostgreSQL, most optimizations will still translate to small installations on inexpensive shared hosts. In addition, most optimizations target scalability issues associated with large sets of data. Smaller sites simply don't encounter these issues because they're small.
2) That's where people start. If Drupal is slow and crappy unless you're using InnoDB, a dedicated server, and an opcode cache, then no one is going to give it a second thought. Most people new to Drupal will try it out first in a shared host, or a private dev box that no one bothered to optimize. If Drupal sucks in that case, then those people will never bother installing it on high-end servers with carefully-tuned databases.
Drupal won't suck in these cases, it just won't have fully realized performance, and the difference between ideal tuning and no tuning for a small installation is tiny. It's more important for Drupal to reach its performance peak with large installations than with small ones.
True story: When I was first looking for a framework or CMS, I tried Typo3 before I tried Drupal. I never actually got it installed because at the time just running the installer died on my system (a stock Debian Sid PHP configuration with no customization, at least at the time) because it hit the default PHP memory limit. Not knowing then what I do now about PHP configuration and optimization, my response was "wtf? What a memory hog, it can't even install on a default setup! Screw this, I'm trying Drupal."
That's a PHP issue, and PHP changes relatively less than databases as sites grow. Even huge sites run PHP with memory limits.
Now, I can certainly accept the argument that we shouldn't try to bend over backward to get every last bit of performance out of MyISAM. There comes a point where a site really does need to have a dedicated box with hand-tuned databases. But that doesn't mean "don't care" about smaller hosts. The longer those small hosts hold out, the less expensive it is to run Drupal and the more people use it. (The menu handler split / split-mode-redux stuff I'm doing is aimed primarily at shared hosts, but will have an impact on larger sites, too.)
I think you're misinterpreting my suggestion. Everything in Drupal can be classified in some sort of big-O notation, like O(q*n^p + m) (and probably other variables), where n is the cardinality of the dataset and the other variables are constants. I'm arguing that we should care more about a smaller p than a smaller q. If q is too big, we certainly have to address that problem on big sites, but p dominates the running time. Splitting tables into 1::1 relationships is akin to optimizing q. Even if we completely focus on big-site performance, small-site performance will generally continue to improve. Every big-site optimization I'm currently pushing would create, at worst, a negligible decline in performance on small sites. When n is very small, m dominates performance considerations. Optimizing m is generally separate from optimizing p or q.
On Sunday 03 June 2007, David Strauss wrote:
Scalability on "low-end" hosts is very important for a number of reasons.
1) Most sites run on them. How many web sites run on a shared host vs. a dedicated farm? I'd venture to say most sites, Drupal or otherwise, run on a shared host. That means we can't ignore that use case.
I disagree. If we optimize for large-scale use on InnoDB or PostgreSQL, most optimizations will still translate to small installations on inexpensive shared hosts.
In addition, most optimizations target scalability issues associated with large sets of data. Smaller sites simply don't encounter these issues because they're small.
2) That's where people start. If Drupal is slow and crappy unless you're using InnoDB, a dedicated server, and an opcode cache, then no one is going to give it a second thought. Most people new to Drupal will try it out first in a shared host, or a private dev box that no one bothered to optimize. If Drupal sucks in that case, then those people will never bother installing it on high-end servers with carefully-tuned databases.
Drupal won't suck in these cases, it just won't have fully realized performance, and the difference between ideal tuning and no tuning for a small installation is tiny. It's more important for Drupal to reach its performance peak with large installations than with small ones.
True story: When I was first looking for a framework or CMS, I tried Typo3 before I tried Drupal. I never actually got it installed because at the time just running the installer died on my system (a stock Debian Sid PHP configuration with no customization, at least at the time) because it hit the default PHP memory limit. Not knowing then what I do now about PHP configuration and optimization, my response was "wtf? What a memory hog, it can't even install on a default setup! Screw this, I'm trying Drupal."
That's a PHP issue, and PHP changes relatively less than databases as sites grow. Even huge sites run PHP with memory limits.
Yes it is. It's also an extreme case. I'm just pointing out that say "feh, small hosts" in general is going to bite us in the ass sooner or later.
Now, I can certainly accept the argument that we shouldn't try to bend over backward to get every last bit of performance out of MyISAM. There comes a point where a site really does need to have a dedicated box with hand-tuned databases. But that doesn't mean "don't care" about smaller hosts. The longer those small hosts hold out, the less expensive it is to run Drupal and the more people use it. (The menu handler split / split-mode-redux stuff I'm doing is aimed primarily at shared hosts, but will have an impact on larger sites, too.)
I think you're misinterpreting my suggestion. Everything in Drupal can be classified in some sort of big-O notation, like O(q*n^p + m) (and probably other variables), where n is the cardinality of the dataset and the other variables are constants. I'm arguing that we should care more about a smaller p than a smaller q. If q is too big, we certainly have to address that problem on big sites, but p dominates the running time. Splitting tables into 1::1 relationships is akin to optimizing q.
Even if we completely focus on big-site performance, small-site performance will generally continue to improve. Every big-site optimization I'm currently pushing would create, at worst, a negligible decline in performance on small sites.
Yeah, that's what we said about the 4.6->4.7 path alias change. :-)
When n is very small, m dominates performance considerations. Optimizing m is generally separate from optimizing p or q.
I think we're talking at cross purposes. :-) I'm not saying "feh, big sites", or that we shouldn't consider retuning how we structure the database schema. I'm saying that optimizing for big, dedicated hosts at the expense of the $20/month hosts is a losing proposition. It sounded like one could easily interpret your "screw table-level-locking setups" comment that way, which I believe would be a very bad way to go. I wasn't commenting on the specific changes you propose. -- Larry Garfield AIM: LOLG42 larry@garfieldtech.com ICQ: 6817012 "If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it." -- Thomas Jefferson
I think we're talking at cross purposes. :-) I'm not saying "feh, big sites", or that we shouldn't consider retuning how we structure the database schema. I'm saying that optimizing for big, dedicated hosts at the expense of the $20/month hosts is a losing proposition. It sounded like one could easily interpret your "screw table-level-locking setups" comment that way, which I believe would be a very bad way to go.
David didn't say that - *you are now saying that*. When you retort something that wasn't actually said, you create fuzzy conversation. David has been lucid here, lets not introduce fuzz. Drupal is never going to run poorly for stock web hosts. Noone here wants that, or would tolerate it. Lets not discuss that strawman. I wasn't commenting on the specific
changes you propose.
Please do - that would be helpful. I have no objections to David's post. It makes sense to me. As usual, it needs be confirmed with a patch and benchmarks.
On Sunday 03 June 2007, Moshe Weitzman wrote:
I think we're talking at cross purposes. :-) I'm not saying "feh, big sites", or that we shouldn't consider retuning how we structure the database schema. I'm saying that optimizing for big, dedicated hosts at the expense of the $20/month hosts is a losing proposition. It sounded like one could easily interpret your "screw table-level-locking setups" comment that way, which I believe would be a very bad way to go.
David didn't say that - *you are now saying that*. When you retort something that wasn't actually said, you create fuzzy conversation. David has been lucid here, lets not introduce fuzz.
OK, either I'm not saying this well or people are really misinterpreting me. From the original post: "screw scalability if the database running Drupal doesn't support row-level locking." I am saying that we should not take that to an extreme of ignoring single-server MyISAM performance entirely. Yes, there are web host out there that don't support InnoDB. No, drupal.org is never going to run on a shared host at this point nor should we expect it to, but while optimizing for that case we need to make sure we don't undermine performance for the shared hosts that run the majority of domains out there. *Not* all high-end optimizations have no impact or minor negative impact on low-end sites. We saw that with the path alias switch in 4.7. That is all I was saying. -- Larry Garfield AIM: LOLG42 larry@garfieldtech.com ICQ: 6817012 "If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it." -- Thomas Jefferson
Larry Garfield wrote:
I am saying that we should not take that to an extreme of ignoring single-server MyISAM performance entirely.
Don't let my dramatization catch you off-guard. Everything has context:
Drupal should still run on hosts allowing only table-level locks. We just need to stop weighing their unique performance concerns so heavily.
It's just more fun to say "screw table-level locks."
On Jun 3, 2007, at 5:20 PM, Larry Garfield wrote:
I'm saying that optimizing for big, dedicated hosts at the expense of the $20/month hosts is a losing proposition.
I just have to back Moshe up here. If you optimize for the large scale cases where you need scalability - well, that's just smart, and you'll have absolutely minimal impact on the vast majority of users. If anything, optimizing for this case raises all performance across the board, including on lower end accounts - where performance and the need to lessen impact to shared infrastructure can be especially important. In all honesty, I can't see a valid argument in anything you wrote, and I'm looking for it. If you're saying that we're going to lose performance or the ability to run on low end hosts, that's not a valid argument. Not only would innodb be supported *just fine* on cpanel, plesk, isp config, and other control panels (we are talking about MySQL 4.x and up, which all of these systems now ship with - even Plesk), but if anything memory limits would actually *drop* due to optimization for large scale setups. InnoDB will run out of the box on any of these (in fact, I've got InnoDB setups running on all three of them right now). This issue is a non-starter - there is no possible argument here, other than a gentle reminder to keep things compatible and friendly for new people in the community, and that's an issue I think everyone involved in this community bears in mind and takes to heart - we want to see new users. If we were talking about stored procedures, triggers, transactions, and other more costly and complex database-layer problems, there might be an argument here. But we generally avoid those to try to keep things compatible (there's a reason; hint: you mentioned it). ;-) I admire you enthusiasm for a good debate though. ;-) Let's let this one die - it's a somewhat inane line of argument, and will lead us nowhere. -- Jonathan Lambert Principal | FireBright, Inc. Email: j@firebright.com -------------------------------------------------- "Grove giveth and Gates taketh away." - Bob Metcalfe
participants (21)
-
Angela Byron -
Chris Johnson -
David Metzler -
David Strauss -
Dries Buytaert -
Earl Miles -
Earnie Boyd -
FGM -
Gabor Hojtsy -
John Wilkins -
Jonathan Lambert -
Karthik -
Khalid Baheyeldin -
Larry Garfield -
Mark Fredrickson -
Moshe Weitzman -
Rob Barreca -
Robert Douglass -
Vladimir Zlatanov -
Walt Daniels -
Wim Leers