[development] One-to-one tables considered harmful

Sun Jun 3 19:52:58 UTC 2007

== What's bad about the current approach ==

The arguments for splitting table into one-to-one relationships are:

* Improving performance on systems that can only use table-level locking
* Minimizing invalidations of the query cache

These are not good reasons considering their impacts.

First, I'd like to adopt a development philosophy on the Drupal project:
screw scalability if the database running Drupal doesn't support
row-level locking. Sites using only table-level locks are doomed to
scale poorly anyway because of the over-aggressive locking. We won't be
able to prevent that disaster with the tiny improvements in lock
granularity afforded by splitting tables into one-to-one table pairs.

Drupal should still run on hosts allowing only table-level locks. We
just need to stop weighing their unique performance concerns so heavily.

Second, the query cache is a luxury, not a key part of database design.

A site needs to run fast without the query cache because -- at some
point -- the database will need to be restarted, which will dump the
query cache. If the site can't survive without the query cache, it will
enter a cycle of crashing/overloading the server before it can populate
the cache.

The query cache provides minimal support for the large variety of
queries hitting the system. For example, a cached tracker query for me
does nothing to help a cached track query for you.

The query cache only works after the first run. If a query takes 45
seconds to run, the page load is in jeopardy. A failed page load to
populate the query cache is unacceptable.

Minimizing invalidations to the query cache by splitting tables yields
far worse performance overall because we can't have keys span the
tables. This will be discussed in detail in the next section.

If we rely on the query cache, we're also being hypocrites with our
stance on the first reason because many low-end hosts don't run the
query cache. Either we care about scalability on low-end configurations
or we don't. I'm suggesting "don't," but if people truly do want to care
about low-end scalability, we can't use the query cache argument.

== Why avoiding one-to-one joins is a good idea ==

Many of the slowest queries in Drupal (as seen on Drupal.org) are
products of the following query plan:

1. Join the two one-to-one tables, maybe after a WHERE restriction on
one. This creates a temporary table.
2. Finish processing the WHERE restrictions on the temporary table.
3. Filesort the results to meet ORDER BY criteria.

An example of such a query would be finding all published nodes in order
of most recent comments. Another would be finding all published nodes in
order of most views today. These are extremely common queries in Drupal.

The problem is caused by an insurmountable restriction in MySQL (and
probably PostgreSQL): we cannot create a key that spans tables. Because
we cannot create such a key, any query with WHERE or ORDER BY criteria
spanning two one-to-one tables is doomed to a temp table and a filesort.

To have these queries run without temp tables and filesorts, we need to
have a key that has all WHERE criteria as a prefix and then has the
ORDER BY criteria left over.

The greatest paradox to me is that many of the worst one-to-one
offenders are denormalized tables designed to improve performance.
Because these denormalized tables contain calculated values that many
queries use for WHERE and ORDER BY, the JOINs with them inevitably sink
the query into temp table/filesort hell.

== Where one-to-one is good ==

It's fine to have one-to-one tables if one table is only used in the
JOIN and doesn't have WHERE and ORDER BY restrictions on it. For
example, the one-to-one tables created by many content types that JOIN
with {node_revisions} are great. You simply INNER JOIN {node_revisions}
with the content type's table at little performance penalty.

For this same reason, there's not much argument for denormalizing user
names into other tables unless we're allowing people to sort on them.
Even the forums don't allow sorting by user name. A simple JOIN from
{blah}.uid to {users}.uid places almost no burden on the query.

== What I'm doing about it ==

I'm working on a patch:

Merge {node_comment_statistics} and {node_counter} into {node}

http://drupal.org/node/148849

It also makes more intelligent choices for which calculated values to
manage, which removes GREATEST() from numerous queries.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 186 bytes
Desc: OpenPGP digital signature
Url : http://lists.drupal.org/pipermail/development/attachments/20070603/77dde3da/attachment.pgp