[support] how cache works: on the fly content creation (not node) and caching

Sat Jan 2 09:51:01 UTC 2010

> I'm already on pg. I'm already using my box. I'm already monitoring
> slow queries.
> It's just that even fast queries add up and most of the content is
> "nearly static". Using drupal cache system let me skip a lot of
> queries and php code so that cpu cycles are left to things that are
> more expensive to make faster.

Good. I feel relieved for you.

> One solution would be to use a functional index on the
> cache_page.cid that compute from the "url" the pk of the record that
> generated it so I could quickly delete cache entries when the record
> is newer than cache_page.created.

This is a good solution. When you monitor queries in PostgreSQL log, it
is impressive how many times Drupal can run several times the same query
in a single page. 

For me it is a design flow in Drupal. There should be a nearly automatic
caching within a page. If you could fix this issue, I would be
delighted.

> Another solution could be to just add a table that save the mapping
> record pk -> url. If there wasn't anything in the cache... I'd hit
> the function that generate a fresh page and save the mapping. If
> there was something in the cache I wouldn't even have to check if
> there was already an entry for pk -> url. So no need to use an
> UPSERT.

May I propose a third solution called database views and materialized
views.

This would be in conjunction with your proposed solutions, as Drupal
needs a better awareness of queries within a page.

Views in a database are very interesting for several  reasons:

* They exist on every database system: MySQL, PostgreSQL, DB2, Oracle,
etc ... So it is easy to add them to Drupal database layer.

* Views have their own query plan. After a number of executions of the
view, PostgreSQL optimizer will be able to find and store the right
query plan. This allows the database to optimize queries, especially the
large number of queries running with ORDER BY or LIMIT, COUNT, etc ...
that make PostgreSQL access disc and not memory.

Presently I doubt that any optimizer can optimize well Drupal database,
because it relies on a giant table called nodes and the optimizer is not
aware of the different situations. But this is just a guess. I have no
proof.

* We should be able to declare two kinds of database views: 

+ The first kind would be normal views, always processed on the fly by
Drupal. There is nothing to say special rather than these views have
their own query plan.

+ The second kind would be materialized views. So I make a short
presentation of materialized views here:

In a web system, more than 99% queries are SELECT and less than 1% are
UPDATE and DELETE queries. If you consider a store with 0.5% of users
buying goods and the 99.5% other percents viewing pages, there is likely
99.5% of SELECT queries. The same applies for free software. In my
projects, I notice there is ONE developer for nearly 1000 users or more.
It is probably the same for Drupal web site.

As for Drupal the proportion of SELECT if even higher, as one page can
have 50 underlying SELECT queries. Pages with UPDATE queries may have
ONE OR TWO update queries, not more.

So let us calculate:
* SELECT : 200 * 50 = 10000
* UPDATE : 1 * 2 = 2

So the percentage of UPDATE/DELETE v.s. SELECT may be as little as ONE
for 5000.

It means that if we can find a way to speed-up SELECT, even at the price
of slowing down UPDATE, it could boost Drupal.

This kind of solution exists, it is called server-side database
materialized views. A materialized view is only a static snapshot of a
view in a table.

A simple example is this normal view called view_foo:
CREATE VIEW foo AS
SELECT .... FROM a,b,c,d 
WHERE .... LIMIT ... ORDER BY

Then a simple materialized view would simply be:
CREATE TABLE foo-materialized AS SELECT * FROM foo ... and add indexes.

Jonathan Gardner wrote a detailed paper on materialized views for
PostgreSQL :
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views

This article is the corner stone to any work being done at database
level.

A materialized view is a simple flat view. The advantage of this view is
that is kept in shared memory, used by the database, and therefore there
is no impact on Apache speed and memory usage. When the database has
enough shared memory (as it is the case on newers systems with up to 8Gb
to 64Gb of memory) or uses SDD discs, the dabase acts as a superfast
caching system. Even on old hadware, materialized views are super-fast.

For example, if your master table has 3 million records and your
materialized view only has 10 records, a SELECT on the view will only
have 10 records to scan. Simple and there is no better way to speed up
queries.

Oracle and DB2 have native support for materialized views.

The only drawbacks of materialized views is that 

* MySQL will never ever implement materialized views. In MySQL, a
materialized view would simple execute as a view. 

* In PostgreSQL, UPDATE and DELETE have to be wired using server-side
pgSQL code. But these are only a generic layer. I am confident that we
can emulate a native system in PG using only pgSQL code. But if we ask
PostgreSQL hackers, they may work on a native implementation. 

In conclusion: a lot of people are looking for reliable and scalable CMS
and in my opinion it cannot be done without materialized views. A
materialized is able to create a snapshot of 10 records out of 3
millions and there is no other way to be fast. You can always work at
PHP level, but you will never reach the speed of materialized views.

Kind regards,
Jean-Michel

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Ceci est une partie de message
	=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=
Url : http://lists.drupal.org/pipermail/support/attachments/20100102/916c27a9/attachment-0001.bin