Re: [support] how cache works: on the fly content creation (not node) and caching

2 Jan 2010


      ...
I'm already on pg. I'm already using my box. I'm already monitoring
slow queries.
It's just that even fast queries add up and most of the content is
"nearly static". Using drupal cache system let me skip a lot of
queries and php code so that cpu cycles are left to things that are
more expensive to make faster.
Good. I feel relieved for you.
...
One solution would be to use a functional index on the
cache_page.cid that compute from the "url" the pk of the record that
generated it so I could quickly delete cache entries when the record
is newer than cache_page.created.
This is a good solution. When you monitor queries in PostgreSQL log, it
is impressive how many times Drupal can run several times the same query
in a single page.
For me it is a design flow in Drupal. There should be a nearly automatic
caching within a page. If you could fix this issue, I would be
delighted.
...
Another solution could be to just add a table that save the mapping
record pk -> url. If there wasn't anything in the cache... I'd hit
the function that generate a fresh page and save the mapping. If
there was something in the cache I wouldn't even have to check if
there was already an entry for pk -> url. So no need to use an
UPSERT.
May I propose a third solution called database views and materialized
views.
This would be in conjunction with your proposed solutions, as Drupal
needs a better awareness of queries within a page.
Views in a database are very interesting for several  reasons:
* They exist on every database system: MySQL, PostgreSQL, DB2, Oracle,
etc ... So it is easy to add them to Drupal database layer.
* Views have their own query plan. After a number of executions of the
view, PostgreSQL optimizer will be able to find and store the right
query plan. This allows the database to optimize queries, especially the
large number of queries running with ORDER BY or LIMIT, COUNT, etc ...
that make PostgreSQL access disc and not memory.
Presently I doubt that any optimizer can optimize well Drupal database,
because it relies on a giant table called nodes and the optimizer is not
aware of the different situations. But this is just a guess. I have no
proof.
* We should be able to declare two kinds of database views:
+ The first kind would be normal views, always processed on the fly by
Drupal. There is nothing to say special rather than these views have
their own query plan.
+ The second kind would be materialized views. So I make a short
presentation of materialized views here:
In a web system, more than 99% queries are SELECT and less than 1% are
UPDATE and DELETE queries. If you consider a store with 0.5% of users
buying goods and the 99.5% other percents viewing pages, there is likely
99.5% of SELECT queries. The same applies for free software. In my
projects, I notice there is ONE developer for nearly 1000 users or more.
It is probably the same for Drupal web site.
As for Drupal the proportion of SELECT if even higher, as one page can
have 50 underlying SELECT queries. Pages with UPDATE queries may have
ONE OR TWO update queries, not more.
So let us calculate:
* SELECT : 200 * 50 = 10000
* UPDATE : 1 * 2 = 2
So the percentage of UPDATE/DELETE v.s. SELECT may be as little as ONE
for 5000.
It means that if we can find a way to speed-up SELECT, even at the price
of slowing down UPDATE, it could boost Drupal.
This kind of solution exists, it is called server-side database
materialized views. A materialized view is only a static snapshot of a
view in a table.
A simple example is this normal view called view_foo:
CREATE VIEW foo AS
SELECT .... FROM a,b,c,d 
WHERE .... LIMIT ... ORDER BY
Then a simple materialized view would simply be:
CREATE TABLE foo-materialized AS SELECT * FROM foo ... and add indexes.
Jonathan Gardner wrote a detailed paper on materialized views for
PostgreSQL :
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
This article is the corner stone to any work being done at database
level.
A materialized view is a simple flat view. The advantage of this view is
that is kept in shared memory, used by the database, and therefore there
is no impact on Apache speed and memory usage. When the database has
enough shared memory (as it is the case on newers systems with up to 8Gb
to 64Gb of memory) or uses SDD discs, the dabase acts as a superfast
caching system. Even on old hadware, materialized views are super-fast.
For example, if your master table has 3 million records and your
materialized view only has 10 records, a SELECT on the view will only
have 10 records to scan. Simple and there is no better way to speed up
queries.
Oracle and DB2 have native support for materialized views.
The only drawbacks of materialized views is that
* MySQL will never ever implement materialized views. In MySQL, a
materialized view would simple execute as a view.
* In PostgreSQL, UPDATE and DELETE have to be wired using server-side
pgSQL code. But these are only a generic layer. I am confident that we
can emulate a native system in PG using only pgSQL code. But if we ask
PostgreSQL hackers, they may work on a native implementation.
In conclusion: a lot of people are looking for reliable and scalable CMS
and in my opinion it cannot be done without materialized views. A
materialized is able to create a snapshot of 10 records out of 3
millions and there is no other way to be fast. You can always work at
PHP level, but you will never reach the speed of materialized views.
Kind regards,
Jean-Michel

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [support] how cache works: on the fly content creation (not node) and caching