I'm already on pg. I'm already using my box. I'm already monitoring slow queries. It's just that even fast queries add up and most of the content is "nearly static". Using drupal cache system let me skip a lot of queries and php code so that cpu cycles are left to things that are more expensive to make faster.
Good. I feel relieved for you.
One solution would be to use a functional index on the cache_page.cid that compute from the "url" the pk of the record that generated it so I could quickly delete cache entries when the record is newer than cache_page.created.
This is a good solution. When you monitor queries in PostgreSQL log, it is impressive how many times Drupal can run several times the same query in a single page.
For me it is a design flow in Drupal. There should be a nearly automatic caching within a page. If you could fix this issue, I would be delighted.
Another solution could be to just add a table that save the mapping record pk -> url. If there wasn't anything in the cache... I'd hit the function that generate a fresh page and save the mapping. If there was something in the cache I wouldn't even have to check if there was already an entry for pk -> url. So no need to use an UPSERT.
May I propose a third solution called database views and materialized views.
This would be in conjunction with your proposed solutions, as Drupal needs a better awareness of queries within a page.
Views in a database are very interesting for several reasons:
* They exist on every database system: MySQL, PostgreSQL, DB2, Oracle, etc ... So it is easy to add them to Drupal database layer.
* Views have their own query plan. After a number of executions of the view, PostgreSQL optimizer will be able to find and store the right query plan. This allows the database to optimize queries, especially the large number of queries running with ORDER BY or LIMIT, COUNT, etc ... that make PostgreSQL access disc and not memory.
Presently I doubt that any optimizer can optimize well Drupal database, because it relies on a giant table called nodes and the optimizer is not aware of the different situations. But this is just a guess. I have no proof.
* We should be able to declare two kinds of database views:
+ The first kind would be normal views, always processed on the fly by Drupal. There is nothing to say special rather than these views have their own query plan.
+ The second kind would be materialized views. So I make a short presentation of materialized views here:
In a web system, more than 99% queries are SELECT and less than 1% are UPDATE and DELETE queries. If you consider a store with 0.5% of users buying goods and the 99.5% other percents viewing pages, there is likely 99.5% of SELECT queries. The same applies for free software. In my projects, I notice there is ONE developer for nearly 1000 users or more. It is probably the same for Drupal web site.
As for Drupal the proportion of SELECT if even higher, as one page can have 50 underlying SELECT queries. Pages with UPDATE queries may have ONE OR TWO update queries, not more.
So let us calculate: * SELECT : 200 * 50 = 10000 * UPDATE : 1 * 2 = 2
So the percentage of UPDATE/DELETE v.s. SELECT may be as little as ONE for 5000.
It means that if we can find a way to speed-up SELECT, even at the price of slowing down UPDATE, it could boost Drupal.
This kind of solution exists, it is called server-side database materialized views. A materialized view is only a static snapshot of a view in a table.
A simple example is this normal view called view_foo: CREATE VIEW foo AS SELECT .... FROM a,b,c,d WHERE .... LIMIT ... ORDER BY
Then a simple materialized view would simply be: CREATE TABLE foo-materialized AS SELECT * FROM foo ... and add indexes.
Jonathan Gardner wrote a detailed paper on materialized views for PostgreSQL : http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
This article is the corner stone to any work being done at database level.
A materialized view is a simple flat view. The advantage of this view is that is kept in shared memory, used by the database, and therefore there is no impact on Apache speed and memory usage. When the database has enough shared memory (as it is the case on newers systems with up to 8Gb to 64Gb of memory) or uses SDD discs, the dabase acts as a superfast caching system. Even on old hadware, materialized views are super-fast.
For example, if your master table has 3 million records and your materialized view only has 10 records, a SELECT on the view will only have 10 records to scan. Simple and there is no better way to speed up queries.
Oracle and DB2 have native support for materialized views.
The only drawbacks of materialized views is that
* MySQL will never ever implement materialized views. In MySQL, a materialized view would simple execute as a view.
* In PostgreSQL, UPDATE and DELETE have to be wired using server-side pgSQL code. But these are only a generic layer. I am confident that we can emulate a native system in PG using only pgSQL code. But if we ask PostgreSQL hackers, they may work on a native implementation.
In conclusion: a lot of people are looking for reliable and scalable CMS and in my opinion it cannot be done without materialized views. A materialized is able to create a snapshot of 10 records out of 3 millions and there is no other way to be fast. You can always work at PHP level, but you will never reach the speed of materialized views.
Kind regards, Jean-Michel