Hi,
I've a lot of content that get created automatically on-the-fly from a DB.
Mapping this content to nodes is not practical (custom search, just a minority of the content get actually displayed etc...).
I need a means to check if the cache is still valid at *request* time. I can't know when the cache *will* expire at request time but I know if it is already expired.
Considering that the pk for page cache is the url and that a minority of the pages will actually be cached periodically invalidating all the urls checking if the underlying DB record is younger is highly inefficient.
Before just building my cache tables and build up my own cache system on top of cache API, is there an alternative approach?
Le vendredi 01 janvier 2010 à 14:08 +0100, Ivan Sergio Borgonovo a écrit :
Before just building my cache tables and build up my own cache system on top of cache API, is there an alternative approach?
You may turn off caching, turn on SQL logging for slow queries and study your SQL query log server-side in the SQL database. Find slow queries and fix them. Of course, this may not be possible if you are using a shared server without access to logs.
I wrote these guidelines if you are interested: http://drupal.org/node/559302
Most Drupal developers are writing PHP code and take advantage of Drupal caching system. The drawback is that we developers have no idea of the actual time that SQL queries take to run.
If you are running MySQL on a dedicated server, consider dropping MySQL because it is fairly slow. And look at PostgreSQL. when configured correctly, it can be faster than the caching system. But this is my own opinion, people with no SQL background may consider MySQL a nice tool just because it runs queries.
Two months ago, I agreed to work on MySQL for a Drupal project. When queries started to flow on the production server, I saw a number of red lines indicating that MySQL was not working fast enough and thus dropped transactions. When I migrated back from MySQL to PostgreSQL, some tables had duplicated entries, just because users had clicked several times on Submit button. I looked at MySQL table and they had unique indexes. This is what MySQL does: it is only a desaster. It can work for small sites, but when using large sites, you need a minimum of two servers, one for database, the other for MySQL. And you go beyond 1000 simultaneous users, you probably need a farm, which will ruin you.
Now my contracts have the "No MySQL" obligation. My opinion.
Kind regards and happy new year, Jean-Michel
Le vendredi 01 janvier 2010 à 18:54 +0100, Jean-Michel Pouré a écrit :
Before just building my cache tables and build up my own cache system on top of cache API, is there an alternative approach?
Of course, before that you may test your Drupal installation on a recent dedicated server, with 8Gb RAM and an SSD disk. You can rent this kind of server for 70€/month in France or build your own and host it for 40 €/month in Germany. If developing websites is your daily work, you should consider using a dedicated host.
I monitor my websites and have statistics about queries execution time and memory. Very rarely pages are server with higher times than 0.2 seconds. I compared to shared hosting and the same site executed in 6 seconds / page.
Kind regards, Jean-Michel
On Fri, 01 Jan 2010 18:54:58 +0100 Jean-Michel Pouré jmpoure@free.fr wrote:
Le vendredi 01 janvier 2010 à 14:08 +0100, Ivan Sergio Borgonovo a écrit :
Before just building my cache tables and build up my own cache system on top of cache API, is there an alternative approach?
You may turn off caching, turn on SQL logging for slow queries and study your SQL query log server-side in the SQL database. Find slow queries and fix them. Of course, this may not be possible if you are using a shared server without access to logs.
I'm already on pg. I'm already using my box. I'm already monitoring slow queries. It's just that even fast queries add up and most of the content is "nearly static". Using drupal cache system let me skip a lot of queries and php code so that cpu cycles are left to things that are more expensive to make faster.
I've found 2 ways... but I'd like to hear something more drupal-halal.
To summarise the problem: - I can't forecast when a certain page will expire - I can know when a certain page is stale - Without any tweak it is "expensive" to relate the page cache pk (cid/url) with the records that generated it. url are a function of the record content
One solution would be to use a functional index on the cache_page.cid that compute from the "url" the pk of the record that generated it so I could quickly delete cache entries when the record is newer than cache_page.created.
Another solution could be to just add a table that save the mapping record pk -> url. If there wasn't anything in the cache... I'd hit the function that generate a fresh page and save the mapping. If there was something in the cache I wouldn't even have to check if there was already an entry for pk -> url. So no need to use an UPSERT.
Still really not convinced and looking for a better solution.
I just noticed that cache_page.cid is too short for my urls...
BTW ISAM shouldn't be bad for caching, not sure if it still perform well during cache invalidation (record deletion).
I'm already on pg. I'm already using my box. I'm already monitoring slow queries. It's just that even fast queries add up and most of the content is "nearly static". Using drupal cache system let me skip a lot of queries and php code so that cpu cycles are left to things that are more expensive to make faster.
Good. I feel relieved for you.
One solution would be to use a functional index on the cache_page.cid that compute from the "url" the pk of the record that generated it so I could quickly delete cache entries when the record is newer than cache_page.created.
This is a good solution. When you monitor queries in PostgreSQL log, it is impressive how many times Drupal can run several times the same query in a single page.
For me it is a design flow in Drupal. There should be a nearly automatic caching within a page. If you could fix this issue, I would be delighted.
Another solution could be to just add a table that save the mapping record pk -> url. If there wasn't anything in the cache... I'd hit the function that generate a fresh page and save the mapping. If there was something in the cache I wouldn't even have to check if there was already an entry for pk -> url. So no need to use an UPSERT.
May I propose a third solution called database views and materialized views.
This would be in conjunction with your proposed solutions, as Drupal needs a better awareness of queries within a page.
Views in a database are very interesting for several reasons:
* They exist on every database system: MySQL, PostgreSQL, DB2, Oracle, etc ... So it is easy to add them to Drupal database layer.
* Views have their own query plan. After a number of executions of the view, PostgreSQL optimizer will be able to find and store the right query plan. This allows the database to optimize queries, especially the large number of queries running with ORDER BY or LIMIT, COUNT, etc ... that make PostgreSQL access disc and not memory.
Presently I doubt that any optimizer can optimize well Drupal database, because it relies on a giant table called nodes and the optimizer is not aware of the different situations. But this is just a guess. I have no proof.
* We should be able to declare two kinds of database views:
+ The first kind would be normal views, always processed on the fly by Drupal. There is nothing to say special rather than these views have their own query plan.
+ The second kind would be materialized views. So I make a short presentation of materialized views here:
In a web system, more than 99% queries are SELECT and less than 1% are UPDATE and DELETE queries. If you consider a store with 0.5% of users buying goods and the 99.5% other percents viewing pages, there is likely 99.5% of SELECT queries. The same applies for free software. In my projects, I notice there is ONE developer for nearly 1000 users or more. It is probably the same for Drupal web site.
As for Drupal the proportion of SELECT if even higher, as one page can have 50 underlying SELECT queries. Pages with UPDATE queries may have ONE OR TWO update queries, not more.
So let us calculate: * SELECT : 200 * 50 = 10000 * UPDATE : 1 * 2 = 2
So the percentage of UPDATE/DELETE v.s. SELECT may be as little as ONE for 5000.
It means that if we can find a way to speed-up SELECT, even at the price of slowing down UPDATE, it could boost Drupal.
This kind of solution exists, it is called server-side database materialized views. A materialized view is only a static snapshot of a view in a table.
A simple example is this normal view called view_foo: CREATE VIEW foo AS SELECT .... FROM a,b,c,d WHERE .... LIMIT ... ORDER BY
Then a simple materialized view would simply be: CREATE TABLE foo-materialized AS SELECT * FROM foo ... and add indexes.
Jonathan Gardner wrote a detailed paper on materialized views for PostgreSQL : http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
This article is the corner stone to any work being done at database level.
A materialized view is a simple flat view. The advantage of this view is that is kept in shared memory, used by the database, and therefore there is no impact on Apache speed and memory usage. When the database has enough shared memory (as it is the case on newers systems with up to 8Gb to 64Gb of memory) or uses SDD discs, the dabase acts as a superfast caching system. Even on old hadware, materialized views are super-fast.
For example, if your master table has 3 million records and your materialized view only has 10 records, a SELECT on the view will only have 10 records to scan. Simple and there is no better way to speed up queries.
Oracle and DB2 have native support for materialized views.
The only drawbacks of materialized views is that
* MySQL will never ever implement materialized views. In MySQL, a materialized view would simple execute as a view.
* In PostgreSQL, UPDATE and DELETE have to be wired using server-side pgSQL code. But these are only a generic layer. I am confident that we can emulate a native system in PG using only pgSQL code. But if we ask PostgreSQL hackers, they may work on a native implementation.
In conclusion: a lot of people are looking for reliable and scalable CMS and in my opinion it cannot be done without materialized views. A materialized is able to create a snapshot of 10 records out of 3 millions and there is no other way to be fast. You can always work at PHP level, but you will never reach the speed of materialized views.
Kind regards, Jean-Michel
Le vendredi 01 janvier 2010 à 21:21 +0100, Ivan Sergio Borgonovo a écrit :
I've found 2 ways... but I'd like to hear something more drupal-halal.
I found this interesting topic in : http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
Kind regards, Jean-Michel
*********************************************************************** Snapshot with Lazy Refreshes Rather than have scheduled refreshes, you can simply mark the data as wrong and then have whoever is interested in seeing the correct data rebuild it and store the result. If you are very clever, you can combine this with the "Snapshot with Windows" method to include records that will be correct in the future.
Many caching methods I have seen rely on the processes who use the cache to keep it organized and fresh. This isn't unreasonable in the database world, either.
However, there is a significant development cost for this kind of behavior. A typical session would look like:
1. Examine records you are interested in looking at. 2. If there are old records, refresh them. 3. Actually run the query you were interested in again with the fresh data.
On Friday 01 January 2010 11:54:58 am Jean-Michel Pouré wrote:
I looked at MySQL table and they had unique indexes. This is what MySQL does: it is only a desaster. It can work for small sites, but when using large sites, you need a minimum of two servers, one for database, the other for MySQL. And you go beyond 1000 simultaneous users, you probably need a farm, which will ruin you.
Please do not spread FUD on the support list. MySQL can be an extremely effective database and very fast and scalable, if you know how to configure and maintain it properly. There are plenty of people who have such knowledge, and MySQL runs several very high-end sites. (Slashdot and, oh yeah, Drupal.org come to mind.) If you don't know how to maintain a MySQL database and try to throw something that heavy at it, sure, it's going to crumble. The same is true of any other database, or any web app, Drupal included.
By the same token, Postgres in its default configuration on most Linux distros is a complete and total performance disaster. You'd be lucky to run a low- traffic blog on it without your system breaking down in tears. No one in their right mind would run Postgres default config for a Drupal site.
'course, if you know how to maintain a Postgres database then sure, you can make it sing on low or high traffic sites. And someone who knows MySQL can do the same with MySQL. Knowing your tool matters, no matter what the tool.
For "install and forget" setups, yes, MySQL is better suited out of the box. But for complex or high-end setups, either DB will require knowledge of the tool. That does not mean that only "people with no SQL background may consider MySQL", and it is rather disingenuous to suggest that is the case.
Besides, what the OP asked about was how to selectively invalidate the cache to avoid running queries at all. A rant about how Postgres would be faster than MySQL really doesn't add anything to the conversation.
--Larry Garfield
On Fri, 1 Jan 2010 16:19:05 -0600 Larry Garfield larry@garfieldtech.com wrote:
On Friday 01 January 2010 11:54:58 am Jean-Michel Pouré wrote:
I looked at MySQL table and they had unique indexes. This is what MySQL does: it is only a desaster. It can work for small sites, but when using large sites, you need a minimum of two servers, one for database, the other for MySQL. And you go beyond 1000 simultaneous users, you probably need a farm, which will ruin you.
Please do not spread FUD on the support list. MySQL can be an extremely effective database and very fast and scalable, if you know how to configure and maintain it properly. There are plenty of people who have such knowledge, and MySQL runs several very high-end sites. (Slashdot and, oh yeah, Drupal.org come to mind.) If you don't know how to maintain a MySQL database and try
Skype, Yahoo, Apple etc... for postgresql...
to throw something that heavy at it, sure, it's going to crumble. The same is true of any other database, or any web app, Drupal included.
I wouldn't put it this way. It's like saying that a train can take you from Hide Park to Piccadilly as well as a motorbike if you know how to drive them.
And yeah people who has been driving a motorbike for the past 5 years would oppose a bit of resistance before learning to drive a train and will hardly understand how a train can be "faster". Still a train is not a motorbike and it looks sluggish when you put it at tasks more suited for a motorbike.
Actually for few tens of a second I was thinking to use ISAM just for cache[1]... I wonder if this would become easier for D7. Yeah I'm aware of fastpath... but I still meant *easier*.
Anyway even in pure speed if you compare pg with InnoDB Postgresql became quite competitive in the past years, if you include other factors other than "pure speed" when comparing pg with InnoDB, pg is more than just competitive. So maybe you're right... other factors come to play, and not just performance. Prejudice?
'course, if you know how to maintain a Postgres database then sure, you can make it sing on low or high traffic sites. And someone who knows MySQL can do the same with MySQL. Knowing your tool matters, no matter what the tool.
Besides, what the OP asked about was how to selectively invalidate the cache to avoid running queries at all. A rant about how Postgres would be faster than MySQL really doesn't add anything to the conversation.
url were actually a very good pk for page cache (btw many urls in drupal are too short (255) including cache_page.cid). So I can't complain about having to do some extra work to invalidate the cache efficiently in my case. I was looking if I could delay a real solution to the problem passing the $expire parameter at page creation time... so that I could assign longer/shorter cache times accordingly to the probability that a certain page will become stale. But I can't see no way to pass $expire at content creation time. This seems to be true even for D7.
What is the mechanism governing "next general cache wipe"?
thanks
[1] I abandoned the idea so fast not because I thought that ISAM is not suited for anything... but just because making it works properly is not trivial and right now I don't have the time nor the need.
On Saturday 02 January 2010 12:14:26 pm Ivan Sergio Borgonovo wrote:
to throw something that heavy at it, sure, it's going to crumble. The same is true of any other database, or any web app, Drupal included.
I wouldn't put it this way. It's like saying that a train can take you from Hide Park to Piccadilly as well as a motorbike if you know how to drive them.
And yeah people who has been driving a motorbike for the past 5 years would oppose a bit of resistance before learning to drive a train and will hardly understand how a train can be "faster". Still a train is not a motorbike and it looks sluggish when you put it at tasks more suited for a motorbike.
I'm not sure I'd agree with that analogy anymore. As you say below, both Postgres and MySQL have improved considerably in recent years so previous street knowledge about "MySQL is fast and lame, Postgres is powerful but slow" is outdated and wrong these days, on both sides. The skillz of your admin now make a much bigger difference in most cases.
Actually for few tens of a second I was thinking to use ISAM just for cache[1]... I wonder if this would become easier for D7. Yeah I'm aware of fastpath... but I still meant *easier*.
Actually, I'd go the other way. In Drupal, the cache tables are one of the busier tables. Cache, session, etc. are some of the heavier-write tables that you don't want to do table-level locking on. watchdog or search indexes are good candidates for MyISAM, but not cache.
Anyway even in pure speed if you compare pg with InnoDB Postgresql became quite competitive in the past years, if you include other factors other than "pure speed" when comparing pg with InnoDB, pg is more than just competitive. So maybe you're right... other factors come to play, and not just performance. Prejudice?
Exactly my point. Old prejudices about MySQL *and* Postgres are no longer really accurate, and to continue to spread them is simply FUD. (Note that I am not badmouthing Postgres here; just pointing out that "MySQL == too slow for anything more than a blog" is simply wrong.)
Besides, what the OP asked about was how to selectively invalidate the cache to avoid running queries at all. A rant about how Postgres would be faster than MySQL really doesn't add anything to the conversation.
url were actually a very good pk for page cache (btw many urls in drupal are too short (255) including cache_page.cid). So I can't complain about having to do some extra work to invalidate the cache efficiently in my case. I was looking if I could delay a real solution to the problem passing the $expire parameter at page creation time... so that I could assign longer/shorter cache times accordingly to the probability that a certain page will become stale. But I can't see no way to pass $expire at content creation time. This seems to be true even for D7.
What is the mechanism governing "next general cache wipe"?
Any time cache_clear_all() is called, anything that's not marked as CACHE_PERMANENT gets cleared. Unfortunately it gets called a little more often than many people realize, because the cache is not as fine-grained as it should be.
Perhaps what you could do is just use your own cache table, and don't identify that cache table to the cache system. That way it gets ignored by the drupal_flush_all_caches() command, but you can still use cache_get()/cache_set() for the body of the page in your own code, then disable page caching on those pages using the CacheExclude module. Turn on block caching. That will make individual pages a little more expensive but you can then control the cache clearing logic for the body however you want to, and since you get to control the cid then, you can make it whatever you want.
--Larry Garfield
On Sat, 2 Jan 2010 14:26:02 -0600 Larry Garfield larry@garfieldtech.com wrote:
Actually for few tens of a second I was thinking to use ISAM just for cache[1]... I wonder if this would become easier for D7. Yeah I'm aware of fastpath... but I still meant *easier*.
Actually, I'd go the other way. In Drupal, the cache tables are one of the busier tables. Cache, session, etc. are some of the heavier-write tables that you don't want to do table-level locking on. watchdog or search indexes are good candidates for MyISAM, but not cache.
I was thinking about partitioning... but this make things much more complex.
Any time cache_clear_all() is called, anything that's not marked
When is cache_clear_all() called?
Perhaps what you could do is just use your own cache table, and
But this can't exploit the fact that if a whole page is cached much less php code is run.
I think I'll test the functional index and saving (url, external data pk) in another table... and also collect more stats about how pages are accessed to get an idea about where cache is needed.
Still being able to explicitly set $expire would be a big help. The pattern I'm expecting may be actually better served by tuning the expire time rather than flushing the cache when there is an update... or even not caching at all some pages.
On Sat, 2 Jan 2010 14:26:02 -0600 Larry Garfield larry@garfieldtech.com wrote:
Actually for few tens of a second I was thinking to use ISAM just for cache[1]... I wonder if this would become easier for D7. Yeah I'm aware of fastpath... but I still meant *easier*.
Actually, I'd go the other way. In Drupal, the cache tables are one of the busier tables. Cache, session, etc. are some of the heavier-write tables that you don't want to do table-level locking on. watchdog or search indexes are good candidates for MyISAM, but not cache.
mmm I just noticed that a db_lock_table($table); has disappeared from D5 to D6 Why was that needed? and what made it possible to remove it?
Code is pretty different in D7. I noticed that data is retrieved from the DB and then checked for validity in prepareItem. But it seems that all the conditions could be checked in the DB, avoiding to return stale data earlier. What am I missing? I couldn't see any place where $user->cache is set to anything other than 0 unless when cache is cleared... So yeah if most users have $user->cache==0 then filtering in the DB may waste "secondary cache" space (eg. cached DB queries), but then... what's the use of $user->cache?
Anyway even in pure speed if you compare pg with InnoDB Postgresql became quite competitive in the past years, if you include other factors other than "pure speed" when comparing pg with InnoDB, pg is more than just competitive. So maybe you're right... other factors come to play, and not just performance. Prejudice?
Exactly my point. Old prejudices about MySQL *and* Postgres are no longer really accurate, and to continue to spread them is
I really admit I don't have too much experience in tweaking MySQL... still being so popular I had to solve some problems with it. I never run into corrupted DB on a default installation in Debian with pg but I regularly experience corrupted DB in a default installation of MySQL/ISAM. Considering I'm a "humble programmer" and not an admin or DBA this is far from constituting statistical evidence... but... it build up prejudices for sure.
Still as a programmer I appreciate when stuff have consistent, predictable behaviours and return me understandable error messages and not when they silently trim or cast etc... I still feel more comfortable writing and *debugging* code for pg.
What is the mechanism governing "next general cache wipe"?
Any time cache_clear_all() is called, anything that's not marked as CACHE_PERMANENT gets cleared. Unfortunately it gets called a little more often than many people realize, because the cache is not as fine-grained as it should be.
So... it looks it get cleared when: - feeds are refreshed - blocks are touched - comments are touched - menu are touched - a single node is touched ... etc... etc... etc... but most importantly... it seems in each cron run. that's for D6
When is cache_page going to be cleaned in D5 when cache_lifetime==0?
How am I going to influence when a cached page is wiped other than setting a minimum lifetime? Even in D7 $expire is not accessible/configurable directly if at all. A not always working trick could be to set $GLOBALS['conf']['cache_lifetime'] /* didn't check if it is the right place but I guess you got the idea */ but in many places $expire is hard coded.
Furthermore there is no way I can see to "late invalidate" a page cache. eg. let's say I saved something like "from this moment on, don't use cache for A, B and C" in the session... since hook_init fire later than when the cache is already loaded, I'm obliged to return the cached version.
What method am I going to use to offer custom content to anonymous users without invalidating the cache for everyone else?
Suppose I just would like to add a greeting at the top of all the pages for all users (anonymous) that just took the time to fill in their name (I bet you can find more useful examples)... I set a cookie/flag in the session... but whatever I write will come into play to late to decide if I can serve a cached page or not.
How am I going to handle this? Even having block and page cache separated this force to turn off page cache for everyone even if it reduces the cost of completely regenerating content. There should be a way to disable cache use at least according to $_SESSION content.
Perhaps what you could do is just use your own cache table, and don't identify that cache table to the cache system. That way it gets ignored by the drupal_flush_all_caches() command, but you can
I went for a functional index... that's not portable... but it is very very cheap to implement and test. Then when the underlying DB data change... I can selectively invalidate the cached pages reasonably quickly. Another way would be to create a (cid, pk) table and use a join to delete stale pages, but then when a cached page get deleted I'd have to do some clean-up of the (cid, pk) table and that's not easy or quick if you can't use triggers, that will make the solution equally not portable.
One way would have to have an hook_cache_clear() poor's man trigger for all the content that is passed to drupal without passing through its "objects".
On Mon, 4 Jan 2010 14:58:01 +0100 Ivan Sergio Borgonovo mail@webthatworks.it wrote:
How am I going to influence when a cached page is wiped other than setting a minimum lifetime? Even in D7 $expire is not accessible/configurable directly if at all. A not always working trick could be to set $GLOBALS['conf']['cache_lifetime'] /* didn't check if it is the right place but I guess you got the idea */ but in many places $expire is hard coded.
Furthermore there is no way I can see to "late invalidate" a page cache.
or just skip cache...
eg. let's say I saved something like "from this moment on, don't use cache for A, B and C" in the session... since hook_init fire later than when the cache is already loaded, I'm obliged to return the cached version.
What method am I going to use to offer custom content to anonymous users without invalidating the cache for everyone else?
Suppose I just would like to add a greeting at the top of all the pages for all users (anonymous) that just took the time to fill in their name (I bet you can find more useful examples)... I set a cookie/flag in the session... but whatever I write will come into play to late to decide if I can serve a cached page or not.
How am I going to handle this? Even having block and page cache separated this force to turn off page cache for everyone even if it reduces the cost of completely regenerating content. There should be a way to disable cache use at least according to $_SESSION content.
It seems the problem is nearly solved in D7
function _drupal_bootstrap($phase) { ... $cache = drupal_page_get_cache(); // If there is a cached page, display it. if (is_object($cache)) { // If the skipping of the bootstrap hooks is not enforced, call // hook_boot. if (variable_get('page_cache_invoke_hooks', TRUE)) { bootstrap_invoke_all('boot');
But if you put something into $_SESSION, then the page is set as non cacheable in drupal_session_initialize (I think drupal_session_initialize is called before than the cache phase).
Still if for any reason someone would like to externally turn off the cache on other kind of input (IP?) that won't be possible.
Wouldn't it be better to call hook_boot before cache is loaded?
I'm going to hack D5/D6 core to switch cacheability of pages according to $_SESSION, since this is going to come very handy.