Caching, caching, caching...
Hi there, you may have noticed my recent interest in better caching. This is due to me monitoring the drupal.org hardware. Luckily, the problems we've had (namely the CPU on the db server being saturated) were resolved by more mundane means (optimze your database on occassion...), but since we continue to experience a growing number of visitors on drupal.org, better caching is no waste of time. I've taken two different approaches: 1) to splt up the cache table. The idea is that parts of the cache table don't change that often (the cached menu structures for each user) while others do (page cache, filter cache, ...). The idea is to put the not often changing parts into their own table to allow the mysql query cache to effectively cache this table. The cached queries will be invalidated and removed once one of the underlying tables changes. This cache has been tried on drupal.org, the results were inconclusive, more tests need to be done. 2) to cache expensive queries in Drupal's cache table. I've chosen the forum module as an example and cached the two blocks it provides and the queries in forum_get_forums (forum_get_topics would be another candidate). During initial benchmarking, I've found that the results with and without my patch hardly differed. Why? The mysql query cache cached the queries. Without the mysql query cache my patch improved the time for delivering /forum with both blocks enabled by about 20%. But of course nobody would run mysql without the query cache so the result is hardly interesting. The question is now: Under which circumstances can my caching inside Drupal be better than the mysql query cache? Let's look at the tables involved: {node} {node_comment_statistics} {term_node} {users} Any time one of these tables changes, the mysql cache for queries using the table in question will be invalidated. The approach to caching I've chosen is more fine grained: Only if the node involved is of type forum, the cached query results will be invalidated. Changes to the users table don't matter at all, but we should probably invalidate the cache once a user change his name. On a site such as drupal.org where the majority of comments are against forum nodes, the results of my patch will probably be hardly noticable. If however we convert the project module to use real comments, the situation will change. This discussion seems to make it fairly obvious that caching results of queries to tables that hardly ever change (think taxonomy (unless you use tagging)) is most likely a waste of code. Caching the result of taxonomy_get_tree does seem to confirm this. Now the question is if we want to go down this road. To properly invalidate the cache requires to implement all sorts of hooks (forum module needs _comment and _user for example) which you would not need otherwise. This complicates the code quite a bit. I think that the answer should depend on whether the mysql query cache is as readily available as I assumed it is. It is possible that it isn't available on shared hosting, for example. It is also possible that the example I've chosen was a bad one and caching other structures (especially those where a fair amount of php processing is done) might even provide advantages for people who have the query cache at their disposal. The tests I've done were all done on my laptop while on battery power which is hardly a valid test case. Better tests are needed to arrive at a conclusive result. Cheers, Gerhard
On Wed, 2006-07-12 at 00:01 +0200, Gerhard Killesreiter wrote:
Hi there,
you may have noticed my recent interest in better caching. This is due to me monitoring the drupal.org hardware. Luckily, the problems we've had (namely the CPU on the db server being saturated) were resolved by more mundane means (optimze your database on occassion...), but since we continue to experience a growing number of visitors on drupal.org, better caching is no waste of time.
I've taken two different approaches:
1) to splt up the cache table. The idea is that parts of the cache table don't change that often (the cached menu structures for each user) while others do (page cache, filter cache, ...). The idea is to put the not often changing parts into their own table to allow the mysql query cache to effectively cache this table. The cached queries will be invalidated and removed once one of the underlying tables changes.
This cache has been tried on drupal.org, the results were inconclusive, more tests need to be done.
2) to cache expensive queries in Drupal's cache table. I've chosen the forum module as an example and cached the two blocks it provides and the queries in forum_get_forums (forum_get_topics would be another candidate).
During initial benchmarking, I've found that the results with and without my patch hardly differed. Why? The mysql query cache cached the queries. Without the mysql query cache my patch improved the time for delivering /forum with both blocks enabled by about 20%. But of course nobody would run mysql without the query cache so the result is hardly interesting. The question is now: Under which circumstances can my caching inside Drupal be better than the mysql query cache?
Let's look at the tables involved:
{node} {node_comment_statistics} {term_node} {users}
Any time one of these tables changes, the mysql cache for queries using the table in question will be invalidated. The approach to caching I've chosen is more fine grained: Only if the node involved is of type forum, the cached query results will be invalidated. Changes to the users table don't matter at all, but we should probably invalidate the cache once a user change his name.
On a site such as drupal.org where the majority of comments are against forum nodes, the results of my patch will probably be hardly noticable. If however we convert the project module to use real comments, the situation will change.
This discussion seems to make it fairly obvious that caching results of queries to tables that hardly ever change (think taxonomy (unless you use tagging)) is most likely a waste of code. Caching the result of taxonomy_get_tree does seem to confirm this.
Now the question is if we want to go down this road. To properly invalidate the cache requires to implement all sorts of hooks (forum module needs _comment and _user for example) which you would not need otherwise. This complicates the code quite a bit.
I think that the answer should depend on whether the mysql query cache is as readily available as I assumed it is. It is possible that it isn't available on shared hosting, for example. It is also possible that the example I've chosen was a bad one and caching other structures (especially those where a fair amount of php processing is done) might even provide advantages for people who have the query cache at their disposal.
The tests I've done were all done on my laptop while on battery power which is hardly a valid test case. Better tests are needed to arrive at a conclusive result.
Cheers, Gerhard
I personally am of the opinion to trust my RDMS's query caching mechanism. That stuff has been being worked on for a long time. For catering to the LCD shared hosting sites I think caching could be helpful, after all things can get pushed out of the query cache much more easily when there are 50 sites all querying the same local db. I prefer the idea of caching more output items, rendered blocks, nodes(teaser and full), than trying to beat out the query level caching provided by the db server. I think with the new file caching code that could be a good avenue to explore.
Hi,
Now the question is if we want to go down this road. To properly invalidate the cache requires to implement all sorts of hooks (forum module needs _comment and _user for example) which you would not need otherwise. This complicates the code quite a bit.
I agree, personally i would create a "cache API", which would take care of following objectives: 1) let modules invalidate specific cache records 2) let modules define a treshold for own content, which instructs cache API how often automatically recache it's content Jakub Suchy
Gerhard, One more note here, is that all this is MySQL centric. What about PostgreSQL? Its caching is different, and optimizing for one type of database will not apply. I think even Inno-DB (or the upcoming SOLID) caching will be different than MyISAM. I think two things can be explored further. 1. Revisit the file caching patch by Jeremy Andrews. 2. Look into object caches, like memcached, if the database is a bottleneck.
On 12 Jul 2006, at 00:01, Gerhard Killesreiter wrote:
Now the question is if we want to go down this road. To properly invalidate the cache requires to implement all sorts of hooks (forum module needs _comment and _user for example) which you would not need otherwise. This complicates the code quite a bit.
Personally, I'd not optimize with MySQL's query cache in mind. If the folks at MySQL improve their query cache implementation, our efforts might have been moot. I'd focus my energy on high-level improvements, rather than low-level improvements. That is, I'd focus architectural changes that reduce the number of SQL queries or that simplify complex queries (eg. less joins). For example, the node table and node_comment_statistics table have a one-on-one relation and could potentially be merged. I'm not sure what it would buy us, but it might be worth it. Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are? You know, accumulate results over multiple page views. Do we have recent data query data from drupal.org? Let's start by identifying some key bottlenecks. -- Dries Buytaert :: http://www.buytaert.net/
Dries Buytaert wrote:
On 12 Jul 2006, at 00:01, Gerhard Killesreiter wrote:
Now the question is if we want to go down this road. To properly invalidate the cache requires to implement all sorts of hooks (forum module needs _comment and _user for example) which you would not need otherwise. This complicates the code quite a bit.
Personally, I'd not optimize with MySQL's query cache in mind. If the folks at MySQL improve their query cache implementation, our efforts might have been moot.
True. The most important aspect of mysql's query cache is that it isn't that well suited for a dynamic application such as Drupal the hit:miss ratio is only about 3.5 on drupal.org.
I'd focus my energy on high-level improvements, rather than low-level improvements. That is, I'd focus architectural changes that reduce the number of SQL queries or that simplify complex queries (eg. less joins). For example, the node table and node_comment_statistics table have a one-on-one relation and could potentially be merged. I'm not sure what it would buy us, but it might be worth it.
I think that the potential for such changes is rather small if we still want to keep a reasonably sane DB schema.
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
You know, accumulate results over multiple page views. Do we have recent data query data from drupal.org? Let's start by identifying some key bottlenecks.
The forum queries that I cached in my patch are also among the more prominent ressource hogs on drupal.org, IIRC. Caching them might give us an immediate benefit. This of course only applies if the cache hit:miss ratio is significant. Maybe I should try them on drupal.org... Cheers, Gerhard
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
Are there reasonable assumptions we can make about certain URLs? Like, can we assume that people don't want to alias '?q=qcomment/ reply/123'? Or ?q=contact? Or q=user/register and ?q=user/ password? If we can make some simple assumptions, we might be able to save a helluvalot database queries (per page view) ... Food for thought. -- Dries Buytaert :: http://www.buytaert.net/
Dries Buytaert wrote:
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
Are there reasonable assumptions we can make about certain URLs? Like, can we assume that people don't want to alias '?q=qcomment/reply/123'?
Maybe.
Or ?q=contact? Or q=user/register and ?q=user/password?
I always alias those for sites in German.
If we can make some simple assumptions, we might be able to save a helluvalot database queries (per page view) ... Food for thought.
Frankly, I don't think we will get a lot of agreement on what can be aliased and what not. We might be able to devise a solution that works similar to the translations cache on a per page basis. Cheers, Gerhard
Op 14-jul-2006, om 20:22 heeft Gerhard Killesreiter het volgende geschreven:
Dries Buytaert wrote:
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
Are there reasonable assumptions we can make about certain URLs? Like, can we assume that people don't want to alias '?q=qcomment/ reply/123'?
Maybe.
Or ?q=contact? Or q=user/register and ?q=user/password?
I always alias those for sites in German.
If we can make some simple assumptions, we might be able to save a helluvalot database queries (per page view) ... Food for thought.
Frankly, I don't think we will get a lot of agreement on what can be aliased and what not. We might be able to devise a solution that works similar to the translations cache on a per page basis.
Cheers, Gerhard
What if we alias by default? So if we do instead of ?q=node/$nid ?q=$node->type/$node->title or even ?q=$node->tite? Looking through the maasive use of the path.module and the mass_url.module we can say people do want fancy _and_ easy to remember url's.. Just some thought... Stefan
On Jul 14, 2006, at 2:55 PM, Stefan Nagtegaal wrote:
What if we alias by default?
So if we do instead of ?q=node/$nid ?q=$node->type/$node->title or even ?q=$node->tite?
Looking through the maasive use of the path.module and the mass_url.module we can say people do want fancy _and_ easy to remember url's..
that would be a complete nightmare for issue nodes, since the titles can change all the time, yet people need easy to paste, unique URLs to refer to issues with. also, titles can contain a lot of characters that aren't valid URLs... there's a reason path.module and mass_url.module are in contrib... not everyone wants that behavior (for good reason). personally, leaving things as /node/X by default (with optional aliases where desired) is sort of like having your site automatically filtered through tinyurl.com. ;) -1 from me... -derek
that would be a complete nightmare for issue nodes, since the titles can change all the time, yet people need easy to paste, unique URLs to refer to issues with.
Hi, starting new thread as i believe there is another issue. fancy URLs are not suitable for users only. they are designed to be good for search engines (SEO) too. Drupals approach to fancy URLs is one of the best between CMS. But there are some drawbacks: 1) it allows duplicates. a duplicate URL for same content is nor good for search engines neither for users. 2) also node/123 has an alias "my-test-node". mysite.com/my-test-node displays correct content, but mysite.com/node/123 displays it too. this is duplicate again. maybe this is a part of the solution? displaying: http://example.com/node/123: if (alias_exists("node/123")) { ... redirect 301 moved permanently to alias ... } else { ... display content ... } we should also remember: when redirecting to new alias, we should everytime do 301 redirect, not ordinary PHP's Location:, which is 302. any ideas how to solve 1) ? Jakub Suchy
On 7/18/06, Derek Wright <drupal@dwwright.net> wrote:
Looking through the maasive use of the path.module and the mass_url.module we can say people do want fancy _and_ easy to remember url's..
that would be a complete nightmare for issue nodes, since the titles can change all the time, yet people need easy to paste, unique URLs to refer to issues with.
How often do titles change? So asks the guy who avoids using the title in URL aliases because he changes titles.
also, titles can contain a lot of characters that aren't valid URLs...
there's a reason path.module and mass_url.module are in contrib... not everyone wants that behavior (for good reason). personally, leaving things as /node/X by default (with optional aliases where desired) is sort of like having your site automatically filtered through tinyurl.com. ;)
path.module is in core, mass_url hasn't had a checkin for over a year: http://cvs.drupal.org/viewcvs/drupal/contributions/modules/mass_url/mass_url... Did you mean the Pathauto module? I've been trying to think of a way to have short e-mail-friendly URLs redirect automatically to their long, informative bretheren. I think TinyURL and their clones are useful only (and yes, only) when saying URLs in voice conversation. Otherwise we say "here, I'm sending you a link" while in voice conversation or just URL-slap in IM or IRC, and it's the full link while TinyURL-like services don't tell you where you're going, making the reader wary of clicking on it. (Of course it takes a discussion about URL aliases to get me to de-lurk.) -- Richard Eriksson http://www.bryght.com/
On Jul 18, 2006, at 5:17 PM, Richard Eriksson wrote:
How often do titles change?
in the issue queues (e.g. http://drupal.org/project/issues), all the time. partly because people don't realize that the title of their follow-up changes the title of the issue itself, and partly because sometimes an issue evolves over time and people provide better, more accurate titles to reflect what the problem/feature really is about... -derek
On 14 Jul 2006, at 20:15, Dries Buytaert wrote:
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
Are there reasonable assumptions we can make about certain URLs? Like, can we assume that people don't want to alias '?q=qcomment/ reply/123'? Or ?q=contact? Or q=user/register and ?q=user/ password? If we can make some simple assumptions, we might be able to save a helluvalot database queries (per page view) ... Food for thought.
Any thoughts here? -- Dries Buytaert :: http://www.buytaert.net/
Dries Buytaert wrote:
On 14 Jul 2006, at 20:15, Dries Buytaert wrote:
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
Are there reasonable assumptions we can make about certain URLs? Like, can we assume that people don't want to alias '?q=qcomment/reply/123'? Or ?q=contact? Or q=user/register and ?q=user/password? If we can make some simple assumptions, we might be able to save a helluvalot database queries (per page view) ... Food for thought.
Any thoughts here?
i can't think of any reasonable assumptions. to me, the waste is not that we have to perform so many queries, but rather that we do so on every page view even though the data rarely changes. to address that, one should use mysql query cache or even memcached. mysqli provides another possibility: prepared statements. queries that recur frequently benefit from prepared statements because mysql does not need to construct a query plan for every query. see http://dev.mysql.com/tech-resources/articles/4.1/prepared-statements.html fyi, devel.module just got the ability to do query logging to the DB. killes and i just need to quash one silly bug and it will be ready.
On Jul 21, 2006, at 6:15 AM, Moshe Weitzman wrote:
Dries Buytaert wrote:
On 14 Jul 2006, at 20:15, Dries Buytaert wrote:
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
Are there reasonable assumptions we can make about certain URLs? Like, can we assume that people don't want to alias '?q=qcomment/ reply/123'? Or ?q=contact? Or q=user/register and ?q=user/ password? If we can make some simple assumptions, we might be able to save a helluvalot database queries (per page view) ... Food for thought. Any thoughts here?
i can't think of any reasonable assumptions. to me, the waste is not that we have to perform so many queries, but rather that we do so on every page view even though the data rarely changes. to address that, one should use mysql query cache or even memcached.
mysqli provides another possibility: prepared statements. queries that recur frequently benefit from prepared statements because mysql does not need to construct a query plan for every query. see http://dev.mysql.com/tech-resources/articles/4.1/prepared- statements.html
I am not sure about mysqli but many db engines cache queries that are precompiled as well as their results. This reduces parsing the query over and over again. It's an optimization that can reduce some of the benefits of stored procedures.
fyi, devel.module just got the ability to do query logging to the DB. killes and i just need to quash one silly bug and it will be ready.
Very nice! Kieran
On 21 Jul 2006, at 15:15, Moshe Weitzman wrote:
i can't think of any reasonable assumptions. to me, the waste is not that we have to perform so many queries, but rather that we do so on every page view even though the data rarely changes. to address that, one should use mysql query cache or even memcached.
A query cache only eliminates part of the overhead. Most of the overhead comes from Drupal's database abstraction layer, not from execution the actual SQL query: http://buytaert.net/drupal-database-interaction We gain most if we eliminate calls to db_query(). We use a MySQL query cache on drupal.org, and the one query that retrieves the URL alias is still to most expensive query (accumulated cost).
fyi, devel.module just got the ability to do query logging to the DB. killes and i just need to quash one silly bug and it will be ready.
Does it (still) accumulate data across different page views? -- Dries Buytaert :: http://www.buytaert.net/
Does it (still) accumulate data across different page views?
yes. if interested, see http://drupal.org/node/47755
Dries Buytaert wrote:
On 21 Jul 2006, at 15:15, Moshe Weitzman wrote:
i can't think of any reasonable assumptions. to me, the waste is not that we have to perform so many queries, but rather that we do so on every page view even though the data rarely changes. to address that, one should use mysql query cache or even memcached.
A query cache only eliminates part of the overhead. Most of the overhead comes from Drupal's database abstraction layer, not from execution the actual SQL query:
http://buytaert.net/drupal-database-interaction
We gain most if we eliminate calls to db_query(). We use a MySQL query cache on drupal.org, and the one query that retrieves the URL alias is still to most expensive query (accumulated cost).
good point. i guess you have looked at opportunities for improvement in db_query() and friends? I suspect that the slowdown is due to regular expressions. I'm not fluent enough to evaluate which are slow, but here is the list of DB functions that contain a regex. db_query db_query_range db_rewrite_sql db_escape_table db_query_temporary - only used for search i think Also, we call mysqlescapestring() an awful lot, to sanitize data in the query. We do so even when the data is known safe. I wonder if this is expensive or not? @Dries - did you have the devel.module query log enabled when you did your analysis? That log calls debug_backtrace() for every query and could conceivably slow things down. So, someone please look at these regex and see if they are expensive, and if we can improve them. Or identify some other cause of delay in our DB abstraction. The timer API might be useful here. -moshe
On 22 Jul 2006, at 18:52, Moshe Weitzman wrote:
Dries Buytaert wrote:
On 21 Jul 2006, at 15:15, Moshe Weitzman wrote:
i can't think of any reasonable assumptions. to me, the waste is not that we have to perform so many queries, but rather that we do so on every page view even though the data rarely changes. to address that, one should use mysql query cache or even memcached. A query cache only eliminates part of the overhead. Most of the overhead comes from Drupal's database abstraction layer, not from execution the actual SQL query: http://buytaert.net/drupal-database-interaction We gain most if we eliminate calls to db_query(). We use a MySQL query cache on drupal.org, and the one query that retrieves the URL alias is still to most expensive query (accumulated cost).
good point. i guess you have looked at opportunities for improvement in db_query() and friends? I suspect that the slowdown is due to regular expressions. I'm not fluent enough to evaluate which are slow, but here is the list of DB functions that contain a regex.
db_query db_query_range db_rewrite_sql db_escape_table db_query_temporary - only used for search i think
My post (http://buytaert.net/drupal-database-interaction) tells you _exactly_ what functions to focus on. Check the figures. :) -- Dries Buytaert :: http://www.buytaert.net/
@Dries - did you have the devel.module query log enabled when you did your analysis? That log calls debug_backtrace() for every query and could conceivably slow things down.
let me revise that - if you turned on the query log and subsequently disabled devel.module, you would still pay the performance penalty because the variable_get('dev_query', 0) will still be true. to fix this, we should introduce an uninstall hook and advise module authors when they should and should not use it (this is an example of *should use it*).
I notice that we sanitize $path every time l() is called [note: l() calls url()]. So these 100+ alias queries also imply 100+ calls to mysql_real_escape_string(). note that db_escape_string() is on the list of offenders at dries' figures. maybe we need a parameter on url() where a developer can declare that his input $path is safe. consider the many links which to "node/$nid" - these get sanitized even though $nid comes from an integer field in the DB. it isn't totally clear how xdebug does its accounting, but i i acknowledge that avoiding output filtering is a bit scary but probably acceptable in this case.
On 7/21/06, Dries Buytaert <dries@buytaert.net> wrote:
On 14 Jul 2006, at 20:15, Dries Buytaert wrote:
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
Are there reasonable assumptions we can make about certain URLs? Like, can we assume that people don't want to alias '?q=qcomment/ reply/123'? Or ?q=contact? Or q=user/register and ?q=user/ password? If we can make some simple assumptions, we might be able to save a helluvalot database queries (per page view) ... Food for thought.
Any thoughts here?
I have to agree with Moshe on this. This just increases the "special cases" we have, and that can be trouble down the road. Moreover, the list can vary from one site to the other ... More generic solutions are better.
Dries Buytaert wrote:
On 14 Jul 2006, at 20:15, Dries Buytaert wrote:
Maybe we should run the modified devel.module on drupal.org to see what the most expensive queries are?
SELECT * FROM url_alias, I'd be surprized if that had changed. The individual query is very short but there are probably 120 per page on average.
Are there reasonable assumptions we can make about certain URLs? Like, can we assume that people don't want to alias '?q=qcomment/ reply/123'? Or ?q=contact? Or q=user/register and ?q=user/ password? If we can make some simple assumptions, we might be able to save a helluvalot database queries (per page view) ... Food for thought.
Any thoughts here?
I still like my limited url alias cache.
What Ken Rickard said makes me think of another solution: If we provide a generic way to exclude certain paths, that can cut down the processed aliases. For example, a site may only want to alias "taxonomy/term/*" and "node/*" and nothing else. Another site may only want aliases for "node/*" and "user/*" The code this way is generic and customizable on a per site basis with no patches, ...etc. You get the idea ...
Khalid B wrote:
What Ken Rickard said makes me think of another solution:
If we provide a generic way to exclude certain paths, that can cut down the processed aliases.
For example, a site may only want to alias "taxonomy/term/*" and "node/*" and nothing else.
Another site may only want aliases for "node/*" and "user/*"
The code this way is generic and customizable on a per site basis with no patches, ...etc.
You get the idea ...
Interesting idea. We could look at the system urls that actually are aliased on a particular site, try to abstract some regexp from them (mainly by replacing numbers by a regexp matching numbers, I'd think. Maybe also by truncating them.) and store the resulting regexps in a variable. Then, before we'd look for a matching alias for a particular system url we can first match that urls against our regexps. I somehow doubt that this would be faster, though. Cheers, Gerhard
On 7/21/06, Gerhard Killesreiter <gerhard@killesreiter.de> wrote:
Khalid B wrote:
What Ken Rickard said makes me think of another solution:
If we provide a generic way to exclude certain paths, that can cut down the processed aliases.
For example, a site may only want to alias "taxonomy/term/*" and "node/*" and nothing else.
Another site may only want aliases for "node/*" and "user/*"
The code this way is generic and customizable on a per site basis with no patches, ...etc.
You get the idea ...
Interesting idea. We could look at the system urls that actually are aliased on a particular site, try to abstract some regexp from them (mainly by replacing numbers by a regexp matching numbers, I'd think. Maybe also by truncating them.) and store the resulting regexps in a variable. Then, before we'd look for a matching alias for a particular system url we can first match that urls against our regexps.
I somehow doubt that this would be faster, though.
Actually what I meant was sort of a filter for the admin (or modules) on what should get aliased and what should be ignored. This way, we cut down on the number of useless aliases (for that particular site). If a site restricts aliasing to $_GET['q'] of ^taxonomy/term/\d$ ^taxonomy/term\d/feed$ ^node/\d$ ^node/\d/feed$ Then menus become a non issue. They do not take up space in the database, they are not queried either. The above pattern is enough for the majority of sites (perhaps with user/* as well). Modules like pathauto can use that to add or remove patterns too. So, the speed in limiting what gets aliased, and what doesn't and what gets queried and what doesn't.
On 22 Jul 2006, at 03:18, Khalid B wrote:
This way, we cut down on the number of useless aliases (for that particular site). If a site restricts aliasing to $_GET['q'] of ^taxonomy/term/\d$ ^taxonomy/term\d/feed$ ^node/\d$ ^node/\d/feed$ Then menus become a non issue. They do not take up space in the database, they are not queried either.
The above pattern is enough for the majority of sites (perhaps with user/* as well). Modules like pathauto can use that to add or remove patterns too.
So we have a number of viable ideas here. Let's try to summarize them: 1. Build a caching algorithm that uses an heuristic to pre-load frequently used URL aliases. * Advantages: transparent, no configuration required * Disadvantages: it a heuristic, we don't know how it would perform, it might be tricky to implement, and MySQL does this implicitly (but not as aggressive). 2. Provide a textarea that allows administrators to _white_list_ URL patterns. * Advantages: fine-grained control, easy to implement * Disadvantages: users usually don't like messing with regular expressions, it might take a lot of effort to get the list Just Right, and it takes a certain amount of familiarity with Drupal's URL scheme (learning curve for new Drupal users). The behavior might be confusing: you add an alias, and it doesn't work because you forgot about the list of URL patterns. 3. Provide a textarea that allows administrators to _black_list_ URL patterns. * Advantages: fine-grained control, easy to implement * Disadvantages: users usually don't like messing with regular expressions, it might take a lot of effort to get the list Just Right, and it takes a certain amount of familiarity with Drupal's URL scheme (learning curve for new Drupal users). The behavior might be confusing: you add an alias, and it doesn't work because you forgot about the list of URL patterns. 4. Stop doing SQL queries when you cached all possible URL aliases. * Advantages: transparent, no configuration required, can co- exist with (1), (2), (3) and (5). * Disadvantages: only works for a subset of all Drupal sites, not a solution for larger Drupal sites. 5. Improve Drupal's high-level page caching so we have to rebuild pages less frequently. * Advantages: no configuration required, can co-exist with (1), (2), (3) and (4), eliminates many more SQL queries. * Disadvantages: doesn't work for authenticated users. Thoughts? -- Dries Buytaert :: http://www.buytaert.net/
Dries Buytaert wrote:
Thoughts?
Can I suggest a #6? 6. Cache by role, not by user. We already cache for one role (Anonymous User). In the vast majority of cases (we'd need to no-cache the exceptions), pages which are not cached at all for logged-in users actually vary only by role, not uid. If we break a cached page down into constituent parts: 1. $comments, $content, $links, most blocks, insert-others-here 2. per-user blocks, insert-others-here ...then we can cache anything in group 1, once for each role. This would *seriously* speed up high-volume interactive sites, and can be implemented at the theme-function level. Please feel free to shoot me down in flames now. :) -- ------------------------------------------- John Handelaar E john@handelaar.org T +353 21 427 9033 M +353 85 748 3790 http://handelaar.org ------------------------------------------- Work in progress: http://dev.vocalvoter.com -------------------------------------------
On 22 Jul 2006, at 18:35, John Handelaar wrote:
Dries Buytaert wrote:
Thoughts?
Can I suggest a #6?
6. Cache by role, not by user.
Please feel free to shoot me down in flames now. :)
Cache by role doesn't work because there are many per-user settings (eg. language preference, the username in the navigation-block, etc). -- Dries Buytaert :: http://www.buytaert.net/
Dries Buytaert wrote:
Cache by role doesn't work because there are many per-user settings (eg. language preference, the username in the navigation-block, etc).
To be fair, I did *explicitly* mention that would be some elements which could not be cached by role. There are, however, a whole bunch which can - I'm confident in my original assertion of 'vast majority', in fact. That there are some places it can't work doesn't sound like a good reason to not consider the places where it can - unless we're going to keep thinking of caches as a full-page- only thing. i18n/l10n, as you say, would probably have to be another "vary" switch. This implies that we should maybe be thinking about a new core hook, which modules can use to introduce/manage switching cases. How about stepping back and thinking about what variables can affect caching on any given theme() function, and in which order they do so? eg 1. (core) Role, and if not role then 2. (core) User, and if not user then 3. (i18n module) $some_i18n_variable, else 4. ...what else? Actually, that list might be upside down in practice, but hopefully I've managed to get my point across. Cached values would be returned for any function which returns theme output... like... [pseudo-code] function node_page_default() { global $user; if (node_page_default_cache($user->uid)) { return node_page_default_cache($user-uid); } else { // original function here } } ...where node_page_default_cache() would be an implementation of hook_cache(), which in turn contains that order-of-preference logic mentioned above. $user->uid gives us the ability to check roles, uids, and anything else we might need. This would involve changing *lots* of functions, I realise, but consider how many DB queries we'd be culling. On the other hand, a well-written hook allows modules to start using it incrementally - modules which haven't implemented it yet don't fail. -- ------------------------------------------- John Handelaar E john@handelaar.org T +353 21 427 9033 M +353 85 748 3790 http://handelaar.org ------------------------------------------- Work in progress: http://dev.vocalvoter.com -------------------------------------------
Hi, Remember also modules like organic groups which has no relation to roles. Maybe a caching flag on theme('page'), so the modules can determine tell the system to cache this page for this user, or groups of users. I know that we just return something, so we would need to make the return method to allow the returning of an array which is past to theme('page'); Gordon. John Handelaar wrote:
Dries Buytaert wrote:
Cache by role doesn't work because there are many per-user settings (eg. language preference, the username in the navigation-block, etc).
To be fair, I did *explicitly* mention that would be some elements which could not be cached by role.
There are, however, a whole bunch which can - I'm confident in my original assertion of 'vast majority', in fact.
That there are some places it can't work doesn't sound like a good reason to not consider the places where it can - unless we're going to keep thinking of caches as a full-page- only thing.
i18n/l10n, as you say, would probably have to be another "vary" switch. This implies that we should maybe be thinking about a new core hook, which modules can use to introduce/manage switching cases.
How about stepping back and thinking about what variables can affect caching on any given theme() function, and in which order they do so?
eg
1. (core) Role, and if not role then 2. (core) User, and if not user then 3. (i18n module) $some_i18n_variable, else 4. ...what else?
Actually, that list might be upside down in practice, but hopefully I've managed to get my point across.
Cached values would be returned for any function which returns theme output... like...
[pseudo-code]
function node_page_default() { global $user; if (node_page_default_cache($user->uid)) { return node_page_default_cache($user-uid); } else {
// original function here
} }
...where node_page_default_cache() would be an implementation of hook_cache(), which in turn contains that order-of-preference logic mentioned above. $user->uid gives us the ability to check roles, uids, and anything else we might need.
This would involve changing *lots* of functions, I realise, but consider how many DB queries we'd be culling.
On the other hand, a well-written hook allows modules to start using it incrementally - modules which haven't implemented it yet don't fail.
On 7/22/06, Dries Buytaert <dries.buytaert@gmail.com> wrote:
On 22 Jul 2006, at 03:18, Khalid B wrote:
This way, we cut down on the number of useless aliases (for that particular site). If a site restricts aliasing to $_GET['q'] of ^taxonomy/term/\d$ ^taxonomy/term\d/feed$ ^node/\d$ ^node/\d/feed$ Then menus become a non issue. They do not take up space in the database, they are not queried either.
The above pattern is enough for the majority of sites (perhaps with user/* as well). Modules like pathauto can use that to add or remove patterns too.
So we have a number of viable ideas here. Let's try to summarize them:
1. Build a caching algorithm that uses an heuristic to pre-load frequently used URL aliases.
* Advantages: transparent, no configuration required
* Disadvantages: it a heuristic, we don't know how it would perform, it might be tricky to implement, and MySQL does this implicitly (but not as aggressive).
2. Provide a textarea that allows administrators to _white_list_ URL patterns.
* Advantages: fine-grained control, easy to implement
* Disadvantages: users usually don't like messing with regular expressions, it might take a lot of effort to get the list Just Right, and it takes a certain amount of familiarity with Drupal's URL scheme (learning curve for new Drupal users). The behavior might be confusing: you add an alias, and it doesn't work because you forgot about the list of URL patterns.
3. Provide a textarea that allows administrators to _black_list_ URL patterns.
* Advantages: fine-grained control, easy to implement
* Disadvantages: users usually don't like messing with regular expressions, it might take a lot of effort to get the list Just Right, and it takes a certain amount of familiarity with Drupal's URL scheme (learning curve for new Drupal users). The behavior might be confusing: you add an alias, and it doesn't work because you forgot about the list of URL patterns.
4. Stop doing SQL queries when you cached all possible URL aliases.
* Advantages: transparent, no configuration required, can co- exist with (1), (2), (3) and (5).
* Disadvantages: only works for a subset of all Drupal sites, not a solution for larger Drupal sites.
5. Improve Drupal's high-level page caching so we have to rebuild pages less frequently.
* Advantages: no configuration required, can co-exist with (1), (2), (3) and (4), eliminates many more SQL queries.
* Disadvantages: doesn't work for authenticated users.
Thoughts?
For 2 and 3, giving the end user (non technical site admin) regexp is suboptimal. I was thinking more of a using regexp internally, perhaps via an API, and modules would provide a user interface on top of it, such as pathauto, where they can select caching for nodes, ...etc.
On Saturday 22 July 2006 10:30, Dries Buytaert wrote:
1. Build a caching algorithm that uses an heuristic to pre-load frequently used URL aliases.
* Advantages: transparent, no configuration required
* Disadvantages: it a heuristic, we don't know how it would perform, it might be tricky to implement, and MySQL does this implicitly (but not as aggressive).
I really like LRU conceptually, but I don't know how we'd implement it. If done in the database, we'd have to write the last-access-time back to the database each time an alias is accessed, doubling the number of queries (unless someone know of a portable update-on-access field in SQL?). If done in the system cache, then we're back to the patch someone already submitted (moshe, I think?). If done in the session, it would be very simple to implement but as the same person explained to me when I suggested it to him, that could get memory intensive very quickly. Another possible guideline is "precache anything that's in a menu", as that would include primary and secondary links and the majority of always-used links, but probably wouldn't be more than two dozen links on most sites. The trick here would be a fast and efficient way of defining "in a menu".
2. Provide a textarea that allows administrators to _white_list_ URL patterns.
*snip*
3. Provide a textarea that allows administrators to _black_list_ URL patterns.
* Advantages: fine-grained control, easy to implement
* Disadvantages: users usually don't like messing with regular expressions, it might take a lot of effort to get the list Just Right, and it takes a certain amount of familiarity with Drupal's URL scheme (learning curve for new Drupal users). The behavior might be confusing: you add an alias, and it doesn't work because you forgot about the list of URL patterns.
Why would you need a textarea and regexes? Just add a "pre-cache" checkbox to the edit-alias screen. Then the first time the alias lookup is called, it does a quick "SELECT ... FROM ... WHERE precache=1". That gets you what the admin thinks is the most common aliases, and both the UI and code couldn't get any simpler. Disadvantage: That's assuming the admin has any idea what the most common aliases are. :-) On the subject of black-listing, though, does anyone ever alias a path that's under admin/? The biggest drain from the aliasing now that I see is all of the queries to look up paths that aren't aliased in the first place.
4. Stop doing SQL queries when you cached all possible URL aliases.
* Advantages: transparent, no configuration required, can co- exist with (1), (2), (3) and (5).
* Disadvantages: only works for a subset of all Drupal sites, not a solution for larger Drupal sites.
Also doesn't take into account the order that the page is built. If you only have 5 aliases, but they're all primary links, those are built rather late (I think?). So the system wouldn't finish loading all aliases until it was nearly done with the page anyway.
5. Improve Drupal's high-level page caching so we have to rebuild pages less frequently.
* Advantages: no configuration required, can co-exist with (1), (2), (3) and (4), eliminates many more SQL queries.
* Disadvantages: doesn't work for authenticated users.
I'll leave this one to the cache experts. -- Larry Garfield AIM: LOLG42 larry@garfieldtech.com ICQ: 6817012 "If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it." -- Thomas Jefferson
On 22 Jul 2006, at 19:32, Larry Garfield wrote:
On Saturday 22 July 2006 10:30, Dries Buytaert wrote:
1. Build a caching algorithm that uses an heuristic to pre-load frequently used URL aliases.
* Advantages: transparent, no configuration required
* Disadvantages: it a heuristic, we don't know how it would perform, it might be tricky to implement, and MySQL does this implicitly (but not as aggressive).
I really like LRU conceptually, but I don't know how we'd implement it. If done in the database, we'd have to write the last-access-time back to the database each time an alias is accessed, doubling the number of queries (unless someone know of a portable update-on-access field in SQL?).
You'd only to that once in a while ... like, at most once every 10 minutes, you'd update the 'frequency counter' (or whatnot).
Why would you need a textarea and regexes? Just add a "pre-cache" checkbox to the edit-alias screen. Then the first time the alias lookup is called, it does a quick "SELECT ... FROM ... WHERE precache=1". That gets you what the admin thinks is the most common aliases, and both the UI and code couldn't get any simpler.
That would solve the 'user complexity' problem.
Disadvantage: That's assuming the admin has any idea what the most common aliases are. :-)
That might actually be a very fair assumption. :)
On the subject of black-listing, though, does anyone ever alias a path that's under admin/? The biggest drain from the aliasing now that I see is all of the queries to look up paths that aren't aliased in the first place.
I don't but - apparently - people who localize Drupal also choose to localize admin/* URLs. I think that, by now, we all agree on the fact that we can't make any assumption about how people use the URL alias functionality. Let's keep that fact in mind during the remainder of this discussion.
4. Stop doing SQL queries when you cached all possible URL aliases.
* Advantages: transparent, no configuration required, can co- exist with (1), (2), (3) and (5).
* Disadvantages: only works for a subset of all Drupal sites, not a solution for larger Drupal sites.
Also doesn't take into account the order that the page is built. If you only have 5 aliases, but they're all primary links, those are built rather late (I think?). So the system wouldn't finish loading all aliases until it was nearly done with the page anyway.
That is correct. It would only stop executing SQL queries once drupal_lookup_path()'s local cache is 'complete'. If this is a simple change (changing a couple lines of code), this might be well worth implementing. If it gets tricky, this optimization probably isn't worth it. It's worth investigating. In a default Drupal install there is exactly one URL alias: modules/system/system.install: $ret[] = update_sql("INSERT INTO {url_alias} (src, dst) VALUES ('node/feed', 'rss.xml')"); modules/system/system.install: $ret[] = update_sql("INSERT INTO {url_alias} (src, dst) VALUES ('rss.xml', 'node/feed')"); $count = db_result(db_query('SELECT COUNT(pid) FROM {url_alias}')); Combined with the above SQL query taken from drupal_lookup_path(), this means that the check "$count > 0" in drupal_lookup_path() will _always_ fail, and that we always query the database for each URL. If $count == 0, Drupal will not query the url_alias table. But unless you actually remove that one alias, you'll never take advantage of that check. As an extra optimization, I suggest that we get rid of that rss.xml URL alias, so that by default, Drupal doesn't generates hundreds of SQL queries, AND that for people who don't use path aliases, drupal_lookup_path() will never trigger any SQL queries. To get rid of the path alias, we could hardcode 'rss.xml' in the code (rather than much uglier 'node/feed') and extend the legacy.module to make the old URL work. It's also not clear why we insert two aliases ... looks like a bug in system.install. I'm wondering -- before we continue discussion this for hours and hours -- is someone willing to work on this? We've come at a point where some things could actually be implemented/fixed. -- Dries Buytaert :: http://www.buytaert.net/
Dries Buytaert wrote:
On 22 Jul 2006, at 19:32, Larry Garfield wrote:
On Saturday 22 July 2006 10:30, Dries Buytaert wrote:
1. Build a caching algorithm that uses an heuristic to pre-load frequently used URL aliases.
* Advantages: transparent, no configuration required
* Disadvantages: it a heuristic, we don't know how it would perform, it might be tricky to implement, and MySQL does this implicitly (but not as aggressive).
I really like LRU conceptually, but I don't know how we'd implement it. If done in the database, we'd have to write the last-access-time back to the database each time an alias is accessed, doubling the number of queries (unless someone know of a portable update-on-access field in SQL?).
You'd only to that once in a while ... like, at most once every 10 minutes, you'd update the 'frequency counter' (or whatnot).
Why would you need a textarea and regexes? Just add a "pre-cache" checkbox to the edit-alias screen. Then the first time the alias lookup is called, it does a quick "SELECT ... FROM ... WHERE precache=1". That gets you what the admin thinks is the most common aliases, and both the UI and code couldn't get any simpler.
That would solve the 'user complexity' problem.
Disadvantage: That's assuming the admin has any idea what the most common aliases are. :-)
That might actually be a very fair assumption. :)
On the subject of black-listing, though, does anyone ever alias a path that's under admin/? The biggest drain from the aliasing now that I see is all of the queries to look up paths that aren't aliased in the first place.
I don't but - apparently - people who localize Drupal also choose to localize admin/* URLs. I think that, by now, we all agree on the fact that we can't make any assumption about how people use the URL alias functionality. Let's keep that fact in mind during the remainder of this discussion.
4. Stop doing SQL queries when you cached all possible URL aliases.
* Advantages: transparent, no configuration required, can co- exist with (1), (2), (3) and (5).
* Disadvantages: only works for a subset of all Drupal sites, not a solution for larger Drupal sites.
Also doesn't take into account the order that the page is built. If you only have 5 aliases, but they're all primary links, those are built rather late (I think?). So the system wouldn't finish loading all aliases until it was nearly done with the page anyway.
That is correct. It would only stop executing SQL queries once drupal_lookup_path()'s local cache is 'complete'. If this is a simple change (changing a couple lines of code), this might be well worth implementing. If it gets tricky, this optimization probably isn't worth it. It's worth investigating.
In a default Drupal install there is exactly one URL alias:
modules/system/system.install: $ret[] = update_sql("INSERT INTO {url_alias} (src, dst) VALUES ('node/feed', 'rss.xml')"); modules/system/system.install: $ret[] = update_sql("INSERT INTO {url_alias} (src, dst) VALUES ('rss.xml', 'node/feed')");
$count = db_result(db_query('SELECT COUNT(pid) FROM {url_alias}'));
Combined with the above SQL query taken from drupal_lookup_path(), this means that the check "$count > 0" in drupal_lookup_path() will _always_ fail, and that we always query the database for each URL. If $count == 0, Drupal will not query the url_alias table. But unless you actually remove that one alias, you'll never take advantage of that check.
As an extra optimization, I suggest that we get rid of that rss.xml URL alias, so that by default, Drupal doesn't generates hundreds of SQL queries, AND that for people who don't use path aliases, drupal_lookup_path() will never trigger any SQL queries. To get rid of the path alias, we could hardcode 'rss.xml' in the code (rather than much uglier 'node/feed') and extend the legacy.module to make the old URL work.
It's also not clear why we insert two aliases ... looks like a bug in system.install.
I'm wondering -- before we continue discussion this for hours and hours -- is someone willing to work on this? We've come at a point where some things could actually be implemented/fixed.
Please look at http://drupal.org/node/40860 Many of you have looked at it, but it's been awhile. I wrote this patch six months ago, because I identified the url alias query as an obvious optimization problem. The patch I wrote for this is quite a bit simpler than what Dries is proposing here, but I found that it worked extremely well, and really reduced the url_alias queries that were being used. Most of the url_alias queries were coming from the menu system (surprise). In my patch, the system is vaguely heuristic -- it datestamps the alias cache and when the cache is 'full' it removes items based on what hasn't been checked recently. All the patch needs is some code cleanup (there were some objections to my use of a function reference, though I think it's perfectly valid, and variable naming) and it needs to update the cache when url aliases are actually changed in path.module. All of these are easy. The hard part is probably updating the patch to HEAD, but even that may not be all that bad, I've never tried.
Regarding pre-caching of aliases, the problem is that this does not scale. Up until 4.6 we read all the aliases in one query, and the expensive part was not the SQL query but the building of the associative array for sites with a large number of aliases. If we cache a large number of aliases, I gues they have to be serialized/unserialized, and then we will be back to the same problem as before, that was solved by individual queries.
Khalid B wrote:
Regarding pre-caching of aliases, the problem is that this does not scale. Up until 4.6 we read all the aliases in one query, and the expensive part was not the SQL query but the building of the associative array for sites with a large number of aliases.
That's why my patch caches only a limited number of aliases, not the entire tree.
On Saturday 22 July 2006 12:57, Dries Buytaert wrote:
I really like LRU conceptually, but I don't know how we'd implement it. If done in the database, we'd have to write the last-access-time back to the database each time an alias is accessed, doubling the number of queries (unless someone know of a portable update-on-access field in SQL?).
You'd only to that once in a while ... like, at most once every 10 minutes, you'd update the 'frequency counter' (or whatnot).
And otherwise you'd keep it where? -- Larry Garfield AIM: LOLG42 larry@garfieldtech.com ICQ: 6817012 "If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea, which an individual may exclusively possess as long as he keeps it to himself; but the moment it is divulged, it forces itself into the possession of every one, and the receiver cannot dispossess himself of it." -- Thomas Jefferson
In a default Drupal install there is exactly one URL alias:
modules/system/system.install: $ret[] = update_sql("INSERT INTO {url_alias} (src, dst) VALUES ('node/feed', 'rss.xml')"); modules/system/system.install: $ret[] = update_sql("INSERT INTO {url_alias} (src, dst) VALUES ('rss.xml', 'node/feed')");
I think the lines that you quote are from old updates. A fresh site does not have any url aliases anymore. see http://drupal.org/node/29326.
On Jul 11, 2006, at 3:01 PM, Gerhard Killesreiter wrote:
Luckily, the problems we've had (namely the CPU on the db server being saturated) were resolved by more mundane means (optimze your database on occassion...), but since we continue to experience a growing number of visitors on drupal.org, better caching is no waste of time.
http://api.drupal.org/api/HEAD/function/drupal_page_header does not issue an authenticated header. If it did, we could use Apache 2.2 cache in memory or file caching. I suspect that serving cached pages from an apache cache in memory will avoid both a Apache HTTPD thread with PHP running at 20-30MB as well as avoid MySQL thread which can be massive if large buffers are configured for(frequently unnecessarily for both MyISAM and InnoDB by default) MySQL. This seems the cheapest an most effective place to cache short of getting another server for a Squid proxy. http://httpd.apache.org/docs/2.2/caching.html In the "What can be cached section?" # If the request contains an "Authorization:" header, the response will not be cached. # If the response contains an "Authorization:" header, it must also contain an "s-maxage", "must-revalidate" or "public" option in the "Cache-Control:" header. This should be relatively straight forward. Anyone sitting on a patch for this already? Anyone care to share experience using Apache 2.2 caching? Cheers, Kieran
participants (15)
-
Darrel O'Pry -
Derek Wright -
Dries Buytaert -
Dries Buytaert -
Earl Miles -
Gerhard Killesreiter -
Gordon Heydon -
Jakub Suchy -
John Handelaar -
Khalid B -
Kieran Lal -
Larry Garfield -
Moshe Weitzman -
Richard Eriksson -
Stefan Nagtegaal