[drupal-devel] Drupal.org cache statistics
Hello world, november last year, I profiled drupal.org's cache observations. Yesterday, Moshe asked to profile it again so we could evaluate the usefulness of Jeremy's "loose caching" mechanism. The past 20 hours, I logged 93.000 unique page requests using the patch at http://buytaert.net/temporary/cache-statistics.patch. Loose caching was enabled. Results: 1. Last year we found that authenticated users were responsible for 15,8 % of all page views. A year later, we see that authenticated users are responsible for 14,9% of all page views. 2. Last year we found that only 27.9% of the page requests actually benefited from the cache system. That is, for more than 2/3th of the page requests, we had to generate a page dynamically. A year later, using "loose caching" rather than "strict caching", we see that 30,7% of the page requests benefit from the cache system. Read: we still have a lot of page cache misses! :( 3. Last year, we found that the cache got flushed once every 207 page requests. A year later, we observe that the cache got flushed once every 190 page requests. We conclude that: 1. Loose caching does not significantly -- or not necessarily -- improve the behavior of drupal.org's page cache (though I'd like to believe that it does when there are sudden traffic spikes/bursts). 2. When writing code, we can NOT assume that a page will benefit from being cached. -- Dries Buytaert :: http://www.buytaert.net/
Thanks Dries. Wow. This is disappointing and a bit confusing. How can we interpret the seemingly contradictory changes described in 2 vs. 3? I have to believe we can get more than 30% cache hits when our user base is 85% anonymous. Part of me wonders if we are not getting skewed results because of web crawlers. Thos crawlers spider all of our obscure pages and thus cause cache misses. I agree that we need to work on the non cached case. -moshe Dries Buytaert wrote:
Hello world,
november last year, I profiled drupal.org's cache observations. Yesterday, Moshe asked to profile it again so we could evaluate the usefulness of Jeremy's "loose caching" mechanism.
The past 20 hours, I logged 93.000 unique page requests using the patch at http://buytaert.net/temporary/cache-statistics.patch. Loose caching was enabled.
Results:
1. Last year we found that authenticated users were responsible for 15,8 % of all page views. A year later, we see that authenticated users are responsible for 14,9% of all page views.
2. Last year we found that only 27.9% of the page requests actually benefited from the cache system. That is, for more than 2/3th of the page requests, we had to generate a page dynamically. A year later, using "loose caching" rather than "strict caching", we see that 30,7% of the page requests benefit from the cache system. Read: we still have a lot of page cache misses! :(
3. Last year, we found that the cache got flushed once every 207 page requests. A year later, we observe that the cache got flushed once every 190 page requests.
We conclude that:
1. Loose caching does not significantly -- or not necessarily -- improve the behavior of drupal.org's page cache (though I'd like to believe that it does when there are sudden traffic spikes/bursts).
2. When writing code, we can NOT assume that a page will benefit from being cached.
-- Dries Buytaert :: http://www.buytaert.net/
I assume that we are taking about Jeremy's patches. Jeremy, you must have tried his new caching on kerneltrap.org. Did you do any comparison of before and after to see how it fares? Does that fit in Dries' findings? Perhaps two different sites, two different usage patterns. Don't know but it is weird. On 5/26/05, Moshe Weitzman <weitzman@tejasa.com> wrote:
Thanks Dries.
Wow. This is disappointing and a bit confusing. How can we interpret the seemingly contradictory changes described in 2 vs. 3?
I have to believe we can get more than 30% cache hits when our user base is 85% anonymous. Part of me wonders if we are not getting skewed results because of web crawlers. Thos crawlers spider all of our obscure pages and thus cause cache misses.
I agree that we need to work on the non cached case.
-moshe
Dries Buytaert wrote:
Hello world,
november last year, I profiled drupal.org's cache observations. Yesterday, Moshe asked to profile it again so we could evaluate the usefulness of Jeremy's "loose caching" mechanism.
The past 20 hours, I logged 93.000 unique page requests using the patch at http://buytaert.net/temporary/cache-statistics.patch. Loose caching was enabled.
Results:
1. Last year we found that authenticated users were responsible for 15,8 % of all page views. A year later, we see that authenticated users are responsible for 14,9% of all page views.
2. Last year we found that only 27.9% of the page requests actually benefited from the cache system. That is, for more than 2/3th of the page requests, we had to generate a page dynamically. A year later, using "loose caching" rather than "strict caching", we see that 30,7% of the page requests benefit from the cache system. Read: we still have a lot of page cache misses! :(
3. Last year, we found that the cache got flushed once every 207 page requests. A year later, we observe that the cache got flushed once every 190 page requests.
We conclude that:
1. Loose caching does not significantly -- or not necessarily -- improve the behavior of drupal.org's page cache (though I'd like to believe that it does when there are sudden traffic spikes/bursts).
2. When writing code, we can NOT assume that a page will benefit from being cached.
-- Dries Buytaert :: http://www.buytaert.net/
On Thu, 26 May 2005, Dries Buytaert wrote:
november last year, I profiled drupal.org's cache observations. Yesterday, Moshe asked to profile it again so we could evaluate the usefulness of Jeremy's "loose caching" mechanism.
The past 20 hours, I logged 93.000 unique page requests using the patch at http://buytaert.net/temporary/cache-statistics.patch. Loose caching was enabled.
Results:
1. Last year we found that authenticated users were responsible for 15,8 % of all page views. A year later, we see that authenticated users are responsible for 14,9% of all page views.
2. Last year we found that only 27.9% of the page requests actually benefited from the cache system. That is, for more than 2/3th of the page requests, we had to generate a page dynamically. A year later, using "loose caching" rather than "strict caching", we see that 30,7% of the page requests benefit from the cache system. Read: we still have a lot of page cache misses! :(
3. Last year, we found that the cache got flushed once every 207 page requests. A year later, we observe that the cache got flushed once every 190 page requests.
We conclude that:
1. Loose caching does not significantly -- or not necessarily -- improve the behavior of drupal.org's page cache (though I'd like to believe that it does when there are sudden traffic spikes/bursts).
2. When writing code, we can NOT assume that a page will benefit from being cached.
My conclusion is that our cache needs to be more finely grained (right, that's not new). For example: Comment.module calls cache_clear_all() after a comment has been added. Granted, thanks to our block system it can happen that some comment (or comment count or "last updated" ...) is displayed on just any page. But I'd prefer modules that provide such blocks to build their own block cache and invalidate cached pages according to the blocks' path settings. That should not be too difficult. The same should be done for blocks that display content based on newly created nodes. Those modules should take the cache setting (none, loose, strict) into account. For comment.module that would mean that it only sets variables (comment_last_timestamp, comment_last_uid, comment_last_nid, comment_last_cid) to new values instead of invalidating the cache. forum.module would then check the variable against its own variable (forum_last_time_page_rebuilt) and invalidate pages as appropriate. Same for tracker and other modules that deal with comments in one way or the other. Generally, we should investigate if a page cache makes sense for a community site at all. For Drupal.org it might make more sense to build up a page from pre-cached pieces of content (nodes, blocks, ...) than to just deliver a complete page from the cache. There is just too much new content added for a global cache to be usefull. Cheers, Gerhard
On Thu, 26 May 2005, Gerhard Killesreiter wrote:
My conclusion is that our cache needs to be more finely grained (right, that's not new).
For example: Comment.module calls cache_clear_all() after a comment has been added. Granted, thanks to our block system it can happen that some comment (or comment count or "last updated" ...) is displayed on just any page. But I'd prefer modules that provide such blocks to build their own block cache and invalidate cached pages according to the blocks' path settings.
Note that the block caching is not a prereqisite, we could do without. Most blocks would need per user caches.
That should not be too difficult. The same should be done for blocks that display content based on newly created nodes. Those modules should take the cache setting (none, loose, strict) into account.
For comment.module that would mean that it only sets variables (comment_last_timestamp, comment_last_uid, comment_last_nid, comment_last_cid) to new values instead of invalidating the cache. forum.module would then check the variable against its own variable (forum_last_time_page_rebuilt) and invalidate pages as appropriate.
That's of course nonsense, the modules would need to implement the comment api.
Same for tracker and other modules that deal with comments in one way or the other.
Most modules that display nodes or listings of nodes deal with comments too by showing "n new comments". Nodes can be shown in multiple contexts: /node/nid /foobar/nid (not often) /node /tracker /forum /taxonomy/.... I propose that we let comment module clear only /node/nid and let modules that show nodes in other contexts delete the appropriate pages. Deleting taxonomy pages might be a bit difficult, haven't checked yet. if it is too diffcult to find all pages that should be deleted, we could make it adopt a different policy, ie only delete all taxo pages after the nthe commment has been committed. Cheers, Gerhard
any page. But I'd prefer modules that provide such blocks to build their own block cache and invalidate cached pages according to the blocks' path settings. That should not be too difficult. The same should be done
Block system renewal as per http://drupal.org/node/16216#comment-29282 is on its way. Bear with me. Patience please. Cache will be included. Regards NK
I haven't heard any new caching ideas for a while, so I sat down and brainstormed a bit. I began by re-phrasing Dries's statistics: Assuming that caching applies only to anonymous visitors for the sake of approximation, these numbers tell me that 35% of all anonymous requests (30 / 85) are identical to one of the last 170 anonymous requests (85% of 200). Then, every 2.5 minutes or so (200 pages / 93,000 pages * 20 hours), the entire cache is reset. But roughly 65% of the anonymous requests (55 / 85) are unique within a 2.5-minute span, which means that at least 65% of drupal.org's pages wait more than 2.5 minutes between anonymous page requests. So the goal is obviously to extend the lifetime of the cache for anonymous visitors. Wiping the entire cache seems overkill, but what else can be done when every page may contain dynamic information? Other people have already said this already. So what options do we have to improve caching performance? 1) Perhaps we could use statistics, like a Poisson distribution, to predict how much time is likely to occur between requests to a certain page given the activity we recently recorded [1]. If the predicted wait is greater than 2.5 minutes, don't immediately clear the cache for that page. Instead, wait until there's at least a 50% chance someone will look again. After all, it's likely that no one is looking at it, so why keep the page up-to-date? 2) Wait more than 2.5 minutes before clearing the cache for anonymous users. If most of the visitors have authenticated, this scheme won't help much. 3) Cache page elements as other people have suggested. Assuming most of the effort to create a page is spent rendering page elements, this scheme should work too. Feel free to point out mistakes! Nic ---- [1] Using a Poisson distribution, there is a 50% chance of waiting less than t seconds between requests, where t = -1 * number_of_seconds_in_sample_period / number_of_requests_in_sample_period * log(.5). On May 26, 2005, at 1:16 PM, Dries Buytaert wrote:
Hello world,
november last year, I profiled drupal.org's cache observations. Yesterday, Moshe asked to profile it again so we could evaluate the usefulness of Jeremy's "loose caching" mechanism.
The past 20 hours, I logged 93.000 unique page requests using the patch at http://buytaert.net/temporary/cache-statistics.patch. Loose caching was enabled.
Results:
1. Last year we found that authenticated users were responsible for 15,8 % of all page views. A year later, we see that authenticated users are responsible for 14,9% of all page views.
2. Last year we found that only 27.9% of the page requests actually benefited from the cache system. That is, for more than 2/3th of the page requests, we had to generate a page dynamically. A year later, using "loose caching" rather than "strict caching", we see that 30,7% of the page requests benefit from the cache system. Read: we still have a lot of page cache misses! :(
3. Last year, we found that the cache got flushed once every 207 page requests. A year later, we observe that the cache got flushed once every 190 page requests.
We conclude that:
1. Loose caching does not significantly -- or not necessarily -- improve the behavior of drupal.org's page cache (though I'd like to believe that it does when there are sudden traffic spikes/bursts).
2. When writing code, we can NOT assume that a page will benefit from being cached.
-- Dries Buytaert :: http://www.buytaert.net/
Caching idea #4: Use feedback to control the maximum number of non-cached pages per minute by adjusting the cache delay in idea #2 below. Site administrators could define the maximum, then when site load passes that threshold the cache increases its delay until the number of rendered pages is no more than the maximum setting. Of course, when the load is low, act normally -- don't try to increase the load. ;-) Nic On May 26, 2005, at 6:06 PM, Nicholas Ivy wrote:
I haven't heard any new caching ideas for a while, so I sat down and brainstormed a bit. I began by re-phrasing Dries's statistics:
Assuming that caching applies only to anonymous visitors for the sake of approximation, these numbers tell me that 35% of all anonymous requests (30 / 85) are identical to one of the last 170 anonymous requests (85% of 200). Then, every 2.5 minutes or so (200 pages / 93,000 pages * 20 hours), the entire cache is reset. But roughly 65% of the anonymous requests (55 / 85) are unique within a 2.5-minute span, which means that at least 65% of drupal.org's pages wait more than 2.5 minutes between anonymous page requests.
So the goal is obviously to extend the lifetime of the cache for anonymous visitors. Wiping the entire cache seems overkill, but what else can be done when every page may contain dynamic information? Other people have already said this already.
So what options do we have to improve caching performance?
1) Perhaps we could use statistics, like a Poisson distribution, to predict how much time is likely to occur between requests to a certain page given the activity we recently recorded [1]. If the predicted wait is greater than 2.5 minutes, don't immediately clear the cache for that page. Instead, wait until there's at least a 50% chance someone will look again. After all, it's likely that no one is looking at it, so why keep the page up-to-date?
2) Wait more than 2.5 minutes before clearing the cache for anonymous users. If most of the visitors have authenticated, this scheme won't help much.
3) Cache page elements as other people have suggested. Assuming most of the effort to create a page is spent rendering page elements, this scheme should work too.
Feel free to point out mistakes!
Nic
---- [1] Using a Poisson distribution, there is a 50% chance of waiting less than t seconds between requests, where t = -1 * number_of_seconds_in_sample_period / number_of_requests_in_sample_period * log(.5).
I'm pretty sure I got this wrong, particularly the part about clearing the cache before the page is reloaded. That defeats the purpose of a cache. Hmm ... Nic On May 26, 2005, at 6:06 PM, Nicholas Ivy wrote:
So what options do we have to improve caching performance?
1) Perhaps we could use statistics, like a Poisson distribution, to predict how much time is likely to occur between requests to a certain page given the activity we recently recorded [1]. If the predicted wait is greater than 2.5 minutes, don't immediately clear the cache for that page. Instead, wait until there's at least a 50% chance someone will look again. After all, it's likely that no one is looking at it, so why keep the page up-to-date?
november last year, I profiled drupal.org's cache observations. Yesterday, Moshe asked to profile it again so we could evaluate the usefulness of Jeremy's "loose caching" mechanism.
The past 20 hours, I logged 93.000 unique page requests using the patch at http://buytaert.net/temporary/cache-statistics.patch. Loose caching was enabled.
Results:
1. Last year we found that authenticated users were responsible for 15,8 % of all page views. A year later, we see that authenticated users are responsible for 14,9% of all page views.
2. Last year we found that only 27.9% of the page requests actually benefited from the cache system. That is, for more than 2/3th of the page requests, we had to generate a page dynamically. A year later, using "loose caching" rather than "strict caching", we see that 30,7% of the page requests benefit from the cache system. Read: we still have a lot of page cache misses! :(
You have individual numbers for anonymous / authenticated, and for cached / not cached. But authenticated users do not get cached pages. Is that 27.9% / 30,7% figure the number of cache hits for all anonymous users, or for all users in general? If it is for all users in general, then we can easily calculate the real page cache hit rate: 27.9 / (1 - 0.158) = 33,13% 30,7 / (1 - 0.149) = 36,07% If my assumption is correct, these numbers are now not dependant on how many anonymous users / authenticated users we have. Not much difference, we can assume our distribution of anonymous users did not change much. Steven Wittens
participants (7)
-
Dries Buytaert -
Gerhard Killesreiter -
K B -
Karoly Negyesi -
Moshe Weitzman -
Nicholas Ivy -
Steven Wittens