[drupal-devel] caching issues
jeremy at kerneltrap.org
Wed Mar 9 03:49:51 UTC 2005
I ran into a problem with the way Drupal caches data today. A spam bot
started crawling my site for the past 36 hours or so, posting dozens of
comments every minute (frequently several a second). Combined with normal
traffic, my site was serving 3-400 pages every 60 seconds. Because the
spam comments were being posted at such a high speed, the cache was being
flushed too quickly to do any good. I may as well have disabled the
cache. The site became sluggish. (It has handled that large a load
before, just not with comments being posted so quickly)
I am planning to patch my site to modify how the cache is flushed, and
perhaps to work again on file-based caching. No matter how optimized a
CMS, there will come a time when the limitations of the hardware prevent a
site from updating in real-time. Large websites (ie Slashdot) have to
rebuild their caches every n minutes, instead of every time a new comment
is posted. Like it or not, this is significantly more efficient.
I would like to target the effort for core inclusion into 4.7. Thus, I
would like to brainstorm now and work out an acceptable design. If nobody
is interested and this has no chance of getting into 4.7, I'll do it on my
own anyway, but I'd much prefer to get something into core so I don't have
to redo it with every release.
Here are some proposals. I personally would like to see one or more of
these available _in addition_ to the current method. ie, most sites would
leave the cache as we're all used to. Busier sites would enable one of
these alternative caching mechanisms (in order of coding complexity):
0) Current Drupal caching. What is in the cache is always valid. If new
content is posted, the cache is potentially invalid so it is flushed and
everything is rebuilt. (Some stuff sticks around, but that stuff will be
unmodified by my proposals)
1) Time-based caching. Simply flush the cache every n minutes. When new
content is posted, a message such as "your comment will become visible in
n minutes" would need to be displayed. (This would have saved me most
recently with the spambot problem I had today...)
2) Fuzzy time-based caching. Patches against 4.2 exist in CVS  to see
what I'm referring to (patches apply in order: 1, then 2, then 3...).
It's similar to idea #1, but slightly more complicated. The cache becomes
"dirty" every n minutes. When a "dirty" cache page is requested, it may
or may not be rebuilt by the requester (a call to random makes the
determination). If after n+x minutes the cache entry still hasn't been
rebuilt, it is flushed (forcing a rebuild). (The idea is to soften the
affect of flushing the cache. In example 1, there will be a CPU spike
every n minutes. In example 2, the CPU load is distributed randomly.) In
other words, there's a soft timeout, and a hard timeout. After the soft
timeout, the cache entry may be rebuilt. After the hard timeout, the
cache entry has to be rebuilt.
3) File-based caching. Patches against 4.0 exist in CVS  to see what
I'm referring to. The simplest mechanism would be like #1 above, but with
files stored in the filesystem instead of in the database. When I
utilized this in 4.0, the performance boost was phenomenal. Additionally,
the site could continue to serve pages with the database stopped.
4) Fuzzy file-based caching. This is actually how I implemented
file-based caching against 4.0 long ago. If you got this far and
understood examples 1, 2, and 3, then no further explanation is needed
Thoughts? Feedback? Suggestions? Dries?
I'll work up patches. But if someone has better design ideas, now is a
good time to suggest it.
More information about the drupal-devel