[drupal-devel] caching issues
Hello, I ran into a problem with the way Drupal caches data today. A spam bot started crawling my site for the past 36 hours or so, posting dozens of comments every minute (frequently several a second). Combined with normal traffic, my site was serving 3-400 pages every 60 seconds. Because the spam comments were being posted at such a high speed, the cache was being flushed too quickly to do any good. I may as well have disabled the cache. The site became sluggish. (It has handled that large a load before, just not with comments being posted so quickly) I am planning to patch my site to modify how the cache is flushed, and perhaps to work again on file-based caching. No matter how optimized a CMS, there will come a time when the limitations of the hardware prevent a site from updating in real-time. Large websites (ie Slashdot) have to rebuild their caches every n minutes, instead of every time a new comment is posted. Like it or not, this is significantly more efficient. I would like to target the effort for core inclusion into 4.7. Thus, I would like to brainstorm now and work out an acceptable design. If nobody is interested and this has no chance of getting into 4.7, I'll do it on my own anyway, but I'd much prefer to get something into core so I don't have to redo it with every release. Here are some proposals. I personally would like to see one or more of these available _in addition_ to the current method. ie, most sites would leave the cache as we're all used to. Busier sites would enable one of these alternative caching mechanisms (in order of coding complexity): 0) Current Drupal caching. What is in the cache is always valid. If new content is posted, the cache is potentially invalid so it is flushed and everything is rebuilt. (Some stuff sticks around, but that stuff will be unmodified by my proposals) 1) Time-based caching. Simply flush the cache every n minutes. When new content is posted, a message such as "your comment will become visible in n minutes" would need to be displayed. (This would have saved me most recently with the spambot problem I had today...) 2) Fuzzy time-based caching. Patches against 4.2 exist in CVS [1] to see what I'm referring to (patches apply in order: 1, then 2, then 3...). It's similar to idea #1, but slightly more complicated. The cache becomes "dirty" every n minutes. When a "dirty" cache page is requested, it may or may not be rebuilt by the requester (a call to random makes the determination). If after n+x minutes the cache entry still hasn't been rebuilt, it is flushed (forcing a rebuild). (The idea is to soften the affect of flushing the cache. In example 1, there will be a CPU spike every n minutes. In example 2, the CPU load is distributed randomly.) In other words, there's a soft timeout, and a hard timeout. After the soft timeout, the cache entry may be rebuilt. After the hard timeout, the cache entry has to be rebuilt. 3) File-based caching. Patches against 4.0 exist in CVS [2] to see what I'm referring to. The simplest mechanism would be like #1 above, but with files stored in the filesystem instead of in the database. When I utilized this in 4.0, the performance boost was phenomenal. Additionally, the site could continue to serve pages with the database stopped. 4) Fuzzy file-based caching. This is actually how I implemented file-based caching against 4.0 long ago. If you got this far and understood examples 1, 2, and 3, then no further explanation is needed here. Thoughts? Feedback? Suggestions? Dries? I'll work up patches. But if someone has better design ideas, now is a good time to suggest it. Thanks, -Jeremy [1] http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/jeremy/4.2.0/cach... [2] http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/jeremy/4.0.0/file...
Thoughts? Feedback? Suggestions? Dries?
Dries and I did a bunch of cache tests a while ago, the conclusions were posted on the list. I think it was shortly after the 4.5 release. Various ideas for keeping the cache alive longer were posted as well. Ones I can remember: - Enforcing a minimum cache lifetime for pages is pretty easy with the timestamp/expiration parameter for cache. - It is important that a user sees his/her changes reflected immediately, otherwise they might think an error occured and post twice. Possible solution: disable caching for a user's session as soon as a they have posted something. For your case this would mean the spambot get fresh pages all the time, but the rest doesn't. - Clearing the cache selectively is difficult because sidebar blocks like "active forum topics" change easily. Still, clearing out the "main" page for a certain item (e.g. a node view) is doable. - The cache has a much higher miss rate than expected on drupal.org at least because the site is constantly being crawled by spiders. Pages that are visited often get re-cached quickly after a wipe, but this doesn't happen for the random access pattern that is common for spiders and also for posts reached through searching. - Any aggressive caching should be implemented as an optional feature as it is useless for small sites. Perhaps we could change the cache option into three states: "No caching" "Mild caching" "Aggressive caching". Still, it sounds to me like your problem could be fixed by imposing a throttle on submissions (we used to have this, but it got lost in one of the node system rewrites) or by trying to detect spammy behaviour and imposing a (temporary) ban. If you dig around the mailinglist archive some more, you might find some more things. Steven
On Wednesday 09 March 2005 05.16, Steven Wittens wrote:
Thoughts? Feedback? Suggestions? Dries?
Dries and I did a bunch of cache tests a while ago, the conclusions were posted on the list. I think it was shortly after the 4.5 release.
I remember that letter, too. To help further discussions, I've looked it up: http://lists.drupal.org/archives/drupal-devel/2004-11/msg00234.html Regards Karoly Negyesi
On Wed, 09 Mar 2005 05:16:22 +0100 Steven Wittens <steven@acko.net> wrote: [...]
- Enforcing a minimum cache lifetime for pages is pretty easy with the timestamp/expiration parameter for cache.
This is my proposal #1? Simple time-based caching.
- It is important that a user sees his/her changes reflected immediately, otherwise they might think an error occured and post twice.
Possible solution: disable caching for a user's session as soon as a they have posted something. For your case this would mean the spambot get fresh pages all the time, but the rest doesn't.
I disagree that this is important. There are many sites on the internet where once you post something, you get a message that says something like "your comment will be visible within n minutes". However, invalidating the cache for specific users is an interesting idea. This would be per-IP... I will look into this.
- Clearing the cache selectively is difficult because sidebar blocks like "active forum topics" change easily. Still, clearing out the "main" page for a certain item (e.g. a node view) is doable.
I don't plan to pursue selective caching. It introduces more complexity than I am interested in maintaining as a patch against core.
- The cache has a much higher miss rate than expected on drupal.org at least because the site is constantly being crawled by spiders. Pages that are visited often get re-cached quickly after a wipe, but this doesn't happen for the random access pattern that is common for spiders and also for posts reached through searching.
Yes, this is true. And I remember the data you and Dries came up with. However I still get a large boost from the cache. I believe this is because much of the anonymous traffic is due to links from other news sites, and thus the same page is loaded many times in rapid succession. Thus, only a percentage of page loads benefit from the cache, but it's a significant enough percentage to have a noticeable affect on performance. Disabling the cache (or flushing it every second) has a negative affect on performance.
- Any aggressive caching should be implemented as an optional feature as it is useless for small sites. Perhaps we could change the cache option into three states: "No caching" "Mild caching" "Aggressive caching".
Yes, I agree. But is there interest in merging an agressive type caching mechanism?
Still, it sounds to me like your problem could be fixed by imposing a throttle on submissions (we used to have this, but it got lost in one of the node system rewrites) or by trying to detect spammy behaviour and imposing a (temporary) ban.
I neglected to mention that the spam bots use an obscenely large number of proxies. Each comment submission is made from a different IP address. Thus, as far as Drupal is concerned they are each a different user.
If you dig around the mailinglist archive some more, you might find some more things.
I have followed such discussions with much interest in the past. If anyone else has practical ideas, please speak up. I am adding "temporarily disable cache for specific users" to my list of potential improvements. That should be simple enough, and may be all I need. Thanks, -Jeremy
Jeremy Andrews wrote:
- The cache has a much higher miss rate than expected on drupal.org at least because the site is constantly being crawled by spiders. Pages that are visited often get re-cached quickly after a wipe, but this doesn't happen for the random access pattern that is common for spiders and also for posts reached through searching.
Yes, this is true. And I remember the data you and Dries came up with. However I still get a large boost from the cache. I believe this is because much of the anonymous traffic is due to links from other news sites, and thus the same page is loaded many times in rapid succession. Thus, only a percentage of page loads benefit from the cache, but it's a significant enough percentage to have a noticeable affect on performance. Disabling the cache (or flushing it every second) has a negative affect on performance.
Caching helps a lot. That said, developers should _not_ rely on pages being cached. The cache is no substitute for badly performing code because 50% of the time, the badly performing code is executed.
- Any aggressive caching should be implemented as an optional feature as it is useless for small sites. Perhaps we could change the cache option into three states: "No caching" "Mild caching" "Aggressive caching".
Yes, I agree. But is there interest in merging an agressive type caching mechanism?
Yes, there is. -- Dries Buytaert :: http://www.buytaert.net/
forgive my ignorance, but would it not solve the problem to disable anonymous commenting? surely these spiders don't check their email and get new passwords that way ... I know that some sites really really want anonymous commenting, but I think spammers have ruined that. There are so many other benefits to being a logged in user. i don't really want to debate the merits of anonymous commenting, just wondering if my proposal is sufficient.
On Wed, 9 Mar 2005 08:12:59 -0500 Moshe Weitzman <weitzman@tejasa.com> wrote: [...]
forgive my ignorance, but would it not solve the problem to disable anonymous commenting? surely these spiders don't check their email and get new passwords that way ... I know that some sites really really want anonymous commenting, but I think spammers have ruined that. There are so many other benefits to being a logged in user.
i don't really want to debate the merits of anonymous commenting, just wondering if my proposal is sufficient.
Yes and no. Yes, this most recent flood I suffered would have been prevented by disabling anonymous commenting. No, spammers have registered user accounts on my site before and used them to post spam, so even with anonymous commenting disabled it would be possible to get spammed. That said, it is much easier to control offending user accounts... Regardless, disabling anonymous comments is not an option for me. A significant portion of my content comes from anonymous comments. -Jeremy
On Wed, 9 Mar 2005 08:12:59 -0500 Moshe Weitzman <weitzman@tejasa.com> wrote:
[...]
forgive my ignorance, but would it not solve the problem to disable anonymous commenting? surely these spiders don't check their email and get new passwords that way ... I know that some sites really really want anonymous commenting, but I think spammers have ruined that. There are so many other benefits to being a logged in user.
i don't really want to debate the merits of anonymous commenting, just wondering if my proposal is sufficient.
Yes and no. Yes, this most recent flood I suffered would have been prevented by disabling anonymous commenting. No, spammers have registered user accounts on my site before and used them to post spam, so even with anonymous commenting disabled it would be possible to get spammed. That said, it is much easier to control offending user accounts...
Regardless, disabling anonymous comments is not an option for me. A significant portion of my content comes from anonymous comments.
Have you considered using captcha for anonymous comments? -Mark
Jeremy Andrews wrote:
Yes and no. Yes, this most recent flood I suffered would have been prevented by disabling anonymous commenting.
Just a general comment on this for everyone's interest. Spam-bots are easier to stop by requireing registration.
spammers have registered user accounts on my site before and used them to post spam
But as Jeremy has experienced - Spamming is an outsourced industry with people being paid to do it. If site registration is quick and easy (as it is with Drupal - because it is such great a CMS) - then you are still faced with the spam issue. Even with sighted-human-input-validation - or what others annoyingly refer to as captchas - paid humans are good at getting past them. And lets not forget the point of this thread - improved caching - which has the pleasant side effect of keeping your site running smoothly under a hailstorm of spam-bot posts. I would love to see an optional file-cache system for blogging or brochureware sites that use Drupal. And in a perfect world improved traditional drupal caching for community features (like forums) and file-based caching for mainly static modules like books or plain pages (those with low volume comments). Or in other words a configuration option to choose the caching mechanism based on the node type. (I can hear Bèr grumbling already ;-). andre
And lets not forget the point of this thread - improved caching - which has the pleasant side effect of keeping your site running smoothly under a hailstorm of spam-bot posts.
I would love to see an optional file-cache system for blogging or brochureware sites that use Drupal. And in a perfect world improved traditional drupal caching for community features (like forums) and file-based caching for mainly static modules like books or plain pages (those with low volume comments).
Or in other words a configuration option to choose the caching mechanism based on the node type. (I can hear Bèr grumbling already
Just curious, has the topic of using shared memory been brought up? I've used shmop functionality to store large data trees which were imported from XML. Parsing the XML doc would take 500-800ms, but unserializing a large object from shared mem took like 6-7ms. This was for a drupal 4.2 module, but I would imagine the code could easily be adapted. I realize that shmop is not available on all platforms, but probably most server-grade environments would/should support it. http://www.php.net/manual/en/ref.shmop.php -Mark
A lot of web hosts "forget" to compile PHP with shmop this support. Carl McDade Mark Howell wrote:
And lets not forget the point of this thread - improved caching - which has the pleasant side effect of keeping your site running smoothly under a hailstorm of spam-bot posts.
I would love to see an optional file-cache system for blogging or brochureware sites that use Drupal. And in a perfect world improved traditional drupal caching for community features (like forums) and file-based caching for mainly static modules like books or plain pages (those with low volume comments).
Or in other words a configuration option to choose the caching mechanism based on the node type. (I can hear Bèr grumbling already
Just curious, has the topic of using shared memory been brought up? I've used shmop functionality to store large data trees which were imported from XML. Parsing the XML doc would take 500-800ms, but unserializing a large object from shared mem took like 6-7ms. This was for a drupal 4.2 module, but I would imagine the code could easily be adapted. I realize that shmop is not available on all platforms, but probably most server-grade environments would/should support it. http://www.php.net/manual/en/ref.shmop.php
-Mark
Jeremy Andrews wrote:
I ran into a problem with the way Drupal caches data today. A spam bot started crawling my site for the past 36 hours or so, posting dozens of comments every minute (frequently several a second). Combined with normal traffic, my site was serving 3-400 pages every 60 seconds. Because the spam comments were being posted at such a high speed, the cache was being flushed too quickly to do any good. I may as well have disabled the cache. The site became sluggish. (It has handled that large a load before, just not with comments being posted so quickly)
As an extra solution you could leverage on the new flood protection mechanism I introduced in Drupal 4.6. See flood_register_event() and flood_is_allowed(), and check the contact.module in CVS HEAD to see how it can be used. -- Dries Buytaert :: http://www.buytaert.net/
On Wed, 09 Mar 2005 08:27:38 +0100 Dries Buytaert <dries@buytaert.net> wrote: [...]
As an extra solution you could leverage on the new flood protection mechanism I introduced in Drupal 4.6. See flood_register_event() and flood_is_allowed(), and check the contact.module in CVS HEAD to see how it can be used.
Unfortunately this will not help, as your flood protection mechanism is per IP. The spammers are using an unbelievably large number of proxies, and each comment seems to come from a new IP address. What more, the contents hold random data, so simply comparing new comments against old to look for duplication isn't often even possible. -Jeremy
participants (8)
-
Andre Molnar -
Carl McDade -
Dries Buytaert -
Jeremy Andrews -
Mark Howell -
Moshe Weitzman -
Negyesi Karoly -
Steven Wittens