[development] Cpu usage: something is very wrong (caching).

Gerhard Killesreiter gerhard at killesreiter.de
Tue Aug 1 16:38:34 UTC 2006

Augustin (Beginner) wrote:

> I just reviewed the Caching, Caching, Caching thread, but I didn't see that it 
> addressed an issue that I recently learned about and that I find somewhat 
> shocking: the whole {cache} is being emptied whenever a node is created or 
> edited,

Just for the record: This behaviour isn't new and has not changed over 
the last Drupal versions.

> meaning that even a medium-traffic website with many nodes but only 
> one or two added every day will see a very high cpu usage, because the 100s 
> of nodes will always be purged from {cache} before Drupal has a chance to 
> serve the cached page to a second visitor.
> This approach defeats the whole caching purpose.

That's not 100% accurate, it is still better than no caching at all. You 
should be able to track you own site's cache hit ratio by adding some 
counters in bootstrap.inc where the cache gets retrieved.

> I may have missed other threads addressing the issues I mention below. Please 
> accept my apologies if I did.
> Below is the kind of cpu usage  I get for my sites.
> I must give you some background information first, so that you can understand 
> the figures.
> I am hosted at a nice little hosting-cooperative (i.e. I am host, and hosted, 
> I "own" part of the hosting coop -- en français: http://ouvaton.coop/).
> I think there are about 5~6000 sites hosted, and I have two sites there. 
> My sites are very low traffic (~20 pages views a day, only).
> Almost all CMS are represented in those 4000 sites (phpBB, MediaWiki, phpNuke, 
> you name it...). 
> A large proportion of sites use Spip, though, which has an excellent on-file 
> caching method (bypassing php and SQL).

This is of course the kind of competition Drupal cannot face: Even for 
serving a cached page we need to fire up php and request data from 
mysql. In which way is php run on your site? As module, fast-cgi, cgi, ...?

There is however Jeremy's patch for file based caching, you might want 
to try that.

> Drupal in under-represented (I wonder if I am the only one using Drupal, 
> there).
> Now, I have an interesting bit of data in my panel, which is the CPU usage, 
> the Bandwidth, and the number of hits IN PERCENTAGE of the total of all the 
> sites co-hosted there.
> What I found out is that since the beginning, my CPU usage is way above the 
> average compared to the other sites.
> Here is what I have today for one of my two sites (I have very similar figures 
> for the other site):
> (use fixed fonts)
> ---------------------------------------------
>           |  For the week  | For the day    |
>           |  rank  -  %    |  rank  -  %    |
> --------------------------------------------|
> cpu       | 127th - 0.153% |  38th - 0.425% |
> hit       | 538th - 0.032% | 577th - 0.030% |
> Bandwidth | 449th - 0.041% | 496th - 0.037% |
> ---------------------------------------------

That's very interesting data.

> I don't have any gallery nor much graphics, so it could have explained the low 
> Bandwidth ranking, but it is in par with the Hit ranking. 
> The CPU ranking however is constantly one order of magnitude above the average 
> of the other 5-6000 sites.
> What's worse, the CPU ranking is very low for the last day. There is not new 
> content every day, and I observe such a spike each time content is added. 
> Out of the 576 sites that have a HIGHER hit figure than my site, 539 sites 
> needed LESS CPU power than I did!

Do you have or can get data about the percentage f the used systems?

> Now, I understand why the cluster of high-end drupal.org servers is having 
> troubles to keep up when a new node is created every couple of minutes!

I think our servers are still doing fine for now.

> I have observed that my own stats have been getting worse and worse as the 
> number of nodes increases. At the beginning, with a dozen nodes or so, the 
> CPU usage never got very high, because the ratio of pages served from the 
> cache was very high. Now, with a meager ~150 nodes, I find out that my CPU 
> usage never goes below a certain level because all nodes have to be 
> recomputed every couple of days, whether they have changed or not. 

If the other content on the server is mostly static, then of course this 
is - by comparison- a problem. Then again, if you do only add a few 
nodes each day, this should not be a big problem. To display a node 
isn't a number-crunching operation. ;)

> While I am looking forward to have more visitors, and especially more 
> contributors, I shudder at the thought of the potential cost in CPU power.

The nice thing about the Drupal cache that the number of visitors 
doesn't matter that much: The first generates the cache copy and the 
others all see it.

> The formidable strength of Drupal is its flexibility. Everything (everything!) 
> can be customized, down to form elements, and link attributes! There is no 
> HTML in core and everything is abstracted. 
> Of course, it comes at a price, which is the price of CPU computing power. 
> Instead of printing a link(<a href="">click</a>) directly, Drupal has to 
> parse large arrays, check the hooks, etc... before printing this simple piece 
> of html. Same for forms and everything else...
> I would have thought that this was acceptable if the computing was done once 
> and the cache served many times, if not from file, at least from the DB.
> The algorithm can be improved in places, yes.
> SQL can be streamlined, of course.
> But all of this still require CPU crunching.
> So, I would like to officially join the ranks behind Gerhard for an improved 
> caching system. 
> I also think we should seriously consider a file-caching mechanism (there was 
> a patch proposed and used on a site, using mod_rewrite magic... we could 
> think of a per-block file/DB caching for those blocks that are often updated 
> --e.g. active forum topics-- while block rarely changing -- e.g. primary 
> links, syndicate block, etc. -- can be hard coded within the cached page.)

As you may have seen from my earlier post today, this isn't as easy as 
it may sound. Adding such more granular caching requires some more code 
to do proper invalidation. More code will lead to more bugs, it will 
make it harder to see what is going on, etc. And then the patch probably 
did not gain us that much on drupal.org because it jeopardized the mysql 
query cache.

> The bottom line is that the cache should be more durable, and the ration 
> page-served-from-cache/page-computed-on-the-fly should be much higher than 
> what it is now.

Frankly, I don't think we can get it better than in your case it is. To 
regenerate the cache twice a day (after your two nodes) seems entirely 
ok to me.

> What can I do?
> I am not a coder as experienced as you are, so I don't really know where to 
> start.
> But if someone makes a start, I can follow the issue and test the patches.

You really should test Jeremy's file caching patch. I think that that 
one can help you a lot. Personally, I don't like file based caching 
because it only works for sites where the content is public, but in your 
case it seems to be the best approach.

> Once a fairly complete and stable patch comes to light, I can test it on my 
> live site, and see what difference it makes when comparing my own little 
> Drupal site against the thousands of non-Drupal sites hosted at the same 
> place.
> Also, why not introduce some amount of caching (at least for some blocks, or 
> the body of old nodes) for registered users?

My patch introduces caching for all users. It even works if you use 
node_access based modules (but then it generates cache entries per user, 
not sure how useful that is).


More information about the development mailing list