At one point I ran into an issue where my server kept hanging. The load average didn't go up quite so quickly as yours, but it would climb up. The problem turned out to be that the version of MySQL I was using had an issue where queries that involved inner joins with views (especially when those views were made of some 20+ inner joins themselves) would never return. Do you have any odd modules enabled? Any you wrote yourself? I solved it my upgrading MySQL (well, my first solution was to change the database structure to realize a view every hour using cron, but when I upgraded and the bug went away, I went back to the view). HTH, Ricky On Oct 11, 2006, at 9:02 AM, Tamir Khason wrote:
We have sysstat installed (SAR tool) And what we see nothing in it besides some high load average until everything is stucked. It doesn't succeed to log the real situation when it comes to it, since server doesn't have any resources... No iowait noticed at all. Everything is between php/mysql. Server has good dsics, raid1 and iowait is getting to 10% maximum in bad times..also I run iostat from time to time to see what goes on. I also think change io scheduling but imo the current problem has nothing with io. Dmesg also doesn't point to anything like that. We are trying to dump the apache/mysql status once there is a problem (through mytop/server-status and other things) the problem is that it's too late in most cases and nothing "cooperates" with us.
---------- Forwarded message ---------- From: Khalid B <kb@2bits.com> Date: Wed, 11 Oct 2006 07:05:58 -0400 Subject: Re: [development] 2k qps To: development@drupal.org
I have another client site that shoots up to 3,000 QPS, but I don't have the statistics of page access for it. This one has 4 Xeons 3.2GHZ each and 4GB.
Write a simple script to run vmstat every 15 seconds or so, log the output to a file. Keep it running until the "hang" happens, and then see what it tells you about the system. Are you out of CPU? Do you have too many blocked processes or runable processes? Do you have a lot of wait on i/o? Are you swapping to the point of thrashing.
Only after you do this analysis you should decide whether to upgrade the RAM or just tune things.
Regarding decreasing the number of Apache processes, yes, that means you can only serve so many users, but you prevent the chaos caused by swapping. Think of it as a popular restaurant. If they let in more people than they have tables/seats, would the customer be happy? Would the staff cope with it?