Hello All, Following the thread on the CDN, and knowing that soon I'll be working on a site with nearly 7 million users per month and a database nearly 20 GBs in size. I'd like to ask the gurus out there for technologies/modules within Drupal or in the LAMP architecture aimed at huge load sites, I want to try to avoid the problem of "early implementation" which may make some of these technologies unfeasible or hard to implement in the future if not thought out before. I know about memcache (which I haven't tried yet) and CDN. I know about load balancing (in theory as I haven't needed it yet), and I know that you can also make your server more powerful with more hardware. I personally have tuned servers to squeeze every last bit of performance, but that falls short of what happens when you reach a certain level when squeezing a server just doesn't cut it. What else can you pitch in? Comments? Recommendation? Thoughts? Regards, AA
From Acquia:
This week's Webinar will feature Kieran Lal, Acquia's Drupal Community Guide, as our guest speaker. He will review the latest techniques for insuring scalability based on Acquia's expertise in supporting customer sites. Drupal site performance is often a combination of monitoring and tuning several underlying technologies. Dec 17th - Best Practices for Building a High Performance and Scalable Drupal Site Time: 1:00 PM ET Nancy E. Wichmann, PMP Injustice anywhere is a threat to justice everywhere. -- Dr. Martin L. King, Jr.
On Wed, Dec 16, 2009 at 11:21 AM, Nancy Wichmann <nan_wich@bellsouth.net>wrote:
From Acquia:
This week's Webinar will feature Kieran Lal, Acquia's Drupal Community Guide, as our guest speaker. He will review the latest techniques for insuring scalability based on Acquia's expertise in supporting customer sites. Drupal site performance is often a combination of monitoring and tuning several underlying technologies.
*Dec 17th - Best Practices for Building a High Performance and Scalable Drupal Site* Time: 1:00 PM ET
Thanks Nancy, I sent it privately as well. But since you brought it up! Registration for the webinar is here: https://www2.gotomeeting.com/register/752581675 Cheers, Kieran
Nancy E. Wichmann, PMP
Injustice anywhere is a threat to justice everywhere. -- Dr. Martin L. King, Jr.
Sounds wonderful! Any way I can check my settings to avoid last minute technical issues? I tried to join hoping I could test before the designated time but it seems I can't. I haven't attended a webinar before so I apologize if this is a silly question or is answered elsewhere. I use firefix 3.5.5, kubuntu 9.10, flash works fine on youtube. I'll make sure the headset is configured ok. Any tips in this regard? Thanks! On Wed, Dec 16, 2009 at 9:46 PM, Kieran Lal <kieran@acquia.com> wrote:
On Wed, Dec 16, 2009 at 11:21 AM, Nancy Wichmann <nan_wich@bellsouth.net>wrote:
From Acquia:
This week's Webinar will feature Kieran Lal, Acquia's Drupal Community Guide, as our guest speaker. He will review the latest techniques for insuring scalability based on Acquia's expertise in supporting customer sites. Drupal site performance is often a combination of monitoring and tuning several underlying technologies.
*Dec 17th - Best Practices for Building a High Performance and Scalable Drupal Site* Time: 1:00 PM ET
Thanks Nancy, I sent it privately as well.
But since you brought it up!
Registration for the webinar is here: https://www2.gotomeeting.com/register/752581675
Cheers, Kieran
Nancy E. Wichmann, PMP
Injustice anywhere is a threat to justice everywhere. -- Dr. Martin L. King, Jr.
-- Ashraf Amayreh http://aamayreh.org
On Wed, Dec 16, 2009 at 10:37 AM, Ashraf Amayreh <mistknight@gmail.com> wrote:
tuned servers to squeeze every last bit of performance, but that falls short of what happens when you reach a certain level when squeezing a server just doesn't cut it. What else can you pitch in? Comments? Recommendation? Thoughts?
If you haven't already, consider joining http://groups.drupal.org/high-performance and reading the content there. Lots of great advice from people managing similar sites. Regards, Greg -- Greg Knaddison | 303-800-5623 | http://growingventuresolutions.com Mastering Drupal - http://www.masteringdrupal.com
On Wed, Dec 16, 2009 at 11:07 PM, Ashraf Amayreh <mistknight@gmail.com>wrote:
Hello All,
Following the thread on the CDN, and knowing that soon I'll be working on a site with nearly 7 million users per month and a database nearly 20 GBs in size. I'd like to ask the gurus out there for technologies/modules within Drupal or in the LAMP architecture aimed at huge load sites, I want to try to avoid the problem of "early implementation" which may make some of these technologies unfeasible or hard to implement in the future if not thought out before.
I would suggest you to have a look at pantheon project [1]. Pantheon is an open source project to make drupal more scalable. It uses pressflow distribution with Varnish, Apache, APC and Solr on cloud. You can either get a mercury AMI instance or create a similar environment in your server [2]. [1] http://getpantheon.com/ [2] http://groups.drupal.org/pantheon/mercurywiki http://groups.drupal.org/pantheon https://launchpad.net/projectmercury
-- Thanks Sivaji
On Thu, Dec 17, 2009 at 3:31 AM, sivaji j.g <sivaji2009@gmail.com> wrote:
On Wed, Dec 16, 2009 at 11:07 PM, Ashraf Amayreh <mistknight@gmail.com>wrote:
Hello All,
Following the thread on the CDN, and knowing that soon I'll be working on a site with nearly 7 million users per month and a database nearly 20 GBs in size. I'd like to ask the gurus out there for technologies/modules within Drupal or in the LAMP architecture aimed at huge load sites, I want to try to avoid the problem of "early implementation" which may make some of these technologies unfeasible or hard to implement in the future if not thought out before.
I would suggest you to have a look at pantheon project [1]. Pantheon is an open source project to make drupal more scalable. It uses pressflow distribution with Varnish, Apache, APC and Solr on cloud. You can either get a mercury AMI instance or create a similar environment in your server [2].
[1] http://getpantheon.com/ [2] http://groups.drupal.org/pantheon/mercurywiki http://groups.drupal.org/pantheon https://launchpad.net/projectmercury
Hi, the Pantheon stuff is awesome. But putting a site with 7M unique visitors, probably 75-100 Million page views per month, on a single virtual web server, even an X-Large, is both very risky and limited if the site get's even bigger. A more appropriate approach for a site of that size is to build a cluster of servers in a high availability configuration which provides more flexibility to use various web scaling technologies. You'll see that's an approach taken with even moderately sized Drupal sites. I'll be covering all of this in quite a bit of detail in my presentation in 2.5 hours. Cheers, Kieran
-- Thanks Sivaji
On 12/17/2009 09:31 AM, Kieran Lal wrote:
A more appropriate approach for a site of that size is to build a cluster of servers in a high availability configuration which provides more flexibility to use various web scaling technologies. You'll see that's an approach taken with even moderately sized Drupal sites. I'll be covering all of this in quite a bit of detail in my presentation in 2.5 hours.
Unfortunately, I missed it due to a client meeting...is there a transcript or recording of this anywhere? --Susan -- "We all declare for liberty; but in using the same word we do not all mean the same thing. With some the word liberty may mean for each man to do as he pleases with himself, and the product of his labor; while with others, the same word may mean for some men to do as they please with other men, and the product of other men's labor. Here are two, not only different, but incompatible things, called by the same name - liberty. And it follows that each of the things is, by the respective parties, called by two different and incompatible names - liberty and tyranny." --Abraham Lincoln
On Thu, Dec 17, 2009 at 7:10 PM, Susan Stewart <hedgemage@binaryredneck.net> wrote:
On 12/17/2009 09:31 AM, Kieran Lal wrote:
A more appropriate approach for a site of that size is to build a cluster of servers in a high availability configuration which provides more flexibility to use various web scaling technologies. You'll see that's an approach taken with even moderately sized Drupal sites. I'll be covering all of this in quite a bit of detail in my presentation in 2.5 hours.
Unfortunately, I missed it due to a client meeting...is there a transcript or recording of this anywhere?
The recorded video will be posted here: http://acquia.com/community/resources/recorded_webinars Keep in mind this was a one hour introductory webinar covering scalability and performance for Drupal. I covered a lot of material quickly, and tried to touch on a lot of relevant performance and scalability technologies and techniques. Cheers, Kieran
--Susan
-- "We all declare for liberty; but in using the same word we do not all mean the same thing. With some the word liberty may mean for each man to do as he pleases with himself, and the product of his labor; while with others, the same word may mean for some men to do as they please with other men, and the product of other men's labor. Here are two, not only different, but incompatible things, called by the same name - liberty. And it follows that each of the things is, by the respective parties, called by two different and incompatible names - liberty and tyranny." --Abraham Lincoln
I listened to the presentation and found it interesting. It was missing some of what I wanted. My site is too big for shared hosting but cannot afford going beyond one dedicated machine. Clearly cache the hell out of everything is probably the best advice but perhaps there are other tweaks that should be looked at as well. A question I had submitted before the talk did not get covered. I would like to see a graph, perhaps a nomogram, of something like max hits per hour vs. appropriate technology (both hardware and software). -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Kieran Lal Sent: Friday, December 18, 2009 12:09 AM To: development Subject: Re: [development] development with scalability in mind On Thu, Dec 17, 2009 at 7:10 PM, Susan Stewart <hedgemage@binaryredneck.net> wrote:
On 12/17/2009 09:31 AM, Kieran Lal wrote:
A more appropriate approach for a site of that size is to build a cluster of servers in a high availability configuration which provides more flexibility to use various web scaling technologies. You'll see that's an approach taken with even moderately sized Drupal sites. I'll be covering all of this in quite a bit of detail in my presentation in 2.5 hours.
Unfortunately, I missed it due to a client meeting...is there a transcript or recording of this anywhere?
The recorded video will be posted here: http://acquia.com/community/resources/recorded_webinars Keep in mind this was a one hour introductory webinar covering scalability and performance for Drupal. I covered a lot of material quickly, and tried to touch on a lot of relevant performance and scalability technologies and techniques. Cheers, Kieran
--Susan
-- "We all declare for liberty; but in using the same word we do not all mean
the same thing. With some the word liberty may mean for each man to do as he pleases with himself, and the product of his labor; while with others, the same word may mean for some men to do as they please with other men, and the product of other men's labor. Here are two, not only different, but incompatible things, called by the same name - liberty. And it follows that each of the things is, by the respective parties, called by two different and incompatible names - liberty and tyranny."
--Abraham Lincoln
No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.716 / Virus Database: 270.14.113/2573 - Release Date: 12/18/09 02:35:00
A lot is going to depend on exactly what you are planning on doing with the site. Will there be a lot of logged in users? How often will data be changing? Are you going to have a lot of complex queries (ie: searches, etc.)? You said yesterday that the DB size would be about 20gb. Well that there will present a performance hit alone, with tables not being able to really fit into memory and means the database will take up a huge chunk of that single server. If you don't have many logged in users and the data isn't changing all that much, then you *may* be able to get by using Boost. If you are going to have a bunch of logged in users then I would seriously look at using alternative caching like Cacherouter. With a 20gb database, the more you can keep off of it the better. If you got a lot of images and static content then I would seriously look at pushing that off to a CDN to remove some of the burden on the server also. From a development stand point, the MySQL's slow query log is your friend, plus the devel+performance logging module. Make sure none of your common queries are doing nasty things like resorting to filesorts on thousands of rows and that all your queries are indexed properly. Also when dealing with caching be very careful. One thing I have seen a lot of is people who do "on demand" refreshing of expired caches. What happens is that they check the expiration or some other metric when the cache is pulled and if it fails they run the query or code to regenerate it. This is usually used on very server intensive queries. The problem lies in this example. You have a query that takes 4 seconds to run - User A hits the site at 00:00:00.00 and the cache needs refreshed so the query is run - User B hits the site at 00:00:01.00. The query from user A is still running so the cache is updated and user B doesn't know this, so the query is running again. On a high traffic site you can see how that will snowball into a bunch of people running the same query. From a development stand point, it's best to put these kind of routines into a cron job so the following happens: - User A hits the site at 00:00:00.00 and the cache needs refreshed. You have a special "cron" table in the DB and a record is written saying that this item needs recomputed at 00:00:00.00 and User A is hit with the stale data. - User B hits the site at 00:00:01.00 and the cache is still expired. The code checks for the record in that cron table and moves on, just serving the stale cache data. Running cache refreshes like this on cron removes the possibilities of the queries being called multiple times. On a cost comparison, sometimes two servers is cheaper than one. With the size of your database and traffic predictions you will probably end up having to dump a lot of extra hardware into that single server to make one "super server", where as if you have one web server and one database server you could possible get by with a medium or large server, since each would be tuned specifically to their job. Jamie Holly http://www.intoxination.net http://www.hollyit.net On 12/18/2009 10:47 AM, Walt Daniels wrote:
I listened to the presentation and found it interesting. It was missing some of what I wanted. My site is too big for shared hosting but cannot afford going beyond one dedicated machine. Clearly cache the hell out of everything is probably the best advice but perhaps there are other tweaks that should be looked at as well. A question I had submitted before the talk did not get covered. I would like to see a graph, perhaps a nomogram, of something like max hits per hour vs. appropriate technology (both hardware and software).
-----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Kieran Lal Sent: Friday, December 18, 2009 12:09 AM To: development Subject: Re: [development] development with scalability in mind
On Thu, Dec 17, 2009 at 7:10 PM, Susan Stewart <hedgemage@binaryredneck.net> wrote:
On 12/17/2009 09:31 AM, Kieran Lal wrote:
A more appropriate approach for a site of that size is to build a cluster of servers in a high availability configuration which provides more flexibility to use various web scaling technologies. You'll see that's an approach taken with even moderately sized Drupal sites. I'll be covering all of this in quite a bit of detail in my presentation in 2.5 hours.
Unfortunately, I missed it due to a client meeting...is there a transcript or recording of this anywhere?
The recorded video will be posted here: http://acquia.com/community/resources/recorded_webinars
Keep in mind this was a one hour introductory webinar covering scalability and performance for Drupal. I covered a lot of material quickly, and tried to touch on a lot of relevant performance and scalability technologies and techniques.
Cheers, Kieran
--Susan
-- "We all declare for liberty; but in using the same word we do not all mean
the same thing. With some the word liberty may mean for each man to do as he pleases with himself, and the product of his labor; while with others, the same word may mean for some men to do as they please with other men, and the product of other men's labor. Here are two, not only different, but incompatible things, called by the same name - liberty. And it follows that each of the things is, by the respective parties, called by two different and incompatible names - liberty and tyranny."
--Abraham Lincoln
No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.716 / Virus Database: 270.14.113/2573 - Release Date: 12/18/09 02:35:00
Hello! Comments inline. On Fri, 18 Dec 2009 11:09 -0500, "Jamie Holly" <hovercrafter@earthlink.net> wrote:
A lot is going to depend on exactly what you are planning on doing with the site. Will there be a lot of logged in users? How often will data be changing? Are you going to have a lot of complex queries (ie: searches, etc.)?
With performance related problems, finding out where the bottlenecks are is important. It's easy to just say it's the DB, but it could also be the webserver or operating system too. How do you do this before launching? Benchmarking based upon your most expensive URLs to create, and most frequently requested URLs. Once a full system benchmark has been done, you'll be able to collect queries using the general query log, then one can use mysqlreport and mk-query-digest to see exactly what your DB is doing with sample data.
You said yesterday that the DB size would be about 20gb. Well that there will present a performance hit alone, with tables not being able to really fit into memory and means the database will take up a huge chunk of that single server.
A 20GB MySQL DB really isn't all that large. When the entire DB is larger than available memory, then you need to look at what's the working set of all that data. Not all data in your DB is of equal value - you may have data that's now considered out of date and really not accessed much. On the other hand, you'll have current data that's accessed more frequently. Consider archiving your least accessed data. An important question to ask - exactly what data is getting requested, and how frequently? You may find out that you'll have a small query that runs very frequently is causing problems, compared to a very complex query that's not run all that frequently. Once you know what queries are giving you problems, then it's off to investigate index & data buffers to make sure MySQL is configured properly. Is MySQL creating too many temporary tables from these queries? Drupal has many core tables that have TEXT or BLOB fields in them, so they always go to disk. You really want to avoid temp tables to disk as much as possible. One approach is to do sorting/grouping at the application level instead of letting MySQL do it. <snip>
If you got a lot of images and static content then I would seriously look at pushing that off to a CDN to remove some of the burden on the server also.
CDNs solve throughput problems, but not database problems. If your bottleneck is webserver throughput, you may require operating system/webserver tuning to overcome. Maybe changing to another webserver softwore is the answer. Possibly serving dynamic data with Apache, then lighttpd server static content could be an approach.
From a development stand point, the MySQL's slow query log is your friend, plus the devel+performance logging module. Make sure none of your common queries are doing nasty things like resorting to filesorts on thousands of rows and that all your queries are indexed properly.
The MySQL Slow Query log has a serious limitation if you're using versions of MySQL that typically ship with most distros. If you have a query that takes less than a second to complete, but this query is called very frequently, it'll never get in the log. You'll never know that this frequently, fast running query is causing a performance problem with your DB. To find out greater details in query performance, use the community version of MySQL, or the OurSQL which contains the query profiler patch. The Percona versions of MySQL also have microsecond patches for the slow query log.
Also when dealing with caching be very careful. One thing I have seen a lot of is people who do "on demand" refreshing of expired caches. What happens is that they check the expiration or some other metric when the cache is pulled and if it fails they run the query or code to regenerate it. This is usually used on very server intensive queries. The problem lies in this example.
You have a query that takes 4 seconds to run
- User A hits the site at 00:00:00.00 and the cache needs refreshed so the query is run
- User B hits the site at 00:00:01.00. The query from user A is still running so the cache is updated and user B doesn't know this, so the query is running again.
On a high traffic site you can see how that will snowball into a bunch of people running the same query. From a development stand point, it's best to put these kind of routines into a cron job so the following happens:
- User A hits the site at 00:00:00.00 and the cache needs refreshed. You have a special "cron" table in the DB and a record is written saying that this item needs recomputed at 00:00:00.00 and User A is hit with the stale data.
- User B hits the site at 00:00:01.00 and the cache is still expired. The code checks for the record in that cron table and moves on, just serving the stale cache data.
Running cache refreshes like this on cron removes the possibilities of the queries being called multiple times.
Warming up the querycache can be an approach to improving DB performance, provided the underlying tables are not changing all that much once users are logged in. This approach can also be used to warm up data and index buffers too. If a query is cached in the MySQL querycache, when that same exact query shows up again, MySQL will check to see if that query already exists, then send those results. If any table in that query has been updated, the querycache get's invalidated and that query is removed. It's entirely possible for the overhead of checking the query, combined with tables changing just frequently enough that the querycache becomes a bottleneck. Having users run queries, then a cron job on top could exacerbate the situation. In these cases, turning the querycache off has improved performance. Using mysqlreport to show the querycache stats would be very helpful in this case.
On a cost comparison, sometimes two servers is cheaper than one. With the size of your database and traffic predictions you will probably end up having to dump a lot of extra hardware into that single server to make one "super server", where as if you have one web server and one database server you could possible get by with a medium or large server, since each would be tuned specifically to their job.
Scaling up versus scaling out, that's the question. Benchmarking your site will help provide the data needed to make that decision. Performance tuning is an additive process - meaning, you can't simply do one thing and expect 20% improvement in performance. Improving performance by making configuration changes across the LAMP stack based upon benchmarking data. Essentially, make a change, benchmark it and see what the results are. It's also a diminishing returns process too. So, if you're already at the point of you've done as much as you can do to your current server, as an example, you're seeing high CPU utilization and low disk I/O, your single server is probably already the best it's going to run.
Jamie Holly http://www.intoxination.net http://www.hollyit.net
Regards, Mark Schoonover -- http://www.thetajoin.com - The Drupal Hosting & Performance Company Email: mark@thetajoin.com :: Voice: 619-928-4473 :: Fax: 619-374-3130
participants (9)
-
Ashraf Amayreh -
Greg Knaddison -
Jamie Holly -
Kieran Lal -
Mark Schoonover -
Nancy Wichmann -
sivaji j.g -
Susan Stewart -
Walt Daniels