[development] development with scalability in mind

Mark Schoonover mark at thetajoin.com
Fri Dec 18 19:01:26 UTC 2009

Hello! Comments inline.

On Fri, 18 Dec 2009 11:09 -0500, "Jamie Holly"
<hovercrafter at earthlink.net> wrote:
> A lot is going to depend on exactly what you are planning on doing with 
> the site. Will there be a lot of logged in users? How often will data be 
> changing? Are you going to have a lot of complex queries (ie: searches, 
> etc.)?

With performance related problems, finding out where the bottlenecks are
is important. It's easy to just say it's the DB, but it could also be
the webserver or operating system too. How do you do this before
launching? Benchmarking based upon your most expensive URLs to create,
and most frequently requested URLs. Once a full system benchmark has
been done, you'll be able to  collect queries using the general query
log, then one can use mysqlreport and mk-query-digest to see exactly
what your DB is doing with sample data. 

> You said yesterday that the DB size would be about 20gb. Well that there 
> will present a performance hit alone, with tables not being able to 
> really fit into memory and means the database will take up a huge chunk 
> of that single server.

A 20GB MySQL DB really isn't all that large. When the entire DB is
larger than available memory, then you need to look at what's the
working set of all that data. Not all data in your DB is of equal value
- you may have data that's now considered out of date and really not
accessed much. On the other hand, you'll have current data that's
accessed more frequently. Consider archiving your least accessed data.

An important question to ask - exactly what data is getting requested,
and how frequently? You may find out that you'll have a small query that
runs very frequently is causing problems, compared to a very complex
query that's not run all that frequently. Once you know what queries are
giving you problems, then it's off to investigate index & data buffers
to make sure MySQL is configured properly. Is MySQL creating too many
temporary tables from these queries? Drupal has many core tables that
have TEXT or BLOB fields in them, so they always go to disk. You really
want to avoid temp tables to disk as much as possible. One approach is
to do sorting/grouping at the application level instead of letting MySQL
do it.

> If you got a lot of images and static content then I would seriously 
> look at pushing that off to a CDN to remove some of the burden on the 
> server also.

CDNs solve throughput problems, but not database problems. If your
bottleneck is webserver throughput, you may require operating
system/webserver tuning to overcome. Maybe changing to another webserver
softwore is the answer. Possibly serving dynamic data with Apache, then
lighttpd server static content could be an approach.

>  From a development stand point, the MySQL's slow query log is your 
> friend, plus the devel+performance logging module. Make sure none of 
> your common queries are doing nasty things like resorting to filesorts 
> on thousands of rows and that all your queries are indexed properly.

The MySQL Slow Query log has a serious limitation if you're using
versions of MySQL that typically ship with most distros. If you have a
query that takes less than a second to complete, but this query is
called very frequently, it'll never get in the log. You'll never know
that this frequently, fast running query is causing a performance
problem with your DB. To find out greater details in query performance,
use the community version of MySQL, or the OurSQL which contains the
query profiler patch. The Percona versions of MySQL also have
microsecond patches for the slow query log.

> Also when dealing with caching be very careful. One thing I have seen a 
> lot of is people who do "on demand" refreshing of expired caches. What 
> happens is that they check the expiration or some other metric when the 
> cache is pulled and if it fails they run the query or code to regenerate 
> it. This is usually used on very server intensive queries. The problem 
> lies in this example.
> You have a query that takes 4 seconds to run
> - User A hits the site at 00:00:00.00 and the cache needs refreshed so 
> the query is run
> - User B hits the site at 00:00:01.00. The query from user A is still 
> running so the cache is updated and user B doesn't know this, so the 
> query is running again.
> On a high traffic site you can see how that will snowball into a bunch 
> of people running the same query. From a development stand point, it's 
> best to put these kind of routines into a cron job so the following
> happens:
> - User A hits the site at 00:00:00.00 and the cache needs refreshed. You 
> have a special "cron" table in the DB and a record is written saying 
> that this item needs recomputed at 00:00:00.00 and User A is hit with 
> the stale data.
> - User B hits the site at 00:00:01.00 and the cache is still expired. 
> The code checks for the record in that cron table and moves on, just 
> serving the stale cache data.
> Running cache refreshes like this on cron removes the possibilities of 
> the queries being called multiple times.

Warming up the querycache can be an approach to improving DB
performance, provided the underlying tables are not changing all that
much once users are logged in. This approach can also be used to warm up
data and index buffers too. 

If a query is cached in the MySQL querycache, when that same exact query
shows up again, MySQL will check to see if that query already exists,
then send those results. If any table in that query has been updated,
the querycache get's invalidated and that query is removed. It's
entirely possible for the overhead of checking the query, combined with
tables changing just frequently enough that the querycache becomes a
bottleneck. Having users run queries, then a cron job on top could
exacerbate the situation. In these cases, turning the querycache off has
improved performance. Using mysqlreport to show the querycache stats
would be very helpful in this case.

> On a cost comparison, sometimes two servers is cheaper than one. With 
> the size of your database and traffic predictions you will probably end 
> up having to dump a lot of extra hardware into that single server to 
> make one "super server", where as if you have one web server and one 
> database server you could possible get by with a medium or large server, 
> since each would be tuned specifically to their job.

Scaling up versus scaling out, that's the question. Benchmarking your
site will help provide the data needed to make that decision.
Performance tuning is an additive process - meaning, you can't simply do
one thing and expect 20% improvement in performance. Improving
performance by making configuration changes across the LAMP stack based
upon benchmarking data. Essentially, make a change, benchmark it and see
what the results are. It's also a diminishing returns process too. So,
if you're already at the point of you've done as much as you can do to
your current server, as an example, you're seeing high CPU utilization
and low disk I/O, your single server is probably already the best it's
going to run.

> Jamie Holly
> http://www.intoxination.net
> http://www.hollyit.net

Mark Schoonover

http://www.thetajoin.com - The Drupal Hosting & Performance Company
Email: mark at thetajoin.com :: Voice: 619-928-4473 :: Fax: 619-374-3130

More information about the development mailing list