[drupal-devel] [task] Performance issues with path aliases

mathias drupal-devel at drupal.org
Sat May 14 20:15:25 UTC 2005


Issue status update for http://drupal.org/node/22035

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     tasks
 Priority:     normal
 Assigned to:  mathias
 Reported by:  mikeryan
 Updated by:   mathias
 Status:       patch
 Attachment:   http://drupal.org/files/issues/alias-query-on-demand2.patch (1.03 KB)

Database schema updates for the applied patch [1].
[1] http://drupal.org/cvs?commit=16021




mathias



Previous comments:
------------------------------------------------------------------------

May 4, 2005 - 19:10 : mikeryan

See Investigate use of conf_url_rewrite [2] for context...


The current core support for translating incoming path aliases to the
internal Drupal path (drupal_get_path_alias) and substituting aliases
when generating links to internal paths (drupal_get_normal_path) does
not scale well with many aliases. The ease with which pathauto enables
site administrators to generate large numbers of aliases exposes this
issue, but it is inherent in the core implementation, because
drupal_get_path_map reads in the entire url_alias table at bootstrap
time. I'd like to discuss ideas on how to improve the performance...


Most obviously, why not simply query the url_alias table as needed
instead of loading the whole table? In the incoming case, only a single
simple SELECT is necessary, which will always be more efficient than
reading the table. In the outgoing case, there might be some slight
performance advantage to caching the table with a small number of
aliases, but the disadvantage can become huge as the alias table grows.


One note - I've noticed that the src column in url_alias is not
indexed, I think adding an index should significantly help the
performance in the outgoing case if we were to do individual SELECTs.


Any other thoughts?
[2] http://drupal.org/node/21938




------------------------------------------------------------------------

May 4, 2005 - 20:02 : mikeryan

Attachment: http://drupal.org/files/issues/bootstrap.inc_2.patch (1.06 KB)

What the hell, I went ahead and gave it a shot... On my home system,
where I'm testing out 4.6.0, page loads have been taking several
seconds, which I attributed to the fact that it's an old computer and
I'm multi-tasking like crazy. I implemented my own suggestion, and now
pages load in about one second, it makes an incredible difference
(FWIW, my url_alias table has over 4000 rows).


Patches attached, go give it a try...




------------------------------------------------------------------------

May 4, 2005 - 20:02 : mikeryan

Attachment: http://drupal.org/files/issues/common.inc_11.patch (723 bytes)

One attachment per note, that's tedious...




------------------------------------------------------------------------

May 4, 2005 - 20:03 : mikeryan

Attachment: http://drupal.org/files/issues/path.module_0.patch (1.34 KB)




------------------------------------------------------------------------

May 4, 2005 - 20:03 : mikeryan

Attachment: http://drupal.org/files/issues/database.mysql_4.patch (553 bytes)




------------------------------------------------------------------------

May 4, 2005 - 20:04 : mikeryan

Attachment: http://drupal.org/files/issues/database.pgsql_2.patch (502 bytes)




------------------------------------------------------------------------

May 4, 2005 - 20:07 : mikeryan

Note: I just noticed that the database.*sql patches showed an extra diff
(not mine) to the location field in locales_source. I did my diffs
against a freshly-updated DRUPAL-4-6, and had made the edits against a
release from a few days ago... Those locales_source changes should
probably be removed from my patches before applying.




------------------------------------------------------------------------

May 7, 2005 - 17:13 : mikeryan

I upgraded Fenway Views [3] to Drupal 4.6.0 today, incorporating these
patches. Performance is very noticeably improved.
[3] http://fenway-views.com/




------------------------------------------------------------------------

May 8, 2005 - 07:31 : killes at www.drop.org

Attachment: http://drupal.org/files/issues/pathalias.patch (4.98 KB)

I've merged the patch into one. Much more convenient. I also changed it
to cvs as only there new features will be added.
Upgrade path needs to be added.


Mike, can you run some more tests? We are especially interested in
"hard" data ie numbers with possibly error bars. it would be
interesting to know how this patch affects sites that have only a few
path aliases vs those with a lot of them. Also, how pages with a lot of
node links (tracker) are affected.


It might be worthwhile to add static caching in drupal_get_normal_path
if it gets calle dmore than once for a link on one page view.




------------------------------------------------------------------------

May 8, 2005 - 11:21 : mikeryan

Thanks for merging the patches, I wasn't aware you could patch multiple
files at once. If this is accepted, of course, adding the index on src
should be incorporated into updates.inc.


In terms of hard data, I've never profiled PHP code - what tool(s) do
you use? I couldn't find anything referenced in the PHP manual...


re: pages with lots of links - on Fenway Views the big test is the
calendar [4], and the performance improvement seems even more dramatic
here. I don't know why - the url_alias table is read once per page, so
I would expect as you do that preloading the table would tend to look
better when you have a hundred internal links on the page. Maybe we're
both underestimating exactly how fast a simple MySQL query on an
indexed key can be (I'm using MySQL 4.0.20, BTW)...
[4] http://fenway-views.com/calendar




------------------------------------------------------------------------

May 8, 2005 - 11:26 : killes at www.drop.org

As we speak, Mathias is doing some tests. The tool of choice is usually
apache bench. If you have a lot of url aliases, the generated array
will be huge. Part of the improvement could be due to the fact that you
need less memory now.




------------------------------------------------------------------------

May 8, 2005 - 11:28 : mikeryan

Great, thanks!




------------------------------------------------------------------------

May 8, 2005 - 11:34 : Dries

On drupal.org it takes 800 ms or more to generate a page.  Of these 800
ms, only 4 ms is spent building the path alias map (incl. the SQL query
time which takes about 1 ms).  That is, building the map takes 0.5% of
the total time which is neglible.  In our url_alias table are only 321
aliases though.  How many aliases do you have?


What is more, when we build drupal.org's main page, we query this map
215 times.  I believe your patch would introduce 215 SQL queries ...


I'm afraid that if we'd apply your patch, we'd pay a serious
performance penalty (unless we have many more aliases).


Can you provide some numbers/measurements?




------------------------------------------------------------------------

May 8, 2005 - 11:37 : Dries

Note: my measurements did not take into account the time spent searching
the map.  On the main page, we search this array 215 times.




------------------------------------------------------------------------

May 8, 2005 - 12:08 : puregin

I believe that it's felt that adding path aliases would improve the
Drupal documentation.  This may well add 400+ paths to the URL aliases
table on Drupal.org




------------------------------------------------------------------------

May 8, 2005 - 12:27 : Dries

I spent some more time investigating this.  


The relevant code in bootstrap.inc is this:



function drupal_get_path_alias($path) {
  if (($map = drupal_get_path_map()) && ($newpath = array_search($path,
$map))) {
    return $newpath;
  }
  ...

Turns out that the time to build the map is neglible (< 5ms, see
previous comments), however the total time spent on 'path aliasing' it
about 70ms per page!  50ms of these 70ms are spent on $map =
drupal_get_path_map().  The remaining 20ms is spent on $newpath =
array_search($path, $map).  


The first call to drupal_get_path_map() takes 3 to 5 ms, each
subsequent call takes about 0.3 ms.  Searching the array costs 0.1 ms. 
However, if you have to do this 130+ times, this adds up to a whopping
70 ms.  Remind that we have 321 aliases in our map.


As drupal_get_path_alias() is typically called from code>url(), which
in turn is typically called from l() I set out to investigate the time
spent in l().  Looks like l() easily gets called 120 times/page, and
that each call to l() costs about 0.9 ms.  That is about 2 or 3 times
the cost of the /average SQL query/.  The total time spent in l() is
115 ms!




------------------------------------------------------------------------

May 8, 2005 - 16:25 : adrian

This is a complete aside, and probably off-topic, but I would like to
mention that I was actually
looking at rewriting url() to be able to handle 'external urls' ,
because I want to do aliases using the site subdomain, like for
instance :
http://category.sitename.com/article/title_goes_here -> node/23


Rewriting url to allow this would also allow us to use l() for external
links of any kind, getting rid of a bunch of inline html.




------------------------------------------------------------------------

May 8, 2005 - 16:46 : chx

adrian, I have written a patch for l to support external links but Ber
said that a module like weblink shall handle those.




------------------------------------------------------------------------

May 8, 2005 - 17:03 : mikeryan

I've got over 4000 aliases on Fenway Views - another pathauto user
reported over 6000.


I don't have a profiling tool to give timing data, but adding a quick
counter shows that for the Fenway Views calendar [5],
drupal_get_path_alias() is called 404 times and drupal_get_normal_path
is called 227 times. And trust me, this page loads MUCH faster with my
patch than it does with the path map.


I think those SQL queries are a lot cheaper than one might expect...
[5] http://fenway-views.com/calendar




------------------------------------------------------------------------

May 8, 2005 - 22:24 : mathias

Attachment: http://drupal.org/files/issues/pathalias-with-caching.patch (5.62 KB)

Some benchmarks, with MySQL and Drupal caching disabled.


2 SETS OF TESTS
=======================================
  1) 500 nodes and aliases
  2) 20,000 nodes and aliases


3 TYPES OF TESTS PER SET
=======================================
  1) Baseline unmodified Drupal


  2) The following change inside drupal_get_path_map():


-  if (is_null($map)) {
+  if ($map === NULL) {
      $map = array();  // Make $map non-null in case no aliases
are defined.
 


  3) Pull aliases only when needed rather than loading the entire alias
table.


SET 1 - 500 Nodes and 500 Aliases
        using: ab -c 10 -n 100 [homepage]
=======================================


Baseline
  Time taken for tests:   13.00 seconds
  Requests per second:    3.85 [#/sec] (mean)
  Transfer rate:          56.35 [Kbytes/sec] received


is_null() VS NULL
  Time taken for tests:   12.463 seconds
  Requests per second:    4.01 [#/sec] (mean)
  Transfer rate:          57.63 [Kbytes/sec] received


Per Instance Alias Lookup (32 additional queries)
  Time taken for tests:   11.502 seconds
  Requests per second:    4.35 [#/sec] (mean)
  Transfer rate:          63.30 [Kbytes/sec] received


SET 2 - 20,000 Nodes and 20,000 Aliases
        using: ab -c 5 -n 50 [homepage]
=======================================


Baseline
  Time taken for tests:   204.583 seconds
  Requests per second:    0.24 [#/sec] (mean)
  Transfer rate:          3.34 [Kbytes/sec] received


is_null() VS NULL
  Time taken for tests:   127.629 seconds
  Requests per second:    0.39 [#/sec] (mean)
  Transfer rate:          5.35 [Kbytes/sec] received


Per Instance Alias Lookup (32 additional queries)
  Time taken for tests:   28.788 seconds
  Requests per second:    1.74 [#/sec] (mean)
  Transfer rate:          23.59 [Kbytes/sec] received


ANALYSIS
=======================================
Converting 'is_null()' to '=== NULL' is a no brainer and results in a
nice performance gain. And while per instance alias lookups also give a
huge boost for site with thousands of aliases, they're benefits are
entirely dependent on the number of system URLs and menu items visible
per request.  As noted above I tested the standard homepage which added
32 additional queries.  A request for something like the admin interface
would presumably show less benefits. I would've tested this, but I
didn't know how to invoke an authenticated page request with 'ab'. A
positive of this approach is we no longer would be storing all the
aliases in memory.


What other concerns do we have with this last approach?  Am I properly
testing the strain on the database?




------------------------------------------------------------------------

May 8, 2005 - 23:08 : Dries

Can you try making the following changes?


1. To common.inc:



 function drupal_rebuild_path_map() {
-  drupal_get_path_map('rebuild');
+  drupal_get_path_map(TRUE);
 }

2. To bootstrap.inc:



-function drupal_get_path_map($action = '') {
+function drupal_get_path_map($rebuild = FALSE) {
   static $map = NULL;

-  if ($action == 'rebuild') {
+  if ($rebuild) {
     $map = NULL;
   }

That is, replace the string comparison with a boolean comparison.  Not
sure it is going to make a significant difference but it might be
another micro-improvement.  


I'll test your patch shortly.




------------------------------------------------------------------------

May 9, 2005 - 06:21 : mathias

Dries. Those changes had no real performance gain.


20,000 Nodes and 20,000 Aliases
 using: ab -c 5 -n 50 [homepage]
=======================================
Baseline
 Time taken for tests: 204.583 seconds
 Requests per second: 0.24 [#/sec] (mean)
 Transfer rate: 3.34 [Kbytes/sec] received


String VS Boolean
  Time taken for tests:   204.190 seconds
  Requests per second:    0.24 [#/sec] (mean)
  Transfer rate:          3.34 [Kbytes/sec] received




------------------------------------------------------------------------

May 9, 2005 - 07:01 : Dries

Before this patch can be committed, we need to do more testing.  I'd
like to know how this behaves on sites with few or no path aliases, and
on sites like drupal.org, with a modest amount of path aliases.  


The reason I'm asking is because MySQL is often the bottleneck and not
PHP/Apache.  This is the case on drupal.org, for example.  The proposed
patch moves some of the processing costs from PHP to MySQL.  On
drupal.org, the amount of SQL queries/page would double.  Needless to
say, this is somewhat scary. ;)


More numbers would be appreciated.




------------------------------------------------------------------------

May 9, 2005 - 20:56 : mikeryan

I'm curious, why test with MySQL caching disabled? Since much of the
issue is the expense of making (potentially many) more queries, this
wouldn't seem to reflect the performance gain in practice.


The caching (compared to making the query each time) seems like
overkill to me - with MySQL caching enabled, I would think this
complicates the code for  little (if any) performance gain. Did you do
any profiling using my original (cacheless) patch?


Thanks.




------------------------------------------------------------------------

May 9, 2005 - 23:23 : mathias

In an ideal world everyone would be running a query cache but I wanted
to see how this would hold up under the worst-case scenario.


I tested the original patch which didn't use static variables and the
amount of queries doubled. For example, to load the front page, Drupal
queried the url_alias table 22 times looking for an alias for 'node'.
There's no need to send duplicate queries to the DB since it seems
that's where most bottlenecks seem to lie.




------------------------------------------------------------------------

May 10, 2005 - 16:41 : mikeryan

re: worst-case scenario - I understand that.


re: # queries - I would not want to assume adding to the number of
queries is necessarily a significant performance hit (with database
caching on, of course). I understand the suspicion - I've been around
long enough to remember when you did everything you could to avoid
making one single SQL query more than you absolutely had to. But modern
databases (including MySQL) are very well optimized for this kind of
application (lots and lots of small queries). And, I have seen
real-life cases where adding caching of data that's already being
cached somewhere upstream actually degrades performance.


As long as we're examining this particular area of performance under a
microscope, let's make sure we squeeze every millisecond of savings we
can out of it - I'd just like to see the numbers for a cached
database/no Drupal caching combination for comparison.


Thanks.




------------------------------------------------------------------------

May 10, 2005 - 23:12 : mathias

Dries asked me to do some tests using the drupal.org database. Once
again it appears that plucking aliases out of the database as needed
performed better than grabbing the entire alias table at the beginning
of the request lifecycle, even if the alias table is small. Here's the
numbers.



using: ab -c 5 -n 50 [homepage] 

Drupal.org Baseline
=============
Time taken for tests:   18.081 seconds
Requests per second:    2.77 [#/sec] (mean)
Transfer rate:          52.69 [Kbytes/sec] received

1 page request was 83 queries in 60ms for a cycle execution time of
255ms


Drupal.org Per Instance Query
==================
Time taken for tests:   16.536 seconds
Requests per second:    3.02 [#/sec] (mean)
Transfer rate:          56.63 [Kbytes/sec] received

1 page request was 125 queries in 65ms for a cycle execution time of
230ms

Notice that the database does work a little harder to retrieve the
queries one by one.  And it also doesn't make sense to run 40 extra
queries on a site with only 50 aliases. 


I think at this point we have a couple of options to explore:


1. Continue to try and squeeze out any other optimizations we can in
the original path aliasing code.  Maybe look at output buffering or a
mechnism that'll allow us to start working the database resultset as
soon data is fed into the pipe rather than waiting for the request to
finish. 


2. Try to identify the point (e.g., the number aliases) at which it
becomes more effecient to fetch aliases and change patterns on the fly.


As with most optimizations, the end result will probably be a
combination of both.




------------------------------------------------------------------------

May 10, 2005 - 23:59 : Dries

Looks like in all cases, performance actually improves.  I've run a
couple tests myself using the drupal.org database and measured
performance improvements up to 15%.  Of course, this was done on a
machine where the MySQL database wasn't hammered on by 50+ concurrent
requests.


It would be appreciated if you could extend your patch with a couple
lines of documentation to document the behavior (so people understand
why we are doing it this way), the code and the API.  Maybe add an
entry to CHANGELOG.txt?  Do we need an update for updates.inc?


As for future work, I think we're better off looking at l().  As said,
one call to l() takes about 0.9 ms and we easily do 100+ such calls per
page.  Caching at the level of l() is likely to buy us more but I don't
know whether it is feasible.  Maybe we can think of other optimizations
too. If anything, it might be a good idea to add a TODO/NOTE to the
PHPdoc of l().  Or, if you feel like messing with l() first, by all
means!  Count me in -- it's fun. :-)




------------------------------------------------------------------------

May 11, 2005 - 20:29 : mathias

Attachment: http://drupal.org/files/issues/alias-query-on-demand.patch (5.72 KB)

I think this patch is now commit-worthy.


I did some more profiling using xdebug 2.0 [6] and kcachegrind [7] and
sure enough, drupal_get_path_map() was a resource hog on sites with
many aliases. This should be quite an improvement.


I also found a couple other small optimizations such as replacing
array_key_exists() with a NULL check inside the arg() function (before
[8] / after [9]) and calling isset() in order to avoid calls to
error_handler.
[6] http://xdebug.org
[7]
http://kcachegrind.sourceforge.net/cgi-bin/show.cgi/KcacheGrindIndex
[8] http://www.asitis.org/tmp/before_arg.png
[9] http://www.asitis.org/tmp/after_arg.png




------------------------------------------------------------------------

May 12, 2005 - 02:07 : Dries

That last patch doens't seem to work.  None of my URLs are being
substituted by their path alias.  Also, mind to rename the src action
to source?




------------------------------------------------------------------------

May 12, 2005 - 13:00 : mathias

Attachment: http://drupal.org/files/issues/alias-query-on-demand_0.patch (7.5 KB)

Doh! The previous version only worked when the path module was enabled.
Here's a new one that works independent of path module. It's also a
little faster now.



20,000 node and aliases: 
  without patch: 0.24 requests per second
  with patch: 1.87 requests per second (was 1.74)

The gains on drupal_org and sites with under 400 or so aliases is
small. Actually those sites will spend more time in the DB and less
time churning through PHP code.  Further optimizations could be made
here by defaulting to the old method of grabbing all aliases and
looking for further PHP optimizations. Should we look into this or is
that overkill?




------------------------------------------------------------------------

May 12, 2005 - 15:24 : Dries

Attachment: http://drupal.org/files/issues/alias.patch (7.97 KB)

I upgraded the patch to HEAD (didn't apply) but it still doesn't work
... my URLs are not replaced by their aliases.  I attached the updated
version of your last patch.  


No need to optimize this more so let's evaluate this first.  I'll test
it on drupal.org, which only has a few aliases.  Let's hope other
people will test this too.







More information about the drupal-devel mailing list