[development] The Drupal Diet - Making bootstrap faster

3 May 2007

      This is more of an RFC than a DEP, so please forgive the looser format and 
trademark verbosity. :-)

My current push for Drupal 6 is to make it faster for non-opcode non-cached 
users.  Drupal 5 was, according to Dries' benchmarks, a slight step backwards 
from Drupal 4.7 in that regard, so let's reverse that trend with a vengeance.

Far and away the slowest part of the Drupal life-cycle is bootstrapping.  
According to Rasmus' keynote at Drupalcon, we spend just over 50% of the 
entire process just pulling code off disk and parsing it.  A typical page 
load, however, uses only a small fraction of that.  Thus, the biggest target 
for optimization is "load less code", but without violating the 
corollary, "load fewer files".

As an example, I recently tried breaking up some core modules and loading page 
callbacks and form only when needed[1].  Even with that primitive breakup, we 
were able to get an 8-18% improvement in page load time and a 23% decrease in 
memory usage.  I hate sayings like "the numbers speak for themselves", but in 
this case they do.  On-demand loading of lesser-used module code has the 
potential to be a huge win, and the extra code required to make it possible 
is minimal.  (The code linked in that issue only adds ~10 lines of code; the 
rest of the patch is just moving code around.)

That of course begs the question, how to split up the code in a module?  In 
general, I see 5 logical divisions of code within a module:

1) Rare hooks.  hook_install() and hook_update() are the classic cases here, 
although I think hook_menu() in Drupal 6 may be moving in that direction.  
These are hooks called only at very specific, rare times.  The other 99% of 
the time they're dead weight.

2) Common hooks.  This is basically every hook that isn't one of the few rare 
hooks.  These may be called at any time, more or less often.  (For the time 
being I am going to lump hook_load(), hook_update(), etc. in here, even 
though they're technically not hooks as we've discussed previously.)

3) Page handlers.  These are functions whose primary purpose in life is to be 
called from menu_execute_active_handler().  They serve no other serious 
function.

4) Form builders.  These are the form definition, validation, and submission 
functions, as well as their sub-call helpers.

5) API functions.  These are functions specifically exposed to other modules 
to do stuff, for some definition of stuff.

6 (nobody expects the Spanish inquisition!)) Utility functions.  These 
functions are mostly intended for internal use but can sometimes be useful to 
other modules.  The line between a utility function and an API function is 
very blurry since PHP functions have no concept of namespace or visibility.  

There aren't that many Rare Hooks, and most of them are already in .install 
files so I will ignore those for now.  Common Hooks by nature need to be 
readily-available at any time.  It is possible to dynamically load those, 
too, but that's a more complex issue, and one that merlinofchaos[2] and 
chx[3] have already started to address.  I am therefore not going to deal 
with those, either.  API functions also have to be readily-available, and 
utility functions probably should, too.

So now that we've said we need to load most types of code, what does that 
leave us?  That leaves us the two types of code that are used the least but 
take up the most lines of code.  merlin and webchick recently went through a 
few core modules and cataloged what functions were of what type[4], and the 
results are clear: We spend most of our code on page and form handling, and 
yet only one page is ever handled per page load and, generally, only 1-2 
forms!  In terms of actual lines of code, all four modules in question 
(system, user, comment, block) are majority pages and forms, in some cases by 
over 2/3.  That means page handlers and form handlers are the safest to 
factor out into separate on-demand files but also the biggest win from doing 
so.  It's nice how that works out.

So now we need a mechanism for on-demand loading of page handlers and form 
handlers, subject to the following conditions:

1) It should be an optional optimization.  We don't want to force all modules 
to break up, because many, I'd say the majority, are small and simple enough 
that it would be a case of over-optimization to require, say, every form to 
be in a separate file or every module to have a .pages file.  We also don't 
want to make module authoring an overly-difficult process with a dozen magic 
files.  The degenerate case should be exactly how things work now.

2) It should be flexible.  Different modules need to be optimized differently.  
Putting all page handlers into a single .pages file for a module could still 
mean loading 10x as much code as we really need.  Module authors need to be 
able to factor their own modules in the way that makes the most sense for 
that module, which could mean one on-demand file or several.

3) It is impossible to determine the module that provides a function from the 
function name alone.  Sure all functions (should) use $modulename_<something> 
as their format, but many modules have an underscore in their name.  Given a 
function named "foo_bar_baz", is that the "bar_baz" function of the "foo" 
module, or the "baz" function of the "foo_bar" module?  We can't tell.  
Therefore, unless we are going to simply exclude modules with such names from 
this system (and I think that's a really bad idea) we will have to explicitly 
specify the module or path for a given auxiliary file.  

4) Modules may call page handlers and form handlers from other modules.  Core 
does this in places (node.module calls a page handler from system.module, for 
instance) as do various contribs, so we can't assume that the calling module 
is the providing module.

4) Page handlers are called from the menu system; therefore, the logical place 
to decide if additional code is needed is the menu system.  Since we can't 
presume or deduce a module from the handler, that means it has to be 
specified explicitly in hook_menu().  

5) Form handlers are called from drupal_get_form(), or from drupal_execute().  
Many are parameters to drupal_get_form() being used as a page handler, but 
not all.  drupal_execute() may be called from anywhere at any time, too, so 
forms need to be either already loaded or loadable on-demand at any time.

I therefore propose (finally I get to this part!) to split off page handlers 
and form handlers in similar ways.  

== Page Handlers == 

Only one page handler is called per page load, so we only need to worry about 
a page handler becoming available in menu_execute_active_handler().  Modules 
provide information to the menu system via hook_menu(), so each menu item can 
optionally specify information on what file to load in order to make the 
handler available.  That could be one of two ways: Pass a full path (eg, 
drupal_get_path('module', 'foo') . '/foo.pages.inc' ) or specify a file name 
and module name separately.  For simple flexibility I favor the former.  It's 
simple and effective and works for cross-module calls.  It's also what's 
already implemented in the patch I mentioned earlier[1].  

== Form Handlers ==

Forms are nearly always accessed via drupal_get_form() or drupal_execute().  
We can therefore do the same sort of centralized improvement for the form 
system in those functions as we can for page handlers using 
menu_execute_active_handler().  That is, add a key to hook_forms() to specify 
a file in which the form lives.  Here we can safely presume that the module 
implementing hook_forms() is also the home of the form functions in question, 
so we need specify only a file and not a module or path.  If a module author 
wishes to split off one or more forms to another file, hook_forms() becomes a 
requirement just as it does for specifying an alternate callback function.  
drupal_get_form() and drupal_execute() then simply check for the existence of 
that key and include_one() the file if necessary.  The total code involved 
should, like the page handler, be quite limited.

Note: I will likely want to wait on implementing the forms part until the FAPI 
3 patch lands, because I really don't want to tangle with both eaton and chx 
on that. :-)

The nice thing about this approach, too, is that it doesn't have to be 
implemented all at once.  Because the degenerate case still works, the 
initial implementation can work on only one or two core modules as a 
demonstration.  The rest of core can be optimized module-at-a-time.  That 
makes the patch easier to review as well as easier to maintain with the rest 
of core still being actively developed.  Given the benchmarks that merlin 
found with the initial attempt, I'd say whatever the total performance gain 
is it should be substantial.

To the potential problem of module authors "over-factoring" and hiding useful 
utility functions in a page handler when they shouldn't, I believe that 
really is solved by best practice guidelines.  As a worst-case, a module 
author can manually include_once() a file out of another module's directory 
at no worse a cost than the extra parse time.  It's still a net-win overall 
since even if one module gets sloppy-loaded the rest of the system is still 
well-factored, so there's still a net-reduction in the amount of code 
involved.

Sooo...  Now that the three of you who made it all the way through this email 
have gotten here, thoughts on this approach?  Any caveats I'm missing?  Any 
use cases I don't know about?  Does this have a snowball's chance in hell of 
being accepted?

<dons flame-retardant suit>

[1] http://drupal.org/node/140218
[2] http://drupal.org/node/116165
[3] http://drupal.org/node/140218#comment-236614
[4] http://drupal.org/node/116165#comment-229856

-- 
Larry Garfield			AIM: LOLG42
larry@garfieldtech.com		ICQ: 6817012

"If nature has made any one thing less susceptible than all others of 
exclusive property, it is the action of the thinking power called an idea, 
which an individual may exclusively possess as long as he keeps it to 
himself; but the moment it is divulged, it forces itself into the possession 
of every one, and the receiver cannot dispossess himself of it."  -- Thomas 
Jefferson