[development] why locale import is slow and a memory eater?

Fri Dec 30 22:13:08 UTC 2005

Hi,

Dries asked me to look into this for 4.7.0, since those with low memory
settings or slow servers experience problems importing PO files. The
issue is quite simple:

 - the import script loads in the complete file into a string
   (in an uberly complicated way :)

 - then it splits the string into $lines (into another variable)
   on which it does the processing

 - while processing the file, it builds up a $strings array
   which contains all the parsed information from the $lines array

 @ at this point, there are three copies of the po file in memory,
   taking up increasly more space:

    single string < line-by-line array < parsed array

 - all the above is done in a function, so when the function
   returns, the three copies of the file reduces to one (the
   returned $strings array), which is then processed element
   by element, and the database is modified accordingly

It would be really straightforward to read the file line by line, which
should get off a lot of burden from PHP (at least a megabyte of text,
plus the overhead of the $lines array structure). That eliminates the
first two steps. But then the best would be to directly process the PO
tokens as parsed (ie. hand over information to the database as soon as
possible, and do not keep a big array of this data in memory).

And here comes the question. Now, if a PO file is broken for any reason,
the real SQL import process is skipped. But whether the file is broken
at some point is only possible to tell, if we parse the file up to that
point. If the SQL changes are made incrementally, then the PO file would
end up half-imported. Question is if it is a problem or not. I think
this is a minor problem, since you are supposed to upload a translation
which you are about to use, so if your po file is broken it is not a
problem that it is half imported, you should fix and try again.

I cannot provide an estimate on when will I have the time to work on
this, but the file reading methods seem to be perfect low hanging fruit,
even if the more direct SQL import is a bit more time to do. Let us
clear up first that the incremental SQL import is sufficient.

BTW first to upload files on a hosted environment, this
open_basedir compatibility patch needs to get it:
http://drupal.org/node/5961

Happy New Year everybody,
Gabor