why locale import is slow and a memory eater?
Hi, Dries asked me to look into this for 4.7.0, since those with low memory settings or slow servers experience problems importing PO files. The issue is quite simple: - the import script loads in the complete file into a string (in an uberly complicated way :) - then it splits the string into $lines (into another variable) on which it does the processing - while processing the file, it builds up a $strings array which contains all the parsed information from the $lines array @ at this point, there are three copies of the po file in memory, taking up increasly more space: single string < line-by-line array < parsed array - all the above is done in a function, so when the function returns, the three copies of the file reduces to one (the returned $strings array), which is then processed element by element, and the database is modified accordingly It would be really straightforward to read the file line by line, which should get off a lot of burden from PHP (at least a megabyte of text, plus the overhead of the $lines array structure). That eliminates the first two steps. But then the best would be to directly process the PO tokens as parsed (ie. hand over information to the database as soon as possible, and do not keep a big array of this data in memory). And here comes the question. Now, if a PO file is broken for any reason, the real SQL import process is skipped. But whether the file is broken at some point is only possible to tell, if we parse the file up to that point. If the SQL changes are made incrementally, then the PO file would end up half-imported. Question is if it is a problem or not. I think this is a minor problem, since you are supposed to upload a translation which you are about to use, so if your po file is broken it is not a problem that it is half imported, you should fix and try again. I cannot provide an estimate on when will I have the time to work on this, but the file reading methods seem to be perfect low hanging fruit, even if the more direct SQL import is a bit more time to do. Let us clear up first that the incremental SQL import is sufficient. BTW first to upload files on a hosted environment, this open_basedir compatibility patch needs to get it: http://drupal.org/node/5961 Happy New Year everybody, Gabor
Gabor Hojtsy wrote:
If the SQL changes are made incrementally, then the PO file would end up half-imported. Question is if it is a problem or not. I think this is a minor problem, since you are supposed to upload a translation which you are about to use, so if your po file is broken it is not a problem that it is half imported, you should fix and try again.
I agree, the problem is minor. We could add a test to the packaging script and not create tarballs for broken PO files (not sure we currently do that). Cheers, Gerhard
I cannot provide an estimate on when will I have the time to work on this, but the file reading methods seem to be perfect low hanging fruit, even if the more direct SQL import is a bit more time to do. Let us clear up first that the incremental SQL import is sufficient.
BTW first to upload files on a hosted environment, this open_basedir compatibility patch needs to get it: http://drupal.org/node/5961
Happy New Year everybody, Gabor
On 30 Dec 2005, at 23:13, Gabor Hojtsy wrote:
And here comes the question. Now, if a PO file is broken for any reason, the real SQL import process is skipped. But whether the file is broken at some point is only possible to tell, if we parse the file up to that point. If the SQL changes are made incrementally, then the PO file would end up half-imported. Question is if it is a problem or not. I think this is a minor problem, since you are supposed to upload a translation which you are about to use, so if your po file is broken it is not a problem that it is half imported, you should fix and try again.
Good analysis; it matches my own observations. I don't think a partial import is a problem. Quite the contrary; I'd rather have 99% of the strings translated than 0% of the strings. I could translate the offending strings manually, or if I'm lucky, the offending strings are only accessible for the administrator, or they are part of some disabled feature(s).
BTW first to upload files on a hosted environment, this open_basedir compatibility patch needs to get it: http://drupal.org/node/5961
I had a quick look at this but it might need some more work; see issue. Thanks for the carefully written patches and analysis, Goba. -- Dries Buytaert :: http://www.buytaert.net/
Dries Buytaert wrote:
And here comes the question. Now, if a PO file is broken for any reason, the real SQL import process is skipped. But whether the file is broken at some point is only possible to tell, if we parse the file up to that point. If the SQL changes are made incrementally, then the PO file would end up half-imported. Question is if it is a problem or not. I think this is a minor problem, since you are supposed to upload a translation which you are about to use, so if your po file is broken it is not a problem that it is half imported, you should fix and try again.
Good analysis; it matches my own observations. I don't think a partial import is a problem. Quite the contrary; I'd rather have 99% of the strings translated than 0% of the strings. I could translate the offending strings manually, or if I'm lucky, the offending strings are only accessible for the administrator, or they are part of some disabled feature(s).
Ok. Just some quick data on how much an import eats now with the upcoming Drupal 4.7.0 Hungarian translation (now 357899 bytes): in _locale_import_read_po(): memory before read: 5923056 memory after read: 6281152 memory after split to lines: 7299384 memory after $strings generation: 8616160 Back in _locale_import_po(), which calls the above function: memory after return: 7285416 This shows that reading in the ~350k file results in roughly that size of memory increase, but then splitting it results in a megabyte of added memory usage (this is the overhead of the array), then the structured storage takes 1.4M (it is a bit of more complicated array). After return, it is clearly visible that the 1.4M structure is kept, the others are wiped, resulting in 7.2M. Goba
Dries Buytaert wrote:
On 30 Dec 2005, at 23:13, Gabor Hojtsy wrote:
And here comes the question. Now, if a PO file is broken for any reason, the real SQL import process is skipped. But whether the file is broken at some point is only possible to tell, if we parse the file up to that point. If the SQL changes are made incrementally, then the PO file would end up half-imported. Question is if it is a problem or not. I think this is a minor problem, since you are supposed to upload a translation which you are about to use, so if your po file is broken it is not a problem that it is half imported, you should fix and try again.
Good analysis; it matches my own observations. I don't think a partial import is a problem. Quite the contrary;
This analysis is now turned into a patch for review. Memory usage decrease is fantastic, see the chart attached. http://drupal.org/node/47610 Goba
participants (3)
-
Dries Buytaert -
Gabor Hojtsy -
Gerhard Killesreiter