Limitations of the source-string-centric approach (was: proper way to translate complicated ...)

List overview All Threads
Download

newer

older

Localization server project in the...

proper way to translate...

Cog Rusty

19 Jul 2007 19 Jul '07

11:54 a.m.

On 7/19/07, Gabor Hojtsy gabor@hojtsy.hu wrote:

...

Frederik 'Freso' S. Olesen wrote:

...
2007/7/14, Derek Wright drupal@dwwright.net:

...
[...] That's what I was hoping to hear, since I agree the split approach is better for everyone.

Except that for some language or other, it might make sense to switch order of the paragraphs, take a paragraph out, add another paragraph, or in some other way manipulate with the paragraphs to get the proper meaning. Splitting the string also risks confusion during the translation in the interface. I agree that shorter text blurbs are both easier to translate and easier to get people to translate (not as much work), but might also degrade the final outcome.

Yes, it is unfortunately not a win-win situation. It seemed to be a good compromise to have one t() per paragraph/list item.

Gabor

It is certainly a compromise. Stretching the idea to an extreme, if we broke everything down to single words, although words are common between many strings and the translator would have to deal with much less strings a serious translation would be obviously impossible -- single words are translated differently in different contexts, have genders, often are part of grammatical structures and idioms etc.

Several times I have needed to abandon a good translation of an English term string to my language and use an awkward one instead, because in English the term happens to be common in two different contexts while in my language it is not and I have to accommodate both cases. And contributed modules join the party later to exacerbate such a problem.

I suspect that in the center of these issues is the gettext system which allows an English string to have only one translation. (Perhaps the lack of context parameters in t() as well. I am not sure about that).

Some time ago, for example, there was an unsolved issue with the month "May", which in English happens to be same as its abbreviation, therefore it can't have an abbreviation different from the name of the month in any other language. Or, a sting as simple as "Active" can have genders (male, female, neutral) depending on what it refers to, but here we have to choose one gender for everything (which wouldn't impress the occasional intellectual visitor of a site).

The rest of the time, a little bigger chunks of text, case sensitivity or even occasional differences in the html markup of a string allow as to hack our way to a better translation.

Does it seems a viable idea for the future to automatically add optional hidden text into the original English strings to identify different context, and then filter it out in the output? At least for common strings which come from a different t()

I am also curious if anyone knows of any project anywhere in the world which one day might enhance or replace the gettext system to address better the issues of the single translation of a source string.

Show replies by date

Konstantin Käfer

19 Jul 19 Jul

3 p.m.

New subject: Limitations of the source-string-centric approach (was: proper way to translate complicated ...)

On 19.07.2007, at 13:54, Cog Rusty wrote:

...

Several times I have needed to abandon a good translation of an English term string to my language and use an awkward one instead, because in English the term happens to be common in two different contexts while in my language it is not and I have to accommodate both cases. And contributed modules join the party later to exacerbate such a problem.

This is indeed a common problem. For the same reason, we changed the generic "Submit" to "Save" for Drupal 6, but that only solves one small problem.

...

I suspect that in the center of these issues is the gettext system which allows an English string to have only one translation. (Perhaps the lack of context parameters in t() as well. I am not sure about that).

Maybe, we could add such context parameters (plain strings), for the case of the abbreviated word "May" the context parameter could be "month abbreviation", for the full month "May" the parameter would be "month full" (or something in that direction). Translators would have to advice programmers to use contexts for strings so that homographs can be separated in a clean fashion.

...

Some time ago, for example, there was an unsolved issue with the month "May", which in English happens to be same as its abbreviation, therefore it can't have an abbreviation different from the name of the month in any other language. Or, a sting as simple as "Active" can have genders (male, female, neutral) depending on what it refers to, but here we have to choose one gender for everything (which wouldn't impress the occasional intellectual visitor of a site).

That's also an issue I have discovered, for example with the string "Your @type has been created." (with @type being the name of a content type).

I'll explain the problem for people who don't understand the problem: Content type names have a specific gender in most languages; let's take "story", which translates to "Artikel" in German. The word "Artikel" is male, thus the sentence should be "Ihr Artikel wurde erstellt." (Ihr = Your). However, if we use page = Seite, we end up with "Ihr Seite wurde erstellt." The problem here is that "Seite" is female, thus requiring "Ihre" in the German language. This results in grammatically incorrect sentences. A wrong gender is not a minor issue for a native speaker, it really disturbs the reading flow and may shed a bad light on the site creator.

A possible solution to that problem could be to: a) remove variables that are embedded in a sentence (strings like "Do you really want to delete %title?" are perfectly fine since the % indicates that this is user supplied text dropping out of the regular reading flow) b) Provide a way to override translations for specific variable contents. The site administrator could for example override "Your @type has been created." for @type = 'Seite' and replacing it with "Ihre @type has been created"

...

The rest of the time, a little bigger chunks of text, case sensitivity or even occasional differences in the html markup of a string allow as to hack our way to a better translation.

I can't agree with that. If we split up long texts to paragraph length texts, the context is absolutely clear and the likelihood that an entire paragraph should be translated differently in different contexts converges against zero.

...

Does it seems a viable idea for the future to automatically add optional hidden text into the original English strings to identify different context, and then filter it out in the output? At least for common strings which come from a different t()

I am also curious if anyone knows of any project anywhere in the world which one day might enhance or replace the gettext system to address better the issues of the single translation of a source string. _______________________________________________ translations mailing list translations@drupal.org http://lists.drupal.org/mailman/listinfo/translations

Konstantin Käfer – http://kkaefer.com/

Gerhard Killesreiter

3:09 p.m.

New subject: Limitations of the source-string-centric approach

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Konstantin Käfer schrieb:

...

On 19.07.2007, at 13:54, Cog Rusty wrote:

...
I suspect that in the center of these issues is the gettext system which allows an English string to have only one translation. (Perhaps the lack of context parameters in t() as well. I am not sure about that).

Maybe, we could add such context parameters (plain strings), for the case of the abbreviated word "May" the context parameter could be "month abbreviation", for the full month "May" the parameter would be "month full" (or something in that direction). Translators would have to advice programmers to use contexts for strings so that homographs can be separated in a clean fashion.

I am not sure that this would work. The whole gettext stuff can't deal with this May/May situation.

...

...
Some time ago, for example, there was an unsolved issue with the month "May", which in English happens to be same as its abbreviation, therefore it can't have an abbreviation different from the name of the month in any other language. Or, a sting as simple as "Active" can have genders (male, female, neutral) depending on what it refers to, but here we have to choose one gender for everything (which wouldn't impress the occasional intellectual visitor of a site).

That's also an issue I have discovered, for example with the string "Your @type has been created." (with @type being the name of a content type).

I'll explain the problem for people who don't understand the problem: Content type names have a specific gender in most languages; let's take "story", which translates to "Artikel" in German. The word "Artikel" is male, thus the sentence should be "Ihr Artikel wurde erstellt." (Ihr = Your). However, if we use page = Seite, we end up with "Ihr Seite wurde erstellt." The problem here is that "Seite" is female, thus requiring "Ihre" in the German language. This results in grammatically incorrect sentences. A wrong gender is not a minor issue for a native speaker, it really disturbs the reading flow and may shed a bad light on the site creator.

A possible solution to that problem could be to: a) remove variables that are embedded in a sentence (strings like "Do you really want to delete %title?" are perfectly fine since the % indicates that this is user supplied text dropping out of the regular reading flow)

I think this is the cleanest solution. However, it could be difficult to implement programming-wise.

...

b) Provide a way to override translations for specific variable contents. The site administrator could for example override "Your @type has been created." for @type = 'Seite' and replacing it with "Ihre @type has been created"

Seems ugly to me.

...

...
The rest of the time, a little bigger chunks of text, case sensitivity or even occasional differences in the html markup of a string allow as to hack our way to a better translation.

I can't agree with that. If we split up long texts to paragraph length texts, the context is absolutely clear and the likelihood that an entire paragraph should be translated differently in different contexts converges against zero.

I agree.

Cheers, Gerhard

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGn36lfg6TFvELooQRAu2eAKDJGYBZFT9kRgwI+TP8Puj/NPbL/wCePfxK 9BAcyNm/OkomQB/6uOmXAVU= =Q5VR -----END PGP SIGNATURE-----

Gabor Hojtsy

5:33 p.m.

New subject: Limitations of the source-string-centric approach

Konstantin Käfer wrote:

...

On 19.07.2007, at 13:54, Cog Rusty wrote:

...
Several times I have needed to abandon a good translation of an English term string to my language and use an awkward one instead, because in English the term happens to be common in two different contexts while in my language it is not and I have to accommodate both cases. And contributed modules join the party later to exacerbate such a problem.

This is indeed a common problem. For the same reason, we changed the generic "Submit" to "Save" for Drupal 6, but that only solves one small problem.

Well that improvement was a good change on it's own, "Submit" was an awkward button label anyway.

...

...
I suspect that in the center of these issues is the gettext system which allows an English string to have only one translation. (Perhaps the lack of context parameters in t() as well. I am not sure about that).

Maybe, we could add such context parameters (plain strings), for the case of the abbreviated word "May" the context parameter could be "month abbreviation", for the full month "May" the parameter would be "month full" (or something in that direction). Translators would have to advice programmers to use contexts for strings so that homographs can be separated in a clean fashion.

There are basically two ways to go about the 'track' (as in audio module and tracker module) and the May/May problem:

- add more context information to the t() call, which ends up in Gettext files (maybe in the strings even) This could be trim(t('~shortmonth May', array('~shortmonth' => ''))), but of course this can be automated (strip ~ stuff from strings).

- use 'constants' or 'symbolic strings' instead of actual strings: t(T_MONTHS_MAY), t(T_SYSTEM_HELP_14, array('@url' => url(....))) and t(T_HOME) and friends. This would slow down English sites, as they would need to look up the 'constants' to strings too. (Also we should not store all strings as constants in memory).

Both ways lead to a solution, and both are ugly unfortunately.

...

That's also an issue I have discovered, for example with the string "Your @type has been created." (with @type being the name of a content type).

I'll explain the problem for people who don't understand the problem: Content type names have a specific gender in most languages; let's take "story", which translates to "Artikel" in German. The word "Artikel" is male, thus the sentence should be "Ihr Artikel wurde erstellt." (Ihr = Your). However, if we use page = Seite, we end up with "Ihr Seite wurde erstellt." The problem here is that "Seite" is female, thus requiring "Ihre" in the German language. This results in grammatically incorrect sentences. A wrong gender is not a minor issue for a native speaker, it really disturbs the reading flow and may shed a bad light on the site creator.

The hungarian team works around this by actually translating '@type has been created'.

...

A possible solution to that problem could be to: a) remove variables that are embedded in a sentence (strings like "Do you really want to delete %title?" are perfectly fine since the % indicates that this is user supplied text dropping out of the regular reading flow)

Unfortunately "Do you really want to delete %title?" does not work well either, as in Hungarian we need an article before %title, effectively translating "Do you really want to delete the %title post?". Anyway, how do you expect these strings to be modified? There are *lots*, and this type of string construction makes the interface so much more friendly, even if the translation is not 100% accurate.

...

b) Provide a way to override translations for specific variable contents. The site administrator could for example override "Your @type has been created." for @type = 'Seite' and replacing it with "Ihre @type has been created"

Well, things like articles and genders would need programmatic backends.

...

...
I am also curious if anyone knows of any project anywhere in the world which one day might enhance or replace the gettext system to address better the issues of the single translation of a source string.

Mozilla and many others (also in Java) use .property files, which is essentially the constants method explained above. Other PHP CMS use plain PHP constants.

Gabor

6792

Age (days ago)

6792

Last active (days ago)

translations@drupal.org

3 comments

4 participants

tags (0)

participants (4)

Cog Rusty
Gabor Hojtsy
Gerhard Killesreiter
Konstantin Käfer