Towards a localised desktop

What it takes to get a complete, localised, free software desktop?

Free software has long been considered much better suited for localisation, since anyone is free to undertake the job of adjusting it to their own locale preferences.

Any division of entire l10n process is going to be artifical, but we're still going to do it. We're going to make our system first recognize our locale, then enable it to display localised content properly, and finally we need it to let us input such content. With those basics done, we'll turn to describing translation framework from both developers' and users' point of view.

To at least have some direction here, we're going to base our free software desktop on GNU/Linux, X.Org or XFree86 and Gnome. To a varying extent, anything we say will be applicable to other free software systems.

We should also note that any system paths we refer to are based on common installation paths, and hope that readers will be able to correctly deduce their own paths if their packages are installed with differing prefix, datadir, sysconfdir, etc.

Related standards

There're many international and local standards which govern the usage and technicalities of l10n support. Lets list only a few of the globally useful ones, which we'll refer to in the text.

ISO 639 lists two-letter ("alpha-2", ISO 639-1) and three-letter ("alpha-3", ISO 639-2) codes for all the worlds languages. Three-letter codes are easier to register and requirements are fairly small: if there's at least 50 documents in a language, it's entitled to a three letter code. Two-letter codes are reserved for "established" languages (see [1] for what is required: lots of documents using this language and official support by an institution). One can apply for ISO 639-1 or 639-2 code at the web pages of USA's Library of Congress [2].

ISO 3166 standard lists two-letter, three-letter and three-digit country codes. One can consult ISO Maintenance Agency web pages [3] directly to find out about relevant code. New codes are assigned only through United Nations procedures.

ISO 10646 or Unicode[4] are synchronized standards which define the "Universal Character Set": a set of all useful characters for the purposes of encoding texts written using any language. They also go as far to define several transformation formats which are directly usable on computers, with UTF-8 (also described in RFC3629) being most interesting to free software desktop users.

ISO/IEC 9945 (available online at [5], commonly dubbed "POSIX", and subset of "Single Unix Specification") is a standard describing portable operating systems and their interfaces, which include interfaces relevant to internationalisation and localisation. Along with ISO 14652 ("Cultural Conventions Technologies", but we'll commonly use the term "locale") and ISO 14651 (defining commonly usable collation table), both available in PDF formats on the web, this standard presents the core of localisation support on free software systems.

Now, anyone can be put off by the sheer amount of standards one would need to consider just to get her locale supported. But fear not, we're going to take shortcuts so we can avoid reading thousands of dull specification pages.

Letting the system know about our locale.

First step is to make system acknowledge that what you need is another locale. If it already knows about our locale, then we're already set (and most likely, it already does). But for those unfortunate ones for whom it doesn't know about, or where it has wrong information for them, lets see how to introduce our locale to the system.

libc locales

Standard C library (defined by ISO C and POSIX standards), or for short "libc" is what provides many features for handling of special, localised data. It is done through the concept of "locales": collections of data representing a language as used in certain region (commonly a country), with some optional features (eg. using euro as a currency value instead of whatever is default).

Locales provide such information as the metric system used (metric/international or "imperial"), currency values, date and time formats, collation tables (used for sorting), address, telephone, numeric and paper formats, etc.

comment_char %
escape_char  /

LC_TIME
day     "<U0044><U0061><U0079><U0020><U0031>";/
        "<U0044><U0061><U0079><U0020><U0032>";/
        "<U0044><U0061><U0079><U0020><U0033>";/
        "<U0044><U0061><U0079><U0020><U0034>";/
        "<U0044><U0061><U0079><U0020><U0035>";/
        "<U0044><U0061><U0079><U0020><U0036>";/
        "<U0044><U0061><U0079><U0020><U0037>"
END LC_TIME

LC_COLLATE
copy "iso14651_t1"
END LC_COLLATE

% Repeat for CTYPE, MESSAGES, MONETARY, NUMERIC, PAPER, 
% 	     MEASUREMENT, NAME, TELEPHONE, ADDRESS
LC_IDENTIFICATION
copy "de_DE"
END LC_IDENTIFICATION

...

Here, we base our locale on de_DE locale (using copy "de_DE" directives), and only replace LC_TIME and LC_COLLATE definitions. Since our LC_TIME is very much incomplete, we're going to get many warnings from localedef, but we can ignore them by using -c to force compilation. As for collation, we're using a global table from ISO 14651 standard. ISO 14652 describes some mechanisms on extending it by providing "deltas", but we're not going to go into details.

We can save this in /usr/share/i18n/locales/testlocale, and then test using following commands:

# localedef -c -f UTF-8 -i testlocale testlocale.UTF-8
# LANG=testlocale.UTF-8 date +%A
Day 4

Petter Reinholdtsen keeps a web page [*] related with GNU libc locales, where you can learn more about writing your own locales. Easiest is to start with other locales.

Working with locales

Apart from working with localedef command, it's important to understand the basics behind setting one's locale. POSIX systems make use of several environment variables, among which are LANG, LC_ALL and LC_category for each of the categories in a locale. While you can set each of them separately, one most commonly simply sets either LANG or LC_ALL, and tests her settings with locale command.

On GNU systems, one can also use more sophisticated LANGUAGE variable, which can hold several fallback languages if first one is not available. It also takes precedence over any other variables on GNU systems.

To fully integrate our newly created locale into system, we want to make X11 aware of it's existence, and we do that by adjusting /usr/X11R6/lib/X11/locale/locale.dir and locale.alias files.

Finally, we want our default Gnome display manager (GDM) to list our locale as one of the available settings. That can be done by adding appropriate entry to /etc/gdm/locale.conf, but it's better to ask for integrating it directly in GDM via Gnome's Bugzilla: this will have added advantage of allowing translation of locale name.

Problems of display

Once our systems accepts that our language and locale exist, we want it to be able to display it. For some languages (i.e. those based on Latin script), this is mostly trivial. But, in many cases, you're either going to end up with missing fonts or insufficiently capable renderers.

Fonts

In most cases, we should be able to find good-enough free software fonts which cover our desired subset of UCS. It's easiest for simpler scripts, such as Latin, Cyrillic and Greek, among others. For Latin scripts, many prefer to use Bitstream Vera fonts, made available thanks to Bitstream Inc. and Gnome Foundation. There're also quite a few Vera derivatives which aim for better coverage, and some of those are DejaVu (covers entire Latin Extended regions and Cyrillic) and Arev (only Sans faces at the time of this writing, adds Greek and Cyrillic). There are other excellent font packages such as URW-CYR (also repackaged as gsfonts 8.11) and Computer Modern Unicode.

However, if your language requires more special features, you probably need to look for dedicated fonts for your language and/or script. This happens with many complex scripts, so you're better off looking for a font that exactly matches your script.

If you're unable to find a suitable font, you're basically left with only one option: develop your own font. While it may seem daunting at first, there's an excellent tool called FontForge [6] which should make it much easier to do.

Drawing your text

Finally, once you've got your font, thanks to FreeType2 library, any X program will be able to render glyphs from the font. But that's hardly enough, and we can observe several problems.

First off, we need our system to recognize that this font is suitable for rendering text in our language. On most new free software systems, this is done using fontconfig library (previous "base X fonts" mechanisms were based on the concept of encodings, and long strange "patterns" describing each font and what encoding does it support). Now, fontconfig tries hard to get as close as possible to application-requested font, but if it can't match exactly, it will resort to giving out best font for desired language. It does it's work using a mapping from sets of UCS codepoints to languages, and if it doesn't work correctly for you, you can adjust this mapping yourself.

FontConfig also allows one to tune rendering provided by FreeType2 library, such as turning auto-hinting on or off for certain faces and sizes. It also allows for setting "aliases" so that a request for one font which doesn't exist on a system yields another suitable font.

Since we're basing our free software desktop on Gnome technology, final step in rendering process is going to be Pango: framework for drawing international texts used in Gtk+. Pango includes "shapers" (modules designed for rendering particular languages and/or scripts) for many languages, which includes Arabic, Hangul, Hebrew, Indic, Thai, Tibetan and Syriac at the time of this writing.

Right-To-Left

Right-to-left rendering requires special care in many applications. While Pango will properly render any RTL text which is properly tagged, there're more issues to be resolved.

First, we want our entire UI to be "mirrored", which includes menus, toolbars, scrollbars and other UI elements. This is relegated to the toolkit, and Gtk+ handles it using a translation of the message "default:LTR": you have to translate it to "default:RTL" in order to get desired behaviour for RTL languages (see below on how do we do translation itself).


Untranslated/English Gedit in RTL rendering.

Most problems arise only when we want to combine RTL and LTR texts in a single paragraph. However, Gnome and Gtk+ have solid foundation for bidirectional texts, so that is unlikely to be a problem in general case, even though it may become a problem in any particular case, depending on the application. For instance, Evolution HTML mail renderer/editor, GtkHTML 3, has only recently had most of the RTL problems resolved.

For some cases where automatic decision making doesn't work, you can manually add specific direction markers by right-clicking the text field, choosing "Insert Unicode control character" from the menu, and selecting appropriate direction mark. This would allow you, for instance, to start your RTL text with an otherwise LTR word (such as "GNOME").

Problems of input

For most of the world's languages, inputting appropriate text is a matter of simple typing of keys engraved on one's keyboard. X Window System has long offered basic remapping support through xmodmap. Still, many advantages, especially for i18n, are brought in by using XKB extension. Further, XFree86 4.3.0 introduced "multi-layout" layouts (each layout can be combined with any of the others).

When plain keyboard remapping isn't sufficient, we'll need to look into compositing and input methods.

Basics of XKB

Keyboards communicate with the operating system by sending out "keycodes" which indicate what key has been pressed. Mapping between short key designators and numeric keycodes is done in /etc/X11/xkb/keycodes/ files: we won't discuss modifying these, since we're assuming that there already are sufficient mappings. You might need to look into these so you could find out what key you want to assign certain functions to.

X (and XKB) communicates with applications in terms of "keysyms" ("key symbols"). Internally, these are just integers assigned a certain meaning. For your copy of X.Org or XFree86, you can find all keysym definitions in /usr/X11R6/include/X11/keysymdef.h (you need to strip leading "XK_" from names). For those Unicode symbols which are not assigned a keysym, you can use format "U" + hexadecimal Unicode codepoint to refer to that symbol. There are also several special keysyms such as "any" and "NoSymbol" (these are equivalent, though). Most common keysyms have names that directly say what they are (eg. "a", "A", "0", "Cyrillic_a", etc.).

Finally, each key can have a different type: alphabetic keys are behaving differently from numeric or control keys, and that is reflected in their type. For instance, alphabetic keys are affected by "Shift Lock" (better known as "Caps Lock"), while numeric keys are not. Also, each key can have multiple "levels": trivial example of this is a key with two levels: basic-level and shift-level, the latter gotten with pressing and holding down the shift key. How many levels are there, and how do you get them, is defined in /etc/X11/xkb/types/. For instance, we'll use FOUR_LEVEL and FOUR_LEVEL_ALPHABETIC types for our purposes, which provide four levels per each key, and appropriate Shift Lock behaviour.

XKB symbol definitions

Multiple different keyboard layouts are termed "groups" in X terminology. One is limited to having a maximum of 4 groups loaded at any one time. In the past, keymaps themselves defined several groups per key, which led to a lot of duplication. Starting with XFree86 4.3.0, all keymaps were reorganized to be "multi-layout": you can now easily merge more than one keymap to fill the 4 available group slots however you please. For compatibility reasons though, this meant moving all keymaps from /etc/X11/xkb/symbols/ into new hierarchy at /etc/X11/xkb/symbols/pc/ (note the subdirectory pc).

Each symbol file is specific to single language or country, or is shared between several other symbol files through inclusion mechanisms. One symbol file can contain several "variants", with the first one commonly called "basic" and being used when no variant is specified.

For demonstration, lets define a map which outputs Z when one presses AC01 key (usually "A" key on English keyboards; "AC01" means 1st [01] key in 3rd [C] row from the bottom of the keyboard, though this is not necessarily a rule, but only a guideline). Also, we want to merge this with the basic Latin keyboard, and separately define dead and combining accents on third and fourth levels.

default partial alphanumeric_keys
xkb_symbols "basic" {
  name[Group1]= "Test keyboard layout";

  include "pc/latin(basic)"
  include "pc/test(atoz)"
  include "pc/test(level3)"
};

partial alphanumeric_keys
xkb_symbols "atoz" {
  key.type[Group1] = "FOUR_LEVEL_ALPHABETIC";
  key <AC01> {   [ z,        Z,   any,any ]   };
};

partial alphanumeric_keys
xkb_symbols "level3" {
  key <AC01> {   [ any,any,     dead_acute,          U301 ]   };
};

If we save this as /etc/X11/xkb/symbols/pc/test, we can test it using setxkbmap test.

Lets take a look at what we've got here. First, we define a default variant (to be loaded if no variant is specified). We name it "Test keyboard layout", and simply include the basic variant from pc/latin into it, followed by inclusion of our two subvariants "atoz" and "level3". In subvariant "atoz", we ensure our key type is FOUR_LEVEL_ALPHABETIC (meaning, Shift Lock will affect it), and put z and Z on key AC01. On third and fourth level we put any, which means that we don't modify the definition there. Finally, in "level3" subvariant, we do the opposite: leave levels 1 and 2 alone, and only define a dead_acute and combining acute (Unicode 0x301) on levels 3 and 4.

We also want our new map to show up in appropriate layout selection tools, such as the one integrated in Gnome. To do that, we need to list it in /etc/X11/xkb/rules/xorg.xml or /etc/X11/xkb/rules/xfree86.xml (depending on whether we're using X.Org or XFree86). For some applications you might need to update respective .lst files instead.

There's also a project started with the wish to become a central place for all keyboard layouts shared between X.Org and XFree86: xkeyboard-config [7]. To provide translations for names and descriptions of the all available layouts, one should use Translation Project [8].

Compose mechanisms

When we need accented characters, it's easiest to get them if they already exist in UCS. We call such accented characters "precomposed", since the process of attaching an accent to a character is known as "composing". However, in many cases our keyboards are not big enough for all the accented characters we might need, or we don't want to waste precious space on rarely used characters. So, we introduce so called "dead" keys, which modify the following letter we type.

We can find all the available "dead" keys in keysymdef.h. Next, we need to define a mapping from dead key and regular character to precomposed character or sequence of characters. This can be done in /usr/X11R6/lib/X11/locale/en_US.UTF-8/Compose, or we can define our own X11 Compose file (which we'd need to add to compose.dir as well).

Lines in Compose files are consisted of whitespace-separated keysyms in angle brackets, followed by colon and quoted UTF-8 string to use as a replacement. Common initial characters are either Multi_key or any of the dead_* keys.

<dead_acute> <i>          : "i with acute"
<Multi_key> <acute> <i>   : "i with acute"
<dead_acute> <I>          : "I with acute"
<Multi_key> <acute> <I>   : "I with acute"

With the above additions or changes to our Compose file, whenever we press any of the combinations on the left side, we'll get text on the right side of the colon. We would commonly use a precomposed character there, but if there is no precomposed character we need, we can put a base character and combining diacritic as a string.

However, none of this will work right away in Gtk+ programs. Gtk+ keeps a table of compose combinations compiled in, so you would need to recompile it if you wish to add anything, with the important caveat that Gtk+ only supports many-to-one mappings, and not many-to-many (i.e. right side must be a single character). However, if you choose "XIM - X Input Method" as input method when right clicking on any Gtk+ text field, you will be able to use compose sequences provided by X. You can also set this using environment variable GTK_IM_MODULE (value xim is for selecting X Input Method).

Another alternative is to directly include combinging characters into the symbols file (as we also did in our example above). The difference is that we won't get any precomposed forms without further processing, and we need to input accents after the character, instead of before.

Complex input: input methods

For most languages, above mechanisms would suffice. Yet, there are languages which are not suitable for regular input via keyboard. Typical examples are ideographic languages, such as Chinese, Japanese and Korean.

There are several input method libraries available, and most of them can be used with Gtk+ software. Gtk+ also provides internal input methods which are table based, and there are several very simple examples in modules/input subdirectory of the Gtk+ source code.

Of popular input method libraries, one should mention XCIN [x] for Chinese, and universal SCIM framework [y].

Problem of translation

If we have gotten this far, it means that our free desktop is finally able to display and accept input in the desired language. Still, we're left with mostly English text all around our system.

Most common localisation library in free software is GNU gettext. Besides being a developers' library (and being integrated in GNU libc as such), it is also a set of tools for working with translation data: xgettext for extracting strings from source code, msgfmt for compiling translations, msgmerge, msgcat, msgfilter and other tools for managing translations, etc.

Translation catalogs: PO files

PO files are defacto standard translation file format for free software systems. They are very simple, plain-text files, which allow anyone to use either plain text editor on them, or any of the sophisticated PO editing tools (such as Gtranslator, KBabel, POEdit, Emacs po-mode, etc).

All of the GNU gettext provided tools either produce or work with PO files.

PO files are consisted of blank-line separated entries which can take two different forms of "messages" for translation: regular and pluralized messages. The difference is that regular forms have only single msgid and msgstr values (original string and translation thereof), while pluralized forms have two original stings (msgid and msgid_plural for plurality), and any number of translated strings (msgstr[i], with i starting with 0) determined by the "plural-forms" header field.

Initial message in a PO file, a "translation" for empty string "" is called PO file header. It is consisted of metadata such as message encoding, last translator and revision date, plural forms description, language, etc.

Another important aspect for translation is compendia: sets of commonly repeated messages, in easily reusable form. PO files can serve as compendia as well, by using -C option to msgmerge. This way, translators avoid having to re-translate all the same messages over and over again, by only keeping them separated in a single shared PO file.

UI translation

User interface of Gnome Desktop programs comes from several sources: source code (compiled or interpreted), miscellaneous data files (such as .desktop and MIME databases), Glade UI files (though these can be considered source code as well), etc.

xgettext from GNU gettext tools is able to extract strings only from a subset of necessary files (mostly from source code). For other files, we would need to manage our translations manually. Or perhaps not, since there's intltool which unifies translation experience by providing extraction and merging facilities for all the desireable file formats used in Gnome and elsewhere.

intltool supports many file formats, among which GConf Schemas, .desktop files, preprocessed XML files, Scheme source code and Glade UI files.

It is very simple to integrate by developers, and even easier to use by translators, who end up caring only about intltool-update command, which is their entire interface to UI translation of any software program.

One big problem with UI translation of Gnome-based desktops is that many messages are passed directly from lower-level components over to higher-level components. This usually means that they end up being untranslated. Though it's possible to end up with libc messages in the UI, that's not as likely as with ending up with Mozilla/Gecko messages appearing in Epiphany (Gnome web browser). This means that for full translation coverage, you need to work on translation of many of the other, non-Gnome components.

However, there are several issues still remaining. GNU gettext merges all exactly the same original strings into one message, which might be a problem since translations might differ depending on the context. This needs to be resolved by developers by providing some context for the message (glib provides Q_ macro for such purposes).

Documentation translation

Gnome documentation is written in DocBook XML format. Translating XML files manually is not very hard initially, but it becomes a nightmare if you try to keep up with changes in original documents.

Solution to that problem is to extract only portions of XML documents that are relevant for translation, and ignore the structure and layout. There are several ways to do this, one is to use proprietary software which handles this, other is to try to use poxml from KDE project, but recommended way is to use gnome-doc-utils and xml2po. They are designed especially with Gnome in mind, so they exhibit best performance for Gnome documentation.

xml2po provides tools to extract messages for translation into PO files, and later merging of those translations into XML files. gnome-doc-utils integrates xml2po with the build system, and provides localisation stylesheets for usage in Yelp (the Gnome help viewer).

Other problems

Resolving problems we described so far will get us very close to completely internationalised and localised desktop: a goal we set for ourselves. However, there's always more to be done. One important thing is introducing spell checking if it makes sense. Other problem is integration of otherwise incompatible software, which are based on entirely different frameworks. Accessibility for international market is another big issue which is not resolved easily.

Spell-checking

Spell checking on free software systems is best done with GNU Aspell [9]. It provides extensive support and documentation for developing your own dictionaries and checking rules.

Unfortunately, some widespread tools (eg. OpenOffice) don't use Aspell, but rather provide their own engines and databases. This means that lot of stuff gets duplicated on our systems. For integration with Gtk+ programs, one should suggest GtkSpell [10].

Conclusion

We have gone through most of the steps in getting our free desktop based on GNU, Linux, XFree86 or X.Org and Gnome localised. From time to time, we're going to be confronted with additional problems, but most of them will be simpler to solve once our foundation is right.

And without knowing all of world's languages and cultural needs, we can only hope our foundation is right. If it's not, everyone is invited to contribute, and that is the power free software gives to local groups.

Appendix: working in restricted environments

It's not uncommon to end up in a very restricted environment: you are just a regular user, so you can't touch anything outside your $HOME. This does not mean that we should be constrained only to locales provided by the system.

Locales and translations

Of greatest interest here are LOCPATH and I18NPATH environment variables. LOCPATH points at the directory which contains locales generated with localedef (with last argument being directory name inside $LOCPATH, instead of locale name), and I18NPATH points at input data to be used for generation of locales.

~/locales $ export LOCPATH=~/locales/
~/locales $ export I18NPATH=~/locales/
~/locales $ localedef -f UTF-8 -i test ./test@locale
~/locales $ export LC_ALL=test@locale

Translations path can be set using environment variable NLSPATH, but this won't affect gettext-style translations (it's used only for older "catgets" approach). Unfortunately, there doesn't seem to be a way to do this for GNU gettext using applications. The only alternative is to recompile software on your own, and to set locale path by passing a parameter to configure script.

Fonts & rendering

Xft2 and FontConfig are commonly set up in such a way to allow users to put fonts for themselves in $HOME/.fonts/. It is as simple as dragging and dropping your favourite fonts there.

You can also set PANGO_RC_FILE environment variable to a Pango RC file in your home directory, and point ModuleFiles to a directory inside your $HOME. This should even allow you to use personal Pango shapers.

Keyboard input

setxkbmap is basically only a front-end to xkbcomp which does the actual work of merging a keyboard map into X server. To see what is setxkbmap passing over to xkbcomp, we can add the -print argument:

$ setxkbmap test -print
xkb_keymap {
        xkb_keycodes  { include "xfree86+aliases(qwerty)"       };
        xkb_types     { include "complete"      };
        xkb_compat    { include "complete+leds(num)+leds(caps)" };
        xkb_symbols   { include "pc/pc(pc105)+pc/test+level3(ralt_switch_multikey)"       };
        xkb_geometry  { include "pc(pc104)"     };
};

This is very useful when we're testing our symbol files, since setxkbmap isn't very verbose on error reporting. Thus, we can instead ask it to print keymap definition, and pass that to xkbcomp -v:

$ setkxbmap test -print | xkbcomp -v -

At the same time, instead of merging our keymap with X server, we can store it in a compiled (.xkm) or source (.xkb) file format. For that, we need to use options -xkm or -xkb to xkbcomp.

If we're unable to modify files in /etc/X11/xkb, we can either keep xkb and xkm files in our home directory, and later load them using xkbcomp, or we can create a definition based on setxkbmap -print output, use relative paths where appropriate, and add our own directory as xkbcomp -I argument.

Finally, now that we're able to load XKB, we also want to be able to have our own, personal Compose file. It's very easy starting with XFree86 4.4.0 and recent X.Org versions: by default, file named ~/.Xcompose is loaded as user's compose file (alternately, one can point XCOMPOSEFILE variable to any other location).

However, this wouldn't be useful enough if one wouldn't be able to base her compose file on already existing compose files. For instance, I'd most likely have something like the following:

$ cat ~/.Xcompose
include "%L"
<dead_grave> <Cyrillic_a>         : "а̀"
<combining_grave> <Cyrillic_a>    : "а̀"
...

Instead of include "%L", which includes default compose file for the active locale, one can put full path to any of the Compose files. Note that the same caveats hold for using this with Gtk+ programs as for the regular Compose files.