cmlenz@2: .. -*- mode: rst; encoding: utf-8 -*-
cmlenz@2: 
cmlenz@2: =============================
cmlenz@2: Working with Message Catalogs
cmlenz@2: =============================
cmlenz@2: 
cmlenz@2: .. contents:: Contents
cmlenz@250:    :depth: 3
cmlenz@2: .. sectnum::
cmlenz@2: 
cmlenz@2: 
cmlenz@2: Introduction
cmlenz@2: ============
cmlenz@2: 
cmlenz@2: The ``gettext`` translation system enables you to mark any strings used in your
cmlenz@2: application as subject to localization, by wrapping them in functions such as
cmlenz@2: ``gettext(str)`` and ``ngettext(singular, plural, num)``. For brevity, the
cmlenz@40: ``gettext`` function is often aliased to ``_(str)``, so you can write:
cmlenz@40: 
cmlenz@40: .. code-block:: python
cmlenz@2: 
cmlenz@2:     print _("Hello")
cmlenz@2: 
cmlenz@40: instead of just:
cmlenz@40: 
cmlenz@40: .. code-block:: python
cmlenz@2: 
cmlenz@2:     print "Hello"
cmlenz@2: 
cmlenz@2: to make the string "Hello" localizable.
cmlenz@2: 
cmlenz@2: Message catalogs are collections of translations for such localizable messages
cmlenz@2: used in an application. They are commonly stored in PO (Portable Object) and MO
cmlenz@2: (Machine Object) files, the formats of which are defined by the GNU `gettext`_
cmlenz@2: tools and the GNU `translation project`_.
cmlenz@2: 
cmlenz@2:  .. _`gettext`: http://www.gnu.org/software/gettext/
cmlenz@2:  .. _`translation project`: http://sourceforge.net/projects/translation
cmlenz@2: 
cmlenz@2: The general procedure for building message catalogs looks something like this:
cmlenz@2: 
cmlenz@2:  * use a tool (such as ``xgettext``) to extract localizable strings from the
cmlenz@2:    code base and write them to a POT (PO Template) file.
cmlenz@2:  * make a copy of the POT file for a specific locale (for example, "en_US")
cmlenz@2:    and start translating the messages
cmlenz@2:  * use a tool such as ``msgfmt`` to compile the locale PO file into an binary
cmlenz@2:    MO file
cmlenz@2:  * later, when code changes make it necessary to update the translations, you
cmlenz@2:    regenerate the POT file and merge the changes into the various
cmlenz@2:    locale-specific PO files, for example using ``msgmerge``
cmlenz@2: 
cmlenz@2: Python provides the `gettext module`_ as part of the standard library, which
cmlenz@2: enables applications to work with appropriately generated MO files.
cmlenz@2: 
cmlenz@2:  .. _`gettext module`: http://docs.python.org/lib/module-gettext.html
cmlenz@2: 
cmlenz@2: As ``gettext`` provides a solid and well supported foundation for translating
cmlenz@2: application messages, Babel does not reinvent the wheel, but rather reuses this
cmlenz@2: infrastructure, and makes it easier to build message catalogs for Python
cmlenz@2: applications.
cmlenz@2: 
cmlenz@2: 
cmlenz@2: Message Extraction
cmlenz@2: ==================
cmlenz@2: 
cmlenz@2: Babel provides functionality similar to that of the ``xgettext`` program,
cmlenz@2: except that only extraction from Python source files is built-in, while support
cmlenz@2: for other file formats can be added using a simple extension mechanism.
cmlenz@2: 
cmlenz@48: Unlike ``xgettext``, which is usually invoked once for every file, the routines
cmlenz@48: for message extraction in Babel operate on directories. While the per-file
cmlenz@48: approach of ``xgettext`` works nicely with projects using a ``Makefile``,
cmlenz@48: Python projects rarely use ``make``, and thus a different mechanism is needed
cmlenz@48: for extracting messages from the heterogeneous collection of source files that
cmlenz@48: many Python projects are composed of.
cmlenz@48: 
cmlenz@48: When message extraction is based on directories instead of individual files,
cmlenz@48: there needs to be a way to configure which files should be treated in which
cmlenz@48: manner. For example, while many projects may contain ``.html`` files, some of
cmlenz@48: those files may be static HTML files that don't contain localizable message,
cmlenz@48: while others may be `Django`_ templates, and still others may contain `Genshi`_
cmlenz@48: markup templates. Some projects may even mix HTML files for different templates
cmlenz@48: languages (for whatever reason). Therefore the way in which messages are
cmlenz@48: extracted from source files can not only depend on the file extension, but
cmlenz@48: needs to be controllable in a precise manner.
cmlenz@48: 
cmlenz@48: .. _`Django`: http://www.djangoproject.com/
cmlenz@48: .. _`Genshi`: http://genshi.edgewall.org/
cmlenz@48: 
cmlenz@48: Babel accepts a configuration file to specify this mapping of files to
cmlenz@48: extraction methods, which is described below.
cmlenz@2: 
cmlenz@2: 
cmlenz@250: .. _`frontends`:
cmlenz@250: 
cmlenz@250: ----------
cmlenz@250: Front-Ends
cmlenz@250: ----------
cmlenz@250: 
cmlenz@250: Babel provides two different front-ends to access its functionality for working
cmlenz@250: with message catalogs:
cmlenz@250: 
cmlenz@250:  * A `Command-line interface <cmdline.html>`_, and
cmlenz@250:  * `Integration with distutils/setuptools <setup.html>`_
cmlenz@250: 
cmlenz@250: Which one you choose depends on the nature of your project. For most modern
cmlenz@250: Python projects, the distutils/setuptools integration is probably more
cmlenz@250: convenient.
cmlenz@250: 
cmlenz@250: 
cmlenz@48: .. _`mapping`:
cmlenz@2: 
cmlenz@48: -------------------------------------------
cmlenz@48: Extraction Method Mapping and Configuration
cmlenz@48: -------------------------------------------
cmlenz@48: 
cmlenz@48: The mapping of extraction methods to files in Babel is done via a configuration
cmlenz@48: file. This file maps extended glob patterns to the names of the extraction
cmlenz@48: methods, and can also set various options for each pattern (which options are
cmlenz@48: available depends on the specific extraction method).
cmlenz@48: 
cmlenz@48: For example, the following configuration adds extraction of messages from both
cmlenz@48: Genshi markup templates and text templates:
cmlenz@48: 
cmlenz@48: .. code-block:: ini
cmlenz@48: 
cmlenz@48:     # Extraction from Python source files
cmlenz@48:     
cmlenz@250:     [python: **.py]
cmlenz@48:     
cmlenz@48:     # Extraction from Genshi HTML and text templates
cmlenz@48:     
cmlenz@250:     [genshi: **/templates/**.html]
cmlenz@48:     ignore_tags = script,style
cmlenz@48:     include_attrs = alt title summary
cmlenz@48:     
cmlenz@250:     [genshi: **/templates/**.txt]
cmlenz@144:     template_class = genshi.template:TextTemplate
cmlenz@48:     encoding = ISO-8819-15
cmlenz@48: 
jruigrok@552:     # Extraction from JavaScript files
jruigrok@552: 
jruigrok@552:     [javascript: **.js]
jruigrok@552:     extract_messages = $._, jQuery._
jruigrok@552: 
cmlenz@48: The configuration file syntax is based on the format commonly found in ``.INI``
cmlenz@48: files on Windows systems, and as supported by the ``ConfigParser`` module in
cmlenz@250: the Python standard library. Section names (the strings enclosed in square
cmlenz@48: brackets) specify both the name of the extraction method, and the extended glob
cmlenz@48: pattern to specify the files that this extraction method should be used for,
cmlenz@48: separated by a colon. The options in the sections are passed to the extraction
cmlenz@48: method. Which options are available is specific to the extraction method used.
cmlenz@48: 
cmlenz@48: The extended glob patterns used in this configuration are similar to the glob
cmlenz@48: patterns provided by most shells. A single asterisk (``*``) is a wildcard for
cmlenz@48: any number of characters (except for the pathname component separator "/"),
cmlenz@48: while a question mark (``?``) only matches a single character. In addition,
cmlenz@48: two subsequent asterisk characters (``**``) can be used to make the wildcard
cmlenz@48: match any directory level, so the pattern ``**.txt`` matches any file with the
cmlenz@48: extension ``.txt`` in any directory.
cmlenz@48: 
cmlenz@48: Lines that start with a ``#`` or ``;`` character are ignored and can be used
cmlenz@250: for comments. Empty lines are ignored, too.
cmlenz@48: 
cmlenz@49: .. note:: if you're performing message extraction using the command Babel
cmlenz@250:           provides for integration into ``setup.py`` scripts, you can also 
cmlenz@250:           provide this configuration in a different way, namely as a keyword 
cmlenz@250:           argument to the ``setup()`` function. See `Distutils/Setuptools 
cmlenz@250:           Integration`_ for more information.
cmlenz@250: 
cmlenz@250: .. _`distutils/setuptools integration`: setup.html
cmlenz@2: 
cmlenz@2: 
cmlenz@250: Default Extraction Methods
cmlenz@250: --------------------------
cmlenz@2: 
jruigrok@553: Babel comes with a few builtin extractors: ``python`` (which extracts 
jruigrok@553: messages from Python source files), ``javascript``, and ``ignore`` (which
jruigrok@553: extracts nothing).
cmlenz@2: 
cmlenz@250: The ``python`` extractor is by default mapped to the glob pattern ``**.py``,
cmlenz@250: meaning it'll be applied to all files with the ``.py`` extension in any 
cmlenz@250: directory. If you specify your own mapping configuration, this default mapping
cmlenz@268: is discarded, so you need to explicitly add it to your mapping (as shown in the
cmlenz@268: example above.)
cmlenz@49: 
cmlenz@250: 
cmlenz@250: .. _`referencing extraction methods`:
cmlenz@250: 
cmlenz@250: Referencing Extraction Methods
cmlenz@250: ------------------------------
cmlenz@250: 
cmlenz@250: To be able to use short extraction method names such as “genshi”, you need to 
cmlenz@250: have `pkg_resources`_ installed, and the package implementing that extraction
cmlenz@250: method needs to have been installed with its meta data (the `egg-info`_).
cmlenz@250: 
cmlenz@250: If this is not possible for some reason, you need to map the short names to 
cmlenz@250: fully qualified function names in an extract section in the mapping 
cmlenz@250: configuration. For example:
cmlenz@250: 
cmlenz@250: .. code-block:: ini
cmlenz@250: 
cmlenz@250:     # Some custom extraction method
cmlenz@250:     
cmlenz@250:     [extractors]
cmlenz@250:     custom = mypackage.module:extract_custom
cmlenz@250:     
cmlenz@250:     [custom: **.ctm]
cmlenz@250:     some_option = foo
cmlenz@250: 
cmlenz@250: Note that the builtin extraction methods ``python`` and ``ignore`` are available
cmlenz@250: by default, even if `pkg_resources`_ is not installed. You should never need to
cmlenz@250: explicitly define them in the ``[extractors]`` section.
cmlenz@250: 
cmlenz@250: .. _`egg-info`: http://peak.telecommunity.com/DevCenter/PythonEggs
cmlenz@250: .. _`pkg_resources`: http://peak.telecommunity.com/DevCenter/PkgResources
cmlenz@49: 
cmlenz@2: 
cmlenz@48: --------------------------
cmlenz@48: Writing Extraction Methods
cmlenz@48: --------------------------
cmlenz@48: 
cmlenz@73: Adding new methods for extracting localizable methods is easy. First, you'll
cmlenz@73: need to implement a function that complies with the following interface:
cmlenz@2: 
cmlenz@40: .. code-block:: python
cmlenz@40: 
cmlenz@84:     def extract_xxx(fileobj, keywords, comment_tags, options):
cmlenz@73:         """Extract messages from XXX files.
cmlenz@73:         
cmlenz@73:         :param fileobj: the file-like object the messages should be extracted
cmlenz@73:                         from
cmlenz@73:         :param keywords: a list of keywords (i.e. function names) that should
cmlenz@73:                          be recognized as translation functions
cmlenz@84:         :param comment_tags: a list of translator tags to search for and
cmlenz@84:                              include in the results
cmlenz@73:         :param options: a dictionary of additional options (optional)
palgarvio@81:         :return: an iterator over ``(lineno, funcname, message, comments)``
palgarvio@81:                  tuples
cmlenz@73:         :rtype: ``iterator``
cmlenz@73:         """
cmlenz@73: 
cmlenz@83: .. note:: Any strings in the tuples produced by this function must be either
cmlenz@83:           ``unicode`` objects, or ``str`` objects using plain ASCII characters.
cmlenz@83:           That means that if sources contain strings using other encodings, it
cmlenz@83:           is the job of the extractor implementation to do the decoding to
cmlenz@83:           ``unicode`` objects.
cmlenz@83: 
cmlenz@73: Next, you should register that function as an entry point. This requires your
cmlenz@73: ``setup.py`` script to use `setuptools`_, and your package to be installed with
cmlenz@73: the necessary metadata. If that's taken care of, add something like the
cmlenz@73: following to your ``setup.py`` script:
cmlenz@73: 
cmlenz@73: .. code-block:: python
cmlenz@73: 
cmlenz@73:     def setup(...
cmlenz@73: 
cmlenz@73:         entry_points = """
cmlenz@73:         [babel.extractors]
cmlenz@73:         xxx = your.package:extract_xxx
cmlenz@73:         """,
cmlenz@73: 
cmlenz@73: That is, add your extraction method to the entry point group
cmlenz@73: ``babel.extractors``, where the name of the entry point is the name that people
cmlenz@73: will use to reference the extraction method, and the value being the module and
cmlenz@73: the name of the function (separated by a colon) implementing the actual
cmlenz@73: extraction.
cmlenz@73: 
cmlenz@250: .. note:: As shown in `Referencing Extraction Methods`_, declaring an entry
cmlenz@250:           point is not  strictly required, as users can still reference the
cmlenz@250:           extraction  function directly. But whenever possible, the entry point
cmlenz@250:           should be  declared to make configuration more convenient.
cmlenz@250: 
cmlenz@73: .. _`setuptools`: http://peak.telecommunity.com/DevCenter/setuptools
palgarvio@81: 
cmlenz@250: 
cmlenz@250: -------------------
cmlenz@250: Translator Comments
cmlenz@250: -------------------
palgarvio@81: 
palgarvio@81: First of all what are comments tags. Comments tags are excerpts of text to
palgarvio@81: search for in comments, only comments, right before the `python gettext`_
palgarvio@81: calls, as shown on the following example:
palgarvio@81: 
palgarvio@81:  .. _`python gettext`: http://docs.python.org/lib/module-gettext.html
palgarvio@81: 
palgarvio@81: .. code-block:: python
palgarvio@81: 
palgarvio@81:     # NOTE: This is a comment about `Foo Bar`
palgarvio@81:     _('Foo Bar')
palgarvio@81: 
palgarvio@81: The comments tag for the above example would be ``NOTE:``, and the translator
palgarvio@81: comment for that tag would be ``This is a comment about `Foo Bar```.
palgarvio@81: 
palgarvio@81: The resulting output in the catalog template would be something like::
palgarvio@81: 
pjenvey@109:     #. This is a comment about `Foo Bar`
palgarvio@81:     #: main.py:2
palgarvio@81:     msgid "Foo Bar"
palgarvio@81:     msgstr ""
palgarvio@81: 
palgarvio@81: Now, you might ask, why would I need that?
palgarvio@81: 
cmlenz@250: Consider this simple case; you have a menu item called “manual”. You know what
palgarvio@81: it means, but when the translator sees this they will wonder did you mean:
palgarvio@81: 
palgarvio@81: 1. a document or help manual, or
palgarvio@81: 2. a manual process?
palgarvio@81: 
palgarvio@81: This is the simplest case where a translation comment such as
palgarvio@81: “The installation manual” helps to clarify the situation and makes a translator
palgarvio@81: more productive.
palgarvio@81: 
cmlenz@250: .. note:: Whether translator comments can be extracted depends on the extraction
cmlenz@250:           method in use. The Python extractor provided by Babel does implement
cmlenz@250:           this feature, but others may not.