XML Technologies and the Localization Process

Why a standard markup method is needed for working with multilingual documents

By Yves Savourel - October 2000.

XPath, XPointer, XLink, XML-Namespaces, XML-Schema, RDF, XForm, XQuery, XML-Signature, XSL, XSLT ‹ XML has blossomed into a tangled bush of names and acronyms hard to keep track of. But, on the positive side, such flurry of new standards shows how quickly the technologies associated with XML are progressing. It opens the doors to many new ways of dealing with data in general and localizable text in particular.

There are two areas where XML and localization interact: when XML is used to help with the localization process and when XML itself is the data to localize.

Examples of XML-based implementations in localization environments are simple file formats such as OpenTag, TMX and TBX. But now XML offers additional avenues to interact with its applications. One of them is XSL.

Using XSL Templates

XSL (Extensible Style-sheet Language) is, in short, the XML-based equivalent of cascading style-sheet (CSS). It allows you to associate presentation styles with XML files. One of its components, XSLT (XSL Transformation) allows you to modify an XML document according a set of template rules.

Let's try to apply this mechanism to a concrete example. We want to offer the translator who is working with an OpenTag file a way to view it so that the source and target text really stand out, as in a two-column table. For this, we can create an XSL template that any XSL-enabled browser will be able to apply when opening the OpenTag document.

The template is itself an XML document where you basically write a skeleton of HTML elements and use the XSL commands to provide the text.

The XSL template with XML elements in bold:

<?xml version="1.0" ?>
 <xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">
  <xsl:template match="/">
   <HTML>
    <BODY>
     <TABLE WIDTH="100%" BORDER="0" CELLSPACING="2" CELLPADDING="4">
      <THEAD>
       <TH BGCOLOR="beige" WIDTH="50%">English</TH>
       <TH BGCOLOR="beige" WIDTH="50%">French</TH>
      </THEAD>
      <TBODY>
       <xsl:for-each select="opentag/file/*">
        <TR>
         <TD BGCOLOR="skyblue"><xsl:value-of select="p[@lc='EN-US']"/></TD>
         <TD BGCOLOR="springgreen"><xsl:value-of select="p[@lc='FR-FR']"/></TD>
        </TR>
       </xsl:for-each>
      </TBODY>
     </TABLE>
    </BODY>
   </HTML>
  </xsl:template>
 </xsl:stylesheet>

The OpenTag document with the XSL template applied.

In the OpenTag file itself, very few changes are needed: the addition of the XSL command at the top of the file specifies which template to use. In our example, the text consists of a few strings extracted from a nonstandard resource-type file.

The OpenTag document with the XSL declaration in bold:

<?xml version="1.0" encoding="windows-1252" ?>
 <?xml-stylesheet type="text/xsl" href="otf-table.xsl"?>
 <opentag version="1.2">
  <file lc="EN-US" ts="skl:950220774" datatype="sunbirdres" ws="1"
   tool="TotalRecall" original="C:\Hobbit\MainApp.sbrc" 
   reference="C:\Hobbit\MainApp.skl">
   <grp id="1" ts="ctx:4e3e" rid="ID_FILE_PRINT">
    <p lc="EN-US">&Print...\tCtrl+P</p>
    <p lc="FR-FR">&Imprimer...\tCtrl+P</p>
   </grp>
   <grp id="2" ts="ctx:4dd4" rid="ID_MAIN1">
    <p lc="EN-US">&File</p>
    <p lc="FR-FR">&Fichier</p>
   </grp>
   <grp id="3" ts="ctx:506d" rid="ID_MAIN2">
    <p lc="EN-US">&View</p>
    <p lc="FR-FR">&Vues</p>
   </grp>
  </file>
 </opentag>

As you can see, parts of the template file refer to the locales to display (for example, "EN-US") and other variables. You can adjust these parameters manually, but it quickly becomes time consuming and cumbersome. The solution is to automate the construction of the XSL files with a simple tool, such as Rainbow.

This Windows utility, available at www.opentag.com, lets you select the OpenTag and TMX files you want to view and creates the relevant templates and temporary files.

Rainbow: a quick way to view OpenTag and TMX documents using XSL templates.

To process a document, add the file in the main list box, select the template to use, choose the locales to work with and click Display. The XSL template is created with the appropriate parameters, the XSL style-sheet command is inserted in the XML document and the temporary result is open with the browser associated with the .xml extension.

Rainbow is just an exercise to demonstrate a few of the things you can accomplish with templates, but you can extend the same principle to real production work. For example, Sykes has developed an in-house utility called Shadow to verify tagged files during the localization process.

The translator and editor work in Word where they can enjoy the powerful features of a word-processor. With a few keystrokes, however, they can use Shadow to look at the WYSIWYG display of their files. HTML documents are rendered as they would be in a browser, while the files in OpenTag or other XML formats are assigned XSL templates and are shown with a user-friendly layout.

Shadow uses the Microsoft Internet Explorer engine and takes full advantage of its error-checking capabilities. For instance, the codes in the translated files may be altered or deleted by accident. The engine catches and displays those problems and where they occur, allowing the user to make the necessary corrections quickly. This side benefit of using XSL allows for a significant reduction of the work to be done after translation.

Syntax checking of a translated OpenTag document in Shadow.

Developing XSL templates is not always straightforward, and you definitely need some good understanding of XML as well as XPath if you want to create anything more than a basic layout. The task can also be challenging if the documents to display have a structure that is somewhat complex. However, as with other XML-related technologies, more and more tools are available. You can find many of them listed on the W3C Web site at www.w3.org.

XSL has still some way to go before it can be widely used. Nevertheless, at the pace most XML-related efforts are going, it should not take very long, and, as we have seen, in some cases you can already take advantage of it.

Using Namespaces

We have seen that the XSL template was using both HTML and XSL tags at the same time. This is made possible by one of the most powerful features of XML: Namespaces.

Simply put, this mechanism consists of adding a prefix to the names of elements and attributes that are not part of the main format. For example, you can mix two elements called <prop> belonging to two different formats in the same file. To do this, you simply prefix them by their respective namespace identifiers: <abc:prop> and <xyz:prop>. The prefixes are specified by the xmlns attribute.

For instance, you can include parts of TMX markup in an XHTML file.

An example of using Namespaces: TMX markup in a XHTML document

<?xml version="1.0" ?>
 <html xmlns="http://www.w3.org/TR/xhtml1"
       xmlns:tmx="http://www.lisa.org/tmx">
  <head>
   <title><tmx:seg>Quote from "The Hobbit" by JRR Tolkien</tmx:seg></title>
  </head>
  <body>
   <p><tmx:seg>He put the ring in his pocket almost without thinking:</tmx:seg>
      <tmx:seg>certainly it did not seem of any particular use at the moment.</tmx:seg></p>
  </body>
 </html>

The full description of the XML-Namespaces is detailed at www.w3.org/TR/REC-xml-names. There are different ways to use it depending where you include the xmlns attribute. Overall, it provides a very efficient method to reuse existing set of tags in any XML-based document. This leads to an interesting possibility when it comes to making XML files more easy to localize: a common set of tags for localization information.

Localization Information Markup

Localization can be done more efficiently if, in your format, you make provision for information such as the following: what text is translatable and what is not; language (or, more exactly, locale) of the text item, especially in multilingual documents; explanations and important guidelines from the author to the translator about specific parts of a text, such as what an acronym stands for, whether in a short phrase a given word is a verb or adjective and so on; requirements for the text, such as maximum length, character class restriction (that is, should the text remain in ASCII or in lower case?) and so on; unique identification of each text item to allow for reuse of source and translated text in subsequent versions of the same document or in other documents. A correct implementation of this ID mechanism can be infinitely more efficient than the classic translation database system.

The nature of this information is always the same regardless the data. For years now, many developers have been coding such information sets in software-related data using specialized comments such as OpenNotes. XML formats could also easily include specialized elements for this. Richard Ishida from Xerox Corp. has made very interesting presentations on this topic at various conferences. To go further, as Steven Forth from DNAMedia Inc. has proposed, it should be possible to standardize this kind of tag set, instead of having each format specify its own. Then, thanks to the Namespace mechanism, any XML format could simply take advantage of the standard set.

At the time this article is written, it seems that such endeavor will be done as one of the tasks of the new XML special interest group at the LISA (www.lisa.org) proposed by Jörg Schütz from IAI Saarbrücken and coordinated with the Internationalization working group of the W3C (www.w3c.org) led by Martin J. Dürst. This is an important effort. Any localization provider and tools vendor working with XML should participate or at least follow closely this initiative.

Let's try to create an example of such a set of tags to see how it would work. Obviously, this is just an experiment. Neither the names of the tags nor the functions defined here are the ones that will ultimately constitute the official tag set. They just show how the principle would operate.

First, we need a Namespace definition. We can dub our imaginary tag set LIME (Localization Information Markup Extension ‹ hopefully the SIG will come up with a much better name). Now we can assign a URI (Universal Resource Identifier) to the namespace, for example, xmlns:lime="www.lisa.org/lime".

Both the main tag set and LIME can coexist in the same flow of text. Depending on what application processes the data, the different tag sets will be treated differently: a localization tool will use the LIME tags to enhance its knowledge of how to handle the text, and a terminology tool will seek for term-related tags, while your browser will interpret only the HTML-related elements and display the same output as if the LIME tags were not there.

HTML file of Example 1 viewed in a browser. The LIME tags are ignored.

In addition to the tags to intersperse within the XML document, LIME should define a markup to specify general rules for what elements and attributes are translatable or not, as well as their default localization properties.

Example 1 - HTML file with the LIME tags in bold:

<?xml version="1.0" ?>
<html xmlns="http://www.w3.org/TR/xhtml1"
      xmlns:lime="http://www.lisa.org/lime">
 <body>
  <p>Start the <strong><lime:term>Surrogate Services</lime:term></strong>
module after the <lime:notrans>Kernel</lime:notrans> is installed.
<lime:break/>You must run version <lime:span var="number">1.5</lime:span>
for this.</p>
 </body>
</html>

The principle does not stop at documentation-type files. It can be applied to resource-type data as well.

At some point we should see the emergence of XML-based formats for describing user interface: dialog boxes, menus, string tables and so forth. In short, XML-based "RC files".

At least one format like this is already in the making: XUL (pronounced "zool"). XUL was created to code the UI resources of Mozilla (Netscape's browser). You can find more information about XUL at www.mozilla.org/xpfe/xulref.

Such formats could take advantage of our LIME tags as well.

Example 2 - XUL resources with LIME tags in bold:

<?xml version="1.0" ?>
<?xml-stylesheet href="chrome://global/skin/xul.css" type="text/css" ?>
<!DOCTYPE window>
<window xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"
 xmlns:html="http://www.w3.org/1999/xhtml"
 xmlns:lime="http://www.lisa.org/lime"
 id="main-window">
 <menubar>
  <lime:span maxlen="15">
   <menu name="File">
    <lime:next note="Scale is a verb"/>
    <menuitem name="Scale..." onclick="doSomething());"/>
   </menu>
  </lime:span>
 </menubar>
</window>

Thinking about software-type data and localization information, it may be a good idea to extend whatever mechanism is devised for XML to non-XML data. After all, even XML documents have text stored in non-XML layers such as scripts in XHTML files. Using comments markers for containers, the same localization information could be placed inside resource and source code. This could help in dealing with some of the "embedded formats" problems, like HTML fragments inside a script, itself stored in a database record (Developers can be highly creative sometimes).

Example of non-XML data marked up with a LIME-like mechanism in bold:

<script language="Jscript" runat="server">
function AlertIsVisible()
{
   /*_lime:span translate=²no² */
   x="<" response.write(x + "SCRIPT Language=JavaScript>");
   if (Grid1.isVisible())
   {
      /*_lime:next translate="yes" datatype="js_script"*/
      response.write("alert('The grid is visible');");
   }
   else
   {
      /*_lime:next translate="yes" datatype="html" */
      response.write("The grid is <b>not visible</b>");
   }
   response.write(x + "/SCRIPT>");
   /*_lime:/span */
}
</script>

Similar methods have been used for years already, but every one has its own set of markers. Using a standard set would be beneficial on the long run. Such non-XML markup may have to be adapted a little bit for resource-type data, but most of the functions should be the same as for XML documents.

A Powerful Cocktail

The elements presented in our imaginary LIME tag set are just a few obvious ideas, but you can see how such a mechanism could streamline many localization tasks. It would improve various areas of our processes: terminology extraction and verification, validation of strings for specific requirements, translation database matching, conveying information to the translators, leveraging, automatic translation operations, management of the "not-to-be-translated" text, sentence breaking and so forth.

The availability of a well-defined localization markup would permit tools that offer the writers and the developers efficient ways to annotate their data for the benefit of the localizers. It would place more power in their hands by allowing them to have a direct channel of communication with the translators.

It is clear now that XML will rule the data side of many applications in the coming years. The benefits of using XML in localization tools are easily understood, but there is a great deal of work left to integrate localization information as a component of XML itself. An effort to come up with a standard set of tags for localization information would be instrumental in the success of globalizing data repositories and ultimately in making digital resources more accessible to the world. XML and its additions constitute a colorful and powerful cocktail of standards, already showing strength and flexibility. Some sort of LIME on top could make it taste even better for the localization industry.

This article reprinted from #35 Volume 11 Issue 7 (October/November 2000) of MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 319 North First Ave., Sandpoint, Idaho, USA, 208-263-8178, Fax: 208-263-6310.