SALT: Standards-based Access service to multilingual Lexicons and Terminologies

Translation, Theory and Technology

Sections

SALT: Standards-based Access service to multilingual Lexicons and Terminologies

Alan K. Melby

A password-protected limited version of the NSF SALT proposal is available. If you are a SALT partner and wish to view this file please contact Arle Lommel to receive the password.

SALT has begun! For information on the SALT Project in Europe please visit http://www.loria.fr/projets/SALT.

What is SALT?

SALT is a consortium of academic, government, association, and commercial groups in the U.S. and Europe who are working together on the task of testing, refining, and implementing a universal putting together format for the interchange of terminology databases and machine translation lexicons. This universal "lex/term" format is based on the recently-adopted MARTIF standard (ISO 12200, which is in turn based on ISO 12620) for human-oriented terminology database ("term" for short) exchange (for further info see www.ttt.org) and the OLIF format for machine-translation dictionary and other NLP lexicon ("lex" for short) exchange (see www.otelo.lu), along with some Unicode and meta-markup features of the TMX standard for translation-memory database exchange (see www.lisa.unige.ch in the SIG section) and, finally, coordination with results from other related projects, such as Transterm and Geneter. The Transterm project has been finished for some time, but we intend to coordinate the MARTIF conceptual data model with the Transterm format conceptual data model, which is similar. Another project with a similar conceptual data model is the University of Rennes project with the Geneter format, which is used in the Inesterm project. We are working on an automatic conversion between MARTIF and a version of the Geneter format.

As stated, the SALT project itself involves:
1. testing and refining an XML-based lex/term-data interchange format combining MARTIF and OLIF and called XLT,
2. development of a website for people to try out various XLT utilities, and
3. development of an XLT toolkit for lex/term-related product developers
The utilities will include conversion routines between OLIF and XLT, between Geneter and XLT, and between several other formats and XLT, as well as guidelines for those who want to develop their own conversion routines.

For many people in the language industries, the benefits having one widely use term-data interchange format are obvious. Indeed, the following typical comment was made at the Localization Industry Standards Association Forum in Boston (February 1999) in response to the idea of combining MARTIF and OLIF: "This is what we have been waiting twenty years for!"

However, to facilitate your discussions with colleagues who have not been thinking about these issues continually over the past twenty years, let me repeat some of the projected benefits of a universal term-data exchange format:
1. Faster Insertion of New Terms into a Database
  The language industries are rapidly embracing the use of translation tools such as automatic terminology lookup, terminology mining, terminology consistency checkers, and machine translation. Authoring tools that provide access to a termbase are also appearing, at least in the context of controlled language, but will hopefully soon be applied to the control of terminlogy in the authoring process even when the syntax is less controlled. Each of these technologies is driven by or drives terms in databases. No longer is a paper glossary sufficient or even a word processing file that is useable only by a human. With each database potentially using a different internal format, receiving term information from multiple sources and incorporating that information into your database can either hand re-keying (ugly!) or custom programming of format-to-format filters (expensive!). Once one lex/term-data exchange format (say, for example, XLT) is widely accepted, every translation tool developer can include just one import/export filter (to and from XLT) into their application. Then, everyone can request term data in the XLT format appropriate to their user-group and incorporate it into their database without re-keying or writing custom filters.
2. More Consistency Across Documents
  It is well known that, given professional competence on the part of an author, translator, or reviewer, the single most significant controllable factor in documentation/translation quality is consistency in the use of terms. Whenever multiple authors or translators are working on a large document or set of related documents, there is the chance for inconsistency in the use of terms, even if each person in the document production chain is using some kind of translation tool that includes a termbase. XLT will facilitate the dissemination of current versions the constantly updated master term database to multiple authors/translators at multiple locations around the globe using multiple translation tools, so long as they all include an appropriate XLT import filter. Without a universal exchange format, expensive custom programming may be required to maintain consistency.
3. Synchronization of Human and Machine Translation
  An increasingly common scenario in large organizations is the use of both human translation (with technology assistance such as translation memory lookup and automatica terminology lookup) and machine translation (with human revision of raw output). It is imperative that the human and machine translation sides of such an operation use terms consistently. That is why XLT includes both a human-translation aspect (MARTIF) and a machine-translation aspect (OLIF) that have been integrated into a single format framework (XLT). The SALT project will include the development of freeware tools for merging human-translation term data and machine-translation term data into a single database, with automatic reporting of potential holes on either side and potential conflicts (such as the same concept in the same domain being designated by different terms on the human and machine traslation sides), to be brought to the attention of a human terminologist for evaluation. The merged database could even be used as the master repository for noun and noun-noun compound terms for both the human translator tools and the machine translation system in your organization. The cost of synchronizing may be greatly reduced by using the SALT utilities. The cost of not synchronizing may be enormous in terms of the deleterious effects of inconsistency. It is reported that the internal costs of developing and implementing a human/machine translation synchronization system in one very large organization ran into the millions of dollars yet was expected to pass the break even point within a few years. When they heard about the SALT project, they wished it had been done sooner so that they could have reaped a return on investment sooner.
Who is SALT?

The US coordinator is Alan K. Melby (Brigham Young University (BYU) and LinguaTech International) and also includes others from academia (e.g. a terminologist, Sue Ellen Wright at Kent State University, and Deryle Lonsdale at BYU, who will head up the ontology aspects of SALT). The European Coordinator is Gerhard Budin at the University of Vienna. The following commmercial translation tools developers have been contacted and have expressed interest in the project: Trados, Star, EP, Logos, Systran, and L&H. Additional potential main partners that have expressed interest are: the University of Applied Sciences (Cologne), IAI (Saarbrucken), the European Academy (Bozono, Italy), the Intstitute for Business Informatics (Kolding, Denmark), Loria Labs and Termisti (France and Belgium), the University of Surrey (UK), and SAP [representing Otelo] (Germany). [Clearly, the participation of key members of the Otelo project, particularly those focused on OLIF, is important. The Otelo final user's group meeting includes possible SALT collaboration as an agenda item.] An advisory group is being formed, consisting of governmental and non-governmental organizations that have a vested interest in term data sharing. So far, we have received positive responses from LISA (HA near Geneva), Infoterm (Vienna), AMTA (Ed Hovy). We will also be coordinating with two relevant EC agencies that cannot be formally listed in the proposal but are willing to advise informally. We have also contacted various LISA member companies for letters of support. We have also arranged additional corporate partners, such as Medtronic and HP. A number of other companies, including Microsoft, are involved through LISA, as IT developers, Localization service providers, and as LISA-OSCAR Steering Committee members.
What are the duties and benefits of being a SALT partner?

There are essentially two levels of participation possible in the SALT project:
- main partner:
  1. perform data collection and analysis
  2. develop/test demo website featuring utilities for validation, conversion, merging, etc.
  3. develop and test the software development toolkit (which will be used in the website utilities and made available to developers for integration into their applications)
- advisory partner:
  1. provide sample term data to data collectors
  2. test website as it is developed
  3. provide end-user feedback
The principal benefit from being a SALT partner is not funding. Partners should be motivated by a desire to satisfy the strongly-felt need within the language industry for a universal term data interchange format. Should NSF/EU funding not be approved (which would be extremely unfortunate), the project will proceed on a much smaller scale anyway, especially with the industrial partners, since it is in their interest to be involved in the refinement and promotion of the primary interchange format for human-translator and machine-translation term data. However, a strong show of support now will increase the chances of funding. Specific benefits to partners include the prestige of early adoption, the advantage of influencing the refinement process, and the potential for eventual consulting work to help others implement XLT.

Of course, the partners will also share in the same benefits that will be reaped by everyone in the language industries. These benefits are listed in section 1.
How do MARTIF, OLIF, Geneter, TBX, XML, and XLT fit together?

Many projects over the years have worked on the problem of term data exchange. There are two sides: human-oriented, concept-oriented terminology databases (termbases) and machine-translation (and other NLP applications), which are lexicons and/or lexicographically oriented lexicons (hence "lex").

On the termbase side, we have seen the MATER project, the MicroMATER project, and the terminological data aspect of the TEI project that have culminated in the MARTIF standard (ISO 12200, based on ISO 12620), which were published as ISO standards in the third quarter of 1999. In the August 1999 ISO meetings in Berlin it was decided to pursue
1. an application of MARTIF called MARTIF with Standardized Constraints (MSC)
2. another intermediate format called Geneter, which, like MARTIF, is based on ISO 12620
3. a meta-model encompassing both MARTIF and Geneter
In addition a resolution was passed mandating an effort to make both MSC and Geneter interoperable. An expected outcome of this effort is that Geneter will be brought into the family of MARTIF-compatible formats in view of market needs for a single lex-term interchange format.

On the machine-translation lexicon side, we have seen a series of EC projects, including Eurotra-7, Multilex/MLEXd, and Genelex projects, and, more recently, the Transterm project, along with several commercial exchange formats, such as MLIF (from METAL) and LEF (from Logos). The most recent MT-lexicon exchange format, OLIF, emerged from the Otelo project, which ended in the 2nd quarter of 1999. One OLIF paper specifically acknowledges the following previous formats: MARTIF, Transterm, Interval, MLIF, and LEF.

Another important historical thread in the development of term-data exchange standards is LISA (the Localization Industry Standards Association). This important trade association started work in 1997 on a standard format for exchanging translation memory database data. The result, called TMX, was one of the first applications of XML and is being implemented by major commercial translation technology vendors. The next data exchange standards project of LISA is to define TBX for term database exchange.

Suddenly these various threads (MARTIF, OLIF, and LISA) came together starting in February 1999. While chairing a OSCAR meeting in Boston in February (OSCAR is the LISA data exchange standards body), Alan Melby got feedback that the previous OSCAR plan to look at termbase and MT-lexicon data exchange separately was unacceptable to the localization industry (and probably the wider language industries). An integrated standard was needed now. While working on an integrated Martif-Olif format for the TBX proposal during February and March, he noticed a call for joint USA-EU proposals from the National Science Foundation on the USA side and from the 5th Framework/IST/HLT program on the EU side. The title of the call is "Multilingual Access and Management", and the description of the call includes terminology management, human and machine translation, and data exchange standards. It seemed a natural step to propose a project that combines further work on MARTIF and OLIF, along the lines of the TBX proposal but going beyond LISA.

That step was taken and the SALT project and its XLT format have been launched. XLT is an XML-compliant framework for defining a family of closely related term data exchange formats tailored to specific user groups. MARTIF is an SGML application that has been adapted to the XML world in anticipation of the adoption of an XML-Schema standard and has become the heart of XLT. The essence of OLIF, which is a tagged, but not SGML, application has been integrated into XLT by inserting the OLIF header into the XLT header, merging the OLIF Central Entry into the corresponding element of XLT taken from MARTIF, and adding to each XLT concept entry an optional NLP feature-value pair list that corresponds exactly the feature-value pair list of OLIF but is recast in MARTIF-style XML. In addition, the TMX method of documenting user-defined Unicode characters and the TMX meta-markup method of including presentational markup in running text (for contextual examples, etc) have been incorporated into XLT. The resulting format, XLT, is described in a paper presented at the TKE conference in August 1999. That draft is available through the www.ttt.org homepage as a PDF file called "Leveraging Terminological Data for Use in Conjunction with Lexicographical Resources".
What are the Objectives, Goals, and Relevance of SALT?

The objective of SALT is to build on the work that has been done in several projects dealing with sharing what we call lex/term-data (including Otelo, Transterm, and Martif). Specific goals include (a) the testing and refinement of a unified XML-based format called XLT (of which TBX is the LISA subset), (b) the development of a demonstration website for end-users to submit files in various formats and validate them, merge them, and get them back in another format, using XLT as the intermediate format, and (c) the development of a toolkit for translation technology developers who want to integrate XLT filters into their software applications.

A specific research goal for SALT is to investigate the difficult problem of mapping positions from one ontology into another. This problem arises necessarily when attempting to minimize information loss going from one termbase to another when they two termbases do not use the same ontologies (classification systems, concept systems, and thesauri). Other less challenging but useful goals are the tasks of extracting a concept system from an existing termbase and grafting an existing ontology onto a termbase that does not yet have links to one.

The overall goal of SALT, however, is extremely practical. It is to reach "critical mass" with XLT so that tools developers, such as Star, Trados, EP, Logos, Systran, L&H, and Xerox, will incorporate some level of XLT support in their products and so that various companies will provide on-going consulting services to anyone who wants to get their proprietary lex/term-data into XLT format or XLT data into their proprietary format. The demonstration website will of course use the XLT toolkit. Developers and consultants will all use the detailed specifications, sample files, and tools for XLT that will be made universally available as freeware with the only restriction being strict adherence to the standard into order to use the SALT/XLT logo.

Without such a "jump start" I fear that widespread use of data exchange standards for lexi[cographical]/term[inological]-data will be unnecessarily delayed. Given the work that has been done to date on lex/term-data exchange, we do not need to search for the "ideal" or absolutely perfect format. The OLIF and MARTIF formats are good enough and the need is growing. Let's get them widely enough known in their integrated form as XLT (with various user-group-specific subsets such as TBX) so that no fragmentation can take place. The language industries need one format that is good enough, not multiple competing formats. And they need it now.

The timetable for SALT is that has essentially already begun since the XLT framework is ready for testing and initial data collection has begun. We hope to hold a major SALT conference in conjunction with the TAMA 2000 conferences.

Please direct comments and questions concerning SALT to:
Alan K. Melby <"akm@byu.edu> (+1 801 378-2144)
with a cc: to Arle Lommel <fenevad@ttt.org> (+1 801 378-4414)

Brief letters of support for the SALT project should be written on organization letterhead and mailed to:
Alan K. Melby, Dept. of Linguistics
2129 JKHB BYU
Provo, Utah 84602
USA
and faxed to +1 801 377-3704

Sections

SALT: Standards-based Access service to multilingual Lexicons and Terminologies

What is SALT?

Who is SALT?

What are the duties and benefits of being a SALT partner?

How do MARTIF, OLIF, Geneter, TBX, XML, and XLT fit together?

What are the Objectives, Goals, and Relevance of SALT?