Machine Translation Technologies

By Diana Sagarna, Ane Alaña, Nerea Basterretxea, Laura Gravina

Abstract

Introduction

Definition of machine translation

Main problems of machine translation

        Problems of ambiguity

        Problems that arise from structural and lexical differences between languages

        Multiword units like idioms and collocations

Description of CAT and its main functions

Translation Memories

References to Translation Memories

References

Conclusion

 

Abstract

This report, which has been made as an exercise for the subject English Language and New Technologies, covers the importance of machine traslation in our society. In addition we also present and try to define several tools programmes that help to the machines in the difficult task of translating texts, words etc. It is also important to point the main problems of machine translation. Nowadays the importance of these machines is incaculable. There are several programmes that allow machines to work and translate texts in a more precise way. Our source of information is going to be mainly the information provaded in class by the professor Joseba Abaitua.

[up]

Introduction

The use of machine translation is more important than we may think if we don´t think carefully about it. It is a tool that enables people to have information about a variety of things in different languages. Therefore, we may have for instance english expressions or idioms translated into our language. This allows us to have the meaning of a word or phrase crutial to our understanding of a conversation  in a rapid and effective way.  

The definition of machine translation is normally taken as "automatic linguistic translation", a word-by-word (verbatim) translation. But in this paper, we think that we have stated quite clearly that´s not the case, it is a translation from one human language to another human language, taking into account all the difficulties and differences between both languages. In a human language translation linguistic, semantic and lexical differences must be taken into consideration, if not, the translation won´t be reliable. 

Thus, MT is researching in all these fields (working with linguists), trying to determine the differences from one language to another, and excluding a verbatim translation.

There are two main problems with machine translation: the problem of ambiguity, consisting of a word having more than one meaning,  lexically ambiguous or  a phrase or sentence being able to have more than one structure, structurally ambiguous. And problems that arise from structural and lexical differences between languages: the problem is that some languages use different structures for the same purpose, and other times, the same structure for different purposes. Obviously, a verbatim translation can´t be the correct one, it can´t solve the problem.

Another problem is that of the idiomatic phrases. Some of them can be quite ambiguous, and in the case of idioms, in a literal translation the meaning of this is lost (more of a cultural thing than anything!). Ex: does this ring the bell? Literal: is the bell ringing due to something? True meaning: is this familiar to you?

Then, there are three types of MT: Fully Automated Machine Translation, in which the computer does all the work. Human-Assisted Machine Translation, in which the computer does most of the work, but assisted by a human (corrections, puntualizations, etc.). And, finally, Computer Aided Translation, whose main representation would be the Translation Memories, a tool that is nowadays nearly compulsory for  professional translators to use (a database of translations that were already done and available for the translator to check).

[Up ]

I.    Machine translation

Definition of Machine Translation:

The term machine translation (MT) is normally taken in its restricted and precise meaning of fully automatic translation. However, in this chapter we consider the whole range of tools that may support translation and document production in general, which is especially important when considering the integration of other language processing techniques and resources with MT. We therefore define Machine Translation to include any computer-based process that transforms (or helps a user to transform) written text from one human language into another. We define Fully Automated Machine Translation (FAMT) to be MT performed without the intervention of a human being during the process. Human-Assisted Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a (mono- or bilingual) human for help. Machine-Aided Translation (MAT) is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants.

Traditionally, two very different classes of MT have been identified. Assimilation refers to the class of translation in which an individual or organization wants to gather material written by others in a variety of languages and convert them all into his or her own language. Dissemination refers to the class in which an individual or organization wants to broadcast his or her own material, written in one language, in a variety of language to the world. A third class of translation has also recently become evident. Communication refers to the class in which two or more individuals are in more or less immediate interaction, typically via email or otherwise online, with an MT system mediating between them. Each class of translation has very different features, is best supported by different underlying technology, and is to be evaluated according to somewhat different criteria.

See Machine Translation, by Bente Maegaard

[Up]

II.     Main problems of machine translation

Problems of ambiguity

In the best of all possible worlds (as far as most Natural Language Processing is concerned, anyway) every word would have one and only one meaning. But, as we all know, this is not the case. When a word has more than one meaning, it is said to be lexically ambiguous. When a phrase or sentence can have more than one structure it is said to be structurally ambiguous.

Ambiguity is a pervasive phenomenon in human languages. It is very hard to find words that are not at least two ways ambiguous, and sentences which are (out of context) several ways ambiguous are the rule, not the exception. This is not only problematic because some of the alternatives are unintended (i.e. represent wrong interpretations), but because ambiguities 'multiply'. In the worst case, a sentence containing two words, each of which is two ways ambiguous may be four ways ambiguous, one with three such words may be, ways ambiguous etc. One can, in this way, get very large numbers indeed. For example, a sentence consisting of ten words, each two ways ambiguous, and with just two possible structural analyses could have  different analyses. The number of analyses can be problematic, since one may have to consider all of them, rejecting all but one.

Problems that arise from structural and lexical differences between languages

At the start of the previous section we said that, in the best of all possible worlds for NLP, every word would have exactly one sense. While this is true for most NLP, it is an exaggeration as regards MT. It would be a better world, but not the best of all possible worlds, because we would still be faced with difficult translation problems. Some of these problems are to do with lexical differences between language: differences in the ways in which languages seem to classify the world, what concepts they choose to express by single words, and which they choose not to lexicalize. We will look at some of these directly. Other problems arise because different languages use different structures for the same purpose, and the same structure for different purposes. In either case, the result is that we have to complicate the translation process. In this section we will look at some representative examples.

Examples like the ones in ( ) below are familiar to translators, but the examples of colours ( c), and the Japanese  examples in ( d) are particularly striking. The latter because they show how languages need differ not only with respect to the fineness or 'granularity' of the distinctions they make, but also with respect to the basis for the distinction: English chooses different verbs for the action/event of putting on, and the action/state of wearing. Japanese does not make this distinction, but differentiates according to the object that is worn. In the case of English to Japanese, a fairly simple test on the semantics of the NPs that accompany a verb may be sufficient to decide on the right translation. Some of the colour examples are similar, but more generally, investigation of colour vocabulary indicates that languages actually carve up the spectrum in rather different ways, and that deciding on the best translation may require knowledge that goes well beyond what is in the text, and may even be undecidable. In this sense, the translation of colour terminology begins to resemble the translation of terms for cultural artifacts (e.g. words like English cottage, Russian  dacha, French  château, etc. for which no adequate translation exists, and for which the human translator must decide between straight borrowing, neologism, and providing an explanation). In this area, translation is a genuinely creative act, which is well beyond the capacity of current computers.

Multiword units like idioms and collocations

Roughly speaking, idioms are expressions whose meaning cannot be completely understood from the meanings of the component parts. For example, whereas it is possible to work out the meaning of ( a) on the basis of knowledge of English grammar and the meaning of words, this would not be sufficient to work out that ( b) can mean something like `If Sam dies, her children will be rich'. This is because kick the bucket is an idiom.

One problem with sentences which contain idioms is that they are typically ambiguous, in the sense that either a literal or idiomatic interpretation is generally possible (i.e. the phrase kick the bucket can really be about buckets and kicking). However, the possibility of having a variety of interpretations does not really distinguish them from other sorts of expression. Another problem is that they need special rules (such as those above, perhaps), in addition to the normal rules for ordinary words and constructions. However, in this they are no different from ordinary words, for which one also needs special rules. The real problem with idioms is that they are not generally fixed in their form, and that the variation of forms is not limited to variations in inflection (as it is with ordinary words). Thus, there is a serious problem in recognising idioms.

This problem does not arise with all idioms. Some are completely frozen forms whose parts always appear in the same form and in the same order. Examples are phrases like in fact, or in view of. However, such idioms are by far the exception. A typical way in which idioms can vary is in the form of the verb, which changes according to tense , as well as person and number. For example, with bury the hatchet (`to cease hostilities and becomes reconciled', one gets He buries/buried/will bury the hatchet, and They bury/buried/shall bury the hatchet. Notice that variation in the form one gets here is exactly what one would get if no idiomatic interpretation was involved - i.e. by and large idioms are syntactically and morphologically regular - it is only their interpretations that are surprising.

A second common form of variation is in the form of the possessive pronoun in expressions like to burn one's bridges (meaning `to proceed in such a way as to eliminate all alternative courses of action'). This varies in a regular way with the subject of the verb:

In other cases, only the syntactic category of an element in an idiom can be predicted. Thus, the idiom pull X's leg (`tease') contains a genitive NP, such as Sam's, or the king of England's. Another common form of variation arises because some idioms allow adjectival modifiers. Thus in addition to keep tabs on (meaning observe) one has keep close tabs on (`observe closely'), or put a political cat among the pigeons (meaning `do or say something that causes a lot of argument politically'). Some idioms appear in different syntactic configurations, just like regular non-idiomatic expressions. Thus, bury the hatchet appears in the passive, as well as the active voice.

Of course, not all idioms allow these variations (e.g. one cannot passivize kick the bucket meaning `die'), and, as noted, some do not allow any variation in form. But where variation in form is allowed, there is clearly a problem. In particular, notice that it will not be possible to recognise idioms simply by looking for sequences of particular words in the input. Recognising some of these idioms will require a rather detailed syntactic analysis . For example, despite the variation in form for bury the hatchet, the idiomatic interpretation only occurs when the hatchet is always DEEP OBJECT of bury. Moreover, the rules that translate idioms or which replace them by single lexical items may have to be rather complex. Some idea of this can be gained from considering what must happen to pull Sam's leg in order to produce something like equivalent to tease Sam, or the French  translation involving taquiner (`tease'). This figure assumes the input and output of transfer are representations of grammatical relations, but the principles are the same if semantic representations are involved, or if the process involves reducing pull X's leg to a single word occurs in English analysis.

See Translation problems, by D J Arnold

Machine translation has interesting and useful complements such as CAT or MAT. For many  professionals  these complements are a good alternative to machine translation. Those professionals prefer CAT and MAt because their role is maintened and their productivity and capacity is considerably augmented.

[Up]

 Description of CAT and its main functions

Computer Aided Translation (CAT) is is intended for professional translators who are already fluent in the two languages they are translating. CAT tools often include "terminology management tools" and "translation memory" to enhance the efficiency and accuracy of translations.

For more information: http://www.traduzioni-inglese.it/computer-aided-translation.html

http://www.hltcentral.org/htmlengine.shtml?id=1091

CAT tools are programs that help to translate by creating databases of previous translations, lexicons, etc.

  Definition of Translations Memories and their advantages

Translation memory (TM) applications are computer-aided translation tools that use database and code-protection features to simplify the translation process. They are designed to improve the quality and efficiency of the human translation process, not to replace it.

The systems basically consist of a database in which each source sentence of a translation is stored together with the target sentence (this is called a translation memory "unit"). Any new source sentences will be searched for in the database and a match value is calculated.

When the match value is 100%, the translation of the source sentence from the database is inserted into the text being translated. If the match value is below 100% and above a certain user-definable percentage (i.e., "fuzzy match"), the old translation will be inserted as a translation proposal for the translator to review and edit. Sentences with match values below that margin have to be translated from scratch. New and changed translation proposals will then be stored in the database for future use.

For more information: http://65.54.184.250/cgi-bin/linkrd?_lang=ES&lah=fd5c71e131f64c7c82dde0ee802400d3&lat=1083064671&hm___action=http%3a%2f%2fwww%2emultilingualwebmaster%2ecom%2flibrary%2ftrmemories%2ehtml

TMX: Translation Memory eXchange Standard. TMX stands for Translation Memory eXchange. is a vendor-neutral, open standard for storing and exchanging translation memories created by Computer Aided Translation (CAT) and localization tools. The purpose of TMX is to allow easier exchange of translation memory data between tools and/or translation vendors with little or no loss of critical data during the process.

[Up]

  Some references of Translation Memories

Some of the Translations Memories we are displaying  here are more used than others.

CATALYST

Corel's localisation system.


Déjà Vu

Low cost system of Atril.


Eurolang Optimizer

LANT. Languages: English, French, German, Spanish, Italian, Dutch, Portuguese, Swedish, Danish, Finnish, Norwegian


IBM Translation Manager (IBM)

60-day evaluation copy available for download.

 

Trans Suite 2000 (Cypresoft)
Includes a TM system (TRANS Suite 2000 Editor), an aligner and a dictionary setup tool, as well as a management module. Trial version available.

 

IBM TransLexis

"TransLexis is a company-wide Terminology Management System which was developed for different NLP applications: First as a lexicon for a Machine Translation System and second to provide terminologists and human translators with a tool for managing lexical and terminological data"


Loc@le 2.0

System of Accent Software.


MetaTexis

MetaTexis runs under Microsoft Word and comprises all functions of a professional CAT tool like TRADOS or DejaVu. It is comparable to Wordfast. However, MetaTexis follows a different technological approach and puts special emphasis on ease of use and detailed statistical information for translators.


MultiTrans of MultiCorpora

MultiTransTM is a user-friendly second generation CAT tool. Contrary to traditional Translation Memory systems, MultiTrans does not build a laborious database of pre-aligned sentences. It indexes previously translated documents, creating a bilingual reference corpus, that allows the user to make full-text search and retrieval of words, expressions and sentences.


Passolo

Software localization tool with integrated dialog and bitmap editor, fuzzy matching, glossaries and checking functions. Professional Edition with VBA compatible scripting engine and automation. Works with standard Windows resources, Delphi, Visual Basic, Java, Databases and others. Unicode version for Asian and Right-to-Left languages included. Optional interfaces for TRADOS Workbench and STAR Transit available


SDLX 2.0 Interactive Translation System

Beta versions available for testing.


TR-Aid

A translation memory system developed by the Institute for Language and Speech Processing in Greece. Tr·AID compares new sentences to previously translated material and locates existing sentences similar or even identical to the new ones, in order to propose them as candidate translations of the originals. This is the first, unrivalled translation memory product that has been developed by a Greek organisation.


TRANSIT & TermStar

Computer-aided translation system & multilingula terminology manager


Translator’s Workbench

Trados Corporation. Supported languages are Czech, Danish, Dutch, English, French, German, Finnish, Greek,Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish.

For more information: http://65.54.184.250/cgi-bin/linkrd?_lang=ES&lah=c08c6f87bfa7c1cf2c3a2cab49648f12&lat=1083064671&hm___action=http%3a%2f%2fwww%2eforeignword%2ecom%2fTechnology%2ftm%2ftm%2ehtm

 [Up]

To show one of the most important problems of MT, we have translated some spanish idioms to english using three different translatorstaken from:

IDIOMS (Spanish) IDIOMS  (English) SYSTRAN WORLDLINGO FREETRANSLATION

De tal palo, tal astilla

A chip of the old block

Of such wood, such chip

Of such wood, such chip

Of such stick, such chip

De la subida más alta es la caída más lastimosa The bigger they are the further they fall Of the highest ascent is the most pitiful fall Of the highest ascent it is the most pitiful fall Of the highest ascent is the most pitiful fall
En caliente y de repente Strike while the iron's hot In hot and suddenly In hot and suddenly In hot and suddenly
En casa del herrero, cuchillo de palo In the blacksmith's house, a wooden knife In house of the blacksmith, wood knife  In house of the blacksmith, wood knife At home of the blacksmith, knife of stick
En menos que canta un gallo In the shake of a lamb's tale In less than sings a rooster In less than a rooster sings In less than it sings a rooster
Eso es harina de otro costal That's a different story That is horse of another color  That is horse of another color That is flour of another sack
Está pensando en las musarañas He or she is daydreaming Is thinking about musarañas  It is thinking about musarañas Is thinking about the musarañas
Haz bien y no mires a quien Mind your own business Haz well and you do not watch to that I Affluent beam and you do not watch to that You do well and do not look at to whom
La carne de burro no es transparente I can't see through you the donkey meat is not transparent  The donkey meat is not transparent

The meat of donkey is not transparent

Yo te conozco bacalao, aunque vengas disfrazado I know your game I know codfish you, although come disguised I know codfish you, although you come disguised I know you cod, although you avenge disguised
Le patina el coco He has a screw loose slides the Coco to Him The Coco slides to him Skates it the coconut
Más vale pájaro en mano que ciento volando A bird in the hand is worth two in the bush It is worth bird in hand that one hundred flying  It is worth bird in hand that one hundred flying It is worth bird in hand that hundred flying
Otro gallo cantaría That's a horse of different color Another rooster would sing  Another rooster would sing Another rooster would sing
Saberlo de buena fuente To hear it straight from the horse's mouth Knowing how it of good source To know it of good source To know about good source
Tener más lana que un borrego To have money to burn To have more wool than a lamb To have more wool than a lamb To have more wool than a borrego
Vivito y coleando Alive and kicking Vivito and fishtailing Vivito and fishtailing Vivito and coleando

As shown any of the traductors (we have used FAMT) recognized any of the idioms, traducting them automatically word-by-word, so the result is a literal traduction and not the real meaning of the idioms.

There are some other translators (as the list below), that can be used to translate from one human language to another, but as we show above they are not reliable because of the MT problems (ambiguity, structure...). Most of them use Systran software, so the translation probably would be the one above.

References

Machine Translation, by Bente Maegaard

Translation problems, by D J Arnold

 http://65.54.184.250/cgi-bin/linkrd?_lang=ES&lah=ac7a9e10bbe5904cc382c6f92b19c116&lat=1083064671&hm___action=http%3a%2f%2fwww%2etranslation%2enet%2fcat%2ehtml

 http://65.54.184.250/cgi-bin/linkrd?_lang=ES&lah=ac7a9e10bbe5904cc382c6f92b19c116&lat=1083064671&hm___action=http%3a%2f%2fwww%2etranslation%2enet%2fcat%2ehtml

http://65.54.184.250/cgi-bin/linkrd?_lang=ES&lah=fd5c71e131f64c7c82dde0ee802400d3&lat=1083064671&hm___action=http%3a%2f%2fwww%2emultilingualwebmaster%2ecom%2flibrary%2ftrmemories%2ehtml

Conclusion

We think that we have clearly stated the great importance that MT has in the society nowadays. Apart from traducting texts, they enable people to have information in many languages, helping to understand it without knowing the language.

As we have already explained, Fully Automatized Machine Translation isn´t reliable yet ( HLT will have to take care of this issue, making human comunication with machines possible and making machines think in human language).

As we said in this report, there are structural or ambiguity problems when working with with MT, and those problems are also common for us. A clear example would be translations from Spanish to Basque. In those translations, apart from ambiguity problems, there would be structural problems, because structurally Spanish and Basque are completely different.

However, MT provides translators of useful tools (such as TMs ) that help them to make their job more efficiently and faster. But there is still a lot to improve in this field, some important problems to solve (as we have mentioned above).