Mixed techniques for the recognition of translation units in large multilingual corpora

Abstract

There is more than meaning in translation. The growing success of analogy-based MT shows an alternative view. A reconciliation of the two strategies derives from a redefinition of translation unit, based on proposals of traductologists such as (Hatim and Mason 1990). The problem then becomes one of dimension, since dictionaries will not only contain word-concept associations, but endless collections of bitextual chunks which we call translation units. Mixed techniques for the automatic recognition of such entities will become essential.

There is more than meaning in translation, and meaning representations may be misleading. The equation, translation unit same as conceptual entity is an erroneous asumption, and has been on the base of much rule-based MT effort. The growing success of analogy-based MT shows an alternative view, although few know why. A reconciliation of the two strategies derives from a redefinition of translation unit, based on proposals of traductologists such as (Hatim and Mason 1990), (Nord 1991) or (Sager 1993). The problem then becomes one of dimension, since dictionaries will not only contain word-concept associations, but endless collections of bitextual chunks which we call translation units. Mixed techniques for the automatic recognition of such entities will become essential.

Introduction

A prevailing assumption in MT is that translation is primarily a problem of meaning equivalence. For this reason research has predominantly focused on techniques suchs as semantic networks (Simmons and Slocum 1972), preference semantics (Wilks 1973), case and valency grammars (Somers 1987), knowledge representations (Carbonell et al 1981, Nirenburg et al 1985, Dorr 1987), lexical transfer (Melby 1986, Alonso 1990), word sense disambiguation (Masterman, 1957, Amsler and White 1979). These approaches rest on the widespread idea, owing to Montague, Frege or Leibniz, that although different in surface, languages share a deep logical substance. A large part of the MT community still holds the view that the discovery and formalization of such underlying conceptual structure would account for a major breakthrough in the field.

Translation studies however have for some time shown that meaning is only one aspect of translingual equivalence. (Hatim and Mason 1991) for example suggest that more approprite than correspondence of meaning is semiotic equivalence. (Nord 1997), also argues for the consideration of pragmatic and stylistical factors as well as of semantics.

This paper explores the identification of semiotic text segments in large multilingual corpora and their reutilization in translation memories.

Thanks to the successful experiments of (Nagao 1984), or (Brown 1990), new paradigms of MT research have emerged which avoid semantic considerations. These new approaches shed light into a problem that has not been correctly addressed, as (Kay 1980/1997) first, but more openly (Melby 1995) have pointed out, although many experts in the MT community still endorse semantic approaches (Maegaard 1999).

This paper discusses the experience on parallel text segmentation in a project that has lasted for eight years. We have seen that the more open hypothesis of semiotic based text segmentation has important advantages. We will first focus on a classification of text segments, which lies on the notions of compositionality, collocations and text categorization. Second we will show some of the problems that we have encountered in identifying the adequate text segment. We will define a an algorithm that .... Then we will conclude...

In order to account for the continuum in natural languages from free combining words to more fixed expressions, such as idioms kick the bucket / estirar la pata.

Backgroung

The main lesson we have learned in our project is the importance of recognizing adequate text segments in large bilingual corpora. The bigger a recognized segment is, the clearer the perception of the text containing it will be, and the lesser the effort spent in senseless segmentation into smaller units. This also connects with the fact that text categorization is a major issue in translation studies. The are some types of texts that need no segmentation, because its equivalence into a target language must not be established at a lower level than the whole text. This is a persistent case of legal documentation, for translations between texts belonging to different legal systmes. Contrastive analysis of text belonging to such domains show sharp structural and typological differences, still the translation is possible (for example memoranda of associations) if the equivalence is place at highest level of segmentation (Borja 1999?) .

It seems as if the MT community slowly wakes up and becomes aware of the relevance of considering bigger segments (Bennett 1994). Collocative issues have come on stage for several years (Nirenburg ), but many different things are mixed up. Viegas et al 1997 establish one classification that we will consider, but we will extend the classification with the consideration of complex lexemes, idioms, locutions and formulae.

Text segmentation has been perceived as a major issue in translation (Larose 1989, Toury 1995). However, there is no general agreement on to the nature of these text segments, which in translation studies are often called translation units.

Not important for descriptive linguists, less even for explicative ("real" theoretical) linguists; but crucial for translators and traductologists. Also of interest for contrastive linguists and stylistics and pragmatics in general. Register, style, genre detection, classification, lexical density, etc.

From the view of translation as a problem of meaning equivalence,

Vinay and Darbelnet, Vazquez Ayora, etc.
Translation atom Vinay&Darbelnet (1958:16,36ss) "the smallest segment of the utterance where the cohesion of signs is such that they cannot be translated separatelly". "UTs are, rather, lexicological units where lexical elements converge in the expression of a single element of thought" (:37). "Multi-word UTs include idioms, various kinds of collocations, and what might now be called support verb constructions (e.g. faire une promenade 'to take a walk'). UT are not linguistic units . "This first sense of UT, then, is really a translation atom, the smallest segment that must be translated as a whole".

Bennett
"Hatim&Mason90 is the standard reference on the text as a UT, though the authors acknowledge that a text need not constitute am entire stretch of discourse (178)". :15 "In this picture [of a transfer module], it is the lexical units which constitute the translation atoms, i.e. the UTs in transfer. Such units will ordinarily be words, but may be at a higher or lowel level". "It is the elements of lexical transfer rules which constitute the translation atoms". "The claim that translation is essentially compositional, viz. that the translation of complex expressions is a function of the translation of its parts and the way they are combined, implies a fairly simple approach to the larger linguistic units, i.e. larger translation foci". "In the best case, there should be nothing to say in transfer about higher than the translation atom". :16 "The simple-transfer methodology in MT results, in the best case, in purely lexical transfer -though 'lexical' here does not simply mean word or morpheme". "Some lexical unit (or listeme) is transferred with as little attention to the context as possible, and there is a bare minimum of transfer at higher levels of the linguistic hierarchy. Considerations of system design undoubtedly point in this direction". "The higher linguistic levels, however, stop at the sentence, which for MT is the translation macro-level". "A text for MT is simply the concatenation of independently-translated sentences, with no consideration of such matters as textual cohesions or rethorical structure (note 5. As pointed out by Hatim&Mason90:24. Their point actually concerns early, pre-ALPAC MT, but little has changed since then)". "Just as simple lexical equivalences are criticised by translation theorists as being taken out of context, so the same applies to individual sentences: "Hatim&Mason90:32 claim that decontextualized utterances such as John is eager to please cannot form the basis for useful discussion of translation". "But this is to elevate discourse matters to far too all-encompassing a status, and the detailed study in the MT literature of translation problems made in the framework of single example sentences shows that such discussion can be insightful".

This together with the idea that meaning is essentially compositional, i.e. that that the translation of large text fragments is a function of the translation of their parts and the way they are combined,

Bennett 1994 reviews the notion of translation unit in translation studies and its possible application to machine translation. His conclusion is that for linguistically motivated rule-based systems, and more specifically in transfer-based systems, it is the lexical units which constitute the translation atoms, or basic translation unit.

(Vinay and Darbelnet (1956) is and early and commonly held reference of meaning as the main guidance for text segmentation: "a translation unit is the smallest segment of the utterance where the cohesion of signs is such that they cannot be translated separatelly". In this way, translation units are put on the level of lexicological units "where lexical elements converge in the expression of a single element of thought".

"Multi-word UTs include idioms, various kinds of collocations, and what might now be called support verb constructions (e.g. faire une promenade 'to take a walk'). UT are not linguistic units . "This first sense of UT, then, is really a translation atom, the smallest segment that must be translated as a whole".

Hatim and Mason (1990) advocate semiotic entity consisted of a discrete sign, ranging from simple linguistic units to entire texts: "One-line slogans (e.g. Salford, the Enterprising City), and entire political speeches in favour of the 'enterprise culture' are, each in their own way, a manifestation of a particular semiotic sign".

Martinez et al, Abaitua et al

Text segmentation, alignment, reviews

Textual entities, why? Text sections, divisions, paragraphs, sentences? In translation, different things may be deemed to represent a unit, a unit of translation. Any thing which is bigger than the sentence we are going to call it formula, and that is going to be our bigger text chunk. The classification of these thext chunks is a task for the pragmatics.

Translation of meaning, is just one part, according to the notion of equivalence of Nord. There is more than meaning in translation, and meaning representations may be misleading. The example of an Escritura de constitución is extreme but it shows the rationale of the argument.

In meaning based translation, then semantic theories play an important role, aswell as lexicological transfer. Lexicological units are related with conceptual entities, which are deemed to be basically interlingual.

Then it has prevaled the idea that lexicological units were translation units. Do these lexicological categories combine freely, and is their combination compositional?

Much attention has been paid to subcategorization and collocations. Case and valency (Somers), and collocational (Nirenburg).

Next we offer a classification of constraints on free combinaiton of LU.

Clasificación de entidades bitextuales y densidad léxica en corpora paralelos

Introducción

Una serie de avances combinados que han ido sucediendo en esta últmia década hacen presagiar que el sueño de una traducción automática de alta calidad esté al alcance de la mano. La clave radica en la disposición de un numero creciente de elementos bitextuales y en la explotación eficaz del concepto de unidad de traducción. Que no se consiga no depende de que no se conozcan los medios, ni las técnicas necesarias para lograrlo, sino de falta de voluntad y decisión colectivas.

Exposición del problema

Unidad de traducción

Si atendemos a la propuesta TMX, la definición de una unidad de traducción es extremadamente sencilla, esto es una cadena de caracteres entre las etiquetas <TU>...</TU>. Claro que esto parece una frivolidad, pero voy a demostrar que no lo es tanto, y que la clave del éxito de la traducción autmática radica en extender esta notación a todo fragmento de bitexto que encontremos a nuestro alcance.

Evolución del trabajo

Logros de LEGEBiDUNA, XTRA-Bi, XML-Bi

Algoritmos

Segmentation into bitextual entities

Text segmentation is a major issue in translation (Larose 1989, Toury 1995). The problem es to determine an adequate for this process. A trivial solution is to put forward the translation unit (TU) as a candidate, because then we find the same difficulty in defining this concept. Early proposals of V&D and VA , compared the TU with the lexicological unit.

In order to account for the continuum in natural languages from free combining words to frozen expressions, such as idioms kick the bucket / estirar la pata.

Either transfer of lexicological units, as in LMT or mapping of lexical-conceptual structures Bonnie Dorr 1993 or Sergei Nirenburg et al. 1992

Syntagmatic relations, also known as collocations, are used differently by lexicographers, linguists and statisticians denoting similar but not identical classes of expressions.

Church and Hanks (1989), Smadja (1993) use statistics in their algorithms to extract collocations from texts. Ahrenberg et al. (1998) align lexicological units which include multiword expressions in a parallel corpus

In recent years much progress has been made in the area of bilingual alignment.

Dagan and Church 1994 report that their Termight system helped double the speed at which terminology lists could be compiled at the AT&T Business Translation Services.

The ability to handle multi-word units is crucial in many language pairs (Jones and Alexa 1997, Ahrenberg et al. 1998). Some languages, as English deploy multi-word compound to express technical concepts, while in other languages there is a single term. Examples fro Ahrenberg et al.

Kitamura and Matsumoto 1996 present results from aligning multi-word and single word expressions.

Viegas et al. 1998...

The problem is how to detect them automatically. All but (1) are problematic, if there are really any category contextually unbounded.

In the past, this work has been done mainly manually. Recent experiments for the automatic recognition of what have been generally called collocations include:

If compositional, Pustejovsky argues it is arbitrary to create a separate word sense for a lexical item just because the meaning of a predicate varies depending on the argument being modified. Lexicons would tipically require an enumeration of all the different senses. But such sense ambiguity can be generated rather than listed. The ambiguity exhibited by some predicates is the result of a phenomenon of type coercion. This is achieved by enriching the lexical semantic representation for lexical items while also allowing a word's semantic type to shift or be coerced in particular contexts. Verbs and nouns allowed to shift in type, the semantic load can be spread in the lexicon more evenly, while still capturing the ways in which words can extend their meaning, the creative use of words (:73).

The notion of context enforcing a certain reading of a word, selecting a particular word sense, is central to dictionary entry design (breaking a word into word senses) and local composition of individual sense definitions.

(Wilks eta al. 1993) believe semantics to be based on the notion of word sense as used by traditional lexicography in constructing dictionaries. The inabilitiy of programs to cope with lexical ambiguity has been a major reason for the failure of early computational linguistics tasks as machine translation (:344).

Compositional

1) Freer word combinations: Part of Speech categories
Phrase structure grammars. These are based upon Part of Speech entities. Such categories as N, A, V, P, Det, Adv, Conj, etc. are abstractions over large number of lexical items. PSRs can be written on top of them Det + N = NP, etc. But how free are they? Not so.

2) Subcategorizations
Verbs, deverbalized nouns, adjetives, prepositions vary in that they exhibit different sintagmatic relations. Since Fillmore, but also ... Case, valency, semantic roles, selectional restrictions, etc.

like and gustarManning 1993, Brent 1993, Monedero 1995, Kokkinakis 1996, Arriola 1999

3) Semantic collocations
Pustejovsky 1993: fast waltz, fast car, fast typist, fast book, fast reader

4) Lexical coocurrences

Viegas et al 1997: rancid butter (mantequilla rancia), sour milk (leche cortada), overripe fruit (fruta pasada)

Semi-compositional

5) Restricted semantic coocurrences
Viegas et al 1997: strong coffe (café muy cargado), strong wine (vino peleón), heavy smoker (fumador empedernido)
the collocate does not have this sense in its lexical entry, it is the base that predicts the combination

Non-compositional

6) Complex lexemes
Include referential expressions, such as proper names, etc.
dictámenes jurídicos / opinions of law; Escritura de constitución / Memorandum of Association
operating system / operativsystem
fruit salad (macedonia)
Dagan and Church 1994, Jones and Alexa 1997, Melamed 1997, Ahrenberg et al 1998, .

7) Idioms

as nutty as a fruitcake / más loco que una cabra
I'am... / estoy más loco que una cabra
She is... / está más loca que una cabra
They are... / están más locos que una cabra

mañana cumple 20 años / she’ll be 20 tomorrow;
¿cuándo cumples años? when’s your birthday?;
¡que cumplas muchos más! many happy returns!;
¡que los cumplas muy feliz! have a very happy birthday!;
ése ya no cumple los cuarenta / he won’t see forty again
mañana cumplimos 20 años de casados / tomorrow we’ll have been married 20 years, tomorrow is our 20th wedding anniversary;
la huelga cumple hoy su tercer día / this is the third day of the strike
un acto que cumple su decimotercer aniversario / an event that is now celebrating its thirteenth anniversary.

<1:3> Every Sunday
<2:1> between 10:30am and six pm
<3:4> in the plaza Thorbecke
<4:2> there is
<5:5> an art market
<6:6> that is now celebrating its thirteenth anniversary.

Desde las 10.30 hasta las 18
tiene lugar,
cada domingo,
en la plaza Thorbecke
un mercado de arte
que cumple su decimotercer aniversario.

Between 10:30am and six pm
there is,
every Sunday,
in the plaza Thorbecke
an art market
that is now celebrating its thirteenth anniversary.

Cada domingo,
desde las 10.30 hasta las 18
en la plaza Thorbecke
tiene lugar
un mercado de arte
que cumple su decimotercer aniversario.

8) Locutions
Preprositional verbs, conjuctions, preprositions
in spite of / trots (SW)
after all/ när allt kommer omkring
however / sin embargo

still / a pesar de todo
el poder adquisitivo es cada vez menor / purchasing power decreases every day
el medio ambiente está cada vez peor / environment gets worse every day

9) Formulae

<1:3> Every <nameDay>
<2:1> between <time_1> and <time_2>
<3:4> in the <place>
<4:2> there is
<5:5> an <event>

Desde las <time_1> hasta las <time_2>
tiene lugar,
cada <nameDay>,
en la <place>
un <event>

Bennett 1994 provides a proposal which is adequate for traditional rule-based MT designs. Interlingual and transfer systems relay on lexicological units, since they are heavely linguictic knowledge oriented.

However, analogy-based MT, and more clearly, example-based MT designs allow for a more varied collection of textual entities to be considered. In particular if we think of memory-based systems.

Text chunks may be of different size and nature. Compositionality and non-compositionality. Viegas et al 1997 stablish a hierarchy wich can be extended crucially if we consider the bottom line. That is, less or non-compositional elemens first. As an example consider a religious pray, such as PN.

In fact, considering bottom text chunks first, may release other higher elements from consideration. I will argue that this bottom-up consult?? in the dictionary will have important benefits.We will study this.

Less responsability for the syntactic component, in some cases even none. And much larger to the lexical and the pragmatic modules.

Algorithms for automatic detection of big text chunks, such as the PN. No problem for 2 to 9 ngrams, but big exponential when dealing with strings larger than 10. Solutions.

If a chunk repeats twice in a corpus, then we have detected a potential entity no matter the size of the string nor the size or representativity of the corpus.

Hatim & Mason
:178 "Text is a coherent and cohesive unit, realised by one or more than one sequence of mutually relevant elements, and serving some overall rhetorical purpose". :105 "The semiotic entity as a unit of translation. [...] The translator identifies a source-system semiotic entity. This will be a constituent element of a certain cultural (sub-)system: :107 [This] semiotic entity consisted of a discrete sign. But semiotic entities may be much larger, ranging from complete entities to entire text. One-line slogans (e.g. Salford, the Enterprising City), and entire political speeches in favour of the 'enterprise culture' are, each in their own way, a manifestation of a particular sign". :57 "Three dimensions of context. (Communicative, pragmatic and semiotic). Semiotic dimension: trating a communicative item, including its pragmatic value, as a sign within a system of signs". Semiotic theory of translation :113 "What this [semiotic transformation] implies for a semiotic theory of translating is that the concept of 'sign' is gradually giving way to that of 'semiotic entity' and, as in some recent formulations, to 'sign function' (Silverman 1983). This arises from what happens when a given portion of reality (Hjelmslev' 'content plane') is subjected by the 'expression plane' to a process of segmentation. The resulting sign-functions are semantic units which, singly or collectively, constitute the filters through which a culture thinks, develops or decays".

Algorithms

The underlying assumption is that, provided a big enough corpora, semiotic entities reappear as text chunks more than once. Then the problem becomes of

N-grammars, coocurrences

Ahrenberg et al. 1998 describe an iterative algorithm that repeats the process of generating translation pairs from the bitext, and then reducing the bitext by removing pairs that have been found before the next iteration starts. The algorithm stops when no more pairs can be generated, or when a given number of iterations have been completed.

Evaluation and future work

Algoritmo

La hipótesis es que en corpora grandes las entidades semióticdas reaparecen en forma de bloques textuales más de una vez. La cuestión está en dar con una estrategia que permita, por un lado, reconocer en el corpus bloques de texto como candidatos a unidad semiótica, y por otro, lograr con un coste computacional razonable su reserva para ser cotejados con cuantos nuevos bloques candidatos vayan apareciendo.

En una jerarquía de:

Documento (obra completa)
Epígrafes (capítulos)
Apartados y subapartados
Párrafos

sea posible conjugar un tamaño abordable con el candidato idóneo.

Párrafos son el último recurso. Es de hecho la propuesta que hacen (Hatim y Mason 199?) La cuestión es, dónde está el punto de equilibrio que permita subir niveles hasta el máximo número de párrafos abordables.

Generalización: eliminar elementos referenciales coyunturales.

Trabajo futuro

En nuestro corpus hemos detectado que.... Casillas (2000) obtuvo ..., pero repasos posteriores han resuelto...

Simplificar el cotejo, mediante eliminación de palabras frecuentes y stop words. Pesar porcentajes de similitud entre vocabulario diferencial similar.

Categorización, clasificación, genre detection, etc.

Tuning segmentation in to text type in Machine Translation

Joseba Abaitua Universidad de Deusto

Abstract

Human translators have for quite a long ago (cf. Catford 1965:83, Wilss 1977:135-9) recognised the importance of the text type in translation. [have long been aware of the need for categorization in translation^] There is also a fruitful tradition in translation studies around the issue of segmenting the text into translation units. Identifying the translation unit may be crucial to achieve a good translation. For some time there has been some agreement that the translation unit is the lexicological unit. Bennett This paper describes an algorithm for automatic adaptation of the segmentation of text chunks adequate for the text type and gendre.

Relevance of text type and gendre detection in translation.

Translation strategies according to text type. Segmentation and translation unit.

Quotes

Sager 97:30 Text types have evolved as patterns of messages for specific communicative situations. When we write a message we first think of the text type that is suitable for the occassion and the content, and formulate our text accordingly. Regular repetitions of messages in particular circumstances have created expectations of recognisable structural and rethorical features which condition our modes of reading a message. When we receive a message, we first think of the text because it permits us to tune in to the appropriate mode of reception.

Sager 97:31 Translation studies have accepted the need for analysing text types. Neubert (1985:125) offers an interesting definition of text type by calling them "socially effective, efficient, and appropriate moulds into which the linguistic material available in the system of a language is recast". Wilss (1977:135-9) surveyed the role that text types have increasingly been playing in translation theory, but also concluded that text types were mainly studied in order to determine translation methods or degrees of translatability.

Trosborg 97:i Since Catford (65:83), the desire to have a framework of categories for the classificatio of varieties or "sub-languages" within a language has been acknowledged. Genre analysis has been concerned with establishing characteristics of particular types of text, but whereas the concepts of genre hava a long tradition in literary studies, interest in the analysis of non-literary genres is of more recent date (eg Swales 1990, Bhatia 1993).

The book attempts to demonstrate the value of text typology for translation purposes, emphasizing the importance of genre analysis, analysis of communicative functions and text types in a broad sense as a means of studying spoken and written discourse. Sonnets, sagas, fairy tales, novels and feature films, sermons, political speeches, international treaties, instruction leaflets, business letters, academic lectures, academic articles, medical research articles, technical brochures and legal documents are but some of the texts treated in this volume. It is argued that text typology involving genre analysis can help the translator develop strategies that facilitate his/her work and provide awarness of various options as well as constraints. In this book, text type is used in a broad sense to refer to any distinct type of text and the notion includes genre.

This book addresses the central question of In what ways are translations affected by text types? To what extent and in what areas are text types identical across languages?

References