Mixed techniques for the recognition of translation units in large multilingual corpora
There is more than meaning in translation. The growing success of analogy-based MT shows an alternative view. A reconciliation of the two strategies derives from a redefinition of translation unit, based on proposals of traductologists such as (Hatim and Mason 1990). The problem then becomes one of dimension, since dictionaries will not only contain word-concept associations, but endless collections of bitextual chunks which we call translation units. Mixed techniques for the automatic recognition of such entities will become essential.
There is more than meaning in translation, and meaning representations may be misleading. The equation, translation unit same as conceptual entity is an erroneous asumption, and has been on the base of much rule-based MT effort. The growing success of analogy-based MT shows an alternative view, although few know why. A reconciliation of the two strategies derives from a redefinition of translation unit, based on proposals of traductologists such as (Hatim and Mason 1990), (Nord 1991) or (Sager 1993). The problem then becomes one of dimension, since dictionaries will not only contain word-concept associations, but endless collections of bitextual chunks which we call translation units. Mixed techniques for the automatic recognition of such entities will become essential.
A prevailing assumption in MT is that translation is primarily a problem of meaning equivalence. For this reason research has predominantly focused on techniques suchs as semantic networks (Simmons and Slocum 1972), preference semantics (Wilks 1973), case and valency grammars (Somers 1987), knowledge representations (Carbonell et al 1981, Nirenburg et al 1985, Dorr 1987), lexical transfer (Melby 1986, Alonso 1990), word sense disambiguation (Masterman, 1957, Amsler and White 1979). These approaches rest on the widespread idea, owing to Montague, Frege or Leibniz, that although different in surface, languages share a deep logical substance. A large part of the MT community still holds the view that the discovery and formalization of such underlying conceptual structure would account for a major breakthrough in the field.
Translation studies however have for some time shown that meaning is only one aspect of translingual equivalence. (Hatim and Mason 1991) for example suggest that more approprite than correspondence of meaning is semiotic equivalence. (Nord 1997), also argues for the consideration of pragmatic and stylistical factors as well as of semantics.
This paper explores the identification of semiotic text segments in large multilingual corpora and their reutilization in translation memories.
Thanks to the successful experiments of (Nagao 1984), or (Brown 1990), new paradigms of MT research have emerged which avoid semantic considerations. These new approaches shed light into a problem that has not been correctly addressed, as (Kay 1980/1997) first, but more openly (Melby 1995) have pointed out, although many experts in the MT community still endorse semantic approaches (Maegaard 1999).
This paper discusses the experience on parallel text segmentation in a project that has lasted for eight years. We have seen that the more open hypothesis of semiotic based text segmentation has important advantages. We will first focus on a classification of text segments, which lies on the notions of compositionality, collocations and text categorization. Second we will show some of the problems that we have encountered in identifying the adequate text segment. We will define a an algorithm that .... Then we will conclude...
In order to account for the continuum in natural languages from free combining words to more fixed expressions, such as idioms kick the bucket / estirar la pata.
The main lesson we have learned in our project is the importance of recognizing adequate text segments in large bilingual corpora. The bigger a recognized segment is, the clearer the perception of the text containing it will be, and the lesser the effort spent in senseless segmentation into smaller units. This also connects with the fact that text categorization is a major issue in translation studies. The are some types of texts that need no segmentation, because its equivalence into a target language must not be established at a lower level than the whole text. This is a persistent case of legal documentation, for translations between texts belonging to different legal systmes. Contrastive analysis of text belonging to such domains show sharp structural and typological differences, still the translation is possible (for example memoranda of associations) if the equivalence is place at highest level of segmentation (Borja 1999?) .
It seems as if the MT community slowly wakes up and becomes aware of the relevance of considering bigger segments (Bennett 1994). Collocative issues have come on stage for several years (Nirenburg ), but many different things are mixed up. Viegas et al 1997 establish one classification that we will consider, but we will extend the classification with the consideration of complex lexemes, idioms, locutions and formulae.
Text segmentation has been perceived as a major issue in translation (Larose 1989, Toury 1995). However, there is no general agreement on to the nature of these text segments, which in translation studies are often called translation units.
Not important for descriptive linguists, less even for explicative ("real" theoretical) linguists; but crucial for translators and traductologists. Also of interest for contrastive linguists and stylistics and pragmatics in general. Register, style, genre detection, classification, lexical density, etc.
From the view of translation as a problem of meaning equivalence,
Vinay and Darbelnet, Vazquez Ayora, etc.
This together with the idea that meaning is essentially compositional, i.e. that that the translation of large text fragments is a function of the translation of their parts and the way they are combined,
Bennett 1994 reviews the notion of translation unit in translation studies and its possible application to machine translation. His conclusion is that for linguistically motivated rule-based systems, and more specifically in transfer-based systems, it is the lexical units which constitute the translation atoms, or basic translation unit.
(Vinay and Darbelnet (1956) is and early and commonly held reference of meaning as the main guidance for text segmentation: "a translation unit is the smallest segment of the utterance where the cohesion of signs is such that they cannot be translated separatelly". In this way, translation units are put on the level of lexicological units "where lexical elements converge in the expression of a single element of thought".
"Multi-word UTs include idioms, various kinds of collocations, and what might now be called support verb constructions (e.g. faire une promenade 'to take a walk'). UT are not linguistic units . "This first sense of UT, then, is really a translation atom, the smallest segment that must be translated as a whole".
Hatim and Mason (1990) advocate semiotic entity consisted of a discrete sign, ranging from simple linguistic units to entire texts: "One-line slogans (e.g. Salford, the Enterprising City), and entire political speeches in favour of the 'enterprise culture' are, each in their own way, a manifestation of a particular semiotic sign".
Martinez et al, Abaitua et al
Text segmentation, alignment, reviews
Textual entities, why? Text sections, divisions, paragraphs, sentences? In translation, different things may be deemed to represent a unit, a unit of translation. Any thing which is bigger than the sentence we are going to call it formula, and that is going to be our bigger text chunk. The classification of these thext chunks is a task for the pragmatics.
Translation of meaning, is just one part, according to the notion of equivalence of Nord. There is more than meaning in translation, and meaning representations may be misleading. The example of an Escritura de constitución is extreme but it shows the rationale of the argument.
In meaning based translation, then semantic theories play an important role, aswell as lexicological transfer. Lexicological units are related with conceptual entities, which are deemed to be basically interlingual.
Then it has prevaled the idea that lexicological units were translation units. Do these lexicological categories combine freely, and is their combination compositional?
Much attention has been paid to subcategorization and collocations. Case and valency (Somers), and collocational (Nirenburg).
Next we offer a classification of constraints on free combinaiton of LU.
Clasificación de entidades bitextuales y densidad léxica en corpora paralelos
Una serie de avances combinados que han ido sucediendo en esta últmia década hacen presagiar que el sueño de una traducción automática de alta calidad esté al alcance de la mano. La clave radica en la disposición de un numero creciente de elementos bitextuales y en la explotación eficaz del concepto de unidad de traducción. Que no se consiga no depende de que no se conozcan los medios, ni las técnicas necesarias para lograrlo, sino de falta de voluntad y decisión colectivas.
Exposición del problema
Unidad de traducción
Si atendemos a la propuesta TMX, la definición de una unidad de traducción es extremadamente sencilla, esto es una cadena de caracteres entre las etiquetas <TU>...</TU>. Claro que esto parece una frivolidad, pero voy a demostrar que no lo es tanto, y que la clave del éxito de la traducción autmática radica en extender esta notación a todo fragmento de bitexto que encontremos a nuestro alcance.
Evolución del trabajo
Logros de LEGEBiDUNA, XTRA-Bi, XML-Bi
Segmentation into bitextual entities
Text segmentation is a major issue in translation (Larose 1989, Toury 1995). The problem es to determine an adequate for this process. A trivial solution is to put forward the translation unit (TU) as a candidate, because then we find the same difficulty in defining this concept. Early proposals of V&D and VA , compared the TU with the lexicological unit.
In order to account for the continuum in natural languages from free combining words to frozen expressions, such as idioms kick the bucket / estirar la pata.
Either transfer of lexicological units, as in LMT or mapping of lexical-conceptual structures Bonnie Dorr 1993 or Sergei Nirenburg et al. 1992
Syntagmatic relations, also known as collocations, are used differently by lexicographers, linguists and statisticians denoting similar but not identical classes of expressions.
Church and Hanks (1989), Smadja (1993) use statistics in their algorithms to extract collocations from texts. Ahrenberg et al. (1998) align lexicological units which include multiword expressions in a parallel corpus
In recent years much progress has been made in the area of bilingual alignment.
Dagan and Church 1994 report that their Termight system helped double the speed at which terminology lists could be compiled at the AT&T Business Translation Services.
The ability to handle multi-word units is crucial in many language pairs (Jones and Alexa 1997, Ahrenberg et al. 1998). Some languages, as English deploy multi-word compound to express technical concepts, while in other languages there is a single term. Examples fro Ahrenberg et al.
Kitamura and Matsumoto 1996 present results from aligning multi-word and single word expressions.
Viegas et al. 1998...
The problem is how to detect them automatically. All but (1) are problematic, if there are really any category contextually unbounded.
In the past, this work has been done mainly manually. Recent experiments for the automatic recognition of what have been generally called collocations include:
If compositional, Pustejovsky argues it is arbitrary to create a separate word sense for a lexical item just because the meaning of a predicate varies depending on the argument being modified. Lexicons would tipically require an enumeration of all the different senses. But such sense ambiguity can be generated rather than listed. The ambiguity exhibited by some predicates is the result of a phenomenon of type coercion. This is achieved by enriching the lexical semantic representation for lexical items while also allowing a word's semantic type to shift or be coerced in particular contexts. Verbs and nouns allowed to shift in type, the semantic load can be spread in the lexicon more evenly, while still capturing the ways in which words can extend their meaning, the creative use of words (:73).
The notion of context enforcing a certain reading of a word, selecting a particular word sense, is central to dictionary entry design (breaking a word into word senses) and local composition of individual sense definitions.
(Wilks eta al. 1993) believe semantics to be based on the notion of word sense as used by traditional lexicography in constructing dictionaries. The inabilitiy of programs to cope with lexical ambiguity has been a major reason for the failure of early computational linguistics tasks as machine translation (:344).
Bennett 1994 provides a proposal which is adequate for traditional rule-based MT designs. Interlingual and transfer systems relay on lexicological units, since they are heavely linguictic knowledge oriented.
However, analogy-based MT, and more clearly, example-based MT designs allow for a more varied collection of textual entities to be considered. In particular if we think of memory-based systems.
Text chunks may be of different size and nature. Compositionality and non-compositionality. Viegas et al 1997 stablish a hierarchy wich can be extended crucially if we consider the bottom line. That is, less or non-compositional elemens first. As an example consider a religious pray, such as PN.
In fact, considering bottom text chunks first, may release other higher elements from consideration. I will argue that this bottom-up consult?? in the dictionary will have important benefits.We will study this.
Less responsability for the syntactic component, in some cases even none. And much larger to the lexical and the pragmatic modules.
Algorithms for automatic detection of big text chunks, such as the PN. No problem for 2 to 9 ngrams, but big exponential when dealing with strings larger than 10. Solutions.
If a chunk repeats twice in a corpus, then we have detected a potential entity no matter the size of the string nor the size or representativity of the corpus.
Hatim & Mason
The underlying assumption is that, provided a big enough corpora, semiotic entities reappear as text chunks more than once. Then the problem becomes of
Ahrenberg et al. 1998 describe an iterative algorithm that repeats the process of generating translation pairs from the bitext, and then reducing the bitext by removing pairs that have been found before the next iteration starts. The algorithm stops when no more pairs can be generated, or when a given number of iterations have been completed.
Evaluation and future work
La hipótesis es que en corpora grandes las entidades semióticdas reaparecen en forma de bloques textuales más de una vez. La cuestión está en dar con una estrategia que permita, por un lado, reconocer en el corpus bloques de texto como candidatos a unidad semiótica, y por otro, lograr con un coste computacional razonable su reserva para ser cotejados con cuantos nuevos bloques candidatos vayan apareciendo.
En una jerarquía de:
sea posible conjugar un tamaño abordable con el candidato idóneo.
Párrafos son el último recurso. Es de hecho la propuesta que hacen (Hatim y Mason 199?) La cuestión es, dónde está el punto de equilibrio que permita subir niveles hasta el máximo número de párrafos abordables.
Generalización: eliminar elementos referenciales coyunturales.
En nuestro corpus hemos detectado que.... Casillas (2000) obtuvo ..., pero repasos posteriores han resuelto...
Simplificar el cotejo, mediante eliminación de palabras frecuentes y stop words. Pesar porcentajes de similitud entre vocabulario diferencial similar.
Categorización, clasificación, genre detection, etc..
Human translators have for quite a long ago (cf. Catford 1965:83, Wilss 1977:135-9) recognised the importance of the text type in translation. [have long been aware of the need for categorization in translation] There is also a fruitful tradition in translation studies around the issue of segmenting the text into translation units. Identifying the translation unit may be crucial to achieve a good translation. For some time there has been some agreement that the translation unit is the lexicological unit. Bennett This paper describes an algorithm for automatic adaptation of the segmentation of text chunks adequate for the text type and gendre.
Relevance of text type and gendre detection in translation.
Translation strategies according to text type. Segmentation and translation unit.
Sager 97:30 Text types have evolved as patterns of messages for specific communicative situations. When we write a message we first think of the text type that is suitable for the occassion and the content, and formulate our text accordingly. Regular repetitions of messages in particular circumstances have created expectations of recognisable structural and rethorical features which condition our modes of reading a message. When we receive a message, we first think of the text because it permits us to tune in to the appropriate mode of reception.
Sager 97:31 Translation studies have accepted the need for analysing text types. Neubert (1985:125) offers an interesting definition of text type by calling them "socially effective, efficient, and appropriate moulds into which the linguistic material available in the system of a language is recast". Wilss (1977:135-9) surveyed the role that text types have increasingly been playing in translation theory, but also concluded that text types were mainly studied in order to determine translation methods or degrees of translatability.
Trosborg 97:i Since Catford (65:83), the desire to have a framework of categories for the classificatio of varieties or "sub-languages" within a language has been acknowledged. Genre analysis has been concerned with establishing characteristics of particular types of text, but whereas the concepts of genre hava a long tradition in literary studies, interest in the analysis of non-literary genres is of more recent date (eg Swales 1990, Bhatia 1993).
The book attempts to demonstrate the value of text typology for translation purposes, emphasizing the importance of genre analysis, analysis of communicative functions and text types in a broad sense as a means of studying spoken and written discourse. Sonnets, sagas, fairy tales, novels and feature films, sermons, political speeches, international treaties, instruction leaflets, business letters, academic lectures, academic articles, medical research articles, technical brochures and legal documents are but some of the texts treated in this volume. It is argued that text typology involving genre analysis can help the translator develop strategies that facilitate his/her work and provide awarness of various options as well as constraints. In this book, text type is used in a broad sense to refer to any distinct type of text and the notion includes genre.
This book addresses the central question of In what ways are translations affected by text types? To what extent and in what areas are text types identical across languages?