Advantages and difficulties with TEI tagging: Experiences from an aided document composition and translation tool

Advantages and difficulties with TEI tagging: Experiences from an aided document composition and translation tool

Arantza Casillas

Departamento de Automática, Universidad de Alcalá

,

(fax: )
arantza@aut.alcala.es

Joseba Abaitua

Facultad de Filosofía y Letras Universidad de Deusto, Bilbao

,

(fax: )
abaitua@fil.deusto.es

Raquel Marínez

Departamento de Sis. Informáticos y Programación, Facultad de Informática, Universidad Complutense de Madrid

,

(fax: )
raquel@;eucmos.sim.ucm.es

Keywords: SGML, ; TEI, ; Machine Translation, ; Translation Memory ;

Abstract:

Translation memories and SGML-authoring can be hybridiezed to produce substantial machine translation coverage. Based on the idea of using DTDs as document-generation grammars, we present an interactive editing tool that integrates the process of source document composition and translation into the target language. The tool benefits from a collection of complementary language databases automatically derived from a TEI conformant tagged and aligned parallel corpus.

Introduction

Translation memories (TM) have been a successful technology in the promotion of quality of multilingual documentation and the enhancement of workflow in large corparations and institutions. TM products ease the work of technical writers and translations because they facilitate constant recycling of previously translated and validated material. However, these products neglect one of the most interesting aspects of the coding language they employ internally, SGML. Based on the possibilities of using DTDs as document-generation grammars, we present an experiment that explores the hybridation of translation memories and SGML- authoring systems.

A corpus of official publications from three bilingual institutions in Spain were compiled and analyzed (Martinez, 1997). Documents in the corpus had been composed by Administration clerks and translated by translators. Both clerks and translators use a wide variety of word-processors, but have never been able to resort to any SGML-authoring tool. Still, administrative documentation shows a regular structure, and is rich in recurrent textual patterns. For each document type different document tokens share a common global distribution of elements. Our main goal in tagging the corpus was to make all them explicit. This was carried out by means of TEI (Burnard, 1995) conformant SGML markup. As we will see, some difficulties were encountered. Nevertheless, the markup helped disclose the underlying logical structure of documents. From annotated documentation, DTDs were later induced and these DTDs served as generation-grammars to produce new documents.

As a result of this process of automatic tagging, a TEI/SGML conforming annotated corpus was produced with yet no corresponding overt DTD. Section 2 discusses the application of TEI tags to the corpus as well as some difficulties encountered. In Section 3 we will explain how DTDs were later induced from annotations.

Once the corpus was segmented the next step was to align it. From the aligned corpus we extracted translation memories (see Section 3). Section 4 shows the composition strategy. And finally, we show some provisonal results obtained by the prototype.

Application and difficulties with TEI tags

Textual units were identified and segmented at different levels (Martinez 1998a):

Domain independent elements, such as paragraphs ( p ) or sentences( s ), but also others as dates( date ) and numbers( num ).
Structural elements. These reflect the division of documents into structural units ( opener , div0 , div1 , dateline , closer )
Elements that mark up textual units which are domain-dependent. These elements help define the structure of the document too.

Both structural and domain-dependent elements will compose the DTD. Figure 1 shows an annotated text in Spanish and its aligned counterpart in Basque (the attributes id and corresp identify the source element and its corresponding target translation).

Documents can be broken up into different segment types. The crucial issue is which of these smaller units may be adequatelly treated as a translation unit. The whole document, as well as full document sections and divisions, paragraphs, sentences, proper nouns, dates and numbers can all be considered as such. In practical terms, these segments will become translation units in the moment they are marked as such, that is, from the moment in which alignment attributes are introduced in their corresponding tags at both sides of the parallel corpus (see Figure 1).

Once the corpus was segmented the next step was to align it. This was conducted at different levels: general document elements, as well as sentencial and intra-sentencial elements (Martinez, 1998b). Once the corpus has been appropriately aligned (see Figure 1), it becomes a rich source of material that can be constantly recycled for future translations.

We followed TEI-P3 Guidelines (Burnand, 1995) to markup domain independent as well as structural elements, but domain-dependent textual units fell outside the scope of TEI coverage. The general purpose encoding scheme suggested by TEI is of little use when specialized documentation has to be thoroughly accounted for. This was a problem for which we envisaged two solutions. One was to develop a particular set of tags which could adequately describe the nature of all domain-dependent elements in the legal documentation. This would had forced us to think about the semantic value and pragmatic interpretation of such textual units within the legal domain. An alternative option was to use a semantically empty tag-name with a numeric counter that would be assigned to each new occurrence of a domain-dependent unit. This was a less jeopardizing and more economic option, and therefore was the option we adopted. The seg-number general indentifier was used for this purpose.

Connected to this problem was the limitations of TEI tagset to conform DTDs that could guide in the generation process. Such tags as p , s , or seg are irrelevant for generation (Figure 2), unless some sort of distinguishing character is added to the name (Figure 3). This dodge does not belong to TEI's original desideratum, however it is necessary if we want to make any practical use of DTDs in generation.

Figure 5, as opposed to Figure 4, illustrates the functionality of added numbers to single out the identity of segments as well as their order in the text.

Resource Generation
Automatic DTD abstraction

In the domain of official documentation, one of the most desired properties is consistency, that is, that all different instances of one single document-type share the same logic structure. The attainment of this property is one of the best spin-offs of the formal constraining force that an SGML's DTD imposes on new documents. Our aim is to provide writers and translators of official documentation with an authoring environment that takes advantage of this property, that is, an editing environment in which the process of generating new bilingual documents is directed by paired DTDs.

There is one initial problem with this proposal. As (Ahonen, 1995) has noticed for other cases, in our case also, we have no previous applicable DTD to depart from. None of the DTD modules proposed within the TEI community is completely satisfactory for our purposes. There is no common DTD suitable for legal or administrative documentation, neither are tagsets available.

Our DTD generator (Casillas, 1999) is similar to (Shafer, 1995) but with some small modifications that produces more abstract DTDs. These are the changes introduced:

In the rules reduction process, we add a group reduction as well as an & operator reduction rules. In the case of the group reduction rule, we detect the repetition of elements and then apply the operator +, as shown below:
As for the & operator rule, this is applied between two instances when the number and name of the occurrences is identical but the order in which they appear is different:

Given that source and target documents show some syntactic and structural mismatches, two different DTDs are induced, one for each language, and are paired through a correspondence table. Correspondences in this table can be up-dated, or deleted. At present, we have six DTDs, one for each document type in each language (there are three document types; Figure 8 shows a part of one of these DTDs). By means of these paired DTDs, document elements in each language are appropriately placed. In the process of generating the bilingual document, a document type must first be selected. Each document type has an associated DTD. This DTD specifies which elements are obligatory and which are optional. With the aid of the DTD, the source document is generated. The target document will be generated with the aid of the corresponding target DTD.

TM Generation

Aligned in this way, the corpus becomes an important resource for translation. Four complementary language databases may be obtained at any time from the annotated corpus: three translation memory databases (TM1, TM2, and TM3) as well as a proper noun database (proper-noun-base). The three TMs differ in the nature of the translation units they contain. TM1 consists of aligned sentences than can feed commercial TM software. TM2 contains elements which are translation segments ranging from whole sections of a document or multi-sentence paragraphs to smaller units, such as short phrases or proper names. TM3 simply hosts the whole collection of aligned bilingual documents, where the whole document may be considered the translation unit. TM3 can be construed as a bilingual document-database. Much redundancy originates from this TM collection, although it should be noticed that they are all by-products derived from the same annotated bitext which subsumes them all. Good software packages for TM1 and TM3 already exist in the market, and hence their exploitation is beyond our interest (Trados Translator's Workbench, Star's Transit, SDLX, Déjà Vu, IBM's Translation Manager) for TM1; and any SGML browsing tool for TM3).

The originality of our editing tool lies in a design which benefits from joining the potentiality of DTDs and the elements in TM2, as will be shown in sections 4 and 5. TM2 specifically stores a type of translation segment class, which we have tagged seg1 , seg2 ... segn , and which is relevant to the DTD.

All translation memories are managed in the form of a relational database where segments are stored as records. Each record in the database consists of four fields: the segment string, a counter for the occurrences of that string in the corpus, the tag and the attributes type, id and corrresp ).

Figure 9 shows how the text fragment inside the /div1 ... /div0 tags of Figure 1 renders three records in the database.

Composition Strategy

Every phase in the process is guided by the markup contained in TM2 and the paired DTDs which control the application of this markup. The composition process follows two main steps which correspond to the traditional source document generation and translation into the target document. The markup and the paired DTD guides the process in the following manner:

Before the user starts writing the source document, he must select a document type, i.e., a DTD. This has two consequences. On the one hand, the selected DTD produces a source document template that contains the logical structure of the document and some of its contents. On the other hand, the selected source DTD triggers a target paired DTD, which will be used later to translate the document. There are three different types of elements in the source document template:
Some elements are mandatory and are provided to the user, who must only choose its content among some alternative usages (s/he will get a list of alternatives ordered by frequency, for example title ). Other obligatory elements, such as dates and numbers, will also be automatically generated.
Some other elements in the template are optional (e.g., seg9 ). Again, a list of alternatives will be offered to the user. These optional elements are sensitive to the context (document or division type), and markup is also responsible for constraining the valid options given to the user. Obligatory and optional elements are retrieved from TM2, and make a considerable part of the source document.
All documents have an important part of their content which is not determined by the DTD ( div1 ). It is the most variable part, and the system lets the writer input text freely. It is when TM2 has nothing to offer that TM1 and TM3 may provide useful material. Given the recurrent style of legal documentation, it is quite likely that the user will be using many of the bilingual text choices already aligned and available in TM1 and TM3.
Once the source document has been completed, the system derives its particular logical structure, which, with the aid of the target DTD, is projected into the resulting target logical structure.

Evaluation

These are some provisional results obtained by the prototype:

Table 1 shows the number of words that make up the segments stored in TM2 from the source documents. There is a line for each document size considered. We can see that the average of segments contained in TM2 is 31.8%, on a scale from 34.91% to only 3.01%. The amount of segments dealt with in this way largely depends on the size of the document. Short documents (90.21) have about 35% of their text composed in this way. This figure goes down to 3% in documents larger than 1,000 words. This is understandable, in the sense that the larger the document, the smaller proportion of fixed sections it will contain.
Table 2. shows the number of words that are proposed for the target document. These translations are obtained from what is stored in TM2 complemented by algorithms designed to translate dates and numbers. We can see that the average of document translated is 34%. Short documents have 36% of their text translated, falling to above 11% in the case of large documents.

Conclusions

One problem has arise with the application of the TEI guidelines in our project: the problem of generic identifier assignment. The TEI guidelines have proved defective to satisfy all our markup needs, and in some cases, the textual properties that we were unable to mark had a high relevance in relation to the aims of our project. Nevertheless, the solution to this problem has been to create new generic identifiers following the underlying design of the existing ones.

We have introduced a novel methodology for the creation of authoring environments for editing and translating bilingual documents in specialized domains. This environment takes advantage of all the bitext mining's turnouts: identification and tagging of translation segments, their alignment and allocation in translation memories, and their retrieval and reutilization in new documents.

We have shown how DTDs derived from descriptive markup can be employed to ease the process of generating bilingual dedicated documentation.

One of the clear targets for the future is to extend the coverage of the corpus and to test structural taggers against other document types. A big challenge we face is to develop tools that automatically perform the recognition of documents from less restricted and more open text types.

H. Ahonen. Automatic Generation of SGML Content Models. Electronic Publishing, 8(2-3):195-206, 1995.

L. Burnard, C. Speberg-McQueen. TEILite: An Introduction to Text Encoding for Interchange. URL://http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei, 1995.

Casillas A., Abaitua J., Matínez R.Extracción y aprovechamiento de DTDs emparejadas en corpus paralelos.Procesamiento del Lenguaje Natural, 25:33-41, 1999.

Martínez R., Abaitua J., Casillas A. Bilingual parallel text segmentation and tagging for specialized documentation. Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'97), 369-372, 1997.

Martínez R., Abaitua J., Casillas A. Bitext Correspondences through Rich Mark-up. 36th Annual Meeting of the Association for Computational Linguistics abd 17 International Conference on Computational Linguistics (COLING-ACL'98), 812-818, 1998.

Martínez R., Abaitua J., Casillas A. Aligning tagged bitexts. Sixth Workshop on Very Large Corpora, 102-109, 1998.

Shafer K.: Fred: the SGML Grammar Builder. http://www.oclc.org/fred, 1994.