MULTEXT/EAGLES - Document LSD 2. Part 1-4. Version 0.5. Last modified 28 April 1996.


GLOSIX Part 1-4.
Linguistic annotation


| Back to LSD2 Table of Contents |


This section provides recommendations for the representation of linguistic annotation within GLOSIX environments. By linguistic annotation, we mean information derived from primary data (text or speech), usually resulting from linguistic analyzes, such as tokenization, morpho-syntactic tagging, prosody tagging, etc.

Recommendations for interchange

All linguistic annotation intended for interchange should follow the MULTEXT/EAGLES Corpus Encoding Standard (CES). The CES is a Text Encoding Initiative (TEI)-based application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language).

At present, the CES provides the following :

Encoding for other types of linguistic annotation are under development, such as encoding of prosody.

The CES follows the GLOSIX recommendations for character sets (GLOSIX Part 1.1. Characters).

Recommendations for local processing

Import/Export formats

Import/export formats for tools should follow the SGML-based recommendations from the MULTEXT/EAGLES Corpus Encoding Standard (CES) described above.

Direct interface formats

The CES SGML format will also be used for direct formats for textual data. However, full SGML is in some cases problematic for processing, especially bysmall Unix-based tools. In such cases, it may be desirable to use other formats, such as:

Other formats will defined in the future.

| Top | LSD2 Table of Contents | MULTEXT | EAGLES Tool subgroup | LPL |

Copyright (c) Centre National de la Recherche Scientifique, 1995-1996. HTML 3.2 Checked!