MULTEXT/EAGLES - Document LSD 2. Part 1-2. Version 0.5. Last modified 28 April 1996.


GLOSIX Part 1-2.


| Back to LSD2 Table of Contents |


This section provides recommendations for the representation of textual data within GLOSIX environments. By textual data, we mean "unannotated" fragments of discourse, most often originally created for non-linguistic purposes such as publishing, broadcasting, etc. This does not include linguistic annotation such as tokenization or morpho-syntactic tagging (see GLOSIX Part 1.4) nor linguistic resources such as lexicons (see GLOSIX Part 1.5).

LSD tools will process text in two formats:

Recommendations for interchange

All text intended for interchange should follow the MULTEXT/EAGLES Corpus Encoding Standard (CES). The CES is a Text Encoding Initiative (TEI)-based application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language).

The CES provides the following :

In particular, the CES provides a series of Document Type Definitions (DTDs), which provide for increasingly refined levels of encoding. This set of DTDs accomodates the importation of "legacy data" (data previously encoded in alternative formats) into the CES format for the purposes of language engineering.

The CES follows the GLOSIX recommendations for character sets (GLOSIX Part 1.1. Characters).

Recommendations for local processing

Import/Export formats

Import/export formats for tools should follow the SGML-based recommendations from the MULTEXT/EAGLES Corpus Encoding Standard (CES) described above.

Direct interface formats

The CES SGML format will also be used for direct formats for textual data. However, full SGML is in some cases problematic for processing, especially bysmall Unix-based tools. In such cases, it may be desirable to use other formats, such as:

Other formats will defined in the future.

| Top | LSD2 Table of Contents | MULTEXT | EAGLES Tool subgroup | LPL

Copyright (c) Centre National de la Recherche Scientifique, 1995-1996. HTML 3.2 Checked!