GLOSIX Part 1.4

MULTEXT/EAGLES - Document LSD 2. Part 1-4. Version 0.5. Last modified 28 April 1996.

GLOSIX Part 1-4.
Linguistic annotation

Scope
Recommendations for interchange
Recommendations for local processing
- Import/Export formats
- Direct interface formats

Scope

This section provides recommendations for the representation of linguistic annotation within GLOSIX environments. By linguistic annotation, we mean information derived from primary data (text or speech), usually resulting from linguistic analyzes, such as tokenization, morpho-syntactic tagging, prosody tagging, etc.

Recommendations for interchange

All linguistic annotation intended for interchange should follow the MULTEXT/EAGLES Corpus Encoding Standard (CES). The CES is a Text Encoding Initiative (TEI)-based application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language).

At present, the CES provides the following :

a set of metalanguage level recommendations (particular profile of SGML use, character sets, etc.);
tagsets for documentation of the encoded data;
tagsets and recommendations
- Segmentation of text;
- Morphosyntactic tagging;
- Parallel text alignment.

Encoding for other types of linguistic annotation are under development, such as encoding of prosody.

The CES follows the GLOSIX recommendations for character sets (GLOSIX Part 1.1. Characters).

Recommendations for local processing

Import/Export formats

Import/export formats for tools should follow the SGML-based recommendations from the MULTEXT/EAGLES Corpus Encoding Standard (CES) described above.

Direct interface formats

The CES SGML format will also be used for direct formats for textual data. However, full SGML is in some cases problematic for processing, especially bysmall Unix-based tools. In such cases, it may be desirable to use other formats, such as:

sgmls/nsgmls/SP output format;
MULTEXT tabular formats (information will be available soon);

Other formats will defined in the future.