GLOSIX Part 1.2

MULTEXT/EAGLES - Document LSD 2. Part 1-2. Version 0.5. Last modified 28 April 1996.

GLOSIX Part 1-2.
Text

Scope
Recommendations for interchange
Recommendations for local processing
- Import/Export formats
- Direct interface formats

Scope

This section provides recommendations for the representation of textual data within GLOSIX environments. By textual data, we mean "unannotated" fragments of discourse, most often originally created for non-linguistic purposes such as publishing, broadcasting, etc. This does not include linguistic annotation such as tokenization or morpho-syntactic tagging (see GLOSIX Part 1.4) nor linguistic resources such as lexicons (see GLOSIX Part 1.5).

LSD tools will process text in two formats:

"plain" text, i.e., a sequence of characters, none of which comprise markup indicating rendition or any other identification of the textual elements (apart from the obvious carriage returns, space, and tabs);
structured text, which, as opposed to the plain text, includes markup which is not a part of the content proper, but rather provides information about the type and/or rendering, etc. of elements of the text.

Recommendations for interchange

All text intended for interchange should follow the MULTEXT/EAGLES Corpus Encoding Standard (CES). The CES is a Text Encoding Initiative (TEI)-based application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language).

The CES provides the following :

a set of metalanguage level recommendations (particular profile of SGML use, character sets, etc.);
tagsets for documentation of the encoded data;
tagsets and recommendations for encoding textual data, including written texts across all genres, for the purposes of corpus-based work in language engineering.

In particular, the CES provides a series of Document Type Definitions (DTDs), which provide for increasingly refined levels of encoding. This set of DTDs accomodates the importation of "legacy data" (data previously encoded in alternative formats) into the CES format for the purposes of language engineering.

The CES follows the GLOSIX recommendations for character sets (GLOSIX Part 1.1. Characters).

sgmls/nsgmls/SP output format
unstructured ("plain") text

Other formats will defined in the future.

GLOSIX Part 1-2. Text

Contents

Scope

Recommendations for interchange

Recommendations for local processing

Import/Export formats

Direct interface formats

GLOSIX Part 1-2.
Text