GLOSIX Part 1-2.
|
| Back to LSD2 Table of Contents |
This section provides recommendations for the representation of textual data within GLOSIX environments. By textual data, we mean "unannotated" fragments of discourse, most often originally created for non-linguistic purposes such as publishing, broadcasting, etc. This does not include linguistic annotation such as tokenization or morpho-syntactic tagging (see GLOSIX Part 1.4) nor linguistic resources such as lexicons (see GLOSIX Part 1.5).
LSD tools will process text in two formats:
All text intended for interchange should follow the MULTEXT/EAGLES Corpus Encoding Standard (CES). The CES is a Text Encoding Initiative (TEI)-based application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language).
The CES provides the following :
In particular, the CES provides a series of Document Type Definitions (DTDs), which provide for increasingly refined levels of encoding. This set of DTDs accomodates the importation of "legacy data" (data previously encoded in alternative formats) into the CES format for the purposes of language engineering.
The CES follows the GLOSIX recommendations for character sets (GLOSIX Part 1.1. Characters).
Import/export formats for tools should follow the SGML-based recommendations from the MULTEXT/EAGLES Corpus Encoding Standard (CES) described above.
Other formats will defined in the future.