RESOURCES SPECIFICATIONS

Home | Catalogue | Speech | Text | Termino

WRITTEN RESOURCES SPECIFICATIONS

CORPUS , Lexicon, Multil. lexicon , Tools

British National Corpus (BNC)

BNC consists of extracts from 4124 modern British English texts of all kinds, both spoken and written. Each text is segmented into orthographic sentence units, and each word automatically assigned a part of speech code. There are 6.250.000 sentences, and over 100 million words.

The corpus was produced by a consortium of leading dictionary publishers (OUP, Longman, Chambers-Harrap) and academic research centres (Oxford University Computing Services, Unit for Computer Research in the English Language at Lancaster University, British Library Research and Development). It provides a unique and authoritative view of the state of the English language today, with carefully balanced representation of as many different varieties of English as possible. It can be used to exercise NLP systems of all kinds, as a fertile source of real-life examples for language learners, or simply to explore the way the language is currently used.

The first release of the BNC comprises (packaged as 3 CDRoms) :

the full text of the 100 million word corpus
printed and online documentation
a full word index to the whole corpus
ANSI C source code for a server program and a basic client program.

The BNC is an SGML document complying with ISO 8879.

Contemporary Portuguese Corpus

This corpus covers the Portuguese language as spoken in Portugal, Brazil, Angola, Mozambique, Guinea, Macao, etc. It consists of about 1.5Million of words for the spoken language and more that 40 millions words of Portuguese texts extracted from fiction, technical, scientific, journalistic, legal, and political material. Some of the corpus is raw data but some of it has been encoded according to an in-house formalism developed within the internal project "Corpus de Referência do Português Contemporâneo".

CRATER Multi-Lingual Aligned Corpus

The Corpus Resources and Terminology Extraction project (MLAP-93 20) has extended the bilingual annotated English-French International Telecommunications Union corpus to include Spanish, and has also debugged the existing corpus. In addition, a Spanish tagger has been developed, along with a set of retrieval tools for browsing the trilingual aligned corpus, and examining the proposed term or word alignments. The offer consists of the 3 x 1,000,000 token corpora of English, French and Spanish, morphosyntactic annotations (human-edited), lemmatisation and term extraction routines for English, French and Spanish.

Samples ?

ECI - European Corpus Initiative

The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual corpus, and supports existing and projected national and international efforts to carefully design, collect and publish large-scale multilingual written and spoken corpora. ECI has produced the Multilingual Corpus I (ECI/MCI) of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material.

Just a sampling of the contents of the CD-ROM:

German newspaper texts from the Frankfurter Rundschau from July 1992 -March 1993. provided by Universität Gesamthochschule, Paderborn, Germany. Approximately 34 million words.
French newspaper texts from Le Monde, consisting of material from September 1989, October 1989, and January 1990. Provided by LIMSI CNRS, France. Approximately 4.1 million words.
Extracts from the Leiden Corpus of Dutch, consisting of newspapers, transcribed speech, etc. Provided by Institut voor Nederlandse Lexicologie, Leiden, Holland. Approximately 5.5 million words.
International Labor Organisation (ILO) "Official Bulletin, B Series". Vols LXVII(1984) - LXXII(1989). Parallel texts in English, French and Spanish provided by the International Labor Organisation. Approximately 5 million words.

The ECI/MCI is available from ELSNET.

ECI - ELSNET Italian&German tagged sub-corpus

The objective is to provide a small but fine grained morphosyntactically tagged corpus, 50.000 running words for each of the two languages (Italian and German) to be used in research work on tagging methods and models. The text for German comes from the Frankfurter Rundschau extracted from the ECI corpus, the Italian material comes from the Italian corpus of ILC - CNR. For German the data concerns several domains including Economy (17,000 word forms), Politics (14,000 word forms), Culture (18,000 word forms), Sports (9,000 word forms), and Local Events (8500 word forms). The situation for Italian is comparable to that. Word occurrences are tagged with very fine grained tagsets which are based on the EAGLES morphosyntactic guidelines.

The tagging, done automatically, has been manually checked. The CD-ROM contains: the text in SGML format; the DBT software which allows different browsing and operations on the annotated text and the EAGLES guidelines for morphosyntactic.

Multilingual Corpora for CO-OPERATION - MLCC

The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies.

The first set is referred as the Polylingual Document Collection (ELRA-W0006), a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora:

Dutch - Het Financieele Dagblad - 1992-1993 (Samples)

The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text.

English - The Financial Times - 1993 (Samples)

The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words.

French - Le Monde - 1992-1993 (Samples)

A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words.

German - Handelsblatt - 1986-1988 (Samples)

This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt.

Italian - Il Sole 24 Ore - 1992-1993 (Samples)

The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh.

Spanish - Expansion - 1994 (Samples)

This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words.

The second set is a Multilingual Parallel Corpus (ELRA-W0007) consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities:

Official Journal of the European Commission, C Series: Written Questions 1993

Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language).

Samples: Danish, German, English, Spanish, Greek, Italian, Dutch, Portuguese.

Official Journal of the European Commission, Annex: Debates of the European Parliament 1992-1994

This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by members of the meeting as well as written input provided to the meeting. The original data from which the translations are produced consist of a transcript of the sittings, each member speaking in the language of his choice. The final version consists of nine parallel versions of the material. The texts delivered comprise the Debates of Parliament from January 1992 to July 1994. This sub-corpus contains some 5 to 8 million words per language.

Samples: Danish, German, English, Spanish, Greek, Italian, French, Dutch, Portuguese.

Monolingual Greek corpus (ILSP -Institute for Language and Speech Processing)

Monolingual Greek corpus of 1 million words. The corpus consists of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA. Each file contains annotated text with SGML mark-up accompanied by a text header.

MTP Annotated Corpus of German

This morphosyntactically annotated 500,000 word German corpus was developed as part of the Münster Tagging Project (MTP). It comprises a collection of SGML-formatted texts from two German newspapers, "Die Frankfurter Allgemeine Zeitung" and "Die Zeit", for the years 1990 to 1992. The articles reflect the typical distribution of newspaper topics, including economics, regional, national and international politics, the arts, sport, literature, history, science and modern life.

The text was segmented into sentence units and word tokens, and tagged for morphosyntactic POS markers. Two tagsets, which mainly differed in the granularity of the noun and verb tags, and which comprised 137 and 52 tags respectively, were used. Users may obtain annotated versions using either set, each of which comes with documentation and an instruction manual for tag application. A suite of tools, including the MTP taggers and the Xlex workbench for text handling, textual analysis and lexicography, is also available.

PRESS 65, (Swedish corpus)

Språkdata has made available the first of its many Swedish corpora, PRESS 65. It consists of one million running words taken from Swedish newspapers from the year 1965. It has been categorised according to text type and is annotated down to the sentence level.

Tagged text in French (MEMODATA)

More than 170 books (classical novels, legal texts...) are tagged with or without rules of morphological disambiguation. A tagged corpus of 50 books is available for research. It consists of several authors of the 19th century (Balzac, Hugo, Stendhal).

More information ?

Test Suites for Natural Language Processing (TSNLP)

The TSNLP project (LRE 62-089) has produced a database of test suites for English, French and German containing over 4,000 test items (sentences or fragment of sentences) per language which have been constructed for evaluating natural language processing systems, but which may also be useful for other purposes. The examples have been systematically constructed with detailed annotations about grammatical and other information, and are relevant to developers or users of systems with grammatical components who wish to test, benchmark, or evaluate them. A three-volume user manual documents major project results, including a description of the test data, the underlying methodology and the tools developed to aid test suite construction and use.

Text corpus of "Le Monde"

Electronic archiving of "Le Monde" articles started on 1 January 1987. Some 200 articles are added every day, and as of October 1997 the database contains more than 500,000 articles, making it the biggest of its kind for all French daily newspapers.

The corpus is available in an ASCII text format. Each month consists of some 10 MB of data (circa 120 MB per year).

Data ranging from 1987 until present date are available through ELRA (each buyer may purchase up to 5 years of data).

Karl-May-Korpus (KMK corpus)

Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May from 1993 to 1997 and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each).

Each word form is tagged with word class (1 out of 43 classes) and appropriate lemma.

File format: Text
Standard in use: SGML
Character set: 8-bit ASCII

MULTEXT JOC Corpus

This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains approx. 5 million words in English, French, German, Italian and Spanish (approx. 1 million words per language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.

The JOC corpus is delivered in Corpus Encoding Standard conformant format at each level of treatment :

paragraph annotation level, conformant to the CESDOC specifications (1 M words * 5 languages);
morpho-syntactic annotation level (PoS Tagging), conformant to CESANA specifications (200,000 words * 4 languages);
parallel text alignment at sentence level, conformant to CESALIGN specifications (200,000 words * 4 languages).

Additional information: http://www.lpl.univ-aix.fr/projects/multext

ARCADE/ROMANSEVAL corpus

The ARCADE/ROMANSEVAL corpus was used as a reference corpus in two international competitions:

ARCADE, an exercise on multilingual text alignment financed by AUPELF-UREF
ROMANSEVAL, part of the SENSEVAL exercise sponsored by ACL-SIGLEX and EURALEX, on word sense disambiguation.

The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four Romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission).

The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3,700 contexts altogether, and comprises:

semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian;
word-level alignment of all the occurrences of the test words between French and English.

Additional information:
http://www.lpl.univ-aix.fr/projects/arcade
http://www.lpl.univ-aix.fr/projects/romanseval

Dutch PAROLE Distributable Corpus

The Dutch PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference corpus.

The Dutch corpus annotation and checking was made accordingly to the common core PAROLE tagset. The Dutch data were also checked for type.

The Dutch PAROLE Distributable Corpus contains the following texts:

MEDIUM	SOURCE	TIMESPAN	TOTAL NUMBER of WORDS
BOOKS	Van Sterkenburg: Wdlijst tot wdboek Taal vt Journaal WNT-portret	1984 1989 1992	65,344 56,215 60,133
NEWSPAPERS	Short Newspaper texts: MN_Collection CVNP(S)-Collection	1986-1988 1983-1990	19,537 179,220
PERIODICAL	Short texts from - Local Papers - Magazines	1985-1988 1985-1989	47,019 164,589
MISCELLANEOUS	Texts to be read out in TV-news broadcasts for: - General audience - Youth Short texts from Ephemera	1992-1995 1991-1995 1985-1986	1,285,824 1,008,658 131,692
TOTAL			3,018,231

Over 250,000 words of corpus texts have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked at least two times with respect to maximal granularity, according to a lexicographer's manual. The extra 9,000 words over the required 50,000 words compensate for the occurrence of ca. 5,300 "keywords" in the original texts. The fully corrected material has been subjected to an automated post-control operation, checking the pertinence relations between the various feature values, and instantiating default values in case a mismatch (indicating a correction error) was found. Ca. 200,000 words have been checked once for PoS and type. In addition to the required PoS, type was checked for reasons of quality. This material has been subjected to an automated correction procedure addressing the feature slots (positions) beyond the first two for PoS and type so as to solve discrepancies between the manually corrected PoS and type, and the possibly erroneous, automatically assigned values of the remaining slots.

More info on the Parole project.

Monolingual Lexicon

DICO-MORPH_lemme. (MEMODATA)

Entries: more than 400 000
Language: French
Format: ASCII with separators
Medium: CD-ROM
French reusable lexicon for morphological works which produces the canonical form from the inflexional form. This lexicon is divided into the following lexical categories: nouns (55,000), verbs (8,000), adjectives (16,850), adverbs (2,000), other words (30,000).

DICO-MORPH_Collocation. (MEMODATA)

Entries: up to 35000
Language: French
Format: ASCII
Medium: Floppy disk

This is an adding for the French lexicon for morphological works (referenced herein as the DICO-MORPH_Lemme. MEMODATA).

More information ?

DICO-SYNT. (MEMODATA)

Entries: 90 000
Language: French
Format: ASCII
Medium: Floppy disk

This resource gives the morpho-syntactical information for DICO-MORPH_lemme: proper noun, transitive verb, ... There are around 800 categories of verbs. The lexical categories are: nouns (25,000), verbs (8.000 that generate 25,000 verb/models), adjectives (1,000), Adverbs (1,500).

More information ?

Dutch Lexicon (LanTmark) General vocabulary

Entries: 64000
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

The Dutch LanTmark lexicon is divided into the following categories: nouns (50,000), verbs (7,000), adjectives (6,000), adverbs (1,000).Each entry contains morphological information (morphological flexes, comparative and superlative markers), syntactic information (such as positional features, gender, complement markers and verb arguments), semantic information (lexical semantics for nouns, adverbs and adjectives).

French Lexicon (LanTmark)

General vocabulary
Entries: 50000
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature-value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

The French LanTmark lexicon is divided into the following categories: nouns (36,000), verbs (6,000), adjectives (7,000), adverbs (1,000).

Each entry contains morphological information (morphological flexes, comparative and superlative markers), syntactic information (such as positional features, gender, complement markers and verb arguments), semantic information (lexical semantics for nouns, adverbs and adjectives).

ILC Italian Morphological Lexicon

The ILC Italian Morphological Lexicon consists of a set of lemmas/lexical entries (about 60,000) with the corresponding inflected word-forms, and a morphological engine for morphological analysis and generation. Lemmas and word-forms are encoded with grammatical codes compatible with the EAGLES recommendations for lexicon encoding at the morphosyntactic level.

LEXin 1:e (Swedish Lexicon)

The first edition of LEXin 1:e, a Swedish database used as the basis for a lexicon for immigrants, is now available via ELRA. Produced by Språkdata in Göteborg, Sweden, it consists of approximately 17,000 headwords and 21,000 senses, and contains explicit morphological information for every headword and syntactical information for all verbs and many adjectives. Each sense is illustrated by a paraphrase, as opposed to a formal definition. Derivational forms, phrases and idioms are also included. The format is flexible and can be customised to individual wishes within reasonable limits.

Monolingual Danish lexicon

(Institut for Erhvervsinformatik)

Entries: 25000
Format: ASCII
This dictionary was developed for machine translation. Each lexeme contains the word class, inflection, semantic features, syntactic frames (for verbs), and complement (for nouns and adjectives).

Monolingual Portuguese lexicon

(Centro de Linguistica da Universidade de Lisboa)
Entries: 60 000
Monolingual Portuguese lexicon with morphological information, with a software engine, written in C, for generating all inflected forms, including adj-adverb derivation.

MULTEXT LEXICONS

This CD-ROM contains a set of lexicons developed in the MULTEXT project financed by the European Commission (LRE 62-050). The set contains the following languages: English, French, German, Italian and Spanish.

English 66,214 Word forms
French 306,795 Word forms
German 233,861 Word formsItalian 145,530 Word forms
Spanish 510,710 Word forms

The MULTEXT lexicons are three-column tables, separated with a tabulation: the first column contains the word-form, the second column contains the lemma, and the third column contains the morpho-syntactic information associated to that form. This information is conformant with the MULTEXT/EAGLES specifications.

Additional information: http://www.lpl.univ-aix.fr/projects/multext

Portuguese morphological lexicon PALAVROSO, (INESC)

Entries: 60 000

Monolingual Portuguese lexicon with a rule-based morphological analysis which also handles enclitics, compounds, diminutives and augmentatives.

PALAVROSO is a European Portuguese lexicon and consists of a set of about 60,000 lexical entries (lemmas), and a rule-based morphological engine for morphological analyses that recognises more than 1,300 000 word-forms. The rule set also allows enclitics, compound words, diminutives and augmentatives to be handled correctly. Information encoded is compatible with the EAGLES recommendations for lexicon encoding at the morpho-syntactic level.

Spanish gilcUB-M-Dictionary

General vocabulary
Entries: 60000
Format: ASCII format with ISO 8859-1 character set. Available versions include atribute-value pairs and tag-style encoding.
Medium: QIC 150 MB Cartridge Tape

The Spanish gilcUB-M-Dictionary is a full form lexicon derived from 60,000 lemmas of general vocabulary (9,700 verbs, 35,500 nouns, 14,300 adjectives and 120 adverbs). Possible adverbs derived from adjectival forms are also included as full forms and are about 10,000 forms. Morphosyntactic information encoded is compatible with EAGLES recommendations for morphosyntactic encoding as well as the associated lemma.

More information ?

THAMUS. Generic Italian dictionary

(Consorzio per la linguistica computazionale)
Entries: 116000

A Generic monolingual Italian dictionary. Morphological coding which can generate all full forms by means of a software engine written in C. Multi-word terms contain morphological coding for the head word.

Dictionary of French verbs - CORA

This dictionary contains 25,610 verbs with usage domains, level of language (familiar, popular, literary, Quebec and Swiss terms, etc.), conjugation, auxiliary, verbal adjectives in -able, -ant or -é, encoded syntactical constructions (subject, direct & indirect object, adverb), sample phrases, synonyms, operators enabling semantic-syntactic classification, encoding of derived forms in -age, -ment, -tion, -oir, -ure, deverbal nouns, base words from which verbs can be derived, a scale of usage ranging from 1 to 6, like those used by commercial dictionaries (basic vocabulary, extended, specialised, etc.).

Codes enable automatic production of conjugation forms, derived nouns and adjectives and, if necessary, the production of potential forms.

Dictionary of words - CORA

This dictionary is composed of 126,844 words, with usage domains, grammatical category, gender, number, uncountable, collective, adjectival, nominal, verbal, adverbial derived forms according to the type of words.

Dictionary of affixes - CORA

4,286 suffixes and prefixes, plus information on their verbal, nominal or adjectival bases or on the verbal basis of greco-latin items. This dictionary does not include the suffixes contained in the dictionary of French verbs (ELRA-L0021) and words (ELRA-L0022) such as -age, -ment, -if, -oir.

Dictionary of verb phrases - CORA

Dictionary of 3,480 entries based on the model of the dictionary of French verbs (ELRA-L0021).

Dictionary of invariable forms and phrases - CORA

Dictionary of 4,783 entries based on the model of the dictionary of words (ELRA-L0022).

Dictionary of exclamatory stereotyped phrases - CORA:

Dictionary of 1,901 entries based on the model of the dictionary of invariable forms and phrases (ELRA-L0025).

Dictionary of French local authorities - CORA

38,965 entries in lower cases with accents, controlled on the guide Michelin, without named places ("lieux-dits"); A link can be made to the dictionary of words (ELRA-L0022) which contains inhabitants' names and their correspondence with town names.

Dictionary of noun phrases and plural-only words - CORA

2,138 compound names and 1,397 entries of plural-only words.

CELEX Dutch lexical database

The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch Version 3.1, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.

Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.

To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.

This database can be divided into different subsets:

orthography: with or without diacritics, with or without word division positions, alternative spellings, number of letters/syllables;
phonology: phonetic transcriptions with syllable boundaries or primary and secondary stress markers, consonant-vowel patterns, number of phonemes/syllables, alternative pronunciations, frequency per phonetic syllable within words;
morphology: division into stems and affixes, flat or hierarchical representations, stems and their inflections;
syntax: word class, subcategorisations per word class;
frequency of the entries: disambiguated for homographic lemmata.

Bulgarian Morphological Dictionary

This dictionary contains 67500 entries divided into 242 inflectional types (including proper nouns), morphosyntactic information for each entry, and a morphological engine (MS DOS and WINDOWS 95/NT) for morphological analysis and generation. The data may be used for morphological analysis and synthesis.

Structure of entries: Local linguistic variant
File format: ASCII; lowercase letters
Standard in use: ISO
Character set: 8-bit ASCII ASCII codes alphabetically: 160-191
Medium: Floppy disk

Dutch PAROLE lexicon

The entry list of the lexicon consists of about 20,200 entries distributed over 13 parts of speech (POS). The entries have been described along the dimensions of morphosyntax and syntax. Morphosyntactic information consists of various lexical properties, like gender, number, case, person, inflection, etc. Syntactic descriptions consist of typical complementation patterns associated with the various lemmata.

The composition of the entry list of the lexicon is based on 3 corpora from the Instituut voor Nederlandse Lexicologie (INL) and 2 lexica. The corpora contain a total of about 54 million words and have been automatically annotated for part-of-speech and lemma. The lexica contain morphosyntactic information of various kinds. For verbs, nouns, adjectives and adverbs, lemmata that were covered by at least 2 corpora and the 2 lexica were selected on the basis of cumulative frequency, coverage (distribution over sources) and inflected forms. For the smaller parts of speech, these selection requirements appeared to be too strict. Entry selection for these parts of speech was based on ranked frequency.

The entries, uniquely defined by the combination of part of speech (e.g. noun) and subtype (e.g. common vs. proper noun), are provided with morphosyntactic information according to the Dutch set of PAROLE categories and features, and, where available, with syntactic information. Morphosyntactic information is automatically extracted from the INL lexica. Syntactic data have been collected manually, by inspection of corpus data and - where necessary - consultation of reference works. The corpus consulted consists of the newspaper component and the varied component of the 38 Million Words Corpus 1996.

Word forms in the Dutch PAROLE lexicon are not inflected according to general paradigms, but are related to their lemma by a set of string procedures. These procedures are not unique. They can be shared by many other word forms. An example is suffixation with -e for adjectives, which produces "goede"/good from "goed". Inflected forms can be derived directly by applying the string procedures to the lemma they are connected with.

The lexicon is set up as an SGML file (over 30 MB of plain ASCII). Its contents have been encoded in a distributed manner: all formative entities (like lemmata, syntactic phrases, feature bundles) are SGML entities, related by a pointer mechanism to other entities.

The lexicon contains the following categories : adjectives (3,298 entries), adpositions (80 entries), adverbs (554 entries), articles (3 entries), conjunctions (70 entries), determiners (59 entries), interjections (235 entries), nouns (12,279 entries), numerals (77 entries), pronouns (85 entries), residuals (186 entries), unique (1 entry), verb (3,274 entries).

More info on the Parole project.

Multilingual Lexicon

Basic multilingual lexicon (MEMODATA)

Entries: 30 000 each language
Languages: French, English, Italian, German, Spanish
Format: ASCII or ANSI with separators between entries
Medium: CD-ROM

The words are associated by the meaning. The lexical categories are: nouns (5 * 18 000), verbs (5 * 8 000), adjectives (5 * 6 000), adverbs (5 * 1 500).

Samples ?

Bilingual Spanish-English and English-Spanish Lexicons (INCYTA)

Technical domains

Economics, law and Business management:          10.640 entries
Leisure, Tourism, Sports, Food:                   3.140 entries
Geography, History, Arts:                         4.110 entries
Sociology, Psychology, Pedagogy:                  4.080 entries
Natural and medical sciences:                    10.530 entries
Exact sciences, Physics, Chemistry, Geology:     10.610 entries
Data Processing, Electronics, Telecommunications: 4.900 entries
Technology, Engineering and Construction:        11.950 entries
Economics                                         1.320 entries
Data Processing                                   3.560 entries
Telecommunications                                3.730 entries
Electrical Engineering                            1.760 entries
Plastics and Chemistry                            9.020 entries
Aeronautics, Navigation, Mechanical Engin.       23.170 entries

The entries contain morphological information for part-of-speech and inflectional class. The information on multi-word terms is provided by the headword.

Danish - German dictionary

(Institut for Erhvervsinformatik)
General vocabulary
Entries: 10 000
Format: ASCII

This dictionary was developed for machine translation. It gives the German lexeme with word class and Danish equivalent with word class, subject area, indication of structural changes from DK-G.

Dutch-French Lexicon (LanTmark)

General and Specialised vocabularies for transfer
Transfer Entries:
General Vocabulary (26 000), Administrative (32 000), Data processing (10 000).
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

General Dutch-French LanTmark lexicon is divided into the following categories: nouns (14,000), verbs (6,000), adjectives (5,000), Adverbs (1,000).

Administrative vocabulary is divided into the following categories: nouns (30,000), verbs (2,000).
Data processing vocabulary has 10 000 transfer nouns.
Each entry contains a domain information, source language disambiguation, features, target language actions.

English-French Lexicon (LanTmark)

General vocabulary for transfer
Transfer Entries: 27000
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature-value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

English-French LanTmark lexicon is divided into the following lexical categories: nouns (14,000), verbs (7,000), adjectives (5,000), Adverbs (1,000).
Each entry contains a domain information, source language disambiguation, features, target language actions.

French-Dutch Lexicon (LanTmark)

General and Specialised vocabularies for transfer
Transfer Entries:
General Vocabulary (34 000), Administrative (18 000), Data processing (10 000).
Format: ASCII format with ISO 8859-1 character set.A lexicon file contains entries with feature-value pairs on each line and separators between entries.
Medium: Floppy disk, QIC 150 MB cartridge tape

General French-Dutch LanTmark lexicon is divided into the following categories: nouns (25,000), verbs (3,000), adjectives (5,000), Adverbs (1,000).
Administrative vocabulary is divided into the following categories: nouns (16,000), verbs (2,000).
Data processing vocabulary has 10,000 transfer nouns.
Each entry contains domain information, source language disambiguation, features, and target language actions.

French-English Lexicon (LanTmark)

General vocabulary for transfer
Transfer Entries: 34 000
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature-value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

The French-English LanTmark lexicon is divided into the following lexical categories: nouns (21,000), verbs (9,000), adjectives (3,000), adverbs (1,000).
Each entry contains adomain information, source language disambiguation, features, and target language actions.

German-Danish dictionaries

(Institut for Erhvervsinformatik)
Technical and General vocabulary
Entries: 6800 (technical) - 15500 (general)
Format: ASCII

This dictionary was developed for machine translation. It gives the German lexeme with word class and Danish equivalent with word class, subject area, indication of structural changes from G-DK (e.g. direct object è PP (Prep 'xxx').

THAMUS. Bilingual dictionaries

(Consorzio per la linguistica computazionale)
Technical domains
Languages: German/Italian - Italian/German
Computer Science 35.000 entries
Construction 7.000 entries

Technical bilingual Italian dictionaries with a morphological coding which can generate all full forms using a software engine written in C. Multi-word terms contain morphological coding for the head word.

THAMUS. Bilingual dictionaries

(Consorzio per la linguistica computazionale)
Technical domains
Languages: English - Italian
Format: ASCII format with ISO 8859-1 character set
Medium: QIC 150 MB Cartridge Tape

Aeronautics        6.500 entries
Law               18.000 entries
Computer Science  31.000 entries
Medicine          20.000 entries
Economics         82.000 entries
Engineering       27.000 entries

Bilingual Collocational Dictionary (Horst Bogatz)

The bilingual English-German collocational dictionary consists of around 40,000 English headwords, including concepts expressed by more than one word (e.g. "environmental awareness" or "lame duck") and hyphenated compounds. It contains verbs, adjectives, synonyms and phrases that collocate with the headword, as well as the German equivalents for the headwords and their English synonyms.

The corpus on which the dictionary is based consists of a representative group of written (British) English texts - books, magazines, and quality Press - which runs to about two million words. All entries are based on contemporary evidence, and are typical of words that appear at least once in a two-million word corpus. The examples and phrases are a major feature of this dictionary.

A global search will provide all collocations that can possibly be associated with the search word. A search engine, the Advanced Reader's Collocation Searcher (ARCS), is supplied with the data and provides all possible German equivalents of the headwords. All entries are sorted according to part-of-speech categories. The latter feature makes it possible for searches to yield different useful combinations of words, e.g. noun + verb + adjective + examples extracted from the corpus + synonyms. A global search will also locate all words semantically connected with the search word in both English and German.

More information ?

Bilingual dictionaries (Translation Experts Ltd.)

Bilingual dictionaries for demonstration and commercial use containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features. The level of information in each entry varies depending on the word/phrase and on the dictionary. However, all of the above are present in varying degrees in the dictionaries. These dictionaries may be of interest in particular for spell-checking, thesaurus, hyphenation and translation of natural languages. A Level 2 translation engine, also available via ELRA, provides exact translations, output in LOCAL-UCS format, for input words and phrases, input in LOCAL-UCS format, based on the vocabulary stored in a compressed translation file.

Each pair of languages may be purchased as different sets or subsets, corresponding to the indicated number of entries. All pairs consist of English to and from another language. The following groups of languages are available:

GROUP 1 (English <=> Language A):

Language A = Spanish (25,000, 60,000, 100,000 and 200,000 entries), French (40,000, 80,000, 100,000 and 200,000 entries), German (40,000, 80,000 and 126,000 entries), Italian (20,000 and 40,000 entries), Brazilian Portuguese (40,000, 80,000 and 400,000 entries), Portuguese (40,000, 80,000, 110,000 and 234,000 entries), Dutch (40,000, 80,000 and 110,000 entries).

GROUP 2 (English <=> Language B):

Language B = Danish (40,000, 80,000 and 110,000 entries), Swedish (40,000, 80,000 and 110,000 entries), Finnish (30,000 entries), Icelandic (40,000, 80,000 and 100,000 entries).

GROUP 3 (English <=> Language C):

Language C = Russian (4,0000, 72,000 and 120,000 entries), Russian Business (60,000 entries), Russian Aerospace (60,000 entries), Russian Automotive (40,000 entries), Russian Minerals & Mining (60,000 entries), Polish (30,000, 80,000, 124,000 and 150,000 entries), Hungarian (30,000, 80,000 and 124,000 entries), Czech (40,000 entries), Romanian Starter (10,000 entries).

GROUP 4 (English <=> Language D):

Language D = Croatian (30,000 entries), Bosnian (30,000 entries), Serbian (Latin or Cyrillic) (30,000 entries).

GROUP 5 (English <=> Language E):

Language E = Japanese (40,000 entries).

GROUP 6 (English <=> Language F):

Language F = Greek (60,000 entries).

File format: Text
Standard in use: ISO
Character set: 8-bit ASCII and UNICODE
Means of delivery: CD-ROM, floppy disk or downloaded from the Web.
Related tools: Word Translator^®, NeuroTran^®, InterTran^®, MobileTran^®.

Please see http://www.tranexp.com for more information

EUROWORDNET

The EUROWORDNET DATA consists of the following modules:

A. Available Wordnets

B. LR(1) Common Components

C. LR(2) Language-Specific Components

D. LR(3) Software

E. Prices

F. Technical support

Available Wordnets

Following the announcement of the EuroWordNet databases in the last issue of the ELRA Newsletter (Vol.4 N.2), we are happy to announce that the list of EuroWordNet languages has grown. The following wordnets are now available via ELRA:

ELRA ref.	Language	Synsets	Word Meanings	Language Internal Relations	Equi-valence Relations
ELRA-M0015	English Addition to English WordNet	16361	40588	42140	0
ELRA-M0016	Dutch	44015	70201	111639	53448
ELRA-M0017	Spanish	23370	50526	55163	21236
ELRA-M0018	Italian	40428	48499	117068	71789
ELRA-M0019	German	15132	20453	34818	16347
ELRA-M0020	French	22745	32809	49494	22730
ELRA-M0021	Czech	12824	19949	26259	12824
ELRA-M0022	Estonian	7678	13839	16318	9004

LR(1) Common Components (All Foreground - Data of layer 1)

A.	The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created. An ILI-record contains: A.1 synset: set of synonymous words or phrases (mostly from WordNet1.5) A.2 part-of-speech, A.3 one or more Top-Concept classifications (Optional) A.4 one or more Domain labels (Optional) A.5 a gloss in English (mostly from WordNet1.5) A.6 a unique ID linking the synset to its source (mostly WordNet1.5)
B.	Top-Ontology: an ontology of 63 basic semantic classes based on fundamental distinctions. By means of the Top-Ontology all the wordnets can be accessed using a single language-independent classification-scheme. Top-Concepts are only assigned to ILI-records.
C.	Domain-ontology: an ontology of subject-domains optionally assigned to ILI-records.
D.	A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets. These Base-Concepts form the core of all the wordnets. All the Base-Concepts are classified in terms of the Top-Concepts that apply to them.
E.	WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.

LR(2) Language-Specific Components (Data of layer 2- partly Foreground and partly Background)

Wordnets produced in the first project (LE2-4003):

F.	Dutch wordnet
G.	English wordnet (additional relations which are missing in WordNet1.5)
H.	Italian wordnet
I.	Spanish wordnet

After extension of the project (LE4-8328):

J.	German wordnet
K.	French wordnet
L.	Czech wordnet
M.	Estonian wordnet

The specific wordnets are language-internal structures, minimally containing:

set of variants or synonyms making up the synset
part-of-speech
language-internal relations to other synsets
equivalence relations with ILI-records
a unique-id linking the synset to its source

Each wordnet will be distributed with LR1 and will include documentation on LR1 and the distributed wordnet. All the data will be distributed as text-files in the EuroWordNet import format and as Polaris database files (see below LR3). The EuroWordNet viewer (Periscope, see below LR3) can be used to access the database version. Polaris has to be licensed to modify and extend the database version.

The wordnets are distributed without:

glosses
usage labels
morpho-syntactic properties
examples
word-to-word translations

LR(3) Software

The multilingual EUROWORDNET Database (partly Foreground, partly Background) consists of three components:

The actual wordnets in Flaim database format: an indexing and compression format of Novell.
Polaris (Louw 1997): a wordnet editing tool for creating, editing and exporting wordnets.
Periscope (Cuypers and Adriaens 1997): a graphical database viewer for viewing and exporting wordnets.

The Polaris tool is a re-implementation of the Novell ConceptNet toolkit (Díez-Orzas et al 1995) adapted to the EuroWordNet architecture. Polaris can import new wordnets or wordnet fragments from ASCII files with the correct import format and it creates an indexed EUROWORDNET Database. Furthermore, it allows a user to edit and add relations in the wordnets and to formulate queries. The Polaris toolkit makes it possible to visualise the semantic relations as a tree-structure that can directly be edited. These trees can be expanded and shrunk by clicking on word-meanings and by specifying so-called TABs indicating the kind and depth of relations that need to be shown. Expanded trees or sub-trees can be stored as a set of synsets, which can be manipulated, saved or loaded. Additionally, it is possible to access the ILI or the ontologies, and to switch between the wordnets and ontologies via the ILI. Finally, it contains an interface to project sets of synsets across wordnets.

The Periscope program is a public viewer that can be used to look at wordnets created by the Polaris tool and to compare them in a graphical interface. Word meanings can be looked up and trees can be expanded. Individual meanings or complete branches can be projected on another wordnet or wordnet structures can be compared via the equivalence relations with the Inter-Lingual-Index. Selected trees can be exported to text files. The Periscope program cannot be used for importing or changing wordnets.

N.	The Polaris program is partly Background and partly Foreground. It is property of Lernout & Hauspie and can be licensed as a EuroWordNet result, either directly from Lernout & Hauspie or from ELRA.
O.	The Periscope viewer is property of Lernout & Hauspie and is Foreground.

Prices

The prices are based on the number of synsets in each wordnet and differ for the kind of usage and ELRA-membership:

	Price per 1K Synsets (KS) in EUROs
VAR-C	250 EURO/1KS
VAR-I (Internal use only)	150 EURO/1KS
VAR-E (Evaluation licence)	20 EURO/1KS
End-User (Academic institution - for research only)	10 EURO/1KS

	Prices per license
Ksynsets Wordnet	Var	Var-I	Var-E	End-User	Reduction	ELRA Member-ship factor
1	250	150	20	10	0%	2
10	2500	1500	200	100	0%	2
20	5000	3000	400	200	0%	2
30	7500	4500	600	300	0%	2
40	10000	6000	800	400	0%	2
50	12500	7500	1000	500	0%	2
60	15000	9000	1200	600	5%	2
70	17500	10500	1400	700	5%	2
80	20000	12000	1600	800	5%	2
90	22500	13500	1800	900	5%	2
100	25000	15000	2000	1000	10%	2
120	30000	18000	2400	1200	10%	2
140	35000	21000	2800	1400	10%	2
150	37500	22500	3000	1500	10%	2
160	40000	24000	3200	1600	20%	2
170	42500	25500	3400	1700	20%	2
180	45000	27000	3600	1800	20%	2
190	47500	28500	3800	1900	20%	2
200	50000	30000	4000	2000	20%	2

Above 60Ksynsets a reduction of 5% is offered, above 100Ksynsets a reduction of 10% and above 160Ksynsets a reduction of 20%. If multiple wordnets are obtained, the total is cumulated and the reduction is based on the cumulative total.. The percentage reduction is deducted from each wordnet. For example, if one obtains 3 wordnets of 10KS, 20KS and 40 KS, the total amount is 70KS. The prices for an ELRA member are then as follows:

	Prices in EURO for ELRA members without reduction				Prices in EURO for ELRA members with reduction of 5%
	10KS wordnet	20KS wordnet	40KS wordnet	Total 70KS	10KS wordnet	20KS wordnet	40KS wordnet	Total 70 KS
VAR-C	2500	5000	10000	17500	2250	4500	9000	15750
VAR-I	1500	3000	6000	10500	1350	2700	5400	9450
VAR-E	200	400	800	1400	180	360	720	1260
End-User	100	200	400	700	90	180	360	630

Since the total is between 60 and 100KS, there will be a 5% reduction. The reduction will be distributed over each wordnet. Non-ELRA members pay a double price.

Below are two examples for a wordnet with 30KSynsets and 40KSynsets.

Wordnet (30Ksynsets)	Price in EUROs for ELRA Member	Price in EUROs for non-Member
VAR-C	7,500 EURO	15,000 EURO
VAR-I (Internal use only)	4,500 EURO	9,000 EURO
VAR-E (Evaluation licence)	600 EURO	1,200 EURO
End-User (Academic institution - for research only)	300 EURO	600 EURO

Wordnet (40Ksynsets)	Price in EUROs for ELRA Member	Price in EUROs for non-Member
VAR-C	10,000 EURO	20,000 EURO
VAR-I (Internal use only)	6,000 EURO	12,000 EURO
VAR-E (Evaluation licence)	800 EURO	1,600 EURO
End-User (Academic institution - for research only)	400 EURO	800 EURO

Technical support

Technical support may be provided by members of the consortium. It will be implemented through bilateral agreements between the User and the member of the consortium responsible for the data acquired by User. As an indication the support contract will be on a yearly basis and will cost 10-20 KEURO/Year.

http://www.hum.uva.nl/~ewn

Tools (Grammar Software)

ALEP

ALEP is a flexible, fully configurable platform, designed to facilitate the description of linguistic phenomena, the compilation of these descriptions into an executable form and the application of the resulting code in a number of processes.

ALEP comes with a rule formalism that offers an expressive, yet concise and simple means to describe linguistic phenomena, a compiler and an engine, called the virtual machine, that uses the compiled linguistic rules in analysis, transfer or synthesis of texts.

LS-GRAM

Please download LS-GRAM gzipped tar-files from THIS SITE

The Large-Scale Grammars for EU Languages project (LRE-1 61029) is making its resources - language modules for Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish - available via ELRA. All modules have a text handling component, a two-level morphology, a word structure component (inflection only), and a grammar. They are based on the same principle and semantic descriptions, and have a common format. The linguistic basis for the grammatical part of the modules, which were developed via corpus investigation, is HPSG, with some revisions. Some of the grammars come close to the corpus in coverage. Efficiency played a decisive role, with some of the modules being able to analyse paragraphs of several sentences comprising up to fifty words in less than a minute on an Ultra-Sparc. Last but not least, a large body of test material and very detailed documentation is available for all grammars.