Home | Catalogue | Speech | Text | Termino

WRITTEN RESOURCES SPECIFICATIONS



CORPUS , Lexicon, Multil. lexicon , Tools


British National Corpus (BNC)

BNC consists of extracts from 4124 modern British English texts of all kinds, both spoken and written. Each text is segmented into orthographic sentence units, and each word automatically assigned a part of speech code. There are 6.250.000 sentences, and over 100 million words.

The corpus was produced by a consortium of leading dictionary publishers (OUP, Longman, Chambers-Harrap) and academic research centres (Oxford University Computing Services, Unit for Computer Research in the English Language at Lancaster University, British Library Research and Development). It provides a unique and authoritative view of the state of the English language today, with carefully balanced representation of as many different varieties of English as possible. It can be used to exercise NLP systems of all kinds, as a fertile source of real-life examples for language learners, or simply to explore the way the language is currently used.

The first release of the BNC comprises (packaged as 3 CDRoms) :

The BNC is an SGML document complying with ISO 8879.

Contemporary Portuguese Corpus

This corpus covers the Portuguese language as spoken in Portugal, Brazil, Angola, Mozambique, Guinea, Macao, etc. It consists of about 1.5Million of words for the spoken language and more that 40 millions words of Portuguese texts extracted from fiction, technical, scientific, journalistic, legal, and political material. Some of the corpus is raw data but some of it has been encoded according to an in-house formalism developed within the internal project "Corpus de Referência do Português Contemporâneo".

CRATER Multi-Lingual Aligned Corpus

The Corpus Resources and Terminology Extraction project (MLAP-93 20) has extended the bilingual annotated English-French International Telecommunications Union corpus to include Spanish, and has also debugged the existing corpus. In addition, a Spanish tagger has been developed, along with a set of retrieval tools for browsing the trilingual aligned corpus, and examining the proposed term or word alignments. The offer consists of the 3 x 1,000,000 token corpora of English, French and Spanish, morphosyntactic annotations (human-edited), lemmatisation and term extraction routines for English, French and Spanish.

Samples ?

ECI - European Corpus Initiative

The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual corpus, and supports existing and projected national and international efforts to carefully design, collect and publish large-scale multilingual written and spoken corpora. ECI has produced the Multilingual Corpus I (ECI/MCI) of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material.

Just a sampling of the contents of the CD-ROM:

The ECI/MCI is available from ELSNET.

ECI - ELSNET Italian&German tagged sub-corpus

The objective is to provide a small but fine grained morphosyntactically tagged corpus, 50.000 running words for each of the two languages (Italian and German) to be used in research work on tagging methods and models. The text for German comes from the Frankfurter Rundschau extracted from the ECI corpus, the Italian material comes from the Italian corpus of ILC - CNR. For German the data concerns several domains including Economy (17,000 word forms), Politics (14,000 word forms), Culture (18,000 word forms), Sports (9,000 word forms), and Local Events (8500 word forms). The situation for Italian is comparable to that. Word occurrences are tagged with very fine grained tagsets which are based on the EAGLES morphosyntactic guidelines.

The tagging, done automatically, has been manually checked. The CD-ROM contains: the text in SGML format; the DBT software which allows different browsing and operations on the annotated text and the EAGLES guidelines for morphosyntactic.


Multilingual Corpora for CO-OPERATION - MLCC

The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies.

The first set is referred as the Polylingual Document Collection (ELRA-W0006), a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora:

The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text.

The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words.

A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words.

This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt.

The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh.

This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words.

The second set is a Multilingual Parallel Corpus (ELRA-W0007) consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities:

Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language).

Samples: Danish, German, English, Spanish, Greek, Italian, Dutch, Portuguese.

This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by members of the meeting as well as written input provided to the meeting. The original data from which the translations are produced consist of a transcript of the sittings, each member speaking in the language of his choice. The final version consists of nine parallel versions of the material. The texts delivered comprise the Debates of Parliament from January 1992 to July 1994. This sub-corpus contains some 5 to 8 million words per language.

Samples: Danish, German, English, Spanish, Greek, Italian, French, Dutch, Portuguese.

Monolingual Greek corpus (ILSP -Institute for Language and Speech Processing)

Monolingual Greek corpus of 1 million words. The corpus consists of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA. Each file contains annotated text with SGML mark-up accompanied by a text header.


MTP Annotated Corpus of German

This morphosyntactically annotated 500,000 word German corpus was developed as part of the Münster Tagging Project (MTP). It comprises a collection of SGML-formatted texts from two German newspapers, "Die Frankfurter Allgemeine Zeitung" and "Die Zeit", for the years 1990 to 1992. The articles reflect the typical distribution of newspaper topics, including economics, regional, national and international politics, the arts, sport, literature, history, science and modern life.

The text was segmented into sentence units and word tokens, and tagged for morphosyntactic POS markers. Two tagsets, which mainly differed in the granularity of the noun and verb tags, and which comprised 137 and 52 tags respectively, were used. Users may obtain annotated versions using either set, each of which comes with documentation and an instruction manual for tag application. A suite of tools, including the MTP taggers and the Xlex workbench for text handling, textual analysis and lexicography, is also available.


PRESS 65, (Swedish corpus)

Språkdata has made available the first of its many Swedish corpora, PRESS 65. It consists of one million running words taken from Swedish newspapers from the year 1965. It has been categorised according to text type and is annotated down to the sentence level.

Tagged text in French (MEMODATA)

More than 170 books (classical novels, legal texts...) are tagged with or without rules of morphological disambiguation. A tagged corpus of 50 books is available for research. It consists of several authors of the 19th century (Balzac, Hugo, Stendhal).

More information ?


Test Suites for Natural Language Processing (TSNLP)

The TSNLP project (LRE 62-089) has produced a database of test suites for English, French and German containing over 4,000 test items (sentences or fragment of sentences) per language which have been constructed for evaluating natural language processing systems, but which may also be useful for other purposes. The examples have been systematically constructed with detailed annotations about grammatical and other information, and are relevant to developers or users of systems with grammatical components who wish to test, benchmark, or evaluate them. A three-volume user manual documents major project results, including a description of the test data, the underlying methodology and the tools developed to aid test suite construction and use.

Text corpus of "Le Monde"

Electronic archiving of "Le Monde" articles started on 1 January 1987. Some 200 articles are added every day, and as of October 1997 the database contains more than 500,000 articles, making it the biggest of its kind for all French daily newspapers.

The corpus is available in an ASCII text format. Each month consists of some 10 MB of data (circa 120 MB per year).

Data ranging from 1987 until present date are available through ELRA (each buyer may purchase up to 5 years of data).


Karl-May-Korpus (KMK corpus)

Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May from 1993 to 1997 and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each).

Each word form is tagged with word class (1 out of 43 classes) and appropriate lemma.

File format: Text
Standard in use: SGML
Character set: 8-bit ASCII


MULTEXT JOC Corpus

This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains approx. 5 million words in English, French, German, Italian and Spanish (approx. 1 million words per language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.

The JOC corpus is delivered in Corpus Encoding Standard conformant format at each level of treatment :

Additional information: http://www.lpl.univ-aix.fr/projects/multext


ARCADE/ROMANSEVAL corpus

The ARCADE/ROMANSEVAL corpus was used as a reference corpus in two international competitions:

The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four Romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission).

The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3,700 contexts altogether, and comprises:

Additional information:
http://www.lpl.univ-aix.fr/projects/arcade
http://www.lpl.univ-aix.fr/projects/romanseval


Dutch PAROLE Distributable Corpus

The Dutch PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference corpus.

The Dutch corpus annotation and checking was made accordingly to the common core PAROLE tagset. The Dutch data were also checked for type.

The Dutch PAROLE Distributable Corpus contains the following texts:

MEDIUM

SOURCE

TIMESPAN

TOTAL NUMBER
of WORDS

BOOKS

Van Sterkenburg:
Wdlijst tot wdboek
Taal vt Journaal
WNT-portret


1984
1989
1992


65,344
56,215
60,133

NEWSPAPERS

Short Newspaper texts:
MN_Collection
CVNP(S)-Collection


1986-1988
1983-1990


19,537
179,220

PERIODICAL

Short texts from
- Local Papers
- Magazines


1985-1988
1985-1989


47,019
164,589

MISCELLANEOUS

Texts to be read out in
TV-news broadcasts for:
- General audience
- Youth
Short texts from
Ephemera



1992-1995
1991-1995

1985-1986



1,285,824
1,008,658

131,692

TOTAL

   

3,018,231

Over 250,000 words of corpus texts have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked at least two times with respect to maximal granularity, according to a lexicographer's manual. The extra 9,000 words over the required 50,000 words compensate for the occurrence of ca. 5,300 "keywords" in the original texts. The fully corrected material has been subjected to an automated post-control operation, checking the pertinence relations between the various feature values, and instantiating default values in case a mismatch (indicating a correction error) was found. Ca. 200,000 words have been checked once for PoS and type. In addition to the required PoS, type was checked for reasons of quality. This material has been subjected to an automated correction procedure addressing the feature slots (positions) beyond the first two for PoS and type so as to solve discrepancies between the manually corrected PoS and type, and the possibly erroneous, automatically assigned values of the remaining slots.

More info on the Parole project.



Monolingual Lexicon


DICO-MORPH_lemme. (MEMODATA)

Entries: more than 400 000
Language: French
Format: ASCII with separators
Medium: CD-ROM
French reusable lexicon for morphological works which produces the canonical form from the inflexional form. This lexicon is divided into the following lexical categories: nouns (55,000), verbs (8,000), adjectives (16,850), adverbs (2,000), other words (30,000).

DICO-MORPH_Collocation. (MEMODATA)

Entries: up to 35000
Language: French
Format: ASCII
Medium: Floppy disk

This is an adding for the French lexicon for morphological works (referenced herein as the DICO-MORPH_Lemme. MEMODATA).

More information ?


DICO-SYNT. (MEMODATA)

Entries: 90 000
Language: French
Format: ASCII
Medium: Floppy disk

This resource gives the morpho-syntactical information for DICO-MORPH_lemme: proper noun, transitive verb, ... There are around 800 categories of verbs. The lexical categories are: nouns (25,000), verbs (8.000 that generate 25,000 verb/models), adjectives (1,000), Adverbs (1,500).

More information ?


Dutch Lexicon (LanTmark) General vocabulary

Entries: 64000
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

The Dutch LanTmark lexicon is divided into the following categories: nouns (50,000), verbs (7,000), adjectives (6,000), adverbs (1,000).Each entry contains morphological information (morphological flexes, comparative and superlative markers), syntactic information (such as positional features, gender, complement markers and verb arguments), semantic information (lexical semantics for nouns, adverbs and adjectives).


French Lexicon (LanTmark)

General vocabulary
Entries: 50000
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature-value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

The French LanTmark lexicon is divided into the following categories: nouns (36,000), verbs (6,000), adjectives (7,000), adverbs (1,000).

Each entry contains morphological information (morphological flexes, comparative and superlative markers), syntactic information (such as positional features, gender, complement markers and verb arguments), semantic information (lexical semantics for nouns, adverbs and adjectives).


ILC Italian Morphological Lexicon

The ILC Italian Morphological Lexicon consists of a set of lemmas/lexical entries (about 60,000) with the corresponding inflected word-forms, and a morphological engine for morphological analysis and generation. Lemmas and word-forms are encoded with grammatical codes compatible with the EAGLES recommendations for lexicon encoding at the morphosyntactic level.

LEXin 1:e (Swedish Lexicon)

The first edition of LEXin 1:e, a Swedish database used as the basis for a lexicon for immigrants, is now available via ELRA. Produced by Språkdata in Göteborg, Sweden, it consists of approximately 17,000 headwords and 21,000 senses, and contains explicit morphological information for every headword and syntactical information for all verbs and many adjectives. Each sense is illustrated by a paraphrase, as opposed to a formal definition. Derivational forms, phrases and idioms are also included. The format is flexible and can be customised to individual wishes within reasonable limits.

Monolingual Danish lexicon

(Institut for Erhvervsinformatik)

Entries: 25000
Format: ASCII
This dictionary was developed for machine translation. Each lexeme contains the word class, inflection, semantic features, syntactic frames (for verbs), and complement (for nouns and adjectives).


Monolingual Portuguese lexicon

(Centro de Linguistica da Universidade de Lisboa)
Entries: 60 000
Monolingual Portuguese lexicon with morphological information, with a software engine, written in C, for generating all inflected forms, including adj-adverb derivation.

MULTEXT LEXICONS

This CD-ROM contains a set of lexicons developed in the MULTEXT project financed by the European Commission (LRE 62-050). The set contains the following languages: English, French, German, Italian and Spanish.

English 66,214 Word forms
French 306,795 Word forms
German 233,861 Word formsItalian 145,530 Word forms
Spanish 510,710 Word forms

The MULTEXT lexicons are three-column tables, separated with a tabulation: the first column contains the word-form, the second column contains the lemma, and the third column contains the morpho-syntactic information associated to that form. This information is conformant with the MULTEXT/EAGLES specifications.

Additional information: http://www.lpl.univ-aix.fr/projects/multext


Portuguese morphological lexicon PALAVROSO, (INESC)

Entries: 60 000

Monolingual Portuguese lexicon with a rule-based morphological analysis which also handles enclitics, compounds, diminutives and augmentatives.

PALAVROSO is a European Portuguese lexicon and consists of a set of about 60,000 lexical entries (lemmas), and a rule-based morphological engine for morphological analyses that recognises more than 1,300 000 word-forms. The rule set also allows enclitics, compound words, diminutives and augmentatives to be handled correctly. Information encoded is compatible with the EAGLES recommendations for lexicon encoding at the morpho-syntactic level.


Spanish gilcUB-M-Dictionary

General vocabulary
Entries: 60000
Format: ASCII format with ISO 8859-1 character set. Available versions include atribute-value pairs and tag-style encoding.
Medium: QIC 150 MB Cartridge Tape

The Spanish gilcUB-M-Dictionary is a full form lexicon derived from 60,000 lemmas of general vocabulary (9,700 verbs, 35,500 nouns, 14,300 adjectives and 120 adverbs). Possible adverbs derived from adjectival forms are also included as full forms and are about 10,000 forms. Morphosyntactic information encoded is compatible with EAGLES recommendations for morphosyntactic encoding as well as the associated lemma.

More information ?


THAMUS. Generic Italian dictionary

(Consorzio per la linguistica computazionale)
Entries: 116000

A Generic monolingual Italian dictionary. Morphological coding which can generate all full forms by means of a software engine written in C. Multi-word terms contain morphological coding for the head word.


Adverbial Equivalence Dictionary (CORA)

Entries: 1,200
Language: French
Format : Word Processing file (Word...)
Medium: floppy disk
Simplified equivalents for fixed expressions.

Nominalisation dictionary (CORA)

Entries: 2,300
Language: French
Format : Word Processing file (Word...)
Medium: Floppy disk
Corresponding substantives for verbs

Tri-, quadri-, pentagrams dictionaries (CORA)

Sequences: 5,487
Format : ASCII
Medium: Floppy disk
The dictionaries consist of a list of sequences of 3, 4 or 5 characters which follow each other in French words. In particular, they enable users to locate misspelt sequences.

" N de N " Dictionary (compound nouns) (CORA)

This dictionary contains 21,000 compound nouns of un inflected " N de N " groups, classified in 1,000 human entries (divided into job, group, animated), 4,200 concrete entries (divided into clothes, dishes, furniture...), 6,000 abstract entries (divided into tables of auxiliary verbs such as : " avoir ", " donner ",etc.), plus syntactic/semantic information about determiners, verbs, etc.
Language: French
Format : Tagged ASCII
Medium: Floppy disk

More information ?


German lexicon (CORA)

Entries: 466,300. inflected forms. The same word can be represented in one or more files and thus counts for several entries.
Language: German
Format: ASCII
Medium: Floppy disk
This lexicon is divided into 7 main syntactic categories: nouns (97,000), verbs (236,200), adjectives and some adverbs (130,500), grammatical words (1,700), punctuation (40), prefixes (400), and suffixes (370). Each file consists of a word list corresponding to syntactical and morphological categories. The lexicon does not include lemmas.

English lexicon (CORA)

Entries: 160,000.
Language: English
Format: ASCII
Medium: Floppy disk
The dictionary is divided into 4 main syntactic categories: nouns (93,500), verbs (35,800), adjectives (46,600), grammatical words (8,865). The lexicon contains a list of inflected words with corresponding syntactic categories and lemmas. Each entry is tagged with specific separators. A single word corresponds to a single entry in the lexicon.

DST Dictionary (CORA)

Entries: 550,000 inflected forms.
Language: French
Format : ASCII
Medium: tape, CD-ROM
Simple forms are divided into: 43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 noun-adjectives, 6,800 verbs, 6,200 compound nouns, 4,680 adverbs and adverbial phrases, 3,292 unelided words, 903 prefixes, 682 abbreviations and measures, 218 pronouns, 248 conjunctions and subordinating conjunction phrases, 186 prepositions and prepositional phrases, 86 determiners, 16 predeterminers, 14 co-ordinating conjunction phrases, as well as all possible homographs. The DST includes semantic (cars, places, wines, etc.), syntactic (gender, number, tense, etc.), morphological (lemma), lexicological (homographs) and more specific syntactical information (prepositions followed by an infinitive form, intransitive verbs with " avoir " or " être ", etc.).

More information ?


Dictionary of French verbs - CORA

This dictionary contains 25,610 verbs with usage domains, level of language (familiar, popular, literary, Quebec and Swiss terms, etc.), conjugation, auxiliary, verbal adjectives in -able, -ant or -é, encoded syntactical constructions (subject, direct & indirect object, adverb), sample phrases, synonyms, operators enabling semantic-syntactic classification, encoding of derived forms in -age, -ment, -tion, -oir, -ure, deverbal nouns, base words from which verbs can be derived, a scale of usage ranging from 1 to 6, like those used by commercial dictionaries (basic vocabulary, extended, specialised, etc.).

Codes enable automatic production of conjugation forms, derived nouns and adjectives and, if necessary, the production of potential forms.


Dictionary of words - CORA

This dictionary is composed of 126,844 words, with usage domains, grammatical category, gender, number, uncountable, collective, adjectival, nominal, verbal, adverbial derived forms according to the type of words.

Dictionary of affixes - CORA

4,286 suffixes and prefixes, plus information on their verbal, nominal or adjectival bases or on the verbal basis of greco-latin items. This dictionary does not include the suffixes contained in the dictionary of French verbs (ELRA-L0021) and words (ELRA-L0022) such as -age, -ment, -if, -oir.

Dictionary of verb phrases - CORA

Dictionary of 3,480 entries based on the model of the dictionary of French verbs (ELRA-L0021).

Dictionary of invariable forms and phrases - CORA

Dictionary of 4,783 entries based on the model of the dictionary of words (ELRA-L0022).

Dictionary of exclamatory stereotyped phrases - CORA:

Dictionary of 1,901 entries based on the model of the dictionary of invariable forms and phrases (ELRA-L0025).

Dictionary of French local authorities - CORA

38,965 entries in lower cases with accents, controlled on the guide Michelin, without named places ("lieux-dits"); A link can be made to the dictionary of words (ELRA-L0022) which contains inhabitants' names and their correspondence with town names.

Dictionary of noun phrases and plural-only words - CORA

2,138 compound names and 1,397 entries of plural-only words.

CELEX Dutch lexical database

The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch Version 3.1, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.

Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.

To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.

This database can be divided into different subsets:


Bulgarian Morphological Dictionary

This dictionary contains 67500 entries divided into 242 inflectional types (including proper nouns), morphosyntactic information for each entry, and a morphological engine (MS DOS and WINDOWS 95/NT) for morphological analysis and generation. The data may be used for morphological analysis and synthesis.

Structure of entries: Local linguistic variant
File format: ASCII; lowercase letters
Standard in use: ISO
Character set: 8-bit ASCII ASCII codes alphabetically: 160-191
Medium: Floppy disk


Dutch PAROLE lexicon

The entry list of the lexicon consists of about 20,200 entries distributed over 13 parts of speech (POS). The entries have been described along the dimensions of morphosyntax and syntax. Morphosyntactic information consists of various lexical properties, like gender, number, case, person, inflection, etc. Syntactic descriptions consist of typical complementation patterns associated with the various lemmata.

The composition of the entry list of the lexicon is based on 3 corpora from the Instituut voor Nederlandse Lexicologie (INL) and 2 lexica. The corpora contain a total of about 54 million words and have been automatically annotated for part-of-speech and lemma. The lexica contain morphosyntactic information of various kinds. For verbs, nouns, adjectives and adverbs, lemmata that were covered by at least 2 corpora and the 2 lexica were selected on the basis of cumulative frequency, coverage (distribution over sources) and inflected forms. For the smaller parts of speech, these selection requirements appeared to be too strict. Entry selection for these parts of speech was based on ranked frequency.

The entries, uniquely defined by the combination of part of speech (e.g. noun) and subtype (e.g. common vs. proper noun), are provided with morphosyntactic information according to the Dutch set of PAROLE categories and features, and, where available, with syntactic information. Morphosyntactic information is automatically extracted from the INL lexica. Syntactic data have been collected manually, by inspection of corpus data and - where necessary - consultation of reference works. The corpus consulted consists of the newspaper component and the varied component of the 38 Million Words Corpus 1996.

Word forms in the Dutch PAROLE lexicon are not inflected according to general paradigms, but are related to their lemma by a set of string procedures. These procedures are not unique. They can be shared by many other word forms. An example is suffixation with -e for adjectives, which produces "goede"/good from "goed". Inflected forms can be derived directly by applying the string procedures to the lemma they are connected with.

The lexicon is set up as an SGML file (over 30 MB of plain ASCII). Its contents have been encoded in a distributed manner: all formative entities (like lemmata, syntactic phrases, feature bundles) are SGML entities, related by a pointer mechanism to other entities.

The lexicon contains the following categories : adjectives (3,298 entries), adpositions (80 entries), adverbs (554 entries), articles (3 entries), conjunctions (70 entries), determiners (59 entries), interjections (235 entries), nouns (12,279 entries), numerals (77 entries), pronouns (85 entries), residuals (186 entries), unique (1 entry), verb (3,274 entries).

More info on the Parole project.



Multilingual Lexicon


Basic multilingual lexicon (MEMODATA)

Entries: 30 000 each language
Languages: French, English, Italian, German, Spanish
Format: ASCII or ANSI with separators between entries
Medium: CD-ROM

The words are associated by the meaning. The lexical categories are: nouns (5 * 18 000), verbs (5 * 8 000), adjectives (5 * 6 000), adverbs (5 * 1 500).

Samples ?


Bilingual Spanish-English and English-Spanish Lexicons (INCYTA)

Technical domains
Economics, law and Business management:          10.640 entries
Leisure, Tourism, Sports, Food:                   3.140 entries
Geography, History, Arts:                         4.110 entries
Sociology, Psychology, Pedagogy:                  4.080 entries
Natural and medical sciences:                    10.530 entries
Exact sciences, Physics, Chemistry, Geology:     10.610 entries
Data Processing, Electronics, Telecommunications: 4.900 entries
Technology, Engineering and Construction:        11.950 entries
Economics                                         1.320 entries
Data Processing                                   3.560 entries
Telecommunications                                3.730 entries
Electrical Engineering                            1.760 entries
Plastics and Chemistry                            9.020 entries
Aeronautics, Navigation, Mechanical Engin.       23.170 entries
The entries contain morphological information for part-of-speech and inflectional class. The information on multi-word terms is provided by the headword.

Danish - German dictionary

(Institut for Erhvervsinformatik)
General vocabulary
Entries: 10 000
Format: ASCII

This dictionary was developed for machine translation. It gives the German lexeme with word class and Danish equivalent with word class, subject area, indication of structural changes from DK-G.


Dutch-French Lexicon (LanTmark)

General and Specialised vocabularies for transfer
Transfer Entries:
General Vocabulary (26 000), Administrative (32 000), Data processing (10 000).
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

General Dutch-French LanTmark lexicon is divided into the following categories: nouns (14,000), verbs (6,000), adjectives (5,000), Adverbs (1,000).

Administrative vocabulary is divided into the following categories: nouns (30,000), verbs (2,000).
Data processing vocabulary has 10 000 transfer nouns.
Each entry contains a domain information, source language disambiguation, features, target language actions.


English-French Lexicon (LanTmark)

General vocabulary for transfer
Transfer Entries: 27000
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature-value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

English-French LanTmark lexicon is divided into the following lexical categories: nouns (14,000), verbs (7,000), adjectives (5,000), Adverbs (1,000).
Each entry contains a domain information, source language disambiguation, features, target language actions.


French-Dutch Lexicon (LanTmark)

General and Specialised vocabularies for transfer
Transfer Entries:
General Vocabulary (34 000), Administrative (18 000), Data processing (10 000).
Format: ASCII format with ISO 8859-1 character set.A lexicon file contains entries with feature-value pairs on each line and separators between entries.
Medium: Floppy disk, QIC 150 MB cartridge tape

General French-Dutch LanTmark lexicon is divided into the following categories: nouns (25,000), verbs (3,000), adjectives (5,000), Adverbs (1,000).
Administrative vocabulary is divided into the following categories: nouns (16,000), verbs (2,000).
Data processing vocabulary has 10,000 transfer nouns.
Each entry contains domain information, source language disambiguation, features, and target language actions.


French-English Lexicon (LanTmark)

General vocabulary for transfer
Transfer Entries: 34 000
Format: ASCII format with ISO 8859-1 character set. A lexicon file contains entries with feature-value pairs on each line and separators between entries.
Medium: Floppy Disk, QIC 150 MB Cartridge Tape

The French-English LanTmark lexicon is divided into the following lexical categories: nouns (21,000), verbs (9,000), adjectives (3,000), adverbs (1,000).
Each entry contains adomain information, source language disambiguation, features, and target language actions.


German-Danish dictionaries

(Institut for Erhvervsinformatik)
Technical and General vocabulary
Entries: 6800 (technical) - 15500 (general)
Format: ASCII

This dictionary was developed for machine translation. It gives the German lexeme with word class and Danish equivalent with word class, subject area, indication of structural changes from G-DK (e.g. direct object è PP (Prep 'xxx').


THAMUS. Bilingual dictionaries

(Consorzio per la linguistica computazionale)
Technical domains
Languages: German/Italian - Italian/German
Computer Science 35.000 entries
Construction 7.000 entries

Technical bilingual Italian dictionaries with a morphological coding which can generate all full forms using a software engine written in C. Multi-word terms contain morphological coding for the head word.


THAMUS. Bilingual dictionaries

(Consorzio per la linguistica computazionale)
Technical domains
Languages: English - Italian
Format: ASCII format with ISO 8859-1 character set
Medium: QIC 150 MB Cartridge Tape
Aeronautics        6.500 entries
Law               18.000 entries
Computer Science  31.000 entries
Medicine          20.000 entries
Economics         82.000 entries
Engineering       27.000 entries
Technical bilingual Italian dictionaries with a morphological coding which can generate all full forms using a software engine written in C. Multi-word terms contain morphological coding for the head word.

Bilingual Collocational Dictionary (Horst Bogatz)

The bilingual English-German collocational dictionary consists of around 40,000 English headwords, including concepts expressed by more than one word (e.g. "environmental awareness" or "lame duck") and hyphenated compounds. It contains verbs, adjectives, synonyms and phrases that collocate with the headword, as well as the German equivalents for the headwords and their English synonyms.

The corpus on which the dictionary is based consists of a representative group of written (British) English texts - books, magazines, and quality Press - which runs to about two million words. All entries are based on contemporary evidence, and are typical of words that appear at least once in a two-million word corpus. The examples and phrases are a major feature of this dictionary.

A global search will provide all collocations that can possibly be associated with the search word. A search engine, the Advanced Reader's Collocation Searcher (ARCS), is supplied with the data and provides all possible German equivalents of the headwords. All entries are sorted according to part-of-speech categories. The latter feature makes it possible for searches to yield different useful combinations of words, e.g. noun + verb + adjective + examples extracted from the corpus + synonyms. A global search will also locate all words semantically connected with the search word in both English and German.

More information ?

Bilingual dictionaries (Translation Experts Ltd.)

Bilingual dictionaries for demonstration and commercial use containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features. The level of information in each entry varies depending on the word/phrase and on the dictionary. However, all of the above are present in varying degrees in the dictionaries. These dictionaries may be of interest in particular for spell-checking, thesaurus, hyphenation and translation of natural languages. A Level 2 translation engine, also available via ELRA, provides exact translations, output in LOCAL-UCS format, for input words and phrases, input in LOCAL-UCS format, based on the vocabulary stored in a compressed translation file.

Each pair of languages may be purchased as different sets or subsets, corresponding to the indicated number of entries. All pairs consist of English to and from another language. The following groups of languages are available:

GROUP 1 (English <=> Language A):

Language A = Spanish (25,000, 60,000, 100,000 and 200,000 entries), French (40,000, 80,000, 100,000 and 200,000 entries), German (40,000, 80,000 and 126,000 entries), Italian (20,000 and 40,000 entries), Brazilian Portuguese (40,000, 80,000 and 400,000 entries), Portuguese (40,000, 80,000, 110,000 and 234,000 entries), Dutch (40,000, 80,000 and 110,000 entries).

GROUP 2 (English <=> Language B):

Language B = Danish (40,000, 80,000 and 110,000 entries), Swedish (40,000, 80,000 and 110,000 entries), Finnish (30,000 entries), Icelandic (40,000, 80,000 and 100,000 entries).

GROUP 3 (English <=> Language C):

Language C = Russian (4,0000, 72,000 and 120,000 entries), Russian Business (60,000 entries), Russian Aerospace (60,000 entries), Russian Automotive (40,000 entries), Russian Minerals & Mining (60,000 entries), Polish (30,000, 80,000, 124,000 and 150,000 entries), Hungarian (30,000, 80,000 and 124,000 entries), Czech (40,000 entries), Romanian Starter (10,000 entries).

GROUP 4 (English <=> Language D):

Language D = Croatian (30,000 entries), Bosnian (30,000 entries), Serbian (Latin or Cyrillic) (30,000 entries).

GROUP 5 (English <=> Language E):

Language E = Japanese (40,000 entries).

GROUP 6 (English <=> Language F):

Language F = Greek (60,000 entries).

File format: Text
Standard in use: ISO
Character set: 8-bit ASCII and UNICODE
Means of delivery: CD-ROM, floppy disk or downloaded from the Web.
Related tools: Word Translator®, NeuroTran®, InterTran®, MobileTran®.

Please see http://www.tranexp.com for more information


EUROWORDNET

The EUROWORDNET DATA consists of the following modules:

A. Available Wordnets

B. LR(1) Common Components

C. LR(2) Language-Specific Components

D. LR(3) Software

E. Prices

F. Technical support

 

  1. Available Wordnets
  2. Following the announcement of the EuroWordNet databases in the last issue of the ELRA Newsletter (Vol.4 N.2), we are happy to announce that the list of EuroWordNet languages has grown. The following wordnets are now available via ELRA:

    ELRA ref.

    Language

    Synsets

    Word Meanings

    Language Internal Relations

    Equi-valence Relations

    ELRA-M0015

    English Addition to English WordNet

    16361

    40588

    42140

    0

    ELRA-M0016

    Dutch

    44015

    70201

    111639

    53448

    ELRA-M0017

    Spanish

    23370

    50526

    55163

    21236

    ELRA-M0018

    Italian

    40428

    48499

    117068

    71789

    ELRA-M0019

    German

    15132

    20453

    34818

    16347

    ELRA-M0020

    French

    22745

    32809

    49494

    22730

    ELRA-M0021

    Czech

    12824

    19949

    26259

    12824

    ELRA-M0022

    Estonian

    7678

    13839

    16318

    9004



  3. LR(1) Common Components (All Foreground - Data of layer 1)
  4. A.

    The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created. An ILI-record contains:

    A.1 synset: set of synonymous words or phrases (mostly from WordNet1.5)
    A.2 part-of-speech,
    A.3 one or more Top-Concept classifications (Optional)
    A.4 one or more Domain labels (Optional)
    A.5 a gloss in English (mostly from WordNet1.5)
    A.6 a unique ID linking the synset to its source (mostly WordNet1.5)

    B.

    Top-Ontology: an ontology of 63 basic semantic classes based on fundamental distinctions. By means of the Top-Ontology all the wordnets can be accessed using a single language-independent classification-scheme. Top-Concepts are only assigned to ILI-records.

    C.

    Domain-ontology: an ontology of subject-domains optionally assigned to ILI-records.

    D.

    A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets. These Base-Concepts form the core of all the wordnets. All the Base-Concepts are classified in terms of the Top-Concepts that apply to them.

    E.

    WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.



  5. LR(2) Language-Specific Components (Data of layer 2- partly Foreground and partly Background)
  6. Wordnets produced in the first project (LE2-4003):

    F.

    Dutch wordnet

    G.

    English wordnet (additional relations which are missing in WordNet1.5)

    H.

    Italian wordnet

    I.

    Spanish wordnet

    After extension of the project (LE4-8328):

    J.

    German wordnet

    K.

    French wordnet

    L.

    Czech wordnet

    M.

    Estonian wordnet

    The specific wordnets are language-internal structures, minimally containing:

    Each wordnet will be distributed with LR1 and will include documentation on LR1 and the distributed wordnet. All the data will be distributed as text-files in the EuroWordNet import format and as Polaris database files (see below LR3). The EuroWordNet viewer (Periscope, see below LR3) can be used to access the database version. Polaris has to be licensed to modify and extend the database version.

    The wordnets are distributed without:

     

  7. LR(3) Software
  8. The multilingual EUROWORDNET Database (partly Foreground, partly Background) consists of three components:

    • The actual wordnets in Flaim database format: an indexing and compression format of Novell.
    • Polaris (Louw 1997): a wordnet editing tool for creating, editing and exporting wordnets.
    • Periscope (Cuypers and Adriaens 1997): a graphical database viewer for viewing and exporting wordnets.

    The Polaris tool is a re-implementation of the Novell ConceptNet toolkit (Díez-Orzas et al 1995) adapted to the EuroWordNet architecture. Polaris can import new wordnets or wordnet fragments from ASCII files with the correct import format and it creates an indexed EUROWORDNET Database. Furthermore, it allows a user to edit and add relations in the wordnets and to formulate queries. The Polaris toolkit makes it possible to visualise the semantic relations as a tree-structure that can directly be edited. These trees can be expanded and shrunk by clicking on word-meanings and by specifying so-called TABs indicating the kind and depth of relations that need to be shown. Expanded trees or sub-trees can be stored as a set of synsets, which can be manipulated, saved or loaded. Additionally, it is possible to access the ILI or the ontologies, and to switch between the wordnets and ontologies via the ILI. Finally, it contains an interface to project sets of synsets across wordnets.

    The Periscope program is a public viewer that can be used to look at wordnets created by the Polaris tool and to compare them in a graphical interface. Word meanings can be looked up and trees can be expanded. Individual meanings or complete branches can be projected on another wordnet or wordnet structures can be compared via the equivalence relations with the Inter-Lingual-Index. Selected trees can be exported to text files. The Periscope program cannot be used for importing or changing wordnets.

    N.

    The Polaris program is partly Background and partly Foreground. It is property of Lernout & Hauspie and can be licensed as a EuroWordNet result, either directly from Lernout & Hauspie or from ELRA.

    O.

    The Periscope viewer is property of Lernout & Hauspie and is Foreground.

     

  9. Prices
  10. The prices are based on the number of synsets in each wordnet and differ for the kind of usage and ELRA-membership:

     

    Price per 1K Synsets (KS) in EUROs

    VAR-C

    250 EURO/1KS

    VAR-I (Internal use only)

    150 EURO/1KS

    VAR-E (Evaluation licence)

    20 EURO/1KS

    End-User (Academic institution - for research only)

    10 EURO/1KS


    Prices per license

    Ksynsets

    Wordnet

    Var

    Var-I

    Var-E

    End-User

    Reduction

    ELRA Member-ship factor

    1

    250

    150

    20

    10

    0%

    2

    10

    2500

    1500

    200

    100

    0%

    2

    20

    5000

    3000

    400

    200

    0%

    2

    30

    7500

    4500

    600

    300

    0%

    2

    40

    10000

    6000

    800

    400

    0%

    2

    50

    12500

    7500

    1000

    500

    0%

    2

    60

    15000

    9000

    1200

    600

    5%

    2

    70

    17500

    10500

    1400

    700

    5%

    2

    80

    20000

    12000

    1600

    800

    5%

    2

    90

    22500

    13500

    1800

    900

    5%

    2

    100

    25000

    15000

    2000

    1000

    10%

    2

    120

    30000

    18000

    2400

    1200

    10%

    2

    140

    35000

    21000

    2800

    1400

    10%

    2

    150

    37500

    22500

    3000

    1500

    10%

    2

    160

    40000

    24000

    3200

    1600

    20%

    2

    170

    42500

    25500

    3400

    1700

    20%

    2

    180

    45000

    27000

    3600

    1800

    20%

    2

    190

    47500

    28500

    3800

    1900

    20%

    2

    200

    50000

    30000

    4000

    2000

    20%

    2

    Above 60Ksynsets a reduction of 5% is offered, above 100Ksynsets a reduction of 10% and above 160Ksynsets a reduction of 20%. If multiple wordnets are obtained, the total is cumulated and the reduction is based on the cumulative total.. The percentage reduction is deducted from each wordnet. For example, if one obtains 3 wordnets of 10KS, 20KS and 40 KS, the total amount is 70KS. The prices for an ELRA member are then as follows:

    Prices in EURO for ELRA members

    without reduction

    Prices in EURO for ELRA members

    with reduction of 5%

    10KS wordnet

    20KS wordnet

    40KS wordnet

    Total

    70KS

    10KS wordnet

    20KS wordnet

    40KS wordnet

    Total

    70 KS

    VAR-C

    2500

    5000

    10000

    17500

    2250

    4500

    9000

    15750

    VAR-I

    1500

    3000

    6000

    10500

    1350

    2700

    5400

    9450

    VAR-E

    200

    400

    800

    1400

    180

    360

    720

    1260

    End-User

    100

    200

    400

    700

    90

    180

    360

    630

    Since the total is between 60 and 100KS, there will be a 5% reduction. The reduction will be distributed over each wordnet. Non-ELRA members pay a double price.

    Below are two examples for a wordnet with 30KSynsets and 40KSynsets.

    Wordnet (30Ksynsets)

    Price in EUROs for ELRA Member

    Price in EUROs for non-Member

    VAR-C

    7,500 EURO

    15,000 EURO

    VAR-I (Internal use only)

    4,500 EURO

    9,000 EURO

    VAR-E (Evaluation licence)

    600 EURO

    1,200 EURO

    End-User (Academic institution - for research only)

    300 EURO

    600 EURO


    Wordnet (40Ksynsets)

    Price in EUROs for ELRA Member

    Price in EUROs for non-Member

    VAR-C

    10,000 EURO

    20,000 EURO

    VAR-I (Internal use only)

    6,000 EURO

    12,000 EURO

    VAR-E (Evaluation licence)

    800 EURO

    1,600 EURO

    End-User (Academic institution - for research only)

    400 EURO

    800 EURO

     

  11. Technical support
  12. Technical support may be provided by members of the consortium. It will be implemented through bilateral agreements between the User and the member of the consortium responsible for the data acquired by User. As an indication the support contract will be on a yearly basis and will cost 10-20 KEURO/Year.

    For more information about the EuroWordNet project: http://www.hum.uva.nl/~ewn


    Tools (Grammar Software)


    ALEP

    The CEC decided in 1991, within the Linguistic Research & Engineering (LRE) programme, to set in motion the development of ALEP, a generic formal and computational environment, which will be made widely available to European companies and research institutions involved in language engineering projects. With this initiative, the CEC wants to overcome the lack of a professionally designed, widely available, non-proprietary platform for linguistic engineering, thus reducing duplication of effort and speeding up the transition from research results to laboratory prototypes and from prototypes to marketable products.

    ALEP is a flexible, fully configurable platform, designed to facilitate the description of linguistic phenomena, the compilation of these descriptions into an executable form and the application of the resulting code in a number of processes.

    ALEP comes with a rule formalism that offers an expressive, yet concise and simple means to describe linguistic phenomena, a compiler and an engine, called the virtual machine, that uses the compiled linguistic rules in analysis, transfer or synthesis of texts.


    LS-GRAM

    Please download LS-GRAM gzipped tar-files from THIS SITE

    The Large-Scale Grammars for EU Languages project (LRE-1 61029) is making its resources - language modules for Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish - available via ELRA. All modules have a text handling component, a two-level morphology, a word structure component (inflection only), and a grammar. They are based on the same principle and semantic descriptions, and have a common format. The linguistic basis for the grammatical part of the modules, which were developed via corpus investigation, is HPSG, with some revisions. Some of the grammars come close to the corpus in coverage. Efficiency played a decisive role, with some of the modules being able to analyse paragraphs of several sentences comprising up to fifty words in less than a minute on an Ultra-Sparc. Last but not least, a large body of test material and very detailed documentation is available for all grammars.



    URL: http://www.icp.grenet.fr/ELRA/cata/text_det.html- Copyright © 1996-99 ELRA - All rights reserved.
    Last update 9 December, 1999. Comments are welcome: zeiliger@icp.inpg.fr