Home | Catalogue | Speech | Termino | Tools

WRITTEN RESOURCES


CORPUS | MONOLINGUAL LEXICON | MULTILINGUAL LEXICON

The description of LRs given herein are brief summaries to facilitate its readability. Further information is given: follow the links !

The ELRA Catalogue

R :For academic useIf none of these abbreviations (R, C or RC)
appears, there are no restrictions for the
type of use.
RC :For research use by a commercial organisation
C : For commercial use
Discount for Non members are offered to members of organizations with which
ELRA entered into special agreements (e.g. ELSNET).
*** :At cost
ELRA :Please contact ELRA office.
--- :Price under discussion
WWW :Please download this free resource from the Web (follow the links)
The following prices are indicated in EURO (1 EUR~=1.2 USD). Some prices, which were negotiated in local currency, have been re-adjusted wrt exchange rate.


CORPORA

Ref.
ELRA
NameType &
No of entries
LanguageMNon-MDate
W0001 BRITISH NATIONAL CORPUS - BNC (OTA)100 million words English R 175R 25401/09/96
W0002 CONTEMPORARY PORTUGUESE CORPUS 1.5 million words Portuguese ------
W0003CRATER Multi-lingual aligned corpus 1 million tokens English, French, Spanish 2010023/01/97
W0004 ECI/MCI European Corpus Initiative Multilingual Corpus
98 million words
Major European languages
+ Turkish, Japanese, Russian, Chinese, Malay, etc.
R 45R 45 01/09/96
W0005 ECI-ELSNET Italian & German tagged sub-corpus Economy 17,000 words
Politics 14,000 words
Culture 18,000 words
Sports 9,000 words
Local Events 8,500 words
Italian & German R 20R 4501/09/96
W0006MLCC - Multi-lingual corpus Het Financieele Dagblad (8.5 million words)
The Financial Times (30 million words)
Le Monde (10 million words)
Handelsblatt (33 million words)
Il sole 24 Ore (1.88 million words)
Expansion (10 million words)
Dutch, English, French, German, Italian, Spanish R 360
C 1500
R 750
C 3200
01/09/96
W0007MLCC - Office of Official Publications of the European Communities (Parliamentary Debates + OJ) Parallel corpus of translated documents in the nine European official languages, divided into 2 sub-corpora: written questions and parliamentary debatesMultilingualR 120
C 480
R 200
C 800
01/09/96
W0008MTP annotated German Corpus
(500000 Words from FAZ/ Die Zeit)
500,000 wordsGerman untagged: 2000
tagged: 8000
untagged: 3500
tagged: 12000
01/09/96
W0009MULTEXT / MULTEXT East (Data/Tools) Written Lexicon and CorporaMultilingual ******
W0010Swedish Corpus PRESS 65
(Corpus of over 1m Words)
1 million words Swedish R 12000R 2000023/01/97
W0011Tagged text in French (MEMODATA)
Typographic tagging
170 books French R 1723
C 2154
R 2154
C 2692
23/01/97
W0012Tagged text in French (MEMODATA)
Morphologic tagging
170 books French R 2461
C 3077
R 3077
C 3846
23/01/97
W0013 TSNLP (Test Suites for NLP Testing) 4,000 test items Multilingual ******01/09/96
W0014 Monolingual Greek corpus1 million wordsGreekR 360R 60017/02/97
W0015Text corpus of "Le Monde"Corpus from "Le Monde" newspaper. From 1 to 5 years of data are available. Each tape/year contains some 10 Mbytes of data per month (circa 120 Mbytes per year).FrenchR.
1year 238,91
2yrs 477,83
3yrs 716,74
4yrs 955,65
5yrs 1194,56
R.
1year 310,59
2yrs 621,17
3yrs 931,76
4yrs 1242,35
5yrs 1552,93
15/09/97
W0016Karl May Korpus (KMK)Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each).GermanR 400
C 2500
R 800
C 3500
28/11/97
W0017MULTEXT JOC CorpusThis CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains ca. 5 million words in English, French, German, Italian and Spanish (ca. 1 million words par language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.English, French, German, Italian, SpanishR 45
C 2000
R 100
C 5000
23/11/98
W0018ARCADE/ROMANSEVAL corpusThe corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together. It comprises: semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; a,d word-level alignment of all the occurrences of the test words between French and English.English, French, ItalianR 45
C 2000
R 100
C 5000
23/11/98
W0019Dutch PAROLE Distributable CorpusThis Dutch corpus is a 3 million words selection built according to the specifications of the PAROLE project. Over 250,000 words of corpus texts have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checkedDutchR 270
RC 800
C 1600
R 300
RC 1300
C 2500
12/07/99
* Special price for academic users from the Netherlands and Belgium: 150 EURO (the data will be supplied directly by the Instituut voor Nederlandse Lexicologie, http://www.inl.nl)


MONOLINGUAL LEXICONS

Ref.
ELRA
NameType &
No of entries
LanguageM Non-MDate
L0001DICO-MORPH_lemme. MEMODATA Morpho-syntactic information
400,000 entries
French R 12090
C 15112
R 15112
C 18890
23/01/97
L0002DICO-MORPH_Collocation. MEMODATA Collocation lexicon
35,000 entries
French R 6992
C 8740
R 8740
C 10925
23/01/97
L0003DICO-SYNT. MEMODATA 90,000 inflexional forms French R 8861
C 11077
R 11077
C 13846
23/01/97
L0004Dutch Lexicon. (LanTmark) General vocabulary
64,000 entries
DutchR 9360
C 32400
R 15600
C 54000
23/01/97
L0005French Lexicon (LanTmark) General vocabulary
50,000 entries
French R 7440
C 25440
R 12400
C 42400
23/01/97
L0006 ILC Italian Morphological lexicon Lexicon
About 60,000 lemmas/lexical entries
Italian R 4000
C 12000
R 8000
C 20000
15/09/97
L0007 LexIn 1:e Swedish Lexicon Lexicon
17,000 headwords and 21,000 senses
Swedish R 1200
C 12000
R 2000
C 20000
23/01/97
L0008Monolingual Danish lexicon. (Institut for Erhvervsinformatik) lexicon 25,000 entriesDanish R 1,2/entry
C 2,4/entry
R 2/entry
C 4/entry
13/05/97
L0009Monolingual Portuguese lexicon. (Centro de Linguistica da Universidade de Lisboa) lexicon 60,000 entriesPortuguese ------
L0010MULTEXT lexicons This CD-ROM contains a set of lexicons developed in the MULTEXT project financed by the European Commission (LRE 62-050). The set contains the following languages: English, French, German, Italian and Spanish.
English 66,214 Word forms
French 306,795 Word forms
German 233,861 Word forms
Italian 145,530 Word forms
Spanish 510,710 Word forms
English, French, German, Italian, SpanishR 45
C 2000
R 100
C 5000
23/11/98
L0011Portuguese morphological lexicon PALAVROSO (INESC) lexicon 60,000 entries Portuguese ------
L0012 Spanish gilcUB-M-Dictionary General vocabulary 60,000 entriesSpanish R 6500
C 8250
R 8225
C 10300
23/01/97
L0013THAMUS. Generic Italian dictionary (Consorzio per la linguistica computazionale) i) Generic (canonical forms) 87,000
ii) Generic (inflected forms) 612,000
iii) Technical (canonical forms) 48,000
iv) Technical (inflected forms) 96,000
Italian R.
i) 19140
ii) 135080
iii) 10560
iv) 21120
C.
i) 47850
ii) 336600
iii) 26400
iv) 52800
R.
i) 20880
ii) 147360
iii) 11520
iv) 23040
C.
i) 52200
ii) 367200
iii) 28800
iv) 57600
13/05/97
L0014Adverbial Equivalence Dictionary (CORA)Generic Dictionary
1,200 entries
FrenchC 243,92C 304,9023/01/97
L0015Nominalisation Dictionary (CORA)Generic Dictionary
2,300 entries
FrenchC 365,88C 457,3523/01/97
L0016Tri-quadri-pentagrams Dictionary (CORA)Generic Dictionary
5,487 entries
FrenchC 365,88C 457,3523/01/97
L0017N de N Dictionary (CORA)Generic Dictionary
10,000 entries
FrenchC 1219,59C 1524,4923/01/97
L0018German lexicon (CORA)Lexicon
466,300
GermanC 4878,37C 6097,9623/01/97
L0019English lexicon (CORA)Lexicon
160,000 entries
EnglishC 4878,37C 6097,9623/01/97
L0020DST Dictionary (CORA)
1) String dictionary
2) Optional extra sets:
i) Part of speech (optional)
ii) Gender, number, conjugation (optional)
iii) Lemma (optional)
iv) Semantical information (optional)
v) Syntactical information (optional)
vi) Prep/adv. phrases (optional)
vii Compound nouns (optional)
3) The whole dictionary
Generic Dictionary
550,000 inflected forms
FrenchC
1) 4878,37
2)
i) 2439,18
ii) 1219,59
iii) 1219,59
iv) 1219,59
v) 609,80
vi) 609,80
vii) 1219,59
3) 12195,92
C
1) 6097,96
2)
i) 3048,98
ii) 1524,49
iii) 1524,49
iv) 1524,49
v) 762,25
vi) 762,25
vii) 1524,49
3) 15244,90
23/01/97
L0021Dictionary of French verbs (CORA - Jean Dubois)>25,610 verbsFrenchC 7317,55C 9146,9421/05/97
L0022Dictionary of words (CORA - Jean Dubois)126,844 wordsFrenchC 4878,35C 6097,9621/05/97
L0023Dictionary of affixes (CORA)4,286 suffixes and prefixesFrenchC 609,80C 762,2521/05/97
L0024Dictionary of verb phrases (CORA)3,480 entries based on the model of the dictionary of French verbs (ELRA-L0021)FrenchC 487,84C 609,8021/05/97
L0025Dictionary of invariable forms and phrases (CORA) 4,783 entries based on the model of the dictionary of words (ELRA-L0022)FrenchC 243,92C 304,9021/05/97
L0026Dictionary of exclamatory stereotyped phrases (CORA)1,901 entries based on the model of the dictionary of invariable forms and phrases (ELRA-L0025)FrenchC 243,92C 304,9021/05/97
L0027Dictionary of French local authorities (CORA)38,965 entries in lower cases with accents, controlled on the guide Michelin, without localitiesFrenchC 243,92C 304,9021/05/97
L0028Dictionary of noun phrases and plural-only words (CORA)2,138 compound names and 1,397 entries of plural-only wordsFrenchC 243,92C 304,9021/05/97
L0029CELEX - Dutch lexical databaseDutch lexical database containing lemmas (124136 entries), wordforms (381292 entries), abbreviations (1622 entries), syllables (31358 entries). The database is divided into different subsets.
i) Complete set of data
ii) Subset Orthography
iii) Subset Phonology
iv) Subset Morphology Infl.
v) Subset Morphology Der.
vi) Subset Syntax
vii) Subset Frequency
DutchC.
i) 56087,32
ii) 5989,90
iii) 12252,07
iv) 5989,90
v) 13613,41
vi) 5989,90
vii) 12252,07
R. ELRA
C.
i) 93478,72
ii) 9983,16
iii) 20420,11
iv) 9983,16
v) 22689,01
vi) 9983,16
vii) 20420,11
R. ELRA
15/09/97
L0030Bulgarian Morphological Dictionary67,500 entries divided into 242 inflectional types (including proper nouns), morphosyntactic information for each entry, and a morphological engine (MS DOS and WINDOWS 95/NT) for morphological analysis and generationBulgarianR 45
C 6000
R 100
C 12000
16/04/98
L0031Dutch PAROLE lexiconThe entry list of the lexicon consists of about 20,200 entries distributed over 13 parts of speech (POS). The entries have been described along the dimensions of morphosyntax and syntax, according to the specifications of the PAROLE project. The lexicon is set up as an SGML file. DutchR 300
RC 1600
C 8000
R 400
RC 3000
C 10000
12/07/99
* Special price for academic users from the Netherlands and Belgium: 200 EURO (the data will be supplied directly by the Instituut voor Nederlandse Lexicologie, http://www.inl.nl)


MULTILINGUAL LEXICONS

Ref.
ELRA
NameType &
No of entries
LanguageM Non-MDate
M0001 Basic multilingual lexicon (MEMODATA) Lexicon
30 000 each language
French, English, Italian, German, Spanish R 8861
C 11077
R 11077
C 13846
23/01/97
M0002Bilingual Spanish-English and English-Spanish Lexicons (INCYTA) Technical domains
Economics, law & business managment 10,642
Leisure, Tourism, Sports, Food 3,144
Geography, History, Arts 4,116
Sociology, Psychology, Pedagogy 4,089
Natural and medical sciences 10,535
Exact sciences, Phys., Chemistry, Geology 10,616
Data Processing, Electronics, Telecoms 4,904
Technology, Engineering & Construction 11,953
Economics 1,320
Data Processing 3,565
Telecommunications 3,733
Electrical Engineering 1,760
Plastics and Chemistry 9,022
Aeronaut., Navigat., Mechanic. Engin. 23,170
Spanish-English
English-Spanish
R 0,12/entry
C 0,96/entry
R 0,2/entry
C 1,6/entry
23/01/97
M0003 Danish-German dictionary (Institut for Erhvervsinformatik) General vocabulary
10,000
Danish-German R 1,2/entry
C 2,4/entry
R 2/entry
C 4/entry
23/01/97
M0004Dutch-French Lexicon (LanTmark) Vocabularies for transfer
i) General Vocabulary 26,000
ii) Administrative 32,000
iii) Data processing 10,000
Dutch-French R
i) 7800
ii) 8160
iii) 2400
C
i) 17760
ii) 19920
iii) 6000
R
i) 12800
ii) 13600
iii) 4000
C
i) 29600
ii) 23200
iii) 10000
23/01/97
M0005English-French Lexicon (LanTmark) General vocabulary for transfer
27,000 entries
English-French R 8160
C 18720
R 13600
C 31200
23/01/97
M0006French-Dutch Lexicon (LanTmark) Vocabularies for transfer
i) General Vocabulary 34,000
ii) Administrative 18,000
iii) Data processing 10,000
French-Dutch R
i) 8880
ii) 4800
iii) 2400
C
i) 21480
ii) 11520
iii) 6000
R
i) 14800
ii) 8000
iii) 4000
C
i) 35800
ii) 19200
iii) 10000
23/01/97
M0007French-English Lexicon (LanTmark) General vocabulary for transfer
34,000 entries
French-English R 10320
C 23640
R 17200
C 39400
23/01/97
M0008German-Danish dictionaries (Institut for Erhvervsinformatik) Technical 6,800
General 15,500
German-Danish R 1,2/entry
C 2,4/entry
R 2/entry
C 4/entry
23/01/97
M0009THAMUS Bilingual dictionaries (Consorzio per la linguistica computazionale) Technical domains
Computer Science
i) canonical forms 17,800
ii) inflected forms 35,000
German-Italian
or
Italian-German
R.
i) 3916
ii) 7700
C.
i) 19580
ii) 38500
R.
i) 4272
ii) 8400
C.
i) 21360
ii) 42000
13/05/97
M0010THAMUS Bilingual dictionaries (Consorzio per la linguistica computazionale) Technical domains
i) Aeronautics 6,300
ii) Law (canonical forms) 8,900
iii) Law (inflected forms) 18,000
iv) Computer Science (canonical forms) 15,700
v) Computer Science (inflected forms) 32,000
vi) Medicine (canonical forms) 20,000
vii) Economics (canonical forms)
50,000
viii) Economics (inflected forms) 86,000
ix) Engineering (canonical forms) 13,000
x) Engineering (inflected forms) 27,000
English-Italian
or
Italian-English
R.
i) 1386
ii) 1958
iii) 3960
iv) 3454
v) 7040
vi) 4400
vii) 11000
viii) 18920
ix) 2860
x) 5940
C.
i) 6930
ii) 9790
iii) 19800
iv) 17270
v) 35200
vi) 22000
vii) 55000
viii) 94600
ix) 14300
x) 29700
R.
i) 1512
ii) 2136
iii) 4320
iv) 3768
v) 7680
vi) 4800
vii) 12000
viii) 20640
ix) 3120
x) 6480
C.
i) 7560
ii) 10680
iii) 21600
iv) 18840
v) 38400
vi) 24000
vii) 60000
viii) 103200
ix) 15600
x) 32400
13/05/97
M0013Bilingual Collocational DictionaryThe bilingual English-German collocational dictionary consists of around 40,000 English headwords, including concepts expressed with more than one word and hyphenated compounds. It contains verbs, adjectives, synonyms and phrases that collocate with the headword. It provides the German equivalents for the headwords as well as their English synonyms.English, German21030028/11/97
M0014Bilingual DictionariesBilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features:

GROUP 1 English <=> Spanish, French, German, Italian, Brazilian Portuguese, Portuguese, Dutch.
GROUP 2 English <=> Danish, Swedish, Finnish, Icelandic.
GROUP 3 English <=> Russian, Russian Business, Russian Aerospace, Russian Automotive, Russian Minerals & Mining, Polish, Hungarian, Czech, Romanian Starter.
GROUP 4 English <=> Croatian, Bosnian, Serbian (Latin or Cyrillic).
GROUP 5 English <=> Japanese.
GROUP 6 English <=> Greek.

See description

R.

G1 0.06/ent.
G2 0.03/ent.
G3 0.04/ent.
G4 0.04/ent.
G5 0.5/ent.
G6 0.12/ent.

C.

G1 0.25/ent.
G2 0.18/ent.
G3 0.2/ent.
G4 0.2/ent.
G5 1/ent.
G6 0.54/ent.

R.

G1 0.12/ent.
G2 0.06/ent.
G3 0.08/ent.
G4 0.08/ent.
G5 1/ent.
G6 0.24/ent.

C.

G1 0.5/ent.
G2 0.36/ent.
G3 0.4/ent.
G4 0.4/ent.
G5 2/ent.
G6 1/ent.

16/04/98
M0015English EuroWordNetEach EuroWordNet database is composed of the following:

- The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.

- A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.

- A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.

- A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.

- WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.

EnglishMore infoMore info30/08/99
M0016Dutch EuroWordNetSee ELRA-M0015Dutch-EnglishMore infoMore info30/08/99
M0017Spanish EuroWordNetSee ELRA-M0015Spanish-EnglishMore infoMore info30/08/99
M0018Italian EuroWordNetSee ELRA-M0015Italian-EnglishMore infoMore info15/10/99
M0019German EuroWordNetSee ELRA-M0015German-EnglishMore infoMore info15/10/99
M0020French EuroWordNetSee ELRA-M0015French-EnglishMore infoMore info15/10/99
M0021Czech EuroWordNetSee ELRA-M0015Czech-EnglishMore infoMore info15/10/99
M0022Estonian EuroWordNetSee ELRA-M0015Estonian-EnglishMore infoMore info15/10/99


URL: http://www.icp.grenet.fr/ELRA/cata/tabtext.html- Copyright © 1996-99 ELRA - All rights reserved.
Last update 12 November, 1999. Comments are welcome: zeiliger@icp.inpg.fr