Home | Catalogue | Speech | Text | Termino |
WRITTEN RESOURCES SPECIFICATIONS
The corpus was produced by a consortium of leading dictionary publishers (OUP, Longman, Chambers-Harrap) and academic research centres (Oxford University Computing Services, Unit for Computer Research in the English Language at Lancaster University, British Library Research and Development). It provides a unique and authoritative view of the state of the English language today, with carefully balanced representation of as many different varieties of English as possible. It can be used to exercise NLP systems of all kinds, as a fertile source of real-life examples for language learners, or simply to explore the way the language is currently used.
The first release of the BNC comprises (packaged as 3 CDRoms) :
The Corpus Resources and Terminology Extraction project (MLAP-93 20) has extended the bilingual annotated English-French International Telecommunications Union corpus to include Spanish, and has also debugged the existing corpus. In addition, a Spanish tagger has been developed, along with a set of retrieval tools for browsing the trilingual aligned corpus, and examining the proposed term or word alignments. The offer consists of the 3 x 1,000,000 token corpora of English, French and Spanish, morphosyntactic annotations (human-edited), lemmatisation and term extraction routines for English, French and Spanish.
Samples ?Just a sampling of the contents of the CD-ROM:
The tagging, done automatically, has been manually checked. The CD-ROM contains: the text in SGML format; the DBT software which allows different browsing and operations on the annotated text and the EAGLES guidelines for morphosyntactic.
The first set is referred as the Polylingual Document Collection (ELRA-W0006), a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora:
The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text.
The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words.
A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words.
This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt.
The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh.
This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words.
The second set is a Multilingual Parallel Corpus (ELRA-W0007) consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities:
Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language).
Samples: Danish, German, English, Spanish, Greek, Italian, Dutch, Portuguese.This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by members of the meeting as well as written input provided to the meeting. The original data from which the translations are produced consist of a transcript of the sittings, each member speaking in the language of his choice. The final version consists of nine parallel versions of the material. The texts delivered comprise the Debates of Parliament from January 1992 to July 1994. This sub-corpus contains some 5 to 8 million words per language.
Samples: Danish, German, English, Spanish, Greek, Italian, French, Dutch, Portuguese.Monolingual Greek corpus of 1 million words. The corpus consists of articles written in 1996 from the Greek daily newspaper ELEFTHEROTIPIA. Each file contains annotated text with SGML mark-up accompanied by a text header.
The text was segmented into sentence units and word tokens, and tagged for morphosyntactic POS markers. Two tagsets, which mainly differed in the granularity of the noun and verb tags, and which comprised 137 and 52 tags respectively, were used. Users may obtain annotated versions using either set, each of which comes with documentation and an instruction manual for tag application. A suite of tools, including the MTP taggers and the Xlex workbench for text handling, textual analysis and lexicography, is also available.
The corpus is available in an ASCII text format. Each month consists of some 10 MB of data (circa 120 MB per year).
Data ranging from 1987 until present date are available through ELRA (each buyer may purchase up to 5 years of data).
Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May from 1993 to 1997 and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each).
Each word form is tagged with word class (1 out of 43 classes) and appropriate lemma.
File format: Text
Standard in use: SGML
Character set: 8-bit ASCII
This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains approx. 5 million words in English, French, German, Italian and Spanish (approx. 1 million words per language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.
The JOC corpus is delivered in Corpus Encoding Standard conformant format at each level of treatment :
Additional information: http://www.lpl.univ-aix.fr/projects/multext
The ARCADE/ROMANSEVAL corpus was used as a reference corpus in two international competitions:
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four Romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission).
The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3,700 contexts altogether, and comprises:
Additional information:
http://www.lpl.univ-aix.fr/projects/arcade
http://www.lpl.univ-aix.fr/projects/romanseval
The Dutch PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference corpus.
The Dutch corpus annotation and checking was made accordingly to the common core PAROLE tagset. The Dutch data were also checked for type.
The Dutch PAROLE Distributable Corpus contains the following texts:
MEDIUM |
SOURCE |
TIMESPAN |
TOTAL NUMBER |
BOOKS |
Van Sterkenburg: |
|
|
NEWSPAPERS |
Short Newspaper texts: |
|
|
PERIODICAL |
Short texts from |
|
|
MISCELLANEOUS |
Texts to be read out in |
|
|
TOTAL |
3,018,231 |
Over 250,000 words of corpus texts have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked at least two times with respect to maximal granularity, according to a lexicographer's manual. The extra 9,000 words over the required 50,000 words compensate for the occurrence of ca. 5,300 "keywords" in the original texts. The fully corrected material has been subjected to an automated post-control operation, checking the pertinence relations between the various feature values, and instantiating default values in case a mismatch (indicating a correction error) was found. Ca. 200,000 words have been checked once for PoS and type. In addition to the required PoS, type was checked for reasons of quality. This material has been subjected to an automated correction procedure addressing the feature slots (positions) beyond the first two for PoS and type so as to solve discrepancies between the manually corrected PoS and type, and the possibly erroneous, automatically assigned values of the remaining slots.
More info on the Parole project.
This is an adding for the French lexicon for morphological works (referenced herein as the DICO-MORPH_Lemme. MEMODATA).
This resource gives the morpho-syntactical information for DICO-MORPH_lemme: proper noun, transitive verb, ... There are around 800 categories of verbs. The lexical categories are: nouns (25,000), verbs (8.000 that generate 25,000 verb/models), adjectives (1,000), Adverbs (1,500).
The Dutch LanTmark lexicon is divided into the following categories: nouns (50,000), verbs (7,000), adjectives (6,000), adverbs (1,000).Each entry contains morphological information (morphological flexes, comparative and superlative markers), syntactic information (such as positional features, gender, complement markers and verb arguments), semantic information (lexical semantics for nouns, adverbs and adjectives).
The French LanTmark lexicon is divided into the following categories: nouns (36,000), verbs (6,000), adjectives (7,000), adverbs (1,000).
Each entry contains morphological information (morphological flexes, comparative and superlative markers), syntactic information (such as positional features, gender, complement markers and verb arguments), semantic information (lexical semantics for nouns, adverbs and adjectives).
Entries: 25000
Format: ASCII
This dictionary was developed for machine translation.
Each lexeme contains the word class, inflection, semantic features,
syntactic frames (for verbs), and complement (for nouns and adjectives).
This CD-ROM contains a set of lexicons developed in the MULTEXT project financed by the European Commission (LRE 62-050). The set contains the following languages: English, French, German, Italian and Spanish.
English 66,214 Word formsThe MULTEXT lexicons are three-column tables, separated with a tabulation: the first column contains the word-form, the second column contains the lemma, and the third column contains the morpho-syntactic information associated to that form. This information is conformant with the MULTEXT/EAGLES specifications.
Additional information: http://www.lpl.univ-aix.fr/projects/multext
Monolingual Portuguese lexicon with a rule-based morphological analysis which also handles enclitics, compounds, diminutives and augmentatives.
PALAVROSO is a European Portuguese lexicon and consists of a set of about 60,000 lexical entries (lemmas), and a rule-based morphological engine for morphological analyses that recognises more than 1,300 000 word-forms. The rule set also allows enclitics, compound words, diminutives and augmentatives to be handled correctly. Information encoded is compatible with the EAGLES recommendations for lexicon encoding at the morpho-syntactic level.
The Spanish gilcUB-M-Dictionary is a full form lexicon derived from 60,000 lemmas of general vocabulary (9,700 verbs, 35,500 nouns, 14,300 adjectives and 120 adverbs). Possible adverbs derived from adjectival forms are also included as full forms and are about 10,000 forms. Morphosyntactic information encoded is compatible with EAGLES recommendations for morphosyntactic encoding as well as the associated lemma.
A Generic monolingual Italian dictionary. Morphological coding which can generate all full forms by means of a software engine written in C. Multi-word terms contain morphological coding for the head word.
Codes enable automatic production of conjugation forms, derived nouns and adjectives and, if necessary, the production of potential forms.
Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.
To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.
This database can be divided into different subsets:
This dictionary contains 67500 entries divided into 242 inflectional types (including proper nouns), morphosyntactic information for each entry, and a morphological engine (MS DOS and WINDOWS 95/NT) for morphological analysis and generation. The data may be used for morphological analysis and synthesis.
Structure of entries: Local linguistic variantThe entry list of the lexicon consists of about 20,200 entries distributed over 13 parts of speech (POS). The entries have been described along the dimensions of morphosyntax and syntax. Morphosyntactic information consists of various lexical properties, like gender, number, case, person, inflection, etc. Syntactic descriptions consist of typical complementation patterns associated with the various lemmata.
The composition of the entry list of the lexicon is based on 3 corpora from the Instituut voor Nederlandse Lexicologie (INL) and 2 lexica. The corpora contain a total of about 54 million words and have been automatically annotated for part-of-speech and lemma. The lexica contain morphosyntactic information of various kinds. For verbs, nouns, adjectives and adverbs, lemmata that were covered by at least 2 corpora and the 2 lexica were selected on the basis of cumulative frequency, coverage (distribution over sources) and inflected forms. For the smaller parts of speech, these selection requirements appeared to be too strict. Entry selection for these parts of speech was based on ranked frequency.
The entries, uniquely defined by the combination of part of speech (e.g. noun) and subtype (e.g. common vs. proper noun), are provided with morphosyntactic information according to the Dutch set of PAROLE categories and features, and, where available, with syntactic information. Morphosyntactic information is automatically extracted from the INL lexica. Syntactic data have been collected manually, by inspection of corpus data and - where necessary - consultation of reference works. The corpus consulted consists of the newspaper component and the varied component of the 38 Million Words Corpus 1996.
Word forms in the Dutch PAROLE lexicon are not inflected according to general paradigms, but are related to their lemma by a set of string procedures. These procedures are not unique. They can be shared by many other word forms. An example is suffixation with -e for adjectives, which produces "goede"/good from "goed". Inflected forms can be derived directly by applying the string procedures to the lemma they are connected with.
The lexicon is set up as an SGML file (over 30 MB of plain ASCII). Its contents have been encoded in a distributed manner: all formative entities (like lemmata, syntactic phrases, feature bundles) are SGML entities, related by a pointer mechanism to other entities.
The lexicon contains the following categories : adjectives (3,298 entries), adpositions (80 entries), adverbs (554 entries), articles (3 entries), conjunctions (70 entries), determiners (59 entries), interjections (235 entries), nouns (12,279 entries), numerals (77 entries), pronouns (85 entries), residuals (186 entries), unique (1 entry), verb (3,274 entries).
More info on the Parole project.
The words are associated by the meaning. The lexical categories are: nouns (5 * 18 000), verbs (5 * 8 000), adjectives (5 * 6 000), adverbs (5 * 1 500).
Economics, law and Business management: 10.640 entries Leisure, Tourism, Sports, Food: 3.140 entries Geography, History, Arts: 4.110 entries Sociology, Psychology, Pedagogy: 4.080 entries Natural and medical sciences: 10.530 entries Exact sciences, Physics, Chemistry, Geology: 10.610 entries Data Processing, Electronics, Telecommunications: 4.900 entries Technology, Engineering and Construction: 11.950 entries Economics 1.320 entries Data Processing 3.560 entries Telecommunications 3.730 entries Electrical Engineering 1.760 entries Plastics and Chemistry 9.020 entries Aeronautics, Navigation, Mechanical Engin. 23.170 entriesThe entries contain morphological information for part-of-speech and inflectional class. The information on multi-word terms is provided by the headword.
This dictionary was developed for machine translation. It gives the German lexeme with word class and Danish equivalent with word class, subject area, indication of structural changes from DK-G.
General Dutch-French LanTmark lexicon is divided into the following categories: nouns (14,000), verbs (6,000), adjectives (5,000), Adverbs (1,000).
Administrative vocabulary is divided into the following categories:
nouns (30,000), verbs (2,000).
Data processing vocabulary has 10 000 transfer nouns.
Each entry contains a domain information, source language disambiguation, features, target language actions.
English-French LanTmark lexicon is divided
into the following lexical categories: nouns (14,000), verbs (7,000),
adjectives (5,000), Adverbs (1,000).
Each entry contains a domain
information, source language disambiguation, features, target
language actions.
General French-Dutch LanTmark lexicon is
divided into the following categories: nouns (25,000), verbs (3,000),
adjectives (5,000), Adverbs (1,000).
Administrative vocabulary is divided into the following categories: nouns (16,000), verbs (2,000).
Data processing vocabulary has 10,000 transfer nouns.
Each entry contains domain information, source language disambiguation, features, and target language actions.
The French-English LanTmark lexicon is divided into the following lexical categories: nouns (21,000), verbs (9,000), adjectives (3,000), adverbs (1,000).
Each entry contains adomain information, source language disambiguation, features, and target language actions.
This dictionary was developed for machine translation. It gives the German lexeme with word class and Danish equivalent with word class, subject area, indication of structural changes from G-DK (e.g. direct object è PP (Prep 'xxx').
Technical bilingual Italian dictionaries with a morphological coding which can generate all full forms using a software engine written in C. Multi-word terms contain morphological coding for the head word.
Aeronautics 6.500 entries Law 18.000 entries Computer Science 31.000 entries Medicine 20.000 entries Economics 82.000 entries Engineering 27.000 entriesTechnical bilingual Italian dictionaries with a morphological coding which can generate all full forms using a software engine written in C. Multi-word terms contain morphological coding for the head word.
The bilingual English-German collocational dictionary consists of around 40,000 English headwords, including concepts expressed by more than one word (e.g. "environmental awareness" or "lame duck") and hyphenated compounds. It contains verbs, adjectives, synonyms and phrases that collocate with the headword, as well as the German equivalents for the headwords and their English synonyms.
The corpus on which the dictionary is based consists of a representative group of written (British) English texts - books, magazines, and quality Press - which runs to about two million words. All entries are based on contemporary evidence, and are typical of words that appear at least once in a two-million word corpus. The examples and phrases are a major feature of this dictionary.
A global search will provide all collocations that can possibly be associated with the search word. A search engine, the Advanced Reader's Collocation Searcher (ARCS), is supplied with the data and provides all possible German equivalents of the headwords. All entries are sorted according to part-of-speech categories. The latter feature makes it possible for searches to yield different useful combinations of words, e.g. noun + verb + adjective + examples extracted from the corpus + synonyms. A global search will also locate all words semantically connected with the search word in both English and German.
More information ?Bilingual dictionaries for demonstration and commercial use containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features. The level of information in each entry varies depending on the word/phrase and on the dictionary. However, all of the above are present in varying degrees in the dictionaries. These dictionaries may be of interest in particular for spell-checking, thesaurus, hyphenation and translation of natural languages. A Level 2 translation engine, also available via ELRA, provides exact translations, output in LOCAL-UCS format, for input words and phrases, input in LOCAL-UCS format, based on the vocabulary stored in a compressed translation file.
Each pair of languages may be purchased as different sets or subsets, corresponding to the indicated number of entries. All pairs consist of English to and from another language. The following groups of languages are available:
GROUP 1 (English <=> Language A):
Language A = Spanish (25,000, 60,000, 100,000 and 200,000 entries), French (40,000, 80,000, 100,000 and 200,000 entries), German (40,000, 80,000 and 126,000 entries), Italian (20,000 and 40,000 entries), Brazilian Portuguese (40,000, 80,000 and 400,000 entries), Portuguese (40,000, 80,000, 110,000 and 234,000 entries), Dutch (40,000, 80,000 and 110,000 entries).
GROUP 2 (English <=> Language B):
Language B = Danish (40,000, 80,000 and 110,000 entries), Swedish (40,000, 80,000 and 110,000 entries), Finnish (30,000 entries), Icelandic (40,000, 80,000 and 100,000 entries).
GROUP 3 (English <=> Language C):
Language C = Russian (4,0000, 72,000 and 120,000 entries), Russian Business (60,000 entries), Russian Aerospace (60,000 entries), Russian Automotive (40,000 entries), Russian Minerals & Mining (60,000 entries), Polish (30,000, 80,000, 124,000 and 150,000 entries), Hungarian (30,000, 80,000 and 124,000 entries), Czech (40,000 entries), Romanian Starter (10,000 entries).
GROUP 4 (English <=> Language D):
Language D = Croatian (30,000 entries), Bosnian (30,000 entries), Serbian (Latin or Cyrillic) (30,000 entries).
GROUP 5 (English <=> Language E):
Language E = Japanese (40,000 entries).
GROUP 6 (English <=> Language F):
Language F = Greek (60,000 entries).
File format: TextPlease see http://www.tranexp.com for more information
C. LR(2) Language-Specific Components
Following the announcement of the EuroWordNet databases in the last issue of the ELRA Newsletter (Vol.4 N.2), we are happy to announce that the list of EuroWordNet languages has grown. The following wordnets are now available via ELRA:
ELRA ref. |
Language |
Synsets |
Word Meanings |
Language Internal Relations |
Equi-valence Relations |
ELRA-M0015 |
English Addition to English WordNet |
16361 |
40588 |
42140 |
0 |
ELRA-M0016 |
Dutch |
44015 |
70201 |
111639 |
53448 |
ELRA-M0017 |
Spanish |
23370 |
50526 |
55163 |
21236 |
ELRA-M0018 |
Italian |
40428 |
48499 |
117068 |
71789 |
ELRA-M0019 |
German |
15132 |
20453 |
34818 |
16347 |
ELRA-M0020 |
French |
22745 |
32809 |
49494 |
22730 |
ELRA-M0021 |
Czech |
12824 |
19949 |
26259 |
12824 |
ELRA-M0022 |
Estonian |
7678 |
13839 |
16318 |
9004 |
A. |
The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created. An ILI-record contains: A.1 synset: set of synonymous words or phrases (mostly from WordNet1.5) |
B. |
Top-Ontology: an ontology of 63 basic semantic classes based on fundamental distinctions. By means of the Top-Ontology all the wordnets can be accessed using a single language-independent classification-scheme. Top-Concepts are only assigned to ILI-records. |
C. |
Domain-ontology: an ontology of subject-domains optionally assigned to ILI-records. |
D. |
A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets. These Base-Concepts form the core of all the wordnets. All the Base-Concepts are classified in terms of the Top-Concepts that apply to them. |
E. |
WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format. |
Wordnets produced in the first project (LE2-4003):
F. |
Dutch wordnet |
G. |
English wordnet (additional relations which are missing in WordNet1.5) |
H. |
Italian wordnet |
I. |
Spanish wordnet |
After extension of the project (LE4-8328):
J. |
German wordnet |
K. |
French wordnet |
L. |
Czech wordnet |
M. |
Estonian wordnet |
The specific wordnets are language-internal structures, minimally containing:
Each wordnet will be distributed with LR1 and will include documentation on LR1 and the distributed wordnet. All the data will be distributed as text-files in the EuroWordNet import format and as Polaris database files (see below LR3). The EuroWordNet viewer (Periscope, see below LR3) can be used to access the database version. Polaris has to be licensed to modify and extend the database version.
The wordnets are distributed without:
The multilingual EUROWORDNET Database (partly Foreground, partly Background) consists of three components:
The Polaris tool is a re-implementation of the Novell ConceptNet toolkit (Díez-Orzas et al 1995) adapted to the EuroWordNet architecture. Polaris can import new wordnets or wordnet fragments from ASCII files with the correct import format and it creates an indexed EUROWORDNET Database. Furthermore, it allows a user to edit and add relations in the wordnets and to formulate queries. The Polaris toolkit makes it possible to visualise the semantic relations as a tree-structure that can directly be edited. These trees can be expanded and shrunk by clicking on word-meanings and by specifying so-called TABs indicating the kind and depth of relations that need to be shown. Expanded trees or sub-trees can be stored as a set of synsets, which can be manipulated, saved or loaded. Additionally, it is possible to access the ILI or the ontologies, and to switch between the wordnets and ontologies via the ILI. Finally, it contains an interface to project sets of synsets across wordnets.
The Periscope program is a public viewer that can be used to look at wordnets created by the Polaris tool and to compare them in a graphical interface. Word meanings can be looked up and trees can be expanded. Individual meanings or complete branches can be projected on another wordnet or wordnet structures can be compared via the equivalence relations with the Inter-Lingual-Index. Selected trees can be exported to text files. The Periscope program cannot be used for importing or changing wordnets.
N. |
The Polaris program is partly Background and partly Foreground. It is property of Lernout & Hauspie and can be licensed as a EuroWordNet result, either directly from Lernout & Hauspie or from ELRA. |
O. |
The Periscope viewer is property of Lernout & Hauspie and is Foreground. |
The prices are based on the number of synsets in each wordnet and differ for the kind of usage and ELRA-membership:
Price per 1K Synsets (KS) in EUROs |
|
VAR-C |
250 EURO/1KS |
VAR-I (Internal use only) |
150 EURO/1KS |
VAR-E (Evaluation licence) |
20 EURO/1KS |
End-User (Academic institution - for research only) |
10 EURO/1KS |
Prices per license |
||||||
Ksynsets Wordnet |
Var |
Var-I |
Var-E |
End-User |
Reduction |
ELRA Member-ship factor |
1 |
250 |
150 |
20 |
10 |
0% |
2 |
10 |
2500 |
1500 |
200 |
100 |
0% |
2 |
20 |
5000 |
3000 |
400 |
200 |
0% |
2 |
30 |
7500 |
4500 |
600 |
300 |
0% |
2 |
40 |
10000 |
6000 |
800 |
400 |
0% |
2 |
50 |
12500 |
7500 |
1000 |
500 |
0% |
2 |
60 |
15000 |
9000 |
1200 |
600 |
5% |
2 |
70 |
17500 |
10500 |
1400 |
700 |
5% |
2 |
80 |
20000 |
12000 |
1600 |
800 |
5% |
2 |
90 |
22500 |
13500 |
1800 |
900 |
5% |
2 |
100 |
25000 |
15000 |
2000 |
1000 |
10% |
2 |
120 |
30000 |
18000 |
2400 |
1200 |
10% |
2 |
140 |
35000 |
21000 |
2800 |
1400 |
10% |
2 |
150 |
37500 |
22500 |
3000 |
1500 |
10% |
2 |
160 |
40000 |
24000 |
3200 |
1600 |
20% |
2 |
170 |
42500 |
25500 |
3400 |
1700 |
20% |
2 |
180 |
45000 |
27000 |
3600 |
1800 |
20% |
2 |
190 |
47500 |
28500 |
3800 |
1900 |
20% |
2 |
200 |
50000 |
30000 |
4000 |
2000 |
20% |
2 |
Above 60Ksynsets a reduction of 5% is offered, above 100Ksynsets a reduction of 10% and above 160Ksynsets a reduction of 20%. If multiple wordnets are obtained, the total is cumulated and the reduction is based on the cumulative total.. The percentage reduction is deducted from each wordnet. For example, if one obtains 3 wordnets of 10KS, 20KS and 40 KS, the total amount is 70KS. The prices for an ELRA member are then as follows:
Prices in EURO for ELRA members without reduction |
Prices in EURO for ELRA members with reduction of 5% |
|||||||
10KS wordnet |
20KS wordnet |
40KS wordnet |
Total 70KS |
10KS wordnet |
20KS wordnet |
40KS wordnet |
Total 70 KS |
|
VAR-C |
2500 |
5000 |
10000 |
17500 |
2250 |
4500 |
9000 |
15750 |
VAR-I |
1500 |
3000 |
6000 |
10500 |
1350 |
2700 |
5400 |
9450 |
VAR-E |
200 |
400 |
800 |
1400 |
180 |
360 |
720 |
1260 |
End-User |
100 |
200 |
400 |
700 |
90 |
180 |
360 |
630 |
Since the total is between 60 and 100KS, there will be a 5% reduction. The reduction will be distributed over each wordnet. Non-ELRA members pay a double price.
Below are two examples for a wordnet with 30KSynsets and 40KSynsets.
Wordnet (30Ksynsets) |
Price in EUROs for ELRA Member |
Price in EUROs for non-Member |
VAR-C |
7,500 EURO |
15,000 EURO |
VAR-I (Internal use only) |
4,500 EURO |
9,000 EURO |
VAR-E (Evaluation licence) |
600 EURO |
1,200 EURO |
End-User (Academic institution - for research only) |
300 EURO |
600 EURO |
Wordnet (40Ksynsets) |
Price in EUROs for ELRA Member |
Price in EUROs for non-Member |
VAR-C |
10,000 EURO |
20,000 EURO |
VAR-I (Internal use only) |
6,000 EURO |
12,000 EURO |
VAR-E (Evaluation licence) |
800 EURO |
1,600 EURO |
End-User (Academic institution - for research only) |
400 EURO |
800 EURO |
Technical support may be provided by members of the consortium. It will be implemented through bilateral agreements between the User and the member of the consortium responsible for the data acquired by User. As an indication the support contract will be on a yearly basis and will cost 10-20 KEURO/Year.
For more information about the EuroWordNet project: http://www.hum.uva.nl/~ewn
ALEP is a flexible, fully configurable platform, designed to facilitate the description of linguistic phenomena, the compilation of these descriptions into an executable form and the application of the resulting code in a number of processes.
ALEP comes with a rule formalism that offers an expressive, yet concise and simple means to describe linguistic phenomena, a compiler and an engine, called the virtual machine, that uses the compiled linguistic rules in analysis, transfer or synthesis of texts.
The Large-Scale Grammars for EU Languages project (LRE-1 61029) is making its resources - language modules for Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish - available via ELRA. All modules have a text handling component, a two-level morphology, a word structure component (inflection only), and a grammar. They are based on the same principle and semantic descriptions, and have a common format. The linguistic basis for the grammatical part of the modules, which were developed via corpus investigation, is HPSG, with some revisions. Some of the grammars come close to the corpus in coverage. Efficiency played a decisive role, with some of the modules being able to analyse paragraphs of several sentences comprising up to fifty words in less than a minute on an Ultra-Sparc. Last but not least, a large body of test material and very detailed documentation is available for all grammars.