CORPUS HOLDINGS

UCREL has a wide variety of machine-readable corpora held in file storage or on CD-ROM. Some corpora are held only as plain orthographic text, whilst others are held with several kinds of annotation.

Some of the corpora listed below are also available via ICAME in Bergen, Norway, and information on how to obtain some of the others is available at the same site. A selection of the corpus manuals are on-line too.

The following overview summarizes what is available to use at Lancaster.

The British National Corpus (BNC)
The Lancaster/Oslo-Bergen Corpus (LOB)
The Brown University Corpus
The Kolhapur Corpus
The Longman-Lancaster Corpus
The Lancaster/IBM Spoken English Corpus (SEC)
The London-Lund Corpus
The ET10-63 Corpus
The International Telecommunications Union (ITU) or CRATER Corpus
The Helsinki Corpus (Diachronic Part)

The Lampeter Corpus of Early Modern English Tracts
The Lancaster-Leeds Treebank
The Lancaster Parsed Corpus (LPC)
The American Printing House for the Blind Treebank (APHB)
The Associated Press Treebank (AP)
The Canadian Hansard Treebank
The IBM Manuals Treebank
The Anaphoric Treebank
The ACL/DCI CD-ROM
The WordCruncher Disk

Mixed-Channel Corpora

The British National Corpus (BNC)

The BNC is a 100,000,000 word corpus of written and spoken British English from the early 1990s. Approximately 90% of the corpus is made up of written material and approximately 10% is made up of spoken material. The corpus is tagged for part of speech.
Full details of the corpus can be found on the BNC web page.

Written Corpora

The Lancaster/Oslo-Bergen Corpus (LOB)

Approximately 1,000,000 words of British written English dating from 1960. The corpus is made up of 15 different genre categories. Available as orthographic text, and tagged with the CLAWS1 part-of-speech tagging system. The Leeds-Lancaster Treebank and Lancaster Parsed Corpus are analyzed subsamples of the LOB corpus. For further information see the corpus manual (1978) and the tagged corpus manual (1986). (There is a local on-line copy of the tagged corpus manual at Lancaster.)

The Brown University Corpus

Approximately 1,000,000 words of American written English dating from 1960. The genre categories are parallel to those of the LOB corpus. Available as orthographic text only. (for further information see the Brown Corpus bibliography, or the corpus manual)

The Kolhapur Corpus

Approximately 1,000,000 words of Indian written English dating from 1978. Again, the genre categories are parallel to those of the LOB corpus. Available as orthographic text only.

The Longman-Lancaster Corpus

Approximately 14.5 million words of written English from various geographical locations in the English-speaking world and of various dates and text types. Orthographic text only.

Speech Corpora

The Lancaster/IBM Spoken English Corpus (SEC)

Approximately 53,000 words of British spoken English, mainly taken from radio broadcasts dating from the mid 1980s. Available as orthographic text, tagged with the CLAWS2 part-of-speech tagging system, parsed, and prosodically annotated. There are also tapes of a standard suitable for the instrumental analysis of F0 values.

The London-Lund Corpus

Approximately 500,000 words of spoken British English. Various dates from 1960s to mid 1970s. Prosodically annotated version only.

Multilingual Corpora

The ET10-63 Corpus

The ET10-63 corpus is a bilingual parallel corpus of English and French, containing EC offical documents on telecommunications. The corpus is part-of-speech tagged and also lemmatized.
Approximately 1,250,000 words of each language.

The International Telecommunications Union (ITU) or CRATER Corpus

An 1,000,000-word trilingual corpus of Spanish, French and English, aligned at the sentence level. The corpus is made up of texts from the telecommunications domain. It has been part-of-speech tagged in all three languages.
The corpus can be accessed on-line.

Historical Corpora

The Helsinki Corpus (Diachronic Part)

Plain orthographic text. The Helsinki corpus contains samples from texts covering the Old, Middle, and Early Modern English periods. 1,500,000 words in total.

The Lampeter Corpus of Early Modern English Tracts

A corpus of approx. 1,000,000 words of English pamphlet literature covering the years 1640-1740. Text samples are taken from each decade within this century and several genres are represented. This corpus contains the whole text of pamphlets, rather than sub-samples. It is being tagged for part-of-speech and lemmatized at the TU Chemnitz-Zwickau's REAL Centre, in association with Lancaster.
A full list of the corpus texts is available.

Parsed Corpora (`Treebanks')

The Lancaster-Leeds Treebank

A manually parsed subsample of the LOB corpus showing the surface phrase structure of each sentence, prepared by Professor Geoffrey Sampson. Approximately 45,000 words taken from all the genre categories of the LOB corpus.

The Lancaster Parsed Corpus (LPC)

A subsample of the LOB corpus, parsed by computer and manually corrected by several researchers. Approximately 140,000 words with samples from each of the 15 categories in the LOB corpus.

The American Printing House for the Blind Treebank (APHB)

A skeleton-parsed corpus of a wide range of English texts. 200,000 words.

The Associated Press Treebank (AP)

A skeleton-parsed corpus of American newswire reports. 1,000,000 words.

The Canadian Hansard Treebank

A skeleton-parsed corpus of proceedings in the Canadian Parliament. 750,000 words.

The IBM Manuals Treebank

A skeleton-parsed corpus of computer manuals. 800,000 words.

The Anaphoric Treebank

A subsample of the AP corpus, annotated to show the reference of pronouns and lexical cohesion. Approximately 100,000 words.

CD-ROMs

The ACL/DCI CD-ROM

This contains plain orthographic text collected by the Association for Computational Linguistics' Data Collection Initiative. It consists of: the Collins English Dictionary; selections from the Wall Street Journal; the `Penn Treebank' of skeleton-parsed data compiled by Mitch Marcus and his team at the University of Pennsylvania (Marcus and Santorini, 1992); and a database of scientific abstracts.

The WordCruncher Disk

A CD-ROM containing a varied selection of texts indexed for use with WordCruncher. These include the complete works of Shakespeare, two versions of the Bible (the Authorized King James Version and the New International Version), and a variety of American literature. The Bible texts are also stored on the Linguistics LAN server and hence these two texts can be searched without having the WordCruncher disk in a local CD-ROM drive.

UCREL LOGO Home Page | Intro | History | Members | Projects | Corpora | Annotation | Tools | Teaching | BNC | Tech. papers | References | Events | Local help | Unix course | CRG | WWW links | Mail archive | Linguistics | Lancaster University Home