UCREL has a wide variety of machine-readable corpora held in file
storage or on CD-ROM. Some corpora are held only as plain orthographic
text, whilst others are held with several kinds of
Some of the corpora listed below are also available via
ICAME in Bergen, Norway, and
information on how to obtain some of the others is available at the
A selection of the corpus manuals
are on-line too.
The following overview summarizes what is available to use
The British National Corpus (BNC)
The BNC is a 100,000,000 word corpus of written and spoken
British English from the early 1990s. Approximately 90% of
the corpus is made up of written material and approximately
10% is made up of spoken material. The corpus is tagged
for part of speech.
Full details of the corpus can be
found on the BNC web page.
The Lancaster/Oslo-Bergen Corpus (LOB)
Approximately 1,000,000 words
of British written English dating from 1960. The corpus is made up of
15 different genre categories. Available as orthographic text, and
tagged with the CLAWS1 part-of-speech tagging system. The
Leeds-Lancaster Treebank and Lancaster Parsed Corpus are analyzed
subsamples of the LOB corpus.
For further information
see the corpus manual (1978)
and the tagged corpus manual (1986).
(There is a local on-line copy of the tagged corpus manual at Lancaster.)
The Brown University Corpus
Approximately 1,000,000 words of American
written English dating from 1960. The genre categories are parallel to
those of the LOB corpus. Available as orthographic text only.
(for further information
see the Brown Corpus bibliography,
or the corpus manual)
The Kolhapur Corpus
Approximately 1,000,000 words of Indian
written English dating from 1978. Again, the genre categories are
parallel to those of the LOB corpus. Available as orthographic text
The Longman-Lancaster Corpus
Approximately 14.5 million words of written English from
various geographical locations in the English-speaking world and of
various dates and text types. Orthographic text only.
The Lancaster/IBM Spoken English Corpus (SEC)
Approximately 53,000 words of British spoken English,
mainly taken from radio broadcasts dating from the mid 1980s. Available
as orthographic text, tagged with the CLAWS2 part-of-speech tagging
system, parsed, and prosodically annotated. There are also tapes of a
standard suitable for the instrumental analysis of F0 values.
The London-Lund Corpus
Approximately 500,000 words of
spoken British English. Various dates from 1960s to mid 1970s.
Prosodically annotated version only.
The ET10-63 Corpus
The ET10-63 corpus is a bilingual parallel corpus of English and
French, containing EC offical documents on telecommunications. The
corpus is part-of-speech tagged and also lemmatized.
Approximately 1,250,000 words of each language.
The International Telecommunications Union (ITU) or CRATER Corpus
An 1,000,000-word trilingual corpus of Spanish, French
and English, aligned at the sentence level. The
corpus is made up of texts from the telecommunications domain.
It has been part-of-speech tagged in all three languages.
The corpus can be accessed on-line.
The Helsinki Corpus (Diachronic Part)
text. The Helsinki corpus contains samples from texts
covering the Old, Middle, and Early Modern English periods. 1,500,000
words in total.
The Lampeter Corpus of Early Modern English Tracts
A corpus of approx. 1,000,000 words of English pamphlet literature
covering the years 1640-1740. Text samples are taken from each
decade within this century and several genres are represented.
This corpus contains the whole text of pamphlets, rather than
sub-samples. It is being tagged for part-of-speech and lemmatized
at the TU Chemnitz-Zwickau's REAL Centre, in association with Lancaster.
A full list of the corpus
texts is available.
Parsed Corpora (`Treebanks')
The Lancaster-Leeds Treebank
A manually parsed subsample of the LOB
corpus showing the surface phrase structure of each sentence, prepared
by Professor Geoffrey Sampson. Approximately 45,000 words taken from
all the genre categories of the LOB corpus.
The Lancaster Parsed Corpus (LPC)
A subsample of the LOB corpus, parsed by computer and
manually corrected by several researchers. Approximately 140,000 words
with samples from each of the 15 categories in the LOB corpus.
The American Printing House for the Blind Treebank (APHB)
A skeleton-parsed corpus of a wide range of
English texts. 200,000 words.
The Associated Press Treebank (AP)
A skeleton-parsed corpus of American newswire reports.
The Canadian Hansard Treebank
A skeleton-parsed corpus of proceedings in the Canadian Parliament. 750,000 words.
The IBM Manuals Treebank
A skeleton-parsed corpus of computer manuals. 800,000 words.
The Anaphoric Treebank
A subsample of the AP corpus, annotated to show the reference of
pronouns and lexical cohesion. Approximately
The ACL/DCI CD-ROM
This contains plain
orthographic text collected by the Association for Computational
Linguistics' Data Collection Initiative. It consists of: the Collins
English Dictionary; selections from the Wall Street Journal; the `Penn
Treebank' of skeleton-parsed data compiled by Mitch Marcus and his team
at the University of Pennsylvania (Marcus and Santorini, 1992); and a
database of scientific abstracts.
The WordCruncher Disk
A CD-ROM containing a varied selection of texts indexed for use with
WordCruncher. These include the complete works of Shakespeare, two
versions of the Bible (the Authorized King James Version and the New
International Version), and a variety of American literature. The Bible
texts are also stored on the Linguistics LAN server and
hence these two texts can be searched without having the WordCruncher
disk in a local CD-ROM drive.