Text Collection and Processing

Most of the text is newswire that is captured from satellite or leased line feeds, though some of it is also delivered via email, direct dialed sources, on CD-ROM, or even on diskette from information providers with whom the LDC has negotiated license agreements. Some of the data is also from NNTP-accessed newsgroups. In the case of sources that provide continuous or periodic feeds, the text is spooled onto the LDC's server, using appropriate capture scripts that have been written for each text source. Other programs are then used to condition the text, by transforming or removing non-standard mark-up, normalizing character sets to an appropriate standard, and introducing a standard form of SGML mark-up, which is as consistent as possible across text sources.

Currently, most text and broadcast data collection at the LDC is project driven, so that text is collected as needed for various research programs. The bulk of our text acquisition has been motivated by various sponsored projects, including language modeling for speech recognition, collections for information retrieval and text understanding, and materials for language teaching. However, we are eager to see text collections made available for languages that do not have such resources available now, and (if asked) will offer advice and assistance in developing, publishing or distributing such collections, regardless of their relation to commercially-viable technology development.

For more information on the bodies of text listed below, see the LDC's catalog listing by database, or the "Main List," which includes various items in the database listing by year and information on data to be published in the coming months. We apologize for the currently incomplete inventory; many of the processing jobs were recently taken offline when new disks were added to the system, and we are only gradually getting them started again. The newswires and other data we spool or have on hand is as follows:

Provider Data Type Corpus Name Language In Hand Character Encoding Media Type Collection
Date
Agence France Presse Text-
Newswire
LDC95T9,
LDC95T11
Arabic,English,
French, German,
Portuguese, and Spanish
Data Arabic
ISO 8859-6, others ISO-Latin 1
CD-ROM 1994-97
Associated Press Text-
Newswire
LDC95T9,
LDC93T3,
LDC95T11
Dutch, English,
French, German,
Spanish,Swedish
Data ISO Latin-1 CD-ROM, On-Line(Eng.) 1996-
Reuters Latin American Business Report Text-
Newswire
LDC95T9 English,Spanish 96MB,240MB ISO Latin-1 CD-ROM Sept. 1993-Dec. 1996
Reuters Latin American Business Report II Text-
Newswire
LDC95T9 English
Reuters Spanish Language News Service Text-
Newswire
LDC95T9 Spanish 326MB ISO Latin-1 CD-ROM Sept. 1993-Dec. 1996
Reuters North American News Text-
Newswire
LDC95T6 English ISO Latin-1
Reuters Financial Report Text-
Newswire
LDC95T6 English ISO Latin-1
New York Times News Service Text-
Newswire
LDC95T6 English Data ISO-Latin-1 CD-ROM,
On-Line(Eng.)
NA
LA Times-Washington Post News Service Text-
Newswire
LDC95T6 English Data ISO-Latin-1 CD-ROM,
On-Line(Eng.)
1996-
Deutsche Presse Agentur Text-
Newswire
LDC95T11 German,English Data ISO Latin-1 CD-ROM NA
Dinamani Text-
Newswire
NA Tamil NA NA NA NA
Xinhua News Agency Text-
Newswire
LDC95T13 Mandarin Chinese,
English
Data GB NA NA
China Broadcasting Text-
Scripted
LDC95T13 Mandarin Chinese Data GB NA NA
People's Daily Text-
Newswire
LDC95T13 Mandarin Chinese Data GB NA NA
Kyodo News Service Text-
Newswire
LDC95T8 Japanese Data JIS CD-ROM 1994-95
Wall Street Journal Text
Newswire
LDC95T6,
LDC93T1,
LDC95T7,LDC93T3,
LDC93S6A, LDC94S13A,
LDC95S24,
English NA ISO Latin-1 CD-ROM,
On-Line(Eng.)
NA
Journal Graphics Text- Transcripts NA English 2GB ISO Latin-1 CD-ROM NA
YONHAP Korean National News Agency Text-
Newswire
NA Korean NA Trigem Johab, EUC-KR NA NA
Izvestia,Financial Izvestia,Zakon Text-
Newswire
NA Russian Data NA NA NA
Internet News Groups Text-Internet NA Japanese, Russian,German NA JIS,KO18,
ISO Latin-1
NA NA
Central News Agency NA Taiwanese NA Big5
NA NA
Nasa Borba Serbo-Croatian ISO Latin-1
Krungthep Thurakej Thai TIS-620


|Home |About |News |Online |Catalog |Service |Agreements |Sites |Search |
(c) 1996-1999 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.
Please send technical questions to online-service@ldc.upenn.edu, Member sales questions to ldc@ldc.upenn.edu.