Text Collection and Processing

Most of the text is newswire that is captured from satellite or leased line feeds, though some of it is also delivered via email, direct dialed sources, on CD-ROM, or even on diskette from information providers with whom the LDC has negotiated license agreements. Some of the data is also from NNTP-accessed newsgroups. In the case of sources that provide continuous or periodic feeds, the text is spooled onto the LDC's server, using appropriate capture scripts that have been written for each text source. Other programs are then used to condition the text, by transforming or removing non-standard mark-up, normalizing character sets to an appropriate standard, and introducing a standard form of SGML mark-up, which is as consistent as possible across text sources.

Currently, most text and broadcast data collection at the LDC is project driven, so that text is collected as needed for various research programs. The bulk of our text acquisition has been motivated by various sponsored projects, including language modeling for speech recognition, collections for information retrieval and text understanding, and materials for language teaching. However, we are eager to see text collections made available for languages that do not have such resources available now, and (if asked) will offer advice and assistance in developing, publishing or distributing such collections, regardless of their relation to commercially-viable technology development.

For more information on the bodies of text listed below, see the LDC's catalog listing by database, or the "Main List," which includes various items in the database listing by year and information on data to be published in the coming months. We apologize for the currently incomplete inventory; many of the processing jobs were recently taken offline when new disks were added to the system, and we are only gradually getting them started again. The newswires and other data we spool or have on hand is as follows:

Provider Data Type Corpus Name Language In Hand Character Encoding Media Type Collection
Date

Agence France Presse Text- Newswire LDC95T9, LDC95T11 Arabic,English, French, German, Portuguese, and Spanish Data Arabic ISO 8859-6, others ISO-Latin 1 CD-ROM 1994-97
Associated Press Text- Newswire LDC95T9, LDC93T3, LDC95T11 Dutch, English, French, German, Spanish,Swedish Data ISO Latin-1 CD-ROM, On-Line(Eng.) 1996-
Reuters Latin American Business Report Text- Newswire LDC95T9 English,Spanish 96MB,240MB ISO Latin-1 CD-ROM Sept. 1993-Dec. 1996
Reuters Latin American Business Report II Text- Newswire LDC95T9 English
Reuters Spanish Language News Service Text- Newswire LDC95T9 Spanish 326MB ISO Latin-1 CD-ROM Sept. 1993-Dec. 1996
Reuters North American News Text- Newswire LDC95T6 English ISO Latin-1
Reuters Financial Report Text- Newswire LDC95T6 English ISO Latin-1
New York Times News Service Text- Newswire LDC95T6 English Data ISO-Latin-1 CD-ROM, On-Line(Eng.) NA
LA Times-Washington Post News Service Text- Newswire LDC95T6 English Data ISO-Latin-1 CD-ROM, On-Line(Eng.) 1996-
Deutsche Presse Agentur Text- Newswire LDC95T11 German,English Data ISO Latin-1 CD-ROM NA
Dinamani Text- Newswire NA Tamil NA NA NA NA
Xinhua News Agency Text- Newswire LDC95T13 Mandarin Chinese, English Data GB NA NA
China Broadcasting Text- Scripted LDC95T13 Mandarin Chinese Data GB NA NA
People's Daily Text- Newswire LDC95T13 Mandarin Chinese Data GB NA NA
Kyodo News Service Text- Newswire LDC95T8 Japanese Data JIS CD-ROM 1994-95
Wall Street Journal Text Newswire LDC95T6, LDC93T1, LDC95T7,LDC93T3, LDC93S6A, LDC94S13A, LDC95S24, English NA ISO Latin-1 CD-ROM, On-Line(Eng.) NA
Journal Graphics Text- Transcripts NA English 2GB ISO Latin-1 CD-ROM NA
YONHAP Korean National News Agency Text- Newswire NA Korean NA Trigem Johab, EUC-KR NA NA
Izvestia,Financial Izvestia,Zakon Text- Newswire NA Russian Data NA NA NA
Internet News Groups Text-Internet NA Japanese, Russian,German NA JIS,KO18, ISO Latin-1 NA NA
Central News Agency NA Taiwanese NA Big5 NA NA
Nasa Borba Serbo-Croatian ISO Latin-1
Krungthep Thurakej Thai TIS-620

Provider	Data Type	Corpus Name	Language	In Hand	Character Encoding	Media Type	Collection Date
`Agence France Presse`	`Text- Newswire`	`LDC95T9, LDC95T11`	`Arabic,English, French, German, Portuguese, and Spanish`	`Data`	`Arabic ISO 8859-6, others ISO-Latin 1`	`CD-ROM`	`1994-97`
`Associated Press`	`Text- Newswire`	`LDC95T9, LDC93T3, LDC95T11`	`Dutch, English, French, German, Spanish,Swedish`	`Data`	`ISO Latin-1`	`CD-ROM, On-Line(Eng.)`	`1996-`
`Reuters Latin American Business Report`	`Text- Newswire`	`LDC95T9`	`English,Spanish`	`96MB,240MB`	`ISO Latin-1`	`CD-ROM`	`Sept. 1993-Dec. 1996`
`Reuters Latin American Business Report II`	`Text- Newswire`	`LDC95T9`	`English`
`Reuters Spanish Language News Service`	`Text- Newswire`	`LDC95T9`	`Spanish`	`326MB`	`ISO Latin-1`	`CD-ROM`	`Sept. 1993-Dec. 1996`
`Reuters North American News`	`Text- Newswire`	`LDC95T6`	`English`		`ISO Latin-1`
`Reuters Financial Report`	`Text- Newswire`	`LDC95T6`	`English`		`ISO Latin-1`
`New York Times News Service`	`Text- Newswire`	`LDC95T6`	`English`	`Data`	`ISO-Latin-1`	`CD-ROM, On-Line(Eng.)`	`NA`
`LA Times-Washington Post News Service`	`Text- Newswire`	`LDC95T6`	`English`	`Data`	`ISO-Latin-1`	`CD-ROM, On-Line(Eng.)`	`1996-`
`Deutsche Presse Agentur`	`Text- Newswire`	`LDC95T11`	`German,English`	`Data`	`ISO Latin-1`	`CD-ROM`	`NA`
`Dinamani`	`Text- Newswire`	`NA`	`Tamil`	`NA`	`NA`	`NA`	`NA`
`Xinhua News Agency`	`Text- Newswire`	`LDC95T13`	`Mandarin Chinese, English`	`Data`	`GB`	`NA`	`NA`
`China Broadcasting`	`Text- Scripted`	`LDC95T13`	`Mandarin Chinese`	`Data`	`GB`	`NA`	`NA`
`People's Daily`	`Text- Newswire`	`LDC95T13`	`Mandarin Chinese`	`Data`	`GB`	`NA`	`NA`
`Kyodo News Service`	`Text- Newswire`	`LDC95T8`	`Japanese`	`Data`	`JIS`	`CD-ROM`	`1994-95`
`Wall Street Journal`	`Text Newswire`	`LDC95T6, LDC93T1, LDC95T7,LDC93T3, LDC93S6A, LDC94S13A, LDC95S24,`	`English`	`NA`	`ISO Latin-1`	`CD-ROM, On-Line(Eng.)`	`NA`
`Journal Graphics`	`Text- Transcripts`	`NA`	`English`	`2GB`	`ISO Latin-1`	`CD-ROM`	`NA`
`YONHAP Korean National News Agency`	`Text- Newswire`	`NA`	`Korean`	`NA`	`Trigem Johab, EUC-KR`	`NA`	`NA`
`Izvestia,Financial Izvestia,Zakon`	`Text- Newswire`	`NA`	`Russian`	`Data`	`NA`	`NA`	`NA`
`Internet News Groups`	`Text-Internet`	`NA`	`Japanese, Russian,German`	`NA`	`JIS,KO18, ISO Latin-1`	`NA`	`NA`
`Central News Agency`		`NA`	`Taiwanese`	`NA`	`Big5`	`NA`	`NA`
`Nasa Borba`			`Serbo-Croatian`		`ISO Latin-1`
`Krungthep Thurakej`			`Thai`		`TIS-620`