Most of the text is newswire that is captured from satellite or leased line feeds, though some of it is also delivered via email, direct dialed sources, on CD-ROM, or even on diskette from information providers with whom the LDC has negotiated license agreements. Some of the data is also from NNTP-accessed newsgroups. In the case of sources that provide continuous or periodic feeds, the text is spooled onto the LDC's server, using appropriate capture scripts that have been written for each text source. Other programs are then used to condition the text, by transforming or removing non-standard mark-up, normalizing character sets to an appropriate standard, and introducing a standard form of SGML mark-up, which is as consistent as possible across text sources.
Currently, most text and broadcast data collection at the LDC is project driven, so that text is collected as needed for various research programs. The bulk of our text acquisition has been motivated by various sponsored projects, including language modeling for speech recognition, collections for information retrieval and text understanding, and materials for language teaching. However, we are eager to see text collections made available for languages that do not have such resources available now, and (if asked) will offer advice and assistance in developing, publishing or distributing such collections, regardless of their relation to commercially-viable technology development.
For more information on the bodies of text listed below, see the LDC's
catalog listing by database, or the "Main List," which includes
various items in the database listing by year and information on data
to be published in the coming months. We apologize for the currently
incomplete inventory; many of the processing jobs were recently taken
offline when new disks were added to the system, and we are only
gradually getting them started again. The newswires and other data we
spool or have on hand is as follows:
Provider | Data Type | Corpus Name | Language | In Hand | Character Encoding | Media Type | Collection Date |
---|---|---|---|---|---|---|---|
Agence France Presse | Text- Newswire | LDC95T9, LDC95T11 | Arabic,English, French, German, Portuguese, and Spanish | Data | Arabic ISO 8859-6, others ISO-Latin 1 | CD-ROM | 1994-97 |
Associated Press | Text- Newswire | LDC95T9, LDC93T3, LDC95T11 | Dutch, English, French, German, Spanish,Swedish | Data | ISO Latin-1 | CD-ROM, On-Line(Eng.) | 1996- |
Reuters Latin American Business Report | Text- Newswire | LDC95T9 | English,Spanish | 96MB,240MB | ISO Latin-1 | CD-ROM | Sept. 1993-Dec. 1996 |
Reuters Latin American Business Report II | Text- Newswire | LDC95T9 | English | ||||
Reuters Spanish Language News Service | Text- Newswire | LDC95T9 | Spanish | 326MB | ISO Latin-1 | CD-ROM | Sept. 1993-Dec. 1996 |
Reuters North American News | Text- Newswire | LDC95T6 | English | ISO Latin-1 | |||
Reuters Financial Report | Text- Newswire | LDC95T6 | English | ISO Latin-1 | |||
New York Times News Service | Text- Newswire | LDC95T6 | English | Data | ISO-Latin-1 | CD-ROM, On-Line(Eng.) | NA |
LA Times-Washington Post News Service | Text- Newswire | LDC95T6 | English | Data | ISO-Latin-1 | CD-ROM, On-Line(Eng.) | 1996- |
Deutsche Presse Agentur | Text- Newswire | LDC95T11 | German,English | Data | ISO Latin-1 | CD-ROM | NA |
Dinamani | Text- Newswire | NA | Tamil | NA | NA | NA | NA |
Xinhua News Agency | Text- Newswire | LDC95T13 | Mandarin Chinese, English | Data | GB | NA | NA |
China Broadcasting | Text- Scripted | LDC95T13 | Mandarin Chinese | Data | GB | NA | NA |
People's Daily | Text- Newswire | LDC95T13 | Mandarin Chinese | Data | GB | NA | NA |
Kyodo News Service | Text- Newswire | LDC95T8 | Japanese | Data | JIS | CD-ROM | 1994-95 |
Wall Street Journal | Text Newswire | LDC95T6, LDC93T1, LDC95T7,LDC93T3, LDC93S6A, LDC94S13A, LDC95S24, | English | NA | ISO Latin-1 | CD-ROM, On-Line(Eng.) | NA |
Journal Graphics | Text- Transcripts | NA | English | 2GB | ISO Latin-1 | CD-ROM | NA |
YONHAP Korean National News Agency | Text- Newswire | NA | Korean | NA | Trigem Johab, EUC-KR | NA | NA |
Izvestia,Financial Izvestia,Zakon | Text- Newswire | NA | Russian | Data | NA | NA | NA |
Internet News Groups | Text-Internet | NA | Japanese, Russian,German | NA | JIS,KO18, ISO Latin-1 | NA | NA |
Central News Agency | NA | Taiwanese | NA | Big5 | NA | NA | |
Nasa Borba | Serbo-Croatian | ISO Latin-1 | |||||
Krungthep Thurakej | Thai | TIS-620 |