Text typology and automatic text categorization

Text typology Text type, genre and register (Trosborg 1997) Automatic text categorization Cataloguing

Text typology

EAGLES Preliminary Recommendations on Text Typology (June 1996)
This Report and its recommendations should be read in conjunction with EAGLES interim recommendations on Corpus Typology (EAGLES, 1996a).
Text Linguistics, Translation Theory and Interpreting
The purpose of this project is to develop models for empirical and theoretical study of the simultaneous interpreting process, including assessment of theories about text understanding and text production, text/discourse types, and the applicability of translation theory to simultaneous interpreting of expert (LSP) discourse.

References

Robert-Alain de Beaugrande & Wolfgang Ulrich Dressler. 1981. Introduction to Text Linguistics. Longman.

Vijay K. Bhatia. 1993. Analysing Genre. Language use in professional settings. Longman.

Douglas Biber. 1989. A Typology of English Texts. Linguistics 27: 3-43.

Douglas Biber. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press. (Review by Nigel Armstrong, caché)

Douglas Biber y Edward Finegan. 1986. An initial typology of English text types. Jan Aarts y Willen Meijs (Eds.) Corpus Linguistics II: New Studies in the Analysis and Exploitation of Computer Corpora. Rodopi: 19-46.

Douglas Biber, S. Conrad, and R. Reppen. 1998. Corpus linguistics: Investigating language structure and use. Cambridge University Press.

Douglas Biber, S. Johansson, G. Leech, S. Conrad, E. Finegan. 1999. The Longman grammar of spoken and written English. Longman.

Philip R. Cohen & C. Raymond Perrault. 1979. Elements of a Plan-Based Theory of Speech Acts. Cognitive Science 3: 177-212.

S. Conrad & Douglas Biber (eds.). 2001. Variation in English: Multi-Dimensional studies. Longman.

EAGLES. 1996. Preliminary Recommendations on Text Typology. http://www.ilc.pi.cnr.it/EAGLES/texttyp/texttyp.html

James L. Kinneavy. 1980. A Theory of Discourse. Norton.

M.A. K. Halliday & R. Hasan. 1976. Cohesion in English. London.

Sara Laviosa. 1998. The English Comparable Corpus: A resource and a methodology. Lynne Bowker, Michael Cronin, Dorothy Kenny y Jennifer Pearson (Eds.). Unity in Diversity? Current Trends in Translation Studies. St. Jerome Publishing. (link)

Junsaku Nakamura. 1991. The relationships among genres in the LOB corpus based upon the distribution of grammatical tags. Jacet Bulletin 22: 55-74.

Christiane Nord. 1997. A Functional Typology of Translations. In Trosborg: 43-65.

Roel Popping. 2000. Computer-assisted Text Analysis. Sage.

Roda P. Roberts. 1995. Towards a Typology of Translations. Hieronymus Complutensis 1: 69-78.

Michael Stubbs. 1996. Text and Corpus Analysis. Blackwell.

John M. Swales. 1990. Genre Analysis. English in academic and research settings. Cambridge University Press.

Anna Trosborg. 1997. Text Typology: Register, Genre and Text Type. In Trosborg: 3-23. (notes)

Anna Trosborg (Ed.). 1997. Text Typology and Translation. John Benamins.

Automatic text categorization

A Tutorial on Automated Text Categorization by Fabrizio Sebastiani
Texto completo en diversos formatos de ResearchIndex.
Machine Learning for Automated Text Categorization by Fabrizio Sebastiani (copia)
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early '60s. Until the late '80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledge-engineering techniques, i.e. manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories.

In the '90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the machine learning paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach.

Within the machine learning paradigm,a general inductive process (called the learner) automatically builds a classifier (also called the rule,or the hypothesis)by learning, from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm.

Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into "Yahoo!-like" hierarchically structured sets of categories
Machine Learning for Categorization of Text Documents and Web Pages (Fabrizio Sebastiani & Alessandro Sperduti July 2001)
In this tutorial we look at the main approaches that have been taken towards automatic text categorization within the general machine learning paradigm. A general presentation of the basic issues in document categorization will be followed by the presentation of basic (such as linear separators, decision trees, etc.) and advanced machine learning concepts and techniques (such as boosting, support vector machines, etc.). Then issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail, and a review of the current most relevant research in text categorization by machine learning tools will be presented. Finally, the special case of automatic classification of Web pages is considered and the concepts and techniques specifically devised for this case are discussed.
Text Classification (LGT, Sep. 1995)
Text categorization and text routing both involve taking a text, and assigning keywords to it, to reflect its content. The applications of categorization and routing are many and varied. For example, large companies sometimes use a text routing tool to scan incoming telexes and assign a keyword to them, typically the name of the department or of the person the telex should go to.
Bibliography on Automated Text Categorization, by Fabrizio Sebastiani

Cataloguing

Open Directory Project
Reference > Libraries > Library and Information Science > Technical Services > Cataloguing
Cataloguer's Toolbox, homepage for the Bibliographic Control Services of the Queen Elizabeth II Library at Memorial University of Newfoundland.

MARC

MARC Record Contents
MARC 21 formats
XML/MARC, en Stanford; ejemplos: registro 1, registro 2, registro 3 (sin hoja de estilo XSL). MARC y SGML.
Cataloging Internet Resources (The codes and tags follow OCLC MARC )

UDC

El sistema de clasificación decimal universal, Miguel Benito
Outline of the UDC

Grupo DELi, Universidad de Deusto, feb. 2002.