Machine Learning for Automated Text Categorization

A Tutorial

Fabrizio Sebastiani

Istituto di Elaborazione dell'Informazione

Consiglio Nazionale delle Ricerche

56100 Pisa, Italy


The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early '60s. Until the late '80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledge-engineering techniques, i.e. manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories. In the '90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the machine learning paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach. Within the machine learning paradigm,a general inductive process (called the learner) automatically builds a classifier (also called the rule,or the hypothesis)by learning, from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this tutorial we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into "Yahoo!-like" hierarchically structured sets of categories.

Detailed Contents of the Tutorial

This section details the contents of the tutorial, including approximate timing information. A preliminary version of the slides on which the tutorial will be based can be downloaded for inspection by clicking here

Introduction  [15 min.]

  • A definition of the text categorisation task
  • Single-label and multi-label categorisation
  • Category-pivoted and document-pivoted categorisation
  • Applications of document categorisation [30 min.]

  • Automatic indexing for Boolean information retrieval systems
  • Document organisation
  • Document filtering
  • Resolution of linguistic ambiguities
  • Yahoo!-style search space categorisation
  • The machine learning approach to text categorisation [20 min.]

    Indexing and dimensionality reduction [40 min.]

    Methods for the inductive construction of a classifier [150 min.]

    Determining thresholds [20 min.]

    Evaluation issues for text categorisation [40 min.]

    Automatic categorisation of Web pages [30 min.]

    Conclusion [15 min.]

    Biographical sketch of the tutor

    Fabrizio Sebastiani (born 1960) graduated in Computer Science summa cum laude at the University of Pisa, Italy in 1986. From 1986 to 1988 he has been working as a researcher at the Department of Linguistics of the University of Pisa; since 1988 to date he has been a member of the research staff of CNR-IEI.  In 1989/90 he has been a Visiting Scientist at the Department of Computer Science, University of Toronto, Canada, where he has worked on non-monotonic reasoning; in 1993/94 he has been a Visiting Scientist at the Department of Computing Science, University of Glasgow, UK, where has worked on the application of logic and probability to information retrieval; in 1998 he has been a Visiting Scientist at the Department of Computing Science, University of Dortmund, Germany, where has worked on automated text categorization. He is currently involved in the CEC-funded ESPRIT LTR Project EUROSEARCH, dealing with the design of a European, multilingual federation of search engines. He has published several papers in international journals and conferences in the areas of natural language processing, logic-based knowledge representation, and information retrieval.  His main current interest is the application of machine learning to automated text categorization.

    Other information of interest: