The automated categorisation (or classification) of texts
into topical categories has a long history, dating back at least to the
early '60s. Until the late '80s, the most effective approach to the problem
seemed to be that of manually building automatic classifiers by means of
knowledge-engineering techniques, i.e. manually defining a set of rules
encoding expert knowledge on how to classify documents under a given set
of categories. In the '90s, with the booming production and availability
of on-line documents, automated text categorisation has witnessed an increased
and renewed interest, prompted by which the machine learning paradigm to
automatic classifier construction has emerged and definitely superseded
the knowledge-engineering approach. Within the machine learning paradigm,a
general inductive process (called the learner) automatically builds a classifier
(also called the
rule,or the hypothesis)by learning, from
a set of previously classified documents, the characteristics of one or
more categories. The advantages of this approach are a very good effectiveness,
a considerable savings in terms of expert manpower, and domain independence.
In this tutorial we look at the main approaches that have been taken towards
automatic text categorisation within the general machine learning paradigm.
Issues pertaining to document indexing, classifier construction, and classifier
evaluation, will be discussed in detail. A final section will be devoted
to the techniques that have specifically been devised for an emerging application
such as the automatic classification of Web pages into "Yahoo!-like" hierarchically
structured sets of categories.
This section details the contents of the tutorial,
including approximate timing information. A preliminary version of the
slides on which the tutorial will be based can be downloaded for inspection
by clicking here
A definition of the text categorisation task Single-label and multi-label categorisation Category-pivoted and document-pivoted categorisation
Automatic indexing for Boolean information retrieval systems Document organisation Document filtering Resolution of linguistic ambiguities Yahoo!-style search space categorisation
Fabrizio
Sebastiani (born 1960) graduated in Computer Science summa cum laude
at the University of Pisa, Italy in 1986. From 1986 to 1988 he has been
working as a researcher at the Department of Linguistics of the University
of Pisa; since 1988 to date he has been a member of the research staff
of CNR-IEI. In 1989/90 he has been a Visiting Scientist at the Department
of Computer Science, University of Toronto, Canada, where he has worked
on non-monotonic reasoning; in 1993/94 he has been a Visiting Scientist
at the Department of Computing Science, University of Glasgow, UK, where
has worked on the application of logic and probability to information retrieval;
in 1998 he has been a Visiting Scientist at the Department of Computing
Science, University of Dortmund, Germany, where has worked on automated
text categorization. He is currently involved in the CEC-funded ESPRIT
LTR Project EUROSEARCH, dealing with the design of a European, multilingual
federation of search engines. He has published several papers in international
journals and conferences in the areas of natural language processing, logic-based
knowledge representation, and information retrieval. His main current
interest is the application of machine learning to automated text categorization.
Other information of interest: