Multilingual Information Management:

Chapter 3. Cross-lingual Information Extraction and Automated Text Summarization

Editor: Eduard Hovy

Contributors:

Ralph Grishman

Jerry Hobbs

Eduard Hovy

Antonio Sanfilippo

Yorick Wilks

Abstract

Information Extraction (IE) and Text Summarization are two methods of extracting relevant portions of the input text. IE produces templates, whose slots are filled with the important information, while Summarization produces one of various types of summary. Over the past 15 years, IE systems have come a long way, with commercial applications being around the corner. Summarization, in contrast, is a much younger enterprise. At present, it borrows techniques from IR and IE, but still requires a considerable amount of research before its unique aspects will be clearly understood.

3.1 Definitions: Information Extraction and Text Summarization

The world of text is huge and expanding. As illustrated by the World Wide Web, important information will continue to become available as text. Two factoids highlight the importance of systems that can accurately and quickly identify the relevant portions of texts automatically: in five years’ time, the major use of computers will be for business and government intelligence, and a large percentage of the data available electronically is in the form of natural language text. A successful information extraction technology therefore has a central role to play in the future of computing.

In this chapter we discuss both Information Extraction (IE) and Automated Text Summarization. At a high level, their goal is the same: find those portion(s) of the given text(s) that are relevant to the user’s task, and deliver that information to the user in the form most useful for further (human or machine) processing. Considering them more closely reveals that IE and Summarization are two sides of a coin, and that a different emphasis of output and techniques results in two quite different-looking branches of technology. In both cases, the input is either a single document or a (huge) collection of documents.

The differences between IE and Summarization lie mainly in the techniques used to identify the relevant information and in the ways that information is delivered to the user. Information Extraction is the process of identifying relevant information where the criteria for relevance are predefined by the user in the form of a template that is to be filled. Typically, the template pertains to events or situations, and contains slots that denote who did what to whom, when, and where, and possibly why. The template builder has to predict what will be of interest to the user and define its slots and selection criteria accordingly. If successful, IE delivers the template, filled with the appropriate values, as found in the text(s). Figure 1 contains three filled templates for the given text.

The Financial Times.

A breakthrough into Eastern Europe was achieved by McDonalds, the American fast food restauranteur, recently through an agreement with Hungary’s most successful agricultural company, Babolna, which is to provide most of the raw materials. Under the joint venture, 5 McDonalds "eateries" are being opened in Budapest which, until now at least, has been the culinary capital of Eastern Europe.

<ENTITY-1375-12> :=

NAME: McDonalds

NATIONALITY: U.S. (COUNTRY)

TYPE: Company

<ENTITY-1375-13> :=

NAME: Babolna

NATIONALITY: Hungary (COUNTRY)

TYPE: Company

<EVENT-12-19007> :=

TYPE: Financial-expansion

PARENT-COMPANY: <ENTITY-1375-12>

SUBSIDIARY-COMPANY: <ENTITY-1375-13>

LOCATION: Hungary (COUNTRY)

SIZE: 5

Figure 1. Example text and templates for Information Extraction.

In contrast, Text Summarization does not necessarily start with a predefined set of criteria of interest; when it does, they are not specified as a template, but at a higher granularity (i.e., expressed in keywords or even whole paragraphs), and hence are less computationally precise. The benefit is that the user can specify dynamically, at run time, what he or she is interested in, but cannot so easily pinpoint exact entities or events or interrelationships. In this, Summarization resembles Information Retrieval (see Chapter 2). Summarization delivers either an Extract (a verbatim rendition of some portions of the text) or an Abstract (a compressed and reformulated version of the contents of some portions of the text).

Generally, from the user’s perspective, IE can be glossed as "I know what specific pieces of information I want–just find them for me!", while Summarization can be glossed as "What’s in the text that is interesting?". Technically, from the system builder’s perspective, the two applications blend into each other. The most pertinent technical aspects are:

Are the criteria of interestingness specified at run-time or by the system builder?
Is the input a single document or multiple documents?
Is the extracted information manipulated, either by simple content delineation routines or by complex inferences, or just delivered verbatim?
What is the grain size of the extracted units of information–individual entities and events, or blocks of text?
Is the output formulated in language, or in a computer-internal knowledge representation?

Thus, although IE and Summarization blend into one another, the processing performed by IE engines generally involves finite state machines and NLP techniques, while Summarization systems tend to employ IR-like processing.

3.2 Relationships with Other Areas

Both Information Extraction and Text Summarization are related to other language processing applications. For example, Information Retrieval (IR; see Chapter 2) systems return sets of relevant documents in response to a query; hopefully the answer is contained in the documents. Thus IR can be used to locate strings within fixed corpus windows, producing Summarization (and, in the limit, IE-like) results. This is true mostly for query-based Extract summaries.

IE is not the same as Question Answering (QA) by computer, because QA (usually) operates over databases and provides answers to specific queries, one at a time. It is clearly useful in QA applications, however.

Similarly, both Summarization and IE can fruitfully be linked to Machine Translation (MT; see Chapter 4) to perform multilingual information access. One can, for example, translate a document and then perform IE on it as a whole, or one can first perform IE on it and then just translate the parts that IE returns.

Despite such differences, it is becoming apparent that IE, QA, IR, Summarization, and MT form a complex of interrelated information access methods. In a typical application, IR may be performed before IE or summarization, to cut down text search; the database of templates that IE subsequently produces can then be searched with IR or QA, or can be summarized; the results can then be translated by MT. This ordering is not the only one, obviously, but reflects the relative speeds and costs of the different tasks.

Overall, at the present time, Information Extraction and Summarization must be distinguished, on the one hand, from IR (that locates documents or parts of documents, generally using simple keyword techniques), and on the other, from full text understanding (which, if it existed, would be able to process all the information, relevant or not, and determine implicit nuances of meaning and intent, using semantics and inference). Mere document retrieval is inadequate for our needs. Full text understanding does not yet exist. Information extraction and summarization occupy a middle ground, providing needed functionality while at the same time being computationally feasible.

3.3 Where We Were Five Years Ago

3.3.1 Origins and Development of IE

An early instance of what is today called an IE system was FRUMP, the Ph.D. thesis of DeJong at Yale University (DeJong, 1979). Given a newspaper text, its task was to recognize which of approximately seven event templates (earthquake, visit of state, terrorism event, etc.) to employ, and then to fill in the template’s slots with relevant information. Similar work was performed at NYU (Sager, 1970) and other locations. But IE became a serious large-scale research effort in the late 1980s, with the onset of the Message Understanding Conference (MUC) series (MUC, 1996). This series, promoted by the US Department of Defense (through DARPA), has had the beneficial effects of:

gradually and systematically increasing the complexity of the input texts (from a navy sublanguage in 1989 to general newspaper text in1997);
gradually increasing the topic range of the input texts (from a single narrow topics in 1989 to several topics in 1997);
encouraging increasing sophistication of the template definition notation, ranging over military, intelligence, and recently also commercial templates;
supporting the identification of the various core subtasks inherent in IE subtask (from a single undifferentiated task in 1989 to at least 4 tasks in 1997), and supporting the creation of distinct evaluations for each;
developing clearly articulated evaluation measures for each subtask;
helping establish typical baseline performance scores for each subtask.

As a result, in just under twenty years, an endeavor that was a fledgling dream in 1979 has started coming to market in the late 1990s. Example systems were developed by General Electric and Lockheed.

3.3.2 Origin and Types of Summarization

Automated text summarization is an old dream–the earliest work dates back to the 1950s–that has lain dormant for almost three decades. Only in the last five years has large-scale interest in summarization resurfaced, partly as a result of the information explosion on the Web, but also thanks to faster computers, larger corpora and text storage capacity, and the emergence in Computational Linguistics of statistics-based learning techniques.

Still, little enough is known about summarization per se. Even the (relatively few) studies in Text Linguistics do not provide an exhaustive categorization of the types of summaries that exist. One can distinguish at least the following:

an Extract is a selection of some of the material of the original, while an Abstract is a condensation and reformulation of the original;
a Generic summary provides the author’s point of view, while a Query-based summary focuses on material of interest to the user;
an Informative summary reflects the content of the original text, possibly spelling out the arguments, while an Indicative summary merely provides an indication of what the original was about;
a Just-the-News summary provides just the newest facts, assuming the reader is familiar with the topic, while a Background summary teaches about the topic;
a Neutral summary tries to be objective, while a Biased summary extracts and formulates the content from some point of view.

The precise differences between these various types is not yet known, nor the places or tasks for which each is most suitable. The genre-specificity of these types is not known either (for example, biased summaries are probably more relevant to editorials than travel reports). However, as described below, recent research has established some important initial methods, baseline performances, and standards.

3.4 Where We Are Now

3.4.1 IE Today

Information extraction research has been rather successful in the past five or six years. The name recognition components of the leading systems have achieved near-human performance in English and Japanese, and are approaching that in Chinese and Spanish. For the task of event recognition (who did what to whom, when, and where), this technology has achieved about 60% recall and 70% precision, in both English and Japanese (human inter-annotator agreement on this task ranged between 65% and 80% in one study). Both these tasks are approaching human-level performance.

Over the past few years, Information Extraction has developed beyond the initial task, which was simply the extraction of certain types of information from a rather artificial sublanguage of the navy, into a set of distinct subtasks, each one concentrating on one core aspect of IE:

Recognition of named entities (both the identification of such entities and their classification as persons, companies, organization, locations, etc.). Experiences in the MUC series over the past few years have indicated an improved understanding of how to train named entity recognizers systems automatically from annotated corpora.
Identification of template elements (those entities that are filled into the template slots). In the seventh MUC conference (Grishman and Sundheim, 1996), improved performance on the template element task was demonstrated.
Recognition of template relations (the interrelationships among entities that make them somehow pertinent to be included in the template). MUC-7 results also demonstrated good performance on the (new) template relation task. Perhaps most interesting, BBN has provided evidence that good performance on these two tasks can be obtained through corpus-based training methods.
Recognition of events (the selection of an appropriate scenario template depending on what is encountered in the input). Event recognition performance, however, seems stuck at 50%-60% accuracy.

In several of these subtasks, the 60%—70% performance barrier has been notoriously difficult to break through. The scores in the top group in every MUC evaluation since 1993 have been roughly the same (bearing in mind however that the MUC tasks have become more complex). The primary advance has been that more and more sites are able to perform at this level, because the techniques used have converged. Moreover, building systems that perform at this level currently requires a great investment in time and expertise. In addition, the vast bulk of the research so far has been done only on written text and only in English and a few other major languages.

What developments will ensure higher performance? The following aspects deserve further investigation:

more comprehensive treatment of linguistic phenomena (e.g., aspectuals, reference phenomena, etc.);
better modeling of domain semantics, including more general world knowledge;
better automated learning methods to acquire background knowledge and to induce selection criteria for template slots.

Current State and Research Questions for IE

The dominant technology in Information Extraction is finite-state transducers, frequently cascaded (connected in serial) to break a complex problem into a sequence of easier sub-problems; a nice example is provided in (Knight and Graehl, 1997), the transliteration of proper names from Japanese to English. Such transducers have shown their worth in recognizing low-level syntactic constructions, such as noun groups and verb groups, and identifying higher-level, domain-relevant, clausal patterns. A key feature of these transducers is their automatic trainability; they do not require hand-crafted rules, which are difficult and expensive to produce, only enough training examples on which to learn input-output behaviors.

Present-day Information Extraction systems are far from perfect. A deep question is whether system performance (measured, say, by Recall and Precision rates) can be made good enough for serious investment. A second question is whether new applications can be found for which template-like relevance criteria are appropriate; the fact that templates have to be constructed by the system builder, prior to run-time, remains a bottleneck. An important issue is scalability–if the cost of producing templates flexibly and fast for new domains cannot be made acceptable, IE will never enjoy large-scale use. Further questions pertain to improving the effectiveness of results by employing models of the user’s likes and dislikes and tuning lexicons to domains. Finally, although much of the research on IE has focused on scanning news articles and filling templates with the relevant event types and participants, this is by no means the only application. This core technology could be applied to a wide range of natural language applications.

3.4.2 Summarization Today

Before the recent TIPSTER program, North America and Europe combined had fewer than ten research efforts, all small-scale, devoted to the problem. A notable exception was the pioneering experiments of (Jacobs and Rau, 1990). Three of them were part of larger commercial efforts, namely the systems of Lexis-Nexis, Oracle, and Microsoft. No system was satisfactory, and no measures of evaluation were commonly recognized.

Given the youth of summarization research, the past five years has witnessed some rapid growth. Most systems developed today perform simple extraction of the most relevant sentences or paragraphs of a given (single) document, using a variety of methods, many of them versions of those used in IR engines. Where IR systems identify the good documents out of a large set of documents, Extraction Summarizers identify the good passages out of a single document’s large set of passages. Various methods of scoring the relevance of sentences or passages and combining the scores are described in (Miike et al., 1994; Kupiec et al., 1995; Aone et al., 1997; Strzalkowski et al., 1998; Hovy and Lin, 1998).

Naturally, however, there is more to summarization than extraction. Some concept fusion techniques are explored in (Hovy and Lin, 1998) and in (Hahn, 1999). Since they require significant world knowledge (the system requires knowledge not explicitly in the text in order to be able to decide how to fuse selected concepts into a more general, abstract, or encompassing concept), it is not likely that practical-use Abstraction Summarizers will be built in the near future.

Evaluation of Summarization Systems

We focus here on the developments in summarization evaluation, since they express current capabilities. Not counting the evaluation of three systems in China in 1996 and the work at Cambridge University in recent years (Sparck Jones, 1998), there has been one formal Text Summarization evaluation of competing systems performed by a neutral agency, to date. The SUMMAC evaluation (Firmin Hand and Sundheim, 1998; Mani et al., 1998), part of the TIPSTER program in the USA, announced its results in May 1998.

The SUMMAC results show that it is hard to make sweeping statements about the performance of summarization systems, (a) because they are so new; (b) because there are so many kinds of summaries; and (c) because there are so many ways of measuring performance. Generally speaking, however, one must measure two things of a summary: the Compression Ratio (how much shorter is the summary than the original?) and the Omission Ratio (how much information have you retained)? Measuring length is easy, but measuring information (especially relevant information) is hard. Several approximations have been suggested (Hovy and Lin, 1999):

The Shannon Game: a variant of Shannon’s measures in Information Theory, one can ask people to reconstruct the original having seen either the full text, or a summary, or (as control) no text. Preliminary experiments have found an order of magnitude difference across the three levels–a phenomenal result.
The Question Game: you ask assessors to answer questions that have been previously drawn up about the original, comparing how they score after reading the original or after reading the summary. A version of this test was run as part of SUMMAC; see below for results.
The Classification Game: you ask assessors to classify the texts (either the originals or summaries) into one of N categories, and measure the correspondence of classification of summaries to originals. A good summary should be classified in the same bin as its original. Two versions of this test were run in SUMMAC; see below.

More work is required to understand the best ways of implementing these measures.

The SUMMAC evaluations were applied to 16 participating systems, unfortunately without humans to provide baselines. All systems produced Extracts only. SUMMAC was no small operation; it took some systems over a full day to produce the several thousand summaries, and it took a battery of assessors over two months to do the judgements. The SUMMAC measures were selected partly because they followed the use of IR measures (Recall and Precision). It has been argued that this biased the proceedings toward IR-like systems.

In the Ad Hoc Task (one variant of the classification game), 20 topics were selected, and for each topic, 50 texts had to be summarized, with respect to the topic. This test was supposed to measure how well the system can identify in the originals just the material relevant to the user. To evaluate, human assessors read the summaries and decided if they were relevant to the query topic or not. The more relevant summaries the system produced, the better it was considered to be. It occurred that relevant summaries were produced out of non-relevant originals, a fact not surprising post hoc but something no-one quite knows how to interpret.

In the Categorization Task (another variant of the classification game), 10 topics were selected, and 100 texts per topic. Here the systems did not know the topics, and simply had to produce a generic summary, which human assessors then classified into one of the 10 topic bins. The more its summaries classified in the same bins as their originals, the better the system was considered to be.

In the Q&A Task, 3 topics were selected, and systems received 90 articles per topic to summarize. Of these, 30 summaries were read by assessors, who answered a predefined question set of 4 or 5 (presumably) relevant questions (the same questions for each summary, in each topic) for each summary. The more questions the assessors could answer correctly, the better the system’s summaries were considered.

The Ad Hoc results partitioned the systems into three classes, ranging from F-score (average Recall and Precision) of 73% down to 60%. The Categorization results showed no significant difference between systems, all at approx. 53%. The Q&A results were very length-sensitive, with systems scoring between 45% and 20% (scores normalized by summary length).

Unfortunately, since the evaluation did not include human summaries as baselines, it is impossible to say how well the systems fared in a general way. One can say though that:

by simply extracting the first 20% of a newspaper article, one can do as well as any system did on Categorization (this is due to the way newspaper articles are written in English newspapers);
shorter summaries did less well consistently; summaries shorter than 20% of the text length were not successful. This is partly due to the fact that systems produced Extracts only; a ‘true’ summary, an Abstract, is about 1/3 of the length of its corresponding Extract (Marcu, 1999);
for any text, it is easy to say which sentences are definitely good ones and which are definitely bad when it comes to making a summary. The trouble is that generally about 60% of the sentences are not clearly good or bad, and it is extremely difficult to explain why one should or should not include them in a consistent way.

3.5 Where We Will Be in Five Years

The current state of affairs for IE and Summarization indicates that we must focus on seven critical areas of research in the near future.

1. Multilinguality (Going beyond English): The IE and Summarization technology must be extended to other languages. As illustrated in MUC-5, the success of the approach in languages as different as English and Japanese is strongly suggestive of the universality of the approach. In the EU, the current Language Engineering projects ECRAN, AVENTINUS, SPARKLE, TREE, and FACILE all address more than one language, while in the US, the MUC-7 task of named entity recognition addressed English, Chinese, Japanese, and Spanish.

Additional work is required on other languages to see which unique problems arise. For example, fundamentally different techniques may be required for languages that make very heavy use of morphology (e.g., Finnish) or have a much freer word order than English. It should not be difficult to get a good start in a large number of languages, since our experience with English and other larger European languages is that a significant level of performance can be gained with rather small grammars for noun groups and verb groups augmented by sets of abstract clause-level patterns. A giant stride toward translingual systems could be achieved by supporting many small projects for doing just this for a large number of languages.

To support such work, it would be useful to develop an automated extraction architecture that works across languages. It should have the following features:

accommodate different levels of initial sentence analysis,
perform at least rudimentary segmentation, morphology, low-level syntax,
employ a common representation at the level of event patterns,
if possible, use functional syntactic labels.

Initial work in this regard is promising; for example, experiments on multilingual extraction have been an excellent basis for international cooperative efforts for NYU, working with other universities on extraction of Spanish, Swedish, and Japanese (Grishman, 1998).

Initial experiments on multilingual text summarization at USC/ISI are also highly promising. To the extent that the summarization engines employ language-neutral methods derived from IR, or to the extent that language-specific methods can be simplified and easily ported to other languages (for example, simple part of speech tagging), it appears that a summarizer producing Extracts for one language can fairly quickly be adapted to work in other languages. The capability to produce extract summaries of Indonesian was added to ISI’s SUMMARIST system in less than two person-months (Lin, 1999), given the fortunate facts that online dictionaries were already at hand and Bahasa Indonesia is not a highly inflected language.

Two possibilities exist for configuring multilingual IE systems. The first is a system that does monolingual IE in multiple languages, one evocation of the system for each language. Here translation occurs twice: once of the template patterns into their language-specific forms, and once after extraction of the extracted information back into the user’s language. The second is a system that does monolingual IE, operating over the documents once they (or some portion of them) have been translated into the user’s language. The tradeoffs here, between translation time/effort, accuracy, and coverage, are exactly the same as those of cross-language Information Retrieval, discussed in Chapter 2.

An additional benefit of multilingual Information Extraction and Summarization is their utility for other applications, such as Machine Translation. Machine Translation is very hard because it is so open-ended. But an IE engine could be applied to the source language text to extract only relevant information, and then only the relevant information would need to be translated. Such an MT system would be much more tractable to build, since its input would be pre-filtered, and in many instances would provide exactly the required functionality; see Chapter 4.

2. Cross-Document Event Tracking (Going beyond Single Documents): Most research in Information Extraction and Summarization has focused on gleaning information from single documents at a time. This has involved recognizing the coreference of entities and events when they are described or referred to in different areas of the text. These techniques could be used to identify the same entities or events when they are talked about in different documents as well. This would allow analysts to track the development of an event as it unfolds across a period of time. Many events of interest–revolutions, troop buildups, hostile takeovers, lawsuits, product developments–do not happen all at once, and if the information from multiple documents can be fused into a coherent picture of the event, the analyst’s job of tracking the event is made much easier. Recent work (the SUMMONS system, (Radev, 1998)) provides some valuable heuristics for identifying cross-document occurrences of the same news, and for recognizing conflicts and extensions of the information at hand.

3. Adaptability (Going beyond Templates): One of the major limitations of current IE systems is that template slots and their associated filling criteria must be anticipated and encoded by the system builder. Much IE research in the past has focused on producing templates that encode the structure of relevant events. This was a useful focus since the kind of information encoded in templates is central in many applications, and the task is easily evaluated. Similarly, the need for run-time user specification of importance criteria was underlined in the SUMMAC Ad Hoc summarization task. But we must shift our focus now more specifically to the ways the technology is to be embedded in real-world applications, useful also to non-Government users.

One approach is to develop methods that recognize internal discourse structure and partition text accordingly. Ongoing work on discourse-level analysis (Marcu, 1997) and text segmentation (Hearst, 1993) holds promise for the future.

Going beyond structural criteria, one can begin to address text meaning itself. A simple approach has been developed for IR and adapted for Summarization systems. Lexical cohesion is one of the most popular basic techniques used in text analysis for the comparative assessment of saliency and connectivity of text fragments. Extending this technique to include simple thesaural relations such as synonymy and hyponymy can help to capture word similarity in order to assess lexical cohesion among text units, although they do not provide thematic characterizations of text units. This problem can be addressed by using a dictionary database providing information about the thematic domain of words (e.g., business, politics, sport). Lexical cohesion can then computed with reference to discourse topics rather than (or in addition to) the orthographic form of words. Such an application of lexical cohesion makes it possible to detect the major topics of a document automatically and to assess how well each text unit represents these topics. Both template extensions for IE and query-based indicative summaries can then be obtained by choosing one or more domain codes, specifying a summary ratio and retrieving the wanted portion of the text that best represents the topic(s) selected. Deriving and storing the required world knowledge is a topic addressed under Ontologies in Chapter 1.

4. Portability and Greater Ease of Use (Going beyond Computational Linguists): We need to achieve high levels of performance with less effort and less expertise. One aspect of this is simply building better interfaces to existing systems, both for the developer and for the end-user. But serious research is also required on the automatic acquisition of template filler patterns, which will enable systems for much larger domains than is typical in today’s MUC evaluations. An underlying ontology and a large library of common, modifiable patterns in the business news and the geopolitical domains would be very useful for analysts seeking to make specially tailored information extraction systems

Several means exist by which such large libraries of patterns can be acquired. Obviously, an analysis of a user’s annotations of texts, performed using a suitable interface, is one way. The recent application of statistical learning techniques to several problems in Computational Linguistics (see Chapter 6) is another. For such methods, we need ‘smarter’ learning techniques–ones that are sensitive to linguistic structures and semantic relations, and so can learn from a smaller set of examples. In order not to have to learn everything anew, it is important that the systems be able to build upon and adapt prior knowledge.

5. Using Semantics (Going beyond Word-Level Processing): Ultimately, we have to transcend the 60%—70% level of performance for IE. As the amount of information available expands, so will the demands for greater coverage and greater accuracy. There are several reasons for this barrier. On the one hand, there are a large number of linguistic problems that must be solved which are infrequent enough that solving any one of them will not have a significant impact on performance. In addition, several problems are pervasive and require general methods going beyond the finite-state or, in some cases, ad hoc approaches that are in use today. These problems include the MUC tasks of entity and event coreference.

More generally, significantly better performance on natural language tasks will require us to tackle seriously the problem of (semantic) inference, or knowledge-based NLP. The primary problem with this as a research program is that there is a huge start-up time, with no immediate payoffs: a very large knowledge base encoding commonsense knowledge must be built up. In order to have a viable research program, it will be necessary to devise a sequence of increments toward full knowledge-based processing, in which each increment yields improved functionality. One possibility for getting this research program started is to experiment with the construction and extension of knowledge bases such as WordNet, SENSUS, and CYC to enable serious natural language processing problems, such as resolving ambiguities, coreference, and metonymies. Some exploratory work has been done in this area as well; see Chapter 1.

The lack of semantic knowledge is a serious shortcoming for Text Summarization. Almost every Text Summarization system today produces Extracts only. The problem is that to produce Abstracts, a system requires world knowledge to perform concept fusion: somewhere it must have recorded that menu+waiter+order+eat+pay can be glossed as visit a restaurant and that he bought apples, pears, bananas, and oranges can be summarized as he bought fruit. The knowledge required is obviously not esoteric; the problem simply is that we do not yet have adequately large collections of knowledge, appropriately organized.

While the query expansion lists of IR are a beginning in this direction, effort should be devoted to the (semi-automated) creation of large knowledge bases. Such knowledge can serve simultaneously to help disambiguate word meanings during semantic analysis, expand queries accurately for IR, determine correct slot filling during IE, enable appropriate concept fusion for Summarization, and allow appropriate word translation in MT.

6. Standardized Evaluation of Summarization (Going beyond IR Measures): While no-one will deny the importance of text summarization, the current absence of standardized methods for evaluating them is a serious shortcoming. Quite clearly, different criteria of measurement apply in the different types of summary; for example, an adequate Extract is generally approximately three 3 times as long as its equivalent Abstract for newspaper texts (Marcu, 1999); and a Query-based summary might seem inadequately slanted from the author’s perspective.

Much NLP evaluation makes the distinction between black-box and glass-box evaluation. For the former, the system–however it may work internally, and whatever its output quality–is evaluated in its capacity to assist users with real tasks. For the latter, some or all of the system’s internal modules and processing are evaluated, piece by piece, using appropriate measures.

A similar approach can obviously be taken for text summarization systems. Jones and Galliers (1996), for example, formulate a version of this distinction as intrinsic vs. extrinsic, the former measuring output quality (only) and the latter measuring assistance with task performance. Most existing evaluations of summarization systems are intrinsic. Typically, the evaluators create a set of ideal summaries, one for each test text, and then compare the output of the summarization engine to it, measuring content overlap in some way (often by sentence or phrase recall and precision, but sometimes by simple word overlap). Since there is no single ‘correct’ ideal summary, some evaluators use more than one ideal per test text, and average the score of the system across the set of ideals. Extrinsic evaluation, on the other hand, is much easier to motivate. The major problem is to ensure that the metric applied does in fact correlate well with task performance efficiency.

Recognizing the problems inherent to summary evaluation, Jing et al. (1998) performed a series of tests with very interesting results. Instead of selecting a single evaluation method, they applied several methods, both intrinsic and extrinsic, to the same (extract only) summaries. Their work addressed two vexing questions: the agreement among human summarizers and the effect of summary length on summary rating. With regard to inter-human agreement, Jing et al. found fairly high consistency in the news genre, but there is some evidence that other genres will deliver less consistency. With regard to summary length, Jing et al. found great sensitivity for both recall and precision, and concluded that precision and recall are not ideal measures, partly due to the interchangeability of some sentences. They find no correlation between summary length and task performance, and recommend that mandating a fixed length can be detrimental to system performance.

The complexity of the problem and the bewildering variety of plausible evaluation measures makes the topic an interesting but far from well-understood one.

7. Multimedia (Going beyond Written Text): Information Extraction and Text Summarization techniques must be applied to other media, including speech, OCR, and mixed media such as charts and tables. Corresponding information must be extracted from visual images, and the information in the various media must be fused, or integrated to form a coherent overall account. Chapter 9 discusses the cross-relationships of information in different media.

With respect to speech and OCR, the input to the Information Extraction or Summarization system is more noisy and more ambiguous. But it could be that extraction technology could be used to reduce this ambiguity, for example, by choosing the reading that is richest in domain-relevant information.

Speech lacks the capitalization and punctuation that provide important information in written text, but it has intonation that provides similar or even richer information. We need to learn how to exploit this to the full.

In analyzing broadcast news, a wide variety of media come into play. A scene of men in suits walking through a door with flash bulbs going off may be accompanied by speech that says Boris Yeltsin attended a meeting with American officials yesterday, over a caption that says "Moscow". All of this information contributes to an overall account of the scene, and must be captured and fused.

The creation of non-text summaries out of textual material, such as the tabulation of highly parallel information in lists and tables, is a research topic for Text Summarization with high potential payoff.

3.6 Conclusion

IE and Summarization technologies, even at the current level of performance, have many useful applications. Two possibilities are data mining in large bodies of text and the improvement of precision in document retrieval applications such as web searches. The strong influence of the US Department of Defense on IE development has somewhat obscured the fact that commercial IE will have to be automatically adaptable to new domains. Important IE application areas include searching patents, searching shipping news, searching financial news, searching for terrorist or drug related news reports, searching entertainment information, searching (foreign) Internet material (see Chapter 2). Applications for Summarization include handling of specialized domains and genres (such as legal and medical documents), summarization for educational purposes, news watch for business (say, for tracking competition) or intelligence (for tracking events in foreign countries), and so on.

In both cases, more work will greatly improve the utility of current technology. Although early IE and Summarization systems can be found on the market, they do not yet perform at levels useful to the average person, whether for business or education. The ability to tailor results to the user’s current purpose is central, as well as the ability to merge information from multiple languages and multiple sources. A longer-term goal is the ability to merge and fuse information into abstractions, generalizations, and possibly even judgments.

At present, funding for IE and Summarization is at an all-time low. The US umbrella for IE and summarization, TIPSTER, ended in 1998. The joint NSF-EU call for cross-Atlantic collaboration on Natural Language Processing issues holds some hope that research will commence again in 2000. Given the importance of these applications, the progress made in the past five years, and the many unanswered questions remaining about IE and Text Summarization, further research on these topics is likely to be highly beneficial.

3.7 References

Aone, C., M.E. Okurowski, J. Gorlinsky, B. Larsen. 1997. A Scalable Summarization System using Robust NLP. Proceedings of the Workshop on Intelligent Scalable Text Summarization, 66—73. ACL/EACL Conference, Madrid, Spain.

DeJong, G.J. 1979. FRUMP: Fast Reading and Understanding Program. Ph.D. dissertation, Yale University.

Firmin Hand, T. and B. Sundheim. 1998. TIPSTER-SUMMAC Summarization Evaluation. Proceedings of the TIPSTER Text Phase III Workshop. Washington.

Grishman, R. and B. Sundheim (eds). 1996. Message Understanding Conference 6 (MUC-6): A Brief History. Proceedings of the COLING-96 Conference. Copenhagen, Denmark (466—471).

Hovy, E.H. and C-Y. Lin. 1998. Automating Text Summarization in SUMMARIST. In I. Mani and M. Maybury (eds), Advances in Automated Text Summarization. Cambridge: MIT Press.

Hovy, E.H. and C-Y. Lin. 1999. Automated Multilingual Text Summarization and its Evaluation. Submitted.

Jing, H., R. Barzilay, K. McKeown, and M. Elhadad. 1998. Summarization Evaluation Methods: Experiments and Results. In E.H. Hovy and D. Radev (eds), Proceedings of the AAAI Spring Symposium on Intelligent Text Summarization (60—68).

Jones, K.S. and J.R.Galliers. 1996. Evaluating Natural Language Processing Systems: An Analysis and Review. New York: Springer.

Knight, K. and J. Graehl. 1997. Machine Transliteration. Proceedings of the 35^th ACL-97 Conference. Madrid, Spain, (128—135).

Lin, C-Y. 1999. Training a Selection Function for Extraction in SUMMARIST. Submitted.

Marcu, D. 1997. The Rhetorical Parsing, Summarization, and Generation of Natural Language Texts. Ph.D. dissertation, University of Toronto.

Marcu, D. 1999. The Automatic Construction of Large-scale Corpora for Summarization Research. Forthcoming.

Jacobs, P.S. and L.F. Rau. 1990. SCISOR: Extracting Information from On-Line News. Communications of the ACM 33(11): 88—97.

Kupiec, J., J. Pedersen, and F. Chen. 1995. A Trainable Document Summarizer. In Proceedings of the 18th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR), 68—73. Seattle, WA.

Mani, I. et al. 1998. The TIPSTER Text Summarization Evaluation: Initial Report.

Miike, S., E. Itoh, K. Ono, and K. Sumita. 1994. A Full-Text Retrieval System with Dynamic Abstract Generation Function. Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR-94), 152—161.

Radev, D. 1998. Generating Natural Language Summaries from Multiple On-Line Sources: Language Reuse and Regeneration. Ph.D. dissertation, Columbia University.

Reimer, U. and U. Hahn. 1998. A Formal Model of Text summarization Based on Condensation Operators of a Terminological Logic. In I. Mani and M. Maybury (eds), Advances in Automated Text Summarization. Cambridge: MIT Press.

Sager, N. 1970. The Sublanguage Method in String Grammars. In R.W. Ewton, Jr. and J. Ornstein (eds.), Studies in Language and Linguistics (89—98).

Sparck Jones, K. 1998. Introduction to Text Summarisation. In I. Mani and M. Maybury (eds), Advances in Automated Text Summarization. Cambridge: MIT Press.

Strzalkowski, T. et al., 1998. ? In I. Mani and M. Maybury (eds), Advances in Automated Text Summarization. Cambridge: MIT Press.

[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter4.html .]

[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]