Multilingual Information Management:

[This report is available as http://www.cs.cmu.edu/people/ref/mlim/index.html .]

[It is also available as a single [very large] page: http://www.cs.cmu.edu/~ref/mlim/index.shtml .]

[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]

Current Levels and Future Abilities

A report

Commissioned by the US National Science Foundation

and also delivered to

the European Commission’s Language Engineering Office

and the US Defense Advanced Research Projects Agency

April 1999

Editors:

Eduard Hovy, USC Information Sciences Institute (co-chair)

Nancy Ide, Vassar College (co-chair)

Robert Frederking, Carnegie Mellon University

Joseph Mariani, LIMSI-CNRS

Antonio Zampolli, University of Pisa

Foreword

Gary W. Strong, DARPA and NSF, USA

The Internet is rapidly bringing to the foreground the need for people to be able to access and manage information in many different languages. Even in cases where people have been lucky enough to learn several languages, they will still need help in effectively participating in the global information society. There are simply too many different languages, and all of them are important to somebody.

While machine translation has a long (over 50 year) history, computer technology now appears ready for the next great push in technology for multilingual information access and management, particularly over the World Wide Web. The European Commission and several US agencies are taking bold steps to encourage research and development in multilingual information technologies. The EC and the US National Science Foundation, for example, have recently issued a joint call for Multilingual Information Access and Management research. The US Defense Advanced Research Projects Agency is supporting a new effort in Translingual Information Detection, Extraction, and Summarization research. Both of these efforts are direct results of international planning efforts, and this Granada effort in particular.

No one was more surprised than the Granada workshop participants were at the rapid uptake in interest in Multilingual Information Management research. Attendees of the workshop in Granada, Spain hardly had their bags unpacked when the results were requested to be presented in Washington DC at a National Academy of Sciences workshop on international research cooperation. The US White House expressed interest in the topic as a groundbreaking effort for a new US-EU Science Cooperation Agreement. Now, DARPA has decided to invest in a multi-year, large-scale effort to push the envelope on rapid development of multilingual capability in new language pairs.

The World is surely shrinking as communication and computation advances proceed at a breath-taking pace. On the other hand, there is no doubt that people will continue to hold on to the values and beliefs of their native cultures. This includes holding on to the language of their families and ancestors. This is a treasure, a cultural knowledge base that must not be weakened even as pressures to be able to speak common languages increase. Therefore, efforts in multilingual technology not only allow us to share knowledge and resources of the World, they also allow us to preserve our individual human qualities that have allowed us to progress and solve problems that we all share.

I thank all whose efforts have gone into this workshop report and the resource that it represents for future efforts in the field. Those who proceed to carry on the needed research and development being called for from around the world will surely find this report to be of great value.

Introduction: The Goals of the Report

Over the past 50 years, a variety of language-related capabilities has been developed in machine translation, information retrieval, speech recognition, text summarization, and so on. These applications rest upon a set of core techniques such as language modeling, information extraction, parsing, generation, and multimedia planning and integration; and they involve methods using statistics, rules, grammars, lexicons, ontologies, training techniques, and so on.

It is a puzzling fact that although all of this work deals with language in some form or other, the major applications have each developed a separate research field. For example, there is no reason why speech recognition techniques involving n-grams and hidden Markov models could not have been used in machine translation 15 years earlier than they were, or why some of the lexical and semantic insights from the subarea called Computational Linguistics are still not used in information retrieval.

This picture will rapidly change. The twin challenges of massive information overload via the web and ubiquitous computers present us with an unavoidable task: developing techniques to handle multilingual and multi-modal information robustly and efficiently, with as high quality performance as possible.

The most effective way for us to address such a mammoth task, and to ensure that our various techniques and applications fit together, is to start talking across the artificial research boundaries. Extending the current technologies will require integrating the various capabilities into multi-functional and multi-lingual natural language systems.

However, at this time there is no clear vision of how these technologies could or should be assembled into a coherent framework. What would be involved in connecting a speech recognition system to an information retrieval engine, and then using machine translation and summarization software to process the retrieved text? How can traditional parsing and generation be enhanced with statistical techniques? What would be the effect of carefully crafted lexicons on traditional information retrieval? At which points should machine translation be interleaved within information retrieval systems to enable multilingual processing?

The purpose of this study is to address these questions, in an attempt to identify the most effective future directions of computational linguistics research and in particular, how to address the problems of handling multilingual and multi-modal information. To gather information, a workshop was held in Granada, Spain, immediately following the First International Conference on Linguistic Resources and Evaluation (LREC) at the end of May, 1998. Experts in various subfields from Europe, Asia, and North America were invited to present their views regarding the following fundamental questions:

What is the current level of capability in each of the major areas of the field dealing with language and related media of human communication?
How can (some of) these functions be integrated in the near future, and what kind of systems will result?
What are the major considerations for extending these functions to handle multi-lingual and multi-modal information, particularly in integrated systems of the type envisioned?

The experts were invited to represent the following areas:

multilingual resources (lexicons, ontologies, corpora, etc.)
information retrieval, especially cross-lingual and cross-modal
machine translation
automated (cross-lingual) summarization and information extraction
multimedia communication, in conjunction with text
speech processing, especially multilingual
evaluation and assessment techniques for each of these areas
methods and techniques (both statistics-based and linguistics-based) of pre-parsing, parsing, generation, information acquisition, etc.
government: funding and development policy

In a series of ten sessions, one session per topic, the experts explained their perspectives and participated in panel discussions that attempted to structure the material and hypothesize about where we can expect to be in a few years’ time. Their presentations, comments, and notes were collected and synthesized into ten chapters by a collection of chapter editors.

A second workshop, this one open to the general computational linguistics public, was held immediately after the COLING-ACL conference in Montreal in August, 1998. This workshop provided a forum for public discussion and critique of the material gathered at the first meeting. Subsequently, the chapter editors updated and refined the ten chapters.

This report is formed out of the presentations and discussions of a wide range of experts in computational linguistics research, at the workshops and later. We are proud and happy to present it to representatives and funders of the US and European Governments and other relevant associations and agencies.

We hope that this study will be useful to anyone interested in assessing the future of multilingual language processing.

We would like to thank the US National Science Foundation and the Language Engineering division of the European Commission for their generous support of this study.

Eduard Hovy and Nancy Ide, Editorial Board Co-chairs

Contributors

Nuria Bel, GILCUB, Spain

Christian Boitet , GETA, France

Nicoletta Calzolari, ILC-CNR, Italy

George Carayannis, ILSP, Greece

Lynn Carlson, Department of Defense, USA

Jean-Pierre Chanod, XEROX-Europe, France

Khalid Choukri, ELRA, France

Ron Cole, Colorado State University, USA

Bonnie Dorr, University of Maryland, USA

Christiane Fellbaum, Princeton University, USA

Christian Fluhr, CEA, France

Robert Frederking, Carnegie Mellon University, USA

Ralph Grishman, New York University, USA

Lynette Hirschman, MITRE Corporation, USA

Jerry Hobbs, SRI International, USA

Eduard Hovy, USC Information Sciences Institute, USA

Nancy Ide, Vassar College, USA

Hitoshi Iida, ATR, Japan

Kai Ishikawa, NEC, Japan

Frederick Jelinek, Johns Hopkins University, USA

Judith Klavans, Columbia University, USA

Kevin Knight, USC Information Sciences Institute, USA

Kamran Kordi, Entropic, England

Gianni Lazzari, ITC, Italy

Bente Maegaard, Center for Sprogteknologi, Denmark

Joseph Mariani, LIMSI-CNRS, France

Alvin Martin, NIST, USA

Mark Maybury , MITRE Corporation, USA

Giorgio Micca, CSELT, Italy

Wolfgang Minker, LIMSI-CNRS, France

Doug Oard, University of Maryland, USA

Akitoshi Okumura, NEC, Japan

Martha Palmer, University of Pennsylvania, USA

Patrick Paroubek, CIRIL, France

Martin Rajman, EPFL, Switzerland

Roni Rosenfeld, Carnegie Mellon University, USA

Antonio Sanfilippo, Anite Systems, Luxembourg

Kenji Satoh, NEC, Japan

Oliviero Stock, IRST, Italy

Gary Strong, National Science Foundation, USA

Beth Sundheim, SPAWAR/NCCOSC, USA

Nino Varile, European Commission, Luxembourg

Charles Wayne, Departmentof Defense, USA

John White, Litton PRC, USA

Yorick Wilks, University of Sheffield, England

Antonio Zampolli, University of Pisa, Italy

Table of Contents

Chapter 1. Multilingual Resources (lexicons, ontologies, corpora, etc.)

Editor: Martha Palmer

Chapter 2. Cross-lingual and Cross-modal Information Retrieval

Editors: Judith Klavans and Eduard Hovy

Chapter 3. Automated Cross-lingual Information Extraction and Summarization

Editor: Eduard Hovy

Chapter 4. Machine Translation

Editor: Bente Maegaard

Chapter 5. Multilingual Speech Processing

Editor: Joseph Mariani

Chapter 6. Methods and Techniques of Processing

Editor: Nancy Ide

Chapter 7. Speaker/Language Identification, Speech Translation

Editor: Gianni Lazzari

Chapter 8. Evaluation and Assessment Techniques

Editor: John White

Chapter 9. Multimedia Communication, in Conjunction with Text

Editors: Mark Maybury and Oliviero Stock

Chapter 10. Government: Policies and Funding

Editors: Antonio Zampolli and Eduard Hovy

[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter1.html .]

[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]