Multilingual Information Management:

Chapter 6. Methods and Techniques of Processing

Editor: Nancy Ide

Contributors:

Jean-Pierre Chanod

Jerry Hobbs

Eduard Hovy

Frederick Jelinek

Martin Rajman

Abstract

Language Processing in almost all its subareas has experienced research in two major paradigms. The symbolic and statistical approaches to language processing are often regarded as (at best) complementary and (at worst) at odds with one another, although the line between them can be blurry. This chapter outlines the nature and history of the two methodologies, and shows why and how they necessarily complement one another.

5.1 Statistical vs. Symbolic: Complementary or at War?

In the history of Language Processing, two principal paradigms came into conflict during some period in almost every major branch–Information Retrieval (see Chapter 2) in the 1960s, Automated Speech Recognition (Chapter 5) in the 1970s, Machine Translation (Chapter 4) in the 1990s. In all cases, this time was rather traumatic, sometimes leading to completely separate professional organizations, journals, and conference series. The two paradigms can be called the "symbolic" and the "statistical" approaches to automatic language processing (see, for example, Klavans and Resnik, 1997). This distinction is made roughly on the following basis:

manual analysis leading to a theory (symbolic), vs. manual definition of a parametric model (statistical);
manual building of rules vs. automatic building of rules;
evaluation based on intuitive judgements vs. automated scoring using an evaluation function.

One can view Language Processing as a process of transformation between input (the language sample, in speech or text) and output (the desired result–translation, summary, query result, etc., depending on the application). In this view, the symbolic approach is primarily concerned with how to decompose the transformation process into stages: which stages are to be created, and what notations are to be invented for them? Traditional answers to these questions include morphology, syntactic analysis (parsing), semantic analysis, discourse analysis, text planning, and so on. In contrast, the statistical approach is primarily concerned with how automatically to construct systems (or rules for systems) that effect the necessary transformations among stages. Traditional techniques include vector spaces, collecting alternatives and then ranking them using various metrics, counting frequencies, and measuring information content in various ways.

The common wisdom is that the symbolic approach, based upon deep analysis of the phenomena in question, is superior for its high quality, over the statistical approach, which considers only somewhat superficial phenomena. Furthermore, symbolic methods do not require massive amounts of data and the often intensive human effort required to create appropriately annotated data. On the other hand, statistics-based methods are generally considered superior because they tend to be more robust in the face of unexpected types of input, where rule-based systems simply break down. Furthermore, since statistics-based systems use automated methods to identify common patterns and create transformations or rules for them, they are well suited to phenomena that do not exhibit fairly simple and clear regularity, and can produce many rules rapidly. In contrast, symbolic approaches are limited by slow manual analysis and rule-building, which is generally costly, difficult, and often incomplete.

A good example is provided by grammars of languages, one of the most obvious candidates for human analysis and rule construction, and a favorite subject of syntacticians for four decades. But even for grammars, the tendency of natural language toward exceptions and complexity bedevils the symbolic rule-building approach–no complete or even fully adequate grammar of any human language has yet been built, despite decades (or even centuries) of effort! Typically, for example, it requires about 2 person-years to build an adequate grammar for a commercial-quality machine translation system, and eventually involves approximately 500—800 grammar rules, giving roughly 80% coverage of arbitrary input sentences. In contrast, recent automated grammar learning systems produce on the order of 2000—25000 grammar rules after a few months of human-guided training, and produce results with over 90% coverage (Collins, 1996, 1997; Hermjakob and Mooney, 1997; Hermjakob, 1999). While the latter set of rules is rarely as elegant as the humans’ rules, one cannot argue with the results.

One cannot conclude that statistics wins, however. Rather, this example uncovers a more subtle relationship between the two approaches, one that illustrates their necessary complementarity. Without some knowledge of syntactic categories and phenomena, no automated rule-learning system would be able to learn any grammar at all. The learning systems have to be told what it is that they must learn: their training corpora have to be annotated according to some theory. The better the theory, the more powerful the eventual result, and the more elegant and parsimonious, generally speaking, the learned rules. This example highlights a harmonious and productive balance between human analyst and learning system: it is the human’s job (possibly aided by automated tools that discover patterns) to decide on the appropriate level(s) of representation and the appropriate representational notations and terms; it is the learning system’s job to learn the rules that transform the input into the desired notation as accurately as possible.

This view leads to the fundamental questions surrounding the supposed dichotomy between the two approaches: Can statistics tell us anything about language? Can it contribute to the development of linguistic models and theories? On the other hand, do we need linguistic theory to do language processing, or, like Orville and Wilbur Wright, can we build an airplane that flies with little or no understanding of aerodynamics?

5.2 Where We Are Coming From

The history of natural language processing research dates back most conveniently to efforts in the early 1950s to achieve automatic translation. Although quantitative/statistical methods were embraced in the early machine translation work, interest in statistical treatment of language waned among linguists in the mid-60s, due to the trend toward generative linguistics sparked by the theories of Zellig Harris (1951) and bolstered most notably by the transformational theories of Noam Chomsky (1957). In Language Processing, attention then turned toward deeper linguistic analysis and hence toward sentences rather than whole texts, and toward contrived examples and artificially limited domains instead of general language.

As described in more detail in Chapter 5, the history of Automated Speech Recognition was a typical example. After a considerable amount of research based on phonemes, word models, and the human articulatory channel, a new paradigm involving Hidden Markov Models (HMMs) was introduced by F. Jelinek and others in the 1970s (Baker, 1975). This paradigm required data to statistically train an Acoustic Model to capture typical sound sequences and a Language Model to capture typical word sequences, and produced results that were far more accurate and robust than the traditional methods. This work was heavily influenced by the information theoretic tradition of Shannon and Weaver (1949). The US Department of Defense DARPA Human Language Technology program, which started in 1984, fostered an evaluation-driven comparative research program that clearly demonstrated the advantages of the statistical approach (DARPA, 1989—94). Gradually, the HMM statistical approach became more popular, both in the US and abroad. The problem was seen as simply mapping from sound sequences to word sequences.

During this time, the speech community worked almost entirely independently of the other Language Processing communities (machine translation, information retrieval, computational linguistics). The two communities’ respective approaches to language analysis were generally regarded as incompatible: the speech community relied on training data to induce statistical models, independent of theoretical considerations, while computational linguists relied on rules derived from linguistic theory.

In the machine translation community, the so-called Statistics Wars occurred during the period 1990—1994. Before this time, machine translation systems were exclusively based on symbolic principles (Chapter 4), including large research efforts such as Eurotra (Johnson et al., 1985). In the late 1980s, again under the influence of F. Jelinek, the CANDIDE research project at IBM took a strictly non-linguistic, purely statistical approach to MT (Brown et al., 1990). Following the same approach as the speech recognition systems, they automatically trained a French-English correspondence model (the Translation Model) on 3 million sentences of parallel French and English from the Canadian Parliamentary records, and also trained a Language Model for English production from Wall Street Journal data. To translate, CANDIDE used the former model to replace French words or phrases by the most likely English equivalents, and then used the latter model to order the English words and phrases into the most likely sequences to form output sentences. DARPA sponsored a four-year competitive research and evaluation program (see Chapter 8 for details on MTEval (White and O’Connell, 1992—94)), pitting CANDIDE against a traditional symbolic MT system (Frederking et al., 1994) and a hybrid system (Yamron et al., 1994). The latter system was built by a team led by the same J. Baker who performed the 1975 speech recognition work.

Unlike the case with speech recognition, the evaluation results were not as clear-cut. Certainly, CANDIDE’s ability to produce translations at the same level as SYSTRAN’s (one of the oldest and best commercial systems for French to English) was astounding. Yet CANDIDE was not able to outperform SYSTRAN or other established MT systems; its main contribution was recognized to be a method for rapidly creating a new MT system up to competitive performance levels. The reasons for this performance ceiling are not clear, but a certain amount of consensus has emerged. As discussed in the next section, it has to do with the fact that, unlike speech recognition, translation cannot operate adequately at the word level, but must involve more abstract constructs such as syntax.

The introduction of statistical processing into machine translation was paralleled by its introduction into the Computational Linguistics community. In the late 1980s, the situation changed quite rapidly, due largely to the increased availability of large amounts of electronic text. This development enabled, for the first time, the full-scale use of data-driven methods to attach generic problems in computational linguistics, such as part-of-speech identification, prepositional phrase attachment, parallel text alignment, word sense disambiguation, etc. The success in treating at least some of these problems with statistical methods led to their application to others, and by the mid-1990s, statistical methods had become a staple of computational linguistics work.

The timing of this development was fortuitous. The explosion in the 1990s of the Internet created opportunities and needs for computational searching, filtering, summarization, and translation of real-world quantities of online text in a variety of domains. It was clear that the purely symbolic approach of the previous 30 years had not produced applications that were robust enough to handle the new environments. As a result, computational linguists began mining large corpora for information about language in actual use, in order to objectively evaluate linguistic theory and provide the basis for the development of new models. Instead of applications that worked very well on domain specific or "toy" data, computational linguists began working on applications that worked only reasonably well on general text, using models that incorporated notions of variability and ambiguity.

While symbolic methods continue to hold their own, in some areas the balance has clearly shifted to statistical methods. For example, as described in Chapter 3, the information extraction community largely abandoned full-sentence parsing in the early 1990s in favor of "light parsing", generally using a cascade of finite-state transducers. This was a result of an inability to resolve syntactic ambiguities using previously available methods with any reliability. In the last few years, apparently significant advances have been made in statistical parsing, particularly in the work of Magerman (1995), Collins (1996, 1997), Hermjakob (1997, 1999), and Charniak (1997). Charniak reports a labeled bracketing recall scores of 87% on sentences shorter than 40 words. By contrast, in the Parseval evaluation of September 1992, the second and last evaluation for hand-crafted grammars, the best system’s labeled bracketing recall rate was 65% on sentences that were all shorter than 30 words. These results led to a general impression that statistical parsing is the clearly superior approach. As a result, research on handcrafted grammars in computational linguistics has virtually ceased in the US.

6.3 Where We Are Now

6.3.1 Current Status for Parsing

Despite the recent success of statistical methods in areas such as parsing, doubts remain about their superiority. One of the major objections arises from the lack of means to compare results directly between symbolic and statistical methods. For example, in the case of parsing, it has been noted that correctly labeled and bracketed syntax trees are not in and of themselves a useful product, and that full-sentence parsing becomes feasible only when a large percentage of complete sentences receive a correct or nearly correct parse. If, for example, incorrect labeled brackets are uniformly distributed over sentences, then the percentage of complete sentences parsed entirely correctly is very low indeed. Thus systems that tend to produce fully correct parses some o f the time, and fail completely when they don’t, may outperform systems that succeed partway on all sentences. Unfortunately, papers on statistical parsing generally do not report the percentage of sentences parsed entirely correctly; in particular, they typically do not report the number of sentences parsed without any crossings. A crossing occurs when the TreeBank correct key and the parser’s output bracket strings of words differently, so that they overlap but neither is fully subsumed by the other, as in

Correct key: [She [gave [her] [dog biscuits]]]

Parser output: [She [gave [her dog] biscuits]]].

Charniak reports a zero-crossing score of 62%, again on sentences of 40 words or fewer. No precisely comparable measure is available for a hand-crafted grammar, but Hobbs et al. (1992) determined in one evaluation using a hand-crafted grammar that 75% of all sentences under 30 morphemes parsed with three or fewer attachment mistakes, a measure that is at least related to the zero crossings measure. In that same analysis, Hobbs et al. found that 58% of all sentences under 30 morphemes parsed entirely correctly. To compare that with Charniak’s results, Hobbs obtained a printout of the labeled bracketings his system produced and inspected 50 sentences by hand. The sentences ranged in length between 6 and 38 words with most between 15 and 30. Of these 50 sentences, the TreeBank key was correct on 46. Irrespective of the key, Charniak’s parser was substantially correct on 23 sentences, or 46%. If these results were to stand up under a more direct comparison, it would cast serious doubt on the presumed superiority of statistical methods.

An analysis of the errors made by Charniak’s parser shows that about one third of the errors are attachment and part-of-speech mistakes of the sort that any hand-crafted parser would make; these ironically are just the ones we would expect statistical parsing to eliminate. About a third involve a failure to recognize parallelisms, and consequently conjoined phrases are bracketed incorrectly; a richer treatment of parallelism would help statistical and handcrafted grammars equally. The remaining third are simply bizarre bracketings that would be filtered out by any reasonable handcrafted grammar. This suggests that a hybrid of statistical and rule-based parsing, augmented by a lexically-based treatment of parallelism, could greatly improve on parsers using only one of the two approaches, and thereby bring performance into a range that would make robust full-sentence parsing feasible.

6.3.2 Current Status for Wordsense Disambiguation

Statistical methods now dominate other areas as well, such as word sense disambiguation. In the 1970s and 1980s, several researchers attempted to handcraft disambiguation rules tailored to individual lexical items (e.g., Small and Rieger, 1982). Although their results for individual words were impressive, the sheer amount of effort involved in creating the so-called "word experts" prevented large-scale application to free text. Since the late 1980s, statistical approaches to word sense disambiguation, typically relying on information drawn from large corpora and other sources such as dictionaries, thesauri, etc. (see, for example, Wilks et al., 1990; Ide and V‚ronis, 1990) have dominated the field. The recent Senseval evaluation for sense disambiguation (Kilgarriff and Palmer, forthcoming) demonstrated that statistics-based systems, with or without the use of external knowledge sources, top out at about 80% accuracy. Although most systems in the competition were strictly statistics-based, the "winner" used a hybrid system including both statistics and rules handcrafted for individual words in the evaluation exercise. This suggests that statistical methods alone cannot accomplish word sense disambiguation with complete accuracy. Some hybrid of methods, taking into account the long history of work on lexical semantic theory, is undoubtedly necessary to achieve highly reliable results.

6.3.3 Current Status for Machine Translation

Except for the speech translation system Verbmobil (Niemann et al., 1997), no large-scale research project in machine translation is currently being funded anywhere in the EU or the US. Smaller projects within the EU are devoted to constructing the support environment for machine translation, such as lexicons, web access, etc.; these include the projects Otello, LinguaNet, Aventinus, Transrouter, and others. Within the US, small research projects are divided between the purely symbolic approach, such as UNITRAN (Dorr et al., 1994), the symbolic tools-building work at NMSU, and Jelinek’s purely statistical approach at the Johns Hopkins University Summer School.

It is, however, instructive to consider what transpired during the years 1990—94 in the competitive DARPA MT program. The program’s two flagship systems started out diametrically opposed, with CANDIDE (Brown et al., 1990) using purely statistical training and Pangloss (Frederking et al., 1994) following the traditional symbolic rule Interlingua approach. Three years and four evaluations later, the picture had changed completely. Both systems displayed characteristics of both statistics and linguistics, and did so in both the overall design philosophy and in the approach taken when constructing individual modules.

For CANDIDE, the impetus always was the drive towards quality–coverage and robustness the system had from the outset. But increasing quality can be gained only by using increasingly specific rules, and (short of creating a truly massive table of rules that operates solely over lexemes, and eventually has to contain all possible sentences in the language) the rules have to operate on abstractions, which are represented by symbols. The questions facing CANDIDE’s builders were: which phenomena to abstract over, and what kinds of symbol systems to create for them? Every time a new phenomenon was identified as a bottleneck or as problematic, the very acts of describing the phenomenon, defining it, and creating a set of symbols to represent its abstractions, were symbolic (in both senses of the word!). The builders thus were forced to partition the whole problem of MT into a set of relatively isolated smaller problems or modules, each one circumscribed in a somewhat traditional/symbolic way, and then to address each module individually. By December 1994, CANDIDE was a rather typical transfer system in structure, whose transfer rules require some initial symbolic/linguistic analysis of source and target languages, followed by a period of statistical training to acquire the rules.

For Pangloss, the development path was no less easy. Pangloss was moved by the drive toward coverage and robustness. Although the Pangloss builders could always theorize representations for arbitrary new inputs and phenomena, Pangloss itself could not. It always needed more rules, representations, and lexical definitions. The Pangloss builders had to acquire more information than could be entered by hand, and so, in the face of increasingly challenging evaluations, were compelled to turn toward (semi-)automated information extraction from existing repositories, such as dictionaries, glossaries, and text corpora. The extracted rules were more general, providing not just the correct output for any input but a list of possible outputs for a general class of inputs, which were then filtered to select the best alternative(s). By the twin moves of extracting information from resources (semi-)automatically and of filtering alternatives automatically, Pangloss gradually took steps toward statistics.

6.3.4 Differences in Methodology and Technology

It is instructive to compare and contrast the methodologies of the two paradigms. Though good research in either paradigm follows the same sequence of five stages, the ways in which they follow them and the outcomes can differ dramatically.

The five stages of methodology:

Stage 1: gathering data. Both symbolic/linguistic and statistical paradigms consider this stage to be critical. Typically, far more data is required for the statistical paradigm, since human are better at creating generalizations than machine. However, since humans are less exhaustive than machines, they may overlook subtle patterns of difference.
Stage 2: analysis and theory formation. Both paradigms perform this step manually; typically, it involves some linguistic/symbolic reasoning, generalization, and concept formation. The outcome of this stage for the symbolic paradigm is a (proto-) theory, possibly involving a set of symbols, that guide all subsequent work. The outcome of this stage for the statistical paradigm is a parametric model of the problem, ready for automated rule learning.
Stage 3: construction of rules or data items such as lexical items. The symbolic paradigm performs this stage manually. The rule or data items collections typically number between a few dozen and a few thousand. Considerable effort may be expended on ensuring internal consistency, especially as the theory tends to evolve when new cases are encountered. In contrast, the statistical paradigm performs this stage automatically, under guidance of the parametric model. Typically, thousands or hundreds of thousands of rules or data items are formed, not all of which are ultimately kept. Effort is expended on measuring the power or goodness of each candidate rule or data item.
Stage 4: application of rules and data items in task. In both paradigms, the rules and data items are then used by the accompanying engines, usually automatically.
Stage 5: evaluation and validation. The symbolic paradigm tends to be far more lax in this regard than the statistical one, preferring system-internal measures of growth (the number of new rules, the size of the lexicon, etc.) over external measures, which are often very difficult to create (see Chapter 8). The statistical paradigm finds external, automated, evaluation central, since it provides the clearest guidance to altering the parametric model and thereby improving the system.

Problems with the symbolic paradigm are most apparent in stage 3, since manual rule building and data item collection is slow, and in stage 5, since there is a natural aversion to evaluation if it is not enforced. Problems with the statistical paradigm are apparent in stage 2, since parametric models tend to require oversimplification of complex phenomena, and in stage 1, since the sparseness (or even total unavailability) of suitable training data may hamper development.

It is also instructive to compare and contrast the technology built by the two paradigms.

Four aspects of technology:

Method: The symbolic paradigm tends to develop systems that produce a single output per transformation step, while the statistical paradigm tends to produce many outputs per step, often together with ratings of some kind. Later filtering stages then prune out unwanted candidates.

Rules: Symbolic rules tend to have detailed left hand sides (the portions of the rules that contain criteria of rule application), containing detailed features conforming to arbitrarily abstract theories. Statistical rules tend to have left hand sides that are either underspecified or that contain rather surface-level features (i.e., features that are either directly observable in the input or that require little additional prior analysis, such as words or parts of speech).

Behavior: Symbolic systems tend to produce higher quality output when they succeed, but to fail abjectly when their rules or data items do not cover the particular input at hand. In contrast, statistical systems tend to produce lower quality output but treat unexpected input more robustly.

Methods: Symbolic methods include Finite State Algorithms, unification, and other methods of applying rules and data items in grammars, lexicons, etc. Statistical methods include Hidden Markov Models, vector spaces, clustering and ranking algorithms, and other methods of assigning input into general parametric classes and then treating them accordingly.

6.3.5 Overall Status

While it is dangerous to generalize, certain general trends do seem to characterize each approach. With respect to processing, two types of modules can be identified:

transformation/replacement engines, that consume representations and produce new ones under guidance of rules. Symbolic approaches tend to characterize the various types of cases that can occur, employ specific rules for each case, and produce a single, hopefully correct, result. In contrast, statistical engines tend to discriminate less and perform the transformation on more instances, of which some are correct and others not.

selection/filtering engines, that prune out some proposed variants to maximize combinations of reliability/probability values. Statistical approaches tend to favor these engines, while symbolic approaches tend not to require this.

With respect to data and rules, creation (whether symbolic or statistical) proceeds as follows. For each linguistic phenomenon / translation bottleneck, system builders:

identify the phenomenon,
circumscribe it by studying the extent of its effects,
analyze it to find the internal structure of the phenomenon or the ways in which its operation affect the outcome,
develop an appropriate representation for the phenomenon (which, in some statistical cases, may simply be the function that specifies how the presence of the phenomenon affects internal system parameters),
collect raw data from which rules will be derived,
reformulate the data so as to affect the system behavior (i.e., as ‘rules’ of some kind),
adapt the appropriate engines to take the data/rules into account.

In symbolic systems, data preparation is mostly done by hand (which is why older systems, with the benefit of years’ worth of hard labor, generally outperform younger ones) while in statistical systems data collection is done almost exclusively by computer, usually using the frequency of occurrence of each datum as the basis from which to compute its reliability (probability) value.

In general, phenomena exhibiting easily identified linguistic behavior, such as grammars of dates and names, seem to be candidates for symbolic approaches, while phenomena with less apparent regular behavior, such as lexically-anchored phrases, require automated rule formation. What constitutes sufficient regularity is a matter both of linguistic sophistication and of patience, and is often legitimately answered differently by different people. Hence, although many phenomena will eventually be treated in all MT systems the same way (either symbolically or statistically), many others will be addressed both ways, with different results.

Experience with statistical and symbolic methods for parsing, word sense disambiguation, and machine translation, then, suggests that neither the symbolic nor the statistical approach is clearly superior. Instead, a hybrid paradigm in which the human’s analysis produces the target for the machine to learn seems to be the most productive use of the strengths of both agencies. This observation may quite likely generalize to all areas of Language Processing. However, at present there is virtually no research into ways of synthesizing the two approaches in the field of computational linguistics. Even among statistics-based approaches, there is little understanding of how various statistical methods contribute to overall results. Systematic consideration of ways to synthesize statistical and symbolic approaches, then, seems to be the next step.

6.4 Where We Go from Here

In order to move toward synthesis of statistical and symbolic approaches to Language Processing, it is first necessary to consider where past experience with both has brought us to date.

What have we learned from theory-based approaches? One of the important (though quite simple) lessons from research in Linguistics is that language is too complex a phenomenon to be accurately treated with simple surface-based models. This fact is acknowledged by the general trend in current linguistic theories toward lexicalization (which is a way to recognize that simple abstract models are not able to represent language complexity without the help of extremely detailed lexical entries), and by the recent turn in computational linguistics to the lexicon as a central resource for sentence and discourse analysis.

Statistical approaches have generally provided us with a clever way to deal with the inherent complexity of language by taking into account its Zipfian nature. In any corpus, a small number of very frequent cases represent a large proportion of occurrences, while the remaining cases represent only a small fraction of occurrences and therefore correspond to rare phenomena. By taking into account the occurrence frequency (which is what probabilities essentially do), a system can quickly cover the most frequent cases and thereby achieve reasonable levels of performance. However, there remains a lot of work to be done to get the remaining cases right; in other words, even the most sophisticated statistical approaches can never achieve 100% accuracy. Furthermore, to the extent they address complex applications, statistical approaches rely on linguistic analysis for guidance. For example, as Speech Recognition research begins to grapple with the problems of extended dialogues, it has to take into the account the effects of pragmatic (speaker- and hearer-based) variations in intonation contour, turn taking noises, and similar non-word-level phenomena.

While it is coming to be widely acknowledged in both the statistical and symbolic Language Processing communities that synthesis of both approaches is the next step for work in the field, it is less clear how to go about achieving this synthesis. The types of contribution of each approach to the overall problem, and even the contribution of different methods within each approach, is not well understood. With this in view, we can recommend several concrete activities to be undertaken in order to move toward synthesis of symbolic and statistical approaches:

(1) A systematic and comprehensive analysis of various methods, together with an assessment of their importance for and contribution to solving the problems at hand, should be undertaken. This is not an easy task. Perhaps more importantly, it is not a task for which funding is likely to be readily forthcoming, since it leads only indirectly to the production of results and is not an end in itself. The trend toward funding for applications rather than basic research is a problem for the entire field–one which, hopefully, can be addressed and rectified in the future.

It will be necessary to develop a precise methodology for an in-depth analysis of methods. Simplistically, one can imagine a "building block" approach, where methods are first broken into individual components and then combined one by one to gain a better understanding of the nature of the contribution of each to a given task. Such an analysis would have to be done over a wide range of samples of varying language types and genres.

In the end, tradeoffs between the two approaches will certainly be recognized, in terms of, for example, precision vs. efficiency. There exist some analyses of such tradeoffs in the literature, but they are neither comprehensive across fields nor, for the most part, systematic enough to be generalizable.

(2) Similarly, resources, including corpora and lexicons, should also be evaluated for their contribution to Language Processing tasks. Recognition of the importance of the lexicon is increasing, but the amount and kind of information in existing lexicons varies widely. We need to understand what kinds of information are useful for various tasks, and where tradeoffs between information in the lexicon (primarily symbolic) and its use by both symbolic and statistical methods can be most usefully exploited. It is also essential to profit from the existence of large corpora and statistical methods to create these resources. In addition, data-driven systems need to be improved to take into account higher level knowledge sources (without losing computational tractability (which is essential in order to train systems on large volumes of data).

For corpora, the current situation may be even more critical (see also Chapter 1). While large amounts of corpus data exist, current resources are lacking in two significant ways:

Representativeness across genre, register, etc. Corpora need to be better designed: brute force selection is fine for the most frequent cases, but highly sub-optimal for the rare ones.
Adequacy and consistency of annotation. Although some attempts to standardize the form of markup in linguistic corpora exist (Ide, 1998), in general formats are widely varied and require significant labor to bring into conformance for input to particular systems. In addition, corpora need to be better annotated: data-driven systems are only as good as the quality of the annotation of the training corpora. More linguistically justified methods for good, coherent, consistent annotation schemes are needed.

(3) Multilinguality is key. There has been almost no study of the applicability of methods across languages or attempts to identify language-independent features that can be exploited in NLP systems across languages. Data-driven techniques are often language-independent, and once again, systematic analysis of what works in a multilingual environment is required. Data annotation, on the other hand, is largely language-dependent, but has to be produced in a standardized way in order to enable both system improvement and evaluation. The standardization of annotation formats (as in, for instance, the European EAGLES effort) and international collaboration are crucial here.

(4) Application technology push. Each of the four major application areas should be stimulated to face its particular challenges. Automated Speech Recognition should continue its recently begun move to dialogue, as evinced in DARPA’s COMMUNICATOR program and others, putting more language in the language models (syntax) and more speech in the speech models (speech acts)–see Chapter 5. Machine Translation (Chapter 4) should pursue coverage and robustness by putting more statistics into symbolic approaches and should pursue higher quality by putting more linguistics into statistical approaches. Information Retrieval (Chapter 2) should focus on multilinguality, which will require new, statistical, methods of simplifying traditional symbolic semantics. Text Summarization and Information Extraction (Chapter 3) should attack the problems of query analysis and sentence analysis in order to pinpoint specific regions in the text in which specific nuances of meaning are covered, mono- and multilingually, by merging their respective primarily statistical and primarily symbolic techniques.

6.5 Conclusion

Natural language processing research is at a crossroads: both symbolic and statistical approaches have been explored in-depth and the strengths and limitations of each are beginning to be well understood. We have the data to feed the development of lexicons, term banks, and other knowledge sources, and we have the data to perform large-scale study of statistical properties of both written and spoken language in actual use. Coupled with this is the urgent need to develop reliable and robust methods for retrieval, extraction, summarization, and generation, due in large part to the information explosion engendered by the development of the Internet. We have the tools and methods, yet we remain far from a solid understanding a general solution to the problem.

What is needed is a concerted and coordination of researchers across the spectrum of relevant disciplines, and representing the international community, to come together and shape the bits and pieces into a coherent set of methods, resources, and tools. As noted above, this involves, in large part, a systematic and pains-taking effort to gain a deep understanding of the contributing factors and elements, from both a linguistic and a computational perspective. However this may best be accomplished, one thing is clear: it demands conscious effort. The current emphasis on the development of applications may or may not naturally engender the sort of work that is necessary, but progress will certainly be enhanced with the appropriate recognition and support.

6.6 References

Abney, S. 1996. Statistical Methods and Linguistics. In J. Klavans and Ph. Resnik (eds.), The Balancing Act. Cambridge, MA: MIT Press.

Brown, P.F., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty, R. Mercer, P. Roossin. 1990. A Statistical Approach to Machine Translation. Computational Linguistics 16(2) (79—85).

Charniak, E. 1997. Statistical Parsing with a Context-Free Grammar and Word Statistics. Proceedings of Fourteenth National Conference on Artificial Intelligence (AAAI-97). Providence, RI (598—603).

Chomsky, N. 1957. Syntactic Structures. The Hague, The Netherlands: Mouton.

Church, K.W. and R. Mercer. 1993. Introduction to the Special Issue on Computational Linguistics Using Large Corpora. Computational Linguistics 19(1) 1—24.

Collins, M.J. 1996. A New Statistical Parser Based on Bigram Lexical Dependencies. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL). Santa Cruz, CA (184—191).

Collins, M.J. 1997. Three Generative, Lexicalised Models for Statistical Parsing. Proceedings of the35th Annual Meeting of the Association for Computational Linguistics (ACL). Madrid, Spain (16—23).

DARPA. 1989—1994. Proceedings of conference series initially called Workshops on Speech and Natural Language and later Conferences on Human Language Technology. San Francisco: Morgan Kaufmann.

Dorr, B.J. 1994. Machine Translation Divergences: A Formal Description and Proposed Solution. Computational Linguistics 20(4) (597—634).

Frederking, R., S. Nirenburg, D. Farwell, S. Helmreich, E. Hovy, K. Knight, S. Beale, C. Domanshnev, D. Attardo, D Grannes, R. Brown. 1994. Integrating Translations from Multiple Sources within the Pangloss Mark III Machine Translation System. Proceedings of the First AMTA Conference, Columbia, MD (73—80).

Harris, Z.S. 1951. Methods in Structural Linguistics. Chicago: University of Chicago Press.

Hermjakob, U. and R.J. Mooney. 1997. Learning Parse and Translation Decisions from Examples with Rich Context. Proceedings of the35th Annual Meeting of the Association for Computational Linguistics (ACL). Madrid, Spain (482—489).

Hermjakob, U. 1999. Machine Learning Based Parsing: A Deterministic Approach Demonstrated for Japanese. Submitted.

Hobbs, J.R., D.E. Appelt, J. Bear, M. Tyson, and D. Magerman. 1992. Robust Processing of Real-World Natural-Language Texts. In P. Jacobs (ed), Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval. Hillsdale, NJ: Lawrence Erlbaum Associates (13—33).

Ide, N. 1998. Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora. Proceedings of the First International Language Resources and Evaluation Conference (LREC). Granada, Spain (463—470).

Ide, N. and J. V‚ronis. 1990. Very large neural networks for word sense disambiguation. Proceedings of the 9th European Conference on Artificial Intelligence (ECAI’90). Stockholm, Sweden (366—368).

Ide, N. and J. V‚ronis. 1998. Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art. Computational Linguistics 24(1) 1—40.

Johnson, R.L, M. King, and L. Des Tombe. 1985. EUROTRA: A Multi-Lingual System under Development. Computational Linguistics 11, (155—169).

Kilgarriff, A. and M. Palmer. forthcoming. The Senseval Word Sense Disambiguation Exercise Proceedings. Computers and the Humanities (special issue), forthcoming.

Klavans, J.L. and Ph. Resnik. 1997. The Balancing Act: Combining Symbolic and Statistical Approaches to Language. Cambridge, MA: MIT Press.

Magerman, D.M. 1995. Statistical Decision-Tree Models for Parsing. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL). Cambridge,MA (276—283).

Niemann, H., E. Noeth, A. Kiessling, R. Kompe and A. Batliner. 1997. Prosodic Processing and its Use in Verbmobil. Proceedings of ICASSP-97, (75—78). Munich, Germany.

Pendergraft, E. 1967. Translating Languages. In H. Borko (ed.), Automated Language Processing. New York: John Wiley and Sons.

Shannon, C.E. and W. Weaver. 1949. The Mathematical Theory of Communication. Urbana, IL: University of Illinois Press.

Small, S.L. and Ch. Rieger. 1982. Parsing and comprehencing with word experts (a theory and its realization). In W. Lehnert and M. Ringle (eds.), Strategies for Natural Language Processing. Hillsdale, NJ: Lawrence Erlbaum and Associates (89—147).

White, J. and T. O’Connell. 1992—94. ARPA Workshops on Machine Translation. Series of 4 workshops on comparative evaluation. PRC Inc., McLean, VA.

Wilks, Y., D. Fass, Ch-M. Guo, J.E. MacDonald, T. Plate, and B.A. Slator. 1990. Providing Machine Tractable Dictionary Tools. In J. Pustejovsky (ed.), Semantics and the Lexicon. Cambridge, MA: MIT Press.

Yamron, J., J. Cant, A. Demedts, T. Dietzel, Y. Ito. 1994. The Automatic Component of the LINGSTAT Machine-Aided Translation System. In Proceedings of the ARPA Conference on Human Language Technology, Princeton, NJ (158—164).

[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter7.html .]

[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]