Linguistic Diversity on the Internet

The history of Machine Translation (MT), the automatic translation of texts by computer, dates back to the early days of computers in the 1940s and 1950s, partly as an outgrowth of wartime cryptography techniques. From then until the mid-60s a considerable amount of work was done on MT technology, but because of the severe limitations of computing and programming practices in this period, coupled with an overly naïve approach to MT lacking in linguistic sophistication and an excess of optimism on the part of both researchers and the general public, early expectations could not be fulfilled and a period of extreme scepticism followed.

In particular, the conclusions of the ALPAC Report of 1966, commissioned by U.S. government agencies, were strongly damning, and the consequent withdrawal of funding brought to an abrupt end the first generation of MT research in the United States, and furthermore produced a markedly negative public perception of the future possibilities of MT. ALPAC is now thought to have been shortsighted, but the damage it did to the image of MT as well as to progress in development was profound and lasting. One offshoot of these events was that, in the decade following ALPAC, with U.S. research brought to a standstill, European and Canadian projects came to the fore. The return of widescale interest in MT in the U.S. was slow, while Japan has since become a further important focal point for MT development.

Thus, although early work on MT laid the foundations for later developments, its shortcomings were many. In programming terms, the implementation of simple bilingual computerized lexica was fairly straightforward, though the human labour required to construct the lexica themselves was considerable. But creating quality machine translation systems is not a simple matter of "brute force": the problem is that translation calls for far more than the mere substitution of lexical items on a one-to-one basis, as was made evident by the poor quality of output from the first systems. Translation often involves choices between possible equivalents in the target language to a given word in the source language, and any system is doomed to constant errors unless it incorporates procedures and criteria for making such choices, most of which were beyond the capacity and level of sophistication of the early attempts. Also, translation requires the conversion not only of lexical items but also of grammatical (morphological and syntactic) structures, the analysis of which required more complex types of processing. Even with such analysis, adequate translation is far from guaranteed unless other kinds of information are also used, given the subtleties of meaning and expression in human language texts, even when these appear straightforward at first sight.

The image of MT was damaged just as much by unrealistic expectations. For relatively unsophisticated machine translations to be considered adequate and useful, a compromise must be accepted on at least one of two points: the requirement that translation should be fully automatic (taking place without human intervention); and the requirement that the quality of machine translation should match that achievable by human translators. Initial attempts to achieve Fully Automatic High Quality Translation (FAHQT) were doomed to failure; it was thus an error of ALPAC and other commentaries of the period to judge progress on MT by such an impossible standard. In later generations of MT it came to be accepted that machine translations not meeting the goal of FAHQT can also be of value, and that the demands made of MT systems should take into account both what is possible with current technology and what is necessary for particular applications.

After several ups and downs in the reputation of MT, the last half-decade has seen a renewed upsurge of interest due in no small part to the intensification of global communication following the initial growth of the Internet. Most experts, while aware of the technology's limitations, are nonetheless confident that MT will play a significant role in the technology of tomorrow's Information Society.

Before the nineties, three main approaches to Machine Translation were developed: the so-called direct, transfer and interlingua approaches. Direct and transfer-based systems must be implemented separately for each language pair in each direction, while the interlingua-based approach is oriented to translation between any two of a group of languages for which it has been implemented. The implications of this fundamental difference, as well as other features of each type of system, are discussed in this and the following sections. The more recent corpus-based approach is considered later in this section.

The direct approach, chronologically the first to appear, is technically also the least sophisticated, although this does not mean, within the limits of present translation technology, that it necessarily produces inferior results. In its purest form, a system of the direct type "translates as it goes along", on a word-by-word basis. Nonetheless, the systems of this type now in general use incorporate many sophisticated features that improve their efficiency, often matching systems of more recent design in the quality and robustness of the translations they produce. Well known and widely used systems are based on such an approach, but greatly enhanced, mainly by including information in the lexicon incorporating rules for disambiguation and bilingual syntax rules.

More recently developed approaches to MT divide the translation process into discrete stages, including an initial stage of analysis of the structure of a sentence in the source language, and a corresponding final stage of generation of a sentence from a structure in the target language. Neither analysis nor generation are translation as such. The analysis stage involves interpreting sentences in the source language, arriving at a structural representation which may incorporate morphological, syntactic and lexical coding, by applying information stored in the MT system as grammatical rules and dictionaries. The generation stage performs approximately the same functions in reverse, converting structural representations into sentences, again applying information embodied in rules and dictionaries.

The next crucial distinction involves what happens "in the middle", between source language analysis and target language generation. We shall discuss the interlingua approach first in order to keep the present exposition roughly chronological. In this approach the object of analysis is to arrive at a representation of each sentence in a form which is essentially independent of the languages between which translation is required: a language-neutral intermediate representation of the content of a sentence. This representation is expressed in an "intermediate language", or interlingua, although the word "language" here may be misleading since what is normally used is a computer-readable form of encoding rather than anything resembling a human language. Generation consists of reconstituting the content of representations in the interlingua as acceptable sentences in the target language. In a sense, an interlingua-based system can be said to carry out two "translations" for every one that the end user observes: first from the source language into the interlingua, and subsequently from the interlingua into the target language. Now it is clear that the quality of translation such a system can produce will be as good as the weakest link in the chain, and many experiments with interlinguas foundered because in practice it was difficult to design an interlingua that was truly language-independent and yet capable of encoding accurately the full content of sentences found in natural human-language texts. If human languages are complex, it follows that designing an interlingua into and out of which texts generated by humans can be translated satisfactorily, without loss of significant information, is not a simple matter. In fact, the interlingua approach encountered many problems and the results obtained were substantially inferior to those yielded by direct and transfer-based MT systems.

The transfer approach, which characterizes the more sophisticated MT systems now in use, may be seen as a compromise between the direct and interlingua approaches, attempting to avoid the most extreme pitfalls of each. Although no attempt is made to arrive at a completely language-neutral interlingua representation, the system nevertheless performs an analysis of input sentences, and the sentences it outputs are obtained by generation. Analysis and generation are however shallower than in the interlingua approach, and in between analysis and generation, there is a transfer component, which converts structures in one language into structures in the other and carries out lexical substitution. The object of analysis here is to represent sentences in a way that will facilitate and anticipate the subsequent transfer to structures corresponding to the target language sentences.

None of the approaches described is trouble-free, and they all have trouble with much the same set of recalcitrant problems, in particular lexical and grammatical ambiguities in the source language. Direct systems tend to operate too "locally", failing to "understand" the complete structure of the sentences they must translate; nevertheless they often "work" in practice, when the structure of the source language sentence happens to be directly transferable to that of the target language translation. Predictably, this happens more frequently the more similar the source and target languages are, so that we can expect a direct system to yield better results when translating between Spanish and Italian than between a European language and Japanese, say.

Interlingua-based systems cannot benefit from such "free rides", since they abstract away from the source text to a language-neutral representation from which the translated text must then be generated. The trouble with this approach, in addition to the difficulty of designing an adequate interlingua, is that when things go wrong in the complex process of abstraction and reconstitution of sentences, they tend to go very badly wrong indeed. However, the interlingua approach has certain other advantages, to be discussed below.

In the 1990s a radically new approach to mechanical translation has developed, known variously as corpus-based, example-based, or non-symbolic translation, and also associated with the notion of Translation Memory (TM). Such systems are based on a quite different principle to those discussed so far. One way to describe them would be as "direct" systems, except that the sentence, rather than the word, is the unit that they work with, and the resource they use to find equivalents is not a lexicon but a corpus. In their purest form, such systems do not break down or analyse the linguistic content of sentences. This means that they are protected against getting the analysis wrong; either the sentence to be translated is already in the available corpus or it isn't. If the sentence is there, the translation will probably be 100% correct; if it is not, no translation can be provided. However, more advanced corpus-based systems attempt to recognise partial or fuzzy matches between a sentence in the input text and another in the system's corpus.

The difficulty here obviously stems from the fact that there is an infinite number of possible sentences in a language. However, as its proponents point out, actual language use tends to be quite repetitive, using the same formulae over and over again. The degree of predictability of language will obviously depend largely on the communicative context and its domain or subject matter, both of which are sometimes highly specialized. It is the fact that language use in technological environments tends to be increasingly specialized, coupled with the possibility of obtaining specialized linguistic corpora corresponding to precisely such parameters of language use, which lends feasibility to corpus-based systems. The corpus-based approach is also favoured by recent advances in computer technology and the rise in computer use: the former makes it possible to store, transmit and access very large corpora efficiently, rapidly and cheaply, while the latter is resulting in an ever larger body of electronically-formatted text in different languages that may be employed as ready-made corpora.

Here we have reviewed current approaches to MT in a highly schematic manner. Among the various "symbolic" MT approaches (sometimes referred to as Machine Translation proper), the differences outlined, while certainly pertinent to an understanding of MT systems, are somewhat academic. Actual MT systems in operation (as opposed, possibly, to those restricted to the laboratory) tend to be more eclectic in character, despite the labels attached to them. It seems likely that those distinctions will become yet more blurred in the future. Unfortunately, the fact that the systems in question are in industrial competition with each other is in this case a possible impediment to rapid progress.

In the long run the same may apply to the opposition between "symbolic" and "non-symbolic" translation systems. Each type of system is a tool that efficiently performs rather different tasks from the other, but there seems to be no reason why these might not be combined to produce translation systems benefitting from both approaches: a corpus-based component would be supplemented by a symbolic MT component to deal with parts of texts not covered by the corpus.

Basic MT techniques, as described above, have been and will continue to be combined with other technologies, in attempts either to improve MT performance or extend its application. Since many of the difficulties with MT stem from the fact that it doesn't understand what it is asked to translate - this is the single most telling gap separating machine and human performance - various ways to increase the computer's "understanding" of texts have been tried: inclusion of components for semantic analysis among central MT routines, incorporation of Artificial Intelligence, Knowledge Bases etc. Some of these approaches have so far proved more successful than others, and experiments along such lines are likely to continue. Whatever their outcome, we can expect to see MT combined with and enhanced by other technologies.

One of the most sensational ways of extending the application of MT will be the integration of voice recognition and synthesis modules with MT to produce systems capable of translating voice communication. Development of all the individual components of such technology is well advanced and approaching a consolidation stage, so it is only a matter of time before such possibilities are realized and become part of everyday life. The same applies, obviously, to other types of input and output interface. Optical character recognition, and even recognition of handwriting, are already realities, and there is now talk of systems capable of recognising human body language too. The incorporation of these human-oriented interfaces into future technology will cover a wide range of applications and certainly not be specific to Machine Translation, but the potential for integration is evident.

Still more generally, future technology is expected to take fuller advantage of the possibilities for integration of existing and emerging technological capabilities into more comprehensive systems; this is the way ahead for many newly developing techniques. In many cases users of tomorrow's devices will not even be aware of the components involved, any more than today's television viewers understand what goes on inside the TV set. To serve multilingual communities and a linguisitically diverse world, the "hidden" components will often have to include either MT systems as such or modules of the kind found integrated in MT systems, such as language parsers and generators, lexica, corpora, etc.

Machine Translation development using systems of the "traditional" (or "symbolic") type is expensive and time consuming. Whatever kind of system we are looking at, we will wish to distinguish between developing a system as such and developing implementations for particular languages. Developing a completely new MT system is a large-scale programming endeavour comparable to the creation of other complex computer applications. A fair number of such systems have been developed over recent decades, and it seems unlikely that any major new projects for the creation of further "traditional" systems will be undertaken in the near future.

Any system is obviously only as useful as the language implementations that grow out of it. These are themselves very costly and take a long time to develop. Quantitatively speaking, the biggest task is lexicon development. The lexica of an MT system have to contain more, and more explicit, information about every single item than is found in ordinary dictionaries. The system must be equipped to handle not only individual words but word combinations - standard phrases, common collocations, idioms, compound expressions and the like. Everything must be tested, corrections made and re-tested and the cycle repeated many times before the system can be released for practical use. Errors will continue to surface in the course of use of the system, and over time this will gradually improve if remaining errors continue to be corrected. For this reason, the best MT systems, in practice, are those that have been running the longest, independently of the degree of sophistication in system construction. Systran and Logos, as "veteran" systems, are among the best in practice for those language pairs with which they have been used most and longest, even though their design is considered "primitive" by academic standards in current MT technology.

There now exists a variety of MT systems implemented for a variety of language pairs, but because of the size of the task of developing implementations for new language pairs, progress has been slow in consideration of the large number of languages, including some of considerable size and world importance, that are still not provided for. Moreover, direct and transfer type systems, the two kinds of "traditional" MT system now in common use, tend not to be fully reversible systems, meaning that they all require more or less separate development for each language pair in each direction. For the two possible directions of translation between a given language pair, some of the resources cannot in practice be shared because the process of translating from Language A to Language B is not merely the reverse of B to A. This is the basic state of affairs, even though there may be some elements that are reusable between language pairs and between directions in a given pair, provided, of course, that the language pairs in question belong to the same MT system and are not the property of competing commercial interests. Thus the potential for large-scale economizing in the development of MT for many languages is limited by a variety of factors, ranging from technical feasibility to market forces and capacity for coordination of efforts.

This point requires even more careful consideration when what is needed is not merely a bilingual but a multilingual MT network, in which translation is possible from any language into any other language among a given network of languages or in a multilingual community. Unless a high degree of reusability be achieved, some serious problems arise unless the multilingual set is very limited in size. When, in 1978, an ambitious project, named Eurotra, was started to develop "a machine translation system of advanced design" between all official languages of the European Community (a target which was not achieved before the programme came to an end), the Community's official languages numbered only six: English, French, German, Dutch, Danish and Italian. This meant fifteen language pairs. Within eight years, the entry of Greece and subsequently Spain and Portugal into the Community had added three new official languages which had to be integrated into the system, still under development. This increase from six to nine languages meant that the number of language pairs more than doubled, rising from fifteen to thirty-six. If the programme had continued a little longer, by the time there were twelve official languages of the Community, the number of language pairs would have gone from 36 to 66; fifteen languages would have brought the figure up to 105, and so on in geometric progression.

Whether we are talking about machine or human translation, various considerations will tend to make the position worse for the smaller languages in the network (i.e. for most languages), in contrast to a very few large communities whose languages are widely used. The more limited the resources of a given language community within the network (and on average we would expect smaller language communities to dispose of fewer resources), the greater the proportional effort required to achieve integration into a common multilingual system if the language community in question must make this effort on its own. Calculated in relation to the intensity of demand for translations between each language pair, also, the less widely a language is used in absolute terms, the more expensive, in terms of cost/benefit, it will be to provide the full range of translation services, whether by human or mechanical means. On the other hand, in all but the simplest multilingual networks, there are probably good organizational, not to mention political reasons why it is desirable for each language community to maintain some autonomous capacity and control over integration of its own language into the multilingual network, rather than depending on a top-down structure where decisions affecting all languages are centralized, or else subordinated to decisions in language communities other than one's own.

These are some of the logistical problems inherent in multilingual networks in general. What we must ask is whether strategies can be found, whether or not they involve Machine Translation (but it seems probable that they might), to resolve or at least palliate this situation in a way which is compatible with linguistic diversity in a closely-knit multilingual community. A number of solutions might be considered.

One approach is to look for ways to produce MT language pairs more cheaply, on the philosophy that for a language not even minimally integrated into modern technology, something is better than nothing, at least as a point of departure. One way to approach this is by creating practical MT development tools that can be adapted relatively rapidly to a given linguistic system on the basis of available knowledge about that system, which in the most extreme case might be limited to what can be elicited from a speaker as linguistic informant. An alternative way is to develop tools capable of constructing a grammar and lexicon on the basis of a corpus. Both avenues have been explored in recent or ongoing projects, particularly those explicitly oriented to providing MT for languages for which development resources are limited, such as unofficial or "underdeveloped" languages, or third-world-origin languages spoken by minority groups in European countries. However, there are limitations to what the present technology can achieve, and if the quality of even the best MT systems available is seriously limited, those generated by "quick and dirty" approaches will be even more imperfect.

Another strategy would be to designate one or more specific languages as pivot languages for translation purposes within the multilingual community. Translation between two languages (other than pivot languages) can then be achieved in two stages (by two human translators or two machine translation systems), translating first into the pivot language and then from this into the target language. This solution admittedly has some shortcomings. It would probably lower the quality of translated output as a consequence of the double translations, and make such translations more expensive and time-consuming (at least in the case of human translators), unless obtaining a translation into the pivot language is also a goal in its own right. But for large multilingual communities the number of individual translation directions would be substantially reduced (to two for each non-pivot language: to and from the pivot).

Given that some pairs and groups of languages are relatively similar to each other and that translation between similar languages is presumably easier (for both machine and human translators), one could contemplate the use of local as well as global pivot languages. For example, for translation between Polish and Slovenian, assuming no MT system for direct translation between the two exists, a third Slavic language would no doubt function better as a "local pivot language" than, say, English, Spanish, Arabic or Indonesian. In a multilingual network working through various pivot languages, a translation between two given languages would be "routed" through the most convenient pivot, or even sequence of pivots, depending on language affinity and availability of pivots, except, obviously, in cases where a direct language pair from source to target happens to be available.

The problem of the proliferation of language pairs, which affects multilingual systems based on direct or transfer MT technology (as well as human translators), is also avoided by MT systems based on the interlingua concept. The hallmark of an interlingua system is the language-neutral nature of the intermediate representations that mediate between source and target languages (see above). The structure of an interlingua system partly resembles that of the pivot language model. The difference is that the pivot, in this case, is not one of the natural languages in the network but an artificially constituted, computer-internal representation. Apart from its neutrality, an interlingua can be designed specifically to fulfil its function; for example it may be more explicit than human languages, and capable of encoding the various kinds of meaning found in different languages. We would thus expect an interlingua to serve as a more efficient pivot for translation than a human language. However, it has been questioned whether an interlingua with these ideal characteristics can actually be constructed, and a number of experiments along these lines have so far given disappointing results. It is indeed unfortunate that, while interlingua-based MT would be the ideal solution for multilingual situations if it worked reliably, this goal has remained elusive, and there are at present no commercial systems of this type.

There is yet another possible approach which, while distinct from any of those so far mentioned, shares or could incorporate features of each of them. This involves general agreement on the use of a standard form of computer-readable representation for texts, into which source texts in any language would be converted, and subsequently converted from this intermediate representation into whatever target language is required. At present there is a new initiative of this type, in which an intermediate language has been developed called Universal Networking Language (UNL).

This may be viewed as an interlingua-based translation system; what most distinguishes it from the latter is the way it is proposed to be used. The intermediate language, in these proposals, is understood as a language in which information can be stored, permanently if necessary, and also transmitted electronically, without either the author knowing into what target language(s) the texts will ultimately find their way, or the eventual recipients necessarily being aware of the language of the ultimate source. The intermediate language thus functions as an international language that is structured and formatted in a computer-friendly manner rather than a manner comparable to human languages. Access by humans to information depends on the availability of machine conversion into the language of the user's choice. Similarly, the ability of authors to create texts in the intermediate format depends on the availability of machine conversion from their languages into it. The most likely repository for the converters to and from different languages would be the Internet, which is also probably the place where information will be stored and the channel for its transmission. In principle, this only requires the establishment of standard norms for an agreed intermediate language whose characteristics permit it to perform the desired function efficiently; for practical implementation, it is naturally also necessary for MT systems to exist to service different languages. The number of such systems required will correspond to the number of analysis and generation components needed for interlingua systems: two for each language.

Apart from possible logistic issues, the main technical obstacles faced by this attractive solution concern whether it is in fact possible to design an intermediate language with the desired characteristics, and whether the MT systems available for use with such a scheme can perform well enough to make it useful. As already pointed out, the interlingua approach to MT, of which this is a development, has run into difficulties in the past and its ultimate feasibility would need to be demonstrated. Basic specifications for UNL have recently been published on the Internet, to be followed by pilot translation systems in the near future.

In contrast to "traditional" MT, Producing "non-symbolic" or corpus-based systems is simpler if the corpus already exists; this is one reason why the creation of language corpora is currently ranked as a high priority in language technology. The application of translation systems based on corpora is limited by the size and type of corpora, whereas symbolic-type MT does not have this limitation. Whether or not the corpus-based approach is useful will depend on whether its limitations are compatible with the purpose to which it is being applied and the conditions in which it must consequently work. This is, of course, equally true of symbolic machine translation.

For most kinds of machine translation, development and implementation is in any case costly and slow, and may be even more problematic if a multilingual system is the objective. Strategies and alternatives may be found to circumvent some problems, but those strategies may also impose further limitations on how the resulting system can be used and how acceptable its performance will be. In summary, then, specific kinds of MT system need to be evaluated in relation to the kind of environment in which they are to be used and the needs they are asked to satisfy.

An important limitation affecting the possible uses of Machine Translation technology, already referred to in the first section, concerns the quality of the translations it is able to provide. "Quality of translation" refers to how accurately an MT application performs a number of technically distinguishable tasks which together make up the complex process of translation: appropriate interpretation of the source text, production of a correct, intelligible target text, and of course perfect, or at least satisfactory, equivalence of meaning between the original text and the translation.

For most purposes it is fair to say that the quality we would ideally like to obtain in a machine translation would be such that this is indistinguishable from what an expert human translator would produce. In fact this formulation would strictly require some modification: not only is it rather ambitious to hope that a machine can model such subtle human skills perfectly, but we have also failed to mention that humans are themselves capable of error, and may, in reality, produce translations of poor quality as well as perfect ones!

However, a translation (whoever or whatever produces it) may be useful even if it is not of ideal quality. In itself, a low quality translation may serve practical purposes well; this depends on what those purposes are, on how critical quality is to those purposes, and also on what may be done about a translation's low quality if this is important.

In particular, depending on context and resources, it may be possible for machines and people to work hand in hand, in translation as in so many other spheres of technology: we need not expect the machine to carry out all the work on its own. Humans may be able to correct a machine's mistakes, taking the computer's effort as a first draft for revision. Alternatively, the computer may be able to request assistance or intervention from a human when necessary. Again, a human may be able to prepare the task for the machine ahead of time in such a way as to ensure that the latter is only asked to do what it is capable of. By such strategies, machines may be used as a tool to help provide translations of adequate quality, but not fully automatically.

Thus there may be a choice between having fully automatic translation which is not of high quality, or high quality translation which is not fully automatic. As we noted above, Machine Translation got a bad press early in its history when it was realised that the goal of FAHQT (fully automatic high quality translation) may be unattainable. This raises the question of why this ambitious goal should be the only one of any interest; while this ideal would certainly be very welcome if available, even low quality translation, or not fully automatic translation, may have important applications.

In the opinion of experts, many ordinary users of the Internet today are fairly tolerant of low quality machine translations such as are provided by existing automatic on-line services, presumably because given these users' priorities these serve their main purpose better than no translation at all. Human translation may be superior in quality but is not always available or affordable. This is one use of MT on the Internet that will predictably become more popular as the Internet expands around the world, becoming more international and multilingual. The primary goal of such services will be to provide translation which, in given circumstances, is available and affordable when needed and achieves a minimum threshold of quality for the purpose.

Humans can aid the machine to translate in a variety of ways. In some of these the assisting human needs the skills of a translator, or at least must know both source and target language, while in other cases the intervening person need only know the source language, or the target language, but not both. The other main criterion for classifying forms of human intervention is according to the point in the translation process at which intervention takes place.

In post-editing, the machine attempts to translate first and the editor then "cleans up" the result to produce a more correct text or a more accurate translation of the original. If the highest quality is of crucial importance, post-editing by a qualified translator is the best guarantee, since this gives maximum human control over the final product; by the same token, it is the least fully automated approach. Another possibility is for the MT programme to begin translating and stop to request clarification from a human operator during the process as necessary. This is called interactive editing. But if maximum machine autonomy is required, then short of full automation, the most appropriate form of human intervention to consider is pre-editing.

In pre-editing the editor, who may or may not need translator's skills depending on the system followed, checks or modifies the source text before or as this is input into the translation system, with the object of ensuring that this conforms to certain restrictions so that the machine is able to handle the input text adequately. Either the pre-editor has some form of training concerning what is acceptable to the machine, or else the machine itself prompts the user interactively, requesting clarification of points such as lexical or grammatical ambiguities in anticipation of what might otherwise give rise to problems.

Ambiguities can be resolved in different ways. One approach is for the source text to be required to conform to certain rules; this is referred to as controlled language. The imposition of a controlled language on the source text may be seen as a sacrifice, but what is obtained in return is a better quality machine translation. A classic example, and an early one - it began operation in 1976 - of a specialized MT system is the Canadian-created Météo system for translating daily weather bulletins from English into French. Its success is based on highly restricted input and a specialized domain. Official weather reports already followed a strict format employing a finite, specialized vocabulary and formulaic syntax, providing ideal circumstances for the implementation of an automated translation system. For the development of Météo, the existing linguistic formulae in use were first analyzed in detail and provided the basis for the standardized format that the system would handle with a high level of accuracy.

Another way for a text to be easier for any kind of MT system to handle, because language use is more predictable and therefore also less prone to ambiguities, is if the text naturally pertains to a specific type, or a particular subject area, known as a domain, for which the MT system used has been specially prepared. This is sometimes referred to as a sublanguage. A word, for example, may have various possible meanings in the language, but within a known domain or text type the meaning may be predictable. The result of this approach may be comparable to that of controlled language; the difference is that controlled languages are artifically imposed by an MT system, whereas sublanguages occur naturally in human text production. Well-defined sublanguages can be supported by special resources such as specialist terminology banks and limited-domain corpora, the creation of which is an investment whose cost, for some domains at least (as in the case of police reports or weather reports) may be offset by savings in terms of efficiency. If a sufficient number of special domain resources were developed, a gradually merging network of MT-supported fields could develop.

The expanding activities of electronic commerce may be viewed as one domain (in terms of transaction types) or a range of domains (in terms of content) for which the development of specialized language resources would be both technically feasible and commercially justified, although special measures will be required, as always, to compensate for the tendency for business interests to concentrate on the largest languages, which in turn might only reinforce the dominance of these without an interventionist policy, such as support for truly multilingual e-commerce transaction software.

Finally we come to another way for texts to be more machine-friendly, namely if they are themselves generated by machines, or if their creation is at least machine-guided. This may occur as part of a larger process such as we may find in Information Technology applications, but also directly in texts of human origin if these are created through the use of authoring tools. Authoring tools and processes are already coming into use for other reasons, and their use is set to increase with the spread of Information Technology. It ought be possible, provided this is adopted as an objective, to have them do so in ways that, together with their other functions, provide good input for MT so as to enhance translation accuracy. Some large international companies, e.g. Caterpillar and Ford, have already implemented ambitious programmes based on such principles.

B. Applying Machine Translation to the Internet

The Internet today is less a homogeneous environment than a macro-environment: a range of only loosely related ways of using a global electronic network for a variety of purposes. There is therefore no reason to expect diverse applications of the Internet to share a single set of conditions relevant to the applicability of Machine Translation (MT), and so to be susceptible to identical solutions. Furthermore, as the Internet evolves, new specialized uses will emerge which may determine the potential roles of MT in the future. Thus, any realistic assessment of the contribution of MT to the Internet must be a complex one.

The Internet is a channel allowing information to be transmitted or stored. The essential features of this channel can be summed up in four points: its operation is efficient; its extension is global; its use is flexible; and its form is electronic:

The Internet is, and will be to an increasing degree, both a vehicle for providing MT services and a major beneficiary of their application. To this extent, it is likely to provide a further key to making the Internet a truly global medium which can transcend not only geographical barriers but also linguistic ones.

Europe, as the most notable focal point in the present-day world where a great capacity for technological innovation crosses paths with a high level of linguistic diversity, is excellently placed to lead the way forward. Other parts of the world are technologically capable but too self-contained and homogeneous culturally to acquire immediate awareness of the need for information technology to find its way across linguistic barriers, while still other communities are fully aware of the language problem but lack a comparable degree of access to technological resources and initiative needed to address the issue on such a scale. Whoever succeeds in making future communication global in linguistic terms will have forged a new tool of incalculable value to the entire world.

The Internet can be used to store and transmit messages, information or other material between people. While some other media are used primarily to store material, and others primarily to transmit it, the Internet is unique in being fully adapted to both these functions within a single technological macro-domain.

Concentrating on transmission, another way of classifying the various functions of the Internet would be according to whether material posted is addressed to a specific "recipient" (e.g. e-mail) or to a generic audience (e.g. websites). A further criterion for classification is according to whether the material that the "author" puts on the Internet is transmitted to a "recipient" immediately, "in real time" (where transmission is synchronous) or placed in storage for the "recipient" to access later ( asynchronous).

Websites exemplify asynchronous communication; so does e-mail, strictly speaking, although the delay may be a short one. Interactive chat, in contrast, is synchronous; like e-mail it has a specific "recipient". The fourth possibility, synchronous communication to a generic audience, is what happens in live television or radio broadcasts; the Internet can also carry these.

This variety of uses has to do with the all-important matters of how material is provided to the Internet and how it is obtained or accessed from it, as well as with the way the material is stored in between. If all these phases, from the author via the channel to the recipient, do not share the same language, then translation must intervene.

At this point several questions need to be examined: Who should be "responsible" for the translation - the author or the recipient? At what point in the communication process should translation occur - at the beginning, at the end, or in the middle? And what means will be used to translate - fully automatic, human-assisted, or exclusively human?

For synchronous communication (e.g. chat in real time) to be possible across languages, translation itself should be synchronous. This narrows down the options regarding translation procedure to two: an on-line human interpreter (theoretically feasible, but extremely expensive) or fully automatic machine translation. With MT technology at present and for the foreseeable future, the quality of automatic translation performance in such an environment will be imperfect except in those cases where the language or domain of communication is restricted and the translation system appropriately specialized. This is in fact a very plausible scenario for some instances of electronic communication in the future. Where communication is not so restricted, it may be that the technology available will still be found so much more useful and affordable than any other available option that users will be prepared to pardon, and adjust their expectations and habits to, the inevitable margin of translation error in exchange for the benefits of being able to communicate in this manner across language barriers.

As is hinted here, in projecting future technological development, we must try to take into account the evolution of users' attitudes and mentality. The MT systems in this scenario would most likely be encountered on the Internet itself, perhaps integrated into the technology as one component among many, integrated into an emerging information environment that is likely to transform the Internet itself radically.

Cross-language communication via e-mail - asynchronous interpersonal communication - offers a second scenario. Because this mode of communication produces a more durable text (the messages may be composed off-line and the recipient may retain them as a permanent record), users might, in some cases at least, make a different evaluation of the acceptability of low-quality machine translation, again predictable unless the messages are restricted to a specialized domain or linguistic range, as they sometimes will be. Because communication is asynchronous, the communicators will dispose of more time to devote to the translation procedure, so a larger range of solutions may be feasible. Without resorting to the services of a human translator, one option for improving the quality of the translation would be for the machine translation process to include a pre-editing phase in order to improve the results subsequently obtainable through machine translation. Naturally, two other possibilities here are for translation to be obtained or commissioned by the author off-line before the message is sent out, and for translation to be carried out by the recipient, after receiving the message in the source language. Future habits in this respect will clearly depend on the services that are available, which as with other forms of technology will be determined either by market forces or planning, and it is still difficult to predict which way things will go.

However, Internet-oriented translation technology is likely to start by focusing on the servicing of websites. Asynchronous and addressed to a generic public, even a web page that gets, say, 100 hits (a small number indeed) is seen by 100 times more people than an e-mail message addressed to a single recipient. Since "web publishing" (in this broad sense) is much more accessible to authors than traditional forms of publication, the volume and growth rate of the Web will far exceed that of traditional publishing.

Every stage in the communication cycle is a potential point at which translation might be managed and/or executed. A wide range of solutions can be contemplated, obviously including pre-translation by the author or post-translation by the recipient, but also en-route translation using special Internet translation servers, and distributed translation whereby the translation process is spread over more than one point in the cycle.

Each of these options has advantages and disadvantages. One argument in favour of translation being controlled by the author is economic: the author is more likely than the recipient to be able to afford the effort and/or cost of a "proper" translation, in view of the number of potential recipients the author presumably wishes to reach. A second advantage deriving from the first is that the author is therefore both better situated and more strongly motivated to devote time and resources to obtaining a high quality translation.

However, there are also important disadvantages, both from the author's and the recipient's viewpoint, to author-created multilingual sites containing a set of parallel, pre-translated pages in several languages.

From the author's point of view there are several down-sides to having a multilingual site: (1) the cost of quality translation is very high, and the budget for a mutilingual site must cover not just original texts but the inevitable updates too; (2) the size of a multilingual site is logically much greater than that of a comparable monolingual site; (3) the internal structure of such a site is considerably more complicated too, often posing design problems; (4) the previous point results in magnifying problems of site maintenance: this can be extremely complex and expensive for properly multilingual sites, since anything that is updated must be modified in parallel in all languages, to the point that parallel maintenance may become simply unfeasible and unaffordable, whereupon the whole purpose of the multilingual site will become compromised.

From the recipient's point of view there is another, equally serious drawback to such a site, namely that a site, and indeed any given text, can only be pre-translated into a finite number of languages, so that no matter how many languages are on offer, the language needs of many potential recipients will remain unattended. Translating a website from English into Spanish, for instance, does nothing to help a Japanese reader!

Opposite considerations apply to the possibility of translation by the recipient. For an individual recipient it is far less likely to be affordable and worthwhile to obtain quality translations of websites in foreign languages than for the author; often it will be far more profitable to look for another site that one already understands. However, recipients can make use of automatic translation, even if of low quality, to obtain at least a general idea of a page's content, if the needed MT service is available (whether on-line or residing on the individual's computer). Unfortunately this availability cannot be taken for granted; it depends not only on the recipient's language but that of the original text, and on the availability of translation between that specific pair of languages. On the other hand, especially for speakers of smaller languages, and hence for the purposes of linguistic diversity, it is probably to the recipient's advantage to exercise some control over the translation process, provided the MT resources exist, rather than depending only on decisions by authors as to which languages to service.

While each of these approaches may provide adequate solutions for some cases, clearly neither fully author-driven nor fully recipient-driven translation are ideal ways to achieve a comprehensively multilingual Internet. The problem is likely to remain with us for a long time to come, but we may hope that technology, which has in a sense "created" the problem, will also eventually provide part of the solution, in the form of tools incorporated into a progressively more technological Internet. There are various possibilities here, and it would surely be premature to attempt to size them all up at present; this is an area within Information Technology which requires urgent attention.

One interesting set of alternatives would involve distributing the translation cycle over the information cycle, in any of various ways: (1) The author might translate the original text into a pivot language (English, Arabic, Esperanto or whatever is used to this end), or (2) convert it to a transportable interlingua such as Universal Networking Language (UNL); the reader would later have this converted into the language of choice. (3) Either or both operations could also be performed automatically somewhere en route. (4) Alternatively, the author, while not providing translations, could put the material through some form of pre-editing (which could result in modification or tagging of the source text in accordance with shared norms of some sort or other) or special authoring procedure, in order to facilitate translation if this should occur later on elsewhere in the cycle. In other words, materials could be posted in an agreed form that was not translated, but "translation-friendly".

The Internet has so far been portrayed as a largely inert channel through which information, consisting of texts (together with non-textual information: graphics, sound, etc.), is transmitted from an author to a recipient. This is an adequate portrayal of the initial stages of development of the Internet and the World Wide Web, which is still partly applicable at the present time, but has already begun to change.

Left to its own devices without technological renovation, the Internet would continue to grow in size and content, but in doing so would become less and less manageable: the more information appears, the harder it will be to find this efficiently. To overcome this effect of the information explosion, new content-oriented technology and a more sophisticated structure will need to be integrated into the Internet as we know it, transforming it from a simple channel for the transmission of characters, sounds and pictures into a medium for the interchange of information and messages. The new type of information cycle will be less inert and more interactive than its predecessor because the stage of accessing material will become far more interactive. The function of the processing component will be to accept specifications from the recipient and, by manipulation of raw information to which it has access, compile a report tailor-made to the recipient's requirements. One of those requirements will naturally concern the ultimate language of the report.

As well as being a channel for direct communication, a global library for all kinds of information, and an instrument of entertainment, the Internet is also fast becoming a medium through which a growing range of transactions can be performed, goods bought and sold, services obtained and provided. Electronic commerce, which is expected to grow dramatically over the next few years, is one arena for such transactions. Increasing functional specialization will be accompanied by corresponding technological specializations of various kinds. The Internet is going to incorporate more and more Information Technology, which may end up transforming the Internet's nature.

As the complexity and specialization of the Internet increase, so too, hopefully, will the options for effective adaptation to the requirements of linguistic diversity. The details of how this may come about are difficult to foresee, but there are a number of reasons why this evolution of the Internet could favour its becoming more multilingual - as long as there is a motivation and a will to implement such possibilities. The following are some of the reasons:

In compensation for the acknowledged weaknesses of Machine Translation as a complete answer to today's multilingual needs, the direction in which the Internet is currently evolving will open up many new opportunities to harness the emerging tools, structures and technologies to construct an environment capable of providing for the different languages of the world's citizens. Not to make provisions for this while much of the technology is still in its infancy and the Internet is still growing could amount to a lack of foresight with long-term consequences far more worrying than those caused by the oversight that led to the millenium bug crisis.

Appendix 1

Appendix 2

	Develop Foundations	Production and Publishing	Improve Access for Insiders	Improve Access for Outsiders
Speech Processing	Speech databases; Recognition; Generation	Dictation; vocalization	Voice control, alarms	[Interpreting]
Text Processing	Coding standards; Localization	Word processing	Text retrieval, summarization	Multilingual document search
Compiling Reference Material	Morph analysers; Parsers; corpora	Spell-checkers, gram-checkers	Multimedia, document libraries	Machine(-aided) translation
Networking	Interchange standards; protocols	World Wide Web	E-mail, discussion lists	Electronic networks, WWW
Computer-aided Instruction	Dictionaries (computer tractable)	Literacy	Classroom materials	Computer-aided language-learning

There is a wide profusion of applications possible for that confluence of computing and lingusitics that we call Human Language Technology. So much so, that I have found it useful to organize them into a table, with the various aims of the applications on the horizontal axis, and the various technologies that can be deployed down the vertical axis.

The column listed under "Develop Foundations" is not in itself a list of applications, but rather of the kinds of studies, most carried out at research institutions, which may support progress in the other applications further down the row.

When the various applications are displayed like this, one immediately sees that applications which require high-level analysis of grammar and meaning are in a small minority, perhaps only Interpreting (not yet available) and Machine Translation; while Summarization, Grammar-checkers, Text Retrieval and Computer-Aided Language Learning might be expected to make much more use of it in the future. This only underlines the fact that smaller languages can begin to apply the technology even though very little work has been done as yet on formal analysis of their structures.

Taken from the ELRA Newsletter April-June 1999 "Does size matter? Language Technology and the Smaller Language" by Nicholas Ostler.

Acknowledgements

We cannot possibly list all those people in virtually every country of the European Union and beyond who have been of help to us in one way or another in preparing this report. Our thanks to them all and in particular to:

Abaitua, Joseba
Facultad de Filosofía y Letras, Universidad de Deusto, Basque Country.
Aizpurua, Joxerra & Landa, Josu
Ametzagaina Taldea/Ametzagaina Group and ASP Software Injiniaritza/ASP Software Engineering, Basque Country.
Aizpurua, Xabier
Secretariat for Language Policy, Eusko Jaurlaritza/Basque Government, Basque Country.
Argemí, Aureli & de Dalmasses, Francesc
CIEMEN (foundation for minorities and stateless nations), Barcelona, Catalonia.
Arrieta, Kutz
Project Manager, Logos (Seattle), USA.
Bel, Núria
Commissionat per a la Societat de la Informació, Grup d'Investigación en Lingüística Computacional, Universitat de Barcelona, Catalonia.
Cardeñosa, Jesús
Facultad de Informàtica, Universidad Politécnica de Madrid, Spain.
Coll, Jordi & Mas, Jordi
Softcatalà (freeware and shareware in Catalan), Barcelona, Catalonia.
Dahl, Erik
Euroseek (search engine).
Davies, Graham
Cymru Ar-Lein, BBC Cymru/Wales, Caerdydd/Cardiff, Wales.
Diz Gamallo, Inés
Centro Ramon Piñeiro, Santiago de Compostela, Galicia.
Goossenaerts, Jan
Eindhoven University of Technology, The Netherlands.
Gordon, Ian
Managing Director, Trados UK Ltd, UK.
Husson, Patrick
Multilingual Information Society, European Commission, Luxemburg.
Jones, Colin & Jones, Gwyn
Bwrdd yr Iaith Gymraeg (Welsh Language Board), Caerdydd/Cardiff, Wales.
Lockwood, Rose
Équipe Consortium Ltd, Cambridge, UK.
Moring, Tom
European Bureau for Lesser-Used Languages, Brussels Office, Belgium.
Ó Cróinín, Donncha
ITE, Dublin, Ireland.
Ostler, Nicholas
Director, Linguacubun Ltd, Bath, UK.
Partal, Vicent
Vilaweb (electronic newspaper), Barcelona, Catalonia.
Prys, Delyth
Canolfan Safoni Termau (Centre for the Standardization of Welsh Terminology), Prifysgol Cymru (University of Wales) Bangor, Wales.
Rivallain, Yann
European Bureau for Lesser-used Languages, Dublin Office, Ireland.
Sarasola, Kepa
IXA Taldea, Donostia, Basque Country.
Somers, Harold & McNaught, John
Centre for Computational Linguistics, University of Manchester Institute of Science & Technology, UK.
Strubell, Miquel & Climent, Salvador
Universitat Oberta de Catalunya, (Open University of Catalonia) Barcelona, Catalonia.
Theologitis, Dimitri
European Commission Translation Service.
Vlaeminck, Sylvia
Socrates and Lesser Used Languages, European Commission, Brussels, Belgium.
Wie, Hakon
Opera Software, Norway.
Williams, Briony
Centre for Speech Technology Research, University of Edinburgh, Scotland.
Williams, Cen & Hicks, Bill
Canolfan Bedwyr (research and development centre), Prifysgol Cymru (University of Wales) Bangor, Wales.

A Short Bibliography of Machine Tran

The following is a list of publications found useful in the preparation of this report. Fuller bibliographies can be found in several works cited.

On this page:

A. History and Evaluation of Machine Translation

B. Applying Machine Translation to the Internet

Appendix 1

Appendix 2

Acknowledgements

A Short Bibliography of Machine Tran

European Documentation