Linguistic Diversity on the Internet: Assessment of the Contribution of Machine Translation
The objectives of this study have been to assess both the problems created and the opportunities offered by the Internet for the smaller and minority languages of the European Union; to consider what measures might facilitate the maximal use by European citizens of their own languages for communication and the accessing and presentation of information on the Internet; and to consider in particular the role which machine translation might play.
The study finds that the threat to linguistic diversity on the Internet will not in the future come from the dominance of one language but from a multilingualism limited to perhaps half a dozen main world languages between which machine translation will be fully developed to the exclusion of the great majority of languages. It argues that the development of language technology for all European languages is not only essential from the point of view of citizenship and avoiding social exclusion, but can give Europe an important technology cluster.
The weakest language-groups in the EU, while found to be making enterprising use of the Internet, need a basic IT environment in their languages. A larger number of languages which lack the full array of language resources - linguistic corpora, electronic dictionaries etc - are in danger of being excluded not from the Internet as it is now, but from many of the processes, including machine translation and other language processing functions, that will increasingly be carried out over the Internet. There is a need for a much enhanced investment in language resources.
Machine translation can only be understood in relation to the availability of the above-mentioned language resources. It is not one process which succeeds or fails by a single absolute standard, but a range of systems with different costs and advantages and suited to different user requirements. The study surveys the field, in respect of the uses of MT on the Internet, and particularly with the costs/benefits to the smaller languages in mind.
Bearing all these factors in mind the study proposes a range of policies which are appropriate for action at European Union level and which, taken together, can preserve and promote linguistic diversity on the Internet.
The objectives of this study have been the following:
to assess both the problems created and the opportunities offered by the Internet for the smaller and minority languages of the European Union;
to consider what measures might facilitate the maximal use by European citizens of their own languages for communication and the accessing and presentation of information on the Internet;
and to consider in particular the role which various systems of machine translation might play.
Our own expertise and perspective relates to information technology and to the regional and minority languages of the EU but we have broadened the focus to include smaller languages more generally.
We find that the threat to linguistic diversity on the Internet will not in the future come from the dominance of one language but from the uneven development of language technology and resources which, given present trends, will privilege half a dozen world languages.
It is estimated that the proportion of non-English-speakers using the Internet will have risen to 60% by 2005, so that global communication and information retrieval, e-commerce and websites that wish to have a global reach, will all require either to be multilingual or to use machine translation systems.
Language technologies have developed at an increasing pace and machine translation has achieved an acceptable level of accuracy in particular contexts, but it has to be understood that we are dealing with a range of MT systems based on a variety of principles, and suited to different user requirements. There are choices to be made between fully automatic translation that is not of very high quality and very high quality translation that is not wholly automatic - in other words, which needs interaction with a human translator. Indeed we see an increasing demand for human translators, but a shift in their function away from routine translation of texts towards becoming adapters of material between cultures.
The weakest language-groups in the EU - both very small state languages and regional and minority languages, inhabit an IT environment that marginalises them through an absence of word-processors, spell-checkers, internet browsers, IT manuals in their language. There is a danger here that an Internet culture - indeed a computer culture - develops in which people either come to accept it as natural to use a language other than their own when using the Internet, or else feel excluded because of lack of fluency in another language. For these languages support is needed to develop a range of everyday IT applications, but such projects should from the start have a distribution and marketing dimension as well as a technical one if they are to succeed in reaching European citizens who speak those languages
The Internet, however, offers these same small languages, and also small communities and regions using major languages, a range of new opportunities. We have cited a number of examples of good practice and suggest that there is room both for the development of pan-European Internet portals and for the circulating of experience in running Internet projects - as much on the management and financial sides as on the technical side. There are also new approaches possible to language-learning, using multi-media modules on the Internet.
A larger group of still relatively small European languages (but including those mentioned above) lack the full array of language resources - linguistic corpora, electronic dictionaries etc - which are necessary in varying degree for machine translation and other language processing functions. This means that they are in danger of being excluded not from the Internet as it is now, but from many of the processes and transactions that will increasingly be carried out over the Internet.
It is the uneven development of these underlying language resources which is the real threat to linguistic diversity on the Internet, since in the absence of language resources for small and medium-sized languages, multilingual access to information on the the Internet could in the future be limited to perhaps half a dozen main world languages between which machine translation will be fully developed to the exclusion of the great majority of languages.
The development of language technology and language resources for all European languages is therefore essential from the point of view of citizenship and equal opportunity in the information society. Language resources take time to develop and have only an indirect input into commercial applications, so that for all but the largest and richest language-groups, public funding will be required. We emphasize, however, that public authorities and voluntary organizations concerned with the languages in question need to be brought together with language technology experts so that overall strategies can be developed and that technical advances do not occur in isolation.
The development of language resources for European languages should not be regarded only as a cost, however. It can create an important technology cluster and confer a first-mover competitive advantage on the EU, which is an ideal test-bed for language technology.
Machine translation is a subject that can only be discussed inside the wider context already outlined. As we have indicated, it is not one process which succeeds or fails by a single absolute standard, but a range of systems with different costs and advantages and suited to different user requirements. We have surveyed the field, particularly with the costs/benefits to the smaller languages in mind, and also with the Internet in mind. This has led us to prefer certain approaches.
The structure of the report
Chapter 1 looks at the global development of the Internet in relation to language, at the role of language in relation to IT in economic and cultural life, education, training and citizenship, and in particular against the background of existing European Union policy and ongoing linguistic development within the EU area.
Chapter 2 considers the range of small and minority languages within the EU and some of the uses to which the Internet is already being put in these languages.The lack of everyday IT applications in some languages is noted as are some contrasting examples of policies to redress the problem. Then, the question of developing language resources is addressed.
Chapter 3 analyses in lay people's terms the present functions of the Internet and assesses the likely future direction of its development. In this context and with small languages particularly in mind it then gives a simplified account of the principles underlying various machine translation systems. Two "Technical Files" at the end of the report go into these questions in much greater technical detail and give an account of the historical evolution of machine translation.
Chapter 4 draws together the conclusions of the report and the policy options appropriate for action at European level that emerge from the discussion in earlier chapters. These are given here in summary form:
Summary of Options
1. Support for networking and the circulation of experience in managing Internet projects among smaller language-communities.
2. Support for the creation of multi-lingual, pan-European Internet portals of all kinds.
3. Support for the creation of everyday IT applications for the smallest languages based on cooperation, shareware/freeware solutions and the free use of reusable elements.
4. Much enhanced support for the creation of language resources - particularly large language corpora where these do not exist - to common standards.
5. Support for the participation of smaller languages in the Universal Networking Language (UNL) project.
6. Support for multi-media language-learning modules adaptable to a variety of languages and usable over the Internet.
7. Support for cooperation in the production of IT manuals and simple IT learning software in small languages - both for schools and for adults.
8. The European Parliament could re-iterate its support for the the principle that Internet domain names (with all diacritic marks) in any language should be registrable alongside the EU suffix.
9. The European institutions could help develop memory-based machine translation and a pool of translators in a wider variety of languages than at present through its own publications policy.
10. Given the penetration of information technology (and therefore language) into all areas of the economy and of social and cultural life, the European Parliament could reiterate the need to take into account linguistic factors and information technology within a whole range of its structural and other programmes. Special arrangements should ensure that smaller languages are not disadvataged by the scale of projects supported.
Our brief for this study was to look at a whole range of questions implied by the title, mainly from the point of view of the regional and minority languages of the European Union, but in the context of a wider multilingualism.
A great many people throughout Europe and beyond were consulted directly, or helped us find our way to the information we needed or to the appropriate experts. We thank them all and have listed those most involved in the list of acknowledgements. It is also worth mentioning that this is not only a report about the Internet, but was carried out to a very considerable degree on the Internet, by research on websites, by e-mail and synchronous interactive interview. Indeed , it would scarcely have been possible to cover the ground in the six-month period allowed us without the use of the Internet. One of the authors, Alan King, worked from a base in the Spanish Basque Country, the other two from Wales.
The three authors are greatly indebted to two colleagues at the Mercator Centre in Aberystwyth: George Jones who assembled and sifted the European Union documentation on the subject and read the proofs, and Lowri Catrin Jones who produced the document in its paper and electronic forms as well as checking a great many details.