STOA PUBLICATIONS
lineb
Linguistic Diversity on the Internet: Assessment of the Contribution of Machine Translation PE 289 662/Fin. St.

On this page:

*

Chapter 1 - Policy Options

*

Chapter 2 - The Background: Social, Political, Technological

*

Chapter 3 - Information Technology and the Small Language

*

Chapter 4 - The Internet and Machine Translation

| Top |

Chapter 1. - Policy Options

During the six months that we have been engaged in preparing this report the increase in Internet use has been phenomenal. Wireless access to the Internet, which was scarcely spoken of a year ago, is now driving a new wave of technology, and developments such as automatic translation of the spoken word on mobile phones - between a few languages - are promised in the near future. Every day established companies turn to the Internet and new dot.com companies are launched which operate only on the Internet. Even allowing for a degree of fashionable overstatement which may be subject at some point to a correction, it is undeniable that today's reality is changing at remarkable speed, as is the horizon of what will be possible tomorrow.

We note in the next chapter a gap between language-technology-rich languages and those languages that do not have the same access to language technology. The speed of change means that this gap is now growing fast, bringing a more urgent need to pursue countervailing policies. But we should also remember that Internet-related skills are spreading fast as is the range of uses to which the Internet can be put, and that inventiveness is not the monopoly of any one group or culture.

We have become aware in carrying out this study that we are not dealing simply with technology but with perceptions, with the culture of the Internet. Do people have the tools in their own language that make them feel the technology belongs to them? When they create a website, have they a clear idea of the audience and the language or languages that audience may understand? Will they perhaps choose to use a language other than their own, imagining - wrongly as it happens - that they are reaching the whole world? It is in this context that a universal system of machine translation - releasing people to use their own languages in a world context both to convey and to access information - would revolutionise the relation between people's perception of the local and the global, creating what is sometimes known as 'glocal' consciousness.

Every language stands on the Internet within a planetary space and face to face with all the other languages there present. Minority languages which have survived as enclaves within nation-states now have to perceive themselves, like all other languages, as standing at a cultural cross-roads, open to multilateral relationships and exchanges. There are great opportunities as well as problems, and the emphasis in our report has been on enabling communities to take up these opportunities.

The needs which we have noted in the case of minority languages of the EU and those languages which have a degree of officiality in their region are also immediately evident in the case of small state languages such as Icelandic, Letzeburgisch, Irish, and Slovene, and could be true tomorrow, as the language-technology gap widens, of half a dozen slightly larger state languages in the enlarged EU. It is because of this convergence, and the possibility of joint programmes between small languages of different kinds, that we do not want to isolate regional and minority languages as a hermetically sealed category within, for example, the EU's MLIS programme.

At the same time we are aware that within the institutional structures of the EU, the official language of a nation-state is likely to have greater influence exerted in its favour, so that where there is a mainstreaming of the minority languages into non-language-specific EU programmes, we believe there should also be a reserved percentage kept in the budget for projects which include minority languages, and a monitoring of what resources are in fact allotted to these languages. In other fields, such as education, however, where important actions of the Socrates/Lingua programme are reserved specifically for the official languages of the EU (at least for the next seven years), it seems to us that a counterbalancing programme of initiatives directed specifically to regional and minority languages is necessary.

We are writing this study at a time when the final details of the Culture 2000 programme are still awaited, and guidelines from the EU for the European Year of Languages Programme, relating to 2001, have yet to be confirmed. Furthermore, no decision has yet been taken on the legal act which is to underpin the budget-line for regional and minority languages. This uncertainty makes the exact location of some of our policy options necessarily somewhat difficult.

Though immigrant languages are at the margins of our brief and outside our area of expertise, we should like to say a word about them here. Where they are official languages in their home country, it is there - possibly with EU financial support and expertise - that IT and language resources for those languages should be developed, a process that will at the same time encourage the growth of a skills base in those countries and avoid duplication. The software produced will then become available to the immigrant communities in the EU, always provided that literacy in the home language exists and can be maintained.

There are, however, substantial and well-established immigrant communities in EU countries whose home languages have little or no recognition in their countries of origin - Kurdish, for example, or the Berber Tamazight language. Not surprisingly, the first software for Tamazight was developed in Canada and in France, where the TEX system was adapted to Tamazight. It seems to us that the EU and its member-states have a special responsibility towards such groups. Where language standardization is adequate, projects for such languages could be included alongside regional and minority languages and member-state languages within, for example, the MLIS programme.

Roma/Sinti are a very special case, whose salience will grow. As Eastern European countries enter the EU, millions of people who speak some variety of Romani will become EU citizens. Questions of literacy and language-standardization will present great challenges if policies to counteract social exclusion are to go hand in hand with respect for cultural difference. All we can do here is express the hope that in the case of autochthonous European languages which require standardization, something can now be done within the teacher training action of SOCRATES which will later allow codification for IT purposes to take place.

Against a background of accelerating change, where the general direction is clear but the detail of technological development cannot easily be foreseen, and faced with a range of languages whose needs are overlapping but not identical, we suggest a range of possible actions, not as alternatives but as a bundle of approaches which, taken together, will support and enable the further development of linguistic diversity on the Internet and promote equality of access and of opportunity to EU citizens whatever their language. Machine Translation is one important element in the bundle, but itself depends on a whole raft of underlying language resources and on an IT environment which is friendly to linguistic diversity.

The following is the range of actions we suggest as desirable, and appropriate for the European Parliament to take:

1. Support for networking, conferences, circulation of experience of Internet projects in smaller language-communities, and small communities generally. Organizational and financial aspects are quite as important in this context as technical aspects. Special provision should be made to ensure the participation of the smallest linguistic communities, since the Internet is a technology usable by even very small groups both for internal communication, for communication with a distant home culture, and for teleworking. These activities should be supported under the new MLIS three-year programme.

2. Support for the creation of multi-lingual pan-European portals bringing together, for example, Internet radios, samples of literature in translation, festivals of national culture. Minority language sites will benefit most by being grouped in a variety of ways, so that they become accessible from different perspectives - by theme, by region, and by the fact of being minority languages. MLIS as above.

3. Support for the creation of everyday IT applications such as spell-checkers, Internet browsers, and in particular word-processing/office packages. There is scope for cooperation because there are re-usable elements in such programmes. Also, software with a choice of two or three screen languages is often suitable for minority language situations. From the point of view of minority languages, and from a cost/benefit perspective for the Commission, we believe there is a great deal to be said for carrying this work out in a non-commercial environment, avoiding problems of intellectual property and ensuring maximum reusability. Linux-based applications may be appropriate. Localization could also be supported if carried out on freeware/shareware. Support should be concentrated on smaller languages, which international software companies do not consider profitable. MLIS as above.

4. Creation of language resources for large and small European languages on a much greater scale than hitherto. We are talking here of underlying resources such as electronic dictionaries and corpora of the written and spoken language, which take time to build and are unlikely to be funded from commercial sources. For minority languages the priority might be to do for them what is already being done, or has been done, for EU member-state languages under the PAROLE project - the construction of a large linguistic corpus of the written language for each language according to the model already in existence. We propose a similar co-ordinated project for minority languages which would, of course, only affect a reduced number of "unique" minority languages. Minority-language groups whose language is official in another state will normally have been catered for already. MLIS as above.

We think that support for the measures outlined in paragraphs 1, 2 and 3 above might be accommodated within a new three-year MLIS programme amounting to 15m euros. While MLIS would of course have to select the best schemes that come to hand, it would be desirable to allot a notional 25% of resources to projects which included one or more minority languages. It would be particularly important to support projects which include smaller/minority languages in the field of e-commerce. The next few years offer a special window of opportunity for the 'unique' minority languages in the EU. After that there will be strong demand from the state languages of the applicant states. (It is worth noting that minority language groups in the applicant states are, almost without exception, transfrontier minorities, so that their languages are state languages elsewhere).

Language resources are a much more expensive field and we suggest that a three year budget of 50m euros to include sections 1- 4 above, again on the basis of a notional 25% directed to projects involving minority languages, would not be too much.

5. Support for the participation of smaller European languages in the Universal Networking Language (UNL) project mentioned in chapter 3, subject always to a positive evaluation by the EC of the whole project. UNL is of considerable interest for all languages, but of particular interest to smaller languages, whether state, regional or minority languages - all languages in fact that are not likely to find the resources to develop a large array of language-pair MT systems. We understand that there are plans to set up a centre for teaching the UNL methodology, and that the research itself will be refined as a variety of languages develop the system. It is essential that smaller European languages, including minority languages, be associated with the project as soon as possible so that they may develop as early as possible and in an appropriate form any necessary language resources they may be lacking. There is therefore an interrelation here with paragraph 4 above. Framework 5 is the appropriate vehicle for support for a project which is still in the research and development stage.

6. Support for language-learning modules devised specifically for home learning over the Internet. We have in mind basic language-learning modules of a multi-media kind, based on situations that are not too culture-specific, in which the medium of instruction is entirely graphic, or if in words, is kept to a minimum and made available in a wide range of languages. Such materials could be developed on a cooperative basis and would be particularly suited to the Internet, where one does not know from what language the learner is starting. Such projects are in theory appropriate for cooperation between partners in all kinds of language-groups - official, immigrant, regional and minority; but if, because of the categorization of languages, support were impossible across all categories under, for example, the SOCRATES/LINGUA programme, then projects relating only to minority languages might be supported under the Regional and Minority Languages budget-line. There may also be possibilities of a one-off nature under the European Year of Languages Programme which has yet to be confirmed.

7. Support for cooperative projects to write introductions to IT (manuals and simple software) in minority languages and with particular sensitivity to the need that may arise to adapt standard applications packages. We have in mind both materials for schools and for adult vocational training. There is a terminology aspect to this work which would incidentally help forward the Machine Translation of software. This is a special need of minority languages and possibly the very smallest state languages. Regional and/or Minority Languages Budget Line.

There are a number of areas where resolutions of the European Parliament might offer the best way forward.

8. The European Parliament could emphasize its support for the principle that Internet domain names in any language (and using all diacritic marks) should be registrable, alongside the use of the .eu suffix. We believe this indeed to be the EU's position in the ICANN negotiations. This is an important symbolic matter.

9. The European Parliament could ask the European Commission itself to help develop memory-based computer-aided translation in a wider variety of languages than at present by its own publications policy. In areas where regional languages are well entrenched in the administration and a body of professional translators exists working between the state and the regional/minority language, such translators might be used for the translation of selected Commission publications into those two languages. Computer-aided translation will not be helped by single symbolic acts of translation into regional and minority languages, but by a steady if small flow of documents in the same limited subject-area.

10. In the information society, all languages, including minority languages, are closely bound up not only with culture but with the economy and with economic opportunity. It is important for the European Parliament to reiterate the need to take into account linguistic factors in relation to IT in programmes that may seem at first sight to have a different focus - support for Objective One areas or for the development of the European Media industry. It is particularly important that very small language groups should not be overlooked simply because they do not fit the scale of projects envisaged in a given programme, which in turn presupposes the scale of matching funding required.

| Top |

Chapter 2. - The background: social, political, technological

Linguistic diversity on the Internet is increasing...

There is a common popular perception that the Internet is an overwhelmingly English-language medium. This is understandable because the early and phenomenal growth in its use happened in the USA where personal computer ownership was most widespread. Estimates of Internet usage by language vary considerably, but all point in the same direction and indicate that whereas there is a steadily increasing growth of usage in English-speaking countries, there is a far greater rate of increase in other countries, and particularly in Europe, China and Japan. For example, in July 1999 one estimate was that 128 million people accessed the Internet in English, whereas 88 million accessed the Internet in other languages (1) . But the same source showed that whereas the English figure had trebled in four years, the "other languages" figure had increased eightfold in the same period (2) . A different source predicts that by 2002 a majority of Internet users worldwide will be non-English-speakers and that three years later their proportion will have risen to 60% (3) .

In the USA, one in two persons has already used the Internet, so the most that can be foreseen is a doubling. India, on the other hand, where English is one of the official languages but is spoken by a small elite only, has at present a PC penetration of just over two per thousand inhabitants (4) . While many of those present PC users may be assumed to know English, that assumption cannot be made for the future. But even when the target audience for a service or product may be assumed to have some knowledge of English as a second or a third language, it is increasingly recognized that in competitive situations those companies will have the edge which use the customer's language and understand the customer's culture. In the United States itself, an estimated 45 million people access the Internet in languages other than English from homes where English is not the home language, though those same people mostly use English at work (5) . There is also the question of cultural attitudes to other languages. Just as some cultures accept, while others resist the use of subtitling on television, so it appears, for example, that Japanese with some knowledge of English still prefer to read English websites in (even inadequate) Japanese translation.

Global communication on the Internet is therefore going to have to take account of linguistic diversity, and global e-commerce is going to have to resort increasingly to multilingual presentation and management of information, and therefore to translation, including machine translation. Globalization requires localization. Equally, where information retrieval and text summarization is concerned, techniques of accessing and selecting information via the web will have to take account of multilingualism.

A range of interrelated language technologies including machine translation have been developed...

Since the Second World War a range of language-related technologies have been developed at an increasing pace, mainly in North America, Asia and Europe in fields such as machine translation, information retrieval, speech recognition and language recognition. These technologies rest on analytic work carried out and techniques developed in relation to particular languages, and across languages, and today such work requires language resources to exist in structured electronic form in those languages. By language resources we mean, for example, large annotated linguistic corpora and linguistic descriptions, speech databases, computerized dictionaries in database format, together with appropriate standards and methodologies. Many of the same techniques are used across a variety of applications. It is therefore impossible to discuss multilingual and translingual operations on the Internet or Machine Translation without reference to language resources more generally (6) .

While each language technology has its own problem areas, and while levels of performance are uneven, more has been achieved than most people outside these fields are aware of. Europe is a magnificent test-bed for language technologies. Countries with more than one language have a ready-made environment for research, testing and evaluation, while the European Union, taken as a whole, has every incentive to make multilingualism work. Indeed the European Commission has been a key player in research and development and also in seeking to raise awareness of multilingualism in the information society (7) , but it is still true to say that such awareness is greater in the research community than in the business community, or public administration or among the public in general.

The EUROMAP project concluded that language technology could emerge as an important "technology cluster" for Europe which would "confer first-mover competitive advantage to the EU" but that for this to happen required more active cooperation between European and national policymakers and between the research and industrial constituencies (8) . Expert opinion also seems to agree that integration of the various language technologies into real-world applications now needs to take place alongside improved performance in some fields and further basic research in others. The growth of Internet use and of e-commerce in particular, with online targeting of customers by language, will precipitate and accelerate the need for multilingual applications, and in this context we can expect the market to respond, but only where some languages are concerned.

The real threat to linguistic diversity on the Internet, as we shall see, is not that one single language will prevail, but that five or six world languages will develop the full range of language resources and integrated language applications, including Machine Translation, and that those many other languages which do not do so will be excluded - not from the Internet as it now is but from many of the processes and transactions that will increasingly be carried out over the Internet.

But uneven development between languages is one factor widening the gap between information rich and information poor and impeding universal access

Here we come to a central problem underlying the particular questions this study will address - uneven development as between languages. Languages start from an uneven situation. Some languages have large numbers of speakers, others have few. But some large language communities are too poor to offer a market for information technology, while some relatively rich language communities are too small for large software companies to consider them a viable market. Policy for the support of language technology, and consequently funding for research and development is uneven between countries (9) and as a result expertise is unevenly spread. Then again, within countries the availability of language resources is uneven as between the historic official and unofficial languages of the same nation-state.

The creation of large-scale language resources as opposed to applications packages requires a long-term strategic vision and cannot be left to commercial interests alone (10) . It is relatively expensive, and the returns come indirectly and slowly at first, so public funding is usually required, which is unevenly available as between languages. But this uneven development of language resources as between languages will result in a subsequent inability to develop a whole range of applications, including machine translation, for the marginalized languages, so that the gap between language-technology-rich communities and the rest will widen and widen. Languages in which people cannot interact with computers over the Internet will come to be considered inferior, pre-technological.

Nor can the creation of everyday applications such as word-processors, spell-checkers, search engines and Internet browsers be left entirely to the market without compounding the already existing inequalities between languages. If a language-community is considered too poor or too small to offer a viable market for these products, members of that community can be seriously disadvantaged in terms of economic opportunity, education and training, freedom of expression, access to information and full participation in the democratic processes.

And beyond these more specific questions lies the overarching question of how the Internet culture itself, or perhaps we should say the wider computer culture, is perceived, particularly by young people. It is not possible to draw an absolute line between the language of the tools and the language of the content. If the tools are not available in one's own language, there will be a tendency, well-known to minorities in earlier historical contexts, to become assimilated to the language of the tools in other respects too. The right to use your own language (with all necessary diacritic marks) in the naming and addressing system of the Internet rather than continuing with the present US-centric system - a question being discussed between the EU and the US Department of Commerce at the time of writing - is very important symbolically in this context (11) .

Language and IT in economic and cultural life, education, training and citizenship

In developed countries, information technology is becoming increasingly integrated with virtually every aspect of life. Its importance in economic life is self-evident and needs no elaboration here. Equipping children and adults to use the technology has become a necessary part of education and training, and the same technology then also becomes a tool for teaching other subjects. Then again good citizenship depends on access to information in your own language. The Internet is a very cost-effective way of making information available and it also offers possibilities for consultation and participation in democratic decision-making. Indeed it is not too much to speak of an emerging cybercitizenship, but this can only be based on an equality of treatment as between all citizens. One dimension of this equality must be linguistic equality. Nor should one overlook the cultural and creative aspects of language in IT. The Internet already has its poets and writers working in a new medium and in a new symbiosis with graphic material and sound files. Self-expression and freedom of expression in your own language are also aspects of access to the Internet.

One dimension of education needs particular mention in the context of this study - the teaching and learning of languages. Its relevance to this study is twofold: in the first place, seldom in European documentation is language learning discussed without some reference to multimedia as an aid to language acquisition, and to telematics as a way of making contacts, accessing learning materials and establishing direct two-way exchange between learners and teaching institutions in different countries. Secondly, the teaching of languages has a direct relation with the supply of human translators.

Nowhere in this document do we suggest that Machine Translation will remove the need for more and better language teaching or for human translators. Mediated communication via computers and machine translation, and better direct communication through improved language skills, are two parallel strategies for achieving greater cohesion and integration within Europe. The translation profession is in no way threatened by technological developments. Translation needs are growing at a phenomenal rate, reflecting the accelerating pace of exchange of information between linguistically diverse communities, far outstripping the recruitment and training of human translators and the capacity of most consumers to pay for translation. What machine translation offers human translators is the prospect of increasing their productivity and taking some of the routine out of their work. Their role will move in the direction of becoming more often editors and cultural adapters of material rather than simple translators of texts.

The European Union - Philosophy, Policy and Practice

From its inception the European Union has based its philosophy, policies and to a degree its internal practice on the principle of multilingualism. A series of reports and programmes cited throughout this report have stressed the interrelated importance of multilingualism, the economy, the information society and citizenship.

But the EU is also a major employer of translators and interpreters, and a major funder of research and development in language technologies. As the number of member-states and languages has increased (and is soon to increase further) so the need has grown to increase productivity through computer-aided translation systems and ideally to automatize some kinds of translation altogether. It is not too much to say that the preservation of the multilingual ideal and the equable functioning of the enlarged European Union will depend on the further development and integration of language technologies that relate to machine translation, and their introduction into the mainstream of citizens' lives.

For as well as multilingual practice in the EU institutions, and communication with member-state governments and institutions, there is the question of relating to European citizens whose lives are increasingly affected by Europe-wide policies, programmes and directives. Here the variety of languages needed to communicate fully and on a basis of equality with all European Union citizens is far greater than the number of official member-state languages. We have in mind both the languages of recent immigrant groups and the autochthonous regional and minority languages of the EU.

Globalization of the economy is by many observers seen as strengthening regional identities within nation states, in which context the use of a region's language in the public domain can be an indicator of cultural and economic self-confidence and also a way of giving the region a cultural salience and positioning it in the global market. The Committee of the Regions has consistently called for support for regional and minority languages in fields such as education (12) and for the transmission of certain EU information in minority languages (13) .

At the same time recent events in eastern Europe have underlined the need for international agreements on minority linguistic and cultural rights as a timely means of preventing conflict. The extension of traditional concepts of individual human rights into the cultural and linguistic fields, that is to say in the direction of group rights, is evident in the Council of Europe's Framework Convention on National Minorities and is spelt out much more fully in the Council of Europe's Convention on Regional and Minority Languages (14) .

During the last twenty years - the period in which the European Union, following a series of reports and resolutions in the European Parliament (15) has been giving support to regional and minority languages, - great changes have also been taking place in the internal constitutional and language arrangements of several EU member-states. A number of regional and minority languages have become official or semi-official on their own territories within the states concerned, well entrenched in public administration, education and the media, and more extensively used than hitherto in commercial life. In recent times, the decision to ratify the Council of Europe's Convention on Regional and Minority Languages has involved recognition by some states of a number of languages that hitherto had little formal recognition.

Taken together, these developments have created an increased need for translation within nation-states and helped develop language-based industries. At the same time the expectations of citizens speaking some of these languages have been raised to levels not far short of those of speakers of official member-state languages. If they are able to deal and be dealt with by the public authorities in their own region in their own language, then they will expect no less of European institutions.

Modern language teaching, we have already argued, is relevant to this study. Improving language skills has been a continuing concern of the European Union (16) and is a main objective within the present Socrates, Leonardo and Youth programmes (17) . The Council of Europe and the European Centre for Modern Languages, established under the Council's auspices in Graz, also has a strong record in the field of advancing language teaching and learning. In particular one may mention the establishment of "threshold levels" for a range of languages including a number of regional and minority languages (18) . The designation of the year 2001 as "European Year of Languages" by both the EU and the Council of Europe will help to raise language awareness in the public at large and there must surely be an Internet dimension to this consciousness-raising.

Other relevant areas of EU policy are media and culture. A succession of media programmes have encouraged co-production with the aim of stimulating and developing the European film industry, and this has necessarily involved back-to-back production and subtitling, which clearly relate to multilingualism. At the time of writing final details are not available of the new Culture 2000 programme, but among other actions it is likely to carry forward elements of the former Ariane programme of translation, which gave special attention to smaller and minority languages.

This study and the policy options it sets out can therefore be situated within the context of established policies of the European Union relating to: research and development of language technologies, public information and citizenship, media and culture, the multilingual information society, education and vocational training, the teaching of modern languages and the raising of language awareness, and the promotion of regional and minority languages. In reviewing the European documentation historically we have noted an increasing convergence between the recommendations in the different fields, and a shift when speaking of regional and minority languages from the terminology of protection and conservation to the terminology of development, access and equal citizenship.

| Top |

Chapter 3. - Information technology and the small language

Minority languages and small languages

Our main focus and the brief for this study is to consider the possibilities of the Internet and of Machine Translation in relation to the autochthonous regional and minority languages of the European Union. These are spoken by over 40m citizens of the European Union (19) and can expect to enter on a new period of development with the passing of the Legal Act currently going through the European Parliament.

But we shall not consider these languages in isolation. They share some problems with the smaller state languages and with some immigrant languages, particularly those which are minoritarian in their countries of origin, and are all part of a wider multilingualism. The distinction between a small state language and a strong regional or minority language becomes very tenuous in some parts of our discussion. We can go further and suggest that in respect of language technology many of the difficulties and threats to minority languages will soon be felt if they are not being felt already by all but the largest of the world's languages. In this context, anything that addresses minority-language problems is likely to be of much wider interest.

We shall not here attempt hard and fast definitions of regional and minority languages nor offer a definitive list, since there are always disagreements at the margins. But where a member-state has ratified the Council of Europe's Convention on Regional and Minority Languages (20) ,a list of recognized languages for that state exists, and the member-state government has taken on responsibility at a certain level for the safeguarding and promotion of each of those languages in a certain number of fields. One of the fields mentioned in the Convention is Media, and we would argue that, as things have developed, these should be interpreted to include "New Media".

Readers of this study may also want to take as a guide the list of languages studied in the Euromosaic Report (21) ,drawn up for the European Commission, or the list of language groups that have membership of the European Bureau for Lesser-Used Languages (22) , or the language communities listed in the databases of the Mercator centres (23) , all of which have a high degree of overlap.

Instead of seeking an absolute definition of regional and minority languages we shall set some markers of our own which are relevant to the subject of the study. Where a regional language/dialect does not have an agreed written standard, then the establishment of such a standard must necessarily precede work on text-based language resources and applications. These cases thus lie outside our present focus until such time as that agreed standard exists. We return to this question in the final chapter.

Some of the minority languages of the EU exist only in minority situations, whether minoritarian in one member-state only, as with Sorbian or Welsh, or minoritarian in two or more member-states, as in the cases of Catalan and Basque. But there are also transfrontier minority languages, where although the language is minoritarian on one side of the border, it also belongs to a large and sometimes powerful language-group possessing its own nation-state on the other side of the border (or further afield), as in the case of the German minorities in Belgium, Denmark or Italy. An intermediate case might be that of the Slovene minorities in Italy and Austria who indeed have access to state-supported language resources over the border in Slovenia, but within a language group whose total size is quite small - indeed smaller than some minority-language groups in the EU.

When, in this chapter and the next, we discuss the basic IT environment and the building of language resources, we shall have in mind those minority languages which exist only in minority situations, or very small state languages, or languages which fall into each of those categories on two sides of a border. We do not have to worry about the availability of word-processors and Internet browsers, or the creation of linguistic corpora for German-speaking minorities outside Germany. These exist within the language. But when we come to consider uses of the Internet for communication within and between minority language-groups, German-speaking minorities will certainly find themselves in the same set of regional and minority languages as Frisian or Scottish Gaelic.

Linguistic minorities as assets in the wider society

The importance of languages and their survival - all languages, whether belonging to a majority or a minority - to the identity of the individual, to the transmission of culture and values within the group, and to the self-definition of Europe - has often been reiterated by European institutions and here we do no more than restate it briefly.

In the field we are concerned with, however, there are costs attached to principles, and there is a danger that the provision of language resources for minority languages might be seen only in terms of an extra cost. We therefore think it worth emphasizing the positive aspects of linguistic minorities for the wider societies in which they live, and also what minorities are doing for themselves in terms of making good use of the Internet.

In the European Union, almost all speakers of minority languages are bilingual, and therefore already one step further than monolinguals towards the trilingualism which the EU has held out as desirable for its citizens. A minority language often serves as a bridgehead into one of the languages of another state, and sometimes into a different language family. Thus Sorbian speakers in Germany are linguistically and geographically close to speakers of Czech and Polish and beyond that offer a way into the Slavonic languages more generally. These linguistic links, which can be strengthened by the Internet, are assets in economic terms and also in terms of cultural exchange both to the state in which the minority exists and to the European Union. The same is true of the languages of immigrant minorities which are in some cases very important languages numerically in the countries and continents of origin. And what is true of minority language groups within a given state is true of smaller state languages within the EU as a whole.

Again, it is not always realized that some of the best examples of successful language learning are to be found in minority-language areas. The teaching of Basque to adults and children, and the teaching of children through Basque in Euskadi, the Basque Autonomous Region in Spain, is a success story on a scale that it would be difficult to match anywhere, and deserves greater attention from all those concerned with language teaching.

Many minority-language areas are bilingual areas, where both the majority and the minority language have co-officiality. This has led to the development in these areas of bilingual administrative procedures and design services, back-to-back film production, subtitling for media, simultaneous and written translation services, an interest in machine translation, and an overall awareness of language pluralism not just among specialists, as tends to be the case in monolingual societies, but in the public at large and at political and institutional levels.

Speakers of minority languages have been quick to seize the opportunities of the Internet

Once the hardware and communications infrastructure is in place, the Internet in its present form has many advantages for minority language communities as indeed for all small communities. We should remember, however, that some minority-language areas coincide with European Objective 1 areas - that is to say they are the poorest in the EU in terms of average income, and therefore likely to have fewer personal computers. They may also be rural and/or mountainous and have a poor telecommunications infrastructure. Yet it is often areas of this kind, far removed from large centres of population, that could most benefit from the Internet in terms of their economy and in some cases already do (24) . It is therefore extremely important that support for minority languages and information technology, both infrastructure and training, should go hand in hand where Objective 1 areas are also minority language areas.

The uses to which minority language groups put the Internet may seem at first to be the same as many we find in majority languages, but the significance is often different. Any presentation of the minority language and culture to a world-wide audience by definition breaks new ground since minority language groups, whatever access they may have had to broadcasting within the nation-state, have scarcely ever had the political or economic strength to project themselves outside the state in which they live, or indeed, in many cases, to their fellow citizens in other parts of the same state.

Thus the broadcasting organization BBC Cymru/Wales, which produces radio and television programmes mainly for Wales, has recently launched an electronic daily newspaper in Welsh (25) . The significance of this is that, although Welsh is a relatively strong minority language, with well-developed radio and TV, there never has been a daily newspaper in Welsh for a variety of historical and geographical reasons.

A newspaper on the Internet becomes immediately available to the Welsh-speaking diaspora around the world, not least in England where perhaps as many as 150,000 Welsh-speakers have only very limited access to broadcasts in Welsh by means of overspill of radio and television in border areas and the digital version of the Welsh television channel S4C which is available in the UK through satellite subscription. Diasporas can be very important for linguistic minorities who have often lost population from the home area through emigration. The Internet makes it possible for them to remain in regular touch with what is happening back home and, if they so wish, to lend their talents to the service of the home community.

Another ambitious and comprehensive electronic newspaper is the Catalan Vilaweb (26) , founded in 1995 by Vicent Partal and Assumpció Maresme, both experienced journalists. It is an electronic newspaper with a network of local editions which appear in towns and villages throughout the Catalan lands but also in diaspora areas such as Boston and New York, creating a kind of "virtual nation". The site also incorporates a directory of electronic resources in the Catalan language and reaches 90,000 different readers each month. This critical mass of users attracts some international web advertising, and local editions collect their own local advertisement. Indeed the organizational and financial arrangements of Vilaweb are every bit as interesting as the technical ones and could be of interest in other minority languages. A similar network exists in Galicia (27) .

There are many courses teaching minority languages on the Internet. The most ambitious is likely to be HABENET, a three-year project for teaching Basque on the Internet (28) and costing some 1.8m euros. Internet courses in minority languages have new possibilities but also face new challenges. Most face-to-face courses and course materials for learning minority languages assume a knowledge of the local majority language and this is undoubtedly where the main demand will be, on and off the Internet. But it seems to us that there would also be room to develop a multi-media language-learning package that was language-independent or language-adaptable so far as the language of instruction went. Such a course would make each language approachable from any other language at least at an elementary level.

Many libraries and museums have websites with on-line catalogues which present a particular culture to the world. In minority-language areas these are often bilingual and sometimes trilingual. Part of the emergence of minorities into the wider world through the Internet might be a greater spread of multilingualism on these cultural sites and we should consider how best this might be encouraged at European level.

One presentation of culture over the Internet which relates particularly to translation is the projection of literature with samples in translation. and the listing of contacts for the selling of publishers' rights. The Welsh Cyfwe (29) site is a case in point, and this is due to be extended to a number of other languages (30) . A fascinating and more purely linguistic site is Scots on the Wab which presents the Scots language (through Scots itself and through English) in a variety of informative, innovative and amusing ways. (31)

The rest of this study will be concerned with language in close interaction with the technology of the Internet and Machine Translation. The reasons we have spent time on these examples of present good practice in respect of use and content are in the first place to show that there is a richness of content and a readiness to seize opportunities among minorities; and secondly that there exist within these communities ideas and experience which need to be circulated. It is important to emphasize that one is talking not so much about sharing technical skills as about organizational and financial management experience in an Internet context. Networking this experience is a very appropriate field for EU support, as are collaborative projects. The European Bureau for Lesser-Used Languages at a general level, and the three Mercator centres which deal at a specialist level with minority languages in legislation, media (including new media) and education, are all well placed to foster collaboration.

We have taken our examples of good practice from minority-language communities who are often the most acutely aware of linguistic factors. But there are, of course, many relevant good examples to be found in other small language groups, and indeed among small communities who speak large majority languages. For example, Parthenay, a small French-speaking town in France has an excellent municipal website combined with internet radio broadcasts (32) which could offer an interesting model within the reach of the smallest linguistic minorities. There are good examples of networks for small communities that might be adapted for minority languages to be found both in Europe (33) and further afield (34) .

For this reason we would not wish to see a ghetto created for regional and minority languages when it comes to European support for networking in this field. At the same time, however, we fear that minority language groups may be marginalized if simply subsumed in a general programme. We shall return to this question in the final chapter.

But there are problems in the basic IT environment ...

We have looked at these examples from the point of view of what is produced for the Internet and found many positive initiatives. Things do not look so good, however, if we start from the point of view of the user of a personal computer who is also a speaker of a minority-language or very small state language.

That person will in all likelihood have learnt, at school or in vocational training, how a particular operating system works, how a given word-processing package works, how a certain Internet browser works, perhaps even how a web-authoring package works, using software and terminology not in his or her first language. The computer bought privately or used at work may well have arrived with software already installed in the same majority language. There is a double problem here. Software in the minority language may not be available, but even when some smaller items are available, marketing them in the face of world-wide brands with high levels of software integration is a difficult task. The tendency to monopoly is also a tendency towards the exclusion of languages not considered commercially viable for the software manufacturers who have that near-monopoly.

When it comes to word-processing, if minority-language speakers type text in their own language they may find that features such as autocorrection that are user-friendly in the majority language create errors in the minority language. The hyphenation system may be inappropriate for their language, and not all the necessary accents and other diacritic signs may be easily accessible. No spell-checker may be available for their language, or if it is, it may not be integrated with the other programmes. The search engines offered by their internet service provider may not be able to select sites by language, or may not be efficient at doing so and the portals into the Internet offered by internet service providers are seldom constructed in a way that allows a minority-language presence to surface into the user's consciousness.

The sophisticated personal computer user will be able to circumvent some of these problems, but as PC and Internet use spreads, so in all probability will the numbers of less sophisticated users. There is a tendency for word-processing software to become more user-friendly with such users in mind, but in so doing it often becomes more language-specific, which then makes it less user-friendly when used with those languages which have not been targeted.

Then there is the question of the screen language. If menus and help facilities are not available in your language this in turn means that the whole terminology relating to personal computers and the Internet will make the technology seem to belong to a culture not your own, with which you may nevertheless want to identify because of its aura of modernity. This problem is compounded by the dearth of manuals and teaching materials - both for schools and for adult vocational training - in the minority or small language.

Before we can even begin to discuss machine translation and its integration with the Internet to serve smaller languages, we have to understand that where IT is in the question these languages start with much more basic needs, which, essentially, are to provide speakers with an IT environment at least to some extent in their own language. This is apparent not only from what official and voluntary language organizations in the minorities have told us but from the strategies and priorities they have applied in practice.

Contrasting experiences and possible strategies

Here we want to consider two contrasting experiences and the possible strategies they suggest for other minorities and for European support.

The first is the agreement between the Basque Autonomous Government and Microsoft for the localization in the Basque language of Windows 95 and MS Office, subsequently extended to Windows 98 and Office 2000. This was a very expensive undertaking, paid for by the Basque Government and provincial councils. The first-time development costs were approximately 1.8m euros paid to Microsoft, but the total cost was more like 2.4m euros if the supplementary work carried out by Basque Government staff on translation and marketing is costed in. On the positive side it has to be said that an operating system and a very commonly used set of programmes are now available in Basque not only for offices but also for schools and vocational training, at a time when most minority language groups in the EU are still wondering what to do. However, the costs in themselves make this a model that can hardly be recommended to minority language groups with fewer resources.

Besides, there are other considerations which make this solution less than ideal. Because of the incorporation of complex components into new versions of the programmes, updates to the localization become more expensive rather than cheaper as time goes on. Moreover, by the time a programme has been localized in small languages such as Basque - which are allocated low priority within Microsoft - new versions of the original are already becoming available in English and some other languages which offer a large market. Finally, what might be thought a major advantage of cooperation with an international company, namely access to its marketing skills and distribution network, does not apply. The Basque version was not important enough to Microsoft for them to be interested in promoting it themselves.

The Basque Government has now looked at information technology needs for the next ten years. Localization is only one kind of action contemplated, and on the whole the assessment of costs and benefits seems to favour other priorities: the development of spelling and grammar checkers, of OCR tools specific to Basque voice recognition software, also support for making Basque dictionaries and reference works available for on-line public use. A five year plan starting this year (2000) is likely to support local companies working in some of these fields. There is also an interest in developing tools for the automatic translation of web-pages.

The Catalan Autonomous Government too entered into an agreement with Microsoft and has appointed a committee of experts to ensure that a strategy is in place so that electronic resources are created in the Catalan language. But the experience from Catalunya we want to look at here is entirely within the non-commercial and voluntary sector.

Softcatalà is a small-scale non-profit organization set up by a group of friends in 1998 (35) . It has two objectives: to use Catalan in the IT domain (rather than English, Spanish or French) and to localize freeware and shareware rather than commercialized products. The first piece of software localized was Netscape Navigator after Netscape decided to make the code for its browser freely available. The Catalan version of the browser can be downloaded free from the Softcatalà website or obtained on a CD distributed and paid for by the Ajuntaments de Catalunya.

The initial localization took three people ten months to carry out and involved translating 30,000 words. Softcatalà is essentially run by these three people who during the summer especially may have the help of up to twelve people, usually students. Subsequent updates to Netscape Navigator have been translated within three months of their US release and often earlier than the Spanish version. Softcatalà always localizes directly from the original English version.

Softcatalà does not keep statistics of how many people download its software but a software survey on the Vilaweb website in 1999 drew 333 replies from people most of whom were frequent or established Internet users concerned for the Catalan language. This may not be a good guide to the absolute numbers using Catalan software, but it is interesting that, within the sample, Softcatalà's free Netscape was far more popular than Microsoft Internet Explorer in Catalan which was used by fewer than 2% of respondents.

This seems an attractive alternative route for smaller applications such as browsers and spell-checkers, but a large word-processing package (Wordperfect, for example, has made its code freely available for non-commercial purposes) would require very considerable effort if done on a voluntary basis. There are also problems of integration, of course, if the aim is to offer a product range comparable, for example, with Microsoft Office.

Marketing is a problem, but, as we have seen, the same was the case with Microsoft software localized into Basque and Catalan. However, given that governmental or voluntary organizations have to do the marketing in each case, there must be some advantage in marketing a free product. The Basque Microsoft programmes, despite the heavy element of subsidy, have had to be purchased by individuals and institutions, including the Basque Government itself.

European Support

By its very nature, localization is language-specific and therefore projects concerning one language only might be thought to concern only the region and the nation-state to which it belongs. If that were all that remained to be said, the outlook would be bleak for the smaller or weaker minority languages which do not have the strategic planning or the strength of official support given in our example of the Basque Government. In most EU minorities development is fragmentary - a simple word-processing package based on shareware in Breton, a spell-checker and voice recognition software for Welsh. The point has also been made to us that joint European projects in this field are difficult to organize for another reason, namely that each area has its own rhythm of development and priorities at a given time, and operates within different budget constraints.

Neverthless we feel that there is scope for joint projects in the field of applications to be loosely coordinated in ways that will differ from application to application, and thus to qualify for European support. We know of one such project already in the pipeline - for an Internet browser that will be available initially in Breton, Irish, Scottish Gaelic Welsh. Very often there will be reusable elements in such projects, which in itself is a justification for cooperation. Just as there is a great deal to be said for any localization of existing software to be based on freeware/shareware, so there is a strong case, when public money (both regional and European) is involved in creating new software in minority languages, for that to be made freely available on the Internet. This will encourage the easy transfer of re-usable elements to future projects.

There is no doubt that the IT application that would find most acclamation by European minority language groups would be an office suite (word-processor, spreadsheet etc) that was truly multilingual in its capability for treating texts (minority-language speakers write both their own language and at least one majority language) and that was available with their own screen language as an option, or maybe as one of two options so that the software might be usable by majority and minority language speakers in the same office or household. Starting from scratch this would be a large undertaking that could perhaps be considered by an international group working, for example, with Linux. Alternatively, a group might come to an arrangement to use Wordperfect or Star Office.

There is room for networking within and between at least three professional groups. First, groups and companies currently engaged in work on minority language applications. A directory of such groups would be a good starting point.

Secondly, an academic group interested specifically in language resources for minority languages has recently been established (36) which could be a useful task force lending its expertise particularly to those language groups that have no centre for language technology within their area. The European Language Resources Association which commissions the production of language resources does not exclude projects from minority languages (http://www.icp.grenet.fr/ELRA/home.html) and in this context it is worth mentioning an article in the ELRA Newsletter April-June 1999 "Does size matter? Language Technology and the Smaller Language". The author Nicholas Ostler is a language technology consultant but also, interestingly, President of the Foundation for Endangered Languages. In the article, which has a wider than European scope, he establishes a very useful table of applications and underlying technologies which we reproduce as Appendix 2.

But equally there is a need for contact and the exchange of experience between official and voluntary organizations capable of contributing to the funding, distribution and marketing of products.

Most of all, however, there is a need for contact between these three different constituencies. If, as is generally agreed, there is a gap which must be bridged between research and the business and administrative worlds in the case of majority languages, how much greater is that need in the case of minority languages where resources are more limited. We think it essential that European support for the development of applications in respect of minority languages should be for projects where the distribution and marketing aspects of the project are part of the proposal.

In this chapter we have separated applications from the building of underlying language resources which are the concern of the next chapter. This is a slightly artificial division since the two are inter-related, and much that we have said here applies to both. Neverthless, machine translation - which is addressed in the next chapter - does require a different order of language resources.

We have sketched out the elements of a basic platform of IT standards and applications that should be available to European citizens in their own language at the points at which most people enter into information technology and the Internet. In the next chapter we shall see that the direction in which the Internet is developing and some of the systems of Machine Translation available each require the building of language resources for every language that is not to be relegated as new services on the Internet become available.

| Top |

Chapter 4. - The Internet and Machine Translation

Matching a range of language technologies to multiple uses and multiple needs

Just as we have made distinctions between categories of languages and language situations, so we shall need to make distinctions when speaking of the Internet or Machine Translation. The Internet, unlike traditional media, serves a number of quite different purposes. Machine Translation, too, is not a single process which either succeeds or does not succeed by some single absolute standard. Different systems of machine translation may be suited to different user requirements. The use that can be made of Machine Translation on the Internet in turn depends on the underlying range of language resources available to a given language. The ultimate purpose of making these distinctions is so as to match particular uses and technologies with the particular needs of the smaller language communities. But first we must give an overview of the present state of the Internet and of Machine Translation, together with our estimate of future developments - for these are fast-moving fields.

In this chapter we inevitably simplify the technical issues to some extent so as to avoid losing the argument in the detail. The Technical File appended to this report gives a fuller and more complex picture of the same areas of discussion and provides a bibliography.

The Internet today

The Internet today is a channel of communication that can be used to store or transmit messages, information or other material between people. It encompasses a range of loosely-related ways of using a global electronic network for a variety of purposes - e-mail, web-sites, interactive chat, live broadcast. Each individual message has a producer or author, and a user or recipient:

AUTHOR
creates material =>
CHANNEL
stores/carries material =>
RECIPIENT
uses material

While some other media (books, data files, picture albums, CDs, etc.) are used primarily to store material, and others (letters, radio, television, the telephone, etc.) primarily to transmit it, the Internet is unique in being fully adapted to both these complementary functions within a single technological macro-domain. Nonetheless, some uses of the Internet focus on one or the other function.

The Internet's essential features can be summed up as follows (the italic glosses comment on factors relevant to smaller languages):

(a) Efficient operation: communication is rapid (in many cases practically instantaneous), powerful (large volumes of traffic can be supported), reliable (messages are delivered with precision), and once the technological infrastructure and tools are in place, cheap in comparison to alternative channels of communication.

This last characteristic is important for small and minority language-groups , where the unit cost of producing and distributing print media can be prohibitive due to the small size of the group. That same saving is available to persons or institutions from outside wishing to communicate with the small language group, but for them may be offset by the costs of translation unless these can be reduced or eliminated. However, one should not confuse the reduced cost of distribution with the cost of assembling the information distributed, which may be high.

(b) Global extension: The Internet renders geographical distances insignificant from every point of view including cost.

However, this is not the same thing as saying that all uses of the Internet address global audiences. Indeed, where Internet usage is common, the medium is increasingly used for very local as well as wider purposes. But the Internet's capacity for global reach means that other obstacles to communication become more apparent - uneven availability of the required hardware, and cultural differences, especially linguistic ones. This is what gives machine translation such salience in the context of the Internet.

(c) Flexible use: A wide and increasing variety of content can be transmitted via the Internet - text, sound, graphics. Moreover, the material is instantly updatable. The only limits are the human and technical resources available for such content to be digitalized, the capacity of current technology to perform such digitalization, and the availability of hardware and communications infrastructure.

However, expectations rise to meet the potential of the new medium - people expect material on the Internet to have been recently updated in a way they would not expect a book to be - and one should not underestimate the time and human resources needed to create and maintain up-to-date, high quality on-line databases and websites.

(d) Electronic form: The electronic nature of the channel, which is the key element behind the features of efficiency, globality and flexibility, also has other advantages. Anything that can be done electronically can be done via the Internet; hence, more and more information technology applications can employ the same common channel, and we can expect more applications modules, including translation and language processing modules to be integrated with the Internet.

As a simple channel of communication, the Internet today does not discriminate between languages, provided they can be input into the system. It is the various applications integrated with the Internet that have the capacity to exclude some languages. Search engines, for example, operate on a variety of principles, and with many it is not possible to search for information by language. It is therefore in the developing area of Internet-integrated applications that we must encourage a truly multilingual and non-discriminatory approach.

The Internet tomorrow

The Internet of today is still a largely inert channel which messages and information pass through. Left to itself, without further technological innovation, the Internet would grow and grow in size, but at the same time become less manageable. New content-orientated technology and a more sophisticated structure will be needed to handle the information explosion, making the evolving Internet less inert and more interactive.

logo

In the above model, it is the stage of accessing material that will become more interactive. The functioning of the processing component will be to accept specifications from the recipient and, by manipulation of raw information to which it has access, compile a report tailor-made to the recipient's requirements. One of those requirements may concern the language of the report.

Although one cannot foresee the detail of future developments, there are reasons for believing that these developments will favour a multilingual presence and machine translation for those languages where the underlying language resources exist. These reasons are:

(a) Specialization of content, function and format. Specialized texts (pertaining to specific domains) are more amenable to Machine Translation than general ones, as is functionally standardized communication (e.g. e-commerce transactions).

(b) Automation of the way the Internet works. A larger proportion of the text that reaches the recipient will be partly or wholly machine-generated and therefore more amenable to automatic multilingual treatment.

(c) Widespread automation entails the existence of technological standards. Standardization will tend to favour efficient language processing for both information treatment and user interfaces.

(d) Automation will also lead to the development of common infrastructures which may incorporate or support, inter alia, the translation utilities and other language processes needed to make the Internet multilingual.

(e) Intelligent tools and sophisticated user-interfaces for Internet-based functions will incorporate knowledge bases, artificial intelligence and forms of language processing which can probably be combined with Machine Translation.

The above tendencies all seem to point to the need for terminological banks, specialized and more general linguistic corpora to be created where they do not already exist.

What should we expect from machine translation?

Non-specialists thinking about Machine Translation (MT) often apply one of two widespread but mistaken notions concerning the nature of translation itself. According to the "naive fallacy", translation is a straightforward matter of substituting for each word in the source language the corresponding word in the target language; thus the ability to translate merely consists of "knowing all the words". According to the "erudite fallacy", on the other hand, translation is such a dauntingly complex and subtle task that accurate translation is almost beyond even the expert human, while for a machine to translate reliably is inconceivable. The truth lies between these extremes.

If the "naive fallacy" were true, Machine Translation would be a very simple matter, well within reach of even the rudimentary computers of decades ago. All that is needed would be a bilingual word list (lexicon) together with a straightforward algorithm for replacing words in one language with those in another.

But languages, of course, do not consist merely of words, but also of morphology, syntax, semantics and indeed pragmatics. Moreover, even where words are concerned, one of the most daunting problems for Machine Translation is how to deal with frequent homonymies or ambiguities, where one item in Language 1 corresponds to more than one item in Language 2: getting a computer to make the right decisions in such cases is in fact one of the most fundamental and difficult of challenges facing MT technology. Then again, even if we could give the computer system fail-safe rules on which to base its decisions, it turns out that authentic texts produced by humans do not always follow the "rules". While the tendency towards unpredictability in language-use only rarely impedes successful communication between humans, it may frequently baffle a computer's more rigid logic. On top of all this there are problems of cultural references in texts which may not be transferable into another cultural context except by the most creative kind of translation solution.

But, despite these complex challenges, MT systems have indeed been developed and are serving a range of useful purposes at this very moment, and in the near future will probably become essential to modern life. Their success, however, has to be judged in terms of their ability to perform to an acceptable level of efficiency in each specific context. Lists of car parts are capable of being translated accurately within a very limited and controlled system, while the translation of advertising slogans may require the same kind of cultural sensitivity as high quality literary translation and best be left to a highly-skilled human translator.

Just as with any technology, we must look at the needs and demands of a given user, or of the system within which the technology is incorporated, and ask whether or not the technology serves its immediate purpose, and also, crucially, what the alternative options are for obtaining similar (or better) results by any other means. Sometimes information technology will offer options for sidestepping translation as such through the spread of "machine-mediated communication" where the machine does not need strictly to translate, only to speak each user's language.

Quality of Translation

"Quality of translation" refers to how accurately an MT application performs a number of technically distinguishable tasks which together make up the complex process of translation. An "ideal translation" machine translation might be comparable to what an ideal human translator might produce, but we have to remember that in practice human translators also are capable of error and not always "ideal".

However, a translation (whoever or whatever produces it), even if it is not ideal may serve practical purposes. Everything will depend on what those purposes are, on how critical the quality of translation is for those purposes, and how possible it is, in given circumstances, to correct low quality machine translation.

In particular, depending on context and resources, it may be possible for machines and people to work hand in hand. We need not expect the machine to carry out all the work on its own. Humans may be able to correct a machine's mistakes, taking the computer's effort as a first draft for revision. Alternatively, the computer may be able to request assistance or intervention from a human when necessary. Again, a human may be able to prepare the task for the machine ahead of time in such a way as to ensure that the latter is only asked to do what it is capable of. By such strategies, machines may be used as a tool to help provide translations of adequate quality, but not fully automatically.

Thus there may be a choice between having fully automatic translation which is not of high quality, or high quality translation which is not fully automatic. In the opinion of several experts consulted, many ordinary users of the Internet today are fairly tolerant of low quality machine translations such as are provided by existing automatic on-line services (though only between certain languages). The primary goal of such services is to provide translation which is affordable and available when needed, and which achieves a minimum threshold of quality. For these users' priorities, this level of translation is better than no translation at all. But there are other users of the Internet - for example international companies creating websites - for whom high quality will be more important than cost, and who will choose human translation, possibly computer-aided.

There is also a trade-off between accuracy on the one hand and our willingness to set limitations on the original text to be translated. Machine translation has greatest difficulties with ambiguities in the source text and one way of getting round this is for the source text to be required to conform to certain rules: this is referred to as controlled language. Another way for a text to be easier for MT to handle, because it is less prone to ambiguities, is if the text naturally pertains to a particular subject area, known as a domain, for which the MT system used has been prepared. It will be apparent that these strategies are better suited to specialist users of the Internet than to the ordinary user.

Machine translation systems can be based on a number of different approaches, though in practice elements from more than one approach may be combined. Each approach has its own history and ups and downs in esteem, but no one approach has triumphed so completely as to make the others not worth considering. The different approaches are set out in the Technical File at the end of this document. Below, we adopt a purely ad hoc classification driven by the preoccupation of this report with multilingual translation capability and support for smaller languages.

Language-pairs

The first approach works with particular language pairs; well-known systems such as Systran and Logos operate in this way. Historically a great deal of work has been involved in building these systems, and it is worth noting that the work has to be done twice over, e.g for French to English and then for English to French. These systems tend to get better with time and use, since problem areas can be addressed and improvements made. Institutional users have expressed reasonable satisfaction with the results, but these institutions, which include the European Commission, are in a position to have a human translator revise the text when a high level of accuracy is important to them. The systems are also expensive to build in the first place, though it is not impossible that ways may be found of simplifying the process, in which case the argument would change. The biggest problem, however, as things stand, is that the number of language pairs, and therefore the amount of work needed, increases exponentially with every new language added to the pool of languages for which a system of multilateral translation is being built. When the EU's EUROTRA project started, there were six official languages which meant creating 15 language pairs, or 30 translation directions. Fifteen languages would have brought the figures up to 105 and 210, but the EUROTRA project has been discontinued.

If it is unrealistic to add all state languages to a completely multilateral translation system, then the same must certainly be the case for minority languages. The cost/benefit of a system for, let us say, Frisian to Greek and vice versa, however desirable in principle, would not bear looking at in practice. Yet if language pairs are developed for only a few larger languages, the rest will be left outside a machine translation system based on language pairs.

It seems to us that language pairs which include a minority language could most easily be justified in cost-benefit terms where the other language is the member-state language and the minority language has a high level of use in government and administration. It will then be of direct use both to the state and the region who might jointly consider the investment worth while. Where languages exist within the same political and institutional frame of reference and belong to the same linguistic family (as with languages other than Basque in Spain) it may be possible to build very successful and useful MT systems (37) , but such projects would not normally qualify for European support.

On the other hand when small languages have a very strong relationship with one particular trading partner investment in a language-pair MT system with that other language may well be worthwhile and such a project might be appropriate for support by the European Commission. There may also be a case for the EU supporting language-pair MT development as a means of conflict prevention in a situation such as exists in Moldova, where the official state language is simply not understood by sizeable minorities.

As for ordinary users of the Internet, however, though anything that is available in the way of machine translation is better than nothing from their point of view, we do not think that an incomplete range of language pairs is what they are looking for.

Interlingua

The second approach is based on an interlingua - a language-neutral representation which intervenes between the source language and the target language. The process involves the conversion of the source language into an interlingua representation, and then, as a completely independent module, the translation of the interlingua representation into the target language. For each language added to the system two modules are needed. Thereafter, there is the possibility of translation into every language in the system without the exponential growth of effort and cost observed in the case of language pairs. In theory, therefore, a successful interlingua system incorporated into the Internet would be exactly what the ordinary user is looking for and possibly many other users too if the quality were good enough. However, from the point of view of quality there is a question mark over interlingua systems, which have run into difficulties in the past.

We must however mention the ongoing Universal Networking Language project (UNL) currently being coordinmated from the United Nations University in Tokyo. As we complete this report a new website (38) has been established for the project which indicates that it will be extended from the present few languages to all EU state languages and eventually all member-state languages of the United Nations.

While we have collected such information as we can about it and attended a presentation in Brussels, it is difficult to make a full assessment: in the first place, the work is not complete and in the second place, there is, for understandable reasons, a degree of secrecy about the detail of the project. But the intention is eventually to make the resulting system a resource freely available on the Internet.

With all the reservations one is bound to have in these circumstances, it remains true that even a relatively low quality system with global application would find users. Ordinary Internet users who are not able to afford human translators to check machine translation are ready to make some allowances for a system which though imperfect is better than nothing. And like other systems, an interlingua system might improve with use. The production of two modules, one working in each direction between the interlingua and the human language in question should not be beyond the resources of even small and/or minority languages.

Elsewhere in this study the actions we see as helpful to linguistic diversity and smaller languages involve the application and adaptation of existing technologies to additional languages, but here is an area of research and development where we think there would be a case for European support for one or two centres in small language areas to become partners in the UNL project - which is open to such partnerships. These could eventually help to transfer the expertise required in building the system to other small European languages.

Translation Memory

The third approach is known variously as Translation Memory (TM), corpus-based, example-based or non-symbolic translation. This method looks for equivalents of sentences in the text to be translated in existing parallel corpora or collections of texts in the two languages. If the sentence is in that corpus, the translation will probably be 100% correct; if it is not, no translation can be provided. Human translators can build up parallel corpora of their own in the fields of translation in which they work, and the software for this is commercially available. The system naturally works equally for all languages and is a very useful aid for professional translators. But if corpus-based systems are to yield something approximating more closely to automatic translation, then large parallel corpora of written language are required which have been annotated in particular ways. It is also possible to construct corpora based on the language of particular domains, or more general corpora. Those who work with this kind of Machine Translation are very enthusiastic about the approach and its potential. Quite apart from its merits in itself, it has the capacity to be combined with other approaches to MT. Moreover, the creation of a large scale corpus for a language can lead to many other uses besides machine translation.

The desirability of having a large corpus for each language was the reason behind the PAROLE project. This was a European initiative set up to create a large corpus for each of the member-state languages of the EU using a common methodology. That methodology and expertise now exist and could be extended to a number of minority languages who could be included in a new phase of the PAROLE project.

Footnotes:

lineb