Speech-to-speech machine translation

Related

"The industry realises that the incorporation of speech technology will imply the ultimate step to bring computers closer to the general public."

EuTrans: web page

Consortium:

"To the extent that personal computers are being equipped with more and more telematic applications, together with the impending arrival of third generation mobile phones, reliable speech recognition systems become a must."

Other related projects:

Fame (EU)
Janus(US)
LC-Star (EU-US)
Siridus (EU)
SpeechTrans™ (US)
Tabitsu (JP)
TransType2 (EU-CA)
Verbmobil (DE)

TransType2

"TT2 aims at facilitating the task of producing high-quality translations, and make the translation task more cost-effective for human translators. Research progress will thus be measurable in terms of the increased productivity of translators using this new computer-assisted translation system. "

Verbmobil

Vision
"The vision behind the Verbmobil project was a portable translation device that one could carry to a meeting with speakers of other foreign languages.
Languages
Input in English, German or Japanese; the translation is bidirectional, English-German and Japanese-German.
Funders
The project was funded by the German Ministry for Research and Technology (BMFT) and an industrial consortium (including Alcatel, Daimler-Benz AG, IBM Deutschland, Philips GmbH, Siemens Aktiengesellschaft). For the first four years of the project the BMFT funding amounted to 60 Million DM. "
Verbmobil resouces at ELDA
Iincluding 200 spontaneous transliterated dialogues in Denglish -English spoken by Germans.

LC-STAR: Lexica and Corpora for Speech to Speech Translation Components

Fame: Facilitating Agents in Multicultural Exchange

University of Karlsruhe (DE)
ITC-irst (IT)
Universitat Politecnica de Catalunya in Barcelona (ES)
PRIMA group (FR)
The IIHM group (FR)
SONY
ATLAS (ES)

JANUS
Language Technology Institute (LTI), School of Computer Science at Carnegie Mellon University.
Travel Domain - conversations between travel agents and clients.
Languages: English, German, Japanese, Korean, Italian and French.
System applications include an Interactive Video Translation Station, a Portable Translator, and a Passive Dialog Interpreter.
Approach: Speech-to-Speech Translation of Spontaneous Conversational Dialogs in multiple languages primarily using an interlingua based approach.

"Another notable point is that task success (73.8%) is higher than translation accuracy (51.8%). This confirms the need for Task Based Evaluation (TBE) in addition to Accuracy Based Evaluations (ABE). "

"The reason for task success being higher than translation accuracy is that both experienced and inexperienced users accepted some bad translations as long as they can be understood in context. For example, in the context of the question How much does it cost?, users will accept the answer128 hours."

In Lessons Learned from a Task-Based Evaluation of Speech-to-Speech Machine Translation (Levin et al, 2000)

An interview with Francisco Casacuberta and Enrique Vidal

by Joseba Abaitua

"At present, there are only a few speech-to-speech machine translation projects of some relevance either in Europe or in the United States, and Japan. Nevertheless, there is no doubt that its importance is continually increasing, the same as the technology of cellular telephony and machine translation technologies. Without a doubt, in a few years time, speech-to-speech machine translation will be a commonplace thing."

Because oral language is the most spontaneous and natural form of communication among people, speech technology is perceived as a determining factor in achieving better interaction with computers. The industry is aware of this fact and realises that the incorporation of speech technology will imply the ultimate step to bring computers closer to the general public. To the extent that personal computers are being equipped with more and more telematic applications, together with the impending arrival of third generation mobile phones, reliable speech recognition systems become a must. In recent years there has been important progress, although still with limitations (of vocabulary, of domain coverage, in the treatment of disfluencies, etc.). But despite some minor problems, the technology today is ready to offer a wide range of services.

Vidal & Casacuberta One of the most attractive applications is without a doubt speech-to-speech machine translation. There exist a reduced number of projects that have contributed more significantly to the development of this technology, among which Verbmobil, sponsored by the German government, and the European EuTrans are worth citing. In this interview we have two representatives of one of the Spanish research groups that has gained more recognition in recent years thanks to its research on speech-to-speech translation. This is the group Pattern Recognition and Human Language Technology (PRHLT) of the Universitat Politècnica of València (UPV), directed by Francisco Casacuberta Nolla and Enrique Vidal Ruiz.

PRHLT-ITI

This group carries out its research activity both in speech technologies and in computer vision. The PRHLT subgroup devoted to speech-to-speech translation in the EuTrans project is composed --in addition to the two directors-- by Carlos Martínez Hinarejos, Francisco Nevado Montero, Moisés Pastor Gadea, David Picó Vila, Alberto Sanchis Navarro, who belong to the Computer Science Institute (ITI) of the UPV, where they also lecture, and by David Llorens Piñana and Juan Miguel Vilar Torres, from the Universitat Jaume I (UJI).

Other research projects on speech translation have been developed by the group, such as "EXTRA: Example-based extensions to text and speech translation in restricted domains" and "Translation and comprehension of the language spoken through example-based learning techniques: TRACOM", both funded by the Spanish Foundation of Science and Technology (CICYT). The group is currently participating in a new European project: "TransType2 - Computer-Assisted Translation" (TT2).

EuTrans

Question: In what context has your research been conducted lately?

Enrique Vidal. Recent research has been carried out within the framework of the project EuTrans, financed by the community program ESPRIT (actions 20268 and 30268). The consortium that has carried out the project was formed by the University of Aachen (Germany), the research centre of the Foundazione Ugo Bordoni (Italy), the German company ZERES GmbH, and our group at the Computer Science Institute (ITI) of the Universitat Politècnica of València (UPV), which led the project. The project involved two different stages. The first stage (in 1996), lasting only six months, demonstrated the viability of the proposed approach on a task of moderate complexity. In the second stage, which took three years (from 1997 to 2000), methodologies were developed in order to address real tasks.

Question: Which were the main contributions of the ITI-UPV group to the consortium?

Enrique Vidal: Our research group has been working for several years on the development of speech recognition methods, with important contributions in training models for acoustic modelling, language modelling (syntactic/semantic) as well as in the development of learning techniques for translation models. We developed the three learning techniques applied to the design of EuTrans prototypes, all of them based on finite state technologies. These were OSTIA (Onward Subsequential Transducer Interface Algorithm), OMEGA (OSTIA Modified for Employing Guarantees and Alignments) and MGTI (Morphic Generator Transducer Inference).

Question: How have these learning techniques contributed to the development of the prototypes?

Francisco Casacuberta: These learning algorithms are essential because without them it would be very difficult or impossible to apply our translation model. We are talking about stochastic finite-state transducers that can have millions of transitions. Clearly, these models cannot be built manually on the basis of linguistic knowledge, but must be "learned" from examples. This is without a doubt an innovative approach to speech translation, inspired in the technology of speech recognition. It is very different from the conventional rule-based translation approaches. By means of this approach it is possible to integrate the acoustic model into the translation model (that in our case, as in the majority of speech recognition systems, are continuous hidden Markov models).

Question: And which are the drawbacks?

Francisco Casacuberta: There are two. The main drawback is that this solution requires of large training corpora. In addition, it is necessary to restrict the area of application to concrete scenarios, because of the huge size of the models.

Question: What orders of magnitude are we talking about? Why is the size still a problem?

Francisco Casacuberta: Possibly, given the current state of the technology, tasks with lexicons of a few thousands words could be easily addressed. As I have said before, the resulting finite-state transducers are very large and need a great deal of memory. Nevertheless, computation time would not be a very serious problem since there exist very efficient techniques that reduce the temporary costs.

The approach

Question: What is your approach to speech-translation?

Enrique Vidal: In a classical approach, a typical speech translator performs in two stages. In the first stage speech recognition takes place, converting the input sound into a source-language text. Then, in the second stage, that source-text is translated into a target-language text. This approach is also known as uncoupled or serial translation. In our approach, speech-recognition and translation are carried out simultaneously. This is possible due to the integration of the word-acoustic models that are part of the speech-recognition system into the translation model. We call it the integrated approach. Integration is possible because hidden-Markov models and finite-state transducers are very similar in essence.

Question: So, what are the advantages of this approach over the classical one?

Francisco Casacuberta: On the one hand, there is no system of speech-recognition that is perfect. Besides, translation systems that work from text --as opposed to speech-- take for granted that the source text is correct. As a result, in the classical approach, translation systems deal with errors of all types for which they were not designed. Obviously, this problem has a solution but a very expensive one.

Question: Is there a way to overcome these limitations?

Enrique Vidal: One of the objectives of our approach is that the translation process behaved like a process of recognition, with the difference that instead of recognising source-language sentences, target-sentences are generated. Our speech-translation systems apply a very similar strategy to that used in speech-recognition. It is basically a statistical approach that, as said before, employs finite-state transducers complemented with acoustic knowledge. In this context, translation is seen as a problem of word-sequence search in the target language that maximises the probability of its corresponding acoustic observation.

Question: What role do these translators perform?

Francisco Casacuberta: Stochastic finite-state transducers permit a direct application of the probability distribution mentioned above, making it possible to define an integrated translation architecture. The main advantage of these translators is that they considerably reduce the unavoidable error impact of speech-recognition systems. In addition to its formal elegance, the integrated architecture offers important functional advantages, although, in general, it suffers from a practical problem: The search for the optimal output sentence may have a high computation cost. It should be pointed out, however, that many techniques are available to reduce the computation time so that real time speech-translation systems can still be designed.

Question: In short, what would you say is the most remarkable aspect of your approach?

Enrique Vidal: Our approach has two outstanding characteristics. One is the capacity to resolve, in a homogeneous and simultaneous way, the two phases of speech-recognition and translation. The second property is the possibility to generate acoustic, lexical and translation knowledge-sources automatically from examples.

But as I have said before, these systems have a limited application domain and need very large amounts of learning-data. As we know, corpora compilation is very expensive. And furthermore, the size of the resulting models represents another problem. Consequently, in order to make the approach viable, it is very important to restrict the domain of application.

Results

Question: How do the positive results of EuTrans project into a more optimistic horizon of speech technologies?

Francisco Casacuberta: Within the framework of EuTrans several prototypes have been developed for relatively simple translation tasks. The language pairs developed were Spanish into English and Spanish into German. On top of these prototypes, two further translation systems (from Spanish and Italian into English) have been constructed for more realistic applications. The small error rates and excellent response times, close to real time, provide good practical support to our technological standpoint.

Two versions of the Spanish prototype were done, with very different training- sample sets, so that the relevance of the training corpus size could be measured with respect to its influence on the behaviour of the translation model. The first prototype was trained with nearly half a million sentence pairs that were generated semiautomatically from texts out of tourist guidelines. The second one was trained with only 10,000 sentence pairs.

This difference in the size of the training corpus was adopted consciously. We wanted to simulate a situation of scarcity of training samples and to verify its incidence on the behaviour of the translator. The vocabulary contained in the corpus was of 686 words in Spanish and 513 in English. The acoustic models had been trained on the basis of four hours of recorded speech. The oral test consisted of 336 phrases in Spanish pronounced by four speakers. The Spanish system with a bigger training corpus and microphone input produced an error rate lower than 2% of translated words. With a corpus of only 10,000 sentence pairs a much worse outcome was obtained, with error rates only lower than 8%.

The Italian prototype was trained with a very small sample of just 3,038 sentence pairs. The acoustic models had been trained with a corpus composed of some eight hours of speech, acquired directly from real telephone calls to the reception of a hotel through the Wizard of Oz technique. The translation corpus was transcribed orthographically from the oral corpus and translated manually into English. The oral test was carried out on 278 Italian sentences, with a rate of 22% of poorly translated words.

Question: In what domains have the prototypes been applied?

Enrique Vidal: The prototypes were designed for the restricted application of person-to-person interactions at the reception of a hotel. There were five expected actions: information requests, bookings, cancellations, claims, and booking-changes.

Question: This makes EuTrans similar to other speech-to-speech translation prototypes, such as Verbmobil or Janus. Is there some reason for this coincidence?

Enrique Vidal: This task is sufficiently restricted, with a vocabulary of moderate size, which makes it possible to demonstrate the viability of speech translation in a very direct way.

Question: What conclusions could be extracted from the evaluation tests?

Enrique Vidal: The results obtained clearly show the viability of the approach for concrete tasks and restricted discourse domains.

Question: Which have been your main scientific contributions?

Francisco Casacuberta: I would like to point out two main contributions: i) One is the validity of stochastic finite-states transducers for the translation of both text and speech. ii) The second one is the development of learning strategies (OSTIA, OMEGA, and MGTI) to be applied in the learning process of stochastic finite-state-based translators.

Furthermore, our group has contributed with the ATROS (Automatically trainable recogniser of speech) system. This is a continuous-speech recogniser that runs on Linux. ATROS has three knowledge-source levels: acoustic, lexical, and a translation level. These knowledge-sources are represented by means of finite-states models, which are automatically learned from examples. These sources are integrated one within the other. ATROS permits both speech decoding into words, and speech comprehension into meaningful messages, as well as speech translation into the target language.

The future

Question: What is left to do in this field?

Francisco Casacuberta: Many problems remain open. In the first place all those that affect speech modelling: unfavourable environments (for example speech coming out of cellular telephones), prosody, and other phenomena caused by spontaneous speech. Modelling dialogue --the way it is done in Verbmobil-- would also help the translation. With regard to translation models, it is necessary to resolve the computational problems that affect larger models, for instance when the task requires large lexicons (of several tens of thousands of words).

Question: What is the state of speech-to-speech translation in Europe and outside Europe?

Enrique Vidal: Some systems are currently being marketed. These employ the classical approach of a commercial speech-recognition system linked together to a text-translation system. In any case, there are only a few projects of some relevance either in Europe or in the United States and Japan. Nevertheless, there is no doubt that its importance is continually increasing, like the technology of cellular telephony and machine translation technologies. In a few years time, speech-to-speech machine translation will be a commonplace thing.

Conclusions

The achievements of the EuTrans project reveal two things. The first thing is that speech-to-speech translation is conditional to the development of speech recognition technology itself. Secondly, that the models employed in speech recognition based on large collections of examples, have proved valid also for the development of speech translation. This implies that in the future these two technologies will be successfully integrated.

At present, however, speech-to-speech translation systems are scarcely available. In recent years speech recognition has made an important progress thanks to the increasing availability of the resources that are needed for its development: large collections of oral texts and data-oriented efficient processing techniques, such as those designed by the PRHLT-ITI group itself. However, the integration of these systems into market products is still complicated. We should not forget that these prototypes developed within research projects are only capable of processing a few "hundreds" of sentences (around 300), on very specific topics (accommodation-booking, planning trips, etc.) and for a small bunch of languages (English, German, Japanese, Spanish, Italian). It seems unlikely that any application will be able to go beyond these boundaries in the short term.

The direct incorporation of speech translation prototypes into industrial applications is at present too costly. However, the growing demand on these products leads us to believe that they will soon be on the market at more affordable prices. The systems developed in projects such as Verbmobil, EuTrans, or Janus, in spite of being at laboratory-stage, contain in practice thoroughly evaluated and robust technologies. A manufacturer considering its integration may join R&D projects and take part in the development of prototypes with the prospect of a fast recoup of the start-up costs. It is quite clear that we are witnessing the emergence of a new technology with great potential for expansion in the telecommunications and microelectronics market for the immediate future.

Another remarkable aspect of the EuTrans project is its methodological contribution to machine translation as a whole, both in speech and written modes. Although these two modes of communication are very different in essence, and their respective technologies cannot always be compared, speech-to-speech translation has brought prospects of improvement for the other channel. Traditional methods for written texts tend to be based on grammatical rules. This way, many MT systems show no coverage problem, although this is so at the expense of quality. The most common way of improving quality consists in restricting the domain of application. It is widely accepted that broadening of coverage immediately endangers quality. In this regard, learning techniques that enable the systems to automatically adapt to new textual typologies, styles, structures, terminological and lexical items mean a radical contribution to the technology.

On account of the difference between oral and written communication, rule-based systems prepared for written texts can hardly be readapted to oral applications. This is an approach that has failed. On the contrary, the example-based learning methods designed for speech-to-speech translation systems can easily be adapted to the written texts given the increasing high availability of bilingual corpora. One of the main contributions of the PRHLT-ITI group is precisely its learning model based on bilingual corpora. It is along this line of experimentation that interesting prospects of improvement in the written translation exist.

Although with limitations with respect to the number of languages, linguistic coverage, and context, effective speech-to-speech translation will become available in the coming years, along with other voice-oriented technologies. It could be argued that EuTrans' main contribution is to have raised the possibilities of speech-to-speech translation to the levels of speech recognition technology, making any new innovation immediate accessible.

Joseba Abaitua has a Ph.D. in Computational Linguistics from the University of Manchester Institute of Science and Technology (UMIST). For four years he worked in the Japanese-Spanish module of Fujitsu's ATLAS machine-translation system. Since 1992 he is professor of Linguistic Technology at the University of Deusto, where he has participated in several Natural Language Processing projects. Complementary to his academic work he also acts as technology consultant for AutomaticTrans.