Review of Speech Technology and Language Engineering

by

Laura Gravina Sobrino

 

 

 

Abstract

The objective of this report is to complete an assigment for the class of "English Language and New Technologies". This class is taught by Teacher Joseba Abaitua and it is about the uses and roles of the Human Language Technologies in this century.

This report looks to make a review about speech technology and language engineering which are so important nowadays in Human Language Technologie; the reason I have chosen this subject is because I really believe that the use and roles of these two technologies are in close contact with the advancement of Human Language Technologies. I will try to explain what speech technology and language engineering are and which uses they have in order to make this report a useful tool for anyone who would like to learn about these two subjects.

 

Introduction

The aim of this report as explained in the proceeding paragraph is to improve our knowledge about these two subjects and their use. The structure of this report is an easy question and answers one; most of the questions that I have used to organize the report are from the questionnaires that we have been working on in class. The purpose of those questionnaires was to make us look for information on the internet in order to solve the questions given, because of that most of the answers are quotations which explains more or less the answers to these questions.

This report is divided into three sections, in each section there are different questions about the subject with their corresponding answers. In the first section, entitled "About Language Engineering and Speech Technology", I am going to give a short definition to each subject that is an explanation of what Language Engineering and Speech Technology are. The second section simply called "Language Engineering" I will investigate deeper about this subject that is its techniques, uses and components. After that, in the third section "Speech Technology" I will talk in depth about this second subject, investigating its two key technologies (speech recognition and speech synthesis) and giving a brief review about speech-to-speech translation and some projects based on this speech-to-speech theme.

As I have said all the answers will be developed by quotations taken from the internet so I will include after every answer the reference of the quotation; also at the end of the report, after the conclusion, the reader will be able to find a list with all references that have been used to complete this report, including the name of the author, name of the original document, group or institution and the URL.

Index of the report:

1. Language Engineering and Speech Technology
      1.1. What is Language Engineering?
      1.2. What is Speech Technology?

2. Language Engineering
      2.1. Which are the main techniques used in Language Engineering?
      2.2. Which language resources are essential components of Language Engineering?

3. Speech Technology
      3.1. About Speech Recognition
      3.2. About Speech Synthesis
      3.3. Speech-to-speech Machine Translation
      3.4. Three Speech-to-speech Machine Translation Projects

 

1. About Language Engineering and Speech Technology

  1.1. What is Language Engineering?

Language Engineering is the application of knowledge of language to the development of computer systems which can recognise, understand, interpret, and generate human language in all its forms. In practice, Language Engineering comprises a set of techniques and language resources. The former are implemented in computer software and the latter are a repository of knowledge which can be accessed by computer software. In other words, it is a way to interpretate human language using techniques.

Language Engineering (Brouchure by HLTCentral: caché)

  1.2. What is Speech Technology?

In the mid- to late 1990s, personal computers started to become powerful enough to make it possible for people to speak to them and for the computers to speak back. Today, the technology is still a long way from delivering natural, unstructured conversations with computers that sound like humans; however, speech technology is delivering some very real benefits in real applications right now. For example:

The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).

Technology Overview (By Microsoft Corporation: Speech evaluation)

 

2. Language Engineering

Language Engineering provides ways in which we can extend and improve our use of language to make it a more effective tool. It is based on a vast amount of knowledge about language and the way it works, which has been accumulated through research. It uses language resources, such as electronic dictionaries and grammars, terminology banks and corpora, which have been developed over time. The research tells us what we need to know about language and develops the techniques needed to understand and manipulate it. The resources represent the knowledge base needed to recognise, validate, understand, and manipulate language using the power of computers. By applying this knowledge of language we can develop new ways to help solve problems across the political, social, and economic spectrum.

Language Engineering is a technology which uses our knowledge of language to enhance our application of computer systems:

Language Engineering (Brouchure by HLTCentral: Language Engineering)

  2.1. Which are the main techniques used in Language Engineering?

Language Engineering (Brouchure by HLTCentral: Language Engineering)

  2.2. Which language resources are essential components of Language Engineering?

Language resources are essential components of Language Engineering. They are one of the main ways of representing the knowledge of language, which is used for the analytical work leading to recognition and understanding.

The work of producing and maintaining language resources is a huge task. Resources are produced, according to standard formats and protocols to enable access, in many EU languages, by research laboratories and public institutions. Many of these resources are being made available through the European Language Resources Association (ELRA).

Lexicons - A lexicon is a repository of words and knowledge about those words. This knowledge may include details of the grammatical structure of each word (morphology), the sound structure (phonology), the meaning of the word in different textual contexts, e.g. depending on the word or punctuation mark before or after it. A useful lexicon may have hundreds of thousands of entries. Lexicons are needed for every language of application.

Specialist Lexicons - There are a number of special cases which are usually researched and produced separately from general purpose lexicons:

Grammars - A grammar describes the structure of a language at different levels: word (morphological grammar), phrase, sentence, etc. A grammar can deal with structure both in terms of surface (syntax) and meaning (semantics and discourse).

Corpora - A corpus is a body of language, either text or speech, which provides the basis for:

There are national corpora of hundreds of millions of words but there are also corpora which are constructed for particular purposes. For example, a corpus could comprise recordings of car drivers speaking to a simulation of a control system, which recognises spoken commands, which is then used to help establish the user requirements for a voice operated control system for the market.

Language Engineering (Brouchure by HLTCentral: Language Engineering)

 

3. Speech Technology

  3.1. About Speech Recognition

Speech recognition, or speech-to-text, involves capturing and digitizing the sound waves, converting them to basic language units or phonemes, constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).

Recognizers-also referred to as speech recognition engines- are the software drivers that convert the acoustical signal to a digital signal and deliver recognized speech as text to your application. Most recognizers support continuous speech, meaning you can speak naturally into a microphone at the speed of most conversations. Isolated or discrete speech recognizers require the user to pause after each word, and are currently being replaced by continuous speech engines.

Continuous speech recognition engines currently support two modes of speech recognition:

Dictation mode allows users to dictate memos, letters, and e-mail messages, as well as to enter data using a speech recognition dictation engine. The possibilities for what can be recognized are limited by the size of the recognizer's "grammar" or dictionary of words. Most recognizers that support dictation mode are speaker-dependent, meaning that accuracy varies on the basis of the user's speaking patterns and accent. To ensure accurate recognition, the application must create or access a "speaker profile" that includes a detailed map of the user's speech patterns used in the matching process during recognition.

Command and control mode offers developers the easiest implementation of a speech interface in an existing application. In command and control mode, the grammar (or list of recognized words) can be limited to the list of available commands-a much more finite scope than that of continuous dictation grammars, which must encompass nearly the entire dictionary. This provides better accuracy and performance, and reduces the processing overhead required by the application. The limited grammar also enables speaker-independent processing, eliminating the need for speaker profiles or "training" of the recognizer.

Speech recognition technology enables developers to include the following features in their applications:

Technology Overview (By Microsoft Corporation: Speech evaluation)

  3.2. About Speech Synthesis

Speech Synthesis, or text-to-speech, is the process of converting text into spoken language. This involves breaking down the words into phonemes; analyzing for special handling of text such as numbers, currency amounts, inflection, and punctuation; and generating the digital audio for playback.

Software drivers called synthesizers, or text-to-speech voices, perform speech synthesis, handling the complexity of converting text and generating spoken language. A text-to-speech voice generates sounds similar to those created by human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape, and tongue position. Although easy to understand, the voice produced by synthesis technology tends to sound less human than a voice reproduced by a digital recording.

Nevertheless, text-to-speech applications may be the better alternative in situations where a digital audio recording is inadequate or impractical. Generally, consider using text-to-speech when:

Technology Overview (By Microsoft Corporation: Speech evaluation)

  3.3. Speech-to-speech Machine Translation

At present there are only a few speech-to-speech machine translation projects, be it in Europe or in the United States and Japan. Nevertheless, speech-to-speech is continually increasing in importance, in a similar way to the technology of cellular telephony and machine translation technologies. Without a doubt, speech-to-speech machine translation will, within a few years, be a commonplace thing.

Because oral language is the most spontaneous and natural form of communication among people, speech technology is perceived as a determining factor in achieving better interaction with computers. The industry is aware of this fact and realises that the incorporation of speech technology will be the ultimate step in bringing computers closer to the general public.

The achievements of the EuTrans project reveal two things. The first is that speech-to-speech translation is conditional on the development of speech recognition technology itself. Secondly, that the models employed in speech recognition based on large corpora have proved valid also for the development of speech translation. This implies that in the future these two technologies could be successfully integrated.

At present, however, speech-to-speech translation systems are not commonplace. In recent years speech recognition techniques have made important strides forward, thanks to the increased availability of the resources that are needed for its development— large collections of oral texts and more efficient data oriented processing techniques, such as those designed by the PRHLT group itself. However, the integration of these systems into marketable products is still some way off.

It is worth remembering that most prototypes developed within research projects are currently only capable of processing a few hundreds of sentences (around 300), on very specific topics (accommodation-booking, planning trips, etc.) and for a small group of languages—English, German, Japanese, Spanish, Italian. It seems unlikely that any application will be able to go beyond these boundaries in the near future.

The direct incorporation of speech translation prototypes into industrial applications is at present too costly. However, the growing demand for these products leads us to believe that they will soon be on the market at more affordable prices. The systems developed in projects such as Verbmobil, EuTrans or Janus—despite being at the laboratory phase—contain in practice thoroughly evaluated and robust technologies. A manufacturer considering their integration may join R&D projects and take part in the development of prototypes with the prospect of a fast return on investment. It is quite clear that we are witnessing the emergence of a new technology with great potential for penetrating the telecommunications and microelectronics market in the not too distant future.

Another remarkable aspect of the EuTrans project is its methodological contribution to machine translation as a whole, both in speech and written modes. Although these two modes of communication are very different in essence, and their respective technologies cannot always be compared, speech-to-speech translation has brought prospects of improvement for text translation. Traditional methods for written texts tend to be based on grammatical rules. Therefore, many MT systems show no coverage problem, although this is achieved at the expense of quality. The most common way of improving quality is by restricting the topic of interest. It is widely accepted that broadening of coverage immediately endangers quality. In this sense, learning techniques that enable systems to automatically adapt to new textual typologies, styles, structures, terminological and lexical items could have a radical impact on the technology.

Due to the differences between oral and written communication, rule-based systems prepared for written texts can hardly be re-adapted to oral applications. This is an approach that has been tried, and has failed. On the contrary, example-based learning methods designed for speech-to-speech translation systems can easily be adapted to the written texts, given the increasing availability of bilingual corpora. One of the main contributions of the PRHLT-ITI group is precisely in its learning model based on bilingual corpora. Herein lie some interesting prospects for improving written translation techniques.

Effective speech-to-speech translation, along with other voice-oriented technologies, will become available in the coming years, albeit with some limitations e.g. the number of languages, linguistic coverage, and context. It could be argued that EuTrans' main contribution has been to raise the possibilities of speech-to-speech translation to the levels of speech recognition technology, making any new innovation immediatly accessible

Speech-to-speech Machine Translation (By Joseba Abaitua: Speech-to-speech machine translation)

  3.4. Three Speech-to-speech Machine Translation Projects

CORETEX.- Nowadays commercial speech recognition systems work well for a very specific task and language. However, they are not able to adapt to new domains, acoustic environments and languages. The objectives of the CORETEX project are to develop generic speech recognition technology that works well for a wide range of tasks with essentially no exposure to task specific data and to develop methods for rapid porting to new domains and languages with limited, inaccurately or untranscribed training data. Another objective is to investigate techniques to produce an enriched symbolic speech transcription with extra information for higher level (symbolic) processing and to explore methods to use contemporary and/or topic-related texts to improve language models, and for automatic pronunciation generation for vocabulary extension.
We began with first investigations in unsupervised training, i.e. train a speech recognition system for a new task without dedicated transcribed training data for this specific task. One problem with genericity and portability is the recognition vocabulary. When shifting to a new task, a lot of work has to be done to manually build phonetic transcriptions for new words. We developed a method for automatically determine the phonetic transcription (see section Pronunciation Modeling). Further we build a system to segment recorded broadcast shows into parts which can be handled by the speech recognition system.

LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Technologies).- The objective of the LC-STAR is to improve human-to-human and man-machine communication in multilingual environments. The project aims to create lexica and corpora needed for speech-to-speech translation. Within LC-STAR, quasi industrial standards for those language resources will be established, lexica for 12 languages and text corpora for 3 languages will be created. A speech to speech translation demonstrator for the three languages English, Spanish and Catalan will be developed. The Lehrstuhl für Informatik VI will focus on the investigation of speech centered translation technologies focusing on requirements concerning language resources and the creation of lexica for speech recognition in German. LC-STAR is supported by the European Union.

VERBMOBIL II.- It is a speaker-independent and bidirectional speech-to-speech translation system for spontaneous dialogues in mobile situations. It recognizes spoken input, analyses and translates it, and finally utters the translation. The multi-lingual system handles dialogues in three business-oriented domains, with context-sensitive translation between three languages (German, English, and Japanese). Within the BMBF-funded project, the Lehrstuhl für Informatik VI performed research on both speech recognition and translation. For both tasks, statistical methods were used and self-contained software modules were developed and integrated into the final prototype system. For the speech recognition part we developed efficient search algorithms which perform a real time operation. In the end-to-end evaluation, the statistical machine translation significantly outperformed the competing translation approaches such as classical transfer-based translation or example-based translation.

Projects (Speech-to-speech machine translation projects)

 

Conclusion

Throughout this report I have tried to make a review and give a deeper explanation to Language Engineering and Speech Technology; thanks to this I have learned the importance of these two subjects and the importance that
new technologies have in them.

Nowadays new technologies give people a lot of advantages in studies and work. Thanks to them we can work in language's translations and with that make it easier to some students with difficulties to continue their studies. There are many advantages for normal users of internet and PCs for example in the gaming industry; thanks to speech recognition traditional computer-based characters could evolve into characters that the user can actually talk to, speech synthesis also gives many advantages such as allow the characters in an application to "talk" back to the user instead of displaying speech balloons..

 

On-line references (in order of appearance)

Language Engineering (Brouchure by HLTCentral: caché)

Technology Overview (By Microsoft Corporation: Speech evaluation)

Speech-to-speech Machine Translation (By Joseba Abaitua: Speech-to-speech machine translation)

Projects (Speech-to-speech machine translation projects)