Dafne Gurrutxaga




This report is a brief summary about Human Language Technologies. All the information is taken from the internet pages provided by the teacher of this subject: English language and new technologies. This report tells us about the importance of technology in society, and specifically in the linguistic world. It shows us that technology is advancing to help people with languages.




New technologies are the base of nowaday’s society. They have much importance in human life. Between these technologies we have we have “Human Language Technologies” which have been developing progressively during the last 40 years. Thanks to Human Language Technologies people can communicate with each other no matter they are near or not. Besides it has been created a system with which we can get information translated from one language to another by a machine. This system is called “Machine Translation”.

In this report there is a brief summary to get a general idea of what Human Language Technologies are and how they work.


Questionnaire 1



The overall objective of HLT is to support e-business in a global context and to promote a human centred infostructure ensuring equal access and usage opportunities for all. This is to be achieved by developing multilingual technologies and demonstrating exemplary applications providing features and functions that are critical for the realisation of a truly user friendly Information Society. Projects address generic and applied RTD from a multi- and cross-lingual perspective, and undertake to demonstrate how language specific solutions can be transferred to and adapted for other languages.

While elements of the three initial HLT action lines - Multilinguality, Natural Interactivity and Crosslingual Information Management are still present, there has been periodic re-assessment and tuning of them to emerging trends and changes in the surrounding economic, social, and technological environment. The trials and best practice in multilingual e-service and e-commerce action line was introduced in the IST 2000 work programme (IST2000) to stimulate new forms of partnership between technology providers, system integrators and users through trials and best practice actions addressing end-to-end multi-language platforms and solutions for e-service and e-commerce. The fifth IST call for proposals covered this action line.

Taken from:

·         Human Language Technologies and the information society (Presentation of Action Line, by the EC: caché)



A natural language is one that evolved along with a culture of human native speakers who use the language for general-purpose communication. Languages like English, American Sign Language and Japanese are natural languages, while languages like Esperanto are called constructed languages, having been deliberately created for a specific purpose.

Natural Language Generation (NLG) is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form.

Some people view NLG as the opposite of natural language understanding. The difference can be put this way: whereas in natural language understanding the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to take decisions about how to put a concept into words.

Taken from:

From Wikipedia, the free encyclopaedia.


  Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science aiming at computational models of human cognition. Computational linguistics has applied and theoretical components.


  Theoretical CL takes up issues in theoretical linguistics and cognitive science. It deals with formal theories about the linguistic knowledge that a human needs for generating and understanding language. Today these theories have reached a degree of complexity that can only be managed by employing computers. Computational linguists develop formal models simulating aspects of the human language faculty and implement them as computer programmes. These programmes constitute the basis for the evaluation and further development of the theories. In addition to linguistic theories, findings from cognitive psychology play a major role in simulating linguistic competence. Within psychology, it is mainly the area of psycholinguisticsthat examines the cognitive processes constituting human language use. The relevance of computational modelling for psycholinguistic research is reflected in the emergence of a new subdiscipline: computational psycholinguistics.


  Applied CL focusses on the practical outcome of modelling human language use. The methods, techniques, tools and applications in this area are often subsumed under the term language engineering or (human) language technology. Although existing CL systems are far from achieving human ability, they have numerous possible applications. The goal is to create software products that have some knowledge of human language. Such products are going to change our lives. They are urgently needed for improving human-machine interaction since the main obstacle in the interaction beween human and computer is a communication problem. Today's computers do not understand our language but computer languages are difficult to learn and do not correspond to the structure of human thought. Even if the language the machine understands and its domain of discourse are very restricted, the use of human language can increase the acceptance of software and the productivity of its users.


Taken from:

·         What is Computational Linguistics?, by Hans Uszkoreit (caché)


The development and convergence of computer and telecommunication technologies has led to a revolution in the way that we work, communicate with each other, buy goods and use services, and even the way we entertain and educate ourselves.

One of the results of this revolution is that large volumes of information will increasingly be held in a form which is more natural for human users than the strictly formatted, structured data typical of computer systems of the past. Information presented in visual images, as sound, and in natural language, either as text or speech, will become the norm.

We all deal with computer systems and services, either directly or indirectly, every day of our lives. This is the information age and we are a society in which information is vital to economic, social, and political success as well as to our quality of life.

The changes of the last two decades may have seemed revolutionary but, in reality, we are only on the threshold of this new age. There are still many new ways in which the application of telematics and the use of language technology will benefit our way of life, from interactive entertainment to lifelong learning.

Although these changes will bring great benefits, it is important that we anticipate difficulties which may arise, and develop ways to overcome them. Examples of such problems are:

Language Engineering can solve these problems.

Taken from:

·         Language Engineering and the Information Society (Document from I*M Europe)


Interactive multimedia content and services, interpersonal communication, cross-border trade and product documentation are all inherently bound to language and culture. Advances in computerised analysis, understanding and generation of written and spoken language are going to revolutionise human-computer interaction and technology mediated person-to-person communication.

Human Language Technologies aims to further strengthen Europe 's position at the forefront of language-enabled systems and services. It will help bring the information society closer to the citizen by "humanising" information and communication services, and demonstrate the economic impact of language enabled applications in key sectors, notably those addressed by the Information Society Technologies (IST) programme.

The focus will be on three major challenges presented by key drivers of the Information Society - specifically, the globalisation of economy and society, high-bandwidth digital communication and the World Wide Web - for which human language technologies play a central role:

1.      adding multilinguality to information and communication systems, at all stages of the information cycle, including content generation and maintenance in multiple languages, content and software localisation, automated translation and interpretation, and computer assisted language training;

2.      providing natural interactivity and accessibility of digital services through multimodal dialogues, understanding of messages and communicative acts, unconstrained language input-output and keyboard-less operation;

3.      enabling active digital content for an optimal use and acquisition by all, through personalised language assistants supporting deep information analysis, knowledge extraction and summarisation, meaning classification and metadata generation.

Taken from:

Information Policy for an Information Society (Paper by Mairéad Browne - caché).

·         Is there any concern in Europe with Human Language Technologies?

The importance for Europe, in particular in the information age, to capitalise on the wealth represented by its linguistic and cultural diversity, while overcoming the inherent inefficiencies associated with it, has repeatedly been stated at various institutional and extra-institutional levels. In particular the relevance of linguistic and cultural aspects of the Information Society in Europe has been stressed by the European Council2, the European Parliament, and by the G7 Conference of Ministers.

The G7 conference on The Information Society and Development, has emphasised the fact that information technologies have a tremendous potential to preserve and exploit cultural and linguistic diversity.

The Information Society Forum, has pointed out that, while Europe 's cultural and linguistic diversity is a unique wealth, it is also a major challenge that can act as a powerful barrier to human and business communication, and to the development of a single market for European goods and services. It has expressed the opinion that, given the appropriate framework, Europe 's cultural and linguistic diversity will be strengthened not threatened, providing new global opportunities for information products that exploit Europe 's rich heritage3.

Taken from:

·         Living and Working Together in the Information Society (Discussion Document from HLTCentral).


 www.HLTCentral.org, the central resource of European HLT developments, is seeking sponsors for the continued operation of the web site in 2004 and beyond. A variety of sponsorship, advertising and content options are available  the current situation of the HLTCentral.org office?

Taken from:




Questionnaire 2

·         Which are the main techniques used in Language Engineering?


Language Engineering comprises a set of techniques and language resources. The former are implemented in computer software and the latter are a repository of knowledge which can be accessed by computer software.


There are many techniques used in Language Engineering and some of these are described below.

Speaker Identification and Verification

A human voice is as unique to an individual as a fingerprint. This makes it possible to identify a speaker and to use this identification as the basis for verifying that the individual is entitled to access a service or a resource. The types of problems which have to be overcome are, for example, recognising that the speech is not recorded, selecting the voice through noise (either in the environment or the transfer medium), and identifying reliably despite temporary changes (such as caused by illness).


Speech Recognition

The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose.

There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognise continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognise any speaker, avoiding the need to train the system to recognise the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically.


Character and Document Image Recognition

Recognition of written or printed language requires that a symbolic representation of the language is derived from its spatial form of graphical marks. For most languages this means recognising and transforming characters. There are two cases of character recognition:

OCR from a single printed font family can achieve a very high degree of accuracy. Problems arise when the font is unknown or very decorative, or when the quality of the print is poor. In these difficult cases, and in the case of handwriting, good results can only be achieved by using ICR. This involves word recognition techniques which use language models, such as lexicons or statistical information about word sequences.

Document image analysis is closely associated with character recognition but involves the analysis of the document to determine firstly its make-up in terms of graphics, photographs, separating lines and text, and then the structure of the text to identify headings, sub-headings, captions etc. in order to be able to process the text effectively.


Natural Language Understanding

The understanding of language is obviously fundamental to many applications. However, perfect understanding is not always a requirement. In fact, gaining a partial understanding is often a very useful preliminary step in the process because it makes it possible to be intelligently selective about taking the depth of understanding to further levels.

Shallow or partial analysis of texts is used to obtain a robust initial classification of unrestricted texts efficiently. This initial analysis can then be used, for example, to focus on 'interesting' parts of a text for a deeper semantic analysis which determines the content of the text within a limited domain. It can also be used, in conjunction with statistical and linguistic knowledge, to identify linguistic features of unknown words automatically, which can then be added to the system's knowledge.

Semantic models are used to represent the meaning of language in terms of concepts and relationships between them. A semantic model can be used, for example, to map an information request to an underlying meaning which is independent of the actual terminology or language in which the query was expressed. This supports multi-lingual access to information without a need to be familiar with the actual terminology or structuring used to index the information.

Combinations of analysis and generation with a semantic model allow texts to be translated. At the current stage of development, applications where this can be achieved need be limited in vocabulary and concepts so that adequate Language Engineering resources can be applied. Templates for document structure, as well as common phrases with variable parts, can be used to aid generation of a high quality text.


Natural Language Generation

A semantic representation of a text can be used as the basis for generating language. An interpretation of basic data or the underlying meaning of a sentence or phrase can be mapped into a surface string in a selected fashion; either in a chosen language or according to stylistic specifications by a text planning system.


Speech Generation

Speech is generated from filled templates, by playing 'canned' recordings or concatenating units of speech (phonemes, words) together. Speech generated has to account for aspects such as intensity, duration and stress in order to produce a continuous and natural response.

Dialogue can be established by combining speech recognition with simple generation, either from concatenation of stored human speech components or synthesising speech using rules.

Providing a library of speech recognisers and generators, together with a graphical tool for structuring their application, allows someone who is neither a speech expert nor a computer programmer to design a structured dialogue which can be used, for example, in automated handling of telephone calls.


Language Resources

Language resources are essential components of Language Engineering. They are one of the main ways of representing the knowledge of language, which is used for the analytical work leading to recognition and understanding.

The work of producing and maintaining language resources is a huge task. Resources are produced, according to standard formats and protocols to enable access, in many EU languages, by research laboratories and public institutions. Many of these resources are being made available through the European Language Resources Association (ELRA).



A lexicon is a repository of words and knowledge about those words. This knowledge may include details of the grammatical structure of each word (morphology), the sound structure (phonology), the meaning of the word in different textual contexts, e.g. depending on the word or punctuation mark before or after it. A useful lexicon may have hundreds of thousands of entries. Lexicons are needed for every language of application.


Specialist Lexicons

There are a number of special cases which are usually researched and produced separately from general purpose lexicons:

Proper names: Dictionaries of proper names are essential to effective understanding of language, at least so that they can be recognised within their context as places, objects, or person, or maybe animals. They take on a special significance in many applications, however, where the name is key to the application such as in a voice operated navigation system, a holiday reservations system, or railway timetable information system, based on automated telephone call handling.

Terminology: In today's complex technological environment there are a host of terminologies which need to be recorded, structured and made available for language enhanced applications. Many of the most cost-effective applications of Language Engineering, such as multi-lingual technical document management and machine translation, depend on the availability of the appropriate terminology banks.

Wordnets: A wordnet describes the relationships between words; for example, synonyms, antonyms, collective nouns, and so on. These can be invaluable in such applications as information retrieval, translator workbenches and intelligent office automation facilities for authoring.



A grammar describes the structure of a language at different levels: word (morphological grammar), phrase, sentence, etc. A grammar can deal with structure both in terms of surface (syntax) and meaning (semantics and discourse).



A corpus is a body of language, either text or speech, which provides the basis for:

There are national corpora of hundreds of millions of words but there are also corpora which are constructed for particular purposes. For example, a corpus could comprise recordings of car drivers speaking to a simulation of a control system, which recognises spoken commands, which is then used to help establish the user requirements for a voice operated control system for the market.


The Chain of Development and Application

The diagram below depicts the chain of activities which are involved in Language Engineering, from research to the delivery of language-enabled and language enhanced products and services to end-users. The process of research and development leads to the development of techniques, the production of resources, and the development of standards. These are the basic building blocks.

Taken from:

·         Language Engineering (Brouchure by HLTCentral: caché)


The basic processes of Language Engineering are shown in the diagram below. These are broadly concerned with:

Model of a Language Enabled System

Within this general model there are, of course, many different configurations. Depending on the application of the technology, not all these components are needed.

Taken from:

·         Language Engineering (Brouchure by HLTCentral: caché)




Authoring tools facilities provided in conjunction with word processing to aid the author of documents, typically including an on-line dictionary and thesaurus, spell-, grammar-, and style-checking, and facilities for structuring, integrating and linking documents.

Taken from:


·         stemmer

A stemmer is a program or algorithm which determines the morphological root of a given inflected (or, sometimes, derived) word form -- generally a written word form.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem".

English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "ax" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of more possible verb inflections), a Russian one is more complex (more possible noun declensions), a Hebrew one is even more complex (a hairy writing system), and so on.

Stemmers are common elements in query systems, since a user who runs a query on "daffodils" probably cares about documents that contain the word "daffodil" (without the s).


·         domain

Domain usually applied to the area of application of the language enabled software e.g. banking, insurance, travel, etc.; the significance in Language Engineering is that the vocabulary of an application is restricted so the language resource requirements are effectively limited by limiting the domain of application

·         translator's workbench

Translator's workbench a software system providing a working environment for a human translator, which offers a range of aids such as on-line dictionaries, thesauri, translation memories, etc.


·         shallow parser

Shallow parser software which parses language to a point where a rudimentary level of understanding can be realised; this is often used in order to identify passages of text which can then be analysed in further depth to fulfil the particular objective.

Taken from:



Questionnaire 3



Comments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.

Performance of speech recognition systems is typically described in terms of word error rate, E, defined as:

where N is the total number of words in the test set, and S, I, and D are the total number of substitutions, insertions, and deletions, respectively.

The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to give optimal performance.

Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.

Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).

Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.

One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.

One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean . The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15.

High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity ( ), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news [PFF 94].

With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.

At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.

Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50% [CGF94]. It will be many years before unlimited vocabulary, speaker-independent continuous dictation capability is realized.


·                     Explain the main differences between speech recognition and speech synthesis.

Speech Recognition

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section .

Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure . An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.

The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar.

One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.

Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme /t/ in two, true, and butter in American English. At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.

Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.

Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec (see sections and 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.

Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics [Her90]. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section ). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling.

Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the search through the most probable sequence of words.

The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections , and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5.

An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks [ZGPS90,FBC95].


Synthetic Speech Generation

Speech generation is the process which allows the transformation of a string of phonetic and prosodic symbols into a synthetic speech signal. The quality of the result is a function of the quality of the string, as well as of the quality of the generation process itself. For a review of speech generation in English the reader is referred to [FR73] and [Kla87]. Recent developments can be found in [BB92], and in [VSSOH95].

Let us examine first what is requested today from a text-to-speech (TtS) system. Usually two quality criteria are proposed. The first one is intelligibility, which can be measured by taking into account several kinds of units (phonemes, syllables, words, phrases). The second one, more difficult to define, is often labeled as pleasantness or naturalness. Actually the concept of naturalness may be related to the concept of realism in the field of image synthesis: the goal is not to restitute the reality but to suggest it. Thus, listening to a synthetic voice must allow the listener to attribute this voice to some pseudo-speaker and to perceive some kind of expressivity as well as some indices characterizing the speaking style and the particular situation of elocution. For this purpose the corresponding extra-linguistic information must be supplied to the system [GN92].

Most of the present TtS systems produce an acceptable level of intelligibility, but the naturalness dimension, the ability to control expressivity, speech style and pseudo-speaker identity still are poorly mastered. Let us mention however that users demands vary to a large extent according to the field of application: general public applications such as telephonic information retrieval need maximal realism and naturalness, whereas some applications involving professionals (process or vehicle control) or highly motivated persons (visually impaired, applications in hostile environments) demand intelligibility with the highest priority.



·         Speech-to-speech machine translation. List and describe at least three projects.

At present there are only a few speech-to-speech machine translation projects, be it in Europe or in the United States and Japan . Nevertheless, speech-to-speech is continually increasing in importance, in a similar way to the technology of cellular telephony and machine translation technologies. Without a doubt, speech-to-speech machine translation will, within a few years, be a commonplace thing.

Because oral language is the most spontaneous and natural form of communication among people, speech technology is perceived as a determining factor in achieving better interaction with computers. The industry is aware of this fact and realises that the incorporation of speech technology will be the ultimate step in bringing computers closer to the general public.

To the extent that personal computers are being equipped with more and more telematic applications, coupled with the impending arrival of third generation mobile phones, reliable speech recognition is becoming a must. There have been important advances in recent years, although some limitations still persist e.g. of vocabulary, of domain coverage, in the treatment of disfluencies (the variation in the fluency of speech), etc. But despite these problems, the technology today is ready to offer a wide range of services.

One of the most attractive applications is without a doubt speech-to-speech machine translation. There are a small number of initiatives that have contributed significantly to the development of this technology. Verbmobil, a project sponsored by the German government, and the European EuTrans project are two worth mentioning.

In the following interview, we have two representatives of one of the Spanish research groups that has gained recognition in recent years thanks to its research on speech-to-speech translation. The group in question is the Pattern Recognition and Human Language Technology (PRHLT) Unit of the Universitat Politècnica of València (UPV), co-directed by Francisco Casacuberta Nolla and Enrique Vidal Ruiz.

The PRHLT group carries out research both in speech technologies and in computer vision. The EuTrans project—Example-based language translation systems—is one of the many projects currently undertaken by the group. Other research projects include "EXTRA: Example-based extensions to text and speech translation in restricted domains" and "Translation and comprehension of the language spoken through example-based learning techniques: TRACOM", both funded by the Spanish Foundation of Science and Technology (CICYT). The group is also currently participating in a new European project: "TransType2 (TT2)- Computer-Assisted Translation".




This report provides to get a general idea of what Human Language Technologies are and how important they are in society. I have learnt that technology is developing a lot and that everybody must learn to use it for two important reasons: firstly because it is very helpful for people but also because nowadays it is essential to know it. So people must get used to use these technologies meanwhile these technologies are being adapt to people’s necessities.