Human Language Technologies in the Information Society

 

1. ABSTRACT

This is a report about Human Language Technologies and their role in Information Society. All the information you have in the report is found in internet. Some of the pages are provided by the professor Abaitua and others are pages found in Google. The structure of the body is based on the questionnaires we have been answering during this course. The purpose of this report is to get a general idea of Human Languages and New Technologies.

 

2. INTRODUCTION

Nowadays, lots of people uses computers everyday. It is a necessary machine in the Modern Society or also called Information Society.This report is about Human Language Technologies in the Information Society. The impact new technologies are having in the society is enormous and that makes the theme really interesting.  

This report pays attention to the relationship between language and new technologies. The report is based in some questionnaires. First of all it introduces us to the theme, Human Language Technologies. Later there is more information about resources and applications of Language Engineering and Speech technology.

The objective of the report is to know a little more about Human Language Technologies in a general way. All the information you can find this report is found in the Net. More information about the subject is available in the web pages provided in REFERENCES.    

 

3. BODY

 First of all we must understand the meaning of some terminology such as Computational Linguistics, Language Engineering, etc. In the page of Journal of Natural Language Engineering,  http://www.dcs.shef.ac.uk/~hamish/LeIntro.html#sec:defns (1999), we can find those meanings;

Language is a communication mechanism whose medium is text or speech, and LE is concerned with computer processing of text and speech. We will define engineering in contrast to science.

Herb Simon contends that ``...far from striving to separate science from engineering, we need not distinguish them at all. But if we insist upon a distinction, we can think of engineering as science for people who are impatient''. This point is undermined a little by a criterion which immediately follows and by which the two fields may be separated: ``While the scientist is interested specifically in creating new knowledge, the engineer is interested also in creating systems that achieve desired goals''. We'll preserve the distinction here and use a definition of science as ``the systematic study of the nature and behaviour of the material and physical universe, based on observation, experiment, and measurement, and the formulation of laws to describe these facts in general terms''.

Engineering connotes ``creating cost-effective solutions to practical problems by applying scientific knowledge'', or ``applying scientific principles to the design, construction and maintenance of engines, cars, machines etc.'', or a ``rigorous set of development methods''. The ``...basic difference between science and engineering is that science is concerned with finding out how the world works, while engineering is concerned with using that knowledge to build artifacts in such a way that one can expect them to perform as required'.

Computational Linguistics, we refer to our definitions of science and language: CL is that part of the science of human language that uses computers to aid observation of, or experiment with, language. If ``Theoretical linguists... attempt to characterise the nature of... either a language or Language'' or ``a grammar or Grammar'', then ``...theoretical Computational Linguistics proper consists in attempting such a characterisation computationally''. In other words, CL concentrates ``on studying natural languages, just as traditional Linguistics does, but using computers as a tool to model (and, sometimes, verify or falsify) fragments of linguistic theories deemed of particular interest''.

Natural Language Processing is a term used in a variety of ways in different contexts. Much work that goes under the heading of NLP could well fit under our definition of CL, and some could also fit the definition of LE that follows. Here we'll use a narrow definition that makes CL and NLP disjoint. Whereas CL is ``a branch of linguistics in which computational techniques and concepts are applied to the elucidation of linguistic and phonetic problems'', NLP is a branch of computer science that studies computer

...systems for processing natural languages. It includes the development of algorithms for parsing, generation, and acquisition of linguistic knowledge; the investigation of the time and space complexity of such algorithms; the design of computationally useful formal languages (such as grammar and lexicon formalisms) for encoding linguistic knowledge; the investigation of appropriate software architectures for various NLP tasks; and consideration of the types of non-linguistic knowledge that impinge on NLP. It is a fairly abstract area of study and it is not one that makes particular commitments to the study of the human mind, nor indeed does it make particular commitments to producing useful artifacts.

There are elements of both CL and NLP in Winograd's early description of the work as

...part of a newly developing paradigm for looking at human behaviour, which has grown up from working with computers. ... Computers and computer languages give us a formal metaphor, within which we can model the processes and test the implications of our theories.

CL is a part of the science of language that uses computers as investigative tools; NLP is part of the science of computation whose subject matter is computer systems that process human language. There is crossover and blurring of these definitions in practice, but they capture some important generalisations.

Gazdar's definition of ``Applied NLP'' is close to that of LE, a subject which

...involves the construction of intelligent computational artifacts that process natural languages in ways that are useful to people other than computational linguists. The test of utility here is essentially that of the market. Examples include machine translation packages, programs that convert numerical data or sequences of error codes into coherent text or speech, systems that map text messages into symbolic or numeric data, and natural language interfaces to databases.

A short definition as any of the products of LE is given by Jacobs:

   The principal defining characteristic of NLE work is its objective: to engineer products which deal with natural language and which satisfy the constraints in which they have to operate. This definition may seem tautologous or a statement of the obvious to an engineer practising in another, well established area (e.g. mechanical or civil engineering), but is still a useful reminder to practitioners of software engineering, and it becomes near-revolutionary when applied to natural language processing. This is partly because of what, in our opinion, has been the ethos of most Computational Linguistics research. Such research has concentrated on studying natural languages, just as traditional Linguistics does, but using computers as a tool to model (and, sometimes, verify or falsify) fragments of linguistic theories deemed of particular interest. This is of course a perfectly respectable and useful scientific endeavour, but does not necessarily (or even often) lead to working systems for the general public.

To summarise:

Language Engineering is the discipline or act of engineering software systems that perform tasks involving processing human language. Both the construction process and its outputs are measurable and predictable. The literature of the field relates to both application of relevant scientific results and a body of practice.

 After claryfying those concepts we should see if the Informational Society have any relation to Human language. Hans Uszkoreit gives the answer in http://www.coli.uni-sb.de/~hansu/ (2000);

The rapid growth of the Internet/WWW and the emergence of the information society poses exciting new challenges to language technology.  Although the new media combine text, graphics, sound and movies, the whole world of multimedia information can only be structured, indexed and navigated through language. For browsing, navigating, filtering and processing the information on the web, we need software that can get at the contents of documents. Language technology for content management is a necessary precondition for turning the wealth of digital information into collective knowledge. The increasing multilinguality of the web constitutes an additional challenge for our discipline. The global web can only be mastered with the help of multilingual tools for indexing and navigating. Systems for crosslingual information and knowledge management will surmount language barriers for e-commerce, education and international cooperation.

And is there any concern in Europe with Human Language Technologies? This is what we find in http://www.statskontoret.se/gol-democracy/eu.htm, page of the European Commission.

A concern for the citizen, as well as the democratic values lying at the core of the European concept of information society, have been focused so far in a number of programmes, projects and actions.

Some of these initiatives, such as Interchange of Data between Administrations (IDA) and the Telematics Applications Programme are directly relevant to the principles of widespread access to electronic information, and in particular to public sector information, as a means of citizen participation in civil and political life. The European Commission itself has a long tradition of information dissemination, which is now looking toward the possibilities offered by the new communication and language technologies.

The Commission is also taking actions at the regulatory level, in order to facilitate the creation at large of a European information market which would go hand in hand with citizen' basic democratic rights. A green paper on "Access to and exploitation of public sector information in the information society" is being developed by the Commission and should be available in 1998.

The Bangemann group of experts identified the need for a truly interoperable trans-European public administrations network as a top priority for on-line access and exchange of public information. To achieve this, the group recommended strengthening and accelerating the IDA program, with the aim of providing the structural support for faster circulation of information between member state administrations and European institutions.

The Telematics Applications Programme (1994-1998), part of the 4th Framework Programme, is further promoting "the competitiveness of European industry and the efficiency of services of public interest" through the development of new telematics systems and services in different areas of life, such as environment, health, transport, education, etc. The Telematics for Administrations sector, in particular, supports applications fostering electronic access by the citizen to regulatory information, as well as intelligent online exchange of information and knowledge between citizens, administrations and decision makers. Moreover, it helps public administrations in their role of providing access across Europe and the wider international community to the cultural heritage of the member states.

The push for economic growth and social integration, coupled with the wish to eliminate factors tending to exclude people from the information society, lead to demands for 'language-enhanced' products and services to be made available both to professional users and to the general public. In response to this need, the Language Engineering Sector of the Telematics Applications Programme focuses on integrating language technologies into information and communications products and services, thus improving their ease of use and functionality.

Recognising that effective stimulation and co-ordination at the European level would help to ensure the availability of multilingual facilities for rapid, cost-effective on-line communication throughout the EU and with the rest of the world, the Multilingual Information Society Program (1996-1998) was launched recently with the aims of supporting the creation of a framework of services for European language resources, encouraging the use of language technologies and promoting the use of advanced language tools across the EU. The MLIS decision emphasises the need for exploiting the synergies with other initiatives whether they have an impact on multi-lingualism, are public or private, national or EU-wide.

The special concern for citizens has been expressed in a number of initiatives and projects, which are, or can be, related to public sector services. In particular, projects which approach information and knowledge concerns at a language processing level are providing innovative prototypes for user trials.

The current situation of the HLT Central.org office is the following. The information is based on http://www.hltcentral.org/htmlengine.shtml?id=615, a page updated in 27/10/03.

HLTCentral web site was established as an online information resource of human language technologies and related topics of interest to the HLT community at large. It covers news, R&D, technological and business developments in the field of speech, language, multilinguality, automatic translation, localisation and related areas. Its coverage of HLT news and developments is worldwide - with a unique European perspective. HLTCentral is Powered by two EU funded projects, ELSNET and EUROMAP, are behind the development of HLTCentral.

Now we are going to focus on the Language Engineering. The followings are the main techniques used in Language Engineering. These techniques are described in the page oh HLT Central, http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#t.

1- Speaker Identification and Verification

A human voice is as unique to an individual as a fingerprint. This makes it possible to identify a speaker and to use this identification as the basis for verifying that the individual is entitled to access a service or a resource. The types of problems which have to be overcome are, for example, recognising that the speech is not recorded, selecting the voice through noise (either in the environment or the transfer medium), and identifying reliably despite temporary changes (such as caused by illness).

2- Speech Recognition

The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose.

There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognise continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognise any speaker, avoiding the need to train the system to recognise the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically.

3- Character and Document Image Recognition

Recognition of written or printed language requires that a symbolic representation of the language is derived from its spatial form of graphical marks. For most languages this means recognising and transforming characters. There are two cases of character recognition:

     -recognition of printed images, referred to as Optical Character Recognition (OCR)

    -recognising handwriting, usually known as Intelligent Character Recognition (ICR)

OCR from a single printed font family can achieve a very high degree of accuracy. Problems arise when the font is unknown or very decorative, or when the quality of the print is poor. In these difficult cases, and in the case of handwriting, good results can only be achieved by using ICR. This involves word recognition techniques which use language models, such as lexicons or statistical information about word sequences.

Document image analysis is closely associated with character recognition but involves the analysis of the document to determine firstly its make-up in terms of graphics, photographs, separating lines and text, and then the structure of the text to identify headings, sub-headings, captions etc. in order to be able to process the text effectively.

4-Natural Language Understanding

The understanding of language is obviously fundamental to many applications. However, perfect understanding is not always a requirement. In fact, gaining a partial understanding is often a very useful preliminary step in the process because it makes it possible to be intelligently selective about taking the depth of understanding to further levels.

Shallow or partial analysis of texts is used to obtain a robust initial classification of unrestricted texts efficiently. This initial analysis can then be used, for example, to focus on 'interesting' parts of a text for a deeper semantic analysis which determines the content of the text within a limited domain. It can also be used, in conjunction with statistical and linguistic knowledge, to identify linguistic features of unknown words automatically, which can then be added to the system's knowledge.

Semantic models are used to represent the meaning of language in terms of concepts and relationships between them. A semantic model can be used, for example, to map an information request to an underlying meaning which is independent of the actual terminology or language in which the query was expressed. This supports multi-lingual access to information without a need to be familiar with the actual terminology or structuring used to index the information.

Combinations of analysis and generation with a semantic model allow texts to be translated. At the current stage of development, applications where this can be achieved need be limited in vocabulary and concepts so that adequate Language Engineering resources can be applied. Templates for document structure, as well as common phrases with variable parts, can be used to aid generation of a high quality text.

5-Natural Language Generation

A semantic representation of a text can be used as the basis for generating language. An interpretation of basic data or the underlying meaning of a sentence or phrase can be mapped into a surface string in a selected fashion; either in a chosen language or according to stylistic specifications by a text planning system.

6-Speech Generation

Speech is generated from filled templates, by playing 'canned' recordings or concatenating units of speech (phonemes, words) together. Speech generated has to account for aspects such as intensity, duration and stress in order to produce a continuous and natural response.

Dialogue can be established by combining speech recognition with simple generation, either from concatenation of stored human speech components or synthesising speech using rules.

Providing a library of speech recognisers and generators, together with a graphical tool for structuring their application, allows someone who is neither a speech expert nor a computer programmer to design a structured dialogue which can be used, for example, in automated handling of telephone calls.

There are some language resources which are essential components of Language Engineering. The page of HLT Central says which they are in http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#t.

They are one of the main ways of representing the knowledge of language, which is used for the analytical work leading to recognition and understanding.

The work of producing and maintaining language resources is a huge task. Resources are produced, according to standard formats and protocols to enable access, in many EU languages, by research laboratories and public institutions. Many of these resources are being made available through the European Language Resources Association (ELRA).

1- Lexicons

A lexicon is a repository of words and knowledge about those words. This knowledge may include details of the grammatical structure of each word (morphology), the sound structure (phonology), the meaning of the word in different textual contexts, e.g. depending on the word or punctuation mark before or after it. A useful lexicon may have hundreds of thousands of entries. Lexicons are needed for every language of application.

2- Specialist Lexicons

There are a number of special cases which are usually researched and produced separately from general purpose lexicons:

2.1- Proper names: Dictionaries of proper names are essential to effective understanding of language, at least so that they can be recognised within their context as places, objects, or person, or maybe animals. They take on a special significance in many applications, however, where the name is key to the application such as in a voice operated navigation system, a holiday reservations system, or railway timetable information system, based on automated telephone call handling.

2.2- Terminology: In today's complex technological environment there are a host of terminologies which need to be recorded, structured and made available for language enhanced applications. Many of the most cost-effective applications of Language Engineering, such as multi-lingual technical document management and machine translation, depend on the availability of the appropriate terminology banks.

2.3- Wordnets: A wordnet describes the relationships between words; for example, synonyms, antonyms, collective nouns, and so on. These can be invaluable in such applications as information retrieval, translator workbenches and intelligent office automation facilities for authoring.

3- Grammars

A grammar describes the structure of a language at different levels: word (morphological grammar), phrase, sentence, etc. A grammar can deal with structure both in terms of surface (syntax) and meaning (semantics and discourse).

4- Corpora

A corpus is a body of language, either text or speech, which provides the basis for: 

There are national corpora of hundreds of millions of words but there are also corpora which are constructed for particular purposes. For example, a corpus could comprise recordings of car drivers speaking to a simulation of a control system, which recognises spoken commands, which is then used to help establish the user requirements for a voice operated control system for the market.

Now we are going to check for the following terms, The information is found in http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#t;

Domain

It is usually applied to the area of application of the language enabled software e.g. banking, insurance, travel, etc.; the significance in Language Engineering is that the vocabulary of an application is restricted so the language resource requirements are effectively by limiting the domain of application. 

Translator's workbench

It is a software system providing a working environment for a human translator, which offers a range of aids such as on-line dictionaries, thesauri, translation memories, etc

Shallow parser

 Itis software which parses language to a point where a rudimentary level of understanding can be realised; this is often used in order to identify passages of text which can then be analysed in further depth to fulfil the particular objective

Speech Recognition

The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose.

There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognise continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognise any speaker, avoiding the need to train the system to recognise the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically.

Formalism

Means to represent the rules used in the establishment of a model of linguistic knowledge.

Authoring tools

They are facilities provided in conjunction with word processing to aid the author of documents, typically including an on-line dictionary and thesaurus, spell-, grammar-, and style-checking, and facilities for structuring, integrating and linking documents

We have seen what Speech Technology is. But what is the state of art in Speech Technology? Victor Zue, Ron Cole and Wayne Ward have the answer in http://cslu.cse.ogi.edu/HLTsurvey/ch1node4.html, a page dated in 1996.


Comments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.

Performance of speech recognition systems is typically described in terms of word error rate, E, defined as, where N is the total number of words in the test set, and S, I, and D are the total number of substitutions, insertions, and deletions, respectively.

 

 The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to give optimal performance.

 

Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition

 

Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).

 

Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.

 

One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known

 

One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 1

 

With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.

 

At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.

 

Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50%. It will be many years before unlimited vocabulary, speaker-independent continuous dictation capability is realized.

 

We must distinguish Speech recognition and Speech Synthesis. In comp.speech Frequently Asked Questions WWW site in the web page http://www.speech.cs.cmu.edu/comp.speech/ (last revised in05/09/97) we get this information.

 

Speech synthesis programs convert written input to spoken output by automatically generating synthetic speech. Speech synthesis is often referred to a "Text-to-Speech" conversion (TTS).

 

There are several algorithms. The choice depends on the task they're used for. The easiest way is to just record the voice of a person speaking the desired phrases. This is useful if only a restricted volume of phrases and sentences is used, e.g. messages in a train station, or schedule information via phone. The quality depends on the way recording is done.

 

More sophisticated but worse in quality are algorithms which split the speech into smaller pieces. The smaller those units are, the less are they in number, but the quality also decreases. An often used unit is the phoneme, the smallest linguistic unit. Depending on the language used there are about 35-50 phonemes in western European languages, i.e. there are 35-50 single recordings. The problem is combining them as fluent speech requires fluent transitions between the elements. The intellegibility is therefore lower, but the memory required is small.

 

A solution to this dilemma is using diphones. Instead of splitting at the transitions, the cut is done at the center of the phonemes, leaving the transitions themselves intact. This gives about 400 elements (20*20) and the quality increases.

 

The longer the units become, the more elements are there, but the quality increases along with the memory required. Other units which are widely used are half-syllables, syllables, words, or combinations of them, e.g. word stems and inflectional endings.

 

Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.

 

A wide variety of techniques are used to perform speech recognition. There are many types of speech recognition. There are many levels of speech recognition / analysis / understanding.

 

Typically speech recognition starts with the digital sampling of speech. The next stage is acoustic signal processing. Most techniques include spectral analysis; e.g. LPC analysis (Linear Predictive Coding), MFCC (Mel Frequency Cepstral Coefficients), cochlea modelling and many more.

 

The next stage is recognition of phonemes, groups of phonemes and words. This stage can be achieved by many processes such as DTW (Dynamic Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), expert systems and combinations of techniques. HMM-based systems are currently the most commonly used and most successful approach.

 

Most systems utilise some knowledge of the language to aid the recognition process.

 

Some systems try to "understand" speech. That is, they try to convert the words into a representation of what the speaker intended to mean or achieve by what they said.

 

In this table, built with the information found in http://www.cs.sunysb.edu/~tony/392/speech/speech.html  , we can find which are the issues and problems of speech recognition and synthesis;

       

Speech Synthesis

• Written text transformed into speech

- text-to-speech

• Two types of synthesiser

- parameterised
- concatenative

• Parameterised

- formant based - use rules based on signal from spoken input
- articulatory - use model of vocal tract

• Parameterised is more like musical instrument synthesis

• Concatenative - word

- just record all the words you need
- good for small sets

• Concatenative - phoneme

- phoneme - smallest unit of speech that differentiates one word from another
- makes more natural sounding speech

• Concatenative is more like sound sampling

Above, BALDI uses synthesized speech to teach deaf children how to form words.

 

 

 

 

Speech Synthesis Problems

• Understandability
• Words not in dictionary
• Prosody

- stress, pitch, intonation

 

Speech Recognition Issues

• Continuous versus Discrete recognition

• Discrete

- improves accuracy
- reduce computation

• Continuous

- hard to do
- natural / fast
Itisverysimilartotryingtoreadtextwithallofthespacesremoved

• Speaker dependent versus independent

• Dependent

- requires training - takes time
- can get good recognition rates

• Independent

- great for ‘walk up and use’ systems
- lower recognition rates in general

• Vocabulary size

• Smaller the size the higher the recognition rates

- 10 - phone digits
- 100 - menu selection
- 10k - 60k - general dictation, email

• Current desktop SR can get around 88% on large vocab

• Accuracy

• What is an error?

- out of vocabulary
- recognition failure
- mis-recognition
- insertion / deletion / substitution

• Hard to tell mis-recognition

 

 

Recognition Errors

• User spoke at the wrong time
• Sentence not in grammar
• User paused too long
• Words sound alike
• Word out of vocab
• User has a cold
• Over-emphasis

 

Finally here we have three projects that deal with Speech-to-speech machine translation. In the following page you can find several projects on the subject http://www-i6.informatik.rwth-aachen.de/HTML/Forschung/Projects_frame.html , a page which is last modified on 02/11/01.

1- LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Technologies)


The objective of the LC-STAR is to improve human-to-human and man-machine communication in multilingual environments. The project aims to create lexica and corpora needed for speech-to-speech translation. Within LC-STAR, quasi industrial standards for those language resources will be established, lexica for 12 languages and text corpora for 3 languages will be created. A speech to speech translation demonstrator for the three languages English, Spanish and Catalan will be developed. The Lehrstuhl fόr Informatik VI will focus on the investigation of speech centered translation technologies focusing on requirements concerning language resources and the creation of lexica for speech recognition in German.
LC-STAR is supported by the European Union. Project partners are Siemens AG (Germany), IBM Deutschland Entwicklung GmbH (Germany), Universitat Politecnica de Catalunya (Spain), NSC - Natural Speech Communication Ltd (Israel), and Nokia Corporation (Finland).

2-VERBMOBIL II


VERBMOBIL is a speaker-independent and bidirectional speech-to-speech translation system for spontaneous dialogues in mobile situations. It recognizes spoken input, analyses and translates it, and finally utters the translation. The multi-lingual system handles dialogues in three business-oriented domains, with context-sensitive translation between three languages (German, English, and Japanese).
Within the BMBF-funded project, the Lehrstuhl fόr Informatik VI performed research on both speech recognition and translation. For both tasks, statistical methods were used and self-contained software modules were developed and integrated into the final prototype system. For the speech recognition part we developed efficient search algorithms which perform a real time operation. In the end-to-end evaluation, the statistical machine translation significantly outperformed the competing translation approaches such as classical transfer-based translation or example-based translation.

 

3-EU-Project EuTrans


Machine translation has been receiving considerable attention for a long time, because of its great industrial and social interest. The focus of the EUTRANS project was the development and evaluation of example-based translation techniques for text and speech input. Our institute contributed acoustic models for the recognition of Italian telephone speech and analyzed different statistical translation techniques. EUTRANS was supported by the European Union ESPRIT LTR (Long Term Research) programme. Project partners were the Universidad Politιcnica de Valencia (Spain), Fundacione Ugo Bordoni (Italy), Zeres GmbH (Bochum), and Lehrstuhl fόr Informatik VI.

 

 

4- CONCLUSION

 

The development of New Technologies is huge. A hundred years ago nobody would believe machines would go so far. Society seems to go slowlier than these technologies but the new discoveries are accepted. However, machines always need the presence or help of human. In the report we have seen how New Technologies are applied in Human Language and how important is the application of Language Engineering. We have also seen machines, as humans, are not perfect. New discoveries are very useful and their purpose is to make life easier. Nevertheless, they have some weak points and some things cannot be done by machines. They cannot replace a man, at least fo the moment, but having a look at the last projects and advances we will never know where the boundary - if there is any -  which machines cannot cross is.  

 

5- REFERENCES

 

http://www.dcs.shef.ac.uk/~hamish/LeIntro.html#sec:defns (1999)

http://www.coli.uni-sb.de/~hansu/ (2000)

http://sirio.deusto.es/abaitua/konzeptu/nlp/Browne_M.html (1998)

http://www.hltcentral.org/htmlengine.shtml?id=615 (2003)

http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#t

http://cslu.cse.ogi.edu/HLTsurvey/ch1node4.html (1996)

http://www.speech.cs.cmu.edu/comp.speech/ (1997)

http://www.cs.sunysb.edu/~tony/392/speech/speech.html 

http://www-i6.informatik.rwth-aachen.de/HTML/Forschung/Projects_frame.http://cordis.lu/ist/ka3/hlt/htmhtml (2001)

http://cordis.lu/ist/ka3/hlt/htm 

http://www.statskontoret.se/gol-democracy/eu.htm#top

 

 

 

IERA ZINKUNEGI