A review of Human Language Technologies and their role in the Information Society

by

Alejandro Otaola Rojo

 

 

 

Abstract

This report is an assignment for our class of "English Language and New Technologies". This is a subject included in the second year of the English Philology degree in the University of Deusto (Universidad de Deusto), in Bilbao (Spain). This subject is taught by the Professor Joseba K. Abaitua Odriozola, and the subject deals with the uses and roles of the Human Language Technologies nowadays, mainly in the field of the Information Society.

As a result of the knowledge acquainted in class, we have been asked to write a report of this kind, related to Human Language Technologies or any other topic we have been dealing with in class. I have been trying to compile information in order to make this report a useful tool for anyone who may need it.

 

0. Introduction

As I have explained in the preceding paragraph, this report tries to give a general view of some aspects of Human Language Technologies in the field of the Information Society. For that purpose, I will use all the information that we have been given in class, but mainly the questionnaires that we have been working on.

Those questionnaires consist of questions related to the specific topics dealt during our class period. They have the purpose of making us look for information on the internet so as to slve the questions given. As I will base my report on those questionnaires, this report will consist of a great quantity of quotations related to the topic of the report. Those quotations have the function of being helpful to provide information about the roles of  Human Language Technologies in the Information Society.

Other interesting topics that this report will be dealing with are Natural Language Processing, Computational Linguistics, Speech Technology, Language Engineering and Machine Translation among others. These are some of the topics developed in class, by consulting data from different kinds in order to try to understand all the topics and their most important features for us.

Machine Translation will have much importance because of the close relation existing between it and the degree that I am studying at the university (English Philology). We will be able to learn about some important aspects of Machine Translation, such as the main problems of its application and some other features related to it.

As all these topics will be developed with the help of quotations taken from the internet, I will provide the readers of this report with a list of all references that have been used to write this report, including the name of the author, group or institution, the date and the publisher's name, or URL.

 

1. Describe the different senses and usages of the terms:

   1.1. Human Language Technologies

The overall objective of HLT is to support e-business in a global context and to promote a human centred infostructure ensuring equal access and usage opportunities for all. This is to be achieved by developing multilingual technologies and demonstrating exemplary applications providing features and functions that are critical for the realisation of a truly user friendly Information Society. Projects address generic and applied RTD from a multi- and cross-lingual perspective, and undertake to demonstrate how language specific solutions can be transferred to and adapted for other languages.

While elements of the three initial HLT action lines - Multilinguality, Natural Interactivity and Crosslingual Information Management are still present, there has been periodic re-assessment and tuning of them to emerging trends and changes in the surrounding economic, social, and technological environment. The trials and best practice in multilingual e-service and e-commerce action line was introduced in the IST 2000 work programme (IST2000) to stimulate new forms of partnership between technology providers, system integrators and users through trials and best practice actions addressing end-to-end multi-language platforms and solutions for e-service and e-commerce.

(HLTCentral)

http://www.hltcentral.org/htmlengine.shtml?id=169

   1.2. Natural Language Processing

Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems inherent in the processing and manipulation of natural language, but not, generally, natural language understanding.

Some problems which make NLP difficult:

Word boundary detection
In spoken language, there are no gaps between words; where to place the word boundary often depends on what choice makes the most sense grammatically and given the context. In written form, languages like Chinese do not have word boundaries either.
Word sense disambiguation
Any given word can have several different meanings; we have to select the meaning which makes the most sense in context.
Syntactic ambiguity
The grammar for natural languages is not unambiguous, i.e. there are often multiple possible parse trees for a given sentence. Choosing the most appropriate one usually requires semantic and contextual information.
Imperfect or irregular input
Foreign or regional accents and vocal impediments in speech; typing or grammatical errors, OCR errors in texts.
Speech acts and plans
Sentences often don't mean what they literally say; for instance a good answer to "Can you pass the salt" is to pass the salt; in most contexts "Yes" is not a good answer, although "No" is better and "I'm afraid that I can't see it" is better yet. Or again, if a class was not offered last year, "The class was not offered last year" is a better answer to the question "How many students failed the class last year?" than "None" is.
 
(Wikipedia, the free encyclopedia)

http://en.wikipedia.org/wiki/Natural_language_processing

   1.3. Computational Linguistics

According to Hans Uszkoreit:

Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science aiming at computational models of human cognition. Computational linguistics has applied and theoretical components.

Applied CL focusses on the practical outcome of modelling human language use. The methods, techniques, tools and applications in this area are often subsumed under the term language engineering or (human) language technology. Although existing CL systems are far from achieving human ability, they have numerous possible applications. The goal is to create software products that have some knowledge of human language. Such products are going to change our lives. They are urgently needed for improving human-machine interaction since the main obstacle in the interaction beween human and computer is a communication problem. Today's computers do not understand our language but computer languages are difficult to learn and do not correspond to the structure of human thought. Even if the language the machine understands and its domain of discourse are very restricted, the use of human language can increase the acceptance of software and the productivity of its users.

(Hans Uszkoreit)

http://www.coli.uni-sb.de/~hansu/what_is_cl.html

   1.4. Language Engineering

Language Engineering provides ways in which we can extend and improve our use of language to make it a more effective tool. It is based on a vast amount of knowledge about language and the way it works, which has been accumulated through research. It uses language resources, such as electronic dictionaries and grammars, terminology banks and corpora, which have been developed over time. The research tells us what we need to know about language and develops the techniques needed to understand and manipulate it. The resources represent the knowledge base needed to recognise, validate, understand, and manipulate language using the power of computers. By applying this knowledge of language we can develop new ways to help solve problems across the political, social, and economic spectrum.

Language Engineering is a technology which uses our knowledge of language to enhance our application of computer systems:

(HLTCentral)

http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#tiole

 

2. Does the notion of "Information Society" have any relation to Human Language?

The overall objective of HLT is to support e-business in a global context and to promote a human centred infostructure ensuring equal access and usage opportunities for all. This is to be achieved by developing multilingual technologies and demonstrating exemplary applications providing features and functions that are critical for the realisation of a truly user friendly Information Society. Projects address generic and applied RTD from a multi- and cross-lingual perspective, and undertake to demonstrate how language specific solutions can be transferred to and adapted for other languages.

(HLTCentral)

http://www.hltcentral.org/htmlengine.shtml?id=169

 

3. Is there any concern in Europe with Human Language Technologies?

During the seven years from the beginning of 1992 to the end of 1998, the European Union invested approximately 115 million ECU in language engineering through shared cost projects. 70% of this money was allocated during the Fourth Framework Programme. This allocation reflects an increasing recognition of language engineering as an important area of research and technological development.

Projects in the Multilingual Information Society Programme (MLIS) complement activities that support multilinguality, exploit existing experiences and knowledge of multilingual issues and solutions, and mobilise players in both the public and private sectors to:

(HLTCentral)

http://www.hltcentral.org/page-218.0.shtml

 

4. What is the current situation of the HLTCentral.org office?

Today many of the initiatives taken through research and technological programmes in Europe are bearing fruit. Recent feature articles in the mainstream business and technology press indicate a change in the market significance of speech and natural language applications. Europe has been prominent in these developments and much of the technology available in other parts of the world is licensed from successful European suppliers.

The success of language engineering research and technological development in Europe is bound to have an impact on our economic future because it can be applied across such a wide range of information systems and services with such significant benefits.

European programmes have certainly helped to raise industrial interest. Products and services are being launched which demonstrate what can be achieved. The need now is for fresh, innovative ideas on future applications in attractive emerging markets.

(HLTCentral)

http://www.hltcentral.org/page-219.0.shtml

 

5. Which are the main techniques used in Language Engineering?

Speaker Identification and Verification

A human voice is as unique to an individual as a fingerprint. This makes it possible to identify a speaker and to use this identification as the basis for verifying that the individual is entitled to access a service or a resource. The types of problems which have to be overcome are, for example, recognising that the speech is not recorded, selecting the voice through noise (either in the environment or the transfer medium), and identifying reliably despite temporary changes (such as caused by illness).

 

Speech Recognition

The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose.

There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognise continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognise any speaker, avoiding the need to train the system to recognise the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically.

 

Character and Document Image Recognition

Recognition of written or printed language requires that a symbolic representation of the language is derived from its spatial form of graphical marks. For most languages this means recognising and transforming characters. There are two cases of character recognition:

OCR from a single printed font family can achieve a very high degree of accuracy. Problems arise when the font is unknown or very decorative, or when the quality of the print is poor. In these difficult cases, and in the case of handwriting, good results can only be achieved by using ICR. This involves word recognition techniques which use language models, such as lexicons or statistical information about word sequences.

Document image analysis is closely associated with character recognition but involves the analysis of the document to determine firstly its make-up in terms of graphics, photographs, separating lines and text, and then the structure of the text to identify headings, sub-headings, captions etc. in order to be able to process the text effectively.

 

Natural Language Understanding

The understanding of language is obviously fundamental to many applications. However, perfect understanding is not always a requirement. In fact, gaining a partial understanding is often a very useful preliminary step in the process because it makes it possible to be intelligently selective about taking the depth of understanding to further levels.

Shallow or partial analysis of texts is used to obtain a robust initial classification of unrestricted texts efficiently. This initial analysis can then be used, for example, to focus on 'interesting' parts of a text for a deeper semantic analysis which determines the content of the text within a limited domain. It can also be used, in conjunction with statistical and linguistic knowledge, to identify linguistic features of unknown words automatically, which can then be added to the system's knowledge.

Semantic models are used to represent the meaning of language in terms of concepts and relationships between them. A semantic model can be used, for example, to map an information request to an underlying meaning which is independent of the actual terminology or language in which the query was expressed. This supports multi-lingual access to information without a need to be familiar with the actual terminology or structuring used to index the information.

Combinations of analysis and generation with a semantic model allow texts to be translated. At the current stage of development, applications where this can be achieved need be limited in vocabulary and concepts so that adequate Language Engineering resources can be applied. Templates for document structure, as well as common phrases with variable parts, can be used to aid generation of a high quality text.

 

Natural Language Generation

A semantic representation of a text can be used as the basis for generating language. An interpretation of basic data or the underlying meaning of a sentence or phrase can be mapped into a surface string in a selected fashion; either in a chosen language or according to stylistic specifications by a text planning system.

 

Speech Generation

Speech is generated from filled templates, by playing 'canned' recordings or concatenating units of speech (phonemes, words) together. Speech generated has to account for aspects such as intensity, duration and stress in order to produce a continuous and natural response.

Dialogue can be established by combining speech recognition with simple generation, either from concatenation of stored human speech components or synthesising speech using rules.

Providing a library of speech recognisers and generators, together with a graphical tool for structuring their application, allows someone who is neither a speech expert nor a computer programmer to design a structured dialogue which can be used, for example, in automated handling of telephone calls.

(HLTCentral)

http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#t

 

6. Which language resources are essential components of Language Engineering?

Language resources are essential components of Language Engineering. They are one of the main ways of representing the knowledge of language, which is used for the analytical work leading to recognition and understanding.

The work of producing and maintaining language resources is a huge task. Resources are produced, according to standard formats and protocols to enable access, in many EU languages, by research laboratories and public institutions. Many of these resources are being made available through the European Language Resources Association (ELRA).

 

Lexicons

A lexicon is a repository of words and knowledge about those words. This knowledge may include details of the grammatical structure of each word (morphology), the sound structure (phonology), the meaning of the word in different textual contexts, e.g. depending on the word or punctuation mark before or after it. A useful lexicon may have hundreds of thousands of entries. Lexicons are needed for every language of application.

 

Specialist Lexicons

There are a number of special cases which are usually researched and produced separately from general purpose lexicons:

Proper names: Dictionaries of proper names are essential to effective understanding of language, at least so that they can be recognised within their context as places, objects, or person, or maybe animals. They take on a special significance in many applications, however, where the name is key to the application such as in a voice operated navigation system, a holiday reservations system, or railway timetable information system, based on automated telephone call handling.

Terminology: In today's complex technological environment there are a host of terminologies which need to be recorded, structured and made available for language enhanced applications. Many of the most cost-effective applications of Language Engineering, such as multi-lingual technical document management and machine translation, depend on the availability of the appropriate terminology banks.

Wordnets: A wordnet describes the relationships between words; for example, synonyms, antonyms, collective nouns, and so on. These can be invaluable in such applications as information retrieval, translator workbenches and intelligent office automation facilities for authoring.

 

Grammars

A grammar describes the structure of a language at different levels: word (morphological grammar), phrase, sentence, etc. A grammar can deal with structure both in terms of surface (syntax) and meaning (semantics and discourse).

 

Corpora

A corpus is a body of language, either text or speech, which provides the basis for:

There are national corpora of hundreds of millions of words but there are also corpora which are constructed for particular purposes. For example, a corpus could comprise recordings of car drivers speaking to a simulation of a control system, which recognises spoken commands, which is then used to help establish the user requirements for a voice operated control system for the market.

(HLTCentral)

http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#t

 

7. Check for the following terms:

   7.1. Shallow parser

Software which parses language to a point where a rudimentary level of understanding can be realised; this is often used in order to identify passages of text which can then be analysed in further depth to fulfil the particular objective.

     7.2. Translator's workbench

A software system providing a working environment for a human translator, which offers a range of aids such as on-line dictionaries, thesauri, translation memories, etc.

     7.3. Formalism

A means to represent the rules used in the establishment of a model of linguistic knowledge.

     7.4. Speech recognition

The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input.

     7.5. Text alignment

The process of aligning different language versions of a text in order to be able to identify equivalent terms, phrases, or expressions.

(HLTCentral)

http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#t

 

8. What is the state of the art in Speech Technology?

According to Jennifer Lai, the state-of-the-art in speech technology has progressed to the point where it is now practical for designers to consider integrating speech input and output into their applications. Adding speech to a multimodal application or creating a speech-only interface presents design challenges that are different from those presented in a purely graphical environment.

(Jennifer Lai)

http://www.acm.org/sigmm/mm2003/t1.shtml

 

According to Enrique Vidal, it is worth remembering that most prototypes developed within research projects are currently only capable of processing a few hundreds of sentences (around 300), on very specific topics (accommodation-booking, planning trips, etc.) and for a small group of languages—English, German, Japanese, Spanish, Italian. It seems unlikely that any application will be able to go beyond these boundaries in the near future.

(Enrique Vidal)

http://www.hltcentral.org/page-1086.0.shtml

 

9. Speech-to-speech translation

At present there are only a few speech-to-speech machine translation projects, be it in Europe or in the United States and Japan. Nevertheless, speech-to-speech is continually increasing in importance, in a similar way to the technology of cellular telephony and machine translation technologies. Without a doubt, speech-to-speech machine translation will, within a few years, be a commonplace thing.

Because oral language is the most spontaneous and natural form of communication among people, speech technology is perceived as a determining factor in achieving better interaction with computers. The industry is aware of this fact and realises that the incorporation of speech technology will be the ultimate step in bringing computers closer to the general public.

The achievements of the EuTrans project reveal two things. The first is that speech-to-speech translation is conditional on the development of speech recognition technology itself. Secondly, that the models employed in speech recognition based on large corpora have proved valid also for the development of speech translation. This implies that in the future these two technologies could be successfully integrated.

At present, however, speech-to-speech translation systems are not commonplace. In recent years speech recognition techniques have made important strides forward, thanks to the increased availability of the resources that are needed for its development— large collections of oral texts and more efficient data oriented processing techniques, such as those designed by the PRHLT group itself. However, the integration of these systems into marketable products is still some way off.

It is worth remembering that most prototypes developed within research projects are currently only capable of processing a few hundreds of sentences (around 300), on very specific topics (accommodation-booking, planning trips, etc.) and for a small group of languages—English, German, Japanese, Spanish, Italian. It seems unlikely that any application will be able to go beyond these boundaries in the near future.

The direct incorporation of speech translation prototypes into industrial applications is at present too costly. However, the growing demand for these products leads us to believe that they will soon be on the market at more affordable prices. The systems developed in projects such as Verbmobil, EuTrans or Janus—despite being at the laboratory phase—contain in practice thoroughly evaluated and robust technologies. A manufacturer considering their integration may join R&D projects and take part in the development of prototypes with the prospect of a fast return on investment. It is quite clear that we are witnessing the emergence of a new technology with great potential for penetrating the telecommunications and microelectronics market in the not too distant future.

Another remarkable aspect of the EuTrans project is its methodological contribution to machine translation as a whole, both in speech and written modes. Although these two modes of communication are very different in essence, and their respective technologies cannot always be compared, speech-to-speech translation has brought prospects of improvement for text translation. Traditional methods for written texts tend to be based on grammatical rules. Therefore, many MT systems show no coverage problem, although this is achieved at the expense of quality. The most common way of improving quality is by restricting the topic of interest. It is widely accepted that broadening of coverage immediately endangers quality. In this sense, learning techniques that enable systems to automatically adapt to new textual typologies, styles, structures, terminological and lexical items could have a radical impact on the technology.

Due to the differences between oral and written communication, rule-based systems prepared for written texts can hardly be re-adapted to oral applications. This is an approach that has been tried, and has failed. On the contrary, example-based learning methods designed for speech-to-speech translation systems can easily be adapted to the written texts, given the increasing availability of bilingual corpora. One of the main contributions of the PRHLT-ITI group is precisely in its learning model based on bilingual corpora. Herein lie some interesting prospects for improving written translation techniques.

Effective speech-to-speech translation, along with other voice-oriented technologies, will become available in the coming years, albeit with some limitations e.g. the number of languages, linguistic coverage, and context. It could be argued that EuTrans' main contribution has been to raise the possibilities of speech-to-speech translation to the levels of speech recognition technology, making any new innovation immediatly accessible

(Joseba Abaitua)

http://www.hltcentral.org/page-1086.0.shtml

 

10. How much new information is created each year?

The World Wide Web contains about 170 terabytes of information on its surface; in volume this is seventeen times the size of the Library of Congress print collections. Instant messaging generates five billion messages a day (750GB), or 274 Terabytes a year. Email generates about 400,000 terabytes of new information each year worldwide.

(University of California)

http://www.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm#summary

 

11. Will the semantic web be a solution to the current web situation?

The Semantic Web can be seen as a huge engineering solution... but it is more than that. We will find that as it becomes easier to publish data in a repurposable form, so more people will want to pubish data, and there will be a knock-on or domino effect. We may find that a large number of Semantic Web applications can be used for a variety of different tasks, increasing the modularity of applications on the Web. But enough subjective reasoning... onto how this will be accomplished.

(Sean B. Palmer)

http://infomesh.net/2001/swintro/#whatIsSw

 

12. Which are the most usual interpretations of the term "machine translation" (MT)?

The term machine translation (MT) is normally taken in its restricted and precise meaning of fully automatic translation. However, in this chapter we consider the whole range of tools that may support translation and document production in general, which is especially important when considering the integration of other language processing techniques and resources with MT. We therefore define Machine Translation to include any computer-based process that transforms (or helps a user to transform) written text from one human language into another. We define Fully Automated Machine Translation (FAMT) to be MT performed without the intervention of a human being during the process. Human-Assisted Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a (mono- or bilingual) human for help. Machine-Aided Translation (MAT) is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants.

Traditionally, two very different classes of MT have been identified. Assimilation refers to the class of translation in which an individual or organization wants to gather material written by others in a variety of languages and convert them all into his or her own language. Dissemination refers to the class in which an individual or organization wants to broadcast his or her own material, written in one language, in a variety of language to the world. A third class of translation has also recently become evident. Communication refers to the class in which two or more individuals are in more or less immediate interaction, typically via email or otherwise online, with an MT system mediating between them. Each class of translation has very different features, is best supported by different underlying technology, and is to be evaluated according to somewhat different criteria.

(Bente Maegaard)

http://sirio.deusto.es/ABAITUA/konzeptu/nlp/Mlim/mlim4.html

 

13. Where was MT ten years ago?

Ten years ago, the typical users of machine translation were large organizations such as the European Commission, the US Government, the Pan American Health Organization, Xerox, Fujitsu, etc. Fewer small companies or freelance translators used MT, although translation tools such as online dictionaries were becoming more popular. However, ongoing commercial successes in Europe, Asia, and North America continued to illustrate that, despite imperfect levels of achievement, the levels of quality being produced by FAMT and HAMT systems did address some users’ real needs. Systems were being produced and sold by companies such as Fujitsu, NEC, Hitachi, and others in Japan, Siemens and others in Europe, and Systran, Globalink, and Logos in North America (not to mentioned the unprecedented growth of cheap, rather simple MT assistant tools such as PowerTranslator).

(Bente Maegaard)

http://sirio.deusto.es/ABAITUA/konzeptu/nlp/Mlim/mlim4.html

 

14. Major Methods, Techniques and Approaches

One of the most pressing questions of MT results from the recent introduction of a new paradigm into Computational Linguistics. It had always been thought that MT, which combines the complexities of two languages (at least), requires highly sophisticated theories of linguistics in order to produce reasonable quality output.

As described above, the CANDIDE system (Brown et al., 1990) challenged that view. The DARPA MT Evaluation series of four MT evaluations, the last of which was held in 1994, compared the performance of three research systems, more than 5 commercial systems, and two human translators (White et al., 1992—94). It forever changed the face of MT, showing that MT systems using statistical techniques to gather their rules of cross-language correspondence were feasible competitors to traditional, purely hand-built ones. However, CANDIDE did not convince the community that the statistics-only approach was the optimal path; in developments since 1994, it has included steadily more knowledge derived from linguistics. This left the burning question: which aspects of MT systems are best approached by statistical methods, and which by traditional, linguistic ones?

Since 1994, a new generation of research MT systems is investigating various hybridizations of statistical and symbolic techniques (Knight et al., 1995; Brown and Frederking, 1995; Dorr , 1997; Nirenburg et al., 1992; Wahlster, 1993; Kay et al., 1994). While it is clear by now that some modules are best approached under one paradigm or the other, it is a relatively safe bet that others are genuinely hermaphroditic, and that their best design and deployment will be determined by the eventual use of the system in the world. Given the large variety of phenomena inherent in language, it is highly unlikely that there exists a single method to handle all the phenomena--both in the data/rule collection stage and in the data/rule application (translation) stage--optimally. Thus one can expect all future non-toy MT systems to be hybrids. Methods of statistics and probability combination will predominate where robustness and wide coverage are at issue, while generalizations of linguistic phenomena, symbol manipulation, and structure creation and transformation will predominate where fine nuances (i.e., translation quality) are important. Just as we today have limousines, trucks, passenger cars, trolley buses, and bulldozers, just so we will have different kind of MT systems that use different translation engines and concentrate on different functions.

(Bente Maegaard)

http://sirio.deusto.es/ABAITUA/konzeptu/nlp/Mlim/mlim4.html

 

15. Which are the main problems of MT?

In this chapter we will consider some particular problems which the task of translation poses for the builder of MT systems --- some of the reasons why MT is hard. It is useful to think of these problems under two headings: (i) Problems of ambiguity, (ii) problems that arise from structural and lexical differences between languages and (iii) multiword units like idioms and collocations. We will discuss typical problems of ambiguity, lexical and structural mismatches, and multiword units.

Of course, these sorts of problem are not the only reasons why MT is hard. Other problems include the sheer size of the undertaking, as indicated by the number of rules and dictionary entries that a realistic system will need, and the fact that there are many constructions whose grammar is poorly understood, in the sense that it is not clear how they should be represented, or what rules should be used to describe them. This is the case even for English, which has been extensively studied, and for which there are detailed descriptions -- both traditional `descriptive' and theoretically sophisticated -- some of which are written with computational usability in mind. It is an even worse problem for other languages. Moreover, even where there is a reasonable description of a phenomenon or construction, producing a description which is sufficiently precise to be used by an automatic system raises non-trivial problems.

(D J Arnold)

http://sirio.deusto.es/abaitua/konzeptu/ta/MT_book_1995/node52.html#SECTION00810000000000000000

 

 

Conclusion

After writing this report about Human Language Technologies and their role in the Information Society I have to say that I have learnt a lot in many ways. In order to write this report, I have had to do a research by using the internet, something that I had never done before.

Another aspect that I want to give importance to is the quantity of different topics in which I have been taking information from. This has made me discovermany things about New Technologies that I did not know at all. I considet that to be a very important feature in an assignment: to be forced to discover "new worlds" of information about many interesting themes.

Although not every topics are as important or interesting for me as others, I have to claim that nearly all of the topics that I have been dealing with during the development of this report are very useful for my studies and my future career. This is due mainly to their value as knowledge of new techniques to be applied in the future, and the knowledge to apply some of them even in the present.

I have learnt about the important of New Technologies, some aspects of Machine Translation; to sum up, the help that we (mainly linguists, philologists and translators) can obtain from computers, the internet and any other tool related to New Technologies.

I did not know many of what in this report has been developed, but from now on, I will be able to use some of them; or even if not using them directly, at least trying to acquire in a more skilful way their appliance for future jobs.

I hope I will be able to learn much more in close relation to my present degree (English Philology), and that this subject will provide me with more tools for my professional future in order to improve my abilities with the correct usage of the English Language. 

 

On-line references (in order of appearance)

http://www.hltcentral.org/htmlengine.shtml?id=169 by HLTCentral (2001)

http://en.wikipedia.org/wiki/Natural_language_processing by Wikipedia, the free encyclopedia (2004)

http://www.coli.uni-sb.de/~hansu/what_is_cl.html by Hans Uszkoreit (2000)

http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#tiole by HLTCentral (2004)

http://www.hltcentral.org/page-218.0.shtml by HLTCentral (2000)

http://www.hltcentral.org/page-219.0.shtml by HLTCentral (2000)

http://www.hltcentral.org/usr_docs/Harness/harness-en.htm#t by HLTCentral (2004)

http://www.acm.org/sigmm/mm2003/t1.shtml by Jennifer Lai (2003)

http://www.hltcentral.org/page-1086.0.shtml by Enrique Vidal (2003)

http://www.hltcentral.org/page-1086.0.shtml by Joseba Abaitua (2003)

http://www.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm#summary by the University of California (2003)

http://infomesh.net/2001/swintro/#whatIsSw by Sean B. Palmer (2001)

http://sirio.deusto.es/ABAITUA/konzeptu/nlp/Mlim/mlim4.html by Bente Maegaard (1999)

http://sirio.deusto.es/abaitua/konzeptu/ta/MT_book_1995/node52.html#SECTION00810000000000000000
 by D J Arnold (1995)