This document is a review of Human Language Technology based on a questionnaire that has been researched on the Internet during the class periods. The paper also provides us a brief overview to further understand the subject. For example, the paper illustrates how Language is connected to the new technologies in the world of computers and why human beings need this connection. The information contained in this paper has been elaborated through several answers related to six different topics. From my point of view, the best way to understand a theme is by trying to answer all of the possible questions. This is the reason why the body of this paper is based on this system. In this way, we will understand the meanings of certain elements which help us to clarify the whole field's subject, such us Information Society, Knowledge and Information, Translators, Machine Translation or Language Engineering. However, the paper is not only limited to the definitions of these elements; In addition, we will see how the explanations are extended to other fields in order to get a basic knowledge, which will help us understand the following topic.
In this report, we are going to analyze the relationship between Human Language Technology and the new technologies advances in the world of computers. The paper is divided into six different topics. Each topic correspond to a different week of class period. We will find several questions relating to each topic. These questions are based on several different fields, such as Information Society, Knowledge and Information, Translators, Machine Translation or Language Engineering.
However, the paper does not only define these different fields, it also includes other aspects. For example, the section on Information Society does not only define the specific phrase,but also elaborates on both the importance of the field, and its influence on society. At this way, we will not only define Knowledge and Information, but also compare these two term to illustrate their main differences. Related to the field of Translators, are human beings better translators than machines?. Finally, we will see how Language Engineering has an influence on the use of language as well as how improves it. In addition, we will speak about organizations related to these fields, their task and their dificulties.
I have selected each answer according to my own specific criteria. First of all, I read all the questions corresponding to each week. Then, I read the information on the web pages provided in the main menu of the subject. Once I thought I understood the information, I tried to identify the correct answer to each question. Finally, I used the commands "copy" and "past" to place the answers in the body of the paper. However, sometimes I had to do a few changes in sentence structure to connect some paragraphs with others. In addition, I added extra information that could help clarify the ideas.
The Information Society is a term used to describe a society and an economy that makes the best possible use of new information and communication technologies (ICTs). In an Information Society people will get the full benefits of new technology in all aspects of their lives: at work, at home and at play. Examples of ICTs are: ATMs for cash withdrawal and other banking services, mobile phones, teletext television, faxes, and information services such as the Internet and e-mail. These new technologies have implications for all aspects of our society and economy. They are changing the way in which we do business, how we learn and how we spend our leisure time. This also means important challenges for Government: our laws need to be up to date in order to support electronic transactions, our people need to educated about new technology, businesses must get on-line if they are to succeed, government services should be available electronically.
HLTCentral - Gateway to Speech & Language Technology Opportunities on the Web HLTCentral web site was established as an online information resource of human language technologies and related topics of interest to the HLT community at large. It covers news, R&D, technological and business developments in the field of speech, language, multilinguality, automatic translation, localisation and related areas. Its coverage of HLT news and developments is worldwide - with a unique European perspective. HLTCentral is Powered by Two EU funded projects, ELSNET and EUROMAP, are behind the development of HLTCentral. EUROMAP ("Facilitating the path to market for language and speech technologies in Europe") - aims to provide awareness, bridge-building and market-enabling services for accelerating the rate of technology transfer and market take-up of the results of European HLT RTD projects. ELSNET ("The European Network of Excellence in Human Language Technologies") - aims to bring together the key players in language and speech technology, both in industry and in academia, and to encourage interdisciplinary co-operation through a variety of events and services. Web site maintenance development of HLTCentral. EUROMAP ("Facilitating the path to market for language and speech technologies in Europe") - aims to provide awareness, bridge-building and market-enabling services for accelerating the rate of technology transfer and market take-up of the results of European HLT RTD projects. ELSNET ("The European Network of Excellence in Human Language Technologies") - aims to bring together the key players in language and speech technology, both in industry and in academia, and to encourage interdisciplinary co-operation through a variety of events and services.
The Information Age The development and convergence of computer and telecommunication technologies has led to a revolution in the way that we work, communicate with each other, buy goods and use services, and even the way we entertain and educate ourselves. One of the results of this revolution is that large volumes of information will increasingly be held in a form which is more natural for human users than the strictly formatted, structured data typical of computer systems of the past. Information presented in visual images, as sound, and in natural language, either as text or speech, will become the norm. We all deal with computer systems and services, either directly or indirectly, every day of our lives. This is the information age and we are a society in which information is vital to economic, social, and political success as well as to our quality of life. The changes of the last two decades may have seemed revolutionary but, in reality, we are only on the threshold of this new age. There are still many new ways in which the application of telematics and the use of language technology will benefit our way of life, from interactive entertainment to lifelong learning. Although these changes will bring great benefits, it is important that we anticipate difficulties which may arise, and develop ways to overcome them. Examples of such problems are: access to much of the information may be available only to the computer literate and those who understand English; a surfeit of information from which it is impossible to identify and select what is really wanted. Language Engineering can solve these problems. Information universally available The language technologies will make an indispensable contribution to the success of this information revolution. The availability and usability of new telematics services will depend on developments in language engineering. Speech recognition will become a standard computer function providing us with the facility to talk to a range of devices, from our cars to our home computers, and to do so in our native language. In turn, these devices will present us with information, at least in part, by generating speech. Multi-lingual services will also be developed in many areas. In time, material provided by information services will be generated automatically in different languages. This will increase the availability of information to the general public throughout Europe. Initially, multi-lingual services will become available, based on basic data, such as weather forecasts and details of job vacancies, from which text can be generated in any language. Eventually, however, we can expect to see automated translation as an everyday part of information services so that we can both request and receive all sorts of information in our own language. Home and Abroad Language Engineering will also help in the way that we deal with associates abroad. Although the development of electronic commerce depends very much on the adoption of interchange standards for communications and business transactions, the use of natural language will continue, precisely because it is natural. However, systems to generate business letters and other forms of communication in foreign languages will ease and greatly enhance communication. Automated translation combined with the management of documentation, including technical manuals and user handbooks, will help to improve the quality of service in a global marketplace. Export business will be handled cost effectively with the same high level of customer care that is provided in the home market.
How can we cope with so much information ? One of the fundamental components of Language Engineering is the understanding of language, by the computer. This is the basis of speech operated control systems and of translation, for example. It is also the way in which we can prevent ourselves from being overwhelmed with information, unable to collate, analyse, and select what we need. However, if information services are capable of understanding our requests, and can scan and select from the information base with real understanding, not only will the problem of information overload be solved but also no significant information will be missed. Language Engineering will deliver the right information at the right time.
Knowledge is of more value than information because thus, information is data given context, and endowed with meaning and significance, Knowledge is information that is transformed through reasoning and reflection into beliefs, concepts, and mental models. Consider a document containing a table of numbers indicating product sales for the quarter. As they stand, these numbers are Data. An employee reads these numbers, recognizes the name and nature of the product, and notices that the numbers are below last years figures, indicating a downward trend. The data has become Information. The employee considers possible explanations for the product decline (perhaps using additional information and personal judgment), and comes to the conclusion that the product is no longer attractive to its customers. This new belief, derived from reasoning and reflection, is Knowledge.
Yes, the possesion of big quantities of data implies that we are well informed because if we consider a document containing a table of numbers indicating product sales for the quarter. An employee reads these numbers, recognizes the name and nature of the product, and notices that the numbers are below last years figures, indicating a downward trend. . The employee considers possible explanations for the product decline (perhaps using additional information and personal judgment), and comes to the conclusion that the product is no longer attractive to its customers. The bigger quantities of data the employee will read the better informed he will be http://www.fis.utoronto.ca/kmi/resources.htm How many words of technical information are recorded every day? 20 million words of technical information are recorded every day.
The most convenient way of representing information is the Information Architecture, which is a set of models, definitions, rules, and standards that give structure and order to an organizations information so that information needs can be matched with information resources. An Information Architecture defines: what types of information exist in the organization where the information can be found who are the creators and owners of the information how the information is to be used. An Information Architecture may contain several of the following: a model or representation of main information entities and processes; taxonomy or categorization scheme; standards; definitions and interpretations of terms; directories or inventories; resource maps and description frameworks; designs for developing information systems, products, services.
Language Engineering is a technology which uses our knowledge of language to enhance our application of computer systems: improving the way we interface with them assimilating, analysing, selecting, using, and presenting information more effectively providing human language generation and translation facilities. New opportunities are becoming available to change the way we do many things, to make them easier and more effective by exploiting our developing knowledge of language. When, in addition to accepting typed input, a machine can recognise written natural language and speech, in a variety of languages, we shall all have easier access to the benefits of a wide range of information and communications services, as well as the facility to carry out business transactions remotely, over the telephone or other telematics services. When a machine understands human language, translates between different languages, and generates speech as well as printed output, we shall have available an enormously powerfu
Language Tecnology, Language Engineering and Computational Linguistics. Similarities and differencies. Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science aiming at computational models of human cognition. Computational linguistics has applied and theoretical components. Theoretical CL takes up issues in theoretical linguistics and cognitive science. It deals with formal theories about the linguistic knowledge that a human needs for generating and understanding language. Today these theories have reached a degree of complexity that can only be managed by employing computers. Computational linguists develop formal models simulating aspects of the human language faculty and implement them as computer programmes. These programmes constitute the basis for the evaluation and further development of the theories. In addition to linguistic theories, findings from cognitive psychology play a major role in simulating linguistic competence. Within psychology, it is mainly the area of psycholinguisticsthat examines the cognitive processes constituting human language use. The relevance of computational modelling for psycholinguistic research is reflected in the emergence of a new subdiscipline: computational psycholinguistics. Applied CL focusses on the practical outcome of modelling human language use. The methods, techniques, tools and applications in this area are often subsumed under the term language engineering or (human) language technology. Although existing CL systems are far from achieving human ability, they have numerous possible applications. The goal is to create software products that have some knowledge of human language. Such products are going to change our lives. They are urgently needed for improving human-machine interaction since the main obstacle in the interaction beween human and computer is a communication problem. Today's computers do not understand our language but computer languages are difficult to learn and do not correspond to the structure of human thought. Even if the language the machine understands and its domain of discourse are very restricted, the use of human language can increase the acceptance of software and the productivity of its users.
Language Engineering is the application of knowledge of language to the development of computer systems which can recognise, understand, interpret, and generate human language in all its forms. In practice, Language Engineering comprises a set of techniques and language resources. The former are implemented in computer software and the latter are a repository of knowledge which can be accessed by computer software.
LanguageTechnologies are information technologies that are specialized for dealing wwith the most complex information medium in our world: human language. Therefore this technologies are also often subsumed under the term Human Language Technology. Human language occurs in spoken and writen form.Whereas the speech is the oldest and most natural mode of language comunication, complex information and most human knowledge is manteined and transmitted in wwritten texts. Speech and texts technologies process or produce language in this two models of realization.But language has also aspects that shared between speach and text such as diccionaries, most of grammar and the meaning of the sentence. Thus large parts of language technology cannot be subsued under speech and texts technologies. Among those are technologies that link language to knowledge. We do not know how language,knowledge and thought are represented in the human brain. Nevertheless, language technology have to create formal representation systems that link language to concepts and tasks in the real wrold. This probides the interface to the fast growing area of knowledge technologies.
Techniques There are many techniques used in Language Engineering and some of these are described below: Speaker Identification and Verification A human voice is as unique to an individual as a fingerprint. This makes it possible to identify a speaker and to use this identification as the basis for verifying that the individual is entitled to access a service or a resource. The types of problems which have to be overcome are, for example, recognising that the speech is not recorded, selecting the voice through noise (either in the environment or the transfer medium), and identifying reliably despite temporary changes (such as caused by illness). Speech Recognition The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose. There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognise continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognise any speaker, avoiding the need to train the system to recognise the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically. Character and Document Image Recognition Recognition of written or printed language requires that a symbolic representation of the language is derived from its spatial form of graphical marks. For most languages this means recognising and transforming characters. There are two cases of character recognition: recognition of printed images, referred to as Optical Character Recognition (OCR) recognising handwriting, usually known as Intelligent Character Recognition (ICR) OCR from a single printed font family can achieve a very high degree of accuracy. Problems arise when the font is unknown or very decorative, or when the quality of the print is poor. In these difficult cases, and in the case of handwriting, good results can only be achieved by using ICR. This involves word recognition techniques which use language models, such as lexicons or statistical information about word sequences. Document image analysis is closely associated with character recognition but involves the analysis of the document to determine firstly its make-up in terms of graphics, photographs, separating lines and text, and then the structure of the text to identify headings, sub-headings, captions etc. in order to be able to process the text effectively. Natural Language Understanding The understanding of language is obviously fundamental to many applications. However, perfect understanding is not always a requirement. In fact, gaining a partial understanding is often a very useful preliminary step in the process because it makes it possible to be intelligently selective about taking the depth of understanding to further levels. Shallow or partial analysis of texts is used to obtain a robust initial classification of unrestricted texts efficiently. This initial analysis can then be used, for example, to focus on 'interesting' parts of a text for a deeper semantic analysis which determines the content of the text within a limited domain. It can also be used, in conjunction with statistical and linguistic knowledge, to identify linguistic features of unknown words automatically, which can then be added to the system's knowledge. Semantic models are used to represent the meaning of language in terms of concepts and relationships between them. A semantic model can be used, for example, to map an information request to an underlying meaning which is independent of the actual terminology or language in which the query was expressed. This supports multi-lingual access to information without a need to be familiar with the actual terminology or structuring used to index the information. Combinations of analysis and generation with a semantic model allow texts to be translated. At the current stage of development, applications where this can be achieved need be limited in vocabulary and concepts so that adequate Language Engineering resources can be applied. Templates for document structure, as well as common phrases with variable parts, can be used to aid generation of a high quality text. Natural Language Generation A semantic representation of a text can be used as the basis for generating language. An interpretation of basic data or the underlying meaning of a sentence or phrase can be mapped into a surface string in a selected fashion; either in a chosen language or according to stylistic specifications by a text planning system. Speech Generation Speech is generated from filled templates, by playing 'canned' recordings or concatenating units of speech (phonemes, words) together. Speech generated has to account for aspects such as intensity, duration and stress in order to produce a continuous and natural response. Dialogue can be established by combining speech recognition with simple generation, either from concatenation of stored human speech components or synthesising speech using rules. Providing a library of speech recognisers and generators, together with a graphical tool for structuring their application, allows someone who is neither a speech expert nor a computer programmer to design a structured dialogue which can be used, for example, in automated handling of telephone calls.
Language resources are essential components of Language Engineering. They are one of the main ways of representing the knowledge of language, which is used for the analytical work leading to recognition and understanding. The work of producing and maintaining language resources is a huge task. Resources are produced, according to standard formats and protocols to enable access, in many EU languages, by research laboratories and public institutions. Many of these resources are being made available through the European Language Resources Association (ELRA).
language processing is a term in use since the 1980s to define a class of software systems which handle text intelligently translator's workbench is a software system providing a working environment for a human translator, which offers a range of aids such as on-line dictionaries, thesauri, translation memories, etc shallow parser is software which parses language to a point where a rudimentary level of understanding can be realised; this is often used in order to identify passages of text which can then be analysed in further depth to fulfil the particular objective formalism is a means to represent the rules used in the establishment of a model of linguistic knowledge Speech Recognition : The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose. There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognise continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognise any speaker, avoiding the need to train the system to recognise the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically. text alignment is the process of aligning different language versions of a text in order to be able to identify equivalent terms, phrases, or expressions authoring tools are facilities provided in conjunction with word processing to aid the author of documents, typically including an on-line dictionary and thesaurus, spell-, grammar-, and style-checking, and facilities for structuring, integrating and linking documents controlled language is the language which has been designed to restrict the number of words and the structure of (also artificial language) language used, in order to make language processing easier; typical users of controlled language work in an area where precision of language and speed of response is critical, such as the police and emergency services, aircraft pilots, air traffic control, etc. domain is usually applied to the area of application of the language enabled software e.g. banking, insurance, travel, etc.; the significance in Language Engineering is that the vocabulary of an application is restricted so the language resource requirements are effectively limited by limiting the domain of application
When discussing the relevance of technological training in the translation curricula, it is important to clarify the factors that make technology more indispensable and show how the training should be tuned accordingly. The relevance of technology will depend on the medium that contains the text to be translated. This particular aspect is becoming increasingly evident with the rise of the localization industry, which deals solely with information in digital form. There may be no other imaginable means for approaching the translation of such things as on-line manuals in software packages or CD-ROMs with technical documentation than computational ones. http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm Do professional interpreters and literary translators need translation technology? Which are the tools they need for their job? With the exception of a few eccentries or maniacs, it will be rare in the future to see good profesional interpreters and literary translators not using more or less sophisticated and specialized tools for their jobs., comparable to the familiarization with type recorders or typewriters in the past. In any case, this maybe something best left to the profesional to decide, and may not be indispensable. It is clear that word procesors, on-line dictionaries and all sorts of background documentation, such as conocordances or collated texts, besides e-mail or other ways of network interaction with colleagues in the world may substancially help the literary translator´s work
Information of many types is rapidly changing format and going digital. Electronic documentation is the adequate realm for the incorporation of translation technology. This is something that young students of translation must learn. As the conception and design of technical documentation becomes progressively influenced by the electronic medium, it is integrating more and more with the whole concept of a software product. The strategies and means for translating both software packages and electronic documents are becoming very similar and both are now, as we will see, the goal of the localization industry
The main focus of localization industry is to help software publishers, hardware manufacturers and telecommunications companies with versions of their software, documentation, marketing, and Web-based information in different languages for simultaneous worldwide release. Yes, I believe it, because it is very important for this sector the capacity of translation.
Globalization: The adaptation of marketing strategies to regional requirements of all kinds (e.g., cultural, legal, and linguistic).
Internationalization: The engineering of a product (usually software) to enable efficient adaptation of the product to local requirements.
Localization: The adaptation of a product to a target language and culture (locale). The main goal of the LEIT initiative is to introduce localization courseware into translation studies, with versions ready for the start of the 1999 academic year. However, this must be done with care. Bert Esselink (1998), from AlpNet, for example, argues against separating localization from other disciplines and claims its basic principles should be covered in all areas of translation training. Furthermore, it would be useful to add the trainers not only need constant feedback and guidance from the commercial sector, they also need to maintain close contact with the software industry. So, perhaps, one of the best features of the LEIT initiative is its combination of partnership from the academic as well as from the industry world. LISA offers the first version of this courseware on its Web-site and users have the possibility to contact the LEIT group and collaborate through an on-line questionnaire
In the localization industry, the utilization of technology is congenital, and developing adequate tools has immediate economic benefits. The above lines depict a view of a translation environment which is closer to more traditional needs of the translator than to current requirements of the industry. Many aspects of software localization have not been considered, particularly the concepts of multilingual management and document-life monitoring. Corporations are now realizing that documentation is an integral part of the production line where the distinction between product, marketing and technical material is becoming more and more blurred. Product documentation is gaining importance in the whole process of product development with direct impact on time-to-market. Software engineering techniques that apply in other phases of software development are beginning to apply to document production as well. The appraisal of national and international standards of various types is also significant: text and character coding standards (e.g. SGML/XML and Unicode), as well as translation quality control standards (e.g. DIN 2345 in Germany, or UNI 10574 in Italy). In response to these new challenges, localization packages are now being designed to assist users throughout the whole life cycle of a multilingual document. These take them through job setup, authoring, translation preparation, translation, validation, and publishing, besides ensuring consistency and quality in source and target language variants of the documentation. New systems help developers monitor different versions, variants and languages of product documentation, and author customer specific solutions. An average localization package today will normally consist of an industry standard SGML/XML editor (e.g. ArborText), a translation and terminology toolkit (Trados Translator's Workbench), and a publishing engine (e.g. Adobe's Frame+SGML). Unlike traditional translators, software localizers may be engaged in early stages of software development, as there are issues, such as platform portability, code exchange, format conversion, etc. which if not properly dealt with may hinder product internationalization. Localizers are often involved in the selection and application of utilities that perform code scanning and checking, that automatically isolate and suggest solutions to National Language Support (NLS) issues, which save time during the internationalization enabling process. There are run-time libraries that enable software developers and localizers to create single-source, multilingual, and portable cross-platform applications. Unicode support is also fundamental for software developers who work with multilingual texts, as it provides a consistent coding format for international character sets. In the words of Rose Lockwood (Language International 10.5), a consultant from Equipe Consortium Ltd, "as traditional translation methods give way to language engineering and disciplined authoring, translation and document-management methods, the role of technically proficient linguists and authors will be increasingly important to global WWW. The challenge will be to employ the skills used in conventional technical publishing in the new environment of a digital economy.
Leaving behind the old conception of a monolithic compact translation engine, the industry is now moving in the direction of integrating systems: "In the future Trados will offer solutions that provide enterprise-wide applications for multilingual information creation and dissemination, integrating logistical and language-engineering applications into smooth workflow that spans the globe," says Trados manager Henri Broekmate. Logos, the veteran translation technology provider, has announced "an integrated technology-based translation package, which will combine term management, TM, MT and related tools to create a seamless full service localization environment." Other software manufacturers also in the race are Corel, Star, IBM, and the small but belligerent Spanish company Atril. This approach for integrating different tools is largely the view advocated by many language-technology specialists. Below is a description of an ideal engine which captures the answers given by Muriel Vasconcellos (from the Pan American Health Organization), Minako O'Hagan (author of The Coming Age of Teletranslations) and Eduard Hovy (President of the Association of Machine Translation in the Americas) to a recent survey (by Language International 10.6). The ideal workstation for the translator would combine the following features: Full integration in the translator's general working environment, which comprises the operating system, the document editor (hypertext authoring, desktop publisher or the standard word-processor), as well as the emailer or the Web browser. These would be complemented with a wide collection of linguistic tools: from spell, grammar and style checkers to on-line dictionaries, and glossaries, including terminology management, annotated corpora, concordances, collated texts, etc. The system should comprise all advances in machine translation (MT) and translation memory (TM) technologies, be able to perform batch extraction and reuse of validated translations, enable searches into TM databases by various keywords (such as phrases, authors, or issuing institutions). These TM databases could be distributed and accessible through Internet. There is a new standard for TM exchange (TMX) that would permit translators and companies to work remotely and share memories in real-time. Eduard Hovy underlines the need for a genre detector. "We need a genre topology, a tree of more or less related types of text and ways of recognizing and treating the different types computationally." He also sees the difficulty of constantly up-dating the dictionaries and suggests a "restless lexicon builder that crawls all over the Web every night, ceaselessly collecting words, names, and phrases, and putting them into the appropriate lexicons." Muriel Vasconcellos pictures her ideal design of the workstation in the following way: Good view of the source text extensive enough to offer the overall context, including the previous sentence and two or three sentences after the current one. Relevant on-line topical word lists, glossaries and thesaurus. These should be immediately accessible and, in the case of topical lists, there should be an optimal switch that shows, possibly in color, when there are subject-specific entries available. Three target-text windows. The first would be the main working area, and it would start by providing a sentence from the original document (or a machine pre-translation), which could be over-struck or quickly deleted to allow the translator to work from scratch. The original text or pre-translation could be switched off. Characters of any language and other symbols should be easy to produce. Drag-and-drop is essential and editing macros are extremely helpful when overstriking or translating from scratch. The second window would offer translation memory when it is available. The TM should be capable of fuzzy matching with a very large database, with the ability to include the organization's past texts if they are in some sort of electronic form. The third window would provide a raw machine translation which should be easy to paste into the target document. The grammar checker can be tailored so that it is not so sensitive. It would be ideal if one could write one's own grammar rules.
Having said all this, it is important to reassess the human factor. Like cooks, tailors or architects, professional translators need to become acquainted with technology, because good use of technology will make their jobs more competitive and satisfactory. But they should not dismiss craftsmanship. Technology enhances productivity, but translation excellence goes beyond technology. It is important to delimit the roles of humans and machines in translation. Martin Kay's (1987) words in this respect are most illustrative: A computer is a device that can be used to magnify human productivity. Properly used, it does not dehumanize by imposing its own Orwellian stamp on the products of human spirit and the dignity of human labor but, by taking over what is mechanical and routine, it frees human beings over what is mechanical and routine. Translation is a fine and exacting art, but there is much about it that is mechanical and routine, if this were given over to a machine, the productivity of the translator would not only be magnified but this work would become more rewarding, more exciting, more human. It has taken some 40 years for the specialists involved in the development of MT to realize that the limits to technology arise when going beyond the mechanical and routine aspects of language. From the outside, translation is often seen as a mere mechanical process, not any more complex than playing chess, for example. If computers have been programed with the capacity of beating a chess master champion such as Kasparov, why should they not be capable of performing translation of the highest quality? Few people are aware of the complexity of literary translation. Douglas Hofstadter (1998) depicts this well: A skilled literary translator makes a far larger number of changes, and far more significant changes, than any virtuoso performer of classical music would ever dare to make in playing notes in the score of, say, a Beethoven piano sonata. In literary translation, it's totally humdrum stuff for new ideas to be interpreted, old ideas to be deleted, structures to be inverted, twisted around, and on and on. Although it may not be perceived at first sight, the complexity of natural language is of an order of magnitude far superior to any purely mechanical process. To how many words should the vocabulary be limited to make the complexity of producing "free sonnets" (that is, any combination of 6 words in 14 verses) comparable to the number of possible chess games? It may be difficult to believe, but the vocabulary should be restricted to 100 words. That is, making free sonnets with 100 words offers as many different alternatives as there are ways of playing a chess game (roughly, 10120; see DELI's Web page for discussion). The number of possibilities would quickly come down if combinations were restricted so that they not only made sense but acquired some sort of poetic value. However, defining formally or mechanically the properties of "make sense" and "have poetic value" is not an easy task. Or at least, it is far more difficult than establishing winning heuristics for a color to succeed in a chess game. No wonder then that Douglas Hofstadter's MT experiment translating 16th century French Clément Marot's poemMa Mignonne into English using IBM's Candide system should have performed so badly (see Sgrung's interview in Language International 10.1) : Well, when you look at [IBM's Candide's] translation of Ma Mignonne, thinking of Ma Mignonne as prose, not as poetry, it's by far the worst. It's so terrible that it's not even laughable, it just stinks! It's pathetic! Obviously, Hofstadter's experiment has gone beyond the recommended mechanical and routine scope of language and is therefore an abuse of MT. Outside the limits of the mechanical and routine, MT is impracticable and human creativity becomes indispensable. Translators of the highest quality are only obtainable from first-class raw materials and constant and disciplined training. The potentially good translator must be a sensitive, wise, vigilant, talented, gifted, experienced, and knowledgeable person. An adequate use of mechanical means and resources can make a good human translator a much more productive one. Nevertheless, very much like dictionaries and other reference material, technology may be considered an excellent prothesis, but little more than that. As Martin Kay (1992) argues, there is an intrinsic and irreplaceable human aspect of translation: There is nothing that a person could know, or feel, or dream, that could not be crucial for getting a good translation of some text or other. To be a translator, therefore, one cannot just have some parts of humanity; one must be a complete human being. However, even for skilled human translators, translation is often difficult. One clear example is when linguistic form, as opposed to content, becomes an important part of a literary piece. Conveying the content, but missing the poetic aspects of the signifier may considerably hinder the quality of the translation. This is a challenge to any translator. Jaime de Ojeda's (1989) Spanish translation of Lewis Carroll's Alice in Wonderland illustrates this problem: Twinkle, twinkle, little bat how I wonder what you're at! Up above the world you fly like a tea-tray in the sky. Brilla, luce, ratita alada ¿en qué estás tan atareada? Por encima del universo vuelas como una bandeja de teteras. Manuel Breva (1996) analyzes the example and shows how Ojeda solves the "formal hurdles" of the original: The above lines are a parody of the famous poem "Twinkle, twinkle, little star" by Jane Taylor, which, in Carroll's version, turns into a sarcastic attack against Bartholomew Price, a professor of mathematics, nicknamed "The Bat". Jaime de Ojeda translates "bat" as "ratita alada" for rhythmical reasons. "Murciélago", the Spanish equivalent of "bat", would be hard to fit in this context for the same poetic reasons. With Ojeda's choice of words the Spanish version preserves the meaning and maintains the same rhyming pattern (AABB) as in the original English verse-lines.
What would the output of any MT system be like if confronted with this fragment? Obviously, the result would be disastrous. Compared with the complexity of natural language, the figures that serve to quantify the "knowledge" of any MT program are absurd: 100,000 word bilingual vocabularies, 5,000 transfer rules.... Well developed systems such as Systran, or Logos hardly surpass these figures. How many more bilingual entries and transfer rules would be necessary to match Ojeda's competence? How long would it take to adequately train such a system? And even then, would it be capable of challenging Ojeda in the way the chess master Kasparov has been challenged? I have serious doubts about that being attainable at all. But there are other opinions, as is the case of the famous Artificial Intelligence master, Marvin Minsky. Minsky would argue that it is all a matter of time. He sees the human brain as an organic machine, and as such, its behavior, reactions and performance can be studied and reproduced. Other people believe there is an important aspect separating organic, living "machines" from synthetic machines. They would claim that creativity is in life, and that it is an exclusive faculty of living creatures to be creative. But from my point of view, machine translation never will be able to get the level of the human translation. The human being is not perfect but he has more imagination to adapt a text into another language, using different features such us feelings or his own knowledge, while a computer cannot use this kind of things to translate a text.
LISA Education Initiative Taskforce (LEIT) is a consortium of schools training translators and computational linguists that was announced in 1998 as an initiative to develop a promotional program for the academic communities in Europe, North America, and Asia. The initial mandate of LEIT was to conduct a survey among academic and non-academic programs that offer courseware and training for internationalizers and localizers and to query the market players to determine their needs with respect to major job profiles. LEIT's main objective is to stimulate more formal education in skills beneficial to the localization industry that complains of a labor shortage. The academic institutions involved in the first release of LEIT are: University of Geneva (Switzerland), Brigham Young University (Utah), Kent State University (Ohio), University of Cologne (Germany), City College of Dublin (Ireland), Monterey Institute of International Studies (California), and National Software Center in Bombay (India).
(i) Problems of ambiguity ,
(ii) problems that arise from structural and lexical differences between languages and
(iii) multiword units like idiom s and collocations .
We will discuss typical problems of ambiguity in Section , lexical and structural mismatches in Section , and multiword units in Section . Of course, these sorts of problem are not the only reasons why MT is hard. Other problems include the sheer size of the undertaking, as indicated by the number of rules and dictionary entries that a realistic system will need, and the fact that there are many constructions whose grammar is poorly understood, in the sense that it is not clear how they should be represented, or what rules should be used to describe them. This is the case even for English, which has been extensively studied, and for which there are detailed descriptions -- both traditional `descriptive' and theoretically sophisticated -- some of which are written with computational usability in mind. It is an even worse problem for other languages. Moreover, even where there is a reasonable description of a phenomenon or construction, producing a description which is sufficiently precise to be used by an automatic system raises non-trivial problems.
lexical holes --- that is, cases where one language has to use a phrase to express what another language expresses in a single word. Examples of this include the `hole' that exists in English with respect to French ignorer (`to not know', `to be ignorant of'), and se suicider (`to suicide', i.e. `to commit suicide', `to kill oneself'). The problems raised by such lexical holes have a certain similarity to those raised by idiom s: in both cases, one has phrases translating as single words
Morphological,Syntactical and Semantic fields are more relevant for MT How many different types of ambiguity are there? In the best of all possible worlds (as far as most Natural Language Processing is concerned, anyway) every word would have one and only one meaning. But, as we all know, this is not the case. When a word has more than one meaning, it is said to be lexically ambiguous. When a phrase or sentence can have more than one structure it is said to be structurally ambiguous. Ambiguity is a pervasive phenomenon in human languages. It is very hard to find words that are not at least two ways ambiguous, and sentences which are (out of context) several ways ambiguous are the rule, not the exception. This is not only problematic because some of the alternatives are unintended (i.e. represent wrong interpretations), but because ambiguities `multiply'. In the worst case, a sentence containing two words, each of which is two ways ambiguous may be four ways ambiguous (), one with three such words may be , ways ambiguous etc. One can, in this way, get very large numbers indeed.
Imagine that we are trying to translate these two sentences into French : You must not abrasive cleaners on the printer casing. The of abrasive cleaners on the printer casing is not recommended. In the first sentence use is a verb, and in the second a noun, that is, we have a case of lexical ambiguity. An English-French dictionary will say that the verb can be translated by (inter alia) se servir de and employer, whereas the noun is translated as emploi or utilisation. One way a reader or an automatic parser can find out whether the noun or verb form of use is being employed in a sentence is by working out whether it is grammatically possible to have a noun or a verb in the place where it occurs. For example, in English, there is no grammatical sequence of words which consists of the + V + PP --- so of the two possible parts of speech to which use can belong, only the noun is possible in the second sentence ( b). As we have noted in Chapter , we can give translation engines such information about grammar, in the form of grammar rules. This is useful in that it allows them to filter out some wrong analyses. However, giving our system knowledge about syntax will not allow us to determine the meaning of all ambiguous words. This is because words can have several meanings even within the same part of speech. Take for example the word button. Like the word use, it can be either a verb or a noun. As a noun, it can mean both the familiar small round object used to fasten clothes, as well as a knob on a piece of apparatus. To get the machine to pick out the right interpretation we have to give it information about meaning. In fact, arming a computer with knowledge about syntax, without at the same time telling it something about meaning can be a dangerous thing. This is because applying a grammar to a sentence can produce a number of different analyses, depending on how the rules have applied, and we may end up with a large number of alternative analyses for a single sentence. Now syntactic ambiguity may coincide with genuine meaning ambiguity, but very often it does not, and it is the cases where it does not that we want to eliminate by applying knowledge about meaning. We can illustrate this with some examples. First, let us show how grammar rules, differently applied, can produce more than one syntactic analysis for a sentence. One way this can occur is where a word is assigned to more than one category in the grammar. For example, assume that the word cleaning is both an adjective and a verb in our grammar. This will allow us to assign two different analyses to the following sentence. fluids can be dangerous. One of these analyses will have cleaning as a verb, and one will have it as an adjective. In the former (less plausible) case the sense is `to clean a fluid may be dangerous', i.e. it is about an activity being dangerous. In the latter case the sense is that fluids used for cleaning can be dangerous. Choosing between these alternative syntactic analyses requires knowledge about meaning. It may be worth noting, in passing, that this ambiguity disappears when can is replaced by a verb which shows number agreement by having different forms for third person singular and plural. For example, the following are not ambiguous in this way: ( a) has only the sense that the action is dangerous, ( b) has only the sense that the fluids are dangerous. Cleaning fluids are dangerous. We have seen that syntactic analysis is useful in ruling out some wrong analyses, and this is another such case, since, by checking for agreement of subject and object, it is possible to find the correct interpretations. A system which ignored such syntactic facts would have to consider all these examples ambiguous, and would have to find some other way of working out which sense was intended, running the risk of making the wrong choice. For a system with proper syntactic analysis, this problem would arise only in the case of verbs like can which do not show number agreement. Another source of syntactic ambiguity is where whole phrases, typically prepositional phrases, can attach to more than one position in a sentence. For example, in the following example, the prepositional phrase with a Postscript interface can attach either to the NP the word processor package, meaning ``the word-processor which is fitted or supplied with a Postscript interface'', or to the verb connect, in which case the sense is that the Postscript interface is to be used to make the connection. the printer to a word processor package with a Postscript interface. Notice, however, that this example is not genuinely ambiguous at all, knowledge of what a Postscript interface is (in particular, the fact that it is a piece of software, not a piece of hardware that could be used for making a physical connection between a printer to an office computer) serves to disambiguate. Similar problems arise with ( ), which could mean that the printer and the word processor both need Postscript interfaces, or that only the word processor needs them. will require a printer and a word processor with Postscript interfaces. This kind of real world knowledge is also an essential component in disambiguating the pronoun it in examples such as the following the paper in the printer. Then switch it on. In order to work out that it is the printer that is to be switched on, rather than the paper, one needs to use the knowledge of the world that printers (and not paper) are the sort of thing one is likely to switch on. There are other cases where real world knowledge , though necessary, does not seem to be sufficient. The following, where two people are re-assembling a printer, seems to be such an example: : Now insert the cartridge at the back. B: Okay. A: By the way, did you order more toner today? B: Yes, I got some when I picked up the new paper. A: OK, how far have you got? A: Did you get fixed? It is not clear that any kind of real world knowledge will be enough to work out that it in the last sentence refers to the cartridge, rather than the new paper, or toner. All are probably equally reasonable candidates for fixing. What strongly suggests that it should be interpreted as the cartridge is the structure of the conversation --- the discussion of the toner and new paper occurs in a digression, which has ended by the time it occurs. Here what one needs is knowledge of the way language is used. This is knowledge which is usually thought of as pragmatic in nature. Analysing the meaning of texts like the above example is important in dialogue translation, which is a long term goal for MT research, but similar problems occur in other sorts of text. Another sort of pragmatic knowledge is involved in cases where the translation of a sentence depends on the communicative intention of the speaker --- on the sort of action (the speech act) that the speaker intends to perform with the sentence. For example, ( ) could be a request for action, or a request for information, and this might make a difference to the translation. you reprogram the printer interface on this printer? In some cases, working out which is intended will depend on the non-linguistic situation, but it could also depend on the kind of discourse that is going on --- for example, is it a discourse where requests for action are expected, and is the speaker in a position to make such a request of the hearer? In dialogues, such pragmatic information about the discourse can be important for translating the simplest expressions. For example, the right translation of Thank you into French depends on what sort of speech act it follows. Normally, one would expect the translation to be merci. However, if it is uttered in response to an offer, the right translation would be s'il vous plaît (`please').
The term machine translation (MT) is normally taken in its restricted and precise meaning of fully automatic translation. However, in this chapter we consider the whole range of tools that may support translation and document production in general, which is especially important when considering the integration of other language processing techniques and resources with MT. We therefore define Machine Translation to include any computer-based process that transforms (or helps a user to transform) written text from one human language into another. We define Fully Automated Machine Translation (FAMT) to be MT performed without the intervention of a human being during the process. Human-Assisted Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a (mono- or bilingual) human for help. Machine-Aided Translation (MAT) is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants. Traditionally, two very different classes of MT have been identified. Assimilation refers to the class of translation in which an individual or organization wants to gather material written by others in a variety of languages and convert them all into his or her own language. Dissemination refers to the class in which an individual or organization wants to broadcast his or her own material, written in one language, in a variety of language to the world. A third class of translation has also recently become evident. Communication refers to the class in which two or more individuals are in more or less immediate interaction, typically via email or otherwise online, with an MT system mediating between them. Each class of translation has very different features, is best supported by different underlying technology, and is to be evaluated according to somewhat different criteria.
Researchers at Georgetown University and IBM were working towards the first operational systems, and they accepted the long-term limitations of MT in the production of usable translations. More influential was the well-known dissent of Bar-Hillel. In 1960, he published a survey of MT research at the time which was highly critical of the theory-based projects, particularly those investigating interlingua approaches, and which included his demonstration of the non-feasibility of fully automatic high quality translation (FAHQT) in principle. Instead, Bar-Hillel advocated the development of systems specifically designed on the basis of what he called 'man-machine symbiosis', a view which he had first proposed nearly ten years before when MT was still in its infancy (Bar-Hillel 1951). In these circumstances it is not surprising that the Automatic Language Processing Advisory Committee (ALPAC) set up by the US sponsors of research found that MT had failed by its own criteria, since by the mid 1960s there were clearly no fully automatic systems capable of good quality translation and there was little prospect of such systems in the near future. MT research had not looked at the economic use of existing 'less than perfect' systems, and it had disregarded the needs of translators for computer-based aids.
The list of such applications of 'external' theories is long. It began in the 1950s and 1960s with information theory, categorial grammar, transformational-generative grammar, dependency grammar, and stratificational grammar. In the 1970s and 1980s came MT research based on artificial intelligence, non-linguistic knowledge bases, formalisms such as Lexical-Functional Grammar, Generalized Phrase Structure Grammar, Head-driven Phrase Structure Grammar, Definite Clause Grammar, Principles and Parameters, Montague semantics. In the 1990s have been added neural networks, connectionism, parallel processing, and statistical methods, and many more. In nearly every case, it has been found that the 'pure' adoption of the new theory was not as successful as initial trials on small samples appeared to demonstrate. Inevitably the theory had to be adapted to the demands of MT and translation, and in the process it became modified. But innovativeness and idealism must not to be discouraged in a field such as MT where the major problems are so great and all promising approaches must be examined closely. Unfortunately, there has been a tendency throughout the history of MT for the advocates of new approaches to exaggerate their contribution. Many new approaches have been proclaimed as definitive solutions on the basis of small-scale demonstrations with limited vocabulary and limited sentence structures. It is these initial untested claims that must always be treated with great caution. This lesson has been learnt by most MT researchers; no longer do they proclaim imminent breakthroughs.
Within the last ten years, research on spoken translation has developed into a major focus of MT activity. Of course, the idea or dream of translating the spoken word automatically was present from the beginning (Locke 1955), but it has remained a dream until now. Research projects such as those at ATR, CMU and on the Verbmobil project in Germany are ambitious. But they do not make the mistake of attempting to build all-purpose systems. The constraints and limitations are clearly defined by definition of domains, sublanguages and categories of users. That lesson has been learnt. The potential benefits even if success is only partial are clear for all to see, and it is a reflection of the standing of MT in general and a sign that it is no longer suffering from old perceptions that such ambitious projects can receive funding.
In the future, much MT research will be oriented towards the development of `translation modules' to be integrated in general `office' systems, rather than the design of systems to be self-contained and independent. It is already evident that the range of computer-based translation activities is expanding to embrace any process which results in the production or generation of texts and documents in bilingual and multilingual contexts, and it is quite possible that MT will be seen as the most significant component in the facilitation of international communication and understanding in the future `information age'. In this respect, the development of MT systems appropriate for electronic mail is an area which ought to be explored. Those systems which are in use (e.g. DP/Translator on CompuServe) were developed for quite different purposes and circumstances. It would be wrong to assume that existing systems are completely adequate for this purpose. They were not designed for the colloquial and often ungrammatical and incomplete dialogue style of the discussion lists on networks. http://ourworld.compuserve.com/homepages/WJHutchins/MTS-95.htm
In conclusion, I think that Language Technology is a very important element for the development of our society. I think we must improve Language Technology because it is newly developed field of science, which is why it is not completely discovered. From my point of view, human beings could have a deeper use of Language Technology. I think that eventually people will receive the full benefits of this new technology in every aspects of our lives: at work, at home and at play. We just need more time to learn more about Language Technology in order to receive these benefits