This report has the aim of analyse analyzeand clarify the role of Human Language Technologies in the Information Society. Trying to do it, I have based my work not only in the web directions Abaitua's page has given us, but also searching for information in pages like Google or Yahoo.
The way the report has been built has consisted on answering the questionnaire found on the page: http://sirio.deusto.es/abaitua. Starting from this very point, I have tried to develope the topics as much as possible, showing the information I have get, and giving my opinions created after reading and analysing the whole information.
The aim of this study is to inform us about New Technologies and its relation with humanity, our different languages, the probems we have - communication problems because of the multilinguality-, and how we try to solve them - machine translation, improving data-management.
We live in a modern world where information is one of the most important things we have and we need. All this information is universilazing more and more and we need a developed technology to spread the whole interesting data to everybody.
Almost everybody has a computer in his/her own home and the Internet is rooting in everybody since it is an easy way to communicate with people far from here or to find a lot of information of diverse themes. this universal use of the web, has created a lot of problems New Technologies must to solve using the way of investigation.
The main objective of must be to initiate discussion on the major social issues becoming apparent as information technology becomes more widespread in society. In addition, it should highlights the need for more awareness of the social consequences of the deployment of information and communication technologies. It also has to raise new issues of international concern in relation to the use of particular technologies, and their widespread deployment, without full consideration being given to the possible social consequences.
The overall objective of Human Language Technologies is to support e-business in a global context and to promote a human centered infostructure ensuring equal access and usage opportunities for all. This is to be achieved by developing multilingual technologies and demonstrating exemplary applications providing features and functions that are critical for the realization of a truly user friendly Information Society. Projects address generic and applied RTD from a multi- and cross-lingual perspective, and undertake to demonstrate how language specific solutions can be transferred to and adapted for other languages
The topics developed in this work are the essentials: Language technologies and the Information Society, Information overload and methods to improve data-management, Language technology and engineering, Multilinguality, Machine Translation and the Assessment of the Contribution of Machine Translation to Linguistic Diversity on the Internet.
I will try to clarify the definitions, successes, failures and solutions to these problems in this work that I expect that serves to create a clear vision of what the improvements are and how we can apply all this innovation to our lives.
Information Society: A society characterized by a high level of information intensity in the everyday life of most citizens, in most organizations and workplaces; by the use of common or compatible technology for a wide range of personal, social, educational and business activities, and by the ability to transmit, receive and exchange digital data rapidly between places irrespective of distance.
HLTCentral web site was established as an on-line information resource of human language technologies and related topics of interest to the HLT community at large. It covers news, R&D, technological and business developments in the field of speech, language, multilinguality, automatic translation, localization and related areas. Its coverage of HLT news and developments is worldwide - with a unique European perspective.
The development and convergence of computer and telecommunication technologies has led to a revolution in the way that we work, communicate with each other, buy goods and use services, and even the way we entertain and educate ourselves. One of the results of this revolution is that large volumes of information will increasingly be held in a form which is more natural for human users than the strictly formatted, structured data typical of computer systems of the past. Information presented in visual images, as sound, and in natural language, either as text or speech, will become the norm. We all deal with computer systems and services, either directly or indirectly, every day of our lives. This is the information age and we are a society in which information is vital to economic, social, and political success as well as to our quality of life. The changes of the last two decades may have seemed revolutionary but, in reality, we are only on the threshold of this new age. There are still many new ways in which the application of telematics and the use of language technology will benefit our way of life, from interactive entertainment to lifelong learning. Although these changes will bring great benefits, it is important that we anticipate difficulties which may arise, and develop ways to overcome them. Examples of such problems are: access to much of the information may be available only to the computer literate and those who understand English; a surfeit of information from which it is impossible to identify and select what is really wanted. Language Engineering can solve these problems.
Information management is the harnessing of the information resources and information capabilities of the organization in order to add and create value both for itself and for its clients or customers. Knowledge management is a framework for designing an organizations goals, structures, and processes so that the organization can use what it knows to learn and to create value for its customers and community. A KM framework involves designing and working with the following elements: Categories of organizational knowledge (tacit knowledge, explicit knowledge, cultural knowledge) Knowledge processes (knowledge creation, knowledge sharing, knowledge utilization) Organizational enablers (vision and strategy; roles and skills; policies and processes; tools and platforms) IM provides the foundation for KM, but the two are focused differently. IM is concerned with processing and adding value to information, and the basic issues here include access, control, coordination, timeliness, accuracy, and usability. KM is concerned with using the knowledge to take action, and the basic issues here include codification, diffusion, practice, learning, innovation, and community building. What is an Information Strategy?
David Lewis coined the term "information fatigue syndrome" for what he expects will soon be a recognized medical condition. "Having too much information can be as dangerous as having too little. Among other problems, it can lead to a paralysis of analysis, making it far harder to find the right solutions or make the best decisions." "Information is supposed to speed the flow of commerce, but it often just clogs the pipes."
Words per minute Every day, approximately 20 million words of technical information are recorded. A reader capable of reading 1000 words per minute would require 1.5 months, reading eight hours every day, to get through one day's output, and at the end of that period he would have fallen 5.5 years behind in his reading
Parsing systems that use unification generally fall into two broad (and rather crude) categories. Computational grammars tend to run efficiently but it is difficult to express linguistic information easily. Linguistic grammars tend to run slowly but are efficient at expressing linguistic information. Previous implementations of LFG have usually been interpreters that are typically inefficient, even when implemented in the form of a chart praiser which is recognized as having good efficiency. The present work starts with LFG grammars and lexicons, written in a style that is very recognizably LFG. Grammars are treated as rules and lexicons as facts which are compiled into a Prolog form. In particular, this involves using Prolog's term unification rather than the more usual linguistic unification. Previous systems have implemented linguistic unification on top of Prolog's term unification with possible speed disadvantages and difficulties in ensuring the correctness of the new unification algorithm. Thus the work has the dual aims of allowing linguistic information to be encoded in a linguistically sophisticated way, while preserving the speed and accuracy of computational grammars
One of the most important ways in which Language Engineering will have a significant impact is in the use of human language, especially speech, to interface with machines. This improves the usability of systems and services. It will also help to ensure that services can be used not just by the computer literate but by ordinary citizens without special training. This aspect of accessibility is fundamental to a democratic, open, and equitable society in the Information Age. A good example of the type of service which will be available is an automated legal advice service. The accessibility of the justice system to all citizens is becoming a serious problem in many societies where the cost of legal expertise and the process of law prevents all but the very rich, and those qualifying for legal aid, from exercising their legal rights. It will be possible using language based techniques not only to provide advice which is based on an understanding of the problem and an analysis of the relevant body of law, but also to understand a natural language description of the problem and deliver the advice, as a human lawyer would have done, in spoken or printed form. Such a service could be made available through kiosks in court buildings or post offices, for example. This type of application can also be used to inform citizens of social security entitlements and job opportunities, as well as providing a usable, comprehensible interface to more open government. Systems with the capacity to communicate with their users interactively, through human language, available either through access points in public places or in the home, via the telephone network or TV cables, will make it possible to change the nature of our democracy. There will be a potential for participation in the decision-making process through a far greater availability of information in understandable and 'objective' form and through opinion gathering on a very large scale. Many people whose lives are affected by disability can be helped through the application of language technology. Computers with an understanding of language, able to listen, see and speak, will offer new opportunities to access services at home and participate in the workplace.
Communication is probably the most obvious use of language. On the other hand, language is also the most obvious barrier to communication. Across cultures and between nations, difficulties arise all the time not only because of the problem of translating accurately from one language to another, but also because of the cultural connotations of word and phrases. A typical example in the European context is the word 'federal' which can mean a devolved form of government to someone who already lives in a federation, but to someone living in a unitary sovereign state, it is likely to mean the imposition of another level of more remote, centralized government. As the application of language knowledge enables better support for translators, with electronic dictionaries, thesauri, and other language resources, and eventually when high quality machine translation becomes a reality, so the barriers will be lowered. Agreements at all levels, whether political or commercial, will be better drafted more quickly in a variety of languages. International working will become more effective with a far wider range of individuals able to contribute. An example of a project which is successfully helping to improve communications in Europe is one which interconnects many of the police forces of northern Europe using a limited, controlled language which can be automatically translated, in real-time. Such a facility not only helps in preventing and detecting international crime, but also assists the emergency services to communicate effectively during a major incident.
Language is the natural means of human communication; the most effective way we have to express ourselves to each other. We use language in a host of different ways: to explain complex ideas and concepts; to manage human resources; to negotiate; to persuade; to make our needs known; to express our feelings; to narrate stories; to record our culture for future generations; and to create beauty in poetry and prose. For most of us language is fundamental to all aspects of our lives.
The use of language is currently restricted. In the main, it is only used in direct communications between human beings and not in our interactions with the systems, services and appliances which we use every day of our lives. Even between humans, understanding is usually limited to those groups who share a common language. In this respect language can sometimes be seen as much a barrier to communication as an aid.
A change is taking place which will revolutionize our use of language and greatly enhance the value of language in every aspect of communication. This change is the result of developments in Language Engineering.
Language Engineering provides ways in which we can extend and improve our use of language to make it a more effective tool. It is based on a vast amount of knowledge about language and the way it works, which has been accumulated through research. It uses language resources, such as electronic dictionaries and grammars, terminology banks and corpora, which have been developed over time. The research tells us what we need to know about language and develops the techniques needed to understand and manipulate it. The resources represent the knowledge base needed to recognize, validate, understand, and manipulate language using the power of computers. By applying this knowledge of language we can develop new ways to help solve problems across the political, social, and economic spectrum.
Language Engineering is a technology which uses our knowledge of language to enhance our application of computer systems: improving the way we interface with them assimilating, analyzing, selecting, using, and presenting information more effectively providing human language generation and translation facilities.
New opportunities are becoming available to change the way we do many things, to make them easier and more effective by exploiting our developing knowledge of language.
When, in addition to accepting typed input, a machine can recognize written natural language and speech, in a variety of languages, we shall all have easier access to the benefits of a wide range of information and communications services, as well as the facility to carry out business transactions remotely, over the telephone or other telematics services.
When a machine understands human language, translates between different languages, and generates speech as well as printed output, we shall have available an enormously powerful tool to help us in many areas of our lives. When a machine can help us quickly to understand each other better, this will enable us to cooperate and collaborate more effectively both in business and in government.
The success of Language Engineering will be the achievement of all these possibilities. Already some of these things can be done, although they need to be developed further. The pace of advance is accelerating and we shall see many achievements over the next few years.
Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science aiming at computational models of human cognition. Computational linguistics has applied and theoretical components.
Applied CL focuses on the practical outcome of modelling human language use. The methods, techniques, tools and applications in this area are often subsumed under the term language engineering or (human) language technology. Although existing CL systems are far from achieving human ability, they have numerous possible applications. The goal is to create software products that have some knowledge of human language. Such products are going to change our lives. They are urgently needed for improving human-machine interaction since the main obstacle in the interaction between human and computer is a communication problem. Today's computers do not understand our language but computer languages are difficult to learn and do not correspond to the structure of human thought. Even if the language the machine understands and its domain of discourse are very restricted, the use of human language can increase the acceptance of software and the productivity of its users
There are many techniques used in Language Engineering and some of these are described below. Speaker Identification and Verification A human voice is as unique to an individual as a fingerprint. This makes it possible to identify a speaker and to use this identification as the basis for verifying that the individual is entitled to access a service or a resource. The types of problems which have to be overcome are, for example, recognizing that the speech is not recorded, selecting the voice through noise (either in the environment or the transfer medium), and identifying reliably despite temporary changes (such as caused by illness).
Speech Recognition The sound of speech is received by a computer in analogue wave forms which are analyzed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognize discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose. There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognize continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognize any speaker, avoiding the need to train the system to recognize the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically.
Character and Document Image Recognition Recognition of written or printed language requires that a symbolic representation of the language is derived from its spatial form of graphical marks. For most languages this means recognizing and transforming characters. There are two cases of character recognition: recognition of printed images, referred to as Optical Character Recognition (OCR) recognizing handwriting, usually known as Intelligent Character Recognition (ICR) OCR from a single printed font family can achieve a very high degree of accuracy. Problems arise when the font is unknown or very decorative, or when the quality of the print is poor. In these difficult cases, and in the case of handwriting, good results can only be achieved by using ICR. This involves word recognition techniques which use language models, such as lexicons or statistical information about word sequences. Document image analysis is closely associated with character recognition but involves the analysis of the document to determine firstly its make-up in terms of graphics, photographs, separating lines and text, and then the structure of the text to identify headings, sub-headings, captions etc. in order to be able to process the text effectively.
Natural Language Understanding The understanding of language is obviously fundamental to many applications. However, perfect understanding is not always a requirement. In fact, gaining a partial understanding is often a very useful preliminary step in the process because it makes it possible to be intelligently selective about taking the depth of understanding to further levels. Shallow or partial analysis of texts is used to obtain a robust initial classification of unrestricted texts efficiently. This initial analysis can then be used, for example, to focus on 'interesting' parts of a text for a deeper semantic analysis which determines the content of the text within a limited domain. It can also be used, in conjunction with statistical and linguistic knowledge, to identify linguistic features of unknown words automatically, which can then be added to the system's knowledge. Semantic models are used to represent the meaning of language in terms of concepts and relationships between them. A semantic model can be used, for example, to map an information request to an underlying meaning which is independent of the actual terminology or language in which the query was expressed. This supports multi-lingual access to information without a need to be familiar with the actual terminology or structuring used to index the information. Combinations of analysis and generation with a semantic model allow texts to be translated. At the current stage of development, applications where this can be achieved need be limited in vocabulary and concepts so that adequate Language Engineering resources can be applied. Templates for document structure, as well as common phrases with variable parts, can be used to aid generation of a high quality text.
Natural Language Generation A semantic representation of a text can be used as the basis for generating language. An interpretation of basic data or the underlying meaning of a sentence or phrase can be mapped into a surface string in a selected fashion; either in a chosen language or according to stylistic specifications by a text planning system.
Speech Generation Speech is generated from filled templates, by playing 'canned' recordings or concatenating units of speech (phonemes, words) together. Speech generated has to account for aspects such as intensity, duration and stress in order to produce a continuous and natural response. Dialogue can be established by combining speech recognition with simple generation, either from concatenation of stored human speech components or synthesizing speech using rules. Providing a library of speech recognizer and generators, together with a graphical tool for structuring their application, allows someone who is neither a speech expert nor a computer programmer to design a structured dialogue which can be used, for example, in automated handling of telephone calls.
Language resources are essential components of Language Engineering. They are one of the main ways of representing the knowledge of language, which is used for the analytical work leading to recognition and understanding. The work of producing and maintaining language resources is a huge task. Resources are produced, according to standard formats and protocols to enable access, in many EU languages, by research laboratories and public institutions. Many of these resources are being made available through the European Language Resources Association (ELRA).
Lexicons A lexicon is a repository of words and knowledge about those words. This knowledge may include details of the grammatical structure of each word (morphology), the sound structure (phonology), the meaning of the word in different textual contexts, e.g. depending on the word or punctuation mark before or after it. A useful lexicon may have hundreds of thousands of entries. Lexicons are needed for every language of application.
Specialist Lexicons There are a number of special cases which are usually researched and produced separately from general purpose lexicons: Proper names: Dictionaries of proper names are essential to effective understanding of language, at least so that they can be recognized within their context as places, objects, or person, or maybe animals. They take on a special significance in many applications, however, where the name is key to the application such as in a voice operated navigation system, a holiday reservations system, or railway timetable information system, based on automated telephone call handling. Terminology: In today's complex technological environment there are a host of terminologies which need to be recorded, structured and made available for language enhanced applications. Many of the most cost-effective applications of Language Engineering, such as multi-lingual technical document management and machine translation, depend on the availability of the appropriate terminology banks. Wordnets: A wordnet describes the relationships between words; for example, synonyms, antonyms, collective nouns, and so on. These can be invaluable in such applications as information retrieval, translator workbenches and intelligent office automation facilities for authoring.
Grammars A grammar describes the structure of a language at different levels: word (morphological grammar), phrase, sentence, etc. A grammar can deal with structure both in terms of surface (syntax) and meaning (semantics and discourse).
Corpora A corpus is a body of language, either text or speech, which provides the basis for: analysis of language to establish its characteristics training a machine, usually to adapt its behavior to particular circumstances verifying empirically a theory concerning language a test set for a Language Engineering technique or application to establish how well it works in practice. There are national corpora of hundreds of millions of words but there are also corpora which are constructed for particular purposes. For example, a corpus could comprise recordings of car drivers speaking to a simulation of a control system, which recognizes spoken commands, which is then used to help establish the user requirements for a voice operated control system for the market.
Natural language processing (NLP) is the formulation and investigation of computationally effective mechanisms for communication through natural language. This involves natural language generation and understanding. An architecture that contains either one will be considered as containing NLP. If the user can communicate with it using natural language then it is clear that the architecture has NLP. It is true that some of the architectures can, theoretically, be programmed in such a way so as to provide NLP. This potential capability is not enough for our criteria, we must have an actual implementation of the architecture showing how it does NLP.
A software system providing a working environment for a human translator, which offers a range of aids such as on-line dictionaries, thesauri, translation memories, etc.
Software which parses language to a point where a rudimentary level of understanding can be realized; this is often used in order to identify passages of text which can then be analyzed in further depth to fulfil the particular objective
A means to represent the rules used in the establishment of a model of linguistic knowledge
The sound of speech is received by a computer in analogue wave forms which are analyzed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognize discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose.
The process of aligning different language versions of a text in order to be able to identify equivalent terms, phrases, or expressions
Facilities provided in conjunction with word processing to aid the author of documents, typically including an on-line dictionary and thesaurus, spell-, grammar-, and style-checking, and facilities for structuring, integrating and linking documents
Language which has been designed to restrict the number of words and the structure of (also artificial language) language used, in order to make language processing easier; typical users of controlled language work in an area where precision of language and speed of response is critical, such as the police and emergency services, aircraft pilots, air traffic control, etc.
Usually applied to the area of application of the language enabled software e.g. banking, insurance, travel, etc.; the significance in Language Engineering is that the vocabulary of an application is restricted so the language resource requirements are effectively limited by limiting the domain of application
When discussing the relevance of technological training in the translation curricula, it is important to clarify the factors that make technology more indispensable and show how the training should be tuned accordingly. The relevance of technology will depend on the medium that contains the text to be translated. This particular aspect is becoming increasingly evident with the rise of the localization industry, which deals solely with information in digital form. There may be no other imaginable means for approaching the translation of such things as on-line manuals in software packages or CD-ROMs with technical documentation than computational ones.
With the exception of a few eccentries or maniacs, it will be rare in the future to see good professional interpreters and literary translators not using more or less sophisticated and specialized tools for their jobs., comparable to the familiarization with type recorders or typewriters in the past. In any case, this maybe something best left to the professional to decide, and may not be indispensable. It is clear that word processors, on-line dictionaries and all sorts of background documentation, such as concordances or collated texts, besides e-mail or other ways of network interaction with colleagues in the world may substantially help the literary translator´s work
Information of many types is rapidly changing format and going digital. Electronic documentation is the adequate realm for the incorporation of translation technology. This is something that young students of translation must learn. As the conception and design of technical documentation becomes progressively influenced by the electronic medium, it is integrating more and more with the whole concept of a software product. The strategies and means for translating both software packages and electronic documents are becoming very similar and both are now, as we will see, the goal of the localization industry
The main focus of localization industry is to help software publishers, hardware manufacturers and telecommunications companies with versions of their software, documentation, marketing, and Web-based information in different languages for simultaneous worldwide release. Yes, I believe it, because it is very important for this sector the capacity of translation.
Globalization: The adaptation of marketing strategies to regional requirements of all kinds (e.g., cultural, legal, and linguistic).
Internationalization: The engineering of a product (usually software) to enable efficient adaptation of the product to local requirements.
Localization: The adaptation of a product to a target language and culture (locale).
In the localization industry, the utilization of technology is congenital, and developing adequate tools has immediate economic benefits. The above lines depict a view of a translation environment which is closer to more traditional needs of the translator than to current requirements of the industry. Many aspects of software localization have not been considered, particularly the concepts of multilingual management and document-life monitoring. Corporations are now realizing that documentation is an integral part of the production line where the distinction between product, marketing and technical material is becoming more and more blurred. Product documentation is gaining importance in the whole process of product development with direct impact on time-to-market. Software engineering techniques that apply in other phases of software development are beginning to apply to document production as well. The appraisal of national and international standards of various types is also significant: text and character coding standards (e.g. SGML/XML and Unicode), as well as translation quality control standards (e.g. DIN 2345 in Germany, or UNI 10574 in Italy). In response to these new challenges, localization packages are now being designed to assist users throughout the whole life cycle of a multilingual document. These take them through job setup, authoring, translation preparation, translation, validation, and publishing, besides ensuring consistency and quality in source and target language variants of the documentation. New systems help developers monitor different versions, variants and languages of product documentation, and author customer specific solutions. An average localization package today will normally consist of an industry standard SGML/XML editor (e.g. ArborText), a translation and terminology toolkit (Trados Translator's Workbench), and a publishing engine (e.g. Adobe's Frame+SGML). Unlike traditional translators, software localizers may be engaged in early stages of software development, as there are issues, such as platform portability, code exchange, format conversion, etc. which if not properly dealt with may hinder product internationalization. Localizers are often involved in the selection and application of utilities that perform code scanning and checking, that automatically isolate and suggest solutions to National Language Support (NLS) issues, which save time during the internationalization enabling process. There are run-time libraries that enable software developers and localizers to create single-source, multilingual, and portable cross-platform applications. Unicode support is also fundamental for software developers who work with multilingual texts, as it provides a consistent coding format for international character sets. In the words of Rose Lockwood (Language International 10.5), a consultant from Equipe Consortium Ltd, "as traditional translation methods give way to language engineering and disciplined authoring, translation and document-management methods, the role of technically proficient linguists and authors will be increasingly important to global WWW. The challenge will be to employ the skills used in conventional technical publishing in the new environment of a digital economy."
Leaving behind the old conception of a monolithic compact translation engine, the industry is now moving in the direction of integrating systems: "In the future Trados will offer solutions that provide enterprise-wide applications for multilingual information creation and dissemination, integrating logistical and language-engineering applications into smooth workflow that spans the globe," says Trados manager Henri Broekmate. Logos, the veteran translation technology provider, has announced "an integrated technology-based translation package, which will combine term management, TM, MT and related tools to create a seamless full service localization environment." Other software manufacturers also in the race are Corel, Star, IBM, and the small but belligerent Spanish company Atril. This approach for integrating different tools is largely the view advocated by many language-technology specialists. Below is a description of an ideal engine which captures the answers given by Muriel Vasconcellos (from the Pan American Health Organization), Minako O'Hagan (author of The Coming Age of Teletranslations) and Eduard Hovy (President of the Association of Machine Translation in the Americas) to a recent survey (by Language International 10.6). The ideal workstation for the translator would combine the following features: Full integration in the translator's general working environment, which comprises the operating system, the document editor (hypertext authoring, desktop publisher or the standard word-processor), as well as the emailer or the Web browser. These would be complemented with a wide collection of linguistic tools: from spell, grammar and style checkers to on-line dictionaries, and glossaries, including terminology management, annotated corpora, concordances, collated texts, etc. The system should comprise all advances in machine translation (MT) and translation memory (TM) technologies, be able to perform batch extraction and reuse of validated translations, enable searches into TM databases by various keywords (such as phrases, authors, or issuing institutions). These TM databases could be distributed and accessible through Internet. There is a new standard for TM exchange (TMX) that would permit translators and companies to work remotely and share memories in real-time. Eduard Hovy underlines the need for a genre detector. "We need a genre topology, a tree of more or less related types of text and ways of recognizing and treating the different types computationally." He also sees the difficulty of constantly up-dating the dictionaries and suggests a "restless lexicon builder that crawls all over the Web every night, ceaselessly collecting words, names, and phrases, and putting them into the appropriate lexicons." Muriel Vasconcellos pictures her ideal design of the workstation in the following way: Good view of the source text extensive enough to offer the overall context, including the previous sentence and two or three sentences after the current one. Relevant on-line topical word lists, glossaries and thesaurus. These should be immediately accessible and, in the case of topical lists, there should be an optimal switch that shows, possibly in color, when there are subject-specific entries available. Three target-text windows. The first would be the main working area, and it would start by providing a sentence from the original document (or a machine pre-translation), which could be over-struck or quickly deleted to allow the translator to work from scratch. The original text or pre-translation could be switched off. Characters of any language and other symbols should be easy to produce. Drag-and-drop is essential and editing macros are extremely helpful when overstriking or translating from scratch. The second window would offer translation memory when it is available. The TM should be capable of fuzzy matching with a very large database, with the ability to include the organization's past texts if they are in some sort of electronic form. The third window would provide a raw machine translation which should be easy to paste into the target document. The grammar checker can be tailored so that it is not so sensitive. It would be ideal if one could write one's own grammar rules. The above lines depict a view of a translation environment which is closer to more traditional needs of the translator than to current requirements of the industry. Many aspects of software localization have not been considered, particularly the concepts of multilingual management and document-life monitoring. Corporations are now realizing that documentation is an integral part of the production line where the distinction between product, marketing and technical material is becoming more and more blurred. Product documentation is gaining importance in the whole process of product development with direct impact on time-to-market. Software engineering techniques that apply in other phases of software development are beginning to apply to document production as well. The appraisal of national and international standards of various types is also significant: text and character coding standards (e.g. SGML/XML and Unicode), as well as translation quality control standards (e.g. DIN 2345 in Germany, or UNI 10574 in Italy). In response to these new challenges, localization packages are now being designed to assist users throughout the whole life cycle of a multilingual document. These take them through job setup, authoring, translation preparation, translation, validation, and publishing, besides ensuring consistency and quality in source and target language variants of the documentation. New systems help developers monitor different versions, variants and languages of product documentation, and author customer specific solutions. An average localization package today will normally consist of an industry standard SGML/XML editor (e.g. ArborText), a translation and terminology toolkit (Trados Translator's Workbench), and a publishing engine (e.g. Adobe's Frame+SGML). Unlike traditional translators, software localizers may be engaged in early stages of software development, as there are issues, such as platform portability, code exchange, format conversion, etc. which if not properly dealt with may hinder product internationalization. Localizers are often involved in the selection and application of utilities that perform code scanning and checking, that automatically isolate and suggest solutions to National Language Support (NLS) issues, which save time during the internationalization enabling process. There are run-time libraries that enable software developers and localizers to create single-source, multilingual, and portable cross-platform applications. Unicode support is also fundamental for software developers who work with multilingual texts, as it provides a consistent coding format for international character sets. In the words of Rose Lockwood (Language International 10.5), a consultant from Equipe Consortium Ltd, "as traditional translation methods give way to language engineering and disciplined authoring, translation and document-management methods, the role of technically proficient linguists and authors will be increasingly important to global WWW. The challenge will be to employ the skills used in conventional technical publishing in the new environment of a digital economy."
Like cooks, tailors or architects, professional translators need to become acquainted with technology, because good use of technology will make their jobs more competitive and satisfactory. But they should not dismiss craftsmanship. Technology enhances productivity, but translation excellence goes beyond technology. It is important to delimit the roles of humans and machines in translation. Martin Kay's (1987) words in this respect are most illustrative:
A computer is a device that can be used to magnify human productivity. Properly used, it does not dehumanize by imposing its own Orwellian stamp on the products of human spirit and the dignity of human labor but, by taking over what is mechanical and routine, it frees human beings over what is mechanical and routine. Translation is a fine and exacting art, but there is much about it that is mechanical and routine, if this were given over to a machine, the productivity of the translator would not only be magnified but this work would become more rewarding, more exciting, more human.
It has taken some 40 years for the specialists involved in the development of MT to realize that the limits to technology arise when going beyond the mechanical and routine aspects of language. From the outside, translation is often seen as a mere mechanical process, not any more complex than playing chess, for example. If computers have been programed with the capacity of beating a chess master champion such as Kasparov, why should they not be capable of performing translation of the highest quality? Few people are aware of the complexity of literary translation. Douglas Hofstadter (1998) depicts this well:
A skilled literary translator makes a far larger number of changes, and far more significant changes, than any virtuoso performer of classical music would ever dare to make in playing notes in the score of, say, a Beethoven piano sonata. In literary translation, it's totally humdrum stuff for new ideas to be interpreted, old ideas to be deleted, structures to be inverted, twisted around, and on and on.
Consultant: A person that is sufficiently informed to advise potential users of translation technology. This person should be able to find out when and how technology may be useful or cost-effective; how to find out the most adequate tools or where to get the necessary information to come up with an answer. That is, a person that has read at least one paper like this, or knows where to find the basic relevant literature and references.
User: A person that has sufficient technological training to be efficient not only using the computer but also any specialized translation software with a minimally standard way of working.
Instructor: A person that can both assess and use the technology is, with a little more experience, also capable of training other people. Teaching requires some confidence with hardware and software, so it would be desirable for the instructor to also be a regular computer user.
Evaluator: Evaluating the technology requires a little more expertise than being a consultant. An evaluator would be able to analyze how good or bad particular software is. Therefore, some experience in software evaluation in general, and in translation technology in particular, is recommendable.
Manager: A person that has the responsibility to make a translation or localization company profitable should have quite some experience in using and testing translation technology. That person should also be able to design an optimal distribution between human and machine resources; and should know what kind of professionals the company needs (translators, computational linguists, or software engineers), as well as how to acquire the most appropriate technological infrastructure.
Developer: Localization software very often needs customizing, integration or up-dating. Good professionals may be involved in software development, where both linguistic and technical skills may be required.
(i) Problems of ambiguity , (ii) problems that arise from structural and lexical differences between languages and (iii) multiword units like idiom s and collocations . We will discuss typical problems of ambiguity in Section , lexical and structural mismatches in Section , and multiword units in Section . Of course, these sorts of problem are not the only reasons why MT is hard. Other problems include the sheer size of the undertaking, as indicated by the number of rules and dictionary entries that a realistic system will need, and the fact that there are many constructions whose grammar is poorly understood, in the sense that it is not clear how they should be represented, or what rules should be used to describe them. This is the case even for English, which has been extensively studied, and for which there are detailed descriptions -- both traditional `descriptive' and theoretically sophisticated -- some of which are written with computational usability in mind. It is an even worse problem for other languages. Moreover, even where there is a reasonable description of a phenomenon or construction, producing a description which is sufficiently precise to be used by an automatic system raises non-trivial problems.
lexical holes --- that is, cases where one language has to use a phrase to express what another language expresses in a single word. Examples of this include the `hole' that exists in English with respect to French ignorer (`to not know', `to be ignorant of'), and se suicider (`to suicide', i.e. `to commit suicide', `to kill oneself'). The problems raised by such lexical holes have a certain similarity to those raised by idiom s: in both cases, one has phrases translating as single words
Morphological,Syntactical and Semantic fields are more relevant for MT
In the best of all possible worlds (as far as most Natural Language Processing is concerned, anyway) every word would have one and only one meaning. But, as we all know, this is not the case. When a word has more than one meaning, it is said to be lexically ambiguous. When a phrase or sentence can have more than one structure it is said to be structurally ambiguous. Ambiguity is a pervasive phenomenon in human languages. It is very hard to find words that are not at least two ways ambiguous, and sentences which are (out of context) several ways ambiguous are the rule, not the exception. This is not only problematic because some of the alternatives are unintended (i.e. represent wrong interpretations), but because ambiguities `multiply'. In the worst case, a sentence containing two words, each of which is two ways ambiguous may be four ways ambiguous (), one with three such words may be , ways ambiguous etc. One can, in this way, get very large numbers indeed.
Prices rose quickly in the market Each of the words prices, rose, and market can be either nouns or verbs; however, quickly is unambiguously an adverb and the unambiguously a definite article, and these facts ensure the unambiguous analysis as a phrase structure , where prices is identified as a subject noun phrase, in the market as a prepositional phrase, and rose quickly as part of a verb phrase.
We can illustrate this with some examples. First, let us show how grammar rules, differently applied, can produce more than one syntactic analysis for a sentence. One way this can occur is where a word is assigned to more than one category in the grammar. For example, assume that the word cleaning is both an adjective and a verb in our grammar. This will allow us to assign two different analyses to the following sentence:
fluids can be dangerous.
One of these analyses will have cleaning as a verb, and one will have it as an adjective. In the former (less plausible) case the sense is `to clean a fluid may be dangerous', i.e. it is about an activity being dangerous. In the latter case the sense is that fluids used for cleaning can be dangerous. Choosing between these alternative syntactic analyses requires knowledge about meaning
It may be worth noting, in passing, that this ambiguity disappears when can is replaced by a verb which shows number agreement by having different forms for third person singular and plural. For example, the following are not ambiguous in this way:
has only the sense that the action is dangerous
has only the sense that the fluids are dangerous
We have seen that syntactic analysis is useful in ruling out some wrong analyses, and this is another such case, since, by checking for agreement of subject and object, it is possible to find the correct interpretations. A system which ignored such syntactic facts would have to consider all these examples ambiguous, and would have to find some other way of working out which sense was intended, running the risk of making the wrong choice. For a system with proper syntactic analysis, this problem would arise only in the case of verbs like can which do not show number agreement.
English chooses different verbs for the action/event of putting on, and the action/state of wearing. Japanese does not make this distinction, but differentiates according to the object that is worn. In the case of English to Japanese, a fairly simple test on the semantics of the NPs that accompany a verb may be sufficient to decide on the right translation. Some of the color examples are similar, but more generally, investigation of color vocabulary indicates that languages actually carve up the spectrum in rather different ways, and that deciding on the best translation may require knowledge that goes well beyond what is in the text, and may even be undecidable. In this sense, the translation of color terminology begins to resemble the translation of terms for cultural artifacts (e.g. words like English cottage, Russian dacha, French château, etc. for which no adequate translation exists, and for which the human translator must decide between straight borrowing, neologism, and providing an explanation). In this area, translation is a genuinely creative act, which is well beyond the capacity of current computers.
A particularly obvious example of this involves problems arising from what are sometimes called lexical holes --- that is, cases where one language has to use a phrase to express what another language expresses in a single word. Examples of this include the `hole' that exists in English with respect to French ignorer (`to not know', `to be ignorant of'), and se suicider (`to suicide', i.e. `to commit suicide', `to kill oneself'). The problems raised by such lexical holes have a certain similarity to those raised by idiom s: in both cases, one has phrases translating as single words. We will therefore postpone discussion of these until Section .
One kind of structural mismatch occurs where two languages use the same construction for different purposes, or use different constructions for what appears to be the same purpose.
Rather different from idioms are expressions like those in ( ), which are usually referred to as collocations . Here the meaning can be guessed from the meanings of the parts. What is not predictable is the particular words that are used.
This butter is rancid (*sour, *rotten, *stale).
This cream is sour (*rancid, *rotten, *stale).
They took (*made) a walk.
They made (*took) an attempt.
They had (*made, *took) a talk.
For example, the fact that we say rancid butter, but not * sour butter, and sour cream, but not * rancid cream does not seem to be completely predictable from the meaning of butter or cream, and the various adjectives. Similarly the choice of take as the verb for walk is not simply a matter of the meaning of walk (for example, one can either make or take a journey).
In what we have called linguistic knowledge (LK) systems, at least, collocations can potentially be treated differently from idioms. This is because for collocations one can often think of one part of the expression as being dependent on, and predictable from the other. For example, one may think that make, in make an attempt has little meaning of its own, and serves merely to `support' the noun (such verbs are often called light verbs, or support verbs). This suggests one can simply ignore the verb in translation, and have the generation or synthesis component supply the appropriate verb. For example, in Dutch , this would be done, since the Dutch for make an attempt is een poging doen (`do an attempt').
If Sam mends the bucket, her children will be rich.
If Sam kicks the bucket, her children will be rich.
The problem with idioms, in an MT context, is that it is not usually possible to translate them using the normal rules. There are exceptions, for example take the bull by the horns (meaning `face and tackle a difficulty without shirking') can be translated literally into French as prendre le taureau par les cornes, which has the same meaning. But, for the most part, the use of normal rules in order to translate idioms will result in nonsense. Instead, one has to treat idioms as single units in translation. In many cases, a natural translation for an idiom will be a single word --- for example, the French word mourir (`die') is a possible translation for kick the bucket.
The term machine translation (MT) is normally taken in its restricted and precise meaning of fully automatic translation. However, in this chapter we consider the whole range of tools that may support translation and document production in general, which is especially important when considering the integration of other language processing techniques and resources with MT. We therefore define Machine Translation to include any computer-based process that transforms (or helps a user to transform) written text from one human language into another. We define Fully Automated Machine Translation (FAMT) to be MT performed without the intervention of a human being during the process. Human-Assisted Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a (mono- or bilingual) human for help. Machine-Aided Translation (MAT) is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants. Traditionally, two very different classes of MT have been identified. Assimilation refers to the class of translation in which an individual or organization wants to gather material written by others in a variety of languages and convert them all into his or her own language. Dissemination refers to the class in which an individual or organization wants to broadcast his or her own material, written in one language, in a variety of language to the world. A third class of translation has also recently become evident. Communication refers to the class in which two or more individuals are in more or less immediate interaction, typically via email or otherwise online, with an MT system mediating between them. Each class of translation has very different features, is best supported by different underlying technology, and is to be evaluated according to somewhat different criteria
Researchers at Georgetown University and IBM were working towards the first operational systems, and they accepted the long-term limitations of MT in the production of usable translations. More influential was the well-known dissent of Bar-Hillel. In 1960, he published a survey of MT research at the time which was highly critical of the theory-based projects, particularly those investigating interlingua approaches, and which included his demonstration of the non-feasibility of fully automatic high quality translation (FAHQT) in principle. Instead, Bar-Hillel advocated the development of systems specifically designed on the basis of what he called 'man-machine symbiosis', a view which he had first proposed nearly ten years before when MT was still in its infancy (Bar-Hillel 1951). In these circumstances it is not surprising that the Automatic Language Processing Advisory Committee (ALPAC) set up by the US sponsors of research found that MT had failed by its own criteria, since by the mid 1960s there were clearly no fully automatic systems capable of good quality translation and there was little prospect of such systems in the near future. MT research had not looked at the economic use of existing 'less than perfect' systems, and it had disregarded the needs of translators for computer-based aids.
The list of such applications of 'external' theories is long. It began in the 1950s and 1960s with information theory, categorial grammar, transformational-generative grammar, dependency grammar, and stratificational grammar. In the 1970s and 1980s came MT research based on artificial intelligence, non-linguistic knowledge bases, formalisms such as Lexical-Functional Grammar, Generalized Phrase Structure Grammar, Head-driven Phrase Structure Grammar, Definite Clause Grammar, Principles and Parameters, Montague semantics. In the 1990s have been added neural networks, connectionism, parallel processing, and statistical methods, and many more. In nearly every case, it has been found that the 'pure' adoption of the new theory was not as successful as initial trials on small samples appeared to demonstrate. Inevitably the theory had to be adapted to the demands of MT and translation, and in the process it became modified. But innovativeness and idealism must not to be discouraged in a field such as MT where the major problems are so great and all promising approaches must be examined closely. Unfortunately, there has been a tendency throughout the history of MT for the advocates of new approaches to exaggerate their contribution. Many new approaches have been proclaimed as definitive solutions on the basis of small-scale demonstrations with limited vocabulary and limited sentence structures. It is these initial untested claims that must always be treated with great caution. This lesson has been learnt by most MT researchers; no longer do they proclaim imminent breakthroughs.
Within the last ten years, research on spoken translation has developed into a major focus of MT activity. Of course, the idea or dream of translating the spoken word automatically was present from the beginning (Locke 1955), but it has remained a dream until now. Research projects such as those at ATR, CMU and on the Verbmobil project in Germany are ambitious. But they do not make the mistake of attempting to build all-purpose systems. The constraints and limitations are clearly defined by definition of domains, sublanguages and categories of users. That lesson has been learnt. The potential benefits even if success is only partial are clear for all to see, and it is a reflection of the standing of MT in general and a sign that it is no longer suffering from old perceptions that such ambitious projects can receive funding.
In the future, much MT research will be oriented towards the development of `translation modules' to be integrated in general `office' systems, rather than the design of systems to be self-contained and independent. It is already evident that the range of computer-based translation activities is expanding to embrace any process which results in the production or generation of texts and documents in bilingual and multilingual contexts, and it is quite possible that MT will be seen as the most significant component in the facilitation of international communication and understanding in the future `information age'. In this respect, the development of MT systems appropriate for electronic mail is an area which ought to be explored. Those systems which are in use (e.g. DP/Translator on CompuServe) were developed for quite different purposes and circumstances. It would be wrong to assume that existing systems are completely adequate for this purpose. They were not designed for the colloquial and often ungrammatical and incomplete dialogue style of the discussion lists on networks.
Efficient operation: Communication via the Internet is rapid (in some cases instantaneous), powerful (large volumes of traffic can be supported), reliable (messages are delivered with precision), and, once the necessary technological infrastructure and tools are in place, cheap in comparison to alternative channels of communication. Global extension: The Internet renders geographical distances insignificant, turning the world into a "global village". Consequently, other obstacles to communication acquire greater relevance, including possession of the required technology (hence ultimately economic factors) and cultural differences (particularly language). Flexible use: A wide and increasing variety of types of communication can be realized via the Internet, transmitting different sorts of content through different media; the only limits are the potential for such content and media to be digitalized, the capacity of current technology to perform such digitalization, and the availability of hardware and communications infrastructure with the required capacities and power. Electronic form: The electronic nature of the channel is the key element behind the aforementioned features; it also implies other benefits. Anything that can be done electronically can be done via the Internet; hence, more and more of modern technology can employ the same common channel, including the numerous aspects of Information Techonology which are beginning to emerge at the present time.
Every language stands on the Internet within a planetary space and face to face with all the other languages there present. Minority languages which have survived as enclaves within nation-states now have to perceive themselves, like all other languages, as standing at a cultural cross-roads, open to multilateral relationships and exchanges.
Some of the minority languages of the EU exist only in minority situations, whether minoritarian in one member-state only, as with Sorbian or Welsh, or minoritarian in two or more member-states, as in the cases of Catalan and Basque. But there are also transfrontier minority languages, where although the language is minoritarian on one side of the border, it also belongs to a large and sometimes powerful language-group possessing its own nation-state on the other side of the border (or further afield), as in the case of the German minorities in Belgium, Denmark or Italy.
we shall have in mind those minority languages which exist only in minority situations, or very small state languages, or languages which fall into each of those categories on two sides of a border. We do not have to worry about the availability of word-processors and Internet browsers, or the creation of linguistic corpora for German-speaking minorities outside Germany. These exist within the language. But when we come to consider uses of the Internet for communication within and between minority language-groups, German-speaking minorities will certainly find themselves in the same set of regional and minority languages as Frisian or Scottish Gaelic.
Once the hardware and communications infrastructure is in place, the Internet in its present form has many advantages for minority language communities as indeed for all small communities.
The uses to which minority language groups put the Internet may seem at first to be the same as many we find in majority languages, but the significance is often different. Any presentation of the minority language and culture to a world-wide audience by definition breaks new ground since minority language groups, whatever access they may have had to broadcasting within the nation-state, have scarcely ever had the political or economic strength to project themselves outside the state in which they live, or indeed, in many cases, to their fellow citizens in other parts of the same state.
Another ambitious and comprehensive electronic newspaper is the Catalan Vilaweb , founded in 1995 by Vicent Partal and Assumpció Maresme, both experienced journalists. It is an electronic newspaper with a network of local editions which appear in towns and villages throughout the Catalan lands but also in diaspora areas such as Boston and New York, creating a kind of "virtual nation". The site also incorporates a directory of electronic resources in the Catalan language and reaches 90,000 different readers each month. This critical mass of users attracts some international web advertising, and local editions collect their own local advertisement. Indeed the organizational and financial arrangements of Vilaweb are every bit as interesting as the technical ones and could be of interest in other minority languages. A similar network exists in Galicia .
There are many courses teaching minority languages on the Internet. The most ambitious is likely to be HABENET, a three-year project for teaching Basque on the Internet and costing some 1.8m euros. Internet courses in minority languages have new possibilities but also face new challenges. Most face-to-face courses and course materials for learning minority languages assume a knowledge of the local majority language and this is undoubtedly where the main demand will be, on and off the Internet. But it seems to us that there would also be room to develop a multi-media language-learning package that was language-independent or language-adaptable so far as the language of instruction went. Such a course would make each language approachable from any other language at least at an elementary level.
Because of the incorporation of complex components into new versions of the programmes, updates to the localization become more expensive rather than cheaper as time goes on. Moreover, by the time a programme has been localized in small languages such as Basque - which are allocated low priority within Microsoft - new versions of the original are already becoming available in English and some other languages which offer a large market. Finally, what might be thought a major advantage of cooperation with an international company, namely access to its marketing skills and distribution network, does not apply. The Basque version was not important enough to Microsoft for them to be interested in promoting it themselves.
The Basque Government has now looked at information technology needs for the next ten years. Localization is only one kind of action contemplated, and on the whole the assessment of costs and benefits seems to favour other priorities: the development of spelling and grammar checkers, of OCR tools specific to Basque voice recognition software, also support for making Basque dictionaries and reference works available for on-line public use. A five year plan starting this year (2000) is likely to support local companies working in some of these fields. There is also an interest in developing tools for the automatic translation of web-pages.
The Catalan Autonomous Government too entered into an agreement with Microsoft and has appointed a committee of experts to ensure that a strategy is in place so that electronic resources are created in the Catalan language. But the experience from Catalunya we want to look at here is entirely within the non-commercial and voluntary sector.
Marketing is a problem, but, as we have seen, the same was the case with Microsoft software localized into Basque and Catalan. However, given that governmental or voluntary organizations have to do the marketing in each case, there must be some advantage in marketing a free product. The Basque Microsoft programmes, despite the heavy element of subsidy, have had to be purchased by individuals and institutions, including the Basque Government itself.
What is happening now though is that we are reaching a point in the translation industry, where translation technology is becoming an essential part of the translation process rather than a curious experiment. The emergence of internet-based translation services aimed at corporate users is raising expectations too, with the promise of fast turnarounds for translation jobs. People are starting to expect everything at 'internet speed', i.e. translation at the touch of a button. In addition, companies with a global presence are starting to realise the importance of localising their website. This is creating a huge amount of translation work which needs to be turned around virtually immediately, as website content may be updated on a daily basis. The volumes and turnaround times involved are often so high that traditional translation methods just cannot keep up with the demand, or the cost becomes unpalatable. The application of MT and TM in both these contexts has obvious potential, ensuring rapid turnarounds and lower costs, though it should be said that MT is not always suitable for the more marketing-oriented content. Controlling the source text is suddenly more important than ever if we are to make the most of the translation technologies available to us and if quality translations are to be delivered on time. The investment is now balanced against a much higher return.
I have study the main problems and succeses of the whole themes. That's why I would like to conclude that machines need the help of the humans, since they need our upturn. I have realized how presents New Technologies are in our daily life, with or without our assent.
There are a lot of programmes in all the world to create an easier way to deal with modern services, such us Information Society Technologies Programme. These must think about all the components of the society, creating a visible improvement in all of them making for example systems and services for the citizen, new methods of work and electronic commerce, multimedia content and tools and essential technologies and infrastructures.
For the private individual, the objective is to meet the need and expectation of high-quality affordable general interest services. For enterprises, workers and consumers, the objective is to enable individuals and organisations to innovate and be more effective and efficient in their work, there by providing the basis for sustainable growth and high added-value employment while also improving the quality of working life. In the sector of multimedia content, the key objective is to confirm each county as a leading force, realising its full potential. For the enabling technologies which are the foundations of the information society, the objective is to drive their development, enhance their applicability and accelerate their take-up in the whole world.