ABSTRACT

This report offers mainly information and details about how language and new technologies are related. You will get to know what an information society is and the importance of language technologies on it. You will also find many details about how language engineering works and which are its main techniques. Besides, it is developed the idea of multilinguality as a barrier to comunication and how machine translation can solve this problem. There are also given some other details about machine translation such as some methods and history.

INTRODUCTION

In order to make this report, I have used the questions assigned in the course of "English language and New Technologies" taught by Joseba Abaitua. Therefore the structure I have chosen for making this report is the one of a questionaire. That is, I have answered some questions so the reader of this report can have a clear idea of different matters on language technologies.

I have decided to complete the report with different information on language technologies instead of focussing on one particular area or problem. The reason for this is that of trying to be helpful for those people that want to have a general idea on this subject. People who may have no idea of what language technologies consists on and what is its main objective.

The main aim of this report is to recover information about language and new technologies. All the information needed to answer the questions have been found in several pages on internet which are mentioned in the reference.

The report is classified in the folowing way: The first three questions deal with human language technologies and information society.The next six questions are related to the problem of having to much information and methods to improve data-management.From question ten to fourteen are connected with language technology and engineering. From question fifteen to question twenty-three give some details about the problem of multilinguality and how can translation technology can help to solve it.Questions from twenty-four to thirty-three you will find information about machine translation.ANd finally the objective of the last three questions is to show how machine traslation contributes to solve the problem of langugae diversity.

1. What is the "Information Society"?

Information Society is a term for a society in which the creation, distribution, and manipulation of information has become the most significant economic and cultural activity. An Information Society may be contrasted with societies in which the economic underpinning is primarily Industrial or Agrarian. The machine tools of the Information Society are computers and telecommunications, rather than lathes or ploughs.

2. What is the role of HLTCentral.org?

HLTCentral web site was established as an online information resource of human language technologies and related topics of interest to the HLT community at large. It covers news, R&D, technological and business developments in the field of speech, language, multilinguality, automatic translation, localisation and related areas. Its coverage of HLT news and developments is worldwide - with a unique European perspective.

Two EU funded projects, ELSNET and EUROMAP, are behind the development of HLTCentral. EUROMAP ("Facilitating the path to market for language and speech technologies in Europe") - aims to provide awareness, bridge-building and market-enabling services for accelerating the rate of technology transfer and market take-up of the results of European HLT RTD projects. ELSNET ("The European Network of Excellence in Human Language Technologies") - aims to bring together the key players in language and speech technology, both in industry and in academia, and to encourage interdisciplinary co-operation through a variety of events and services

3. Why language technologies are so important for the Information Society?

The overall objective of HLT is to support e-business in a global context and to promote a human centred infostructure ensuring equal access and usage opportunities for all. This is to be achieved by developing multilingual technologies and demonstrating exemplary applications providing features and functions that are critical for the realisation of a truly user friendly Information Society. Projects address generic and applied RTD from a multi- and cross-lingual perspective, and undertake to demonstrate how language specific solutions can be transferred to and adapted for other languages.

4. Why "knowledge" is of more value than "information"?

Information management is the harnessing of the information resources and information capabilities of the organization in order to add and create value both for itself and for its clients or customers.

Knowledge management is a framework for designing an organization’s goals, structures, and processes so that the organization can use what it knows to learn and to create value for its customers and community.

5. Does the possesion of big quantities of data imply that we are well informed?

Consider a document containing a table of numbers indicating product sales for the quarter. As they stand, these numbers are Data. An employee reads these numbers, recognizes the name and nature of the product, and notices that the numbers are below last year’s figures, indicating a downward trend. The data has become Information. The employee considers possible explanations for the product decline (perhaps using additional information and personal judgment), and comes to the conclusion that the product is no longer attractive to its customers. This new belief, derived from reasoning and reflection, is Knowledge.

Thus, information is data given context, and endowed with meaning and significance. Knowledge is information that is transformed through reasoning and reflection into beliefs, concepts, and mental models.

6. How many words of technical information are recorded every day?

Every day, approximately 20 million words of technical information are recorded. A reader capable of reading 1000 words per minute would require 1.5 months, reading eight hours every day, to get through one day's output, and at the end of that period he would have fallen 5.5 years behind in his reading

7. What is the most convenient way of representing information? Why?

Language is the natural means of human communication; the most effective way we have to express ourselves to each other. We use language in a host of different ways: to explain complex ideas and concepts; to manage human resources; to negotiate; to persuade; to make our needs known; to express our feelings; to narrate stories; to record our culture for future generations; and to create beauty in poetry and prose. For most of us language is fundamental to all aspects of our lives.

One of the key features of an information service is its ability to deliver information which meets the immediate, real needs of its client in a focused way. It is not sufficient to provide information which is broadly in the category requested, in such a way that the client must sift through it to extract what is useful. Equally, if the way that the information is extracted leads to important omissions, then the results are at best inadequate and at worst they could be seriously misleading.

8. How can computer science and language technologies help manage information?

Language Engineering can improve the quality of information services by using techniques which not only give more accurate results to search requests, but also increase greatly the possibility of finding all the relevant information available. Use of techniques like concept searches, i.e. using a semantic analysis of the search criteria and matching them against a semantic analysis of the database, give far better results than simple keyword searches.

9. Why language can sometimes be seen as a barrier to communication? How can this change?

Communication is probably the most obvious use of language. On the other hand, language is also the most obvious barrier to communication. Across cultures and between nations, difficulties arise all the time not only because of the problem of translating accurately from one language to another, but also because of the cultural connotations of word and phrases. A typical example in the European context is the word 'federal' which can mean a devolved form of government to someone who already lives in a federation, but to someone living in a unitary sovereign state, it is likely to mean the imposition of another level of more remote, centralised government.

As the application of language knowledge enables better support for translators, with electronic dictionaries, thesauri, and other language resources, and eventually when high quality machine translation becomes a reality, so the barriers will be lowered. Agreements at all levels, whether political or commercial, will be better drafted more quickly in a variety of languages. International working will become more effective with a far wider range of individuals able to contribute. An example of a project which is successfully helping to improve communications in Europe is one which interconnects many of the police forces of northern Europe using a limited, controlled language which can be automatically translated, in real-time. Such a facility not only helps in preventing and detecting international crime, but also assists the emergency services to communicate effectively during a major incident.

10. In what ways does Language Engineering improves the use of language?

Our ability to develop our use of language holds the key to the multi-lingual information society; the European society of the future. New developments in Language Engineering will enable us to: Access information efficiently, focusing precisely on the information we need, saving time and avoiding information overload. Talk to our computer systems, at home as well as at work, in our cars and in public places where we need information or assistance. Teach ourselves other languages and improve our use of our own, at our convenience: in our own time; at our own pace; and in our own place. do business efficiently over the telephone by interacting reliably and directly with voice operated computer systems; even instruct our PCs to carry out transactions on our behalf. Learn more about what is happening around us, locally, nationally and internationally and have a greater influence on decisions affecting our lives. Operate more effectively internationally, in business, in administration, in political activities and as citizens and consumers. Provide a wider range of better services to the maximum number of fellow citizens, colleagues and customers.

11. Language Tecnology, Language Engineering and Computational Linguistics. Similarities and differencies.

Language technologies are information technologies that are specialized for dealing with the most complex information medium in our world: human language. Therefore these technologies are also often subsumed under the term Human Language Technology. Human language occurs in spoken and written form. Whereas speech is the oldest and most natural mode of language communication, complex information and most of human knowledge is maintained and transmitted in written texts. Speech and text technologies process or produce language in these two modes of realization. But language also has aspects that are shared between speech and text such as dictionaries, most of grammar and the meaning of sentences. Thus large parts of language technology cannot be subsumed under speech and text technologies. Among those are technologies that link language to knowledge. We do not know how language, knowledge and thought are represented in the human brain. Nevertheless, language technology had to create formal representation systems that link language to concepts and tasks in the real world. This provides the interface to the fast growing area of knowledge technologies.

In our communication we mix language with other modes of communication and other information media. We combine speech with gesture and facial expressions. Digital texts are combined with pictures and sounds. Movies may contain language and spoken and written form. Thus speech and text technologies overlap and interact with many other technologies that facilitate processing of multimodal communication and multimedia documents (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science aiming at computational models of human cognition. Computational linguistics has applied and theoretical components

Computanional linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science aiming at computational models of human cognition. Computational linguistics has applied and theoretical components

Applied CL focusses on the practical outcome of modelling human language use. The methods, techniques, tools and applications in this area are often subsumed under the term language engineering or (human) language technology. Although existing CL systems are far from achieving human ability, they have numerous possible applications. The goal is to create software products that have some knowledge of human language. Such products are going to change our lives. They are urgently needed for improving human-machine interaction since the main obstacle in the interaction beween human and computer is a communication problem. Today's computers do not understand our language but computer languages are difficult to learn and do not correspond to the structure of human thought. Even if the language the machine understands and its domain of discourse are very restricted, the use of human language can increase the acceptance of software and the productivity of its users.

Much older than communication problems between human beings and machines are those between people with different mother tongues. One of the original aims of applied computational linguistics has always been fully automatic translation between human languages. From bitter experience scientists have realized that they are still far away from achieving the ambitious goal of translating unrestricted texts. Nevertheless computational linguists have created software systems that simplify the work of human translators and clearly improve their productivity. Less than perfect automatic translations can also be of great help to information seekers who have to search through large amounts of texts in foreign languages

Language engineering is the application of knowledge of language to the development of computer systems which can recognise, understand, interpret, and generate human language in all its forms. In practice, Language Engineering comprises a set of techniques and language resources. The former are implemented in computer software and the latter are a repository of knowledge which can be accessed by computer software.

12. Which are the main techniques used in Language Engineering?

There are many techniques used in Language Engineering and some of these are described below.

1.- Speaker Identification and Verification

A human voice is as unique to an individual as a fingerprint. This makes it possible to identify a speaker and to use this identification as the basis for verifying that the individual is entitled to access a service or a resource. The types of problems which have to be overcome are, for example, recognising that the speech is not recorded, selecting the voice through noise (either in the environment or the transfer medium), and identifying reliably despite temporary changes (such as caused by illness

2.- Speech Recognition

The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose. There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognise continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognise any speaker, avoiding the need to train the system to recognise the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically.

3.- Character and Document Image Recognition

Recognition of written or printed language requires that a symbolic representation of the language is derived from its spatial form of graphical marks. For most languages this means recognising and transforming characters. There are two cases of character recognition: recognition of printed images, referred to as Optical Character Recognition (OCR) and recognising handwriting, usually known as Intelligent Character Recognition (ICR)

OCR from a single printed font family can achieve a very high degree of accuracy. Problems arise when the font is unknown or very decorative, or when the quality of the print is poor. In these difficult cases, and in the case of handwriting, good results can only be achieved by using ICR. This involves word recognition techniques which use language models, such as lexicons or statistical information about word sequences.

Document image analysis is closely associated with character recognition but involves the analysis of the document to determine firstly its make-up in terms of graphics, photographs, separating lines and text, and then the structure of the text to identify headings, sub-headings, captions etc. in order to be able to process the text effectively.

4.- Natural Language Understanding

The understanding of language is obviously fundamental to many applications. However, perfect understanding is not always a requirement. In fact, gaining a partial understanding is often a very useful preliminary step in the process because it makes it possible to be intelligently selective about taking the depth of understanding to further levels.

Shallow or partial analysis of texts is used to obtain a robust initial classification of unrestricted texts efficiently. This initial analysis can then be used, for example, to focus on 'interesting' parts of a text for a deeper semantic analysis which determines the content of the text within a limited domain. It can also be used, in conjunction with statistical and linguistic knowledge, to identify linguistic features of unknown words automatically, which can then be added to the system's knowledge.

Semantic models are used to represent the meaning of language in terms of concepts and relationships between them. A semantic model can be used, for example, to map an information request to an underlying meaning which is independent of the actual terminology or language in which the query was expressed. This supports multi-lingual access to information without a need to be familiar with the actual terminology or structuring used to index the information.

Combinations of analysis and generation with a semantic model allow texts to be translated. At the current stage of development, applications where this can be achieved need be limited in vocabulary and concepts so that adequate Language Engineering resources can be applied. Templates for document structure, as well as common phrases with variable parts, can be used to aid generation of a high quality text.

5.- Natural Language Generation

A semantic representation of a text can be used as the basis for generating language. An interpretation of basic data or the underlying meaning of a sentence or phrase can be mapped into a surface string in a selected fashion; either in a chosen language or according to stylistic specifications by a text planning system.

6.- Speech Generation

Speech is generated from filled templates, by playing 'canned' recordings or concatenating units of speech (phonemes, words) together. Speech generated has to account for aspects such as intensity, duration and stress in order to produce a continuous and natural response.

Dialogue can be established by combining speech recognition with simple generation, either from concatenation of stored human speech components or synthesising speech using rules.

Providing a library of speech recognisers and generators, together with a graphical tool for structuring their application, allows someone who is neither a speech expert nor a computer programmer to design a structured dialogue which can be used, for example, in automated handling of telephone calls.

13. Which language resources are essential components of Language Engineering?

Language resources are essential components of Language Engineering. They are one of the main ways of representing the knowledge of language, which is used for the analytical work leading to recognition and understanding.

The work of producing and maintaining language resources is a huge task. Resources are produced, according to standard formats and protocols to enable access, in many EU languages, by research laboratories and public institutions. Many of these resources are being made available through the European Language Resources Association (ELRA). Those are the essential components of language engineering:

1.- Lexicons

A lexicon is a repository of words and knowledge about those words. This knowledge may include details of the grammatical structure of each word (morphology), the sound structure (phonology), the meaning of the word in different textual contexts, e.g. depending on the word or punctuation mark before or after it. A useful lexicon may have hundreds of thousands of entries. Lexicons are needed for every language of application.

2.- Specialist Lexicons

There are a number of special cases which are usually researched and produced separately from general purpose lexicons:

Proper names: Dictionaries of proper names are essential to effective understanding of language, at least so that they can be recognised within their context as places, objects, or person, or maybe animals. They take on a special significance in many applications, however, where the name is key to the application such as in a voice operated navigation system, a holiday reservations system, or railway timetable information system, based on automated telephone call handling.

Terminology: In today's complex technological environment there are a host of terminologies which need to be recorded, structured and made available for language enhanced applications. Many of the most cost-effective applications of Language Engineering, such as multi-lingual technical document management and machine translation, depend on the availability of the appropriate terminology banks.

Wordnets: A wordnet describes the relationships between words; for example, synonyms, antonyms, collective nouns, and so on. These can be invaluable in such applications as information retrieval, translator workbenches and intelligent office automation facilities for authoring

3.- Grammars

A grammar describes the structure of a language at different levels: word (morphological grammar), phrase, sentence, etc. A grammar can deal with structure both in terms of surface (syntax) and meaning (semantics and discourse).

4.- Corpora

A corpus is a body of language, either text or speech, which provides the basis for: analysis of language to establish its characteristics, training a machine, usually to adapt its behaviour to particular circumstances, verifying empirically a theory concerning language, a test set for a Language Engineering technique or application to establish how well it works in practice.

There are national corpora of hundreds of millions of words but there are also corpora which are constructed for particular purposes. For example, a corpus could comprise recordings of car drivers speaking to a simulation of a control system, which recognises spoken commands, which is then used to help establish the user requirements for a voice operated control system for the market.

14. Check for the following terms:

Natural language processing

Natural language processing is a term in use since the 1980s to define a class of software systems which handle text intelligently

Traslator´s Workbench

It is a software system providing a working environment for a human translator, which offers a range of aids such as on-line dictionaries, thesauri, translation memories, etc

Shallow parser

Shallow parser is a software which parses language to a point where a rudimentary level of understanding can be realised; this is often used in order to identify passages of text which can then be analysed in further depth to fulfil the particular objective

Formalism

It is a means to represent the rules used in the establishment of a model of linguistic knowledge

Speech recognition

The sound of speech is received by a computer in analogue wave forms which are analysed to identify the units of sound (called phonemes) which make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose. There are a number of significant problems to be overcome if speech is to become a commonly used medium for dealing with a computer. The first of these is the ability to recognise continuous speech rather than speech which is deliberately delivered by the speaker as a series of discrete words separated by a pause. The next is to recognise any speaker, avoiding the need to train the system to recognise the speech of a particular individual. There is also the serious problem of the noise which can interfere with recognition, either from the environment in which the speaker uses the system or through noise introduced by the transmission medium, the telephone line, for example. Noise reduction, signal enhancement and key word spotting can be used to allow accurate and robust recognition in noisy environments or over telecommunication networks. Finally, there is the problem of dealing with accents, dialects, and language spoken, as it often is, ungrammatically.

Text alignment

the process of aligning different language versions of a text in order to be able to identify equivalent terms, phrases, or expressions

Authoring tools

Authoring tools facilities provided in conjunction with word processing to aid the author of documents, typically including an on-line dictionary and thesaurus, spell-, grammar-, and style-checking, and facilities for structuring, integrating and linking documents

Controlled language

It is a language which has been designed to restrict the number of words and the structure of (also artificial language) language used, in order to make language processing easier; typical users of controlled language work in an area where precision of language and speed of response is critical, such as the police and emergency services, aircraft pilots, air traffic control, etc.

Domain

It is usually applied to the area of application of the language enabled software e.g. banking, insurance, travel, etc.; the significance in Language Engineering is that the vocabulary of an application is restricted so the language resource requirements are effectively limited by limiting the domain of application

15. In the translation curricula, which factors make technology more indispensable?

When discussing the relevance of technological training in the translation curricula, it is important to clarify the factors that make technology more indispensable and show how the training should be tuned accordingly. The relevance of technology will depend on the medium that contains the text to be translated. This particular aspect is becoming increasingly evident with the rise of the localization industry, which deals solely with information in digital form. There may be no other imaginable means for approaching the translation of such things as on-line manuals in software packages or CD-ROMs with technical documentation than computational ones.

16. Do professional interpreters and literary translators need translation technology? Which are the tools they need for their job?

The traditional crafts of interpreting natural speech or translating printed material, which are peripheral to technology, may still benefit from technological training slightly more than anecdotally. It is clear that word processors, on-line dictionaries and all sorts of background documentation, such as concordances or collated texts, besides e-mail or other ways of network interaction with colleagues anywhere in the world may substantially help the literary translator's work. With the exception of a few eccentrics or maniacs, it will be rare in the future to see good professional interpreters and literary translators not using more or less sophisticated and specialized tools for their jobs, comparable to the familiarization with tape recorders or typewriters in the past. In any case, this might be something best left to the professional to decide, and may not be indispensable.

17. In what ways is documentation becoming electronic? How does this affect electrony?

The increase of information in electronic format is linked to advances in computational techniques for dealing with it. Together with the proliferation of informational webs in Internet, we can also see a growing number of search and retrieval devices, some of which integrate translation technology. Technical documentation is becoming electronic, in the form of CD-ROM, on-line manuals, intranets, etc.

An important consequence of the popularization of Internet is that the access to information is now truly global and the demand for localizing institutional and commercial Web sites is growing fast.

18. What is the focus of the localization industry? Do you believe there might be a job for you in that industry sector?

The main role of localization companies is to help software publishers, hardware manufacturers and telecommunications companies with versions of their software, documentation, marketing, and Web-based information in different languages for simultaneous worldwide release. The recent expansion of these industries has considerably increased the demand for translation products and has created a new burgeoning market for the language business. According to a recent industry survey by LISA (the Localization Industry Standards Association), almost one third of software publishers, such as Microsoft, Oracle, Adobe, Quark, etc., generate above 20 percent of their sales from localized products, that is, from products which have been adapted to the language and culture of their targeted markets, and the great majority of publishers expect to be localizing into more than ten different languages.

The General Manager of LionBridge, Santi van der Kruk, for example, declares: "The profile we look for in translators is an excellent knowledge of computer technology and superb linguistic ability in both the source and target languages. They must know how to use the leading CAT [computer assisted translation] tools and applications and be flexible. The information technology and localization industries are evolving very rapidly and translators need to move with them."

In my opinion, and considering my case, there may be a job in this kind of industry. The reasons i have to state that opinion is that we, students of English Philology, are supossed to enter very deeply into the word of linguists and languages. So with our knowledge in this area and a considerable preparation in the sector of machine translation and computers in general we could have the chance to work in this industry.

19. Define internationalization, globalization and localization. How do they affect the design of software products?

Globalization: The adaptation of marketing strategies to regional requirements of all kinds (e.g., cultural, legal, and linguistic).

Internationalization: The engineering of a product (usually software) to enable efficient adaptation of the product to local requirements.

Localization: The adaptation of a product to a target language and culture (locale).

20. Are translation and localization the same thing? Explain the differences.

[Localization - is the process during which a computer program is translated to a different language for a specific market. The user interface is translated into the target language, dialog boxes are resized due to the use of different character sets, and if necessary, double-byte enabling is done.]

Vand der Meer, president of AlpNet, puts it this way: "Localization was originally intended to set software (or information technology) translators apart from 'old fashioned' non-technical translators of all types of documents. Software translation required a different skill set: software translators had to understand programming code, they had to work under tremendous time pressure and be flexible about product changes and updates. Originally there was only a select group--the localizers--who knew how to respond to the needs of the software industry. >From these beginnings, pure localization companies emerged focusing on testing, engineering, and project management."

21. What is a translation workstation? Compare it with a standard localization tool.

The ideal workstation for the translator would combine the following features:

Full integration in the translator's general working environment, which comprises the operating system, the document editor (hypertext authoring, desktop publisher or the standard word-processor), as well as the emailer or the Web browser. These would be complemented with a wide collection of linguistic tools: from spell, grammar and style checkers to on-line dictionaries, and glossaries, including terminology management, annotated corpora, concordances, collated texts, etc.

The system should comprise all advances in machine translation (MT) and translation memory (TM) technologies, be able to perform batch extraction and reuse of validated translations, enable searches into TM databases by various keywords (such as phrases, authors, or issuing institutions). These TM databases could be distributed and accessible through Internet. There is a new standard for TM exchange (TMX) that would permit translators and companies to work remotely and share memories in real-time.

Muriel Vasconcellos pictures her ideal design of the workstation in the following way:

"Good view of the source text extensive enough to offer the overall context, including the previous sentence and two or three sentences after the current one."

"Relevant on-line topical word lists, glossaries and thesaurus. These should be immediately accessible and, in the case of topical lists, there should be an optimal switch that shows, possibly in color, when there are subject-specific entries available."

"Three target-text windows. The first would be the main working area, and it would start by providing a sentence from the original document (or a machine pre-translation), which could be over-struck or quickly deleted to allow the translator to work from scratch. The original text or pre-translation could be switched off. Characters of any language and other symbols should be easy to produce."

" Drag-and-drop is essential and editing macros are extremely helpful when overstriking or translating from scratch. "

"The second window would offer translation memory when it is available. The TM should be capable of fuzzy matching with a very large database, with the ability to include the organization's past texts if they are in some sort of electronic form."

"The third window would provide a raw machine translation which should be easy to paste into the target document."

"The grammar checker can be tailored so that it is not so sensitive. It would be ideal if one could write one's own grammar rules."

The above lines depict a view of a translation environment which is closer to more traditional needs of the translator than to current requirements of the industry. Many aspects of software localization have not been considered, particularly the concepts of multilingual management and document-life monitoring. Corporations are now realizing that documentation is an integral part of the production line where the distinction between product, marketing and technical material is becoming more and more blurred. Product documentation is gaining importance in the whole process of product development with direct impact on time-to-market. Software engineering techniques that apply in other phases of software development are beginning to apply to document production as well.

22. Machine translation vs. human translation. Do you agree that translation excellence goes beyond technology? Why?

Martin Kay's (1987) words:

"A computer is a device that can be used to magnify human productivity. Properly used, it does not dehumanize by imposing its own Orwellian stamp on the products of human spirit and the dignity of human labor but, by taking over what is mechanical and routine, it frees human beings over what is mechanical and routine. Translation is a fine and exacting art, but there is much about it that is mechanical and routine, if this were given over to a machine, the productivity of the translator would not only be magnified but this work would become more rewarding, more exciting, more human."

It has taken some 40 years for the specialists involved in the development of MT to realize that the limits to technology arise when going beyond the mechanical and routine aspects of language. From the outside, translation is often seen as a mere mechanical process, not any more complex than playing chess, for example. If computers have been programed with the capacity of beating a chess master champion such as Kasparov, why should they not be capable of performing translation of the highest quality? Few people are aware of the complexity of literary translation. Douglas Hofstadter (1998) depicts this well:

"A skilled literary translator makes a far larger number of changes, and far more significant changes, than any virtuoso performer of classical music would ever dare to make in playing notes in the score of, say, a Beethoven piano sonata. In literary translation, it's totally humdrum stuff for new ideas to be interpreted, old ideas to be deleted, structures to be inverted, twisted around, and on and on."

23. Which profiles should any person with a University degree in Translation be qualified for?

Obviously, Hofstadter's experiment has gone beyond the recommended mechanical and routine scope of language and is therefore an abuse of MT. Outside the limits of the mechanical and routine, MT is impracticable and human creativity becomes indispensable. Translators of the highest quality are only obtainable from first-class raw materials and constant and disciplined training. The potentially good translator must be a sensitive, wise, vigilant, talented, gifted, experienced, and knowledgeable person. An adequate use of mechanical means and resources can make a good human translator a much more productive one. Nevertheless, very much like dictionaries and other reference material, technology may be considered an excellent prothesis, but little more than that. As Martin Kay (1992) argues, there is an intrinsic and irreplaceable human aspect of translation:

There is nothing that a person could know, or feel, or dream, that could not be crucial for getting a good translation of some text or other. To be a translator, therefore, one cannot just have some parts of humanity; one must be a complete human being.

24. Why is translation such a difficult task?

Some problems that includes the task of translation include the sheer size of the undertaking, as indicated by the number of rules and dictionary entries that a realistic system will need, and the fact that there are many constructions whose grammar is poorly understood, in the sense that it is not clear how they should be represented, or what rules should be used to describe them. This is the case even for English, which has been extensively studied, and for which there are detailed descriptions (both traditional `descriptive' and theoretically sophisticated ) some of which are written with computational usability in mind. It is an even worse problem for other languages. Moreover, even where there is a reasonable description of a phenomenon or construction, producing a description which is sufficiently precise to be used by an automatic system raises non-trivial problems.

25.Which are the main problems of MT?

The main problems of machine translation are ambiguity, lexical and structural mismatches and multiword units: idioms and collocations.

26. Which parts of Linguistics are more relevant for MT?

It is a truism to say that one of the most straightforward operations of any MT system should be the identification and generation of morphological variants of nouns and verbs. There are basically two types of morphology in question: inflectional morphology, as illustrated by the familiar verb and noun paradigms (French marcher, marche, marchons, marchait, est marché, etc.), and derivational morphology, which is concerned with the formation of nouns from verb bases, verbs from noun forms, adjectives from nouns, and so forth, e.g. nation, nationalism, nationalise, nationalisation, and equivalents in other languages.

27. How many different types of ambiguity are there?

In the best of all possible worlds (as far as most Natural Language Processing is concerned, anyway) every word would have one and only one meaning. But, as we all know, this is not the case. When a word has more than one meaning, it is said to be lexically ambiguous. When a phrase or sentence can have more than one structure it is said to be structurally ambiguous.

28. Illustrate your discussion with?

Two examples of lexical ambiguity

Imagine that we are trying to translate these two sentences into French :

You must not abrasive cleaners on the printer casing.

The of abrasive cleaners on the printer casing is not recommended.

In the first sentence use is a verb, and in the second a noun, that is, we have a case of lexical ambiguity. An English-French dictionary will say that the verb can be translated by (inter alia) se servir de and employer, whereas the noun is translated as emploi or utilisation. One way a reader or an automatic parser can find out whether the noun or verb form of use is being employed in a sentence is by working out whether it is grammatically possible to have a noun or a verb in the place where it occurs. For example, in English, there is no grammatical sequence of words which consists of the + V + PP --- so of the two possible parts of speech to which use can belong, only the noun is possible in the second sentence ( b).

One example of structural ambiguity

We can illustrate this with some examples. First, let us show how grammar rules, differently applied, can produce more than one syntactic analysis for a sentence. One way this can occur is where a word is assigned to more than one category in the grammar. For example, assume that the word cleaning is both an adjective and a verb in our grammar. This will allow us to assign two different analyses to the following sentence.

fluids can be dangerous.

One of these analyses will have cleaning as a verb, and one will have it as an adjective. In the former (less plausible) case the sense is `to clean a fluid may be dangerous', i.e. it is about an activity being dangerous. In the latter case the sense is that fluids used for cleaning can be dangerous. Choosing between these alternative syntactic analyses requires knowledge about meaning

It may be worth noting, in passing, that this ambiguity disappears when can is replaced by a verb which shows number agreement by having different forms for third person singular and plural. For example, the following are not ambiguous in this way:

( a) has only the sense that the action is dangerous,

( b) has only the sense that the fluids are dangerous.

Cleaning fluids is dangerous

. Cleaning fluids are dangerous.

We have seen that syntactic analysis is useful in ruling out some wrong analyses, and this is another such case, since, by checking for agreement of subject and object, it is possible to find the correct interpretations. A system which ignored such syntactic facts would have to consider all these examples ambiguous, and would have to find some other way of working out which sense was intended, running the risk of making the wrong choice. For a system with proper syntactic analysis, this problem would arise only in the case of verbs like can which do not show number agreement

Three lexical and structural mismatches

English chooses different verbs for the action/event of putting on, and the action/state of wearing. Japanese does not make this distinction, but differentiates according to the object that is worn. In the case of English to Japanese, a fairly simple test on the semantics of the NPs that accompany a verb may be sufficient to decide on the right translation. Some of the colour examples are similar, but more generally, investigation of colour vocabulary indicates that languages actually carve up the spectrum in rather different ways, and that deciding on the best translation may require knowledge that goes well beyond what is in the text, and may even be undecidable. In this sense, the translation of colour terminology begins to resemble the translation of terms for cultural artifacts (e.g. words like English cottage, Russian dacha, French château, etc. for which no adequate translation exists, and for which the human translator must decide between straight borrowing, neologism, and providing an explanation). In this area, translation is a genuinely creative act, which is well beyond the capacity of current computers.

A particularly obvious example of this involves problems arising from what are sometimes called lexical holes --- that is, cases where one language has to use a phrase to express what another language expresses in a single word. Examples of this include the `hole' that exists in English with respect to French ignorer (`to not know', `to be ignorant of'), and se suicider (`to suicide', i.e. `to commit suicide', `to kill oneself'). The problems raised by such lexical holes have a certain similarity to those raised by idiom s: in both cases, one has phrases translating as single words. We will therefore postpone discussion of these until Section .

One kind of structural mismatch occurs where two languages use the same construction for different purposes, or use different constructions for what appears to be the same purpose.

Cases where the same structure is used for different purposes include the use of passive constructions in English, and Japanese . In the example below, the Japanese particle wa, which we have glossed as `TOP' here marks the `topic' of the sentence --- intuitively, what the sentence is about.

Satoo-san wa shyushoo ni erabaremashita.

Satoo-hon TOP Prime Minister in was-elected

Mr. Satoh was elected Prime Minister.

Example ( ) indicates that Japanese has a passive-like construction, i.e. a construction where the PATIENT, which is normally realized as an OBJECT, is realized as SUBJECT. It is different from the English passive in the sense that in Japanese this construction tends to have an extra adversive nuance which might make ( a) rather odd, since it suggests an interpretation where Mr Satoh did not want to be elected, or where election is somehow bad for him. This is not suggested by the English translation, of course. The translation problem from Japanese to English is one of those that looks unsolvable for MT, though one might try to convey the intended sense by adding an adverb such as unfortunately. The translation problem from English to Japanese is on the other hand within the scope of MT, since one must just choose another form. This is possible, since Japanese allows SUBJECTs to be omitted freely, so one can say the equivalent of elected Mr Satoh, and thus avoid having to mention an AGENT . However, in general, the result of this is that one cannot have simple rules like those described in Chapter for passives. In fact, unless one uses a very abstract structure indeed, the rules will be rather complicated.

Three collocations

Rather different from idioms are expressions like those in ( ), which are usually referred to as collocations . Here the meaning can be guessed from the meanings of the parts. What is not predictable is the particular words that are used.

This butter is rancid (*sour, *rotten, *stale).

This cream is sour (*rancid, *rotten, *stale).

They took (*made) a walk.

They made (*took) an attempt.

They had (*made, *took) a talk.

For example, the fact that we say rancid butter, but not * sour butter, and sour cream, but not * rancid cream does not seem to be completely predictable from the meaning of butter or cream, and the various adjectives. Similarly the choice of take as the verb for walk is not simply a matter of the meaning of walk (for example, one can either make or take a journey).

In what we have called linguistic knowledge (LK) systems, at least, collocations can potentially be treated differently from idioms. This is because for collocations one can often think of one part of the expression as being dependent on, and predictable from the other. For example, one may think that make, in make an attempt has little meaning of its own, and serves merely to `support' the noun (such verbs are often called light verbs, or support verbs). This suggests one can simply ignore the verb in translation, and have the generation or synthesis component supply the appropriate verb. For example, in Dutch , this would be doen, since the Dutch for make an attempt is een poging doen (`do an attempt').

Two idiomatic expressions

If Sam mends the bucket, her children will be rich.

If Sam kicks the bucket, her children will be rich.

The problem with idioms, in an MT context, is that it is not usually possible to translate them using the normal rules. There are exceptions, for example take the bull by the horns (meaning `face and tackle a difficulty without shirking') can be translated literally into French as prendre le taureau par les cornes, which has the same meaning. But, for the most part, the use of normal rules in order to translate idioms will result in nonsense. Instead, one has to treat idioms as single units in translation. In many cases, a natural translation for an idiom will be a single word --- for example, the French word mourir (`die') is a possible translation for kick the bucket.

to keep tabs on (meaning observe)

29. Which are the most usual interpretations of the term "machine translation" (MT)?

The term machine translation (MT) is normally taken in its restricted and precise meaning of fully automatic translation. However, it must be considered the whole range of tools that may support translation and document production in general, which is especially important when considering the integration of other language processing techniques and resources with MT. We therefore define Machine Translation to include any computer-based process that transforms (or helps a user to transform) written text from one human language into another. We define Fully Automated Machine Translation (FAMT) to be MT performed without the intervention of a human being during the process. Human-Assisted Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a (mono- or bilingual) human for help. Machine-Aided Translation (MAT) is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants.

30. What do FAHQT and ALPAC mean in the evolution of MT?

There were of course dissenters from the dominant 'perfectionism'. Researchers at Georgetown University and IBM were working towards the first operational systems, and they accepted the long-term limitations of MT in the production of usable translations. More influential was the well-known dissent of Bar-Hillel. In 1960, he published a survey of MT research at the time which was highly critical of the theory-based projects, particularly those investigating interlingua approaches, and which included his demonstration of the non-feasibility of fully automatic high quality translation (FAHQT) in principle. Instead, Bar-Hillel advocated the development of systems specifically designed on the basis of what he called 'man-machine symbiosis', a view which he had first proposed nearly ten years before when MT was still in its infancy (Bar-Hillel 1951).

Nevertheless, the main thrust of research was based on the explicit or implicit assumption that the aim of MT must be fully automatic systems producing translations at least as good as those made by human translators. The current operational systems were regarded as temporary solutions to be superseded in the near future. There was virtually no serious consideration of how 'less than perfect' MT could be used effectively and economically in practice. Even more damaging was the almost total neglect of the expertise of professional translators, who naturally became anxious and antagonistic. They foresaw the loss of their jobs, since this is what many MT researchers themselves believed was inevitable.

In these circumstances it is not surprising that the Automatic Language Processing Advisory Committee (ALPAC) set up by the US sponsors of research found that MT had failed by its own criteria, since by the mid 1960s there were clearly no fully automatic systems capable of good quality translation and there was little prospect of such systems in the near future. MT research had not looked at the economic use of existing 'less than perfect' systems, and it had disregarded the needs of translators for computer-based aids.

While the ALPAC report brought to an end many MT projects, it did not banish the public perception of MT research as essentially the search for fully automatic solutions. The subsequent history of MT is in part the story of how these is this mistaken emphasis of the early years has had to be repaired and corrected. The neglect of the translation profession has been made good eventually by the provision of translation tools and translator workstations. MT research has turned increasingly to the development of realistic practical MT systems where the necessity for human involvement at different stages of the process is fully accepted as an integral component of their design architecture. And 'pure' MT research has by and large recognised its role within the broader contexts of commercial and industrial realities.

31. List some of the major methods, techniques and approaches.

Tools for translators, practical machine translation and research methods for machine translation.

32. Where was MT ten years ago?

Ten years ago, the typical users of machine translation were large organizations such as the European Commission, the US Government, the Pan American Health Organization, Xerox, Fujitsu, etc. Fewer small companies or freelance translators used MT, although translation tools such as online dictionaries were becoming more popular. However, ongoing commercial successes in Europe, Asia, and North America continued to illustrate that, despite imperfect levels of achievement, the levels of quality being produced by FAMT and HAMT systems did address some users’ real needs. Systems were being produced and sold by companies such as Fujitsu, NEC, Hitachi, and others in Japan, Siemens and others in Europe, and Systran, Globalink, and Logos in North America (not to mentioned the unprecedented growth of cheap, rather simple MT assistant tools such as PowerTranslator).

33.New directions and foreseeable breakthroughs of MT in the sort term

Several applications have proved to be able to work effectively using only subsets of the knowledge required for MT. It is possible now to evaluate different tasks, to measure the information involved in solving them, and to identify the most efficient techniques for a given task. Thus, we must face the decomposition of monolithic systems, and to start talking about hybridization, engineering, architectural changes, shared modules, etc. It is important when identifying tasks to evaluate linguistic information in terms of what is generalizable, and thus a good candidate for traditional parsing techniques (argument structure of a transitive verb in active voice?), and what is idiosyncratic (what about collocations?). Besides, one cannot discard the power of efficient techniques that yield better results than older approaches, as illustrated clearly by part of speech disambiguation, which has proved to be better solved using Hidden Markov Models than traditional parsers. On the other hand, it has been proven that good theoretically motivated and linguistically driven tagging label sets improve the accuracy of statistical systems. Hence we must be ready to separate the knowledge we want to represent from the techniques/formalisms that have to process it

Within the last ten years, research on spoken translation has developed into a major focus of MT activity. Of course, the idea or dream of translating the spoken word automatically was present from the beginning (Locke 1955), but it has remained a dream until now. Research projects such as those at ATR, CMU and on the Verbmobil project in Germany are ambitious. But they do not make the mistake of attempting to build all-purpose systems. The constraints and limitations are clearly defined by definition of domains, sublanguages and categories of users. That lesson has been learnt. The potential benefits even if success is only partial are clear for all to see, and it is a reflection of the standing of MT in general and a sign that it is no longer suffering from old perceptions that such ambitious projects can receive funding.

34. Which are Internet's essential features?

Before the nineties, three main approaches to Machine Translation were developed: the so-called direct, transfer and interlingua approaches. Direct and transfer-based systems must be implemented separately for each language pair in each direction, while the interlingua-based approach is oriented to translation between any two of a group of languages for which it has been implemented. The implications of this fundamental difference, as well as other features of each type of system, are discussed in this and the following sections. The more recent corpus-based approach is considered later in this section.

More recently developed approaches to MT divide the translation process into discrete stages, including an initial stage of analysis of the structure of a sentence in the source language, and a corresponding final stage of generation of a sentence from a structure in the target language. Neither analysis nor generation are translation as such. The analysis stage involves interpreting sentences in the source language, arriving at a structural representation which may incorporate morphological, syntactic and lexical coding, by applying information stored in the MT system as grammatical rules and dictionaries. The generation stage performs approximately the same functions in reverse, converting structural representations into sentences, again applying information embodied in rules and dictionaries.

The transfer approach, which characterizes the more sophisticated MT systems now in use, may be seen as a compromise between the direct and interlingua approaches, attempting to avoid the most extreme pitfalls of each. Although no attempt is made to arrive at a completely language-neutral interlingua representation, the system nevertheless performs an analysis of input sentences, and the sentences it outputs are obtained by generation. Analysis and generation are however shallower than in the interlingua approach, and in between analysis and generation, there is a transfer component, which converts structures in one language into structures in the other and carries out lexical substitution. The object of analysis here is to represent sentences in a way that will facilitate and anticipate the subsequent transfer to structures corresponding to the target language sentences

35. What is the role of minority languages on the Internet (Catalan, Basque...)?

This point requires even more careful consideration when what is needed is not merely a bilingual but a multilingual MT network, in which translation is possible from any language into any other language among a given network of languages or in a multilingual community. Unless a high degree of reusability be achieved, some serious problems arise unless the multilingual set is very limited in size. When, in 1978, an ambitious project, named Eurotra, was started to develop "a machine translation system of advanced design" between all official languages of the European Community (a target which was not achieved before the programme came to an end), the Community's official languages numbered only six: English, French, German, Dutch, Danish and Italian. This meant fifteen language pairs. Within eight years, the entry of Greece and subsequently Spain and Portugal into the Community had added three new official languages which had to be integrated into the system, still under development. This increase from six to nine languages meant that the number of language pairs more than doubled, rising from fifteen to thirty-six. If the programme had continued a little longer, by the time there were twelve official languages of the Community, the number of language pairs would have gone from 36 to 66; fifteen languages would have brought the figure up to 105, and so on in geometric progression.

36. In what ways can Machine Translation be applied on the Internet?

The Internet is, and will be to an increasing degree, both a vehicle for providing MT services and a major beneficiary of their application. To this extent, it is likely to provide a further key to making the Internet a truly global medium which can transcend not only geographical barriers but also linguistic ones.

Europe, as the most notable focal point in the present-day world where a great capacity for technological innovation crosses paths with a high level of linguistic diversity, is excellently placed to lead the way forward. Other parts of the world are technologically capable but too self-contained and homogeneous culturally to acquire immediate awareness of the need for information technology to find its way across linguistic barriers, while still other communities are fully aware of the language problem but lack a comparable degree of access to technological resources and initiative needed to address the issue on such a scale. Whoever succeeds in making future communication global in linguistic terms will have forged a new tool of incalculable value to the entire world.

CONCLUSION

As a conclusion is necesary to say that information society is becoming more and more important nowadays. We could say that it is the new society which is developing and raising very quickly. It is a society based on computers and telecomunications but on which language technologies are essential because language itself is the natural means of comunication.

But many times language can also work as a barrier for comunication. Anyway, the main aim of language engineering is to pull down this barrier using different techniques. One of the hardest work of language engineering is the translation which is basic to for sharing information with people of different speeches. Specially machine translation which is an important and almost essential tool for translators. Althought machine translation is a necessary tool, humans are also necessary in order to make a good translation because until now, machine translation can only help human being by fulfilling the routine work.

REFERENCES

http://whatis.techtarget.com/definition/0,,sid9_gci213588,00.html

Published by Tech Target (2003)

http://www.hltcentral.org/page-615.shtml

Published byHLTCentral.org (2001)

http://sirio.deusto.es/abaitua/konzeptu/nlp/hlt.htm

Published byHLTCentra.org (2001)

http://choo.fis.utoronto.ca/IMfaq/

Published byUniversity of Toronto’s Faculty of Information Studies.

http://sirio.deusto.es/abaitua/konzeptu/fatiga.htm#Words

Published byJoseba Abaitua (1996)

http://www.hltcentral.org/usr_docs/project-source/en/broch/harness.html#lt

Published byHLTCentral.org (2001)

http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm

Published byJoseba Abaitua (1999)

http://sirio.deusto.es/abaitua/konzeptu/ta/MT_book_1995/node52.html

Published byArnold DJ (1995)

http://sirio.deusto.es/abaitua/konzeptu/nlp/Mlim/mlim4.html

Published byBente Maegaard

http://ourworld.compuserve.com/homepages/WJHutchins/MTS-95.htm

Published byJohn Hutchins (1995)

http://www.europarl.eu.int/stoa/publi/99-12-01/part2_en.htm

Published byEropean Parliament (2000)