Basically, this report concerns LANGUAGE TECHNOLOGIES. Within that topic I cover different themes that could be summed up as the following: -The importance of Language technology to the development of information and communication techniques. Closely linked to this idea, we have the transformation of our lifestyle. -Language Engineering brings revolution as it prevents multilinguality from being considered as a basic thing to communication. -Regarding technology, we focus on the area of translation. a-the opposition between human translation and machine translation. b-the problems that may arise when translating -The Internet is included as one of the fields that the localization industry has entered
To began with, I would like to spend some lines dealing with the METHODOLOGY that has been used for this report. It has been quite a new way of working as there was a recent use of the computer, the intranet and the Internet. All the documents provided were accessible only trough the net and the amount of information was huge ( at this report, I notice that the statement "having too much information is as dangerous as having too little" is true). The most common thing was to read the first day of each week (we have 7 weeks) the questions that had been assigned for us to solve. From that moment on, I started the job of browsing in order to find the required information. Once the texts were found, I had to select from all the passages for my answer and having a Word document open, I copied and posted. Finally, I had to acknowledge the source of all the information: the name of the writer or institution, the date and the publisher´s name or URL. However, it has not been possible for me to include all this in some texts. Then, as far as the CONTENT is concerned, I will explain in detail the most important parts that my report includes: The first one is about the information society and how our activities are subordinated to the services given by the information and communication technologies. Language Technology is really important so that this information society can be successful. The second one copes with the revolution in communication, the development in Language Engineering. The third part leads us to concentrate on translation. There, we appreciate human translation vs. machine translation. Trough the translation of a poem by both, it is clear that it is not going to be possible to use always the automatic one. It is here that I observe there is a huge need of human translations. Besides, I also include some of the problems that may arise when translation is taking place and we come across idioms, collocations and so on. To conclude this part, MT is defined and described thoroughly (history, approaches ...). The last part has to do with the Internet and the multilinguality on the net. This channel is only one of the many areas where the localization industry has penetrated. Together with the Internet, we find home banking, mobile phones, games, education and entertainment. The differences between translation and localization have already been referred to in the previous parts. To conclude this introduction, I determine some objectives. The first one is the important role that linguistics play so that information technologies can succeed. The second one has to do with the many technologies such as MT that are applied to languages. The last one is to become familiar with the browser and digital documents.
3. BODY This is the information age and we are a society in which information is vital to economic, social, and political success as well as to our quality of life. http://www.serv-inf.deusto.es/abaitua/konzeptu/nlp/echo/infoage.html Information and communication technologies (ICTs) are transforming dramatically many aspects of economic and social life(...) This decade is witnessing the forging of a link (...) between the technological innovation process and economic and social organisation. A new "information society" is emerging in which the services provided by information and communications technologies (ICTs) underpin human activities (...), in which management, quality and speed of information are key factors for competitiveness (...) Throughout the world production systems, methods of organising work and consumption patterns are undergoing changes that will have long-term effects comparable with the first industrial revolution. This is the result of the development of information and communications technologies. http://europa.eu.int/en/record/white/c93700/ch01_1.html The development of an "information society" will be a global phenomenon, led first of all by the Triad, but gradually extended to cover the entire planet. http://europa.eu.int/en/record/white/c93700/ch5_1.html But this huge phenomenon, that is the information society, needs an indispensable contribution to succeed: Language Technologies. Language technologies are information technologies that are specialised for dealing with the most complex information medium in our world: human language (...) these technologies are often subsumed under the term Human Language Technology. There are still many new ways in which the application of telematics and the use of language technology will benefit our way of life, from interactive entertainment to lifelong learning. Although these changes will bring great benefits, it is important that we anticipate difficulties that may arise... · access to much of the information may be available only to the computer literate and those who understand English; · select what is really wanted. Language Engineering can solve these problems and will also help in the way that we deal with associates abroad. (...) electronic commerce(...)systems to generate business letters and other forms of communication in foreign languages will ease and greatly enhance communication. The language technologies will make an indispensable contribution to the success of this information revolution. The availability and usability of new telematics services will depend on developments in language engineering. Speech recognition will become a standard computer function providing us with the facility to talk to a range of devices, from our cars to our home computers, and to do so in our native language. In turn, these devices will present us with information, at least in part, by generating speech. Multi-lingual services will also be developed in many areas. One of the fundamental components of Language Engineering is the understanding of language, by the computer. If information services are capable of understanding our requests, and can scan and select from the information base with real understanding, Language Engineering will deliver the right information at the right time. http://sirio.deusto.es/abaitua/konzeptu/nlp/echo/infoage.html It will be now interesting to see the main objects of Human Language Technologies: · easy access to information and communication services in one's own language; · effective harness of the information glut; · meaningful use and assimilation of information; · natural operation of new services without needing specialist skills; · productive communication and co-operation across languages and cultures. As we have been able to see Human Language Technologies activities are relevant to many of the action lines within the thematic programme on the Information society, due to the persuasiveness of human language in information and communication related activities. http://www.hltcentral.org/htmlengine.shtml? id=55Discussion Document, Luxembourg, July 1997 http://sirio.deusto.es/abaitua/konzeptu/umist.htm#elec According to Joseba Abaitua: El lenguaje humano va encontrando con el tiempo el modo de articulación más efectivo y adecuado al medio. http://sirio.deusto.es/abaitua/konzeptu/copyr.htm#Editorial In a few years, men will be able to communicate more effectively through a machine than face to face. Creative, interactive communication requires (...) a dynamic medium in which premises will flow into consequences, and above all a common medium that can be contributed to and experimented with by all. Such a medium is at hand the programmed digital computer. Its presence can change the nature and value of communication even more profoundly than did the printing press and the picture tube, for a well-programmed computer can provide direct access both to informational resources and to the processes for making use of the resources. Having reached this point, I would like to explain the differences between knowledge and information as there can arise some kind of confusion due to their likeness. Knowledge is power, but information is not. It's like the detritus that a gold-panner needs to sift through in order to find the nuggets. Having too much information can be as dangerous as having too little. Among other problems, it can lead to a paralysis of analysis, making it far harder to kind the right solutions or make the best decisions. Information is supposed to speed the flow of commerce, but it often just clogs the pipes. http://sirio.deusto.es/abaitua/konzeptu/fatiga.htm#knowledge Now, as a curiosity, I will mention the number of words regarding technical information that are recorded every day: Every day, approximately 20 million words of technical information are recorded. A reader capable of reading 1000 words per minute would require 1.5 months, reading eight hours every day, to get through one day's output, and at the end of that period he would have fallen 5.5 years behind in his reading. http://sirio.deusto.es/abaitua/konzeptu/fatiga.htm#Notes In fact, language is often seen as a barrier to communication. The following passage will help us to see why and how this is changing as a result of the developments in Language Engineering: The use of language is currently restricted. In the main, it is only used in direct communications between human beings and not in our interactions with the systems, services and appliances which we use every day of our lives. Even between humans, understanding is usually limited to those groups who share a common language. In this respect, language can sometimes be seen as much as a barrier to communication rather than as an aid. A change is taking place that will revolutionise our use of language and greatly enhance the value of language in every aspect of communication. This change is the result of developments in Language Engineering. http://www.hltcentral.org/usr_docs/project-source/en/broch/harness.html#lia As this quotation has explained, Language Engineering will revolutionise our use of Language in the way we will see later. First, it is convenient to know what Language Engineering is and its main techniques. Language is the natural means of human communication; the most effective way we have to express ourselves to each other. Language Engineering is the application of knowledge of language to the development of computer systems which can recognise, understand, interpret, and generate human language in all its forms. In practice, Language Engineering comprises a set of techniques and language resources. http://www.hltcentral.org/usr_docs/project-source/en/broch/harness.html#wile Language Engineering provides ways in which we can extend and improve our use of language to make it a more effective tool. It is based on a vast amount of knowledge about language and the way it works, which has been accumulated through research. It uses language resources, such as electronic dictionaries and grammars, terminology banks and corpora, which have been developed over time. (...) By applying this knowledge of language we can develop new ways to help solve problems across the political, social, and economic spectrum. Besides, it also uses our knowledge of language to enhance our application of computer systems: · improving the way we interface with them · assimilating, analysing, selecting, using, and presenting information more effectively · providing human language generation and translation facilities. New opportunities are becoming available to change the way we do many things (...) by exploiting our developing knowledge of language. When a machine can recognise written natural language and speech, in a variety of languages, we shall all have easier access to the benefits of a wide range of information and communication services (...) When a machine understands human language, translates between different languages (...) we shall have available an enormously powerful tool to help us in many areas of our lives. http://www.hltcentral.org/usr_docs/project-source/en/index.html These are the main techniques used in Language Engineering: · Speaker Identification and Verification: a human voice is as unique to an individual as a fingerprint. This makes it possible to identify a speaker and to use this identification as the basis for verifying the individual is entitled to access a service or a resource. The types of problems which have to be overcome are, for example, recognising that the speech is not recorded, selecting the voice through noise (either in the environment or the transfer medium), and identifying reliably despite temporary changes (such as caused by illness). · Speech Recognition: the sound of speech is received by a computer in analogue waveforms that are analysed to identify the units of sound that make up words. Statistical models of phonemes and words are used to recognise discrete or continuous speech input. The production of quality statistical models requires extensive training samples (corpora) and vast quantities of speech have been collected, and continue to be collected, for this purpose. · Character and Document Image Recognition: recognition of written or printed language requires that a symbolic representation of the language is derived from its spatial form of graphical marks. There are two cases of character recognition: - recognition of printed images, referred to as Optical Character Recognition (OCR) - recognising handwriting, usually known as Intelligent Character Recognition (ICR) · Natural Language Understanding: The understanding of language is obviously fundamental to many applications. However, perfect understanding is not always a requirement. In fact, gaining a partial understanding is often a very useful preliminary step in the process because it makes it possible to be intelligently selective about taking the depth of understanding to further levels. Shallow or partial analysis of texts is used to obtain a robust initial classification of unrestricted texts efficiently. This initial analysis can then be used, for example, to focus on 'interesting' parts of a text for a deeper semantic analysis which determines the content of the text within a limited domain. It can also be used, in conjunction with statistical and linguistic knowledge, to identify linguistic features of unknown words automatically, which can then be added to the system's knowledge. · Natural Language Generation: a semantic representation of a text can be used as the basis for generating language. An interpretation of basic data or the underlying meaning of a sentence or phrase can be mapped into a surface string in a selected fashion; either in a chosen language or according to stylistic specifications by a text planning system. · Speech Generation: Speech is generated from filled templates, by playing 'canned' recordings or concatenating units of speech (phonemes, words) together. Speech generated has to account for aspects such as intensity, duration and stress in order to produce a continuous and natural response. Dialogue can be established by combining speech recognition with simple generation, either from concatenation of stored human speech components or synthesising speech using rules. Providing a library of speech recognisers and generators, together with a graphical tool for structuring their application, allows someone who is neither a speech expert nor a computer programmer to design a structured dialogue which can be used, for example, in automated handling of telephone calls. http://www.hltcentral.org/usr_docs/project-source/en/index.html Language resources are essential components of Language Engineering. They are one of the main ways of representing the knowledge of language, which is used for the analytical work leading to recognition and understanding: · Lexicons: a lexicon is a repository of words and knowledge about those words. This knowledge may include details of the grammatical structure of each word (morphology), the sound structure (phonology), the meaning of the word in different textual contexts, (...) A useful lexicon may have hundreds of thousands of entries. · Specialist Lexicons: (...)special cases which are researched and produced separately from general purpose lexicons: ü Proper names ü Terminology ü Wordnets · Grammars: A grammar describes the structure of a language at different levels: word (morphological grammar), phrase, sentence, etc. A grammar can deal with structure both in terms of surface (syntax) and meaning (semantics and discourse). · Corpora: A corpus is a body of language, either text or speech, which provides the basis for: ü Analysis of language to establish its characteristics ü Training a machine, usually to adapt its behaviour to particular circumstances ü Verifying empirically a theory concerning language ü A test set for a Language Engineering technique or application to establish how well it works in practice. There are national corpora of hundreds of millions of words but there are also corpora which are constructed for particular purposes. http://www.hltcentra.org/usr_docs/project-source/en/index.html Among the different technologies that are being developed, in this case we are concerned about the translation technology. When discussing the relevance of technological training in the translation curricula, it is important to clarify the factors that make technology more indispensable and show how the training should be tuned accordingly. The relevance of technology will depend on the medium that contains the text to be translated. This particular aspect is becoming increasingly evident with the rise of the localization industry, which deals solely with information in digital form. On the other hand, the traditional crafts of interpreting natural speech or translating printed material (...) may still benefit from technological training (...) word processors, on-line dictionaries (...) e-mail or other ways of network interaction with colleagues anywhere in the world may substantially help the literary translator's work. (...) it will be rare in the future to see good professional interpreters and literary translators no using sophisticated and specialized tools for their jobs (...) http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm (...) professional translators need to become acquainted with technology, because good use of technology will make their jobs more competitive and satisfactory. But they should not dismiss craftsmanship. Technology enhances productivity, but translation excellence goes beyond technology. It is important to delimit the roles of humans and machines in translation. According to Martin Kay's (1987): A computer is a device that can be used to magnify human productivity. (...) it frees human beings over what is mechanical and routine. According to Hofstadter (1998): A skilled literary translator makes a far larger number of changes, and far more significant changes (...) Obviously, Hofstadter's experiment has gone beyond the recommended mechanical and routine scope of language and is therefore an abuse of MT. Outside the limits of the mechanical and routine, MT is impracticable and human creativity becomes indispensable.(...) The potentially good translator must be a sensitive, wise, vigilant, talented, gifted, experienced, and knowledgeable person. An adequate use of mechanical means and resources can make a good human translator a much more productive one. Nevertheless, very much like dictionaries and other reference material, technology may be considered an excellent prosthesis, but little more than that. However, even for skilled human translators, translation is often difficult. One clear example is when linguistic form, as opposed to content, becomes an important part of a literary piece. Compared with the complexity of natural language, the figures that serve to quantify the "knowledge" of any MT program are absurd: 100,000 bilingual vocabularies, 5,000 transfer rules... Well-developed systems such as Systran, or Logos hardly surpass these figures. http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm We are aware that documentation is becoming electronic. The reason for this is that this digital documentation is the best way for the incorporation of translation technology. The increase of information in electronic format is linked to advances in computational techniques for dealing with it. Together with the proliferation of informational webs in Internet, we can also see a growing number of search and retrieval devices, some of which integrate translation technology. Technical documentation is becoming electronic, in the form of CD-ROM, on-line manuals, Intranets, etc. An important consequence of the popularization of Internet is that the access to information is now truly global and the demand for localizing institutional and commercial Web sites is growing fast. In the localization industry, the utilization of technology is congenital, and developing adequate tools has immediate economic benefits. http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm It would be interesting now to describe and define 3 important terms and how they affect the design of software products. For this purpose, I will base myself on the definitions by Professor Margaret King of Geneva in the project LISA Education Initiative Taskforce. · Globalization: the adaptation of marketing strategies to regional requirements of all kinds (e.g., cultural, legal, and linguistic). · Internationalization: the engineering of a product (usually software) to enable efficient adaptation of the product to local requirements. · Localization: the adaptation of a product to a target language and culture (locale). Many aspects of software localization have not been considered, particularly, the concepts of multilingual management and document-life monitoring. Corporations are now realizing that documentation is an integral part of the production line where the distinction between product, marketing and technical material is becoming more and more blurred. Product documentation is gaining importance in the whole process of product development with direct impact on time-to-market. Software engineering techniques that apply in other phases of software development are beginning to apply to document production as well. The appraisal of national and international standards of various types is also significant: text and character coding standards (e.g. SGML/XML and Unicode), as well as translation quality control standards (e.g. DIN 2345 in Germany, or UNI 10574 in Italy). In response to these new challenges, localization packages are now being designed to assist users throughout the whole life cycle of a multilingual document. These take them through job set-up, authoring, translation preparation, translation, validation, and publishing, besides ensuring consistency and quality in source and target language variants of the documentation. New systems help developers monitor different versions, variants and languages of product documentation, and author customer specific solutions. According to Rose Lockwood, a consultant from Equipe consortium Ltd, "as traditional translation methods give way to language engineering and disciplined authoring, translation and document-management method, the role of technically proficient linguists and authors will be increasingly important to global WWW. The challenge will be to employ the skills used in conventional technical publishing in the new environment of a digital economy." http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm The focus of the localization industry is to help software publishers, hardware manufacturers and telecommunications companies with versions of their software, documentation, marketing, and Web-based information in different languages for simultaneous world-wide release. The recent expansion of these industries has considerably increased the demand for translation products and has created a new burgeoning market for the language business. According to a recent industry survey by LISA, almost one third of software publishers (...) generate above 20 percent of their sales from localized products, that is, from products which have been adapted to the language and culture of their targeted markets, and the great majority of publishers expect to be localizing into more than ten different languages. Localization is not limited to the software-publishing business and it has infiltrated many other facets of the market (...) Besides Internet, another emerging sector for the localization industry is the introduction of the e-book (electronic book) in the literary market. (...) it is clear that for a new generation of console and video-games users, who are more than adapted to reading on screens, literature on the console may be more than appealing. http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm Having said all this, it is quite clear that localization and translation are two different things although they are connected. Localization is the paradigm of the need for technology, while interpreting and literary translation are examples of the latter. The localization business is intimately connected with the software industry and companies in the field complain about the lack of qualified personnel the combine both an adequate linguistic background and computational skills. This is the reason why the industry has taken the lead over educational institutions by proposing courseware standards for training localization professionals. According to Vand der Meer, president of Alpnet: Localization was originally intended to set software (or information technology) translators apart from 'old fashioned' non-technical translators of all types of documents. Software translation required a different skill set: software translators had to understand programming code, they had to work under tremendous time pressure and be flexible about product changes and updates. Originally there was only a select group--the localizers--who knew how to respond to the needs of the software industry. From these beginnings, pure localization companies emerged focusing on testing, engineering, and project management. All in all, localization is the adaptation of a product to a target language and culture (locale). http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm In the same way that I have highlighted the differences between localization and translation, it would be suitable to explain what the translation workstation is. According to Henri Broekmate, Trados manager, translation workstation will provide enterprise-wide applications for multilingual information creation and dissemination, integrating logistical and language-engineering applications into smooth workflow that spans the globe. (...) the industry is now moving in the direction of integrating systems (...) According to Logos, the veteran translation technology provider an integrated technology-based translation package, which will combine term management, TM, MT and related tools to create a seamless full service localization environment. However, in general, the ideal workstation for the translator would combine the following features: · Full integration in the translator's general working environment, which comprises the operating system, the document editor (...) as well as the emailer or the Web browser. These would be complemented with a wide collection of linguistic tools: from spell, grammar and style checkers to on-line dictionaries, and glossaries, including terminology management, annotated corpora, concordances, collated texts, etc. · The system should comprise all advances in machine translation (MT) and translation memory (TM) technologies, be able to perform batch extraction and reuse of validated translations, enable searches into TM databases by various keywords. These TM databases could be distributed and accessible through Internet. There is a new standard for TM exchange (TMX) that would permit translators and companies to work remotely and share memories in real-time. Muriel Vasconcellos also pictures her ideal design of the workstation. http://sirio.deusto.es/abaitua/konzeptu/ta/vic.htm At this precise point, one important themes of the report is mentioned: the opposition between human translation and machine translation. If we have a text and we want it to be translated automatically, that is by a machine, we could do it. However, difficulties may arise if the language in which we want the text to be translated and the language of the text have different structures. To exemplify this problem, I take the following poem: Twinkle, twinkle, little bat how I wonder what you're at! Up above the world you fly like a tea tray in the sky. Lewis Carroll Brilla, luce, ratita alada ¿en qué estás tan atareada? Por encima del universo vuelas como una bandeja de teteras. Translation of Jaime de Ojeda Centelleo, centelleo, pequeño palo, ¡cómo me pregunto en cuál usted está! Encima sobre del mundo usted vuela como una té-bandeja en el cielo. Translation of SYSTRAN (machine translation) According to Manuel Breva (1996) Jaime de Ojeda translates "bat" as "ratita alada" for rhythmical reasons. "Murciélago", the Spanish equivalent of "bat", would be hard to fit in this context for the same poetic reasons. With Ojeda's choice of words the Spanish version preserves the meaning and maintains the same rhyming pattern (ABBA) as in the original English verse-lines. According to Martin kay, There is nothing that a person could know, or feel, or dream, that could not be crucial for getting a good translation of some text or other. To be a translator, therefore, one cannot just have some parts of humanity; one must be a complete human being. (...) la historia no puede ofrecer un ejemplo mejor de uso inapropiado del ordenador que la traducción automática. Through the example above, we know that we could not use the machine translation as the result would even be ridiculous. In general, it would not be appropriate when the language is creative. Yet if the language is repetitive, canonic and it is controlled there should not be any problem to use the machine translation. Condiciones adversas: lenguaje creativo, espontáneo, imprevisible (...) Condiciones óptimas: lenguaje controlado, repetitivo, canónico (...) http://sirio.deusto.es/abaitua/konzeptu/ta/jaumei00.ppt When we are translating a text we may have to face many important problems such as: ü Lexical and structural ambiguities ü Lexical and structural mismatches ü Idioms ü Collocations According to Arnold D J, when a word has more than one meaning, it is said to be lexically ambiguous. When a phrase or sentence can have more than one structure it is said to be structurally ambiguous. In order to understand better the phenomenon of ambiguity, I will show some examples given by Arnol D.J.(1995). To begin with, I will consider some examples of lexical ambiguity: Imagine that we are trying to translate these two sentences into French: · You must not use abrasive cleaners on the printer casing. · The use of abrasive cleaners on the printer casing is not recommended. In the first sentence use is a verb, and in the second a noun, that is, we have a case of lexical ambiguity. An English-French dictionary will say that the verb can be translated by (inter alia) se servir de and employer, whereas the noun is translated as emploi or utilisation. One way a reader or an automatic parser can find out whether the noun or the verb form of use is being employed in a sentence is by working out whether it is grammatically possible to have a noun or a verb in the place where it occurs. Take for example the word button. Like the word use, it can be either a verb or a noun. As a noun, it can mean both the familiar small round object used to fasten clothes, as well as a knob on a piece of apparatus. Secondly, I will go on with other examples of structural ambiguity: a) Another source of syntactic ambiguity is where whole phrases, typically prepositional phrases, can attach to more than one position in a sentence. For example, in the following example, the prepositional phrase with a Postscript interface can attach either to the NP the word processor package, meaning "the word-processor which is fitted or supplied with a Postscript interface", or to the verb connect, in which case the sense is that the Postscript interface is to be used to make the connection. · The printer to a word processor package with a Postscript interface. b) This kind of real world knowledge is also an essential component in disambiguating the pronoun it in examples such as the following: · The paper in the printer. Then switch it on. In order to work out that it is the printer that is to be switched on, rather than the paper, one needs to use the knowledge of the world that printers (and not paper) are the sort of thing one is likely to switch on. http://sirio.deusto.es/abaitua/konzeptu/ta/MT_book_1995/node53.html#SECTION00820000000000000000 All the examples given in these two questions are by Arnold D J. Thirdly, we have the problem of the lexical and structural mismatches. In the best of all possible worlds for NLP, every word would have exactly one sense. While this is true for most NLP, it is an exaggeration as regards MT (...) Some of the translation problems are to do with lexical differences between languages --- differences in the ways in which languages seem to classify the world, what concepts they choose to express by single words, and which they choose not to lexicalize (...) the result is that we have to complicate the translation process. (...) words like English cottage, Russian dacha, French château, etc. for which no adequate translation exists, and for which the human translator must decide between straight borrowing, neologism, and providing an explanation (...)Calling cases such as those above lexical mismatches is not controversial. However, when one turns to cases of structural mismatch, classification is not so easy. A particularly obvious example of this involves problems arising from what are sometimes called lexical holes --- that is, cases where one language has to use a phrase to express what another language expresses in a single word. Examples of this include the `hole' that exists in English with respect to French ignorer (`to not know', `to be ignorant of'), and se suicider (`to suicide', i.e. `to commit suicide', `to kill oneself'). The problems raised by such lexical holes have a certain similarity to those raised by idiom s: in both cases, one has phrases translating as single words. One kind of structural mismatch occurs where two languages use the same construction for different purposes, or use different constructions for what appears to be the same purpose. Cases where the same structure is used for different purposes include the use of passive constructions in English, and Japanese. (...) in general, the result of this is that one cannot have simple rules like those described for passives. In fact, unless one uses a very abstract structure indeed, the rules will be rather complicated. We can see different constructions used for the same effect in cases like the following: a)He is called Sam. Er heiß t Sam. `He is-named Sam' Il s'appelle Sam. `He calls himself Sam' b)Sam has just seen Kim. Sam vient de voir Kim. `Sam comes of see Kim' c)Sam likes to swim. Sam zwemt graag. `Sam swims likingly' Figure: venir-de and have-just The first example shows how English, German and French choose different methods for expressing `naming'. The other two examples show one language using an adverbial ADJUNCT ( just, or graag(Dutch) `likingly' or `with pleasure'), where another uses a verbal construction. This is actually one of the most discussed problems in current MT, and it is worth examining why it is problematic. In particular, notice that while the main verb is see, the main verb is venir-de. Figure: Translating have-just into venir-de Of course, given a complicated enough rule, all this can be stated. However, there will still be problems because writing a rule in isolation is not enough. One must also consider how the rule interacts with other rules. For example, there will be a rule somewhere that tells the system how see is to be translated, and what one should do with its SUBJECT and OBJECT. One must make sure that this rule still works (...) Figure: The Representation of venir-de Sam has probably just seen Kim. ÞIl est probable que Sam vient de voir Kim. `It is probable that Sam comes of see Kim' Of course, one could try to argue that the difference between English just and French venir de is only superficial. The argument could, for example, say that just should be treated as a verb at the semantic level. However, this is not very plausible. There are other cases where this does not seem possible. (...) where English uses a `manner' verb and a directional adverb/prepositional phrase, French use a directional verb and a manner adverb. That is where English classifies the event described as `running', French classifies it as an `entering'. A slightly different sort of structural mismatch occurs where two languages have `the same' construction (more precisely, similar constructions, with equivalent interpretations), but where different restrictions on the constructions mean that it is not always possible to translate in the most obvious way. (...) English and French differ in that English permits prepositions to be `stranded' (i.e. to appear without their objects, like in a). French normally requires the preposition and its object to appear together, as in d) --- of course, English allows this too. In general, relative clause constructions in English consist of a head noun ( letters in the previous example), a relative pronoun (such as which), and a sentence with a `gap' in it. The relative pronoun (and hence the head noun) is understood as if it filled the gap (...) In English, there are restrictions on where the `gap' can occur. In particular, it cannot occur inside an indirect question, or a `reason' ADJUNCT. These sorts of problem are beyond the scope of current MT systems --- in fact, they are difficult even for human translators. Arnold D J Thu Dec 21 10:52:49 GMT 1995 http://clwww.essex.ac.uk/MTbook/ As far as idioms are concerned, these expressions are formed by different words that do not help us to guess the meaning. Roughly speaking, idioms are expressions whose meaning cannot be completely understood from the meanings of the component parts. For example, whereas it is possible to work out the meaning of a) on the basis of knowledge of English grammar and the meaning of words, this would not be sufficient to work out that b) can mean something like `If Sam dies, her children will be rich'. This is because kick the bucket is an idiom. a) If Sam mends the bucket, her children will be rich. b) If Sam kicks the bucket, her children will be rich. The problem with idioms, in an MT context, is that it is not usually possible to translate them using the normal rules. There are exceptions, for example take the bull by the horns (meaning `face and tackle a difficulty without shirking') can be translated literally into French as prendre le taureau par les cornes, which has the same meaning. But, for the most part, the use of normal rules in order to translate idioms will result in nonsense. Instead, one has to treat idioms as single units in translation. In many cases, a natural translation for an idiom will be a single word --- for example, the French word mourir (`die') is a possible translation for kick the bucket. Lexical holes and idioms are frequently instances of word phrase translation (...) In general, there are two approaches one can take to the treatment of idioms. The first is to try to represent them as single units in the monolingual dictionaries. What this means is that one will have lexical entries such as kick_the_bucket. A more reasonable idea is (...) to allow analysis rules to replace pieces of structure by information which is held in the lexicon at different stages of processing, just as they are allowed to change structures in other ways. (...) this approach will lead to translation rules saying something like the following, in a transformer or transfer system (in an interlingual system, idioms will correspond to collections of concepts, or single concepts in the same way as normal words). The second approach to idioms is to treat them with special rules that change the idiomatic source structure into an appropriate target structure. This would mean that kick the bucket and kick the table would have similar representations all through analysis. Clearly, this approach is only applicable in transfer or transformer systems, and even here, it is not very different from the first approach --- in the case where an idiom translates as a single word (...) One problem with sentences which contain idioms is that they are typically ambiguous, in the sense that either a literal or idiomatic interpretation is generally possible (i.e. the phrase kick the bucket can really be about buckets and kicking). The real problem with idioms is that they are not generally fixed in their form, and that the variation of forms is not limited to variations in inflection (as it is with ordinary words). Thus, there is a serious problem in recognising idioms. This problem does not arise with all idioms. Some are completely frozen forms whose parts always appear in the same form and in the same order. Examples are phrases like in fact, or in view of. However, such idioms are by far the exception. A typical way in which idioms can vary is in the form of the verb, which changes according to tense, as well as person and number (...) A second common form of variation is in the form of the possessive pronoun in expressions like to burn one's bridges (meaning `to proceed in such a way as to eliminate all alternative courses of action'). This varies in a regular way with the subject of the verb. In other cases, only the syntactic category of an element in an idiom can be predicted. Another common form of variation arises because some idioms allow adjectival modifiers (...) Of course, not all idioms allow these variations (e.g. one cannot passivize kick the bucket meaning `die'), and, as noted, some do not allow any variation in form. Arnold D J Thu Dec 21 10:52:49 GMT 1995 http://clwww.essex.ac.uk/MTbook/ Finally, rather different from idioms are (...) collocations. Here the meaning can be guessed from the meanings of the parts. What is not predictable is the particular words that are used. a) This butter is rancid (*sour, *rotten, *stale). b) This cream is sour (*rancid, *rotten, *stale). c) They took (*made) a walk. d) They made (*took) an attempt. e) They had (*made, *took) a talk. For example, the fact that we say rancid butter, but not * sour butter, and sour cream, but not * rancid cream does not seem to be completely predictable from the meaning of butter or cream, and the various adjectives. Similarly the choice of take as the verb for walk is not simply a matter of the meaning of walk (for example, one can either make or take a journey). In what we have called linguistic knowledge (LK) systems, at least, collocations can potentially be treated differently from idioms. This is because for collocations one can often think of one part of the expression as being dependent on, and predictable from the other. For example, one may think that make, in make an attempt has little meaning of its own, and serves merely to `support' the noun (such verbs are often called light verbs, or support verbs). This suggests one can simply ignore the verb in translation, and have the generation or synthesis component supply the appropriate verb. For example, in Dutch, this would be doen, since the Dutch for make an attempt is een poging doen (`do an attempt'). One way of doing this is to have analysis replace the lexical verb (e.g. make) with a `dummy verb' (e.g. VSUP). This can be treated as a sort of interlingual lexical item, and replaced by the appropriate verb in synthesis (the identity of the appropriate verb has to be included in the lexical entry of nouns (...) The advantage is that support verb constructions can be handled without recourse to the sort of rules required for idioms (...) (...) Lexical functions express a relation between two words. Take the case of heavy smoker, for example. The relationship between heavy and smoker is that of intensification, indicating that the appropriate adjective for English smoker is heavy, whereas that for the corresponding French word fumeur is grand (`large') and that for the German word Raucher is stark (`strong'). Arnold D J Thu Dec 21 10:52:49 GMT 1995 http://clwww.essex.ac.uk/MTbook/ After having commented on the clear differences between human translation and machine translation, now there is a need to extend our knowlegde regarding Machine Translation since it is another relevant issues that we are dealing with here. The term machine translation (MT) is normally taken in its restricted and precise meaning of fully automatic translation. We define Fully Automated Machine Translation (FAMT) to be MT performed without the intervention of a human being during the process. Human-Assisted Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a (mono- or bilingual) human for help. Machine-Aided Translation (MAT) is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants. Traditionally, two very different classes of MT have been identified: 1. Assimilation refers to the class of translation in which an individual or organization wants to gather material written by others in a variety of languages and convert them all into his or her own language. 2. Dissemination refers to the class in which an individual or organization wants to broadcast his or her own material, written in one language, in a variety of language to the world. However, a third class of translation has also recently become evident: 3. Communication refers to the class in which two or more individuals are in more or less immediate interaction, typically via email or otherwise online, with an MT system mediating between them. Each class of translation has very different features, is best supported by different underlying technology, and is to be evaluated according to somewhat different criteria. http://sirio.deusto.es/abaitua/konzeptu/nlp/Mlim/mlim4.html Now after having given the definition, we add some hystorical notes about it. Los sistemas basados en analogías han hecho su aparición en la década de 1990 y aplican métodos de proximidad estadística sobre muestras de textos previamente traducidos. Algunos autores los describen como sistemas de "traducción asistida". Hasta los años noventa una de las premisas más firmes entre la comunidad de investigadores ha sido considerar la traducción como un problema fundamentalmente de equivalencia semántica. Este supuesto teórico constituye seguramente el factor que más ha perjudicado al desarrollo de sistemas de traducción útiles para los traductores. http://sirio.deusto.es/abaitua/konzeptu/ta/mt10h_es/ta10h-3es.htm http://sirio.deusto.es/abaitua/konzeptu/nlp/Mlim/mlim4.html Ten years ago, the typical users of machine translation were large organizations such as the European Commission, the US Government, the Pan American Health Organization, Xerox, Fujitsu, etc. Fewer small companies or freelance translators used MT, although translation tools such as online dictionaries were becoming more popular. However, ongoing commercial successes in Europe, Asia, and North America continued to illustrate that, despite imperfect levels of achievement, the levels of quality being produced by FAMT and HAMT systems did address some users' real needs. In response, the European Commission funded the Europe-wide MT research project Eurotra, which involved representatives from most of the European languages, to develop a large multilingual MT system (Johnson, et al., 1985). Eurotra, which ended in the early 1990s, had the important effect of establishing Computational Linguistics groups in a several countries where none had existed before. Following this effort, and responding to the promise of statistics-based techniques (...)the US Government funded a four-year effort, pitting three theoretical approaches against each other in a frequently evaluated research program. As we reach the end of the decade, the only large-scale multi-year research project on MT worldwide is Verbmobil in Germany (Niemann et al., 1997), which focuses on speech-to-speech translation of dialogues in the rather narrow domain of scheduling meetings. Thanks to ongoing commercial growth and the influence of new research, the situation is different today from ten years ago. There has been a trend toward embedding MT as part of linguistic services, which may be as diverse as email across nations, foreign-language web searches, traditional document translation, and portable speech translators with very limited lexicons (for travelers, soldiers, etc). These are the major approaches and techniques that MT uses: http://sirio.deusto.es/abaitua/konzeptu/nlp/Mlim/mlim4.html · Statistical vs. Linguistic MT The CANDIDE system (...) changed the face of MT, showing that MT systems using statistical techniques to gather their rules of cross-language correspondence were feasible competitors to traditional, purely hand-built ones. However, CANDIDE did not convince the community that the statistics-only approach was the optimal path; in developments since 1994, it has included steadily more knowledge derived from linguistics. This left the burning question: which aspects of MT systems are best approached by statistical methods and which by traditional, linguistic ones? While it is clear by now that some modules are best approached under one paradigm or the other, it is a relatively safe bet that others are genuinely hermaphroditic, and that their best design and deployment will be determined by the eventual use of the system in the world(...) we will have different kind of MT systems that use different translation engines and concentrate on different functions. · Feature Symbolic vs. Statistical Major applications include: Assimilation tasks: lower quality, broad domains - statistical techniques predominate Dissemination tasks: higher quality, limited domains - symbolic techniques predominate Communication tasks: medium quality, medium domain - mixed techniques predominate Ideally, systems will employ statistical techniques to augment linguistic insights, allowing the system builder, a computational linguist, to specify the knowledge in the form most convenient to him or her, and have the system perform the tedious work of data collection, generalisation, and rule creation. Such collaboration will capitalise on the (complementary) strengths of linguist and computer, and result in much more rapid construction of MT systems for new languages, with greater coverage and higher quality. Still, how exactly to achieve this optimal collaboration is far from clear. · Rule-based vs. Example-based MT Most production systems are rule-based. That is, they consist of grammar rules, lexical rules, etc. More rules lead (...) into systems that are quite difficult to maintain. Consequently, alternative methods have been sought. Translation by analogy, usually called memory-based or example-based translation (EBMT), see (Nagao, 1984), is one answer to this problem. An analogy-based translation system has pairs of bilingual expressions stored in an example database. Just as for translation memories, the analogy-based translation builds on approved translations, consequently the quality of the output is expected to be high. Unfortunately, however, purely analogy-based systems have problems with scalability (...) Consequently, a combination of the rule-based approach and the analogy-based approach is the solution (...) · Transfer vs. Interlingual MT Current rule-based MT uses either the Transfer architecture or the Interlingua architecture. The Intermediate Structure is a (usually grammatical) analysis of the text, one sentence at a time. The Interlingua is a (putatively) language-neutral analysis of the text. The theoretical advantage of the Interlingua approach is that one can add new languages at relatively low cost, by creating only rules mapping from the new language into the Interlingua and back again. In contrast, the Transfer approach requires one to build mapping rules from the new language to and from each other language in the system. The Transfer approach involves a comparison between just the two languages involved. In practical systems, the transfer approach is often chosen simply because it is the simplest and scales up the best. This is an important virtue in the development of production systems. However, researchers will continue to pursue the Interlingual approach for a variety of reasons. Not only does it hold the promise of decreasing the cost of adding a new language, but it also encourages the inclusion of deeper, more abstract levels of representation, including discourse structure and interpersonal pragmatics, than are included in transfer structures. · Multi-Engine MT In recent years, several different methods of performing MT-transfer, example-based, simple dictionary lookup, etc.-have all shown their worth in the appropriate circumstances. A promising recent development has been the attempt to integrate various approaches into a single multi-engine MT system. The idea is very simple: pass the sentence(s) to be translated through several MT engines in parallel, and at the end combine their output, selecting the best fragment(s) and recomposing them into the target sentence(s). · Speech-to-Speech Translation Current commercially available technology makes speech to speech translation already possible and usable. Besides, we are also interested in the foreseeable breakthroughs of MT both in the short and larger terms. ¡Error!Marcador no definido. Several applications have proven to be able to work effectively using only subsets of the knowledge required for MT. It is possible now to evaluate different tasks, to measure the information involved in solving them, and to identify the most efficient techniques for a given task. Besides, one cannot discard the power of efficient techniques that yield better results than older approaches (...) On the other hand, it has been proven that good theoretically motivated and linguistically driven tagging label sets improve the accuracy of statistical systems. Hence we must be ready to separate the knowledge we want to represent from the techniques/formalisms that have to process it. In summary, we should be concerned with identifying what techniques can lead to better results under separation of phenomena: transfer vs. interlingua (including ontologies), grammar-based vs. example-based techniques, and so on. We should be willing to view alternatives not as competing approaches but as complementary techniques, the key point being to identify how to structure and to control the combination of all of them. I will distinguish now two different trends expected in five years: One important trend, of which the first instances can be seen already, is the availability of MT for casual, one-off, use via the Internet. Such services can either be standalone MT (Lernout and Hauspie and Systran) or bundled with some other application, such as web access (website of Altavista and Systran), multilingual information retrieval in general, text summarization (...) A second trend can also be recognized: the availability of low-quality portable speech-to-speech MT systems. It is expected that these domains will increase in size and complexity as speech recognition becomes more robust. As analysis and generation theory and practice becomes more standardized and established, the focus of research will increasingly turn to methods of constructing low-quality yet adequate MT systems (semi-)automatically. Methods of automatically building multilingual lexicons and wordlists involve bitext alignment and word correspondence discovery (...) (...)future developments will include highly integrated approaches to translation (integration of translation memory and MT, hybrid statistical-linguistic translation, multi-engine translation systems, and the like). We are likely to witness the development of statistical techniques to address problems that defy easy formalization and obvious rule-based behavior, such as sound transliteration (Knight and Graehl, 1997), word equivalence across languages (Wu, 1995), wordsense disambiguation (Yarowsky, 1995), etc. There are two other ongoing developments which do not draw much on empirical linguistics: The first is the continuing integration of low-level MT techniques with conventional word processing to provide a range of aids, tools, lexicons, etc., for both professional and occasional translators. The second continuing development, set apart from the statistical movement, is a continuing emphasis on large-scale handcrafted resources for MT. This emphasis implicitly rejects the assumptions of the empirical movement that such resources could be partly or largely acquired automatically by, e.g., extraction of semantic structures from machine readable dictionaries, of grammars from treebanks or by machine learning methods. Fortunately, recent developments in Text Linguistics, Discourse Study, and computational text planning have led to theories and techniques that are potentially of great importance for MT. As I have already mentioned above, the localization industry has infiltrated many facets of the market such as the Internet. http://www.europarl.eu.int/stoa/publi/99-12-01/part2en.htm#b The Internet is a channel allowing information to be transmitted or stored. The essential features of this channel can be summed up in four points: · Efficient operation: Communication via the Internet is rapid (in some cases instantaneous), powerful (large volumes of traffic can be supported), reliable (messages are delivered with precision), and, once the necessary technological infrastructure and tools are in place, cheap in comparison to alternative channels of communication. · Global extension: The Internet renders geographical distances insignificant, turning the world into a "global village". Consequently, other obstacles to communication acquire greater relevance, including possession of the required technology (hence ultimately economic factors) and cultural differences (particularly language). · Flexible use: A wide and increasing variety of types of communication can be realized via the Internet, transmitting different sorts of content through different media; the only limits are the potential for such content and media to be digitalized, the capacity of current technology to perform such digitalization, and the availability of hardware and communications infrastructure with the required capacities and power. · Electronic form: The electronic nature of the channel is the key element behind the aforementioned features; it also implies other benefits. Anything that can be done electronically can be done via the Internet; hence, more and more of modern technology can employ the same common channel, including the numerous aspects of Information Technology which are beginning to emerge at the present time. But what is the role of minority languages within the net? http://www.europarl.eu.int/stoa/publi/99-12-01/part2en.htm Whether we are talking about machine or human translation, various considerations will tend to make the position worse for the smaller languages in the network (i.e. for most languages), in contrast to a very few large communities whose languages are widely used. The more limited the resources of a given language community within the network (and on average we would expect smaller language communities to dispose of fewer resources), the greater the proportional effort required to achieve integration into a common multilingual system if the language community in question must make this effort on its own. Calculated in relation to the intensity of demand for translations between each language pair, also, the less widely a language is used in absolute terms, the more expensive, in terms of cost/benefit, it will be to provide the full range of translation services, whether by human or mechanical means (...) We have seen that Machine translation is a method to translate, but what we have not taken into account are the ways in which it can be applied on the Internet. I do include some quotations so that I can illustrate this: Machine translation is probably the oldest application of natural language processing. Its 50 years of history have seen the development of several major approaches and, (...) still, today, there is no dominant approach. Despite the commercial success of many MT systems, tools, and other products, the main problem remains unsolved, and the various ways of combining approaches and paradigms are only beginning to be explored (...) The future of MT is rosy. Thanks largely to the Internet and the growth of international commerce, casual (one-off) and repeated MT is growing at a very fast pace. Correspondingly, MT products are coming to market as well. The Machine Translation Compendium (Hutchins, 1999) lists commercial products in over 30 languages (including Zulu, Ukrainian, Dutch, Swahili, and Norwegian) in 83 language pairs (...) In tandem with this growth, it is imperative to ensure that research in MT begins again (...) http://sirio.deusto.es/abaitua/konzeptu/nlp/Mlim/mlim4.html Editor: Bente Maegaard Contributors: Nuria Bel, Bonnie Dorr, Eduard Hovy, Kevin Knight, Hitoshi Iida,Christian Boitet, Bente Maegaard,Yorick Wilks. Finally, I would like to close my report introducing myself a little in the area of Corpus and Corpora. Considering the great extension of this topic, basically, I will focus on the areas of application of linguistic corpus research on the one hand through these quotations: "Early corpus linguistics" is a term we use here to describe linguistics before the advent of Chomsky. Below is a brief overview of some interesting corpus-based studies predating 1950. · Language acquisition The studies of child language in the diary studies period of language acquisition research (roughly 1876-1926) were based on carefully composed parental diaries recording the child's locutions (...) Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 (...) Longitudinal studies have been dominant from 1957 to the present - again based on collections of utterances, but this time with a smaller (approximately 3) sample of children who are studied over long periods of time (e.g. Brown (1973) and Bloom (1970)]. · Spelling conventions · Language pedagogy http://sirio.deusto.es/abaitua/konzeptu/corpus/corpus1/1fra1.htm On the other hand, according to Chomsky (1964) and Abercrombie (1965), the main drawbacks of corpus methodology to study human languages are the following: (...) the nub of Chomsky's initial criticism: a corpus is by its very nature a collection of externalised utterances - it is performance data and is therefore a poor guide to modelling linguistic competence. Further to that, if we are unable to measure linguistic competence, how do we determine from any given utterance what are linguistically relevant performance phenomena? This is a crucial question, for without an answer to this, we are not sure that what we are discovering is directly relevant to linguistics. The impact of the criticisms levelled at early corpus linguistics in the 1950s was immediate and profound. Corpus linguistics was largely abandoned during this period, although it never totally died. http://sirio.deusto.es/abaitua/konzeptu/corpus/corpus1/1fra1.htm Apart from Chomsky's theoretical criticisms, there were problems of practicality with corpus linguistics. Abercrombie (1965) summed up the corpus-based approach as being composed of "pseudo-techniques". Can you imagine searching through an 11-million-word corpus such as that of Kading (1897) using nothing more than your eyes? The whole undertaking becomes prohibitively time consuming, not to say error-prone and expensive. Whatever Chomsky's criticisms were, Abercrombie's were undoubtedly correct. Early corpus linguistics required data processing abilities that were simply not available at that time. I do not forget to mention the importance of annotations to a corpus, that is they increase the value of it very much. http://sirio.deusto.es/abaitua/konzeptu/corpus/corpus2/2fra2.htm If corpora is said to be unannotated it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation.
To sum up, I think that "Human Language Technologies" has been a really interesting topic, because I have learnt about a field in which I had never worked on before. In my opinion, it is likely to have been so appealing at the end due to many difficulties I have had to overcome: 1. To finish the whole report has taken me long hours. 2. I only had access to the documents while I was at the university. 3. The information was really new for me and it contained technicisms. Besides, I have now become aware of the importance that human being still has: e.g. in translation, even though technology and machines are taking away many jobs to him in which he was supposed to be indispensable some years ago. I will also highlight the connections between linguistics and technologies. This is something really striking for me as I never trough they could have anything in common. Finally, it has been the theme concerning the opposition between human translation and machine translation the most enjoyable one. It has also lead me to think deeply and carefully over it.