REVIEW OF HUMAN LANGUAGE TECHNOLOGIES
by Tamara Diez González
This report contains a
review of Language Engineering, covering the aspect of Language Technologies and
Information Society, as well as Machine translation. This report has been
elaborated with the information provided in the class of “English Language and
New Technologies” an also with the sources found on the Internet, particularly
by the use of the searcher Google. It contains a
great deal of quotes from other sites, but I have also tried to introduce my
ideas and opinions about these themes.
This report will cover first of all the term of Language
Engineering. I will try to clarify it and I will introduce terms such as Natural
Language Processing, Computational Linguistics, and Human Language Technologies.
Technology is a very recent matter developed during the last decades that has
enabled people to communicate with computers easily in the use of human
language, both written and spoken, as if we were in relation to an intelligent
machine which responds us in an extremely effective way. Thus, the language is
very important, because there are certain difficulties that have to be taken
into account. To that extend, we will reach the aspect of Machine Translation,
and we will see those difficulties, as well as the advances made in that field.
Nevertheless, before we reach that level, I will also expand in the term of the Information Society, which has evolved lately as an important matter to take into consideration. This is mainly because of the great amount of information that is held in our everyday's life, so that it has a lot of importance in relation to the technology used in those common activities in our society.
Language is a communication mechanism whose medium is text or speech, and Language Engineering is concerned with computer processing of those mediums. Engineering uses scientific knowledge to build artifacts in such a way that one can expect them to perform as required.
The field of engineering is both the body of scientific knowledge relevant to a particular engineering task, that is a tool that by means of its knowledge, offers solutions to problems. Thus, we define engineering as a construction process that is directed both by intended conformance of the resultant artifact to well-specified criteria of fitness and by constraints operating on the nature of the process itself; both the construction process itself and its outputs should be measurable and predictable; the activity is informed by relevant scientific knowledge and practical experience.
Changing to the term Computational Linguistics, we may say that it is that part of the science of human language that uses computers to experiment with language. That is, CL is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. Computational linguistics has applied and theoretical components. Applied Computational Linguistics focusses on the practical outcome of modelling human language use. The methods, techniques, tools and applications in this area are often subsumed under the term language engineering or (human) language technology. This leads us to the term Human Language Technology.
HLT has the objective of supporting e-business in a global context and promoting a human centred infostructure ensuring equal access and usage opportunities for all. This is to be achieved by developing multilingual technologies and demonstrating exemplary applications providing features and functions that are critical for the realisation of a truly user friendly Information Society.
Human Language Technology is sometimes quite familiar. We may deal with it without even knowing about it, because it is present in things such as the spell checker in your word processor. Nevertheless, sometimes it can be hidden inside complex networks, such as a machine for automatically reading postal addresses. So it is clear that it is important to develop certain activities we could face in a school, home or work environment. From speech recognition to automatic translation, Human Language Technology products and services enable humans to communicate more naturally and more effectively with their computers. But, above all, with each other.
In relation to these terms, we also find Natural Language Processing. It is a subfield of artificial intelligence and linguistics. It studies the problems inherent in the processing and manipulation of natural language, but not, generally, natural language understanding. Natural language interfaces enable the user to communicate with the computer in any human language. The main aim of the Natural Language Processing is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that a person will be able to address a computer as if it were another human person. But this is not an easy task. Understanding a language involves knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way. If we think about this, it is hard to digest the fact that the most natural action for us, humans, to do it is not so easy for computers to precess. Both in the case of written and spoken language, machines have difficulties to work in an apropriate way to provide us with the correct solution to our petitions.
The major tasks in NLP are the following ones:
Text to speech: Speech synthesis
is the generation of human speech without directly using a human voice.
Generally speaking, a speech
is a software or hardware capable of rendering artificial speech. Speech
synthesis systems are often called text-to-speech systems in reference
to their ability to convert text into speech. However, there exist systems that
can only render symbolic linguistic representations
like phonetic transcriptions into speech.
Speech recognition: Speech recognition
technologies allow computers
equipped with microphones to interpret human speech, for example for
transcription or as a control method. Those systems have differences. While some
may require the user to "train" the system to recognise their own
particular speech patterns, others donçt need that. Some can recognise
continuous speech while others need the users to stop each time they say a new
word, that is, to discrete words. And also they differ in the vocabulary the
system recognises, because it can be either small or large.
Natural language generation: Natural Language
Generation is the natural language processing task of generating natural
language from a machine representation system such as a knowledge base or a
logical form. Some people view Natural Language Generation as the opposite of
natural language understanding. The difference is this: whereas in natural
language understanding the system needs to disambiguate the input sentence to
produce the machine representation language, in Natural Language Generation the
system needs to take decisions about how to put a concept into words.
Machine translation: Machine translation
is the process of automatic translation
from one natural language to another by a computer. One
of the very earliest pursuits in computer science, MT has proved to be an
elusive goal, but today a number of systems are available which produce output
which, if not perfect, is of sufficient quality to be useful in a number of
areas, and to assist human translators.
Question answering: Question Answering is a type of information
retrieval. Given a collection of documents the system should be able to retrieve
answers to questions posed in natural language. QA is regarded as requiring more
complex natural language processing techniques than other types of information
retrieval such as document retrieval, and it is sometimes regarded as the next
step beyond search engines. Closed-domain question answering deals with
questions under a specific domain, and can be seen as an easier task because NLP
systems can exploit domain-specific knowledge such as ontologies. Open-domain
question answering deals with questions about nearly everything can only rely on
general ontologies. On the other hand, these systems have much more data
available where from to extract the answer.
Information retrieval: Information retrieval is the art and science of searching for information in documents, searching for documents themselves, searching for metadata which describes documents, or searching within databases, whether relational stand alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. There is a common confusion, however, between data, document, information, and text retrieval, and each of these have their own bodies of literature, theory, praxis and technologies. Web Search Engines such as Google and Lycos are amongst the most visible applications of Information retrieval research.
Informaton extraction: Information
extraction is a type of information retrieval whose goal is to automatically
extract structured or semistructured information from unstructured
machine-readable documents. A typical application of IE is to scan a set of
documents written in a natural language and populate a database with the
information extracted. Current approaches to IE use natural language processing
techniques that focus on very restricted domains.
THE INFORMATION SOCIETY
The term Information Society has been around for a long time now and, indeed, has become something of a cliché. In the European Union, the concept of the Information Society has been evolving strongly over the past few years building on the philosophy originally spelled out by Commissioner Martin Bangemann in 1994. Bangemann argued that the Information Society represents a "revolution based on information ... [which] adds huge new capacities to human intelligence and constitutes a resource which changes the way we work together and the way we live together..." (European Commission, 1994:4). One of the main implications of this "revolution" for Bangemann is that the Information Society can secure badly needed jobs (Europe and the Global Information Society, 1994:3). In other words, a driving motivation for the Information Society is the creation of employment for depressed economies.
Closer to home it is instructive to look at just a few policy documents to see the views of the Information Society dominant here. The Goldsworthy report sees the Information Society as a "societal revolution based around information and communication technologies and about the role of these in developing global competitiveness and managing the transition to a globalised free trade world" (Department of Industry, Science and Tourism, 1997). In short, Goldsworthy's idea of the Information Society is entirely an economic one. Given this blind faith in the existence and the desirability of an Information Society among diverse nations, it is instructive to look at the theoretical literature which has spawned the idea to see what it claims for the Information Society. The term Information Society has many synonyms: Information Age, Information Revolution, Information Explosion and so on and it is found across a wide spectrum of disciplines.
This notion of the Information Society focuses on the gee-whiz technology as epitomised by the 'Towards 2000' TV series. In recent times, the emphasis is on the convergence of computers and telecommunications and the capacity for storage, manipulation and transmission of vast amounts of data. The Goldsworthy Report sits squarely in this category, following earlier Australian reports such as the Broadband Services Expert Group's document (Broadband Services Expert Group, 1994).
The problem is, however, that drawing a direct line between the presence of information technology with some sort of new society is hard to justify. Will the presence of a computer in every home make us an Information Society? Or should that be two computers? At what point will we know we've arrived? What changes in our fundamental institutions, ways of living and working characterises an Information Society, as opposed to a non- Information Society? A further weakness of this concept is highlighted by the many commentators who point out the dangers of technological determinism in thinking about the Information Society and reject the view that technology impacts on society and is the prime agent of change, defining the social world (Webster, 1995:10)
This concept of the Information Society has been built on Fritz Machlup's seminal study of the size and effect of the US information industries in the 1960s. Machlup demonstrated that education, the media, computing, information services, R+D and so on accounted for some 30% of GNP. (Machlup, 1962) . Marc Porat continued this line of enquiry and demonstrated the rising proportion of information-related activities in the US economy (Porat, 1977) . Barry Jones replicated this work for Australia in his highly-cited Sleepers Wake! (Jones, 1983). More recently an ABC "Background Briefing" programme on the Information Economy highlighted the significance of the value of logical structures, the expression of cognitive processes, within computer software. This was referred to as the "weightless economy".
Entrancing as it is to have numbers to quote in support of the importance of information in the economy, it is difficult to argue that the existence of lots of information activities in society actually impacts on social life, without moving to an analysis of the substance or quality of that information. In any event, what matters, surely, is not the amount but the meaning and value of information. Some econometric studies suggest that the early experimental exponential growth of information activities as a proportion of economic activities has actually slowed down with little change from 1958 to 1980. This hardly supports the idea that information is growing steadily in its dominance (Rubin and Huber, 1986). And there is the added difficulty of applying economic concepts to the creation, processing. flow and use of information. Sandra Braman's analysis shows the pitfalls of thinking of information as a commodity as this fails to accommodate the fact that many forms of activity around information are not driven by market forces, for example, culturally transmitted information. Nor does an economic approach acknowledge the inappropriateness of many basic economic assumptions given that form and substance of information are not the same thing. Finally, there is the difficulty that economic approaches require information to be measured in terms of discrete pieces for economic valuation. (Braman, 1996).
This idea of the Information Society rests on the idea that in an Information Society the dominant category of worker is engaged as an "information worker". Many commentators have produced data to demonstrate growth patterns in the need for more workers who will use their brain rather than their brawn. Daniel Bell's influential 'Coming of the Post-Industrial Society' argued that the professional and technical classes would dominate in the new era with work organised around theoretically based knowledge for the purpose of social control and directing of innovation and change (Bell 1974: 15-20).
As the former head of a school of Information Studies, I have real doubts about the usefulness of the figures in these analyses. Conscientious attempts by myself and colleagues to analyse market demand for graduates in Information Studies led to immense frustration as we grappled with the poor descriptive powers of job titles, and advertisements in general, in relation to the information activities in a given position. Some were fairly obvious - Data Base Designer, Librarian, Information Manager, Research Officer, but we quickly found that lurking beneath just about every position described in the Saturday advertisements was some component of information handling and processing. The challenge was to find a way of saying definitively whether a job was predominantly an information professional's job or not.
This difficulty may not be enough on its own to say that occupational trends cannot be reliably tapped and used as an indicator of broad developments over time but it suggests the basis of the Bell, Porat and Jones studies are probably more than a bit wobbly. A quick consideration of Jonscher's two categories of worker applied to publishers and booksellers points up the same difficulty at a more general level. If workers deal with tangible products such as books - are they production workers or are they information workers? The dilemma comes from the reality that just about everyone's job has some information activities embedded in it so that deciding when information handling dominates to the point where the worker is an "information worker" rather than a "production worker" is simply too hard. It has to be concluded then that attempts to define an Information Society according to the number of people in the business of information is problematic. Consequently, measurement of trends in employment in information work or comparison between societies to decide which, if any, is an Information Society seems destined to be highly unreliable.
Webster has identified two more concepts of the Information Society. Firstly, there is the spatial idea of the Information Society as a networked society, a global village where people of like minds and purposes are linked together through electronic networks. This idea is now coming through in some EU Information Society policy documents in the idea of the Information Society as a mechanism for developing cultural cohesion, empowerment and integration of communities across the Union (European Union, 1996a). It would be fair to say also that Australian information policy documents also incorporate both cultural and spatial concepts of Information Society. The Broadband Services Expert Group final report dealt with the question of equity of access (regardless of geography) and called for communication and information infrastructure developments to build on community and individual user need rather than technological capacity. (Broadband Services Expert Group, 1994:5). The Jones report mentioned earlier while focusing on economic and occupational aspects, acknowledges the Information Society as a period in which use of time, and family life will be influenced by access to information technology.
Looking to the implications of these varied ideas of the Information Society for public policy making it is clear there was a time when policy was clearly the business of the public sector and was essentially about "what governments choose to do and what not to do" (Dye 1995). The trouble now is that the edges of the public and private spheres are becoming more difficult to distinguish as has been amply demonstrated by papers in this strand of the Conference. It is interesting that the field of information studies has in some ways anticipated this development as it has accepted the place of private sector organisational policy on information matters to be recognised as "information policy" even though, at least traditionally, these policies were turned inwards to the support of organisational roles.
Some understanding of how the fusion of public and private impacts on information policy can be gained from Nick Moore's analysis of Western and East Asian information policy implementation strategies (Moore, 1997). Moore argues that there are two broad approaches to information policy formation. One, the neo-liberal, puts its trust in the market to move society along towards the Information Society. The European Union policies illustrate this particularly well as there the basic tenet of information policy is the belief that the achievement of the Information Society "is a task for the private sector" with the role of government confined to ensuring a supportive regulatory climate and a refocussing of current public expenditure patterns. Bangemann is adamant that additional public money, subsidies or protectionism will not be available and talks about the need to "strike down entrenched positions which put Europe at a competitive disadvantage". The role of government is strictly limited to providing a regulatory framework for a partnership of private and public sectors (European Commission, 1994:3).
In conclusion, it can be said that there is a generally optimistic response to the idea of the Information Society and it is mostly enthusiastically endorsed as desirable. Many go further and say that it is absolutely essential for nations and regions to become an Information Society. There are, however, many conceptions of the Information Society which means that there is an ambiguous foundation for policy makers. Added to this is the complexity of different political philosophies which impact on implementation of information policy. This complexity is further compounded when we start to look a the informational component of the Information Society.
As mentioned at the beginning, machine tranlation is not an easy task for computers. The fact that we have the capacity to easily go through a learning process of acquiring a language does not mean that it should also be easy for machines. For certain things such as mathematic calculation, computers are faster and more effective than humans. Nevertheless, the field of linguistics is not so advanced as other fields. A great advance has been done on that respect, but there are still some unsolved difficulties. A natural language is not easily acquired by a machine because of the ambiguity and other factors that each language has.
This is very well seen if we go to an online translator such as Altavista's Babel Fish. These translators are not 100% effective. The grammar is not correct most times, because they make a literal translation of a sentence. And we have to add also the case of some expressions that mean another thing because in each language they are said in a different way. Thus, it is evident that many times, in this field, the human translator is more needed than a machine translator. That gives us a kind of relief, because it means that as machines are not exact, the presence of a person will be needed, so our work will be of much help. Anyway, as machines are advancing, we will be less required than at present.
We could realize the term Machine Translation as the fully automatic translation. However, we also have to consider the whole range of tools that may support translation and document production in general, which is especially important when considering the integration of other language processing techniques and resources with MT. Thus, we define Machine Translation to include any computer-based process that transforms (or helps a user to transform) written text from one human language into another. We also have the term Human-Assisted Machine Translation, which is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a human for help. And also, we have to mention the Machine-Aided Translation, which is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants.
Traditionally, two very different classes of Machine Translation have been identified, but in fact, there are three. Assimilation refers to the class of translation in which an individual or organization wants to gather material written by others in a variety of languages and convert them all into his or her own language. Dissemination refers to the class in which an individual or organization wants to broadcast his or her own material, written in one language, in a variety of language to the world. The third class of translation that has also recently become evident is Communication, which refers to the class in which two or more individuals are in more or less immediate interaction, typically via email or otherwise online, with an Machine Translation system mediating between them. Each class of translation has very different features, is best supported by different underlying technology, and is to be evaluated according to somewhat different criteria.
Problems of Machine Translation
There are some particular problems which the task of translation poses for the builder of MT systems --- some of the reasons why MT is hard. It is useful to think of these problems under two headings: Problems of ambiguity , problems that arise from structural and lexical differences between languages and multiword units like idiom s and collocations.
These sorts of problem are not the only reasons why MT is hard. Other problems include the sheer size of the undertaking, as indicated by the number of rules and dictionary entries that a realistic system will need, and the fact that there are many constructions whose grammar is poorly understood, in the sense that it is not clear how they should be represented, or what rules should be used to describe them. This is the case even for English, which has been extensively studied, and for which there are detailed descriptions -- both traditional "descriptive" and theoretically sophisticated -- some of which are written with computational usability in mind. It is an even worse problem for other languages. Moreover, even where there is a reasonable description of a phenomenon or construction, producing a description which is sufficiently precise to be used by an automatic system raises non-trivial problems.
The Internet is both a vehicle for providing Machine Tarnslation services and a major beneficiary of their application. To this extent, it is likely to provide a further key to making the Internet a truly global medium which can transcend not only geographical barriers but also linguistic ones.
Europe, as the most notable focal point in the present-day world where a great capacity for technological innovation crosses paths with a high level of linguistic diversity, is excellently placed to lead the way forward. Other parts of the world are technologically capable but too self-contained and homogeneous culturally to acquire immediate awareness of the need for information technology to find its way across linguistic barriers, while still other communities are fully aware of the language problem but lack a comparable degree of access to technological resources and initiative needed to address the issue on such a scale. Whoever succeeds in making future communication global in linguistic terms will have forged a new tool of incalculable value to the entire world.
Through this report, I should say that I have learned about some terms that I did not know before. The activity of searching for information on the internet and reading through it has resulted in the knowledge of new terms that I could not talk about before. Even though on the term of Informatio Society I am not so able to give my deep opinions, in other fields such as Machine Translation I may say I am capable of doig so.
I have realized how important human sources are in the field of Machine Translation. Machine translators are still not fully developed. They have a lot of mistakes, and thus, humans will still be needed because of their better knowledge in that field.
At first, I could not think of the relation that Computer technology had with our degree, but now I see that they are related. It is not a separate field, that is machines on the one hand and humans on the other hand. It is not that way. It is evident that by the union of both many advances can be made. And I see it in the way that even though any of them advances, both will always be needed. Humans are not perfect, but we have also seen that machines neither. So we will always be needed.
Concerning to the internet, I think it is the most amazing advance mada in the past years. It is very powerful and useful, because it provides us with a wider information than what we could get from a book or magazine. It makes possible for people to work together even if they are in different parts of the world. And the good thing is that each day, the amount of information avaiable to us is growing. Also, we are able to take part in that giving and sharing of information.
Finally, I have to say that even if this report has taken a lot of time to finish it, it has supplied me with a knowledge that in the future will still be there. And also, my interest in this field wil grow more, surely. We never know what we would find interesting until we go through it. So we must stay open to new suggestions and matters.
http://www.coli.uni-sb.de/~hansu/what_is_cl.html = by Hans Uszkoreit
http://www.gu.edu.au/centre/cmp/Papers_97/Browne_M.html = by Mairéad Browne