abstract

The world is facing a new era, where telecomunications play an important rule. We are talking about the information society. Since its appearance, Internet, mobile phones... have become a powerful and accessible tool for information searching and providing. Human Language Technologies have had a strong impact since they appeared on the world of new technologies and the Internet, focusing on the essence of the information provided by texts or voice: the speech. In a world were English language is hegemonic, translation is becoming important day by day.

 

HUMAN LANGUAGE TECHNOLOGIES and the INFORMATION SOCIETY

 

introduction

Now that global and accessible ways of communicating have created what we call "The Information Society", HLT, their complements and agents play an important role in the treatment of voice or texts. I have divided this report into 3 main parts. In the first one I have analyzed the role of these HLT in our common lifes in the new society, when using computers, mobile phones and such technological media. Then, I have described the main EU projects behind HLT, as many other projects that have derived from the Framework Programmes. In a second part I have centred on the Machine Translation and its different ways of translating, comparing the human and the machine translation, and the last one describes the problem of an amount of information, and some possible solutions to this..

 

1st part: HLT in a new society

1.- Its role in the information society

The main aim of HLT, in general is to support e-business in a global context and to promote a human centred structure of information ensuring equal access and usage opportunities for all. This has to be achieved by developing multilingual technologies and demonstrating exemplary applications providing features and functions that are critical for the realisation of a truly user friendly Information Society. While elements of the three initial HLT action lines - Multilinguality, Natural Interactivity and Crosslingual Information Management are still present, there has been periodic re-assessment and tuning of them to emerging trends and changes in the surrounding economic, social, and technological environment. The trials and best practice in multilingual e-service and e-commerce action line was introduced in the IST 2000 work programme (IST2000) to stimulate new forms of partnership between technology providers, system integrators and users through trials and best practice actions addressing end-to-end multi-language platforms and solutions for e-service and e-commerce. The fifth IST call for proposals covered this action line.

2.- Framework Programmes and other projects            

Two EU funded projects, ELSNET and EUROMAP, are behind the development of HLTCentral:

EUROMAP ("Facilitating the path to market for language and speech technologies in Europe") - aims to provide awareness, bridge-building and market-enabling services for accelerating the rate of technology transfer and market take-up of the results of European HLT RTD projects.

ELSNET ("The European Network of Excellence in Human Language Technologies") - aims to bring together the key players in language and speech technology, both in industry and in academia, and to encourage interdisciplinary co-operation through a variety of events and services.

There have been a range of programmes in order to make possible this objective. In the early nineties, the Third Framework Programme was launched, which developed collaboration with public and private sectors, academic institutions, and individuals. Within it, Linguistic Research and Engineering (LRE) under the Telematics Applications Programme, developed essential natural language and speech technologies, as many components that could be incorporated into information systems and services.

LRE aimed at pre-competitive, generic technologies and promoting industrial participation. Activities did not lead directly to finished products, but concentrated on providing enabling technologies and opportunities for inter-working between proprietary solutions, and on defining standards and common reference architectures. It also sought to stimulate the development of pilot applications and demonstration projects to show how language technologies could be used, and to demonstrate the technical feasibility of the solutions.

Towards the end of the Third Framework Programme, the Multilingual Action Plan (MLAP) was introduced with a call for exploratory projects to ensure continuity and prepare for the Fourth Framework Programme. LRE and MLAP together funded over fifty projects, with a significant number of user organisations from both small and medium sized enterprises and multinationals. While research results were always of a high standard, commercial exploitation was somewhat disappointing.

The Third Framework Programme has a follower, the Fourth Framework Programme, where the importance of language engineering was recognised with a significant increase in budget to over 80 million ECU (European predecessor of the EURO) with the focus on user driven, application oriented projects, designed to stimulate and respond to market needs. The central aim was to promote the use of telematics applications through the use of language technologies in order to facilitate communication in and between different European languages. Work was focused on projects which integrated new methods of processing spoken and written language into information and communications systems and services, with a view to improving their usability, accessibility and functionality.

User involvement increased significantly with around 500 organisations participating in projects launched since 1995. More than 60% of the human and financial resources deployed in these projects originated from industrial and user organisations. User involvement was drawn from a wide range of groups:

administration agencies, service providers and law enforcement organisations; small and medium, and larger enterprises in the areas of manufacturing (including aerospace and automotive industries) and services (e.g. software, financial, publishing, media, education and training, telecommunications); professional users; the general public. During the seven years from the beginning of 1992 to the end of 1998, the European Union invested approximately 115 million ECU in language engineering through shared cost projects. 70% of this money was allocated during the Fourth Framework Programme. This allocation reflects an increasing recognition of language engineering as an important area of research and technological development.

Projects in the Multilingual Information Society Programme (MLIS) complement activities that support multilinguality, exploit existing experiences and knowledge of multilingual issues and solutions, and mobilise players in both the public and private sectors to:

stimulate provision and raise awareness of multilingual services in Europe; create favourable conditions for the development of commercial activity based on language technologies; reduce the cost of information transfer among languages; contribute to the promotion of the linguistic diversity in Europe. In addition, a number of projects funded in the Esprit (long term research) and in the IN-CO (International Cooperation) programmes contain strong elements of language technology.

Now that we have talked about FP3 and FP4, here comes the new one. The strategic objective of the Information Society Technologies (IST) Programme in FP5 is to realise the benefits of the information society for Europe both by accelerating its emergence and by ensuring that the needs of individuals and enterprises are met.

The IST Programme has four inter-related objectives:

For the private individual - to meet the need and expectation of high-quality affordable general interest services. For Europe’s enterprises, workers and consumers - to enable individuals and organisations to innovate and be more effective and efficient, thereby providing the basis for sustainable growth and high added-value employment while also improving the quality of working life. In the sector of multimedia content - to confirm Europe as a leading force. For the enabling technologies - to drive development, enhance applicability and accelerate take-up in Europe. The Programme vision is very simple: "Our surrounding is the interface" to a universe of integrated services. This will enable people to access systems and services wherever they are, whenever they want, and in the form that is most "natural" for them. While directly targeting the improvement of quality of life and work, the vision is expected to act as a catalyst for business opportunities from added-value services and products.

The preparations for the Sixth Framework Programme (FP6 begun back in 2001. The planned funding for IST in FP6 is also 3,600 MEUR. which includes a total of seven thematic priorities, of which IST is one. Listed amongst the IST research priorities are Communication, computing and software technologies and Knowledge and interface technologies. More information from the FP6 - The Way Forward pages.

More information on the preparatory activities can be found on the HLTCentral FP6 pages, and also on the EC's CORDIS web site.

The overarching aim of HLT is to maximise the effectiveness and competitiveness of global business activities and to promote a truly human-centred infostructure ensuring equal access and usage opportunities for all. HLT actions initially addressed three intertwined areas centred around how people interact with information, with information services and with each other:

Multilingual communication, aimed at building multilingual intelligence into business processes, communication services, information appliances, and public interest services.

As a cronological order, we could take this one 1998 - 2002: Human Language Technologies spring 2001: eContent 1994 - 1998: Language Engineering 1996 - 1999: MLIS - Multilingual Information Society 1994 - 1998: ESPRIT.

Last updated: 22.07.02 16:16

http://www.hltcentral.org/page-214.0.shtml

3.- What is the current situation of the HLTCentral.org office?

There have been a great amount of initiatives in the whole Europe that have began to derive into effective results. Many articles in the mainstream business and technology press indicate a change in the market significance of speech and natural language applications. Europe has been prominent in these developments, as its market has to satisfy users of a large amount of languages. It is interesting to know that much of the technology available in other parts of the world is licensed from successful European suppliers.

The success of language engineering research and technological development in Europe is supposed to have an important influence thanks to its impact on our economic future because it can be applied across such a wide range of information systems and services with such significant benefits.

The convergence of technologies for creating, managing and communicating information has expanded the opportunities and the need for HLT. The explosion of Internet access has derived into a higher number of users, and it has created a platform for networked computing, with an associated expansion in the range of devices through which information can be delivered and accessed.

There is an emerging landscape of new technologies that requires more advanced information-handling techniques, where HLT components will be key elements. New platforms, coupled with high communication bandwidth, enable the delivery of complex multimedia information. The consolidation in the telecommunications sector, which is a basic feature of the actual economic market, exaggerate these trends and the need for differentiation through the introduction of innovative value-added services.

The information society will enter virtually every area of life involving interactions between people and organisations, in both the public and private spheres. HLT will enable the information society through intuitive, human-centred modes of interaction with products and services. These will include spoken interaction, which will enable the removing of keyboards and keypads, the use of many different languages to process information and interact with devices - as well as the ability to communicate across language barriers.

This information infrastructure, or infostructure, underpins new social and economic formations in which HLT will be applied. Services for both voice and message communication are at the heart of the infostructure, and HLT is rapidly being deployed in many applications. Traditional voice telephony is already highly automated, with the worldwide market for voice processing systems already estimated to be worth $5.8 billion. In the near term, small vocabulary systems will be universally available in all major EU languages for services such as advanced voice-controlled electronic assistants. This is software installed in a carrier’s network that manages interpersonal communications by performing a variety of tasks, all through spoken commands. "They can handle faxes and email, help handle multiple calls, make conference calls, retrieve voice messages from other systems, send group messages, and remember follow-up calls and action items."

Electronic commerce and call-centre based applications, including telephone and online sales, service and support, will likewise be an important area for new HLT applications in what we could call TeleBusiness. Telephone call centres are being combined with Web-based service and support sites, bringing opportunities to integrate speech and text applications in the quest for better customer relationship management. The result is estimated to be a major new services market generating more than $3.5 billion in annual revenues by 2005. The number of call centre agents is expected to double by 2005 , and nearly 25% of them are forecast to use some type of network-based service. Suppliers in the telemarketing and support market will invest substantial sums in technology in order to remain competitive.

Web portal services and e-commerce vendors must provide customers with navigation aids and ways to interact with electronic ‘storefronts’. While all markets will need the intelligent retrieval and filtering capabilities of LT, in Europe there is a particular need for multilingual access interfaces. Entertainment will be a strong draw for online consumer services especially for the currently estimated number of net users who increasingly demand more interactivity and customisation.

It is true that most EU financial institutions have web sites, but online financial information and brokerage services are more advanced in the US than in Europe. European-based services are likely to take hold in the medium term where information profiling, filtering and extraction together with voice recognition and speaker identification will be key features of systems helping the estimated 10 million Europeans seeking financial guidance over the net by 2002.

According to Datamonitor, "the European Intranet services market is expected to generate revenue of $5.2 billion by 2003, over seven times its current value of $720 million. Networked business processes are becoming the rule rather than the exception, which make knowledge management and workflow support key requirements for successful, competitive businesses. Key LT components of corporate intranet applications will be information retrieval and extraction, speech-enabled automated assistants, and tools to support multilingual creation of and access to corporate information". This is just an example of the relevance of Network and Intranet.

The localisation industry, which is already heavily concentrated in Europe, will thrive in a commercial publishing world that is being transformed by electronic delivery, and where publishing, video/film, audio/music, and information provision are all converging. Globalisation services, which include authoring for international markets, translation, cultural adaptation, and software localisation, will move from the software industry into the mainstream of corporate publishing. At the same time, language processing tools which support globalisation will find new and expanded markets in many sectors with digital multimedia content.

The new technologies are leading the industrial view of the latest decade. European programmes have really helped to raise this industrial interest. Products and services are being launched, which demonstrate what can be achieved. The need now is for fresh, innovative ideas on future applications in attractive emerging markets.

Last updated: 16.10.00 16:50

http://www.hltcentral.org/page-219.0.shtml

4.- The web page: www.hltcentral.org

As we can read there, HLTCentral web site was established as an online information resource of human language technologies and related topics of interest to the HLT community at large. It covers news, R&D, technological and business developments in the field of speech, language, multilinguality, automatic translation, localisation and related areas. Its coverage of HLT news and developments is worldwide - with a unique European perspective.

 

2nd part: HLT and the translation agents

1.- Which are the most usual interpretations of the term "machine translation" (MT)?

The term machine translation (MT) refers generally to the completely automatic way of translating. However, we consider the whole range of tools that may support translation and document production in general, which is especially important when considering the integration of other language processing techniques and resources with MT. We therefore define Machine Translation to include any computer-based process that transforms (or helps a user to transform) written text from one human language into another. 

We define Fully Automated Machine Translation (FAMT) to be MT performed without any intervention of a human being during the process. Human-Assisted Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a (mono- or bilingual) human for help. Machine-Aided Translation (MAT) is the style of translation where a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants.

Traditionally, two very different classes of MT have been identified, although there is also a third one, which are assimilation and dissimilation. Assimilation refers to the class of translation in which an individual or organization wants to collect material written by others in a variety of languages and convert them all into his or her own language. Dissemination refers to the class in which an individual or organization wants to broadcast his or her own material in a variety of languages to the world, and that material is written in one language. The third class of translation has also recently become evident. Communication occurs when two or more individuals are in more or less immediate interaction, normally through online sources, like e-mail, with an MT system that mediates between them. Each class of translation has very different features, is best supported by different underlying technology, and is to be evaluated following different criteria.

http://sirio.deusto.es/abaitua/konzeptu/nlp/Mlim/mlim4.html

2.- MT: Where was MT years ago?

    The history of MT research has gone through a number of phases in which certain frameworks have dominated. From the late 1960s the syntactic orientation was dominant, initially with syntactic transfer approaches (e.g. at MIT), then the interlingua formalisms of CETA and LRC, followed by the "second generation" transfer-based multi-level model of GETA-Ariane, SUSY, Mu, and Eurotra. In the 1980s the AI orientation was popular (e.g. Carnegie Mellon), more attention was paid to semantics and interlingua-based systems were explored (e.g. Rosetta and DLT). And now in the 1990s, the corpus-based paradigm with stochastic and example-based methodologies is the focus of much activity.

http://sirio.deusto.es/abaitua/konzeptu/nlp/Mlim/mlim4.html

3.- Speech-to-speech machine translation: three projects.

The achievements of the EuTrans project reveal two things. The first is that speech-to-speech translation is conditional on the development of speech recognition technology itself. Secondly, that the models employed in speech recognition based on large corpora have proved valid also for the development of speech translation. This implies that in the future these two technologies could be successfully integrated.

At present, however, speech-to-speech translation systems are not commonplace. In recent years speech recognition techniques have made important strides forward, thanks to the increased availability of the resources that are needed for its development— large collections of oral texts and more efficient data oriented processing techniques, such as those designed by the PRHLT group itself. However, the integration of these systems into marketable products is still some way off.

It is worth remembering that most prototypes developed within research projects are currently only capable of processing a few hundreds of sentences (around 300), on very specific topics (accommodation-booking, planning trips, etc.) and for a small group of languages—mostly the predominants. It seems unlikely that any application will be able to go beyond these boundaries in the near future. This makes this kind of translation a bit unreliable.

The direct incorporation of speech translation prototypes into industrial applications is at present too costly. However, the growing demand for these products leads us to believe that they will soon be on the market at more affordable prices. The systems developed in projects such as Verbmobil, EuTrans or Janus—despite being at the laboratory phase at the time of this article—contain in practice thoroughly evaluated and robust technologies. A manufacturer considering their integration may join R&D projects and take part in the development of prototypes with the prospect of a fast return on investment. It is quite clear that we are witnessing the emergence of a new technology with great potential for penetrating the telecommunications and microelectronics market in the not too distant future.

Another remarkable aspect of the EuTrans project is its methodological contribution to machine translation as a whole, both in speech and written modes. Although these two modes of communication are very different in essence, and their respective technologies cannot always be compared, speech-to-speech translation has brought prospects of improvement for text translation. Traditional methods for written texts tend to be based on grammatical rules. Therefore, many MT systems show no coverage problem, although this is achieved at the expense of quality. The most common way of improving quality is by restricting the topic of interest. It is widely accepted that broadening of coverage immediately endangers quality. In this sense, learning techniques that enable systems to automatically adapt to new textual typologies, styles, structures, terminological and lexical items could have a radical impact on the technology.

Due to the differences between oral and written communication, rule-based systems prepared for written texts can hardly be re-adapted to oral applications. This is an approach that has been tried, and has failed. On the contrary, example-based learning methods designed for speech-to-speech translation systems can easily be adapted to the written texts, given the increasing availability of bilingual corpora. One of the main contributions of the PRHLT-ITI group is precisely in its learning model based on bilingual corpora. Herein lie some interesting prospects for improving written translation techniques.

Effective speech-to-speech translation, along with other voice-oriented technologies, will become available in the coming years, but with some limitations that have been mentioned before e.g. the number of languages, linguistic coverage, and context. It could be argued that EuTrans' main contribution has been to raise the possibilities of speech-to-speech translation to the levels of speech recognition technology, making any new innovation immediatly accessible.

http://www.hltcentral.org/page-1086.0.shtml

Last updated: 13.06.03 15:19

4.- Human translation or machine translation?

Although the human being is not a perfect being, it has the capacity of using his imagination to adapt text fragments to the most acceptable form of another language. It is clear that translation is not so much a purely logical proccess of stablishing equivalents, but a search for the most acceptable way on each context. In some cases, like we see in poems, linguistic form becomes the most important part of the speech: it is not so much what it is written, but what it wants to express.

 

3rd part: HLT and the information

1.- How much new information is created each year?

Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002. 92%  of the new information was stored on magnetic media, mostly in hard disks of computers.

How big is five exabytes? If digitized, the nineteen million books and other print collections in the Library of Congress would contain about ten terabytes of information; five exabytes of information is equivalent in size to the information contained in half a million new libraries the size of the Library of Congress print collections. Hard disks store most new information. Ninety-two percent of new information is stored on magnetic media, primarily hard disks. Film represents 7% of the total, paper 0.01%, and optical media 0.002%. The United States produces about 40% of the world's new stored information, including 33% of the world's new printed information, 30% of the world's new film titles, 40% of the world's information stored on optical media, and about 50% of the information stored on magnetic media. How much new information per person? According to the Population Reference Bureau, the world population is 6.3 billion, thus almost 800 MB of recorded information is produced per person each year. It would take about 30 feet of books to store the equivalent of 800 MB of information on paper. We estimate that the amount of new information stored on paper, film, magnetic, and optical media has about doubled in the last three years.

Information explosion? We estimate that new stored information grew about 30% a year between 1999 and 2002. Paperless society? Not really. The amount of information printed on paper is still increasing, but we have to know that the vast majority of original information on paper is produced by individuals in office documents and postal mail which are later on printed, not in formally published titles such as books, newspapers and journals. Information flows through electronic channels -- telephone, radio, TV, and the Internet -- contained almost 18 exabytes of new information in 2002, three and a half times more than is recorded in storage media. Ninety eight percent of this total is the information sent and received in telephone calls - including both voice and data on both fixed lines and wireless.

Telephone calls worldwide – on both landlines and mobile phones – contained 17.3 exabytes of new information if stored in digital form; this represents 98% of the total of all information transmitted in electronic information flows, most of it person to person. Most radio and TV broadcast content is not new information. About 70 million hours (3,500 terabytes) of the 320 million hours of radio broadcasting is original programming. TV worldwide produces about 31 million hours of original programming (70,000 terabytes) out of 123 million total hours of broadcasting. The World Wide Web contains about 170 terabytes of information on its surface; in volume this is seventeen times the size of the Library of Congress print collections. Instant messaging generates five billion messages a day (750GB), or 274 Terabytes a year. Email generates about 400,000 terabytes of new information each year worldwide. P2P file exchange on the Internet is growing rapidly. Seven percent of users provide files for sharing, while 93% of P2P users only download files. The largest files exchanged are video files larger than 100 MB, but the most frequently exchanged files contain music (MP3 files). How we use information. Published studies on media use say that the average American adult uses the telephone 16.17 hours a month, listens to radio 90 hours a month, and watches TV 131 hours a month. About 53% of the U.S. population uses the Internet, averaging 25 hours and 25 minutes a month at home, and 74 hours and 26 minutes a month at work – about 13% of the time.

Release date: October 27, 2003. © 2003 Regents of the University of California

2.- How can HLT contribute to solve this problem?

Many experts have brought up ideas at this point, some of them are quoted below:

The solution is a huge wastepaper basket." R. Sachs. It refers clearly to the fact that there is a great amount of unnecessary information.

"Better training in separating essential data from material that, no matter how interesting, is irrelevant to the task at hand is needed." D. Lewis

"Information stress sets in when people in possession of a huge volume of data have to work against the clock, when major consequences -lives saved or lost, money made or lost- will flow from their decision, or when they feel at a disadvantage because even with their wealth of material they still think they do not have all the facts they need. So challenged, the human body reacts with a primitive survival response. This evolved millions of years ago to safeguard us when confronted by physical danger. In situations where the only options are to kill a adversary or flee from it, the 'fight-flight' response can make the difference between life and death." D. Lewis. It provides an explanation of what the information stress is.

http://sirio.deusto.es/abaitua/konzeptu/nlp/mlim.html

http://sirio.deusto.es/abaitua/konzeptu/fatiga.htm

3.- An overview

Human Language Technologies RTD contributes to enhancing usability and accessibility of digital content and services while supporting linguistic diversity in Europe. It is part of the Multimedia Contents and Tools activity (Key Action III) of the Information Society Technologies (IST) Programme. IST is the largest single element of the Fifth Framework programme (FP5). Key Action III has a budget of 564 Meuro and brings together all RTD activities relating to digital content including tools to create, manage, deliver, retrieve and exchange it.

The IST Programme in FP5 (1998 - 2002) is funded at 3,600 MEUR which represents 26.3% of the total FP5 budget. It is designed to anticipate the needs of the converging telecommunications, computing and media industries, and related markets and technologies. It integrates all previous Community ICT activities (ESPRIT, ACTS and Telematics) into one programme, managed by one service - the Information Society Directorate (INFSO).

Natural Interactivity, with the aim of enhancing the naturalness of human-computer interactions and the effectiveness of interpersonal communications.

Cross-lingual information.management, with a view to improving the effectiveness of information access and the efficiency of information handling. A total of 85 HLT projects are currently up and running in IST programme so far, with more in the pipeline. The IST programme is synchronised with a related call in the US National Science Foundation programme.

HLT is also present in the eContent Programme, Action Line 3 - Facilitating linguistic and cultural customisation of digital products and services.

Current HLT themes and priorities are built upon a substantial base of existing skills and knowledge acquired as a result of achievements in the EU's previous R&D programmes and initiatives. Future market prospects for HLT-enabled systems and services are excellent due to the globalisation of the economy, opportunities from existing and emerging internet business models, and the increasingly important demand for more natural, effective and efficient user interfaces.

User involvement increased significantly with around 500 organisations participating in projects launched since 1995. More than 60% of the human and financial resources deployed in these projects originated from industrial and user organisations. User involvement was drawn from a wide range of groups:

administration agencies, service providers and law enforcement organisations; small and medium, and larger enterprises in the areas of manufacturing (including aerospace and automotive industries) and services (e.g. software, financial, publishing, media, education and training, telecommunications); professional users; the general public. During the seven years from the beginning of 1992 to the end of 1998, the European Union invested approximately 115 million ECU in language engineering through shared cost projects. 70% of this money was allocated during the Fourth Framework Programme. This allocation reflects an increasing recognition of language engineering as an important area of research and technological development.

Last updated: 16.10.00 16:50

http://www.hltcentral.org/page-218.0.shtml

 

conclusion

HLT are basic in this new era. The market of new technologies is aware of the usage of information transmissors, such as mobile phones or internet, by milions of people. HLT are aware of this, and there have been many Framework Programmes that have helped with this task. As information is provided by texts, most of them written in English, the translation becomes relevant for many users. There are human resources for this, but there have been also HLT, machine-type helpers, for this task of translating information. The second ones can be useful, but they still have not achieved the level of a human, as imagination is an important feature for translating. This is trascendental, not only for individuals searching for information, but also for those who want their message to arrive to people of different mother tongues. We are facing a new model of society, where information (in lots of different languages and formats) is more accessible than ever in human history. HLT provide a good help for this.