by: Olatz Garcia
In this Report for our course on English Language and New Technologies, we are going to deal with an important and interesting theme; "Information Retrieval". We will make use of some of the searchers (buscadores) we can find in the net, and will try to explain the main problems we can find with them. Our main intention is to show you how these tools work and how the results can be completely different by means of a slight change in the words we insert in the searcher.
In our last Report we focus on an important (but with deficiencies) tool, Machine translation, now we are going to talk about another important and useful tool; the searchers. These tools are use to find information in the net but have a lot of slight ideas to take into account. This is what we will try to explain and show briefly in this report.
The way we are going to structure the report will be quite simple to follow; first we will give you some information we have taken from the net about this tools; we will give you the definition and some kind of a brief introduction on the history of this tool, this part based on the history of the tool will not only referred to the past, but also to the future and the main problems we find in each period. Later we will try to explain in a simple way how these searchers work. We will try to use a very simple vocabulary to make the report easily comprehensible and understandable for those people who may know nothing about this.
1·Multilingual Information Retrieval
1.1 Definition and Terms
Multilingual Information Retrieval (MLIR) refers to the ability to process a query for information in any language, search a collection of objects, including text, images, sound files, etc., and return the most relevant objects, translated if necessary into the user's language. The explosion in recent years of freely-distributed unstructured information in all media, most notably on the World Wide Web, has opened the traditional field of Information Retrieval (IR) up to include image, video, speech, and other media, and has extended out to include access across multiple languages. Being new, MLIR will probably also include the historically excluded access mechanisms typical of libraries involving structured data, such as MARC catalogue records.
The general field of MLIR has expanded in several directions, focusing on different issues; what exactly is within its purview remains open to discussion. It is generally agreed, however, that Machine Translation proper and Multimedia processing are not included. Nonetheless, several new terms have arisen around the new IR, each with a slight variation in emphasis, inclusiveness, or historical association with related fields. For example, recent research in multilingual information retrieval, such as (Fluhr et al., 1998) in (Grefenstette, 1998), includes descriptive catalogue data from libraries as well as unstructured data. Hull and Grefenstette (1996) list five uses of the term MLIR:
In addition to MLIR, four related terms have been used:
1. Multilingual Information Access (MLIA). The broadest possible term to use is Multilingual Information Access, which refers to query, retrieval, and presentation of information in any language. The term MLIA is used in the NSF-EU working groups (Klavans and Schäuble, 1998). In general, the use of information access rather than retrieval implies a more general set of access functions, including those that have been part of the traditional library, as well as other modalities of access to other media. Access could refer to the use of speech input for video output, where the language component could consist of close-captioned text or text from speech recognition, or catalogue querying to metadata. The term information access came into use recently as a way to broaden the historically narrower use of information retrieval.
2. Multilingual Information Retrieval (MLIR). This term refers to the ability to process a query in any language and return objects, such as text, images, sound files, etc., relevant to the user query in any language. Historically, however, Information Retrieval (IR) as a field involved a group of researchers from the unstructured text data base community who employed statistical methods to match query and document (Salton, 1988). In general, this work was English dominated, given the amount of digital information made available to the research community in the early years in English, and excluded access mechanisms typical of libraries involving structured data, such as MARC catalogue records. Thus MLIR as used in this chapter denotes a significantly wider field of interest than that of traditional IR.
3. Cross-lingual Information Access. The use of the term cross-lingual refers (in this context) to bridging two languages, rather than the ability to access information in any language starting with input any language. Systems with cross-lingual capability can accept a query in language L1 or L2, for example English and French, and are capable of returning documents in either L1 or L2. (In other meetings, the term cross-lingual (or translingual) has been used to distinguish systems that cross a language barrier, as opposed to multiple monolingual systems as in TREC.) This term logically includes access via catalogue record and other structured indexing, as for MLIA.
4. Cross-lingual Information Retrieval (CLIR). CLIR generally implies a relationship to IR, with all the implications that apply to MLIR. At the 1997 Cross-language Information Retrieval Spring Symposium of the American Association of Artificial Intelligence (Oard et al., 1997), CLIR was defined with the following research challenge: Given a query in any medium and any language, select relevant items from a multilingual multimedia collection which can be in any medium and any language, and present them in the style or order most likely to be useful to the user, with identical or near-identical objects in different media or languages appropriately identified. This definition of the requirements of a system gives full recognition to the query, retrieval, presentation requirements of a working system from a user perspective, and encapsulates succinctly the full set of capabilities to be included. However, its breadth makes it fit well with a definition of MLIA, the most general term, rather than CLIR, a more precise term.
2· Where We Were Five Years Ago
2.1 Capabilities Then
The lure of cross language information retrieval attracted experimentation by the IR community early on. Already in 1971, Salton showed that the use of a transfer dictionary for English and French (a bilingual wordlist with predefined mappings between terms) could be used to translate from a query in one language to another (Salton, 1971). This experiment, although ignoring the realistic and challenging problem of ambiguity, nonetheless served the information retrieval community well in providing a model for a viable approach to cross language IR. However, at the same time, the experiment also illustrated some of the exceedingly difficult problems in the language translation and mapping component of a system, namely one to many mappings, gaps in term translations, and ambiguity. Similarly, in a manual test with a small corpus, Pevzner (1972) showed for English and Russian that a controlled thesaurus can be used effectively for query term translation.
For nearly twenty years, the areas of IR and MT remained separate, leaving MLIR somewhat dormant. Apart from a few forays into refining these early techniques, all significant advances in MLIR have been made in the past five years. This is not surprising, given that increased amounts of information are becoming available in electronic format, and the economy is globalizing.
2.2 Major Methods, Techniques, and Approaches Five Years Ago
We discuss the problem within the framework outlined above.
System issues include the following.
Usability issues include the following. Early experiments were performed at such a small scale, more in the nature of proof-of-concept rather than full-fledged large-scale systems. User feedback and user needs were simply not part of what was tested.
2.3 Major Bottlenecks and Problems Five Years Ago
The three major bottlenecks of the early part of this decade still persist. They are: limited resources for building domain and language models; limited new technologies for coping with size of collections; and limited understanding of the myriad of user needs.
3· Where We Are Today
The burgeoning field of MLIR field is clearly in evidence, as can be seen in the bibliography in the first major review article on the topic (Oard and Dorr, 1996). Papers cited include related work on machine translation, including some research translated from Russian. There are 16 citations prior to 1980, 10 from 1980-89, and 52 from 1990 to early 1996. The first major book to be published on the topic (Grefenstette, 1998) reflects the same temporal bias. This work is slanted towards IR rather than toward MT. It contains 11 citations prior to 1980, 25 from 1980-89, and 101 from 1990 to very early 1998.
3.1 Major Methods, Techniques, and Approaches Now
Following the format above, we divide the methods into system-centered and user-centered concerns, although each provides feedback to the other.
System issues include the following:
Usability issues include the following. The development of effective MLIR technology will have no impact if the user's needs and operation patterns are not considered. Since MLIR is a growing field, and since applications are just emerging, formative studies of usability are essential. Currently, there are a limited number of systems in early operation which are providing important data (e.g., EuroSpider, the translate function of AltaVista, multilingual catalogue access). The incorporation of users in the relevance feedback loop is particularly important, since user needs vary greatly. A full review of user needs is found in (Klavans and Schäuble, 1998).
4· Where We Will Be in Five Years
The growing amount of multilingual corpora is providing a valuable and as yet untapped resource for MLIR. Such corpora are essential to building successful dynamic term and phrase translation thesauri, which is, in turn, key to effective indexing and matching. One of the key challenges is in devising efficient yet linguistically informed methods of tapping these resources, methods which combine the best of what is know about fast statistical techniques along with more knowledge based symbolic methods. Even promising new techniques, such as translingual LSI (Landauer et al., 1998) and related techniques (Carbonell et al., 1997), will most probably still rely on parallel corpora. Such corpora are often difficult to find, and very expensive to prepare. This has been the motivation for the work on comparable corpora. However, more and more are being created electronically, especially to conform to legal requirements for the European Union.
An important class of techniques involves machine learning, as applied to the cross-language term mapping problem. Since term translation, loosely defined, is at the core of query processing, document processing, and matching, it is an important process to do thoroughly and accurately. Even if multiple translations are retained in the MLIR process, obtaining a sensible set of domain linked terms is an important and central task. One way to obtain these term dictionaries is through parallel corpora, but statistical processing is typically difficult to fine tune. As discussed before, machine learning techniques are a fundamental enhancement of the power of language processing systems and hold particular promise in this area as well.
Finally, it is to be hoped that our understanding of user needs and user interactions with MLIR systems will be significantly better in five years than it is now. As early systems emerge and are tested in the field, a range of flexible and fluid applications that can learn and dynamically adjust to the users' levels of competence, across languages and across domains, should appear. One possible example of this type of flexible application might be human-aided MT systems for producing gusting-quality translations of retrieved documents, which would allow the user to make a personal time/quality tradeoff: the longer the user interacted with the translator, the better the resulting output. Most probably, these systems will incorporate multimedia seamlessly and permit multimodal input and output. Such capabilities will provide maximum usability.
5· MAJOR SEARCHERS IN THE NET
Millions of people all over the world make an everyday use of the tools New Technologies make available and useful for them; one of the most important and helpful are the searchers. These tools have the main service of helping people to, as their name explains, search specific information in the net. Although these are very useful they have an important problem, the multilingualism. This problem is based on the different languages people have all over the world and the fact that the net gives us information from everywhere without taking into account the native language of the person who is looking for that information.
There are a big amount of searchers in the net, we are going to deal with three of the most famous one such as www.google.es, www.yahoo.com ; we are going to present a chart where we will show the differences between these three searchers. The most important difference we will see is related to linguistics, some of the searchers give an extreme importance to the word itself. There seem to be a problem for example with the determiners and articles or with singular and plurals. We will try to show this by a chart of words.
One other importance in relation to the searchers is the fact that they donīt look for a definition of the word you insert but for words or places related to this word. Sometimes they lead you to fragments of literary texts where these words appear. Thus, you donīt get the information you need or you were looking for.
The case of the yahoo searcher is the following; if we look for the definition of a word such as "criticism" what we get is: 1: disapproval expressed by pointing out faults or shortcomings; "the senator received severe criticism from his opponent" [syn: unfavorable judgment] 2: a serious examination and judgment of something; "constructive criticism is always appreciated" [syn: critique] 3: a written evaluation of a work of literature [syn: literary criticism]
And if we insert the same word in plural the searcher gives us an amount of words that resemble to it. Or lead us to search the word in another searcher such as "google". The list of similar words that the result gives us is the following; 14 matches found:
http://rds.yahoo.com/S=2766679/K=definition+%22criticisms%22/v=2/TID=DFX5_64/SID=e/l=WS1/R=1/H=0/*-http://www.csu.edu.au/faculty/arts/humss/wel217/week2/sld009.htm Now we are going to make the same experiment in googles searcher, the result is that we donīt get a clear definition but we are leaded to a big amount of possible answers, although none of them is what we were really looking for; instead we are leaded to pages which contained this term in relation to other subjects, such as literature or politics. We must go on searching in the page until we find the definition of the word in the following terms;
The result we get if we introduce the same word in plural is that we are lead to search for the word in singular.
Another experiment we are going to make is to look for a word such as "niņito" the result we get is that the searcher "google" give us as a second possibility the definition of the word and this is not a real definition but rather an explanation of some Spanish morphological features.
If we make this same search in " yahoo" we donīt get any definition but rather an amount of different pages related to the world of childhood such as:
The page we will show now is related to psychology:
The next page is related to "demagogy":
Another important error or complication we may find in all these searchers is that most of them do not understand foreign languages. This doesnīt happen in the case of "google", which is probably the most important all over the world. If we insert a basque word in a searcher such as www.altavista.com the result we may get is a note that lead us to go to another searcher. That is an important problem with these tools. They donīt normally understand different languages or in the case of other searchers such as "google" the problem may be the opposite, thatīs that we get information from a big amount of places and in an incredible variety of languages we may not understand.
As a conclusion we would say that these tools are really useful and necessary for people in general and specially for all of us. We, almost everyday, have to look up some information for our studies in dictionaries or books we may have at home, but many times we donīt get anything from our own books. These tools may help us to solve this problem, with our computers and the searchers we may find almost anything in the net, information about any theme. This is not always a good point because as some people say these tools are so easily used that young children may get access to them and find some unwanted information.
These tools have some problems to solve, as they are relatively new inventions. Some of the most important difficulties we can find are; the multilinguality, the fact that we get information in any language we may not understand, and another problem is that some of the searchers do not make differences between some morphemes, we will get almost the same results if we enter a word in singular or plural or a word like "criticism", if we look for the definition what we will get is more than that. The searchers are not able to make a difference between the information we need and the other.
As students we realize that these tools are extremely useful for us, because they give us a lot of information for our works or subjects. But these tools will have to improve as time passes and we as linguists or philologist will probably be the responsible for these changes. We think that people must be aware of the importance New Technologies are acquiring for our daily life, but should never forget that it is not only a "scientific" work but also a work related to language and linguistics. People must realize that is extremely important to create a kind of "community" between those related to New Technologies world and those who are experts on language and linguistics to have an improvement in these kinds of tools we are dealing with nowadays.