Tamara Diez and Joana Salazar
In this report we are going to focus on "Information retrieval", mainly in the use of online search engines such as Google. This searcher is the most used by students, because it is very easy to use and provides us with a lot of information in a short amount of time. However, we do not always find exactly what we want, and that is one of the main problems that we are going to cover in this report. We are going to go through some of the features of these kinds of searchers in their linguistical aspect.
To start with, we have to know what is the meaning of "Information Retrieval", the topic of this project. Information Retrieval (IR) consists in building search engines that understand query in human language. We can search for information in documents, the documents themselves and metadata which describes documents. We can also search within databases, and even search for sounds and images. There is a common confusion, however, between data, document, information, and text retrieval, and each of these have their own features.
IR is a broad field that draws on many other disciplines. Indeed, because it is so broad, it is normally poorly understood, being approached typically from only one perspective or another. It wants to cover so many things that it lacks of specialization in every field. That is the case of the linguistic field. Web search engines such as Google and Lycos are amongst the most visible applications of Information retrieval research. However, they do not work as the human mind, because they search for the exact word without taking into account its variations. We are going to develop this aspects of search engines. In certain things some are better than the others, so we will go through them to see what their differences and advantages are.
Apart from Google and Lycos, which are practically the same, we find other searchers such as Word-IQ (http://www.wordiq.com) and Ask (http://www.ask.com). In this report we are going to see how these different search engines work. To ilustrate that we will use some examples.
Concerning this topic, we also have to mention other terms that are related to it. Apart from Information Retrieval, which refers to the identification of documents containing relevant information,we find the term "Information Extraction" (IE), which refers to the extraction of relevant information from documents, and when we talk about presenting condensed information we are refering to "Text Summarisation"(TS). Something that is very similar to Information Retrieval is "Cross-lingual information retrieval". It is the same as IR, but cross-lingually. We define Cross-lingual IR as the system in which users can formulate, expand and disambiguate queries, filter the search results and read the retrieved documents by using only their native language. This multilingual functionality is achieved by the use of dictionary-based query translation, multilingual document categorisation and automatic translation of summaries and documents. For example, if we make a search in Google, we can get an answer also in other languages, because there are more language tools.
Google is one of the most frequently used searchers by students. When they need to find something for a work or project they directely go to this searcher because it is easy to use and it works very fast. Although it is very much used by the mayority of the students, it is true that we do not always get the results we expected. One case would be, for example, if we want to find something about the short story of Viramontes entitled "Growing". We could be hours and hours searching for information about it until we realise that there is not anything. So, what we want to say with this example is that searchers as google are not always capable to find us what we want. However, in this case it may be because there is nothing in the Internet about the story we wanted.
The other case would be when we cannot get to the webpage where the information we want is located. It is a big problem to find exactly what we want. That is why different searchers exist. We have searchers such as Google, and also others that are more especialised, we have to choose the correct one depending on what we are working on. In the case of Google, it searches for a specific word. Its main problem is that the moment that word is modified, the results we get are different. In the following table that change is shown by several examples. The figures refer to the number of results we got in the search.
|surface / surfaces||30,800,000||7,420,000.|
|theory / theories||38,600,000||5,720,000|
|language / languages||86,100,000||24,500,000|
|foot / feet||57,400,000||55,100,000|
|notebook / notebooks||27,500,000||11,200,000|
|woman / women||56,400,000||217,000,000|
One of the main problems of this searcher is that it does not recognise two similar words when their only change is that one is singular and the other plural. This means that when you add the plural suffix to a word such as "surface", the results are less (7,420,000) compared to the results we get in the singular form (30,800,000). This kind of searcher is not able to distinguish the plural form from the singular, as human minds can do.
This searcher is not as known as Google. The structure that is used here is more or less the same as the one in Google, but with more search options such as: dictionary, encyclopedia, thesaurus, the web and ebooks. With this system we have different fields in which we can get more information. That help us to complete our work because we have many things in one place. So, we can say that this searcher is quite complete. In our case, as English Philology students, this seacher would be very useful in our studies, because we can search for books, meanings of some words and general information about certain topics.
However, there are certain disadvantages in respect to the language. We are refering to the choice of language. In the case of Google, we could restrict our search by choosing the option of only searching for Spanish sites. In the case of Word-IQ, we can only search for English terms and sites. Nowadays, as English is the most used language all around the world, it is the way in which we find most of the information.
Comparing Word-IQ with Google, we have come to the conclusion that both of them have the same problem with plurality. If we enter the plural form of a word, we get less results than in the singular form, as in Google. We have tested this plurality problem with the option of the Encyclopedia. We have introduced the word "war" and we have gotten its definition as a usual encyclopedia book would have done it:
Typically, warfare is mortal and lives of combatants are deliberately taken by enemy forces and the continued existence of a losing group as an entity is in doubt. In view of this, rules for the conduct of war are unenforceable during active conflict. A person faced with death, or an organisation faced with extinction, both have little incentive to obey rules that contribute to that result. If they can survive by breaking the rules they are likely to do so, and some would argue justifably.
Sometimes a distinction is made between a conflict and the formal declaration of a state of war. Given this distinction the term "war" is sometimes considered restricted to those conflicts where one or both belligerants have made a formal declaration.
Wars have been fought to control natural resources, for religious or cultural reasons, over political balances of power, legitimacy of particular laws, to settle economic and territorial disputes, and many other issues. The roots of any war are very complex - there is usually more than one issue involved.
In addition to that, we also got information about things related to the war, such as the following ones:
1 Philosophy of War
2 Types of war
3 Types of Warfare
4 Laws of war
5 Statistical analysis
6 Famous Quotes about War
However, when we introduced the plural form "wars", the results were very different, because the machine did not recognise that the word comes from "war", and gave us the places in which the word "wars" appears. That is, the searcher did not relate them. We got the following results with the word "wars" highlighted in red:
1: ...uch, plus a forum that explains almost any [[Star Wars]] topic.
Interbellum (475 bytes)
1: An '''Interbellum''' is a period between wars. More specifically, the term is being used for th...
3: ...erbellum" contains an unspoken clause of "between wars that matter to major nations."
Afghan Wars (372 bytes)
1: ...s''' but is now referred to as the [[Anglo-Afghan wars]] perhaps to distinguish them from the civil stri...
Moff (397 bytes)
3: ...ath onboard the [[Death Star]] at the end of Star Wars IV.
5: ... onboard the second Death Star at the end of Star Wars VI.
AT-ST (327 bytes)
1: ...s Episode V: The Empire Strikes Back]] and [[Star Wars Episode VI: Return of the Jedi]], respectively.
Roger MacBride Allen (855 bytes)
War of the Three Henrys (262 bytes)
1: ... wars in [[France]], also known as the [[Huguenot Wars]].
This searcher is quite different if we compare it with the previous ones. The difference is that here there are more options in which we can focus our search. We can search for the following things:
Famous people, the weather forecast, products, pictures, news, and so on.We can do an advanced search, and even get a smart response, too.
This searcher, named Ask Jeeves, which responds to questions, phrases, or single words, makes our work easier by giving us several options to search for information. One important aspect we have to take into account is the correct spelling of the words. If we make a mistake in writing a word, this searcher notices it and suggests a form that could be close to what we were looking for. In addition to that, we also need to stick to the terms. It is better to simplify what we are searching for instead of giving too much information and make the searcher not recognise what we want. We do not have to be very specific, but try to be more general in order to get better and much more results.
Moreover, it is recommended to make one question at a time, and also make sure the spaces between words are correctly put. The Ask searcher also gives us the opportunity to fill in a form (http://sp.ask.com/docs/help/help_tuwyt.php) in case we do not get the results that we want.
We have tested the problem of not leaving spaces between words. Ask Jeeves has recognised the correct form in most of the following cases:
|WRONG FORM||CORRECT FORM|
|whatis informationretrieval?||what is information retrieval?|
|where isnewyorkcity?||not recognised|
|whatis machinetranslation?||what is machine translation?|
|how many waras are there?||how many wars are there?|
|what is hummana transaltion?||what is human translation?|
As we have seen throughout this report, we have to pay a lot of attention when we want to find something in these search engines. We have explained here some of the troubles that we find in different searchers. It is true that they have some flows and weaknesses, but their advantages are better than their disadvantages. The internet is used by millions of people from different places, so these webpages allow us to be able to use information that without being online we would not get. It has been a great advance in computer technology, and we hope it will become better as years pass.
Something that we have noticed is that Google and Word-IQ give us the number of results, whereas Ask Jeeves does not. That gave us difficulties when contrasting the amount of results in singular and plural forms. However, this is not a great concern in this topic, because what we really wanted to know were the results we got and not the amount of them.
Talking about the different search engines, we have seen that they have a lot of limitations. We have explained some of them, but it is true that although we may not know about all of them, we would notice them if we used them more often. It is also true that a huge activity is going on trying to overcome those limitations.
Ask Jeeves: http://www.ask.com
Information Retrieval (and other terms): - http://tfpsly.planet-d.net/english/IR.html