Report C by Ager Gondra, Unai Diego de Somonte, Andrew San Juan, Stéphane Cos.

 

 

ABSTRACT:

In this project we are going to work on Information Retrieval. For this purpose we have selected a few internet search engines, and we are going to look for certain words on them. First, we are going to write down the words in their singular form, and afterwards in their plural form. Once we have done this we shall compare the results obtained and we will draw our conclusions on why are the results are different..

 

INTRODUCTION:

The first step in order to accomplish our task has been to select a few internet search engines such as: Google, Aurki and Yahoo.

Some of these, like Google or Yahoo are well known through out the world. But others, like Aurki and Jalgi, are not that famous. This is an interesting contrast because this way we will see if the famousness of an internet search engine is directly proportional to its semantic capabilities. 

Once we have chosen the search engines, the next step will be to select a list of words. The words we have selected are: ball, giant, apple, war, connexion and boy. These words are going to be looked for in each internet search engine in their singular form as well as in their plural form. And this procedure will not only be done in English, but also in Euskera and  in Spanish. 

Then, near the end of the report it will also be possible to find a description of other internet search engines as well as some of their characteristics.  

The last step will be to compare the results obtained in each  and try to explain the differences in those results obtained given that  there  are any differences. 

 

 

BODY:

First of all we ahve searched for a definition of what Informatin Retrieval means:

Information Retrieval (IR), or document retrieval is the systematic manipulation of textual information so that it can be easily be found again (retrieved). On the WWW, the most important method of IR is the indexing of free-form text. IR exhibits

 

similarities to (but is not the same as) other areas of information processing, such as expert systems and data base management systems (DMBS).

 sammelpunkt.philo.at:8080/archive/00000023/01/HTML_Version/text/node83.html

 

We have also include a detailed description of each server selected.

Google

One of the most versatile web searchers. It permits to make searches using 5 different criteria:

- searches in the web

- searches by groups or associations

- searches of images only

- searches by directories

- searches of news associated with the typed word

It had an integrated translator too in the search engine that allows a 150 words translation, moreover, it can translate most of the web pages the user may find interesting and can be configured to be displayed in various languages, including languages such as the "Klingon" or the "Bork!Bork!Bork!". This configuration can be obtained through a complete menu of preferences with the possibility of an "advanced search" option that makes Google a globally efficient translator.

 

Yahoo

One of the oldest in the web, conceived more like an information portal than like a searcher, it offers direct access to the most interesting news and the posibility to personalize it as long as the user is registered, allowing access to e-mail, horoscope etc... The advanced search option is as detailed as the one in Google. Nevertheless the range of languages is smaller and the option of text translation does not exist. Yahoo has the following searching criteria:

- search in all the web

- search of images

- yellow pages

- directories

- news

 

Terra

An exclusively spanish portal similar to Yahoo. It offers services like free e-mail, access to forums and chats. The portal empowers these options neglecting others like a more complete "advanced search" option, it offers nevertheless a simple but complete translation system offered by "Reverso".

 

Jalgi

A Basque search engine that can be displayed in French, Spanish, English or Basque. It is a limited search engine since it does not have an advanced search option nor a translation text system.

 It is interesting to note that the most popular searchers, Google and Yahoo, are more "international" since they are registered in different domains, that is they have international (Spanish, French, English...) portals, modifiying their services depending on the chosen country.

Aurki

It is not very famous. The most popular search engine in basque language remains kaixo. As its title says, it is the first only-basque search engine. Here we will not find any kind of option for translating the page into Spanish or French, what makes its usage more exclusive, limitated only to those users with a knowledge of Basque.

But, as we deepen into this search engine, we find an option for other languages. Actually it is not very useful by itself, because it takes us directly to Google.

 

Testing the Search Engines

 

We have tested the Google search engine by searching the following words in different languages,  retrieving the different results obtained.

 

Words in English

 

Singular Plural
Ball ball                         32.100.000 Balls                   9,960,000
Giant Giant                      12.200.000 Giants                 5,520,000
Apple Apple                     41,500,000 Apples                3,150,00
War War                        97,500,000 Wars                 19,900,000
Connexion Connexion              4,360,000 Connexions         926,000
Boy Boys                       47,200,000 Boys                   35,900,000

 

 

Words in Spanish Singular Singular
Pelota

Pelota                  483,000

Pelotas                    693,000
Gigante Gigante                 924,000 Gigantes                 545,000
Manzana

Manzana             387,000

Manzanas               217,000
Guerra

Guerra                8,860,000

Guerras                   559,000
Conexión Conexión              1,260,000 Conexiones             354,000
Niño Niño                     1,480,000 Niños                      2,290,000

 

 

Words in Basque

 

Singular Plural
Pilota Pilota                        15.200 Pilotak                          864
Erraldoia Erraldoia                   974 Erraldoiak                     830
Sagarra Sagarra                     13.900 Sagarrak                       449
Guda Guda                         643 Gudak                           22
Konexioa Konexioa                   936 Konexioak                     952
Mutila Mutila                        16.100 Mutilak                          2.120

 

These are the same words above searched with the Yahoo! search engine.

 

 

Words in English

 

Singular Plural
Ball ball                         50.300.000 Balls                       13.800.000
Giant Giant                      22.800.000 Giants                     10.300.000
Apple Apple                     38.900.000 Apples                      4.830.000
War War                     136.000.000 Wars                      29.200.000
Connexion Connexion              6.400.000 Connexions                  864.000
Boy Boys                      62.000.000 Boys                       45.200.000

 

 

Words in Spanish Singular Singular
Pelota

Pelota                  826,000

Pelotas                    440,000
Gigante Gigante                 1,470,000 Gigantes                 609,000
Manzana

Manzana             20,000

Manzanas               210,000
Guerra

Guerra                1,900,000

Guerras                   926,000
Conexión Conexión              1,260,000 Conexiones             832,000
Niño Niño                     824 Niños                            735

 

 

Words in Basque

 

Singular Plural
Pilota Pilota                        1,140,000 Pilotak                         43,000
Erraldoia Erraldoia                   491 Erraldoiak                    419
Sagarra Sagarra                     24,300 Sagarrak                      198
Guda Guda                         34,800 Gudak                          194
Konexioa Konexioa                   505 Konexioak                    138
Mutila Mutila                        28,100 Mutilak                          540

 

            The popularity  of certain words such as sagarra, pilota or mutila shows us more than 10.000 words here. Anyway, there is a generality that there are many more singular words than plural ones, except of konexioak, which is a word that uses to occur in plural, regarding the chosen language.

            During our searches we have seen that search engines like Google or Yahoo! do not includes in the search the more common words like the articles and ignores stress marquers like the Spanish “tilde”. Other search engine like Aurki search not for the exact word but for the root of it. A more detailed study of this less known search engine has been done, showing the limitations those tools can have.

 

We are going to look at the following basque words: "baloia", "erraldoia", "sagarra", "guda", "konexioa" eta "mutila".

Words

Singular/ plural

Baloia/baloiak

538/150

Erraldoia/erraldoiak

3 sections and 4 places

Sagarra/sagarrak

1 section and 2 places

Guda/sagarra

3 sections and 3 places

Konexioa

1 section and 1 place

Mutil

3 sections and 2 places

In the case of "erraldoi" we find different interpretations for this word. We see it as an adjective or a noun, respecting the ambiguity that arises from such ambiguous word without its context.

It is strange how the search engine understands the word "sagarra": instead of restricting to its own meaning, it understands it as "sagarroi", a word based on "sagarra", but which has nothing to do with what the word apple means.

It is also remarkable how this search engine does not care too much about the declinations of the basque words we have tested: the words have been tested in singular and plural, but it provides also those places where we find the words in some other cases, such as the non-defined case "mutil" in "mutila" or "mutilak". This is why the results in singular and plural are the same.

 

CONCLUSION

 

From what we can derive of the results obtained, we can see that in most of the cases the noun in its singular form generates far many more results in each search that the plural form of the same word.

This could be related to limitations of computer translators. In report B we’ve seen how computers had problems dealing with lexical variations: plurals, composite nouns, neologisms are most of the time not well recognized. This lexical limitation seems to spread to search engines. Thus making a change of one letter passing then from singular to plural results in a less number of entries while searching for a word. This lack of  ability to recognise derivate forms is one important problem of the search engines.

Then we have also noticed that some search engines, like for example Google, actually correct the word you have chosen to look for in case you haven’t spelled it right, or you have made any mistake while taping it.  

Nevertheless, Basque internet search engines obtain far less results than any of the other search engines. This could be due to the fact that their lexicon is much more limited. And also because maybe those search engines are designed to look for words in Basque strictly.