Search and Content Retrieval Engines on Internet
Idoia Martínez del Mozo
Ángela Maside Páramo
Alejandro Otaola Rojo
This report is an assigned investigation task for our class of "English Language and New Technologies". This is a subject included in the second year of the English Philology degree in the University of Deusto (Universidad de Deusto), in Bilbao (Spain). This subject is taught by Professor Joseba Koldobika Abaitua Odriozola, and this part of the subject deals with Information Retrieval and Information Management.
This is the third report we have been asked to elaborate during this course on New Technologies. As during the last part of this second half of the course we have been dealing with Information Retrieval and Information Management, i.e. methods for searching information in the internet, we are now developing the knowledge, although very little for reasons of lack of time, that we have acquaired on that, by evaluating and comparing the methods of work of some searchers that can be found in the internet.
In this third report for the "English Language and New Technologies" course, we are supposed to make a research on Information Retrieval and on Information Management. What we will try to do for the completion of this task is that of working with some searchers that can be found on the internet, comparing the searchers that are going to be analysed among themselves.
We will focus principally on the linguistic point of view of these searchers. What we will try to evaluate is the linguistic accuracy with which these searchers make their job, that is, how they look for the information that they are provided with by us - the users of the searchers.
We will introduce pieces of information into the different searchers, and those pieces of information are to be found by the searchers. The different items of information will differ among themselves in a morphological feature, as for example, the morpheme of number, being one word singular and later giving the searcher the plural form for that item of information. With this kind of experimental activity we will try to demonstrate whether the searchers really know what they are being introduced - the pieces of information with their grammatical variants.
It was built in 1999, the Google Search Robot has been one of the best machines ever made for searching websites or any kind of element requested by the customer.There are about 1060 millions of registered and un-registered web pages. There are also about 560 millions of registered web pages and 500 millions of un-registered web pages.
The way it works is a web page linking reference. When a customer writes down a sentence or just a word there will appear automatically a list of related pages to the written word/s.
One of the reasons why Google has been a fashion is because they offer a great variety of languages, it can be used in English, Spanish, Korean and more than 10 languages available to every user. Another point in favour to Google is that they have aslo a translation machine, where the user could translate whatever he/she wants to look for.In Google searcher we find a dinstinction in the linguistic features of nouns. According to this searcher, each word is unique and it is different from the others. For example, if we introduce the term "telephone" we find aproximately 7,250,000 entries. If we introduce the term "telephones" we find aproximately 662,000 entries.
START, the world's first Web-based question answering system, has been on-line and continuously operating since December, 1993. It has been developed by Boris Katz and his associates of the InfoLab Group at the MIT Artificial Intelligence Laboratory. Unlike information retrieval systems (e.g., search engines), START aims to supply users with "just the right information," instead of merely providing a list of hits. Currently, the system can answer millions of English questions about places (e.g., cities, countries, lakes, coordinates, weather, maps, demographics, political and economic systems), movies (e.g., titles, actors, directors), people (e.g., birth dates, biographies), dictionary definitions, and much, much more.
However, this searcher has many deficiences. One of them is that the searches are only restricted to English. It does not recognise other languages, so people who does not have any knowledge of English cannot make use of this searcher. Another limitation this seracher has (apart from the fact of being very slow) is that sometimes it does not recognise the question you are making it.
Linguistically, this searcher is not very rich because, for example, it does not make a difference between singular and plural. As a way of testing the efficiency of this searcher we introduced the term "cat" and the term "cats". The result was that the definition was the same for both of them: "Cat is a favorite pet of people around the world".
START (SynTactic Analysis using Reversible Transformations) is a natural language processing system. It consists of two modules which share the same Grammar. The understanding module analyzes English text and produces a knowledge base which incorporates information found in the text. Given an appropriate segment of the knowledge base, the generating module produces English sentences. A user can retrieve the information stored in the knowledge base by querying it in English. The system will then produce an English response.
In addition, by annotating free-form text with English phrases and sentences, then matching these annotations with incoming queries, the power of sentence-level natural language processing can be effectively put to use in the service of multi-media information access. Furthermore, this technique generalizes easily to the indexing and retrieval of all types of information, whether or not these are based on text.
The START system was demonstrated during the Voyager Neptune encounter when researchers from the Jet Propulsion Laboratory and members of the press in the JPL press room were able to use the START system to retrieve information concerning the encounter, the Voyager spacecraft, and the solar system.
Search engine is a system, containing spiders and robots (which use computer programs) to gather and disseminate information automatically on the Internet. Most of the major search engines store the collected information in a database and index the database to handle a huge amount of data in a better way. These computer generated databases are frequently updated, and provide the most comprehensive search results. MSN, a regular search engine, (although is not having its own database), is considered to have the largest source of referrals. According to the report from Jupiter Media, January 2002, MSN search has come out on top above all other search engines.
Microsoft's MSN search service is mainly a LookSmart-powered directory of web sites.Its secondary results are provided by Inktomi's database. MSN search also uses both the RealNames and Direct Hit database. In case of certain search terms, it provides results from Overture too. MSN Search also offers a unique way for Internet Explorer 5 users to save past searches. Although MSN search is using other search engines' databases, the rankings of the listings are done according to its own ranking algorithm.
In MSN search, it is better to use lowercase letters while typing search words, because lowercase words will match any case. Also, use of specific words (not long combinations of words) for the search is suggested. There is also the facility of Advanced Search to provide more specific results to the net
In MSN searcher as in Google, we find that it makes a dinstinction in the linguistic features of nouns.If we introduce the term "dolphin" we find about 17469 entries containing that word. If we introduce the term "dolphins" we find about 15566 entries containing that word.
After having completed a brief period of time working with searchers, we have realised of some things that are noticeable about the usage and working of these tools, the searchers.
We have realised that there is a great difference between one searcher to another. But in general, what has called our attention has been the fact that most of the searchers do not understand some basic morphological differences among words. They are not capable of making a clear distinction between singular and plural phrases - what from a linguistic point of view is a basic and primary distinction to be made among words. Although at first this feature could not seem an important one, we realised that it is in fact a very relevant and significant feature for a clear morphological distinction to be made.
In contrast to the aforementioned, we have found that some searchers do have some linguistic tools so as to make the searchings quite easier. In order to do it, these searchers have proper tools to make the searching more precise or accurate, by means of the language that is wanted to be used for the searching or by other linguistically related features.
Web page for START searcher: www.ai.mit.edu/projects/infolab
web page for Google searcher: www.google.es
Web page for MSN searcher: http://search.msn.com