Information Retrieval


Ane Alaña, Diana Sagarna, Nerea Basterretxea



















In this report, which has been made as an exercise for the subject English Language and New Technologies, we are going to talk about information retrieval. We are going to talk about the difficulties we can face when looking for information (such as too much information, and too scattered and unorganized, how to tell the "crap" from the useful information...). All this links with previous questionaries and reports (see report A, which deals with these issues). The information has been taken from the references given by Prof. Joseba  Abaitua.



Recently, there has been an attempt to improve the comunication between machines and humans through language. However, there are still problems that have to be solved, such as:

  1. There is too much information due to a greater access of people to the new technologies.
  2. The second problem comes as a consequence of the first one. There is so much information that we often can´t tell the right information from the wrong one (or the important parts from the "blah-blah"). This happens because there is so much of it and so interrelated that the searching engines often get confused. That is way we have to improve in this field.

The three key steps to be taken to reach the correct information are these: an improved information retrieval, information extraction from the document to get exactly what we were looking for, and text summarisation, in case it is too long.

But getting a machine to do all this is still very complicated. A colaboration between linguists and informatic engineers should solve this problem, creating improved languaje tools for a better understanding between the machine and the user. This is were Human Language Technologies have an important role to play. 

Cross-lingual information retrieval is a tool with which the user can make searchs in his native language and get information from any parts of the world, translated from any language. 

As Cross-lingual information retrieval, question-answering systems, which are those that allow people to get information from the machine when asking for questions, are still in process of development. So, there are not very effective and therefore the results are not acurate.

By means of this report we want to show mainly the effectiveness and efficiency of information retrieval, the information overload and  the information fatigue syndrome.


About the Information Fatigue Syndrome: too much information to handle )



Pile of papers

"On same days I can see the pile of papers on my desk grow right before my eyes, just like those time-lapse films of flowers opening up".

Peter Guilford, spokesman for the Eurepean Commission in Brussels.

Clutter in the mind

P. Guilford isn't just worried about the clutter on his desk; the clutter in his mind bothers him too. All that paper contains voluminous words, numbers and diagrams -far too much information for him to read, much less remember and thorougly comprehend. And if he could somehow get trough a deskful of documents, his computer could easily spit out more.

Rife Internet

The Internet is rife with Web pages and databases containing material that could be useful to Guilford, if only he could get to it. Still, much of what there is to wade through, he points out, is simply not worth the trouble.

Information overload

Like most bureaucrats, business executives, teachers, doctors, lawyers and other professionals, Guilford increasingly feels he is suffering from information overload. The symtoms of this epidemic ailment can include tension, occasional irritability and frequent feelings of helpessnes -all signs that the victim is under considerable stress.

Information and knowledge

"Knowledge is power, but information is not. It's like the detritus that a gold-panner needs to sift through in order to find the nuggets."

D. Lewis


David Lewis coined the term "information fatigue syndrome" for what he expects will soon be a recognized medical condition.

"Having too much information can be as dangerous as having too little. Among other problems, it can lead to a paralysis of analysis, making it far harder to find the right solutions or make the best decisions."

"Information is supposed to speed the flow of commerce, but it often just clogs the pipes."

David Lewis

Dr. David Lewis is a British psychologist, author of the report Dying for Information?, commissioned by London based Reuters Business Information. Lewis has coined the term "information fatigue syndrome" for what he expects will soon be a recognized medical condition. Lewis is a consultant who has studied the impact of data proliferation in the corporate world.

Essential vs. irrelevant data

"Better training in separating essential data from material that, no matter how interesting, is irrelevant to the task at hand is needed."

D. Lewis

The European Commission is also encouraging governments, corporations and small businesses to train people in how to manage data.

The irony of the fact that Daying for Information? was sponsored by Reuters Business Information is not lost on its executives, who direct the production and marketing of information services to corporate clients around the world.

"We would argue the Reuters' whole raison d'être for the past 150 years is getting through the overload to the salient facts."

Paul Waddington, marketing manager at Reuters.

"Dealing with the information burden is one of the most urgent challenges facing businesses. Unless we can discover ways of staying afloat amidst the surging torrents of information, we may end u drowing in them."

D. Lewis




 Information retrieval (IR) is the art and science of searching for information in documents, searching for documents themselves, searching for metadata which describes documents, or searching within databases, whether relational stand alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. There is a common confusion, however, between data, document, information, and text retrieval, and each of these have their own bodies of literature, theory, praxis and technologies.

IR is a broad interdisciplinary field, that draws on many other disciplines. Indeed, because it is so broad, it is normally poorly understood, being approached typically from only one perspective or another. It stands at the junction of many established fields, and draws upon cognitive psychology, information architecture, information design, human information behaviour, linguistics, semiotics, information science, computer science and librarianship.

Automated information retrieval (IR) systems were originally used to manage information explosion in scientific literature in the last few decades. Many universities and public libraries use IR systems to provide access to books, journals, and other documents. IR systems are often related to object and query. Queries are formal statements of information needs that are put to an IR system by the user. An object is an entity which keeps or stores information in a database. User queries are matched to documents stored in a database. A document is, therefore, a data object. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates.

In 1992 the Department of Defense, along with the National Institute of Standards and Technology(NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for such a huge evaluation of text retrieval methodologies.

Web Search Engines such as Google and Lycos are amongst the most visible applications of Information retrieval research.



Google is (currently) the most popular search engine on the web. As of 2004, it handles upwards of 80% of all Internet searches through its website and the websites of clients like AOL, or roughly 200 million search requests per day. The popularity of Google is evinced by the fact that the verb "to google" is sometimes used generically to mean "to search the web".

(Note that "80% of all Internet searches" was when Yahoo! still used Google. With the recent Yahoo! move to deliver independent results, this figure might be much lower.)

In addition to web pages, Google also provides services for searching images, Usenet newsgroups and news sites. It currently indexes 4.28 billion web pages, 880 million images and 845 million Usenet messages, a total of 6 billion items. It also caches much of the content that it indexes.

The search engine

PageRank and indexing

Google uses an algorithm called PageRank to rank web pages that match a given search string. The PageRank algorithm computes a recursive figure of merit for web pages, based on the weighted sum of the PageRanks of the pages linking to them. The PageRank thus derives from human-generated links, and correlates well with human concepts of 'importance'. Previous keyword-based methods of ranking search results, used by many search engines that were once more popular than Google, would rank pages by how often the search terms occurred in the page, or how strongly associated the search terms were within each resulting page. In addition to PageRank, Google also uses other secret criteria for determining the ranking of pages on result lists.

Google employs server farms of more than 100,000 GNU/Linux computers around the world to answer search requests and to index the web. The indexing is performed by a program ("Googlebot") which periodically requests new copies of the web pages it already knows about. The more often a page updates, the more often Googlebot will visit. The links in these pages are examined to discover new pages to be added to its database. The index database and web page cache is several terabytes in size.

Google not only indexes and caches HTML-files but also 12 other file types, including .pdf (Portable Document Format), .txt (text), .doc (Word document), and .xls (Excel spreadsheet). Except in the case of text files, the cached version is a conversion to HTML. Hence Google allows reading these files even without having the corresponding program such as Word or Excel.

The search engine is somewhat customizable, allowing users to set a default language, whether to use "SafeSearch" filtering technology, and setting the number of results displayed per page. Google has been criticized for placing long-term cookies on users' machines to store these preferences, which also enables them to track a user's search terms over time. However, most of Google's services can be used with cookies disabled.



Lycos is an Internet search engine and web directory. It was born from a research project by Dr. Michael Mauldin of Carnegie Mellon University(CMU) in 1994. The original Lycos search engine went on to be used in Carnegie Mellon's Informedia Digital Library project. The name "Lycos" comes from Latin, lycosidae, meaning "wolf spider".

Shortly after the development of the Lycos Search Engine, the Lycos company was formed using venture capital and initial internal support from CMU. The CEO of the Lycos company was Bob Davies (internet businessman) , a native of Boston who moved the headquarters of Lycos to Waltham, Massachusetts from Pittsburgh, and concentrated on building it into an advertising supported Web Portal, arguably at the expense of the Information Retrieval research on which the company was founded.

Lycos suffered under competition from Google, which concentrated on driving its business primarily on the basis of fast, effective web search, but remains a viable business as of early 2004.

Extracted from Wikipedia



Since the 1940s the problem of information storage and retrieval has attracted increasing attention. It is simply stated: we have vast amounts of information to which accurate and speedy access is becoming ever more difficult. One effect of this is that relevant information gets ignored since it is never uncovered, which in turn leads to much duplication of work and effort. With the advent of computers, a great deal of thought has been given to using them to provide rapid and intelligent retrieval systems. In libraries, many of which certainly have an information storage and retrieval problem, some of the more mundane tasks, such as cataloguing and general administration, have successfully been taken over by computers. However, the problem of effective retrieval remains largely unsolved.

In principle, information storage and retrieval is simple. Suppose there is a store of documents and a person (user of the store) formulates a question (request or query) to which the answer is a set of documents satisfying the information need expressed by his question. He can obtain the set by reading all the documents in the store, retaining the relevant documents and discarding all the others. In a sense, this constitutes 'perfect' retrieval. This solution is obviously impracticable. A user either does not have the time or does not wish to spend the time reading the entire document collection, apart from the fact that it may be physically impossible for him to do so.

An information retrieval system

Let me illustrate by means of a black box what a typical IR system would look like. The diagram shows three components: input, processor and output. Such a trichotomy may seem a little trite, but the components constitute a convenient set of pegs upon which to hang a discussion.

Starting with the input side of things. The main problem here is to obtain a representation of each document and query suitable for a computer to use. Let me emphasise that most computer-based retrieval systems store only a representation of the document (or query) which means that the text of a document is lost once it has been processed for the purpose of generating its representation. A document representative could, for example, be a list of extracted words considered to be significant. Rather than have the computer process the natural language, an alternative approach is to have an artificial language within which all queries and documents can be formulated. There is some evidence to show that this can be effective. Of course it presupposes that a user is willing to be taught to express his information need in the language.

When the retrieval system is on-line, it is possible for the user to change his request during one search session in the light of a sample retrieval, thereby, it is hoped, improving the subsequent retrieval run. Such a procedure is commonly referred to as feedback. An example of a sophisticated on-line retrieval system is the MEDLINE system. I think it is fair to say that it will be only a short time before all retrieval systems will be on-line.

Secondly, the processor, that part of the retrieval system concerned with the retrieval process. The process may involve structuring the information in some appropriate way, such as classifying it. It will also involve performing the actual retrieval function, that is, executing the search strategy in response to a query. In the diagram, the documents have been placed in a separate box to emphasise the fact that they are not just input but can be used during the retrieval process in such a way that their structure is more correctly seen as part of the retrieval process.

Finally, we come to the output, which is usually a set of citations or document numbers. In an operational system the story ends here. However, in an experimental system it leaves the evaluation to be done.


Effectiveness and efficiency

Much of the research and development in information retrieval is aimed at improving the effectiveness and efficiency of retrieval. Efficiency is usually measured in terms of the computer resources used such as core, backing store, and C.P.U. time. It is difficult to measure efficiency in a machine independent way. In any case, it should be measured in conjunction with effective-ness to obtain some idea of the benefit in terms of unit cost. In the previous section I mentioned that effectiveness is commonly measured in terms of precision and recall. I repeat here that precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved, and recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents (both retrieved and not retrieved). The reason for emphasising these two measures is that frequent reference is made to retrieval effectiveness but its detailed discussion is delayed until Chapter 7. It will suffice until we reach that chapter to think of retrieval effectiveness in terms of precision and recall. It would have been possible to give the chapter on evaluation before any of the other material but this, in my view, would have been like putting the cart before the horse. Before we can appreciate the evaluation of observations we need to understand what gave rise to the observations. Hence I have delayed discussing evaluation until some understanding of what makes an information retrieval system tick has been gained. Readers not satisfied with this order can start by first reading Chapter 7 which in any case can be read independently.


Future developments  

Much of the work in IR has suffered from the difficulty of comparing retrieval results. Experiments have been done with a large variety of document collections, and rarely has the same document collection been used in quite the same form in more than one piece of research. Therefore one is always left with the suspicion that worker A's results may be data specific and that were he to test them on worker B's date, they would not hold.

The lesson that is to be learnt is that should new research get underway it will be very important to have a suitable data-base ready. I have in mind a natural-language document collection, probably using the full test of each document. It should be constructed with many applications in mind and then be made universally available.*

Information retrieval systems are likely to play an every increasing part in the community. They are likely to be on-line and interactive. The hardware to accomplish this is already available but its universal implementation will only follow after it has been made commercially viable.

One major recent development is that computers and data-bases are becoming linked into networks. It is foreseeable that individuals will have access to these networks through their private telephones and use normal television sets as output devices. The main impact of this for IR systems will be that they will have to be simple to communicate with, which means they will have to use ordinary language, and they will have to be competent in their ability to provide relevant information. The VIEWDATA system provided by the British Post Office is a good example of a system that will need to satisfy these demands. 

By extending the user population to include the non-specialist, it is likely that an IR system will be expected to provide not just a citation, but a display of the text, or part of it, and perhaps answer simple questions about the retrieved documents. Even specialists may well desire of an IR system that it do more than just retrieve citations.

To bring all this about the document retrieval system will have to be interfaced and integrated with data retrieval systems, to give access to facts related to those in the documents. An obvious application lies in a chemical or medical retrieval system. Suppose a person has retrieved a set of documents about a specific chemical compound, and that perhaps some spectral data was given. He may like to consult a data retrieval system giving him details about related compounds. Or he may want to go on-line to, say, DENDRAL which will give him a list of possible compounds consistent with the spectral data. Finally, he may wish to do some statistical analysis of the data contained in the documents. For this he will need access to a set of statistical programs.

Another example can be found in the context of computer-aided instruction, where it is clearly a good idea to give a student access to a document retrieval system which will provide him with further reading on a topic of his immediate interest. The main thrust of these examples is that an important consideration in the design of a retrieval system should be the manner in which it can be interfaced with other systems.

Although the networking of medium sized computers has made headline news, and individuals and institutions have been urged to buy into a network as a way of achieving access to a number of computers, it is by no means clear that this will always be the best strategy. Quite recently a revolution has taken place in the mini-computer market. It is now possible to buy a moderately powerful computer for a relatively small outlay. Since information channels are likely to be routed through libraries for some time to come, it is interesting to think about the way in which the cheaper hardware may affect their future role. Libraries have been keen to provide users with access to large data-bases, stored and controlled some where else often situated at a great distance, possibly even in another country. One option libraries have is the one I have just mentioned, that is, they could connect a console into a large network. An alternative, and more flexible approach, would be for them to have a mini-computer maintaining access to a small, recently published chunk of the document collection. They would be able to change it periodically. The mini would be part of the network but the user would have the option of invoking the local or global system. The local system could then be tailored to local needs which would give it an important advantage. Such things as personal files, containing say user profiles could be maintained on the mini. In addition, if the local library's catalogue and subject index were available on-line, it would prove very useful in conjunction with the document retrieval system. A user could quickly check whether the library had copies of the documents retrieved as well as any related books.

Another hardware development likely to influence the development of IR systems is the marketing of cheap micro-processors. Because these cost so little now, many people have been thinking of designing 'intelligent' terminals to IR systems, that is, ones which are able to do some of the processing instead of leaving it all the main computer. One effect of this may well be that some of the so-called more expensive operations can now be carried out at the terminal, whereas previously they would have been prohibited.

As automation advances, much lip service is paid to the likely benefit to society. It is an unfortunate fact that so much modern technology is established before we can actually assess whether or not we want it. In the case of information retrieval systems, there is still time to predict and investigate their impact. If we think that IR systems will make an important contribution, we ought to be clear about what it is we are going to provide and why it will be an improvement on the conventional methods of retrieving information. 



Early Developments

For approximately 4000 years, man has organized information for later retrieval and usage. A typical example is the table of contents of a book. Since the volume of information eventually grew beyond a few books, it became necessary to build specialized data structures to ensure faster access to the stored information. An old and popular data structure for faster information retrieval is a collection of selected words or concepts with which are associated pointers to the related information (or documents) -- the index. In one form or another, indexes are at the core of every modern information retrieval system. They provide faster access to the data and allow the query processing task to be speeded up.

For centuries, indexes were created manually as categorization hierarchies. In fact, most libraries still use some form of categorical hierarchy to classify their volumes (or documents). Such hierarchies have usually been conceived by human subjects from the library sciences field. More recently, the advent of modern computers has made possible the construction of large indexes automatically. Automatic indexes provide a view of the retrieval problem which is much more related to the system itself than to the user need. In this respect, it is important to distinguish between two different views of the IR problem: a computer-centered one and a human-centered one.

In the computer-centered view, the IR problem consists mainly of building up efficient indexes, processing user queries with high performance, and developing ranking algorithms which improve the `quality' of the answer set. In the human-centered view, the IR problem consists mainly of studying the behavior of the user, of understanding his main needs, and of determining how such understanding affects the organization and operation of the retrieval system. According to this view, keyword based query processing might be seen as a strategy which is unlikely to yield a good solution to the information retrieval problem in the long run.


Information Retrieval in the Library

Libraries were among the first institutions to adopt IR systems for retrieving information. Usually, systems to be used in libraries were initially developed by academic institutions and later by commercial vendors. In the first generation, such systems consisted basically of an automation of previous technologies (such as card catalogs) and basically allowed searches based on author name and title. In the second generation, increased search functionality was added which allowed searching by subject headings, by keywords, and some more complex query facilities. In the third generation, which is currently being deployed, the focus is on improved graphical interfaces, electronic forms, hypertext features, and open system architectures.


The Web and Digital Libraries

If we consider the search engines on the Web today, we conclude that they continue to use indexes which are very similar to those used by librarians a century ago. What has changed then?

Three dramatic and fundamental changes have occurred due to the advances in modern computer technology and the boom of the Web. First, it became a lot cheaper to have access to various sources of information. This allows reaching a wider audience than ever possible before. Second, the advances in all kinds of digital communication provided greater access to networks. This implies that the information source is available even if distantly located and that the access can be done quickly (frequently, in a few seconds). Third, the freedom to post whatever information someone judges useful has greatly contributed to the popularity of the Web. For the first time in history, many people have free access to a large publishing medium.

Fundamentally, low cost, greater access, and publishing freedom have allowed people to use the Web (and modern digital libraries) as a highly interactive medium. Such interactivity allows people to exchange messages, photos, documents, software, videos, and to `chat' in a convenient and low cost fashion. Further, people can do it at the time of their preference (for instance, you can buy a book late at night) which further improves the convenience of the service. Thus, high interactivity is the fundamental and current shift in the communication paradigm.

In the future, three main questions need to be addressed. First, despite the high interactivity, people still find it difficult (if not impossible) to retrieve information relevant to their information needs. Thus, in the dynamic world of the Web and of large digital libraries, which techniques will allow retrieval of higher quality? Second, with the ever increasing demand for access, quick response is becoming more and more a pressing factor. Thus, which techniques will yield faster indexes and smaller query response times? Third, the quality of the retrieval task is greatly affected by the user interaction with the system. Thus, how will a better understanding of the user behavior affect the design and deployment of new information retrieval strategies?


Practical Issues

Electronic commerce is a major trend on the Web nowadays and one which has benefited millions of people. In an electronic transaction, the buyer usually has to submit to the vendor some form of credit information which can be used for charging for the product or service. In its most common form, such information consists of a credit card number. However, since transmitting credit card numbers over the Internet is not a safe procedure, such data is usually transmitted over a fax line. This implies that, at least in the beginning, the transaction between a new user and a vendor requires executing an off-line procedure of several steps before the actual transaction can take place. This situation can be improved if the data is encrypted for security. In fact, some institutions and companies already provide some form of encryption or automatic authentication for security reasons.

However, security is not the only concern. Another issue of major interest is privacy. Frequently, people are willing to exchange information as long as it does not become public. The reasons are many but the most common one is to protect oneself against misuse of private information by third parties. Thus, privacy is another issue which affects the deployment of the Web and which has not been properly addressed yet.

Two other very important issues are copyright and patent rights. It is far from clear how the wide spread of data on the Web affects copyright and patent laws in the various countries. This is important because it affects the business of building up and deploying large digital libraries. For instance, is a site which supervises all the information it posts acting as a publisher? And if so, is it responsible for a misuse of the information it posts (even if it is not the source)?

Additionally, other practical issues of interest include scanning, optical character recognition (OCR), and cross-language retrieval (in which the query is in one language but the documents retrieved are in another language). In this book, however, we do not cover practical issues in detail because it is not our main focus.


The Retrieval Process

To describe the retrieval process, we use a simple and generic software architecture as shown in the figure. First of all, before the retrieval process can even be initiated, it is necessary to define the text database. This is usually done by the manager of the database, which specifies the following: (a) the documents to be used, (b) the operations to be performed on the text, and (c) the text model (i.e., the text structure and what elements can be retrieved). The text operations transform the original documents and generate a logical view of them.

Once the logical view of the documents is defined, the database manager (using the DB Manager Module) builds an index of the text. An index is a critical data structure because it allows fast searching over large volumes of data. Different index structures might be used, but the most popular one is the inverted file. The resources (time and storage space) spent on defining the text database and building the index are amortized by querying the retrieval system many times.

Given that the document database is indexed, the retrieval process can be initiated. The user first specifies a user need which is then parsed and transformed by the same text operations applied to the text. Then, query operations might be applied before the actual query, which provides a system representation for the user need, is generated. The query is then processed to obtain the retrieved documents. Fast query processing is made possible by the index structure previously built.

Before been sent to the user, the retrieved documents are ranked according to a likelihood of relevance. The user then examines the set of ranked documents in the search for useful information. At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle. In such a cycle, the system uses the documents selected by the user to change the query formulation. Hopefully, this modified query is a better representation of the real user need.

Consider now the user interfaces available with current information retrieval systems (including Web search engines and Web browsers). We first notice that the user almost never declares his information need. Instead, he is required to provide a direct representation for the query that the system will execute. Since most users have no knowledge of text and query operations, the query they provide is frequently inadequate. Therefore, it is not surprising to observe that poorly formulated queries lead to poor retrieval (as happens so often on the Web).

For further information:



Now, we are going to show that there different types of  internet search and retrieval engines.

InfoSeek Search Form

Atlavista Search Form

Lycos Search Form




For lots of years, man has organized information for later retrieval and usage, f.e. we can mention the table of contents of a book. But those tables of contents are inadequated for our "information society".

Since the huge grew of information volume, data structures became necessary to access the specific information. An old and popular data structure is the index. As we said it is not adequated for nowadays information amounts, but anyway, indexes are at the core of every modern information retrieval system.

Nowadays, the problem of information overload is very much present, so there is an urgent necessity to make that amount of information be ordered and clear in order to make a good use of the material we have. And it is here where information retrieval has to play a role.

By the making up of the reports we have learnt that although important developments are being made in this field, there is still the necessity of improving the techniques and therefore make them more effective. The main problem consist on having effective indexes, improving the quality of the answer set.



Information retrieval , by C. J. van RIJSBERGEN

Information retrieval definition extracted from Wikipedia


See also:
Geographic Information System
Digital Libraries
Spoken Document Retrieval
Cross-language information retrieval


Major Figures in Information Retrieval: Gerald Salton, W Bruce Croft, Karen Spärck Jones, C. J. van Rijsbergen


External links