Computer comunication B

Computer comunication B

Information retrieval

Repetition

Retrieval models

Wildcards

Web information retrieval

Digital libraries

IR Information retrieval started from bibliography retrieval to become full-text

term retrieval in a dataset, to be finally expanded to web information retrieval

The information retrieval system anlyses the contents of the sources of information and the sources of the user’s queries and matches the two to retrieve the relevant items

COMPONENTS The document subsystem The indexing subsystem The searching subsystem The matching subsystem

The searching subsystem is one of the fundamental parts of a information retrieval system

IR searching Models

Searching models can be seen as searching strategies

Boolean search model Probabilistic retrieval model Vector processing model

The boolean search model IR system use boolean logic to allow the users to express their choice

using these operators George Boole initiated a system of symbolic logic formed by three

operators: The logical sum + (OR)

Allows to specify alternatives between (or among) search terms The logical product X (AND)

Allows to specify the search for the coincidence of two concepts The logical difference –

Allows to exclude terms from the search

The boolean operators The logical sum + (OR)

House OR castle The logical product X (AND)

house AND castle The logical difference – (NOT)

House NOT castle

Boolean operators can be visualized with the so called Venn diagrams

AND OR NOT

HouseCastleHouse

The boolean model: Pro and contra It is an easy search model Despite its simplicity users are not able to effectively use

the three boolean operators, especially for more complicated queries.

The search is sometimes not too precise, i.e. the search can give too many items after the search is the search is too broad, or too few responses if the search is too strict (probability to miss important items).

Boolean search does not permit ranking, i.e. the importance of items in an document are not ordered.

Boolean search: example

Catalogue RUG library There are index terms:

Boole as author is indexed is different than boolean or Boole in the titel index-term

The three boolean operators are used There is integration of wildcards (see later) http://opc.ub.rug.nl/IMPLAND=Y/SRT=YOP/LNG

=NE/DB=1/

Probabilistic retrieval models

Tags the last problem outlined for the boolean model:

Probabilistic models try to rank the found documents in order of decreasing probability of usefulness or relevance given by the user

Vector space models Documents are characterized/evaluated according to their

index-terms Each document is identified with a vector The dimensions of the vector are the index-terms. The

dimensions of a document can be therefore several. The value regarding an index is the number of times a

specific term appears (sometimes the value is 0) A metrics for the similarity between two documents is the

co-sinus of the angle between their vectors Searches are interpreted as well in terms of vectors

Vector space models

Evaluation for a search

Precision: How many of the found documents are relevant to the

search? Recall

How many of the relevant documents are found to the search?

Fall-outHow many of the irrelevant documents are found to

the search?

Wildcards 1 Wilcards are characters that can be a substitute for any subset of all possible

characters In other words they are unknown subparts in a term

Usually wildcards are signaled with an asterisk * Usually the asterisk is a wildcard character that substitutes zero or more

unknown characters. Example: aphas* → aphasiology, aphasia, aphasic, aphasics,

aphasiological etc… Wildcards are an advantage for the user of the system but it is not convenient

for the system self The user does not have to repeatedly ask for different searches But the system needs to interpret the term and test (search) all the

possible terms stemming from it

Wildcards 2 Wilcard characters usually substitute a group of letters that

can not stand alone as words, but can form a word is united to a specific root Sun* → wc:0= sun. wc: -s = suns. Ws: -set = sunset …

The search via wildcards in the beginning of a word or within a word is not so easy (the resulting possibilities are larger)

Wildcards 3: Permuterm index Wilcard


IR was created for bibliography retrieval. Nevertheless there is much information that has to be accessed in the web. IR addresses even this search

Traditional and web IR differ on a number of characteristics

Web information retrieval1. The web is far more distributed and larger than

the traditional set of information sources2. The web is increasingly growing3. The web has different levels of depth for a

search4. The web has different type and format of

documents5. The quality of documents in the web varies6. The information in the web changes rapidly7. Distributed users

Web information retrieval1. The web is far larger than the traditional

set of information sources1. Not only the amount of information and

documents is larger but the retrieval system (in traditional IR systems) has to deal with different a different set of standards (sofware etc). Actually the web does not have a “set of standards”

2. As a consequence the search is more difficult


1. The web is increasingly growing1. The amount of information in the web is

growing (and it will probably grow).

2. The conventional text retrieval systems should be tested and readapted to work with larger datasets

Web information retrieval1. The web has different levels of depth for a

search1. The web can have two types of access: one free

and the other one the “deep” one accessible only with passwords or special programs. WIR can get access only to the surface information.

2. The web has different type and format of documents

1. Traditional IR works with texts. In the web there are several types of documents (Images, soundfiles etc..). Both indexing and information retrieval are therefore more complex

Web information retrieval1. The quality of documents in the web varies

1. IR systems are not designed to check the quality of the information resources, therefore there is no control over the quality

2. The information in the web changes rapidly1. This differs from traditional text retrieval systems

which are quite static according to the rapid changes of the web.Keeping track of the rapid changes is a challengeThe sources often move. There is a difficulty to track them back


1. Distributed users1. The builder of conventional IR systems

knows approximately the target of users for a IR system. A builder for web information retrieval system does not have any “typical” user

Search engines

Search engines can are a sort of IR systems They allow to run the search using search terms and using keywords or

key sentences Most search engines allow the use of boolean operators (AND, OR,

NOT) Special programs called “spiders” regularly collect information on

web pages The search engine finds documents that match the search The web engine does not search the web for every search but

searches a given database formed by the spider programs. This database is regularly updated.

There are many types of web engines according to different specialties as well (Google, Altavista news, Google Images, etc)

Digital libraries

A digital library: “must accomplish all essential services of traditional

libraries and also exploit the well-known advantage of a digital storage”

Digital libraries provide access to different information sources, in various forms (text, images, audiofiles etc)

Digital libraries create the access for a variety of information via different sources

The web E-journals Online databases Remote digital libraries

Every digital library has a library-user interface

The digital library

USER

E-journalswww

Online databeses

Remote digital libraries

Digital library interface

Digital libraries

Some digital librariesAlexandria Digital Library Projecthttp://www.cdlib.org/http://www.gutemberg.orghttp://www.theeuropeanlibrary.org

Digital libraries

Digital libraries use features of IR systems Users can browse or search the collections Some digital libraries permit to search in a

network of digital libraries Boolean search is most used in digital libraries The search is via keywords or sentences with

the use of wildcards

Introduction to modern information retrieval (Chowdhury, G.G.)

Date post:	30-Jan-2016
Category:	Documents
Upload:	julio
View:	40 times
Download:	0 times