Information Retrieval: Introduction · Information Retrieval: Introduction De nition of Information...

Post on 27-Jun-2020

13 views 1 download

transcript

Information Retrieval: Introduction

Information Retrieval: Introduction

Norbert Fuhr

April 28, 2014

1 / 52

Information Retrieval Applications

Application ExamplesFacets of Search

Information Retrieval: Introduction

Information Retrieval Applications

Application Examples

Web Search

3 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Application Examples

Product Search in Online Shops

4 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Application Examples

Intranet Search

5 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Application Examples

Searching in Digital Libraries

6 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Application Examples

Multimedia Search

7 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Facets of Search

Facets of SearchLanguage

Example: Cross-lingual search in Google

8 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Facets of Search

Facets of SearcingStructure

Example: XML retrieval

9 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Facets of Search

Facets of SearchMedia

Example: Similarity search for images

10 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Facets of Search

Facets of SearchObjects

Example: People search in 123people

11 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Facets of Search

Facets of Searchstatic/dynamic Content

Example: Twitter search

12 / 52

Information Retrieval: Introduction

Information Retrieval Applications

Facets of Search

Facets of Search

Language: monolingual, cross-lingual, multilingual

Structure: atomics, fields, tree structure (e.g. XML), graph(e.g. Web)

Media: texts, facts, images, audio, video, 3D,. . .

Objects: products, people, companies,. . .

static/dynamic contents (databases/streams)

13 / 52

Definition of Information Retrieval

Information Retrieval: Introduction

Definition of Information Retrieval

Definition of Information Retrieval

Information Retrieval (IR) is about vagueness und uncertainty ininformation systems

Vagueness: user cannot give a precise specification of herinformation need

vague query conditionsiterative query formulation

Uncertainty system has uncertain knowledge about the (contentof the) objects in database

uncertain representation( wrong answers)incomplete representation( missing answers)

15 / 52

Information Retrieval: Introduction

Definition of Information Retrieval

Definition of Information Retrieval

Information Retrieval (IR) is about vagueness und uncertainty ininformation systems

Vagueness: user cannot give a precise specification of herinformation need

vague query conditions

iterative query formulation

Uncertainty system has uncertain knowledge about the (contentof the) objects in database

uncertain representation( wrong answers)incomplete representation( missing answers)

15 / 52

Information Retrieval: Introduction

Definition of Information Retrieval

Definition of Information Retrieval

Information Retrieval (IR) is about vagueness und uncertainty ininformation systems

Vagueness: user cannot give a precise specification of herinformation need

vague query conditionsiterative query formulation

Uncertainty system has uncertain knowledge about the (contentof the) objects in database

uncertain representation( wrong answers)incomplete representation( missing answers)

15 / 52

Information Retrieval: Introduction

Definition of Information Retrieval

Definition of Information Retrieval

Information Retrieval (IR) is about vagueness und uncertainty ininformation systems

Vagueness: user cannot give a precise specification of herinformation need

vague query conditionsiterative query formulation

Uncertainty system has uncertain knowledge about the (contentof the) objects in database

uncertain representation( wrong answers)incomplete representation( missing answers)

15 / 52

Information Retrieval: Introduction

Definition of Information Retrieval

Definition of Information Retrieval

Information Retrieval (IR) is about vagueness und uncertainty ininformation systems

Vagueness: user cannot give a precise specification of herinformation need

vague query conditionsiterative query formulation

Uncertainty system has uncertain knowledge about the (contentof the) objects in database

uncertain representation( wrong answers)

incomplete representation( missing answers)

15 / 52

Information Retrieval: Introduction

Definition of Information Retrieval

Definition of Information Retrieval

Information Retrieval (IR) is about vagueness und uncertainty ininformation systems

Vagueness: user cannot give a precise specification of herinformation need

vague query conditionsiterative query formulation

Uncertainty system has uncertain knowledge about the (contentof the) objects in database

uncertain representation( wrong answers)incomplete representation( missing answers)

15 / 52

Information Retrieval: Introduction

Definition of Information Retrieval

IR = Content-Oriented SearchNarrow Definition of IR

Searching at different abstraction levels:

Syntax document as sequence of symbols

Semantics meaning of a text/media object

Pragmatics usefulness for solving my current problem

16 / 52

Information Retrieval: Introduction

Definition of Information Retrieval

Syntax, Semantics and Pragmatics

Welcome at the Information Engineering Group. Our current workfocuses on information retrieval, digital libraries and web-basedinformation systems, with special emphasis on user-orientedresearch.

Syntax ’digital library’ no match

Semantics ’research area’ match

Pragmatics ’potential project partner for medical informationproject’?

17 / 52

Retrieval Quality

Information Retrieval: Introduction

Retrieval Quality

Retrieval QualityThe concept of relevance

in contrast to databases, IR system cannot decide if an answeris correct or not

user has information need

relevance: relationship between document and informationneed

judged by user

19 / 52

Information Retrieval: Introduction

Retrieval Quality

Facets of Relevance

20 / 52

Information Retrieval: Introduction

Retrieval Quality

Facets of Relevance

Situational Relevance: related to the percepted task

Pertinence relevance: related to the information need

Intellectual topicality: as judged by human observer

Algorithmic relevance: system score comparing request/querywith object

In the following: Relevance as pertinence/topicality without furtherdistinction

21 / 52

Information Retrieval: Introduction

Retrieval Quality

Retrieval metrics

RET: set of retrieved documents

REL: set of relevant documents in the database

Precision p: Proportion of relevant among retrieved

Recall r : Proportion of retrieved among relevant

p =|REL ∩ RET ||RET |

r =|REL ∩ RET ||REL|

Example:20 relevant documents for the current query.System returns 10 dokumente, of which 8 are relevant.

Precision: p = 8/10 = 0.8Recall: r = 8/20 = 0.4

22 / 52

Information Retrieval: Introduction

Retrieval Quality

Retrieval metrics

RET: set of retrieved documents

REL: set of relevant documents in the database

Precision p: Proportion of relevant among retrieved

Recall r : Proportion of retrieved among relevant

p =|REL ∩ RET ||RET |

r =|REL ∩ RET ||REL|

Example:20 relevant documents for the current query.System returns 10 dokumente, of which 8 are relevant.

Precision: p = 8/10 = 0.8Recall: r = 8/20 = 0.4

22 / 52

Information Retrieval: Introduction

Retrieval Quality

Retrieval metrics

RET: set of retrieved documents

REL: set of relevant documents in the database

Precision p: Proportion of relevant among retrieved

Recall r : Proportion of retrieved among relevant

p =|REL ∩ RET ||RET |

r =|REL ∩ RET ||REL|

Example:20 relevant documents for the current query.System returns 10 dokumente, of which 8 are relevant.

Precision: p = 8/10 = 0.8Recall: r = 8/20 = 0.4

22 / 52

Representations

Semantic DescriptionsFree Text SearchObjects, Representations, and Descriptions

Information Retrieval: Introduction

Representations

Representations

Free text search search in document text

Semantic approach assign semantic descriptions

24 / 52

Information Retrieval: Introduction

Representations

Semantic Descriptions

Semantic Descriptions

classification schemes e.g. hierarchic classification, as in librariesor product catalogs

Tagging users assign tags

Ontologies e.g. OWL: Web Ontology Language

25 / 52

Information Retrieval: Introduction

Representations

Semantic Descriptions

Classification/Ontology Example: DMOZ

26 / 52

Information Retrieval: Introduction

Representations

Free Text Search

Free Text SearchProblems

Inflectioncomputer – computers, fly – fliesgo – goes – going

Derivationcompute - computer - computerization - computation

Synonymsmobile – smartphone, table – bench – board –counter

Polysemesbank, head

Compoundssteamboat, testbed

Phrasesinformation retrieval – retrieval of information

27 / 52

Information Retrieval: Introduction

Representations

Free Text Search

Free Text SearchApproaches

inflection, derivation stemming algorithmscomputer, computation, computerize → comput

synonyms synonym lexicons

compunds splitting algorithms

phrases adjacency search

Most systems implement only stemming and adjacency search!

28 / 52

Information Retrieval: Introduction

Representations

Objects, Representations, and Descriptions

A Document Object

29 / 52

Information Retrieval: Introduction

Representations

Objects, Representations, and Descriptions

Example: document text, representation, description

Text:Research in the probabilistic theory of information retrieval involvesthe construction of mathematical models. In this kind of theoryconstruction the assumptions laid down ...

Stopword removal and stemming:research probabil theory informat retriev involv constructmathemat model kind theory construct assume lay downRepresentation (Bag of words):(research,1), (probabil,1), (theory,2), (informat,1), (retriev,1),(involv,1), (construct,2), (mathemat,1), (model,1), (kind,1),(assum,1), (lay,1), (down,1),Description:(research,0.5), (probabil,0.5), (theory,1.0), (informat,0.5),(retriev,0.5), (involv,0.5), (construct,1.0), (mathemat,0.5),(model,0.5), (kind,0.5), (assum,0.5), (lay,0.5), (down,0.5)

30 / 52

Information Retrieval: Introduction

Representations

Objects, Representations, and Descriptions

Example: document text, representation, description

Text:Research in the probabilistic theory of information retrieval involvesthe construction of mathematical models. In this kind of theoryconstruction the assumptions laid down ...Stopword removal and stemming:research probabil theory informat retriev involv constructmathemat model kind theory construct assume lay down

Representation (Bag of words):(research,1), (probabil,1), (theory,2), (informat,1), (retriev,1),(involv,1), (construct,2), (mathemat,1), (model,1), (kind,1),(assum,1), (lay,1), (down,1),Description:(research,0.5), (probabil,0.5), (theory,1.0), (informat,0.5),(retriev,0.5), (involv,0.5), (construct,1.0), (mathemat,0.5),(model,0.5), (kind,0.5), (assum,0.5), (lay,0.5), (down,0.5)

30 / 52

Information Retrieval: Introduction

Representations

Objects, Representations, and Descriptions

Example: document text, representation, description

Text:Research in the probabilistic theory of information retrieval involvesthe construction of mathematical models. In this kind of theoryconstruction the assumptions laid down ...Stopword removal and stemming:research probabil theory informat retriev involv constructmathemat model kind theory construct assume lay downRepresentation (Bag of words):(research,1), (probabil,1), (theory,2), (informat,1), (retriev,1),(involv,1), (construct,2), (mathemat,1), (model,1), (kind,1),(assum,1), (lay,1), (down,1),

Description:(research,0.5), (probabil,0.5), (theory,1.0), (informat,0.5),(retriev,0.5), (involv,0.5), (construct,1.0), (mathemat,0.5),(model,0.5), (kind,0.5), (assum,0.5), (lay,0.5), (down,0.5)

30 / 52

Information Retrieval: Introduction

Representations

Objects, Representations, and Descriptions

Example: document text, representation, description

Text:Research in the probabilistic theory of information retrieval involvesthe construction of mathematical models. In this kind of theoryconstruction the assumptions laid down ...Stopword removal and stemming:research probabil theory informat retriev involv constructmathemat model kind theory construct assume lay downRepresentation (Bag of words):(research,1), (probabil,1), (theory,2), (informat,1), (retriev,1),(involv,1), (construct,2), (mathemat,1), (model,1), (kind,1),(assum,1), (lay,1), (down,1),Description:(research,0.5), (probabil,0.5), (theory,1.0), (informat,0.5),(retriev,0.5), (involv,0.5), (construct,1.0), (mathemat,0.5),(model,0.5), (kind,0.5), (assum,0.5), (lay,0.5), (down,0.5)

30 / 52

Information Retrieval: Introduction

Representations

Objects, Representations, and Descriptions

Conceptual Model

DD

rel.

judg.

α

αQ βQ

ρIR

DQ

DD

Q

D

Q

R

qk∈ Q query

qk ∈ Q: queryrepresentation

qDk ∈ QD : query description

dm ∈ D document

dm ∈ D: documentrepresentation

dDm ∈ DD : document

descriptionR: relevance scale%: retrieval function

31 / 52

Probabilistic Models

Probabilistic Event SpaceProbability Ranking PrincipleBinary Independence Retrieval ModelBM25 modelLearning to Rank

Information Retrieval: Introduction

Probabilistic Models

Probabilistic Event Space

Probabilistic Event space

[Fuhr 92]

qk

dm

dm

qk

D

Q

Q: Queriesqk

: queryqk : query rep.

D: Documentsdm: documentdm: document rep.

33 / 52

Information Retrieval: Introduction

Probabilistic Models

Probabilistic Event Space

Event space

dm

qk

dm

qk

qk dm

Q

R N R N

R R N N

R R R R

N N N N

D

P(R| , )=0.5

34 / 52

Information Retrieval: Introduction

Probabilistic Models

Probabilistic Event Space

Event space

Event space: Q × Dsingle element: query-document pair (qk , dm)all elements are equiprobable

relevance judgement (qk , dm)εRrelevance judgements for different documents w.r.t. the samequery are independent of each other

Probability of relevance P(rel |qk , dm):probability of a an element of (qk , dm) being relevant

regard collections as samples of possibly infinite sets

poor representation of retrieval objects:single representation may stand for a number of differentobjects.

35 / 52

Information Retrieval: Introduction

Probabilistic Models

Probabilistic Event Space

Event space

Event space: Q × Dsingle element: query-document pair (qk , dm)all elements are equiprobable

relevance judgement (qk , dm)εRrelevance judgements for different documents w.r.t. the samequery are independent of each other

Probability of relevance P(rel |qk , dm):probability of a an element of (qk , dm) being relevant

regard collections as samples of possibly infinite sets

poor representation of retrieval objects:single representation may stand for a number of differentobjects.

35 / 52

Information Retrieval: Introduction

Probabilistic Models

Probabilistic Event Space

Event space

Event space: Q × Dsingle element: query-document pair (qk , dm)all elements are equiprobable

relevance judgement (qk , dm)εRrelevance judgements for different documents w.r.t. the samequery are independent of each other

Probability of relevance P(rel |qk , dm):probability of a an element of (qk , dm) being relevant

regard collections as samples of possibly infinite sets

poor representation of retrieval objects:single representation may stand for a number of differentobjects.

35 / 52

Information Retrieval: Introduction

Probabilistic Models

Probability Ranking Principle

Probability Ranking Principle

defines optimum retrieval for probabilistic models:

rank documents according to decreasing values of the

probability of relevance P(rel |q, d)

Advantage:

PRP yields

optimum retrieval quality

minimum retrieval costs

36 / 52

Information Retrieval: Introduction

Probabilistic Models

Probability Ranking Principle

Probability Ranking Principle

defines optimum retrieval for probabilistic models:

rank documents according to decreasing values of the

probability of relevance P(rel |q, d)

Advantage:

PRP yields

optimum retrieval quality

minimum retrieval costs

36 / 52

Information Retrieval: Introduction

Probabilistic Models

Binary Independence Retrieval Model

Binary Independence Retrieval Model

represent queries and documents as sets of termsT = {t1, . . . , tn} set of terms in the collection

q ∈ Q: queryrepresentation

dm ∈ D: documentrepresentation

qT : set of query terms

dTm : set of document

terms

simple retrieval function: Coordination level match

%COORD(q, dm) = |qT ∩ dTm |

Binary independence retrieval (BIR) model:assign weights to query terms

%BIR(q, dm) =∑

ti∈qT∩dTm

ci

37 / 52

Information Retrieval: Introduction

Probabilistic Models

Binary Independence Retrieval Model

Binary Independence Retrieval Model

represent queries and documents as sets of termsT = {t1, . . . , tn} set of terms in the collection

q ∈ Q: queryrepresentation

dm ∈ D: documentrepresentation

qT : set of query terms

dTm : set of document

terms

simple retrieval function: Coordination level match

%COORD(q, dm) = |qT ∩ dTm |

Binary independence retrieval (BIR) model:assign weights to query terms

%BIR(q, dm) =∑

ti∈qT∩dTm

ci

37 / 52

Information Retrieval: Introduction

Probabilistic Models

Binary Independence Retrieval Model

Binary Independence Retrieval Model

represent queries and documents as sets of termsT = {t1, . . . , tn} set of terms in the collection

q ∈ Q: queryrepresentation

dm ∈ D: documentrepresentation

qT : set of query terms

dTm : set of document

terms

simple retrieval function: Coordination level match

%COORD(q, dm) = |qT ∩ dTm |

Binary independence retrieval (BIR) model:assign weights to query terms

%BIR(q, dm) =∑

ti∈qT∩dTm

ci

37 / 52

Information Retrieval: Introduction

Probabilistic Models

Binary Independence Retrieval Model

%BIR(q, dm) =∑

ti∈qT∩dTm

ci , ci = logpi (1− si )

si (1− pi )

pi = P(ti |rel): prob. that ti occurs in arbitrary relevant doc.si = P(ti | ¯rel): prob. that ti occurs in arbitrary nonrelevant doc.

38 / 52

Information Retrieval: Introduction

Probabilistic Models

Binary Independence Retrieval Model

%BIR(q, dm) =∑

ti∈qT∩dTm

ci , ci = logpi (1− si )

si (1− pi )

pi = P(ti |rel): prob. that ti occurs in arbitrary relevant doc.si = P(ti | ¯rel): prob. that ti occurs in arbitrary nonrelevant doc.

38 / 52

Information Retrieval: Introduction

Probabilistic Models

Binary Independence Retrieval Model

Parameter estimationRelevance Feedback

ti occurs ¬ ti occurs

relevant ri R − ri R

¬ relevant ni − ri N − ni − R + ri N − R

ni N − ni N

pi = P(ti |rel) prob. that ti occurs in arbitrary relevant doc.

pi ≈riR

si = P(ti | ¯rel) prob. that ti occurs in arbitrary nonrelevant doc..

si ≈ni − riN − R

≈ ni

N

39 / 52

Information Retrieval: Introduction

Probabilistic Models

Binary Independence Retrieval Model

Parameter estimation w/o relevance feedback

N - # documents in the collectionni - # documents containing term ti

si = P(ti | ¯rel) prob. that ti occurs in arbitrary nonrelevant doc..

si ≈ni

N

pi = P(ti |rel) prob. that ti occurs in arbitrary relevant doc.assume constant value: p = 0.5

ci = logpi (1− si )

si (1− pi )= log

p

1− p+ log

1− sisi

= 0 + logN − ni

ni≈ log

N

ni

IDF (inverse document frequency) weight: log N/ni

40 / 52

Information Retrieval: Introduction

Probabilistic Models

Binary Independence Retrieval Model

Parameter estimation w/o relevance feedback

N - # documents in the collectionni - # documents containing term ti

si = P(ti | ¯rel) prob. that ti occurs in arbitrary nonrelevant doc..

si ≈ni

N

pi = P(ti |rel) prob. that ti occurs in arbitrary relevant doc.assume constant value: p = 0.5

ci = logpi (1− si )

si (1− pi )= log

p

1− p+ log

1− sisi

= 0 + logN − ni

ni≈ log

N

ni

IDF (inverse document frequency) weight: log N/ni

40 / 52

Information Retrieval: Introduction

Probabilistic Models

BM25 model

BM25

[Robertson et al 95]heuristic extension of the BIR modelfrom binary to weighted indexing(consideration of within-document frequency tf )

tf

1

umi

mi

41 / 52

Information Retrieval: Introduction

Probabilistic Models

BM25 model

tf*idf Weighting

originally developed for (non-probabilistic) vector space model

set of heuristics: the weight of a term should be higher. . .1 the less frequent the term occurs in the collection

(inverse document frequency, idf — see above)2 the more often the term occurs in the document (tf)3 the shorter the document

42 / 52

Information Retrieval: Introduction

Probabilistic Models

BM25 model

From binary to weighted Indexing

lm document length(# tokens in dm)al average document length in D

tfmi : occurrence frequency of ti in dm.b weight of length normalization, 0 ≤ b ≤ 1k weight of occurrence frequency

length normalization: B =

((1− b) + b

lmal

)normalized within-document frequency: ntfmi = tfmi/B

BM25 weight: umi =ntfmi

k + ntfmi

=tfmi

k((1− b) + b lm

al

)+ tfmi

43 / 52

Information Retrieval: Introduction

Probabilistic Models

Learning to Rank

Parameter learning in IR

[Fuhr 92]

Learning approaches in IR

44 / 52

Information Retrieval: Introduction

Probabilistic Models

Learning to Rank

Learning to Rank for Web Searches

45 / 52

Interactive Retrieval

Search modelsAnomalous State of KnowledgeIngwersen’s Cognitive Model

Information Retrieval: Introduction

Interactive Retrieval

Search models

Search modelsClassical search process model

47 / 52

Information Retrieval: Introduction

Interactive Retrieval

Search models

Empirical studies

information search consists of a sequence of connected, butdifferent searches

search result may trigger new searches

only task context remains the same

main goal of a search is accumulated learning and collectionof new information while searching

48 / 52

Information Retrieval: Introduction

Interactive Retrieval

Search models

Search modelsBerry picking-Model

[Bates 90]

continuous change of information need and queries duringsearchinformation need cannot be satisfied by a single result setinstead: sequence of selections and collection of pieces ofinformation during search

49 / 52

Information Retrieval: Introduction

Interactive Retrieval

Anomalous State of Knowledge

Anomalous State of Knowledge (ASK)(1)

[Belkin 80]

Classic IR systems: ”best match” principle

system returns those documents that fit best to therepresentation of the information need (e.g. query statement)

only feasible, if user can give precise specification of herinformation need (like e.g. in DBMS)

50 / 52

Information Retrieval: Introduction

Interactive Retrieval

Anomalous State of Knowledge

Anomalous State of Knowledge (ASK)(2)

ASK-Hypothesis

information need results from user’s anomalous state ofknowledge (ASK)

user is unable to precisely specify information need forremoving the ASK

instead: describe ASK

requires capture of cognitive and situation-specific aspects forresolving this anomaly

51 / 52

Information Retrieval: Introduction

Interactive Retrieval

Anomalous State of Knowledge

Anomalous State of Knowledge (ASK)(2)

ASK-Hypothesis

information need results from user’s anomalous state ofknowledge (ASK)

user is unable to precisely specify information need forremoving the ASK

instead: describe ASK

requires capture of cognitive and situation-specific aspects forresolving this anomaly

51 / 52

Information Retrieval: Introduction

Interactive Retrieval

Anomalous State of Knowledge

Anomalous State of Knowledge (ASK)(2)

ASK-Hypothesis

information need results from user’s anomalous state ofknowledge (ASK)

user is unable to precisely specify information need forremoving the ASK

instead: describe ASK

requires capture of cognitive and situation-specific aspects forresolving this anomaly

51 / 52

Information Retrieval: Introduction

Interactive Retrieval

Anomalous State of Knowledge

Anomalous State of Knowledge (ASK)(2)

ASK-Hypothesis

information need results from user’s anomalous state ofknowledge (ASK)

user is unable to precisely specify information need forremoving the ASK

instead: describe ASK

requires capture of cognitive and situation-specific aspects forresolving this anomaly

51 / 52

Information Retrieval: Introduction

Interactive Retrieval

Anomalous State of Knowledge

Anomalous State of Knowledge (ASK)(2)

ASK-Hypothesis

information need results from user’s anomalous state ofknowledge (ASK)

user is unable to precisely specify information need forremoving the ASK

instead: describe ASK

requires capture of cognitive and situation-specific aspects forresolving this anomaly

51 / 52

Information Retrieval: Introduction

Interactive Retrieval

Ingwersen’s Cognitive Model

Ingwersen’s Cognitive Model

Organiz.

SocialContext

Cultural

CognitiveActor(s)

(team)Interface

Informationobjects

IT: EnginesLogics

Algorithms

Cognitive transformations and influenceInteractive communications of cognitive structures

52 / 52