1
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 3 Nov. 2009
Seminar on
Information Retrieval (IR)
2
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Information Retrieval Definition
• Information Retrieval (IR) is :
1. finding material (usually documents)
2. of an unstructured nature (usually text)
3. that satisfies an information need(query)
4. from within large collections (usually stored on computers).
3
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Basic assumptions of Information Retrieval
• Collection: Fixed set of documents
• Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task
4
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Search Methods
for
Finding Documents
5
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Searching Methods
Grep method Term-document incidence matrix (Binary Ret.)
Inverted indexInverted index mit Skip pointers/Skip lists Positional Postings (for Phrase queries)
6
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Term-document incidence
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 if play contains word, 0 otherwise
7
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Inverted index
• For each term T, we must store a list of all documents that contain T.
• Do we use an array or a list for this?
7
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 34
2 4 8 16 32 64128
13 16
What happens if the word Caesar is added to document 14?
Sec. 1.2
8
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Inverted index
• Linked lists generally preferred to arrays– Dynamic space allocation– Insertion of terms into documents easy– Space overhead of pointers
8
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings lists
PostingPosting
Sec. 1.2
9
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Augment postings with skip pointers (at indexing time)
• Why?• To skip postings that will not figure in the search
results.• Where do we place skip pointers?
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
10
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Where do we place skips?
• Tradeoff:– More skips shorter skip spans more likely
to skip. But lots of comparisons to skip pointers.
– Fewer skips few pointer comparison, but then long skip spans few successful skips.
11
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Positional index example
<be: 993427;1: 7, 18, 33, 72, 86, 231;2: 3, 149;4: 17, 191, 291, 430, 434;5: 363, 367, …>
Which of docs 1,2,4,5could contain “to be
or not to be”?
12
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Steps of Inverted index construction
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Documents tobe indexed.
Friends, Romans, countrymen.
Sec. 1.2
13
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Parts of an Inverted Index
• Dictionary– Commonly keep in memory
• Posting lists– Commonly keep in disk
14
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Inverted index construction
Preprocessing to form the term vocabulary
Tokenization (problems) Hyphens apostrophes Compounds Chinese numbers
Dropping Stop Words But you need them: Phrase queries, various song titles,
Relational queriesNormalization (Term equivalence classing)
Numbers case folding (Reduce all letters to lower case) Stemming ( Porter’s algorithm) Reduce terms to their “roots” lemmatization (Reduce variant forms to base form)
15
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Inverted index construction
Index ConstructionBlocked Sort-based indexing (BSBI)
Algorithm Accumulate posting for each block, sort, write to disk Then merge (External sorting) the blocks into one long sorted order
Distributed indexing using MapReduce Break up indexing into sets of 2 parallel tasks
Parsers Invertors
Break the input document corpus into splits Parsers
Master assign a split to an idle parser machine Parser reads a document at a time and emit (term,doc) pairs Parser writes pairs into j partitions Each partition is for a range of term's first letters
Inverters An inverter collects all (term,doc) pairs for one term-partition Sorts and writes to postings list
Dynamic Indexing
16
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Data flow
splits
Parser
Parser
Parser
Master
a-f g-p q-z
a-f g-p q-z
a-f g-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Mapphase
Segment files Reducephase
Inverted index construction
Index Construction
17
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Search structures for Dictionary
A naïve dictionary Hash tables Trees
Binary tree B-Tree
18
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Index compression
Dictionary compression for Boolean indexes Array of fixed/width entries (it is wasteful) Dictionary as a string Blocking Front coding
Postings compression Gap encoding using prefix-unique codes Variable-Byte
Gamma codes ( seldom used in practice)
19
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Dictionary compression for Boolean indexes
Dictionary-as-a-String
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
Total string length =400K x 8B = 3.2MB
Pointers resolve 3.2Mpositions: log23.2M =
22bits = 3bytes
20
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Dictionary compression for Boolean indexes
Blocking
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
7
Save 9 bytes on 3 pointers.
Lose 4 bytes onterm lengths.
21
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Dictionary compression for Boolean indexes
Front coding
– Sorted words commonly have • long common prefix – store differences only
– (for last k-1 in a block of k)
8automata8automate9automatic10automation8automat*a1e2ic3ion
Encodes automat Extra lengthbeyond automat.
22
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Information Retrieval
Ranked Retrieval
23
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Information Retrieval
Ranked retrieval
• Thus far, our queries have all been Boolean.• Good for expert users • Also good for applications: Applications can
easily consume 1000s of results.– Not good for the majority of users.– Most users incapable of writing Boolean queries (or
they are, but they think it’s too much work).
• Most users don’t want to wade through 1000s of results.– This is particularly true of web search
24
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Term Weighting• Term frequency and Inverse document frequency
– TF
– IDF: the number of docs in the collection that contain a term t
• td-idf weighting– The tf-idf weight of a term is the product of its tf weight and its idf weight
• td-idf is the best known weighting scheme in information retrieval
otherwise 0,
0 tfif, tflog10 1 t,dt,d
t,dw
tt N/df log idf 10
tdt Ndt
df/log)tflog1(w 10,,
25
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Vector space model for scoring
– Represent the query as a weighted tf-idf vector– Represent each document as a weighted tf-idf vector– Compute the cosine similarity score for the query vector and
each document vector
– Rank documents with respect to the query by score– Increases with the number of occurrences within a document– Increases with the rarity of the term in the collection
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dqdq
1
2
1
2
1),cos(
26
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Providing heuristics methods for
Speeding up Vector Space Scoring & Ranking
– Many of these heuristics achieve their speed at risk of not finding quite top K documents matching query
• Efficient Scoring & ranking1. Inexact top K document retrieval2. Index Elimination3. Champion lists4. Static quality scores
• We want top-ranking documents to be both relevant and authoritative
• Relevance is being modeled by cosine scores• Authority is typically a query-independent property of a
document• Assign a query-independent quality score in [0,1] to each
document d, Denote this by g(d)
27
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Providing heuristics methods for
Speeding up Vector Space Scoring & Ranking(Cont.)
5 - Cluster pruning: preprocessing• Pick N docs at random: call these leaders• For every other doc, pre-compute nearest leader
– Docs attached to a leader: its followers;
– Likely: each leader has ~ N followers.
• Process a query as follows:– Given query Q, find its nearest leader L.
– Seek K nearest docs from among L’s followers
– Net score for a document d• net-score can be computed as combination of cosine
relevance and authority e.g. net-score(q,d) = g(d) + cosine(q,d)
• Top K by net score – fast methods
28
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Cluster Pruning
Query
Leader Follower
29
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Parametric and zone indexes
• In fact documents have multiple parts, some with special semantics:
– Author, Title, Date of publication, Language, Format, etc.• These constitute the metadata about a document• We sometimes wish to search by these metadata• Field or parametric index: postings for each field value
– Field query typically treated as conjunction• A zone is a region of the doc that can contain an
arbitrary amount of text e.g., Title, Abstract, References …
– Build inverted indexes on zones as well to permit querying
30
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Example zone indexes
Encode zones in dictionary vs. postings.
31
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Tiered indexes
– Tiered indexes• Break postings up into a hierarchy of lists
– Most important– …– Least important
• Can be done by g(d) or another measure• Inverted index thus broken up into tiers of decreasing
importance• At query time use top tier unless it fails to yield K docs
– If so drop to lower tiers
32
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Example tiered index
33
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
A Complete Search System
34
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Evaluating
Search Engine(Ranked Retrieval Method)
35
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Measures for a search engine
Which parameters are very important in SE
– How fast does a search engine index– How fast does a search engine search– Expressiveness of query language– Uncluttered User Interface(UI)– Is it free?
36
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
The key measure
User happiness
• Useless answers won’t make a user happy• Need a way of quantifying user happiness• Issue: who is the user we are trying to make happy?
– Web engine– eCommerce site– Enterprise (company/govt/academic)
• Happiness: elusive to measure
37
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Evaluation of unranked retrieval– Precision: fraction of retrieved docs that are relevant =
P( relevant | retrieved )– Recall: fraction of relevant docs that are retrieved =
P( retrieved | relevant )
• Precision P = tp/(tp + fp)• Recall R = tp/(tp + fn)
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
38
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
38
Evaluation of unranked retrieval(Cont.)
• What about Accuracy– The accuracy of an engine: the fraction of
classifications that are correct– Accuracy is a used in machine learning
classification work– Why is this not a very useful evaluation measure
in IR?– How to build a 99.9999% accurate search engine
on a low budget….
39
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
39
Evaluation of unranked retrieval(Cont.)
• F measure– Combined measure that assesses precision/recall
tradeoff is F measure (weighted harmonic mean):
– People usually use balanced F1 measure i.e., with = 1 or = ½
– For F1 the best value is 1 and the worst value is 0
RP
PR
RP
F
2
2 )1(1
)1(1
1
40
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
40
Evaluation of Ranked Retrieval
• By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve
• We can determine a value between the points using Interpolation
• 11-point interpolated average precision• Other methods: Mean average precision (MAP) and R-
precision
41
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
41
A precision-recall curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cisi
on
42
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
42
Typical (good) 11 point precisions
• SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Pre
cis
ion
43
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Relevance Feedback (RF)for
Query Refinement
In
Search Engine
44
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Relevance Feedback• user feedback on relevance of docs in initial set of
results– User issues a (short, simple) query– The user marks some results as relevant or non-relevant.– The system computes a better representation of the
information need based on feedback.– Relevance feedback can go through one or more
iterations.
• Idea: it may be difficult to formulate a good query when you don’t know the collection well, so iterate
45
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Relevance Feedback: Example• Image search engine
http://nayana.ece.ucsb.edu/imsearch/imsearch.html
46
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Results for Initial Query
47
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Relevance Feedback
48
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Results after Relevance Feedback
49
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Key concept: Centroid
• The centroid is the center of mass of a set of points
• Recall that we represent documents as points in a high-dimensional space
• Definition: Centroid
where C is a set of documents.
Cd
dC
C
||
1)(
50
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Rocchio Algorithm
• The Rocchio algorithm uses the vector space model to pick a relevance fed-back query
• Rocchio seeks the query q opt that maximizes
• Tries to separate docs marked relevant and non-relevant
• Problem: we don’t know the truly relevant docs
))](,cos())(,[cos(maxarg nrr
q
opt CqCqq
rjrj Cdj
nrCdj
ropt d
Cd
Cq
11
51
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Rocchio 1971 Algorithm (SMART)
• Used in practice:
• Dr = set of known relevant doc vectors• Dnr = set of known irrelevant doc vectors
– Different from Cr and Cnr
• qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)
• New query moves toward relevant documents and away from irrelevant documents
nrjrj Ddj
nrDdj
rm d
Dd
Dqq
110
!
52
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
The Theoretically Best Query
x
x
xx
oo
o
Optimal query
x non-relevant documentso relevant documents
o
o
o
x x
xxx
x
x
x
x
x
x
x
x
x
53
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Relevance feedback on initial query
x
x
xx
oo
o
Revised query
x known non-relevant documentso known relevant documents
o
o
ox
x
x x
xx
x
x
xx
x
x
x
x
Initial query
54
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Relevance Feedback in vector spaces
• We can modify the query based on relevance feedback and apply standard vector space model.
• Use only the docs that were marked.
• Relevance feedback can improve recall and precision
• Relevance feedback is most useful for increasing recall in situations where recall is important
– Users can be expected to review results and to take time to iterate
55
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Relevance feedback revisited• In relevance feedback, the user marks a number of
documents as relevant/nonrelevant.• We then try to use this information to return better
search results.• Suppose we just tried to learn a filter for nonrelevant
documents• This is an instance of a text classification problem:
– Two “classes”: relevant, nonrelevant– For each document, decide whether it is relevant or
nonrelevant
56
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Text Classification
57
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Classification Methods #1
Manual classification• Used by Yahoo! (originally; now present but
downplayed), Looksmart, about.com, ODP, PubMed
• Very accurate when job is done by experts• Consistent when the problem size and team is
small• Difficult and expensive to scale
58
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Classification Methods #2
Automatic document classification
• Hand-coded rule-based systems– One technique used by CS dept’s spam filter,
Reuters, CIA, etc. – Companies (Verity) provide “IDE” for writing such
rules– Accuracy is often very high if a rule has been carefully
refined over time by a subject expert– Building and maintaining these rules is expensive
59
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
Classification Methods #3
Supervised learning • Supervised learning of a document-label
assignment function– Many systems partly rely on machine learning
• k-Nearest Neighbors (simple, powerful)• Naive Bayes (simple, common method)• Support-vector machines (new, more powerful)• No free lunch: requires hand-classified training data• But data can be built up (and refined) by amateurs
60
.
Hadi Mohammadzadeh Information Retrieval (IR) 50 Pages
References
• Introduction to Information Retrieval-2008• Managing Gigabytes-1999