Post on 21-Dec-2015
transcript
Retrieval Models II
Vector Space, Probabilistic
Allan, Ballesteros, Croft, and/or Turtle
Properties of Inner Product
• The inner product is unbounded.
• Favors long documents with a large number of unique terms.
• Measures how many terms matched but not how many terms are not matched.
Allan, Ballesteros, Croft, and/or Turtle
Inner Product -- Examples
Binary:– D = 1, 1, 1, 0, 1, 1, 0
– Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
retri
eval
database
archite
cture
computer
textmanagem
ent
informatio
n
Size of vector = size of vocabulary = 70 means corresponding term not found
in document or query
Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Allan, Ballesteros, Croft, and/or Turtle
t
i
t
i
t
i
ww
ww
qd
qd
iqij
iqij
j
j
1 1
22
1
)(
Cosine Similarity Measure
• Cosine similarity measures the cosine of the angle between two vectors.
• Inner product normalized by the vector lengths.
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
t3
t1
t2
D1
D2
Q
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
CosSim(dj, q) =
Allan, Ballesteros, Croft, and/or Turtle
Simple Implementation
1. Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V.
2. Convert query to a tf-idf-weighted vector q.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
4. Sort documents by decreasing score.
5. Present top ranked documents to the user.
Time complexity: O(|V|·|D|) Bad for large V & D !|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000
Allan, Ballesteros, Croft, and/or Turtle
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite obvious weaknesses.
• Allows efficient implementation for large document collections.
Allan, Ballesteros, Croft, and/or Turtle
Problems with Vector Space Model
• Assumption of term independence
• Missing semantic information (e.g. word sense).
• Missing syntactic information (e.g. phrase structure, word order, proximity information).
• Lacks the control of a Boolean model (e.g., requiring a term to appear in a document).– Given a two-term query “A B”,
• may prefer a document containing A frequently but not B,
over a document that contains both A and B, but both less frequently.
Allan, Ballesteros, Croft, and/or Turtle
Statistical Models
• A document is typically represented by a bag of words (unordered words with frequencies).
• Bag = set that allows multiple occurrences of the same element.
• User specifies a set of desired terms with optional weights:– Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 >– Unweighted query terms: Q = < database; text; information >– No Boolean conditions specified in the query.
Allan, Ballesteros, Croft, and/or Turtle
Statistical Retrieval
• Retrieval via similarity based on probability of relevance to Q
• Given Q, the set of all documents is partitioned with into the sets rel and nonrel.– The sets rel and nonrel change from query to query
• Output documents are ranked according to probability of relevance to query.– Pr(relevance) of each document to the query is not
available in practice
Allan, Ballesteros, Croft, and/or Turtle
Basic Probabilistic Retrieval Model
• We need a similarity function s so that: – P(rel|Di)>P(rel|Dj) iff s(Q, Di)>s(Q, Dj)
• Retrieve if P(relevant|D) > P(non-relevant|D)– calculate P(D|R)/P(D|NR)
• Different ways of estimating these probabilities lead to different probabilistic models
Allan, Ballesteros, Croft, and/or Turtle
Probability• Experiment: a specific set of actions the results of which can
not be predicted with certainty – i.e. rolling two dice and recording their values
• Simple outcome: ea. possible set of recorded data– for the example, each pair is a simple outcome
(1,1) (2,1) (3,1) ... (6,1)
(1,2) (2,2) (3,2) ... (6,2)
(1,3) (2,3) (3,3) ... (6,3)
(1,4) (2,4) (3,4) ... (6,4)
(1,5) (2,5) (3,5) ...(6,5)
(1,6) (2,6) (3,6) ... (6,6)
Allan, Ballesteros, Croft, and/or Turtle
Probability
• Sample Space– non-empty set containing all possible simple outcomes
of the experiment
(1,1) (2,1) (3,1) ... (6,1)(1,2) (2,2) (3,2) ... (6,2)(1,3) (2,3) (3,3) ... (6,3)(1,4) (2,4) (3,4) ... (6,4)(1,5) (2,5) (3,5) ...(6,5)(1,6) (2,6) (3,6) ... (6,6)
– each element is known as a sample point
Allan, Ballesteros, Croft, and/or Turtle
Probability
• Sample Space(1,1) (2,1) (3,1) ... (6,1)
(1,2) (2,2) (3,2) ... (6,2)(1,3) (2,3) (3,3) ... (6,3)(1,4) (2,4) (3,4) ... (6,4)(1,5) (2,5) (3,5) ...(6,5)(1,6) (2,6) (3,6) ... (6,6)
• Event space: subsets of a sample space defined by a specific event or outcome– i.e. the event that the sum of the two die is 4
Allan, Ballesteros, Croft, and/or Turtle
Event Space
• The probability of an event is the sum of the probabilities of the sample points associated with the event– what is the probability that the sum is 4?
• Recall that sample points represent the possible outcomes of a statistical “experiment”
• 36 possible outcomes when rolling 2 die• 3 ways to get sum of 4: (1,3) (2,2) (3,1) • Pr(sum is 4) = 3/36 = 1/12
Allan, Ballesteros, Croft, and/or Turtle
Event Space
• For a retrieval model, the event space is Q x D, – where each sample point is a query-document pair
– Each has an associated relevance judgment
• For a particular a query, a probabilistic model tries to estimate P(R|D)
•
Allan, Ballesteros, Croft, and/or Turtle
Probability Ranking Principle
• Ranking documents in decreasing order of pr(rel) to the query, where probabilities are estimated using all available evidence, produces the best possible effectiveness
– Assume relevance of a document is independent of other documents in the collection
– Bayes Decision Rule: Retrieve if P(R|D) > P(NR|D)• minimizes the average probability of error
– equivalent to optimizing recall/fallout tradeoff
{ NR decide weif D)|P(R
R decide weif D)|P(NRD)|P(error
Allan, Ballesteros, Croft, and/or Turtle
Basic Probabilistic Model• Doc d = (t1, t2, …tn)
ti = 0 means index term ti absent, ti = 1 means term ti present
– pi = P(ti =1|R) and 1-pi = P(ti =0|R)
– qi = P(ti =1|NR) and 1- qi = P(ti =0|NR)
• Assume conditional independence– P(d|R) is product of the probs for the components of d (i.e. product of
probabilities of getting a particular vector of 1’s and 0’s)
• Appearance of a term in a doc interpreted either as – evidence that document is relevant or
– evidence that document is non-relevant
• The key is finding a means of estimating pi and qi
– pi is prob term is present given Relevant
– qi is prob term is present given Not Relevant
Allan, Ballesteros, Croft, and/or Turtle
Basic Probabilistic Model
• Need to calculate – “relevant to query given term appears” and
– “irrelevant to query given term appears”
• These values can be based upon some known relevance judgments
Allan, Ballesteros, Croft, and/or Turtle
Estimation with Relevance Information
– N = total # docs, R = # rel docs, N-R = # nonrel docs
– ft = # docs w/term
– Rt = # rel docs w/ term
– ft -Rt = # nonrel w/term
• We can estimate the conditional probabilities w/ the table – P(rel|t is present) = Rt /ft
– P(non-rel|t is present) = ft -Rt / ft
– P(t is present| rel) = Rt / R
– P(t is present| nonrel) = ft - Rt / N-R
Number Docs Relevant Nonrelevant Total Term t present Rt ft - Rt ft
Term t absent R - Rt N- ft-(R- Rt ) N - ft
R N-R N
Allan, Ballesteros, Croft, and/or Turtle
Estimation with Relevance Information
• wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))– ratio of rel with term to relevant without term
ratio of nonrel with term to nonrel without term
• Suppose N = 20; R = 13 relevant; term t appears in 11 rel; term appears in 12 docs– wt =(11/(13-11))/((12-11)/(20-12-(13-11))) = 5.5/0.17 = 33
Number Docs Relevant Nonrelevant Total Term t present Rt ft - Rt ft
Term t absent R - Rt N- ft-(R- Rt ) N - ft
R N-R N
Allan, Ballesteros, Croft, and/or Turtle
Estimation with Relevance Information
• Think of this as extent to which the term can discriminate b/w relevant and non-relevant docs
– wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))– wt = 33– t is strongly indicative of relevance since it frequently
appears in rel documents and rarely in nonrel
– what if N = 20; R = 13; Rt = 4; ft = 7 ?• wt = (4/9)/(3/4) = 0.59• t counts slightly against the doc being relevant
– wt = 1 indicates neutral term since it appears randomly across relevant and nonrelevant docs
Allan, Ballesteros, Croft, and/or Turtle
Estimation with Relevance Information
• wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))
• wt = 1 indicates neutral term since it appears randomly across relevant and nonrelevant docs
• assuming that the occurrences of terms in documents are independent– document weight is the product of it’s term weights
• w(d) = wt
– conventional to express as a sum of logs:
log wt – neg. values indicate nonrel, 0 indicates there is as much
evidence for relevance as for non-relevance
Allan, Ballesteros, Croft, and/or Turtle
Estimation• Relevance information is usually not available
• Estimate prob based on information in query & collection– previous queries can also be used with some learning approaches
• If qi (pr of occurrence in non-relevant documents) is estimated
as ft /N, the second part of the weight is
– which for large N is the IDF weight
– # non-relevant documents are approximated by the whole collection
n
n - Nlog
q
q1log
i
i
Allan, Ballesteros, Croft, and/or Turtle
Estimation
• pi (probability of occurrence in relevant documents) can be estimated in various ways– constant (Croft and Harper combination match)– proportional to probability of occurrence in collection– more accurately, proportional to log(probability of occurrence)
• Greif, 1998
• Maximum likelihood estimates have problems with small samples or zero values
• Estimating probabilities is the same problem as determining weighting formulae in less formal models
Allan, Ballesteros, Croft, and/or Turtle
An Independence Assumption
• Typically, terms aren’t independent (e.g. phrases)– Modeling dependence can be very complex
• The set of all terms are distributed independently in both rel and nonrel– Very strong assumption! e.g.)
Q: “What is happening with the impeachment trial?”
– Occurrence of “impeachment” in relevant documents is independent from occurrence of “trial”