Retrieval Models II Vector Space, Probabilistic. Allan, Ballesteros, Croft, and/or Turtle...

transcript

Retrieval Models II

Vector Space, Probabilistic

Allan, Ballesteros, Croft, and/or Turtle

Properties of Inner Product

• The inner product is unbounded.

• Favors long documents with a large number of unique terms.

• Measures how many terms matched but not how many terms are not matched.

Inner Product -- Examples

Binary:– D = 1, 1, 1, 0, 1, 1, 0

– Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

database

archite

computer

textmanagem

informatio

Size of vector = size of vocabulary = 70 means corresponding term not found

in document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Cosine Similarity Measure

• Cosine similarity measures the cosine of the angle between two vectors.

• Inner product normalized by the vector lengths.

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13

Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times better using

inner product.

CosSim(dj, q) =

Simple Implementation

1. Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V.

2. Convert query to a tf-idf-weighted vector q.

3. For each dj in D do

Compute score sj = cosSim(dj, q)

4. Sort documents by decreasing score.

5. Present top ranked documents to the user.

Time complexity: O(|V|·|D|) Bad for large V & D !|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

Comments on Vector Space Models

• Simple, mathematically based approach.

• Considers both local (tf) and global (idf) word occurrence frequencies.

• Provides partial matching and ranked results.

• Tends to work quite well in practice despite obvious weaknesses.

• Allows efficient implementation for large document collections.

Problems with Vector Space Model

• Assumption of term independence

• Missing semantic information (e.g. word sense).

• Missing syntactic information (e.g. phrase structure, word order, proximity information).

• Lacks the control of a Boolean model (e.g., requiring a term to appear in a document).– Given a two-term query “A B”,

• may prefer a document containing A frequently but not B,

over a document that contains both A and B, but both less frequently.

Statistical Models

• A document is typically represented by a bag of words (unordered words with frequencies).

• Bag = set that allows multiple occurrences of the same element.

• User specifies a set of desired terms with optional weights:– Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 >– Unweighted query terms: Q = < database; text; information >– No Boolean conditions specified in the query.

Statistical Retrieval

• Retrieval via similarity based on probability of relevance to Q

• Given Q, the set of all documents is partitioned with into the sets rel and nonrel.– The sets rel and nonrel change from query to query

• Output documents are ranked according to probability of relevance to query.– Pr(relevance) of each document to the query is not

available in practice

Basic Probabilistic Retrieval Model

• We need a similarity function s so that: – P(rel|Di)>P(rel|Dj) iff s(Q, Di)>s(Q, Dj)

• Retrieve if P(relevant|D) > P(non-relevant|D)– calculate P(D|R)/P(D|NR)

• Different ways of estimating these probabilities lead to different probabilistic models

Probability• Experiment: a specific set of actions the results of which can

not be predicted with certainty – i.e. rolling two dice and recording their values

• Simple outcome: ea. possible set of recorded data– for the example, each pair is a simple outcome

(1,1) (2,1) (3,1) ... (6,1)

(1,2) (2,2) (3,2) ... (6,2)

(1,3) (2,3) (3,3) ... (6,3)

(1,4) (2,4) (3,4) ... (6,4)

(1,5) (2,5) (3,5) ...(6,5)

(1,6) (2,6) (3,6) ... (6,6)

Probability

• Sample Space– non-empty set containing all possible simple outcomes

of the experiment

(1,1) (2,1) (3,1) ... (6,1)(1,2) (2,2) (3,2) ... (6,2)(1,3) (2,3) (3,3) ... (6,3)(1,4) (2,4) (3,4) ... (6,4)(1,5) (2,5) (3,5) ...(6,5)(1,6) (2,6) (3,6) ... (6,6)

– each element is known as a sample point

Probability

• Sample Space(1,1) (2,1) (3,1) ... (6,1)

(1,2) (2,2) (3,2) ... (6,2)(1,3) (2,3) (3,3) ... (6,3)(1,4) (2,4) (3,4) ... (6,4)(1,5) (2,5) (3,5) ...(6,5)(1,6) (2,6) (3,6) ... (6,6)

• Event space: subsets of a sample space defined by a specific event or outcome– i.e. the event that the sum of the two die is 4

Event Space

• The probability of an event is the sum of the probabilities of the sample points associated with the event– what is the probability that the sum is 4?

• Recall that sample points represent the possible outcomes of a statistical “experiment”

• 36 possible outcomes when rolling 2 die• 3 ways to get sum of 4: (1,3) (2,2) (3,1) • Pr(sum is 4) = 3/36 = 1/12

Event Space

• For a retrieval model, the event space is Q x D, – where each sample point is a query-document pair

– Each has an associated relevance judgment

• For a particular a query, a probabilistic model tries to estimate P(R|D)

Probability Ranking Principle

• Ranking documents in decreasing order of pr(rel) to the query, where probabilities are estimated using all available evidence, produces the best possible effectiveness

– Assume relevance of a document is independent of other documents in the collection

– Bayes Decision Rule: Retrieve if P(R|D) > P(NR|D)• minimizes the average probability of error

– equivalent to optimizing recall/fallout tradeoff

{ NR decide weif D)|P(R

R decide weif D)|P(NRD)|P(error

Basic Probabilistic Model• Doc d = (t1, t2, …tn)

ti = 0 means index term ti absent, ti = 1 means term ti present

– pi = P(ti =1|R) and 1-pi = P(ti =0|R)

– qi = P(ti =1|NR) and 1- qi = P(ti =0|NR)

• Assume conditional independence– P(d|R) is product of the probs for the components of d (i.e. product of

probabilities of getting a particular vector of 1’s and 0’s)

• Appearance of a term in a doc interpreted either as – evidence that document is relevant or

– evidence that document is non-relevant

• The key is finding a means of estimating pi and qi

– pi is prob term is present given Relevant

– qi is prob term is present given Not Relevant

Basic Probabilistic Model

• Need to calculate – “relevant to query given term appears” and

– “irrelevant to query given term appears”

• These values can be based upon some known relevance judgments

Estimation with Relevance Information

– N = total # docs, R = # rel docs, N-R = # nonrel docs

– ft = # docs w/term

– Rt = # rel docs w/ term

– ft -Rt = # nonrel w/term

• We can estimate the conditional probabilities w/ the table – P(rel|t is present) = Rt /ft

– P(non-rel|t is present) = ft -Rt / ft

– P(t is present| rel) = Rt / R

– P(t is present| nonrel) = ft - Rt / N-R

Number Docs Relevant Nonrelevant Total Term t present Rt ft - Rt ft

Term t absent R - Rt N- ft-(R- Rt ) N - ft

R N-R N

• wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))– ratio of rel with term to relevant without term

ratio of nonrel with term to nonrel without term

• Suppose N = 20; R = 13 relevant; term t appears in 11 rel; term appears in 12 docs– wt =(11/(13-11))/((12-11)/(20-12-(13-11))) = 5.5/0.17 = 33

Number Docs Relevant Nonrelevant Total Term t present Rt ft - Rt ft

Term t absent R - Rt N- ft-(R- Rt ) N - ft

R N-R N

• Think of this as extent to which the term can discriminate b/w relevant and non-relevant docs

– wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))– wt = 33– t is strongly indicative of relevance since it frequently

appears in rel documents and rarely in nonrel

– what if N = 20; R = 13; Rt = 4; ft = 7 ?• wt = (4/9)/(3/4) = 0.59• t counts slightly against the doc being relevant

– wt = 1 indicates neutral term since it appears randomly across relevant and nonrelevant docs

• wt = (Rt /(R-Rt ))/ ((ft - Rt )/(N - ft - (R -Rt )))

• wt = 1 indicates neutral term since it appears randomly across relevant and nonrelevant docs

• assuming that the occurrences of terms in documents are independent– document weight is the product of it’s term weights

• w(d) = wt

– conventional to express as a sum of logs:

log wt – neg. values indicate nonrel, 0 indicates there is as much

evidence for relevance as for non-relevance

Estimation• Relevance information is usually not available

• Estimate prob based on information in query & collection– previous queries can also be used with some learning approaches

• If qi (pr of occurrence in non-relevant documents) is estimated

as ft /N, the second part of the weight is

– which for large N is the IDF weight

– # non-relevant documents are approximated by the whole collection

n - Nlog

Estimation

• pi (probability of occurrence in relevant documents) can be estimated in various ways– constant (Croft and Harper combination match)– proportional to probability of occurrence in collection– more accurately, proportional to log(probability of occurrence)

• Greif, 1998

• Maximum likelihood estimates have problems with small samples or zero values

• Estimating probabilities is the same problem as determining weighting formulae in less formal models

An Independence Assumption

• Typically, terms aren’t independent (e.g. phrases)– Modeling dependence can be very complex

• The set of all terms are distributed independently in both rel and nonrel– Very strong assumption! e.g.)

Q: “What is happening with the impeachment trial?”

– Occurrence of “impeachment” in relevant documents is independent from occurrence of “trial”

Retrieval Models II Vector Space, Probabilistic. Allan, Ballesteros, Croft, and/or Turtle...

Documents