Date post: | 25-Jan-2015 |
Category: |
Technology |
Upload: | ofer-egozi |
View: | 3,419 times |
Download: | 2 times |
Measuring Search Engine Quality
Rank-Biased PrecisionAlistair Moffat and Justin Zobel, “Rank-Biased Precision for Measurement of Retrieval Effectiveness”, TOIS vol.27 no. 1, 2008.
Ofer EgoziLARA group, Technion
Introduction to IR Evaluation
Mean Average Precision
Rank-Biased Precision
Analysis of RBP
Outline
Task: given query q, output ranked list of documents◦ Find probability that document d is relevant for q
IR Evaluation
Task: given query q, output ranked list of documents◦ Find probability that document d is relevant for q
Evaluation is difficult◦ No (per query) test data
◦ Queries vary tremendously
◦ Relevance is a vague (human) concept
IR Evaluation
Precision / recall
◦ Precision and recall usually conflict
◦ Single measures proposed (P@X, RR, AP…)
Elementary IR Measures
Drel(q,D)alg(q,D)
Precision: |alg rel|/|alg| Recall: |alg rel|/|rel|
Relevancy requires human judgment◦ Exhaustive judging is not scalable
◦ TREC uses pooling
◦ Shown to miss significant relevant portion…
◦ … but shown to compare cross-system well
◦ Bias against novel approaches
Pooling for Scalable Judging
In real-world, what does recall measure?◦ Recall important only with “perfect” knowledge◦ If I got one result, and there is another I don’t
know of, am I half-satisfied?...◦ …yes, for specific needs (legal, patent) session◦ “Boiling temperature of lead”
How Important is Recall?
In real-world, what does recall measure?◦ Recall important only with “perfect” knowledge◦ If I got one result, and there is another I don’t
know of, am I half-satisfied?...◦ …yes, for specific needs (legal, patent) session◦ “Boiling temperature of lead”
Precision is more user-oriented◦ P@10 measures real user satisfaction◦ Still, P@10=0.3 can mean first three or last
three…
How Important is Recall?
Calculated as ◦ Intuitively: sum all P@X where rel found, divide by
total rel to normalize for summing across queries Example: $$---$----$-----$---
(Mean) Average Precision
Calculated as ◦ Intuitively: sum all P@X where rel found, divide by
total rel to normalize for summing across queries Example: $$---$----$-----$--- Consider: $$---$----$-----$$$$
◦ AP is down to 0.5234, despite P@20 increasing
◦ Finding more rels can harm AP performance!
◦ Similar problems if some are initially unjudged
(Mean) Average Precision
Methodological problem of instability◦ Results may depend on judging extent
◦ More judging can be destabilizing (meaning error margins don’t shrink with reducing uncertainty)
MAP is Unstable
Complex abstraction of user satisfaction◦ “Every time a relevant document is encountered, the user pauses, asks “Over the
documents I have seen so far, on average how satisfied am I?” and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.”
How can R be truly calculated? Think evaluating a Google query…
MAP is not “Real-Life”
Complex abstraction of user satisfaction◦ “Every time a relevant document is encountered, the user pauses, asks “Over the
documents I have seen so far, on average how satisfied am I?” and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.”
How can R be truly calculated? Think evaluating a Google query…
Still, MAP is highly popular and useful: ◦ Validated in numerous TREC researches◦ Shown to be stable and robust across query sets (for
deep enough pools)
MAP is not “Real-Life”
Enter RBP…
Induced by a user model
Rank-Biased Precision
Induced by a user model
◦ Each document is observed at probability pi-1
◦ Expected #docs seen:
◦ Total expected utility (ri = known relevance function):
◦ RBP = expected utility rate = utility/effort
Rank-Biased Precision
Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)
◦ P=0.5 impatient (0.1% chance for 2nd page)
RBP Intuitions
Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)
◦ P=0.5 impatient (0.1% chance for 2nd page)
◦ P=0 I’m feeling lucky (identical to P@1)
RBP Intuitions
Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)
◦ P=0.5 impatient (0.1% chance for 2nd page)
◦ P=0 I’m feeling lucky (identical to P@1)
Values of p control contribution of each relevant document◦ But always positive!
RBP Intuitions
Evaluating a new evaluation measure…
RBP Stability
Uncertainty: how many relevant documents? (down the ranking, or even in current depth)
RBP value is inherently lower bound
Error Bounds
Uncertainty: how many relevant documents? (down the ranking, or even in current depth)
RBP value is inherently lower bound Residual uncertainty is easy to calculate –
assume relevant…
Error Bounds
RBP in Comparison
Similarity (correlation) between measures
Detected significance in evaluated systems’ ranking
RBP has significant advantages:◦ Based on a solid and supported user model
◦ Real-life, no unknown factors (R, |D|)
◦ Error bounds for uncertainty
◦ Statistical significance as good as others
But also:◦ Absolute values, not relative to query difficulty
◦ A choice for p must be made
Conclusion