Date post: | 21-May-2015 |
Category: |
Technology |
Upload: | yuvalf |
View: | 4,045 times |
Download: | 0 times |
Yuval Feinstein
Answers Corporation
BM25 Scoring for Lucene:
From Academia to Industry
Apache Lucene EuroCon 2010 Meetup
Prague, May 2010
Overview
Answers.com
A Relevance problem
BM25F - a possible solution
Joaquin’s Implementation
Productization
Future directions
2
3
Answers.com
Mission - Provide best answers about anything.
A popular web site (according to comScore,
March 2010):
#33 worldwide, with 75.8 million unique users
#18 in US, with 51.2 million unique users
WikiAnswers – community Q&A site (UGC)
ReferenceAnswers – editorial content
Atlas – internal search engine
Implicit search example: find similar
questions
Similar Questions
4
Case 31136
5
Enter BM25F
Query Q = (t1, t2, …, tm)
Document D
Term frequency tfi
How much should tfi influence similarity?
Determine similarity by choosing weights
BM25F: saturation, soft length normalization, idf
weights and field weights.
DQt
iitfwDQsimilarity ,
Saturation
Frequency Saturation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30
Term Frequency tf
Saturated
Weight, tf/(2+tf)
Replace tf by tf/(k1+tf)
Soft Length Normalization
length normalization
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 5 10 15 20 25 30
document length
normalized
frequency
Replace tf by
avdl
dlbb
tftf
1
'
Inverse Document Frequency (IDF)
5.0
5.0log
i
iIDF
in
nNw
IDF weighting
0
0.5
1
1.5
2
2.5
0 20 40 60 80 100 120
num docs with term (ni)
IDF weight (wi)
Field Weights
10
Every field has a different b (length verbosity parameter) and a different v
(field value parameer)
The BM25F Formula
s
si
S
s
siB
tfvft
1
~
avsl
slbbB
s
sss1
IDF
i
i
iFBM
iw
fk
ftw ~
~
1
25
Field weighting
Field length normalization
Saturation and IDF
Joaquin’s Implementation
Joaquín Pérez Iglesias of UNED, Madrid, Spain
implemented a BM25F library for Lucene,
with the class BM25BooleanQuery
Algorithm:
Collect documents with query terms
Score individual terms using BM25F
Combine scores using addition to get Boolean query
score
12
BM25F Usefulness for Our Case
Short texts
Term repetitions hurt relevance for short texts
Want to combine different fields (in the future,
different information sources)
Initial Experiments showed nice relevance, but….
13
Feeling Safe to make Changes
How can we be sure not to break anything?
Added Unit Tests
(This is almost a Lucene standard, but not in
Academia…)
14
Production Challenges –Performance
Can this library handle 10M queries daily?
Initial Runtimes:
15
Average
Runtime
mSec
Median
RuntimemSec
Standard
Lucene
Scoring
161 119
BM25F 273 209
Difference 68% 75%
Improving Performance
Addressed using:
Benchmarking
Profiling
Refactoring, to give
16
Average
Runtime
mSec
Median
RuntimemSec
Standard
Lucene
Scoring
93 65
BM25F 92 70
Difference -1% 8%
Production Challenges –Robustness
Lots of users strange inputs e.g.
////////////////////////////////////////
;-)
fdsfdsdfsdffssssssfsfsfs
Addressed using more careful tokenization
Production Challenges –Integration and Interoperability
Needs data not currently in Lucene index:
Average Field Lengths
Document-level IDF
We calculated the first externally and
approximated the second using longest field IDF
Library does not play nicely with others – not
recursive
BM25 Library supports BooleanQuery, not
phrases, prefix, etc.
Remember case 31136?
Well, She’s mostly pleased…
BM25 runs in our production environment
Supporting 10s of millions of queries daily
Future Work
LUCENE-2091 – Our suggested contrib patch
LUCENE-2392 – Current work on making Lucene
scoring more flexible, to incorporate BM25 as well
as other models
We want to incorporate BM25 scoring into Solr
Could this be faster as well?
20
References
Integrating the Probabilistic Model BM25/BM25F
into Lucene – Joaquin Perez Iglesias
The Probabilistic Relevance Framework: BM25
and Beyond – Stephen Robertson and Hugo
Zaragoza
Working Effectively with Legacy Code – Michael
Feathers