BM25 Scoring for Lucene: From Academia to Industry

Yuval Feinstein

Answers Corporation

BM25 Scoring for Lucene:

From Academia to Industry

Apache Lucene EuroCon 2010 Meetup

Prague, May 2010

Overview

Answers.com

A Relevance problem

BM25F - a possible solution

Joaquin’s Implementation

Productization

Future directions

2

3

Answers.com

Mission - Provide best answers about anything.

A popular web site (according to comScore,

March 2010):

#33 worldwide, with 75.8 million unique users

#18 in US, with 51.2 million unique users

WikiAnswers – community Q&A site (UGC)

ReferenceAnswers – editorial content

Atlas – internal search engine

Implicit search example: find similar

questions

Similar Questions

4

Case 31136

5

Enter BM25F

Query Q = (t1, t2, …, tm)

Document D

Term frequency tfi

How much should tfi influence similarity?

Determine similarity by choosing weights

BM25F: saturation, soft length normalization, idf

weights and field weights.

DQt

iitfwDQsimilarity ,

Saturation

Frequency Saturation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30

Term Frequency tf

Saturated

Weight, tf/(2+tf)

Replace tf by tf/(k1+tf)

Soft Length Normalization

length normalization

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 5 10 15 20 25 30

document length

normalized

frequency

Replace tf by

avdl

dlbb

tftf

1

'

Inverse Document Frequency (IDF)

5.0

5.0log

i

iIDF

in

nNw

IDF weighting

0

0.5

1

1.5

2

2.5

0 20 40 60 80 100 120

num docs with term (ni)

IDF weight (wi)

Field Weights

10

Every field has a different b (length verbosity parameter) and a different v

(field value parameer)

The BM25F Formula

s

si

S

s

siB

tfvft

1

~

avsl

slbbB

s

sss1

IDF

i

i

iFBM

iw

fk

ftw ~

~

1

25

Field weighting

Field length normalization

Saturation and IDF

Joaquin’s Implementation

Joaquín Pérez Iglesias of UNED, Madrid, Spain

implemented a BM25F library for Lucene,

with the class BM25BooleanQuery

Algorithm:

Collect documents with query terms

Score individual terms using BM25F

Combine scores using addition to get Boolean query

score

12

BM25F Usefulness for Our Case

Short texts

Term repetitions hurt relevance for short texts

Want to combine different fields (in the future,

different information sources)

Initial Experiments showed nice relevance, but….

13

Feeling Safe to make Changes

How can we be sure not to break anything?

Added Unit Tests

(This is almost a Lucene standard, but not in

Academia…)

14

Production Challenges –Performance

Can this library handle 10M queries daily?

Initial Runtimes:

15

Average

Runtime

mSec

Median

RuntimemSec

Standard

Lucene

Scoring

161 119

BM25F 273 209

Difference 68% 75%

Improving Performance

Addressed using:

Benchmarking

Profiling

Refactoring, to give

16

Average

Runtime

mSec

Median

RuntimemSec

Standard

Lucene

Scoring

93 65

BM25F 92 70

Difference -1% 8%

Production Challenges –Robustness

Lots of users strange inputs e.g.

////////////////////////////////////////

;-)

fdsfdsdfsdffssssssfsfsfs

Addressed using more careful tokenization

Production Challenges –Integration and Interoperability

Needs data not currently in Lucene index:

Average Field Lengths

Document-level IDF

We calculated the first externally and

approximated the second using longest field IDF

Library does not play nicely with others – not

recursive

BM25 Library supports BooleanQuery, not

phrases, prefix, etc.

Remember case 31136?

Well, She’s mostly pleased…

BM25 runs in our production environment

Supporting 10s of millions of queries daily

Future Work

LUCENE-2091 – Our suggested contrib patch

LUCENE-2392 – Current work on making Lucene

scoring more flexible, to incorporate BM25 as well

as other models

We want to incorporate BM25 scoring into Solr

Could this be faster as well?

20

https://issues.apache.org/jira/browse/LUCENE-2091






References

Integrating the Probabilistic Model BM25/BM25F

into Lucene – Joaquin Perez Iglesias

The Probabilistic Relevance Framework: BM25

and Beyond – Stephen Robertson and Hugo

Zaragoza

Working Effectively with Legacy Code – Michael

Feathers

http://nlp.uned.es/~jperezi/Lucene-BM25/






http://www.zaragozas.info/hugo/academic/pdf/robertson_FTIR09.pdf



http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052

Date post:	21-May-2015
Category:	Technology
Upload:	yuvalf
View:	4,045 times
Download:	0 times

BM25 Scoring for Lucene: From Academia to Industry

Technology