+ All Categories
Home > Technology > BM25 Scoring for Lucene: From Academia to Industry

BM25 Scoring for Lucene: From Academia to Industry

Date post: 21-May-2015
Category:
Upload: yuvalf
View: 4,045 times
Download: 0 times
Share this document with a friend
Description:
Slides from a talk about the BM25 library given at the Meetup session of Apache Lucene Eurocon 2010 in Prague on May 20th, 2010.
Popular Tags:
21
Yuval Feinstein Answers Corporation BM25 Scoring for Lucene: From Academia to Industry Apache Lucene EuroCon 2010 Meetup Prague, May 2010
Transcript
Page 1: BM25 Scoring for Lucene: From Academia to Industry

Yuval Feinstein

Answers Corporation

BM25 Scoring for Lucene:

From Academia to Industry

Apache Lucene EuroCon 2010 Meetup

Prague, May 2010

Page 2: BM25 Scoring for Lucene: From Academia to Industry

Overview

Answers.com

A Relevance problem

BM25F - a possible solution

Joaquin’s Implementation

Productization

Future directions

2

Page 3: BM25 Scoring for Lucene: From Academia to Industry

3

Answers.com

Mission - Provide best answers about anything.

A popular web site (according to comScore,

March 2010):

#33 worldwide, with 75.8 million unique users

#18 in US, with 51.2 million unique users

WikiAnswers – community Q&A site (UGC)

ReferenceAnswers – editorial content

Atlas – internal search engine

Implicit search example: find similar

questions

Page 4: BM25 Scoring for Lucene: From Academia to Industry

Similar Questions

4

Page 5: BM25 Scoring for Lucene: From Academia to Industry

Case 31136

5

Page 6: BM25 Scoring for Lucene: From Academia to Industry

Enter BM25F

Query Q = (t1, t2, …, tm)

Document D

Term frequency tfi

How much should tfi influence similarity?

Determine similarity by choosing weights

BM25F: saturation, soft length normalization, idf

weights and field weights.

DQt

iitfwDQsimilarity ,

Page 7: BM25 Scoring for Lucene: From Academia to Industry

Saturation

Frequency Saturation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30

Term Frequency tf

Saturated

Weight, tf/(2+tf)

Replace tf by tf/(k1+tf)

Page 8: BM25 Scoring for Lucene: From Academia to Industry

Soft Length Normalization

length normalization

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 5 10 15 20 25 30

document length

normalized

frequency

Replace tf by

avdl

dlbb

tftf

1

'

Page 9: BM25 Scoring for Lucene: From Academia to Industry

Inverse Document Frequency (IDF)

5.0

5.0log

i

iIDF

in

nNw

IDF weighting

0

0.5

1

1.5

2

2.5

0 20 40 60 80 100 120

num docs with term (ni)

IDF weight (wi)

Page 10: BM25 Scoring for Lucene: From Academia to Industry

Field Weights

10

Every field has a different b (length verbosity parameter) and a different v

(field value parameer)

Page 11: BM25 Scoring for Lucene: From Academia to Industry

The BM25F Formula

s

si

S

s

siB

tfvft

1

~

avsl

slbbB

s

sss1

IDF

i

i

iFBM

iw

fk

ftw ~

~

1

25

Field weighting

Field length normalization

Saturation and IDF

Page 12: BM25 Scoring for Lucene: From Academia to Industry

Joaquin’s Implementation

Joaquín Pérez Iglesias of UNED, Madrid, Spain

implemented a BM25F library for Lucene,

with the class BM25BooleanQuery

Algorithm:

Collect documents with query terms

Score individual terms using BM25F

Combine scores using addition to get Boolean query

score

12

Page 13: BM25 Scoring for Lucene: From Academia to Industry

BM25F Usefulness for Our Case

Short texts

Term repetitions hurt relevance for short texts

Want to combine different fields (in the future,

different information sources)

Initial Experiments showed nice relevance, but….

13

Page 14: BM25 Scoring for Lucene: From Academia to Industry

Feeling Safe to make Changes

How can we be sure not to break anything?

Added Unit Tests

(This is almost a Lucene standard, but not in

Academia…)

14

Page 15: BM25 Scoring for Lucene: From Academia to Industry

Production Challenges –Performance

Can this library handle 10M queries daily?

Initial Runtimes:

15

Average

Runtime

mSec

Median

RuntimemSec

Standard

Lucene

Scoring

161 119

BM25F 273 209

Difference 68% 75%

Page 16: BM25 Scoring for Lucene: From Academia to Industry

Improving Performance

Addressed using:

Benchmarking

Profiling

Refactoring, to give

16

Average

Runtime

mSec

Median

RuntimemSec

Standard

Lucene

Scoring

93 65

BM25F 92 70

Difference -1% 8%

Page 17: BM25 Scoring for Lucene: From Academia to Industry

Production Challenges –Robustness

Lots of users strange inputs e.g.

////////////////////////////////////////

;-)

fdsfdsdfsdffssssssfsfsfs

Addressed using more careful tokenization

Page 18: BM25 Scoring for Lucene: From Academia to Industry

Production Challenges –Integration and Interoperability

Needs data not currently in Lucene index:

Average Field Lengths

Document-level IDF

We calculated the first externally and

approximated the second using longest field IDF

Library does not play nicely with others – not

recursive

BM25 Library supports BooleanQuery, not

phrases, prefix, etc.

Page 19: BM25 Scoring for Lucene: From Academia to Industry

Remember case 31136?

Well, She’s mostly pleased…

BM25 runs in our production environment

Supporting 10s of millions of queries daily

Page 20: BM25 Scoring for Lucene: From Academia to Industry

Future Work

LUCENE-2091 – Our suggested contrib patch

LUCENE-2392 – Current work on making Lucene

scoring more flexible, to incorporate BM25 as well

as other models

We want to incorporate BM25 scoring into Solr

Could this be faster as well?

20


Recommended