+ All Categories
Home > Education > Beyond tf idf why, what & how

Beyond tf idf why, what & how

Date post: 08-Jan-2017
Category:
Upload: lucenerevolution
View: 7,004 times
Download: 1 times
Share this document with a friend
59
Beyond TF-IDF Stephen Murtagh etsy.com
Transcript
Page 1: Beyond tf idf why, what & how

Beyond TF-IDF

Stephen Murtaghetsy.com

Page 2: Beyond tf idf why, what & how
Page 3: Beyond tf idf why, what & how

20,000,000 items

Page 4: Beyond tf idf why, what & how

1,000,000 sellers

Page 5: Beyond tf idf why, what & how
Page 6: Beyond tf idf why, what & how

15,000,000 daily searches

80,000,000 daily calls to Solr

Page 7: Beyond tf idf why, what & how

Etsy Engineering

• Code as Craft - our engineering blog

• http://codeascraft.etsy.com/

• Continuous Deployment

• https://github.com/etsy/deployinator

• Experiment-driven culture

• Hybrid engineering roles

• Dev-Ops

• Data-Driven Products

Page 8: Beyond tf idf why, what & how

Etsy Search

• 2 search clusters: Flip and Flop

• Master -> 20 slaves

• Only one cluster takes traffic

• Thrift (no HTTP endpoint)

• BitTorrent for index replication

• Solr 4.1

• Incremental index every 12 minutes

Page 9: Beyond tf idf why, what & how

Beyond TF-IDF

•Why?

•What?

•How?

Page 10: Beyond tf idf why, what & how
Page 11: Beyond tf idf why, what & how

Luggage tags

“unique bag”

q = unique+bag

Page 12: Beyond tf idf why, what & how

q = unique+bag

>

Page 13: Beyond tf idf why, what & how

Scoring in Lucene

Page 14: Beyond tf idf why, what & how

Scoring in Lucene

Fixed for any given query

constant

Page 15: Beyond tf idf why, what & how

Scoring in Lucenef(term, document)

f(term)

Page 16: Beyond tf idf why, what & how

Scoring in LuceneUser content

Only measure rarity

Page 17: Beyond tf idf why, what & how

IDF(“unique”)4.429547

IDF(“bag”)4.32836>

Page 18: Beyond tf idf why, what & how

q = unique+bag“unique unique bag” “unique bag bag”

>

Page 19: Beyond tf idf why, what & how

“unique” tells us nothing...

Page 20: Beyond tf idf why, what & how

Stop words

• Add “unique” to stop word list?

• What about “handmade” or “blue”?

• Low-information words can still be useful for matching

• ... but harmful for ranking

Page 21: Beyond tf idf why, what & how

Why not replace IDF?

Page 22: Beyond tf idf why, what & how

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?

•How?

Page 23: Beyond tf idf why, what & how

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?

•How?

Page 24: Beyond tf idf why, what & how

What do we replace it with?

Page 25: Beyond tf idf why, what & how

Benefits of IDF

I1 =

doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

Page 26: Beyond tf idf why, what & how

Benefits of IDF

I1 =

doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

IDF (jewelry) = 1 + log(n�

d id,jewelry)

Page 27: Beyond tf idf why, what & how

Sharding

I1 =

doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

I2 =

dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...

. . .termm 0 1 1 . . . 0

Page 28: Beyond tf idf why, what & how

Sharding

I1 =

doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

I2 =

dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...

. . .termm 0 1 1 . . . 0

IDF (jewelry) = 1 + log(n�

d id,jewelry)

Page 29: Beyond tf idf why, what & how

Sharding

I1 =

doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

I2 =

dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...

. . .termm 0 1 1 . . . 0

IDF1(jewelry) �= IDF2(jewelry) �= IDF (jewelry)

Page 30: Beyond tf idf why, what & how

Sharded IDF options• Ignore it - Shards score differently

• Shards exchange stats - Messy

• Central source distributes IDF to shards

Page 31: Beyond tf idf why, what & how

Information Gain

• P(x) - Probability of "x" appearing in a listing

• P(x|y) - Probability of "x" appearing given "y" appears

info(y) = D(P (X|y)||P (X))

info(y) = Σx∈X log(P (x|y)P (x)

) ∗ P (x|y)

Page 32: Beyond tf idf why, what & how

Term Info(x) IDFunique 0.26 4.43

bag 1.24 4.33

pattern 1.20 4.38

original 0.85 4.38

dress 1.31 4.42

man 0.64 4.41

photo 0.74 4.37

stone 0.92 4.35

Similar IDF

Page 33: Beyond tf idf why, what & how

Term Info(x) IDFunique 0.26 4.39

black 0.22 3.32

red 0.22 3.52

handmade 0.20 3.26

two 0.32 5.64

white 0.19 3.32

three 0.37 6.19

for 0.21 3.59

Similar Info Gain

Page 34: Beyond tf idf why, what & how

q = unique+bagUsing IDF

score(“unique unique bag”)

> score(“unique bag bag”)

Using information gain

score(“unique unique bag”)

< score(“unique bag bag”)

Page 35: Beyond tf idf why, what & how

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?

•How?

Page 36: Beyond tf idf why, what & how

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?• Information gain accounts for term quality

•How?

Page 37: Beyond tf idf why, what & how

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?• Information gain accounts for term quality

•How?

Page 38: Beyond tf idf why, what & how

Listing Quality

• Performance relative to rank

• Hadoop: logs -> hdfs

• cron: hdfs -> master

• bash: master -> slave

• Loaded as external file field

Page 39: Beyond tf idf why, what & how

Computing info gain

I1 =

doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

info(y) = D(P (X|y)||P (X))

info(y) = Σx∈X log(P (x|y)P (x)

) ∗ P (x|y)

Page 40: Beyond tf idf why, what & how

Hadoop

• Brute-force

• Count all terms

• Count all co-occuring terms

• Construct distributions

• Compute info gain for all terms

Page 41: Beyond tf idf why, what & how

File Distribution

• cron copies score file to master

• master replicates file to slaves

infogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`

scp $infogain user@$slave:$infogain

Page 42: Beyond tf idf why, what & how

File Distribution

Page 43: Beyond tf idf why, what & how

schema.xml

Page 44: Beyond tf idf why, what & how

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?• Information gain accounts for term quality

•How?• Hadoop + similarity factory = win

Page 45: Beyond tf idf why, what & how

Fast Deploys, Careful Testing

• Idea

• Proof of Concept

• Side-By-Side

• A/B test

• 100% Live

Page 46: Beyond tf idf why, what & how

Side-by-Side

Page 47: Beyond tf idf why, what & how
Page 48: Beyond tf idf why, what & how
Page 49: Beyond tf idf why, what & how
Page 50: Beyond tf idf why, what & how

Relevant != High quality

Page 51: Beyond tf idf why, what & how

A/B Test

• Users are randomly assigned to A or B

• A sees IDF-based results

• B sees info gain-based results

Page 52: Beyond tf idf why, what & how

A/B Test

• Users are randomly assigned to A or B

• A sees IDF-based results

• B sees info gain-based results

• Small but significant decrease in clicks, page views, etc.

Page 53: Beyond tf idf why, what & how

More homogeneous resultsLower average quality score

Page 54: Beyond tf idf why, what & how

Next Steps

Page 55: Beyond tf idf why, what & how

Parameter Tweaking...Rebalance relevancy and quality signals in score

Page 56: Beyond tf idf why, what & how

The Future

Page 57: Beyond tf idf why, what & how

Latent Semantic Indexing in Solr/Lucene

Page 58: Beyond tf idf why, what & how

Latent Semantic Indexing• In TF-IDF, documents are sparse vectors in

term space

• LSI re-maps these to dense vectors in “concept” space

• Construct transformation matrix:

• Load file at index and query time

• Re-map query and documents

Rm+

Rr

Tr×m


Recommended