Beyond tf idf why, what & how

Post on 08-Jan-2017

7,004 views 1 download

transcript

Beyond TF-IDF

Stephen Murtaghetsy.com

20,000,000 items

1,000,000 sellers

15,000,000 daily searches

80,000,000 daily calls to Solr

Etsy Engineering

• Code as Craft - our engineering blog

• http://codeascraft.etsy.com/

• Continuous Deployment

• https://github.com/etsy/deployinator

• Experiment-driven culture

• Hybrid engineering roles

• Dev-Ops

• Data-Driven Products

Etsy Search

• 2 search clusters: Flip and Flop

• Master -> 20 slaves

• Only one cluster takes traffic

• Thrift (no HTTP endpoint)

• BitTorrent for index replication

• Solr 4.1

• Incremental index every 12 minutes

Beyond TF-IDF

•Why?

•What?

•How?

Luggage tags

“unique bag”

q = unique+bag

q = unique+bag

>

Scoring in Lucene

Scoring in Lucene

Fixed for any given query

constant

Scoring in Lucenef(term, document)

f(term)

Scoring in LuceneUser content

Only measure rarity

IDF(“unique”)4.429547

IDF(“bag”)4.32836>

q = unique+bag“unique unique bag” “unique bag bag”

>

“unique” tells us nothing...

Stop words

• Add “unique” to stop word list?

• What about “handmade” or “blue”?

• Low-information words can still be useful for matching

• ... but harmful for ranking

Why not replace IDF?

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?

•How?

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?

•How?

What do we replace it with?

Benefits of IDF

I1 =

doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

Benefits of IDF

I1 =

doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

IDF (jewelry) = 1 + log(n�

d id,jewelry)

Sharding

I1 =

doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

I2 =

dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...

. . .termm 0 1 1 . . . 0

Sharding

I1 =

doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

I2 =

dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...

. . .termm 0 1 1 . . . 0

IDF (jewelry) = 1 + log(n�

d id,jewelry)

Sharding

I1 =

doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

I2 =

dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...

. . .termm 0 1 1 . . . 0

IDF1(jewelry) �= IDF2(jewelry) �= IDF (jewelry)

Sharded IDF options• Ignore it - Shards score differently

• Shards exchange stats - Messy

• Central source distributes IDF to shards

Information Gain

• P(x) - Probability of "x" appearing in a listing

• P(x|y) - Probability of "x" appearing given "y" appears

info(y) = D(P (X|y)||P (X))

info(y) = Σx∈X log(P (x|y)P (x)

) ∗ P (x|y)

Term Info(x) IDFunique 0.26 4.43

bag 1.24 4.33

pattern 1.20 4.38

original 0.85 4.38

dress 1.31 4.42

man 0.64 4.41

photo 0.74 4.37

stone 0.92 4.35

Similar IDF

Term Info(x) IDFunique 0.26 4.39

black 0.22 3.32

red 0.22 3.52

handmade 0.20 3.26

two 0.32 5.64

white 0.19 3.32

three 0.37 6.19

for 0.21 3.59

Similar Info Gain

q = unique+bagUsing IDF

score(“unique unique bag”)

> score(“unique bag bag”)

Using information gain

score(“unique unique bag”)

< score(“unique bag bag”)

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?

•How?

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?• Information gain accounts for term quality

•How?

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?• Information gain accounts for term quality

•How?

Listing Quality

• Performance relative to rank

• Hadoop: logs -> hdfs

• cron: hdfs -> master

• bash: master -> slave

• Loaded as external file field

Computing info gain

I1 =

doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...

. . .termm 1 0 1 . . . 0

info(y) = D(P (X|y)||P (X))

info(y) = Σx∈X log(P (x|y)P (x)

) ∗ P (x|y)

Hadoop

• Brute-force

• Count all terms

• Count all co-occuring terms

• Construct distributions

• Compute info gain for all terms

File Distribution

• cron copies score file to master

• master replicates file to slaves

infogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`

scp $infogain user@$slave:$infogain

File Distribution

schema.xml

Beyond TF-IDF

•Why?• IDF ignores term “usefulness”

•What?• Information gain accounts for term quality

•How?• Hadoop + similarity factory = win

Fast Deploys, Careful Testing

• Idea

• Proof of Concept

• Side-By-Side

• A/B test

• 100% Live

Side-by-Side

Relevant != High quality

A/B Test

• Users are randomly assigned to A or B

• A sees IDF-based results

• B sees info gain-based results

A/B Test

• Users are randomly assigned to A or B

• A sees IDF-based results

• B sees info gain-based results

• Small but significant decrease in clicks, page views, etc.

More homogeneous resultsLower average quality score

Next Steps

Parameter Tweaking...Rebalance relevancy and quality signals in score

The Future

Latent Semantic Indexing in Solr/Lucene

Latent Semantic Indexing• In TF-IDF, documents are sparse vectors in

term space

• LSI re-maps these to dense vectors in “concept” space

• Construct transformation matrix:

• Load file at index and query time

• Re-map query and documents

Rm+

Rr

Tr×m

CONTACT Stephen Murtaghsmurtagh@etsy.com