Date post: | 08-Jan-2017 |
Category: |
Education |
Upload: | lucenerevolution |
View: | 7,004 times |
Download: | 1 times |
Beyond TF-IDF
Stephen Murtaghetsy.com
20,000,000 items
1,000,000 sellers
15,000,000 daily searches
80,000,000 daily calls to Solr
Etsy Engineering
• Code as Craft - our engineering blog
• http://codeascraft.etsy.com/
• Continuous Deployment
• https://github.com/etsy/deployinator
• Experiment-driven culture
• Hybrid engineering roles
• Dev-Ops
• Data-Driven Products
Etsy Search
• 2 search clusters: Flip and Flop
• Master -> 20 slaves
• Only one cluster takes traffic
• Thrift (no HTTP endpoint)
• BitTorrent for index replication
• Solr 4.1
• Incremental index every 12 minutes
Beyond TF-IDF
•Why?
•What?
•How?
Luggage tags
“unique bag”
q = unique+bag
q = unique+bag
>
Scoring in Lucene
Scoring in Lucene
Fixed for any given query
constant
Scoring in Lucenef(term, document)
f(term)
Scoring in LuceneUser content
Only measure rarity
IDF(“unique”)4.429547
IDF(“bag”)4.32836>
q = unique+bag“unique unique bag” “unique bag bag”
>
“unique” tells us nothing...
Stop words
• Add “unique” to stop word list?
• What about “handmade” or “blue”?
• Low-information words can still be useful for matching
• ... but harmful for ranking
Why not replace IDF?
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?
•How?
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?
•How?
What do we replace it with?
Benefits of IDF
I1 =
doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
Benefits of IDF
I1 =
doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
IDF (jewelry) = 1 + log(n�
d id,jewelry)
Sharding
I1 =
doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
I2 =
dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...
. . .termm 0 1 1 . . . 0
Sharding
I1 =
doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
I2 =
dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...
. . .termm 0 1 1 . . . 0
IDF (jewelry) = 1 + log(n�
d id,jewelry)
Sharding
I1 =
doc1 doc2 doc3 . . . dockart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
I2 =
dock+1 dock+2 dock+3 . . . docnart 6 1 0 . . . 1jewelry 0 1 3 . . . 0...
. . .termm 0 1 1 . . . 0
IDF1(jewelry) �= IDF2(jewelry) �= IDF (jewelry)
Sharded IDF options• Ignore it - Shards score differently
• Shards exchange stats - Messy
• Central source distributes IDF to shards
Information Gain
• P(x) - Probability of "x" appearing in a listing
• P(x|y) - Probability of "x" appearing given "y" appears
info(y) = D(P (X|y)||P (X))
info(y) = Σx∈X log(P (x|y)P (x)
) ∗ P (x|y)
Term Info(x) IDFunique 0.26 4.43
bag 1.24 4.33
pattern 1.20 4.38
original 0.85 4.38
dress 1.31 4.42
man 0.64 4.41
photo 0.74 4.37
stone 0.92 4.35
Similar IDF
Term Info(x) IDFunique 0.26 4.39
black 0.22 3.32
red 0.22 3.52
handmade 0.20 3.26
two 0.32 5.64
white 0.19 3.32
three 0.37 6.19
for 0.21 3.59
Similar Info Gain
q = unique+bagUsing IDF
score(“unique unique bag”)
> score(“unique bag bag”)
Using information gain
score(“unique unique bag”)
< score(“unique bag bag”)
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?
•How?
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?• Information gain accounts for term quality
•How?
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?• Information gain accounts for term quality
•How?
Listing Quality
• Performance relative to rank
• Hadoop: logs -> hdfs
• cron: hdfs -> master
• bash: master -> slave
• Loaded as external file field
Computing info gain
I1 =
doc1 doc2 doc3 . . . docnart 2 0 1 . . . 1jewelry 1 3 0 . . . 0...
. . .termm 1 0 1 . . . 0
info(y) = D(P (X|y)||P (X))
info(y) = Σx∈X log(P (x|y)P (x)
) ∗ P (x|y)
Hadoop
• Brute-force
• Count all terms
• Count all co-occuring terms
• Construct distributions
• Compute info gain for all terms
File Distribution
• cron copies score file to master
• master replicates file to slaves
infogain=`find /search/data/ -maxdepth 1 -type f -name info_gain.* -print | sort | tail -n 1`
scp $infogain user@$slave:$infogain
File Distribution
schema.xml
Beyond TF-IDF
•Why?• IDF ignores term “usefulness”
•What?• Information gain accounts for term quality
•How?• Hadoop + similarity factory = win
Fast Deploys, Careful Testing
• Idea
• Proof of Concept
• Side-By-Side
• A/B test
• 100% Live
Side-by-Side
Relevant != High quality
A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
• Small but significant decrease in clicks, page views, etc.
More homogeneous resultsLower average quality score
Next Steps
Parameter Tweaking...Rebalance relevancy and quality signals in score
The Future
Latent Semantic Indexing in Solr/Lucene
Latent Semantic Indexing• In TF-IDF, documents are sparse vectors in
term space
• LSI re-maps these to dense vectors in “concept” space
• Construct transformation matrix:
• Load file at index and query time
• Re-map query and documents
Rm+
Rr
Tr×m