+ All Categories
Home > Software > Beyond text similarity

Beyond text similarity

Date post: 16-Apr-2017
Category:
Upload: christianuhlcc
View: 309 times
Download: 0 times
Share this document with a friend
33
Beyond Text Similarity_ Tune your search for your Business Domain Search Meetup Munich 26.10.2016 Christian Uhl
Transcript
Page 1: Beyond text similarity

Beyond Text Similarity_Tune your search for your Business DomainSearch Meetup Munich 26.10.2016Christian Uhl

Page 2: Beyond text similarity

Agenda

Moving from simple text matching towards custom scoring• Recap: Text similarity and why this

stops working in the travel domain• Using recommendations and user

interaction feedback• Performance!• Protect yourself against

regressions

2

Page 3: Beyond text similarity

Practical scoring and text similarity is not

enough this time

3

Page 4: Beyond text similarity

Text Similarity

4

Elasticsearch • Lucene Practical Scoringscore(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Page 5: Beyond text similarity

Text Similarity

5

Elasticsearch • Lucene Practical Scoringscore(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Page 6: Beyond text similarity

Text Similarity

6

Elasticsearch • Lucene Practical Scoringscore(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Page 7: Beyond text similarity

Text Similarity

7

Elasticsearch • Lucene Practical Scoringscore(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Page 8: Beyond text similarity

Text Similarity

8

Elasticsearch • Lucene Practical Scoringscore(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Page 9: Beyond text similarity

Text Similarity

9

Inverse Document FrequencySearch for “Da Vinci Paris”

Few Da Vincis in the World, but Paris occurs a lot.

But is a “Da Vinci” in Valencia more relevant than any other Hotel in Paris?

Page 10: Beyond text similarity

Text Similarity

10

Elasticsearch • Lucene Practical Scoringscore(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Page 11: Beyond text similarity

Text Similarity

11

Elasticsearch • Lucene Practical Scoringscore(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

Page 12: Beyond text similarity

Text Similarity

12

Text Similarity

Search for “Paris”

• Paris, Illinois

• Paris, Texas• Paris, France• even more WTF

I don‘t visit no cheese eating surrender monkeys – Uncle Sam

Page 13: Beyond text similarity

Summary

13

Lucene Practical Scoring is finely crafted and tuned to find occurrences in large text bodys

We do not have large text bodys

Well, s***

Page 14: Beyond text similarity

Bring in the real world

14

Page 15: Beyond text similarity

Bring in the real World

15

Change the score!

• Instead of just relying on the practical scoring function, add other parameters

• Use values from the real world that reflect the relevance of a given document in the whole document space

Page 16: Beyond text similarity

Bring in the real World

16

Our users were kind enough to provide valuable feedback about our data

• They rate and recommend things (Hotels)

• They click on things (Everywhere*)

*except ads

We also have a geospatial relation between hotels and destinations

Page 17: Beyond text similarity

Bring in the real World

17

Rescore!

• Hotels by recommendations

• Destinations by clicks and hotel count

• POIs by clicks

Page 18: Beyond text similarity

Bring in the real World

18

Dont sort by Average rating!

Page 21: Beyond text similarity

Bring in the real World

21

Dont sort by average rating!http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

Maybe use the lower bound of Wilson score confindence interval for a Bernoulli Parameter!

Page 22: Beyond text similarity

Bring in the real World

22

Dont fear the math

Page 23: Beyond text similarity

Doesn’t custom scoring kill performance?

23

Page 24: Beyond text similarity

Bring in the real World

24

Yes it does.

We started with script score function to determine a better score during search time. Very bad idea 500ms – 1s queries, CPUs screaming for mercy

Page 25: Beyond text similarity

Bring in the real World

25

Rescoring!

• Generate a search result with ES/Lucene Standard

• Rescore the top 40• Fetch the top n of that

Page 26: Beyond text similarity

Bring in the real World

26

Page 27: Beyond text similarity

Protect yourself against regressions

27

Page 28: Beyond text similarity

Testing

28

Regression Testing

• Record ~4500 searches users did that brought in money

• Generate tests that make sure for each search term the relevant result is in the result set

• Define a threshold for OK (qalitative tests)

• Execute on CI!

Page 29: Beyond text similarity

Testing

29

Page 30: Beyond text similarity

Testing

30

Page 31: Beyond text similarity

Testing

31

Page 32: Beyond text similarity

32

“Unless you‘re a Library you should use additional real word

data for scoring”-me

Page 33: Beyond text similarity

#done

33

Get in touch!

[email protected]• @chrisuhlcc


Recommended