Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | toria-gibbs |
View: | 689 times |
Download: | 1 times |
LEONOR. Macrame wall hanging
$145.00 USDAncestralStore
3
Bread your Cat Costume for Cats
$12.00 USDMissMaddyMakes
Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
“Isn’t search a solved problem? We have Google!”All my friends
Photo by Alissaloveherbyalissa.etsy.com
title• Title • Title
Very very large scope Medium scope
No control over content Some control over content
High intent Low intent
Optimize for Google users Optimize for Etsy users
9
Google Etsy
id description price
001 red cat mittens 40.00
002 blue mittens 19.99
003 blue hat for cats 12.50
004 cat hat 25.00
005 red and blue hat 30.00
11
Database Example
q=“cat”
SELECT * FROM itemsWHERE description LIKE ‘%cat%’
13
n n·m
10 250
100 2500
1000 25000
10000 250000
100000 2500000
1000000 25000000
Database Scalability
m=25
Why build search systems?
1. Customize the solution (your users, your data, your algorithms)
2. Improve performance
14
✓ cat hat
✓ blue hat for cats
✓ vacation hat
? kitten hat
By Laura Solartefloflyco.etsy.com
SELECT * FROM itemsWHERE description LIKE ‘%cat%’
Why build search systems?
1. Customize the solution (your users, your data, your algorithms)
2. Improve performance
3. Improve quality of results
16
Inverted Index
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
18
001 red cat mittens
002 blue mittens
003 blue hat for cats
004 cat hat
005 red and blue hat
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
19
● A document is a single searchable unit
001 red cat mittens 40.00
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
20
● A document is a single searchable unit
● A field is a defined value in a document
id description price
001 red cat mittens 40.00
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
21
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source in order to build the inverted index
id description price
001 red cat mittens 40.00
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
22
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source in order to build the inverted index
● An inverted index is an internal data structure that maps terms of a field to document ids
Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
23
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source in order to build the inverted index
● An inverted index is an internal data structure that maps terms of a field to document ids
● An index is a collection of documents
12.50 [003]
19.99 [002]
25.00 [004]
30.00 [005]
40.00 [001]
001 red cat mittens 40.00
002 blue mittens 19.99
... ... ...
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
001 red cat mittens
002 blue mittens
003 blue hat for cats
004 cat hat
005 red and blue hat
How did we do this?
Stemming
By Paradise CrowParadiseCrow.etsy.com
“cats” → “cat”“walking” → “walk”
“painting” → “paint” ?
By Dina Castellanomamaslilsugarcrochet.etsy.com
Bonus: Synonyms
✓ [“cat”, “kitten”]
✓ [“color”, “colour”]
✓ [“Canada”, “Canadian”, “canuck”]
✗ [“Poland”, “Polish”]
By Ludwinus van den Arendcircuszoo.etsy.com
● Stemming ✓ hat for cats
● Tokenization ✗ vacation
● Synonyms ✓ kitten hat
Building an Inverted Index
30
INDEX TIME
O(n·m·p)QUERY TIME
O(1)
n = items in databasem = length of string
p = preprocessing steps
title 1. “big data”2. “small data”3. “big data”4. “small data”5. “big data”6. “small data”7. “big data”8. “small data”9. “big data”
10. “small data”11. “bigger data”12. “biggest data”
data=[1,2,3,4,5,6,7,8,9,10,11,12]big=[1,3,5,7,9,11,12]small=[2,4,6,8,10]
32
title1. “Carlos Vives is the
greatest singer alive”
2. “Shakira is the best dancer in the world”
3. “Sophía Vergara is the most famous Colombian in the United States”
carlos=[1]vives=[1]is=[1,2,3]the=[1,2,3]great=[1]singer=[1]alive=[1]shakira=[2]best=[2]dancer=[2]
in=[2,3]world=[2]sophia=[3]vergara=[3]most=[3]famous=[3]colombia=[3]unite=[3]states=[3]
33
Did we solve it?
✓ Customize the solution (your users, your data, your algorithms)
✓ Improve performance
✓ Improve quality of results
34
Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
✓
✓
38
● Inverted index● Field data (uninverted index)● Basic stemming, tokenizing,
faceting
● Advanced stemming, tokenizing, faceting
● Plugins● Caching, warming● Replication● Sharding, distribution● ...and more!
SourceSide by Side with Elasticsearch and SolrBy Rafał Kuć and Radu Gheorghehttps://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solrhttps://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
See alsohttp://solr-vs-elasticsearch.com/By Kelvin Tan
40
It Doesn’t Matter
● Most projects work well with either
● Getting configuration right is more important
● Test with your own data and your own queries
41
<schema name="items" version="1.6"> <types> <fieldType name="long" class="solr.TrieLongField"/> <fieldType name="int" class="solr.TrieField" type="integer"/> <fieldType name="tdate" class="solr.TrieDateField"/> <fieldType name="text" class="solr.TextField"/> </types>
<fields> <field name="item_id" type="long" stored="true" required="true"/> <field name="description" type="text"/> <field name="quantity" type="int"/> <field name="price" type="long"/> <field name="update_date" type="tdate"/></fields>
<defaultSearchField>description</defaultSearchField><uniqueKey>item_id</uniqueKey></schema>
"item" : { "properties" : { "item_id": { "type": "long", "store": true }, "description": { "type": "string" }, "quantity": { "type": "int" }, "price": { "type": "long" }, "update_date": { "type": "date" } } }
59
TF-IDF
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge(total number of docs / # docs which contain this term)
1. The orange cat is a very good cat
2. My cat ate an orange
3. Cats are the best and I will give every cat a special cat toy
1. TF(cat) = 2/82. TF(cat) = 1/53. TF(cat) = 3/14
IDF(cat) = loge(3/3)
“cat” → [1, 3, 2]
60
TF-IDF
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge(total number of docs / # docs which contain this term)
1. The orange cat is a very good cat
2. My cat ate an orange
3. Cats are the best and I will give every cat a special cat toy cat cat cat cat cat
1. TF(cat) = 2/82. TF(cat) = 1/53. TF(cat) = 8/19
IDF(cat) = loge(3/3)
“cat” → [3, 1, 2]
Quality
By Lisaairfriend.etsy.com
● User reviews● Clicks● Favorites● Adds to shopping cart● Purchases● Dwell (time spent viewing the item)
● ...and more!
Recency
By Olyafoxberrystudio.etsy.com
● Ensure that each visit is new and fresh
● New items have a chance to be seen
Query Understanding
● Tokenization and stemming
● Language identification
● Spelling correction
● Query rewriting (scoping, expansion, relaxation)
For more informationhttp://queryunderstanding.com/By Daniel Tunkelang
67
Query Scoping
68
q=“red mittens”
q=“pizza restaurants in Medellin”
q=“necklace under $20”
q=“mittens” & color=red
q=“pizza restaurant” & location=“Medellin”
q=“necklace” & price<20
How Etsy Uses Thermodynamics to Help You Search for “Geeky” by Fiona Condonhttp://codeascraft.com/2015/08/31/how-etsy-uses-thermodynamics-to-help-you-search-for-geeky
Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
✓
✓
✓
✓