HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING
Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH
PRESENTED BYVAIBHAV MEHTA
Scenario
Arnab Nandi & Phil Bernstein
2
Scenario
Arnab Nandi & Phil Bernstein
3
Search over structured dataCommerceentertainment
Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.
Scenario
Arnab Nandi & Phil Bernstein
4
query
Search engine + data warehouse
Users
3rd Party Feed
3rd Party Feed
3rd Party Feed
3rd Party Feed
results
“Amazon.com”
•High Precision•(Irrespective of Recall)
•Minimal Human Involvement
•High Precision•(Irrespective of Recall)
•Minimal Human Involvement
Example Feed
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
5
Arnab Nandi & Phil Bernstein
Schema Matching
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
6
Arnab Nandi & Phil Bernstein
Taxonomy Matching
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
7
Arnab Nandi & Phil Bernstein
Various Problems8
Badly normalized….
Unit conversion…
Formatting choices…
In-band signaling…
Arbitrary labels
Arnab Nandi & Phil Bernstein
Non standard vocabulary / language
Zero documenta
tion
Not enough
instances
Unlike conventional matching…
Arnab Nandi & Phil Bernstein
9
We have web search click data
For both Warehouse & 3rd party website
The databases we are integrating (usually) have a presence on the web
Why not use click data as a feature for schema & taxonomy matching?
query
Search engine + data warehouse
Users
3rd Party Feed
results
Outline10
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
Core idea11
“If two (sets of) products are searched for by similar queries, then they are similar”
Small laptop
Arnab Nandi & Phil BernsteinWeb Search
Clicklog
Core idea12
Arnab Nandi & Phil Bernstein
Small Lapto
ps
Pro. Laptops
Warehouse
eee ::: small
laptopsSmall laptop
Small laptop
Y
X
Z
Small laptop
Query Distributions
Arnab Nandi & Phil Bernstein
13
click count
Mapping to Taxonomy14
Map URL to product, which belongs to taxonomy
http://www.amazon.com/dp/B001JTA59C
Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein
3rd party DB(provided to us)
Aggregating Query Distributions
15
Arnab Nandi & Phil Bernstein
Small Laptop
s
Pro. Laptops
Warehouse
eee ::: small
laptops
Generating Correspondences
Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.
Process For each page (URL)
Identify query distribution Identify category / schema element of that page
For each category / schema element C Aggregate over pages in C to get query distribution
For each foreign category / schema element Find host category / schema element with most similar query
distribution
17
Arnab Nandi & Phil Bernstein
Outline18
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
19
query freq url
laptop 70http://searchengine.com/product/macbookpro
laptop 25http://searchengine.com/product/mininote
laptop 5 http://asus.com/eeepcnetbook 5
http://searchengine.com/product/macbookpro
netbook 20
http://searchengine.com/product/mininote
netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
20
“laptop”: 25/45“netbook”: 20/45
“laptop” : 70 / 75“netbook” : 5/75
“laptop”: 5/25“netbook”: 15/25“cheap laptop”:
5/25
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
Distribution Similarity Metric
Arnab Nandi & Phil Bernstein
21
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)
“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop
1 x (5/25) + 1 x (20/45) + 0.5 x (5/25)
= 0.74
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
22
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
“laptop”: 25/45“netbook”: 20/45
“laptop” : 70 / 75“netbook” : 5/75
“laptop”: 5/25“netbook”: 15/25“cheap laptop”:
5/25
0.74
0.31
Advantages of Clicklogs
Arnab Nandi & Phil Bernstein
23
Resilient to language
Resilient to new domains, data, and features As long as people query & click, we have data to
learn from
Generates mappings previous methods can’t Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments
≈ Office Products ▷ Office Machines ▷ Calculators
Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic ≈ Software ▷ Developer Tools
System Design24
Arnab Nandi & Phil Bernstein
Outline25
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
Experimenting with Click Logs
Arnab Nandi & Phil Bernstein
26
Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613
in gold) to a 6,000 term warehouse taxonomy (40 in gold)
Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,
consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution
averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
Summary of Results
Arnab Nandi & Phil Bernstein
27
90% precision / recall possible
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Precision / Recall
Arnab Nandi & Phil Bernstein
28
Commercial warehouse mapping, 258 products
from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613
categories used)
Summary of Results
Arnab Nandi & Phil Bernstein
29
90% precision / recall possible
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Varying Clicklog Size30
Successively decreased clicklog size by half
Recall decreases as clicklog size is decreased
Arnab Nandi & Phil Bernstein
Summary of Results
Arnab Nandi & Phil Bernstein
31
90% precision / recall possible
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Comparing Query Distributions
32
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)
Σ(all qhost, qforeign combinations)
Replace Jaccard with various phrase similarity metrics
Minimal difference due to size of most queries Arnab Nandi & Phil Bernstein
Summary of Results
Arnab Nandi & Phil Bernstein
33
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Conclusion
Unsupervised mapping is possible very high recall / precision when enough
queries are present
Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce
more mappings
Combinable with existing methods
34
Arnab Nandi & Phil Bernstein
Questions?
Arnab Nandi & Phil Bernstein