HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF...

Post on 29-Mar-2015

218 views 0 download

Tags:

transcript

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING

Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH

2

Scenario

Arnab Nandi & Phil Bernstein

Arnab Nandi & Phil Bernstein

3

Scenario

Search over structured dataCommerceentertainment

Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.

4

Scenario

Arnab Nandi & Phil Bernstein

query

Search engine + data warehouse

Users

3rd Party Feed

3rd Party Feed

3rd Party Feed

3rd Party Feed

results

“Amazon.com”

•High Precision•High Recall•Minimal Human

Involvement

Arnab Nandi & Phil Bernstein

5

Example Feed

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

Arnab Nandi & Phil Bernstein

6

Schema Matching

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

From To

Movie MOVIE

Title MOVIE_NAME

Runtime RUNTIME

Category GENRE*

MPAA RATING

Person ACTOR*

Arnab Nandi & Phil Bernstein

7

Taxonomy Matching

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>From To

Action Action/Adventure

PG-13 NR

R R

8

Various Problems

Badly normalized….

Unit conversion…

Formatting choices…

In-band signaling…

Arbitrary labels

Arnab Nandi & Phil Bernstein

Non standard vocabulary / language

Zero documenta

tion

Not enough

instances

9

Unlike conventional matching…

Arnab Nandi & Phil Bernstein

We have web search click data

For both Warehouse & 3rd party website

The databases we are integrating (usually) have a presence on the web

Why not use click data as a feature for schema & taxonomy matching?

query

Search engine + data warehouse

Users

3rd Party Feed

results

10

Outline

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

11

Core idea

“If two (sets of) products are searched for by similar queries, then they are similar”

Small laptop

Arnab Nandi & Phil BernsteinWeb Search

12

Clicklog

Core idea

Arnab Nandi & Phil Bernstein

Small Lapto

ps

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small

laptopsSmall laptop

Small laptop

Y

X

Z

Small laptop

13

Query Distributions

Arnab Nandi & Phil Bernstein

small laptop

netbook

hp mini 1000

hp mini

0 10 20 30 40 50click count

14

Mapping to Taxonomy

Map URL to product, which belongs to taxonomy

http://www.amazon.com/dp/B001JTA59C

Shopping | Electronics |NetbooksArnab Nandi & Phil Bernstein

3rd party DB(provided to us)

15

Aggregating Query Distributions

Arnab Nandi & Phil Bernstein

Small Laptop

s

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small

laptops

0 5 101520253035404550

0 5 101520253035404550

0 5 101520253035404550

0 5 101520253035404550

0 10 20 30 40 50

0 10 20 30 40 50

Arnab Nandi & Phil Bernstein

17

Generating Correspondences

Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.

Process For each page (URL)

Identify query distribution Identify category / schema element of that page

For each category / schema element C Aggregate over pages in C to get query distribution

For each foreign category / schema element Find host category / schema element with most similar

query distribution

18

Outline

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

19

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

query freq url

laptop 70http://searchengine.com/product/macbookpro

laptop 25http://searchengine.com/product/mininote

laptop 5 http://asus.com/eeepcnetbook 5

http://searchengine.com/product/macbookpro

netbook 20

http://searchengine.com/product/mininote

netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

20

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

5/25

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

21

Distribution Similarity Metric

Arnab Nandi & Phil Bernstein

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)

22

“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop

1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)

= 0.74

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

5/25

0.74

0.31

Arnab Nandi & Phil Bernstein

23

Advantages of Clicklogs

Resilient to language

Resilient to new domains, data, and features As long as people query & click, we have data to

learn from

Generates mappings previous methods can’tElectronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments

≈ Office Products ▷ Office Machines ▷ Calculators

Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic  ≈ Software ▷ Developer Tools

24

System Design

Arnab Nandi & Phil Bernstein

25

Outline

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Arnab Nandi & Phil Bernstein

26

Experimenting with Click Logs Commercial warehouse mapping, 258 products

from a 70,000 term Amazon.com taxonomy (613 in gold)

to a 6,000 term warehouse taxonomy (40 in gold)

Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,

consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product

pages Typically each product had a query

distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

27

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Arnab Nandi & Phil Bernstein

28

Precision / Recall

Commercial warehouse mapping, 258 products

from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613

categories used)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Instance-basedQuery DistributionConsensusName-based

Recall

Pre

cisio

n

29

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Arnab Nandi & Phil Bernstein

30

Match Quality

QDs are unique to entities

QDs are unique to aggregate classes

Amazon Products

Amazon Categories

Warehouse Products

Warehouse Categories

Amazon Products

257/258 correct

241/258 correct

189/258 correct (73%)

226/258correct

Amazon Categories

373/613 correct

204/400 correct 525/613 (85%)

Warehouse Products

392/400 correct 383/400 correct

Warehouse Categories

40/40 correct

QDs of entities are closest to the distributions of their aggregate classes

QDs of similar aggregates are similar

31

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

32

Varying Clicklog Size

Successively decreased clicklog size by half

Recall decreases as clicklog size is decreased

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.65

0.75

0.85

0.95

ItemsCategories

Recall

Pre

cisio

n

¼ ½ Full Log

1/32

Arnab Nandi & Phil Bernstein

33

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

34

Comparing Query Distributions

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)

Σ(all qhost, qforeign combinations)

Replace Jaccard with various phrase similarity metrics

Minimal difference due to size of most queries

Arnab Nandi & Phil Bernstein

35

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

36

Related + Future Work

Arnab Nandi & Phil Bernstein

Usage Based / Crowdsourcing Usage-Based Schema Matching (ICDE 2008)

Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.

Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan

Web Scale Integration Web-scale Data Integration: You can only afford to Pay

As You Go (CIDR 2007)Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy

37

Related + Future Work

Arnab Nandi & Phil Bernstein

“Mixed” methods Ontology matching: A machine learning approach

(Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy

Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy

Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm

Arnab Nandi & Phil Bernstein

38

Conclusion

Unsupervised mapping is possible very high recall / precision when enough

queries are present

Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce

more mappings

Combinable with existing methods

39

Arnab Nandi & Phil Bernstein

http://arnab.org/contact

http://research.microsoft.com/~philbe/

Questions?