HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF...

transcript

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING

Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH

Scenario

Arnab Nandi & Phil Bernstein

Scenario

Search over structured dataCommerceentertainment

Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.

Scenario

Search engine + data warehouse

3rd Party Feed

results

“Amazon.com”

•High Precision•High Recall•Minimal Human

Involvement

Example Feed

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

<Category>Action</Category> <Category>Comedy</Category>

</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

Schema Matching

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

From To

Movie MOVIE

Title MOVIE_NAME

Runtime RUNTIME

Category GENRE*

MPAA RATING

Person ACTOR*

Taxonomy Matching

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>From To

Action Action/Adventure

PG-13 NR

Various Problems

Badly normalized….

Unit conversion…

Formatting choices…

In-band signaling…

Arbitrary labels

Non standard vocabulary / language

Zero documenta

Not enough

instances

Unlike conventional matching…

We have web search click data

For both Warehouse & 3rd party website

The databases we are integrating (usually) have a presence on the web

Why not use click data as a feature for schema & taxonomy matching?

Search engine + data warehouse

3rd Party Feed

results

Outline

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Core idea

“If two (sets of) products are searched for by similar queries, then they are similar”

Small laptop

Arnab Nandi & Phil BernsteinWeb Search

Clicklog

Core idea

Small Lapto

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small

laptopsSmall laptop

Small laptop

Query Distributions

small laptop

netbook

hp mini 1000

hp mini

0 10 20 30 40 50click count

Mapping to Taxonomy

Map URL to product, which belongs to taxonomy

http://www.amazon.com/dp/B001JTA59C

Shopping | Electronics |NetbooksArnab Nandi & Phil Bernstein

3rd party DB(provided to us)

Aggregating Query Distributions

Small Laptop

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small

laptops

0 5 101520253035404550

0 10 20 30 40 50

Generating Correspondences

Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.

Process For each page (URL)

Identify query distribution Identify category / schema element of that page

For each category / schema element C Aggregate over pages in C to get query distribution

For each foreign category / schema element Find host category / schema element with most similar

query distribution

Outline

Scenario

Results

Example: Taxonomy Matching

query freq url

laptop 70http://searchengine.com/product/macbookpro

laptop 25http://searchengine.com/product/mininote

laptop 5 http://asus.com/eeepcnetbook 5

http://searchengine.com/product/macbookpro

netbook 20

http://searchengine.com/product/mininote

netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

Warehouse: Small

Laptops

Distribution Similarity Metric

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)

“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop

1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)

= 0.74

Warehouse: Small

Laptops

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

Advantages of Clicklogs

Resilient to language

Resilient to new domains, data, and features As long as people query & click, we have data to

learn from

Generates mappings previous methods can’tElectronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments

≈ Office Products ▷ Office Machines ▷ Calculators

Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic ≈ Software ▷ Developer Tools

System Design

Outline

Scenario

Results

Experimenting with Click Logs Commercial warehouse mapping, 258 products

from a 70,000 term Amazon.com taxonomy (613 in gold)

to a 6,000 term warehouse taxonomy (40 in gold)

Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,

consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product

pages Typically each product had a query

distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Summary of Results

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Precision / Recall

Commercial warehouse mapping, 258 products

from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613

categories used)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Instance-basedQuery DistributionConsensusName-based

Recall

Summary of Results

Match Quality

QDs are unique to entities

QDs are unique to aggregate classes

Amazon Products

Amazon Categories

Warehouse Products

Warehouse Categories

Amazon Products

257/258 correct

241/258 correct

189/258 correct (73%)

226/258correct

Amazon Categories

373/613 correct

204/400 correct 525/613 (85%)

Warehouse Products

392/400 correct 383/400 correct

Warehouse Categories

40/40 correct

QDs of entities are closest to the distributions of their aggregate classes

QDs of similar aggregates are similar

Summary of Results

Varying Clicklog Size

Successively decreased clicklog size by half

Recall decreases as clicklog size is decreased

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.65

ItemsCategories

Recall

¼ ½ Full Log

Summary of Results

Comparing Query Distributions

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)

Σ(all qhost, qforeign combinations)

Replace Jaccard with various phrase similarity metrics

Minimal difference due to size of most queries

Summary of Results

Related + Future Work

Usage Based / Crowdsourcing Usage-Based Schema Matching (ICDE 2008)

Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.

Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan

Web Scale Integration Web-scale Data Integration: You can only afford to Pay

As You Go (CIDR 2007)Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy

Related + Future Work

“Mixed” methods Ontology matching: A machine learning approach

(Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy

Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy

Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm

Conclusion

Unsupervised mapping is possible very high recall / precision when enough

queries are present

Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce

more mappings

Combinable with existing methods

http://arnab.org/contact

http://research.microsoft.com/~philbe/

Questions?

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF...

Documents