+ All Categories
Home > Documents > Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng...

Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng...

Date post: 27-Sep-2020
Category:
Upload: others
View: 13 times
Download: 1 times
Share this document with a friend
30
http://poloclub.gatech.edu/cse6242 CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics Machine Learning Area Leader, College of Computing Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Transcript
Page 1: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

http://poloclub.gatech.edu/cse6242CSE6242: Data & Visual Analytics

Data Integration

Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS AnalyticsMachine Learning Area Leader, College of Computing Georgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

What is Data Integration? Combining data from multiple sources to

provide the user with a unified view.

Why is it Important?Think about the apps, websites, and

services that you use every day.

Page 3: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Businesses derive value through data integration.

Page 4: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics
Page 5: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Apple Siri

Page 6: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics
Page 7: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

More Examples?• Social media (data from users, businesses)

• Facebook: your posts, advertisements, review

• Search engine: Google, Bing, Yahoo, etc.

• Smart assistants: Siri, Cortana, Alexa

• Price comparison: Kayak

• Uber, Lyft: drivers, traffic data, customers

• google maps: users, restaurants, traffic….

�7

Page 8: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

How to do data integration?

Page 9: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

“Low” Effort Approaches1. Use database’s “Join”! (e.g., SQLite)When does this approach work? (Or, when does it NOT work?)

�9

id name salary111 Smith $40k222 Johnson $60k333 Lee $50k

id name111 Smith222 Johnson333 Lee

id salary111 $40k222 $60k333 $50k

2. Open Refine http://openrefine.org (Video #3 “Reconcile and Match Data”)

Page 10: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

IDs are really important, and can simplify data integration!

But who creates the IDs?

�10

Page 11: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Crowd-sourcing Approaches: Freebase

�11

Freebase intro video: https://youtu.be/TJfrNo3Z-DULearn more about Freebase at https://en.wikipedia.org/wiki/Freebase

Page 12: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Freebase(a graph of entities)

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

�12

Wikipedia.

Page 13: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

So what? What can you do with the

Freebase knowledge graph?

Hint: Google acquired it in 2010.

�13

Page 14: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Google Knowledge Graph video: https://youtu.be/mmQl6VGvX-c

Page 15: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Freebase replaced by Google Knowledge Graph API

�15

Example: What does Google know about Taylor Swift?

https://developers.google.com/knowledge-graph/

Page 16: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

�16

What does Google know about Taylor Swift? https://developers.google.com/knowledge-graph/

Page 17: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

What if we don’t have the luxury of having IDs ?

�17(Screenshot from FreeBase video)

A common problem in academia:Polo ChauDuen Horng ChauDuen ChauD. Chau

Page 18: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Entity Resolution(A hard problem in data integration)

�18

Then you need to do…

Page 19: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Why is entity resolution so difficult?

Let’s understand it through shopping for an iPhone on Apple, Amazon and eBay

Page 20: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics
Page 21: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics
Page 22: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics
Page 23: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

D-Dupe Interactive Data Deduplication and IntegrationTVCG 2008University of MarylandBilgic, Licamele, Getoor, Kang, Shneiderman

�23https://linqspub.soe.ucsc.edu/basilic/web/Publications/2006/bilgic:vast06/

Page 24: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics
Page 25: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Polo

Palo

Alice

Bob

Carol

Dave

Page 26: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Core components: Similarity functions

Determine how two entities are similar.D-Dupe’s approach: Attribute similarity + relational similarity

�26

Similarity score for a pair of entities

Page 27: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

�27

Attribute similarity (a weighted sum)

Page 28: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Numerous similarity functions

• Euclidean distanceEuclidean norm / L2 norm

• TaxiCab/Manhattan distance

• Jaccard Similarity (e.g., used with w-shingles)e.g., overlap of nodes’ #neighbors

• String edit distancee.g., “Polo Chau” vs “Polo Chan”

�28

http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:

Page 29: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

�29

https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.html

Page 30: Data Integration - Visualization · CSE6242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Associate Professor, College of Computing Associate Director, MS Analytics

Excellent Tutorial on Entity Resolution

http://www.umiacs.umd.edu/~getoor/Tutorials/ER_KDD2013.pdf

by Lise Getoor and Ashwin Machanavajjhala

�30


Recommended