+ All Categories
Home > Documents > Data Integration - Georgia Institute of...

Data Integration - Georgia Institute of...

Date post: 31-Aug-2019
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
31
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Transcript
Page 1: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Data Integration

Duen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

What is Data Integration? Combining data from multiple sources to

provide the user with a unified view.

Why is it Important?Think about the apps, websites, and

services that you use every day.

Page 3: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Businesses derive value through data integration.

Page 4: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor
Page 5: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor
Page 6: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor
Page 7: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

More Examples?• Social media (data from users, businesses)

• Facebook: your posts, advertisements, review

• Search engine: Google, Bing, Yahoo, etc.

• Smart assistants: Siri, Cortana, Alexa

• Price comparison: Kayak

• Uber, Lyft: drivers, traffic data, customers

• google maps: users, restaurants, traffic….

7

Page 8: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

How to do data integration?

Page 9: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

“Low” Effort Approaches1. Use database’s “Join”! (e.g., SQLite)When does this approach work? (Or, when does it NOT work?)

9

id name state111 Smith GA222 Johnson NY333 Obama CA

id name111 Smith222 Johnson333 Obama

id state111 GA222 NY333 CA

2. Open Refine http://openrefine.org (video #3)

Page 10: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

So IDs are really important!

But who creates the IDs?

10

Page 11: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Crowd-sourcing Approaches: Freebase

11http://wiki.freebase.com/wiki/What_is_Freebase%3F

Freebase intro: https://www.youtube.com/watch?v=TJfrNo3Z-DUFreebase to moved to Wikidata in July (2015): http://goo.gl/3ZDTg7

Page 12: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Freebase(a graph of entities)

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

12

Wikipedia.

Page 13: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

So what? What can you do with Freebase?

Hint: Google acquired it in 2010

13

Page 14: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html

Page 15: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Freebase replaced by Google Knowledge Graph API

15

Example: What does Google know about Taylor Swift?

https://developers.google.com/knowledge-graph/

Page 16: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Google has the Knowledge Graph.

Facebook has…

16

Page 17: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

https://www.facebook.com/about/graphsearchhttps://www.youtube.com/watch?v=W3k1USQbq80

Page 18: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

What if we don’t have the luxury of having IDs ?

18(Screenshot from FreeBase video)

A common problem in academia:Polo ChauDuen Horng ChauDuen ChauD. Chau

Page 19: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Entity Resolution(A hard problem in data integration)

19

Then you need to do…

Page 20: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Why is entity resolution so difficult?

Let’s understand it through shopping for an iPhone 6 on

Apple, Amazon and eBay

Page 21: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor
Page 22: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor
Page 23: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor
Page 24: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

D-Dupe Interactive Data Deduplication and IntegrationTVCG 2008University of MarylandBilgic, Licamele, Getoor, Kang, Shneiderman

24https://linqspub.soe.ucsc.edu/basilic/web/Publications/2006/bilgic:vast06/

Page 25: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor
Page 26: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Polo

Paolo

Alice

Bob

Carol

Dave

Page 27: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Numerous similarity functions

• Euclidean distanceEuclidean norm / L2 norm

• TaxiCab/Manhattan distance

• Jaccard Similarity (e.g., used with w-shingles)e.g., overlap of nodes’ #neighbors

• String edit distancee.g., “Polo Chau” vs “Polo Chan”

27

http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:

Page 28: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

28

https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.html

Page 29: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Core components: Similarity functions

Determine how two entities are similar.D-Dupe’s approach: Attribute similarity + relational similarity

29

Similarity score for a pair of entities

Page 30: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

30

Attribute similarity (a weighted sum)

Page 31: Data Integration - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2017fall/slides/CSE6242-300-Integrate.pdf · Data Integration Duen Horng (Polo) Chau Assistant Professor

Excellent Tutorial on Entity Resolution

http://www.umiacs.umd.edu/~getoor/Tutorials/ER_KDD2013.pdf

by Lise Getoor and Ashwin Machanavajjhala

31


Recommended