http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics
Data Integration
Duen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
What is Data Integration? Combining data from multiple sources to
provide the user with a unified view.
Why is it Important?Think about the apps, websites, and
services that you use every day.
Businesses derive value through data integration.
More Examples?• Social media (data from users, businesses)
• Facebook: your posts, advertisements, review
• Search engine: Google, Bing, Yahoo, etc.
• Smart assistants: Siri, Cortana, Alexa
• Price comparison: Kayak
• Uber, Lyft: drivers, traffic data, customers
• google maps: users, restaurants, traffic….
7
How to do data integration?
“Low” Effort Approaches1. Use database’s “Join”! (e.g., SQLite)When does this approach work? (Or, when does it NOT work?)
9
id name state111 Smith GA222 Johnson NY333 Obama CA
id name111 Smith222 Johnson333 Obama
id state111 GA222 NY333 CA
2. Open Refine http://openrefine.org (video #3)
So IDs are really important!
But who creates the IDs?
10
Crowd-sourcing Approaches: Freebase
11http://wiki.freebase.com/wiki/What_is_Freebase%3F
Freebase intro: https://www.youtube.com/watch?v=TJfrNo3Z-DUFreebase to moved to Wikidata in July (2015): http://goo.gl/3ZDTg7
Freebase(a graph of entities)
“…a large collaborative knowledge base consisting of metadata composed mainly
by its community members…”
12
Wikipedia.
So what? What can you do with Freebase?
Hint: Google acquired it in 2010
13
https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html
Freebase replaced by Google Knowledge Graph API
15
Example: What does Google know about Taylor Swift?
https://developers.google.com/knowledge-graph/
Google has the Knowledge Graph.
Facebook has…
16
https://www.facebook.com/about/graphsearchhttps://www.youtube.com/watch?v=W3k1USQbq80
What if we don’t have the luxury of having IDs ?
18(Screenshot from FreeBase video)
A common problem in academia:Polo ChauDuen Horng ChauDuen ChauD. Chau
Entity Resolution(A hard problem in data integration)
19
Then you need to do…
Why is entity resolution so difficult?
Let’s understand it through shopping for an iPhone 6 on
Apple, Amazon and eBay
D-Dupe Interactive Data Deduplication and IntegrationTVCG 2008University of MarylandBilgic, Licamele, Getoor, Kang, Shneiderman
24https://linqspub.soe.ucsc.edu/basilic/web/Publications/2006/bilgic:vast06/
Polo
Paolo
Alice
Bob
Carol
Dave
Numerous similarity functions
• Euclidean distanceEuclidean norm / L2 norm
• TaxiCab/Manhattan distance
• Jaccard Similarity (e.g., used with w-shingles)e.g., overlap of nodes’ #neighbors
• String edit distancee.g., “Polo Chau” vs “Polo Chan”
27
http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:
28
https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.html
Core components: Similarity functions
Determine how two entities are similar.D-Dupe’s approach: Attribute similarity + relational similarity
29
Similarity score for a pair of entities
30
Attribute similarity (a weighted sum)
Excellent Tutorial on Entity Resolution
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_KDD2013.pdf
by Lise Getoor and Ashwin Machanavajjhala
31