+ All Categories
Home > Documents > Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... ·...

Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... ·...

Date post: 22-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
46
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Transcript
Page 1: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Data Cleaning & Integration

Duen Horng (Polo) Chau Georgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Data CleaningWhy data can be dirty?

Page 3: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Examples

• …

3

How dirty is real data?

http://blogs.verdantis.com/wp-content/uploads/2015/02/Data-cleansing.jpg

Page 4: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Examples

• duplicates

• empty rows

• abbreviations (different kinds)

• difference in scales / inconsistency in description/ sometimes include units

• typos

• missing values

• trailing spaces

• incomplete cells

• synonyms of the same thing

• skewed distribution (outliers)

• bad formatting / not in relational format (in a format not expected)

4

(Previous semester)How dirty is real data?

Page 5: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

More to readBig Data's Dirty Problem [Fortune]http://fortune.com/2014/06/30/big-data-dirty-problem/

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights [New York Times]http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0

5

Page 6: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Data Janitor

Page 7: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Data CleanersWatch videos • Open Refine (previously Google Refine)

• Data Wrangler (research at Stanford)

Write down• Examples of data dirtiness• Tool’s features demo-ed (or that you like)

Will collectively summarize similarities and differences afterwards

Open Refine: http://openrefine.orgData Wrangler: http://vis.stanford.edu/wrangler/

7

Page 8: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 9: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 10: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

How are the tools similar or different?• [G] cluster similar entities (e.g., T&M), synonyms

• [G, W] history

• [G] trailing spaces

• [W] text extraction

• [W] export script, code (work on other systems? interoperability)

• [W, G] one-click (usability)

• [G] distribution of data (apply log scale)

• [W] pivot data (unfold)

• [W] suggestions (even more usable)

• [W + G] preview changesG = Google RefineW = Data wrangler10

Page 11: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

! The videos only show

some of the tools’ features. Try them out.

Google Refine: http://code.google.com/p/google-refine/Data Wrangler: http://vis.stanford.edu/wrangler/

11

Page 12: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Data Integration

Page 13: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Course OverviewCollection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 14: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

What is Data Integration? Why is it Important?

Page 15: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

15

Data IntegrationCombining data from different sources to provide the user with a unified view

As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges

How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.)

Page 16: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Examples businesses that derive value via

data integration

Page 17: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 18: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 19: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Craigslist now has map view! What problem has it solved?

https://atlanta.craigslist.org/search/hhh

Page 20: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 21: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

More Examples?• Google Now

• Yelp?

• Amazon — different kinds of product (dpreview.com)

• Dealnews, slickdeals, fatwallet?

• tinder (facebook and instagram)

• facebook (news stories, etc.)

• walmart (different merchants)

• search engines

• ? any websites with advertising (e.g., new york times)21

Page 22: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

More Examples?• [FREE] Mint: account app, integrates multiple account (credit

card, bank, etc.), can parse receipts

• Google News

• Crime mapping

• Feedly

• app that check gas prices, coupons

• zillow-trulia/redfin

• imdb (movie database)

• coin: combine multiple credits

• ebay22

(Previous semester)

Page 23: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

More Examples?• Palantir gotham

• Yelp: restaurant reviews, business reviews

• Facebook friend request: look at your friends’s friends and recommend those friends as your friends

• Trulia / zillow (real estate sites)

• graph search (facebook)

• waze

• yahoo pipe

• google search engine

• google transit

• google now / apple siri23

(Previous semester)

Page 24: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

How to do data integration?

Page 25: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

“Low” Effort ApproachesUse database’s “Join”! (e.g., SQLite)When would this approach work? (Or, when it won’t work?)

25

id name state111 Smith GA222 Johnson NY333 Obama CA

id name111 Smith222 Johnson333 Obama

id state111 GA222 NY333 CA

Google Refinehttp://code.google.com/p/google-refine/ (video #3)

Page 26: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Crowd-sourcing Approaches: Freebase

26http://wiki.freebase.com/wiki/What_is_Freebase%3F

Page 27: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Freebase(a graph of entities)

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

27

Wikipedia.

Page 28: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

So what? What can you do with Freebase?

Hint: Google acquired it in 2010 Freebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7

28

Page 29: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Given a graph of entities, like Freebase, what other cool

things can you do?

29

Page 30: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

https://www.facebook.com/about/graphsearchhttps://www.youtube.com/watch?v=W3k1USQbq80

Page 31: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

https://www.youtube.com/watch?v=mmQl6VGvX-c

Page 32: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

FeldsparFinding Information by Association.

CHI 2008 Polo Chau, Brad Myers, Andrew Faulring

32Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdfYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

Page 33: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 34: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

We need ways to identify the many ways that one thing may be called. How?

34(Screenshot from FreeBase video)

Page 35: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Entity Resolution (A hard problem in data integration)Polo ChauP. ChauDuen Horng ChauDuen ChauD. Chau

35

Page 36: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Why is entity resolution so Important?

Case Study Let’s shop for an iPhone 6 on

Apple, Amazon and eBay

Page 37: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 38: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 39: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 40: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

D-Dupe Interactive Data Deduplication and IntegrationTVCG 2008University of MarylandBilgic, Licamele, Getoor, Kang, Shneiderman

40http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55)http://linqs.cs.umd.edu/basilic/web/Publications/2008/kang:tvcg08/kang-tvcg08.pdf

Page 41: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with
Page 42: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Polo

Paolo

Alice

Bob

Carol

Dave

Page 43: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Numerous similarity functions

• Euclidean distanceEuclidean norm / L2 norm

• TaxiCab/Manhattan distance

• Jaccard Similarity (e.g., used with w-shingles)e.g., overlap of nodes’ #neighbors

• String edit distancee.g., “Polo Chau” vs “Polo Chan”

• Canberra distance (compare ranked items)

43

http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:

Page 44: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Core components: Similarity functions

Determine how two entities are similar.D-Dupe’s approach: Attribute similarity + relational similarity

44

Similarity score for a pair of entities

Page 45: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

45

Attribute similarity (a weighted sum)

Page 46: Data Cleaning & Integrationpoloclub.gatech.edu/cse6242/2015fall/slides/CSE6242-2... · 2015-08-31 · Data Integration Combining data from different sources to provide the user with

Summary for data integration• Enable new services• Improve existing services• New ways to interactive with data

46


Recommended