+ All Categories
Home > Documents > Data Cleaning & Integration - Polo Club of Data...

Data Cleaning & Integration - Polo Club of Data...

Date post: 03-Feb-2018
Category:
Upload: lamhanh
View: 219 times
Download: 1 times
Share this document with a friend
35
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Transcript
Page 1: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Data Cleaning & Integration

Duen Horng (Polo) Chau Georgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Last TimeBig data analytics building blocksData collection & simple data storage

• Why SQLite? • Simplicity : nothing to install/

maintain, database in a single file

• Popular: cross-platform, cross-device

• SQL basics (create table, join, create index, etc.)

2

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 3: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Data CleaningWhy data can be dirty?

Page 4: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Examples

• …

4

How dirty is real data?

Page 5: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Examples

• duplicates

• empty rows

• abbreviations (different kinds)

• difference in scales / inconsistency in description/ sometimes include units

• typos

• missing values

• trailing spaces

• incomplete cells

• synonyms of the same thing

• skewed distribution (outliers)

• bad formatting / not in relational format (in a format not expected)

5

(Fall’14)How dirty is real data?

Page 6: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

More to readBig Data's Dirty Problem [Fortune]http://fortune.com/2014/06/30/big-data-dirty-problem/

A Taxonomy of Dirty Data [Won Kim+]http://sci2s.ugr.es/docencia/m1/KimTaxonomy03.pdf(Very detailed, may be slightly outdated)

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights [New York Times]http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0

6

Page 7: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •
Page 8: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Data CleanersWatch videos • Open Refine (previously Google Refine)

• Data Wrangler (research at Stanford)

Write down• Examples of data dirtiness• Tool’s features demo-ed (or that you like)

Will collectively summarize similarities and differences afterwards

Open Refine: http://openrefine.orgData Wrangler: http://vis.stanford.edu/wrangler/

8

Page 9: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •
Page 10: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •
Page 11: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

How are the tools similar or different?

• …

G = Google RefineW = Data wrangler11

Page 12: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

! The videos only show

some of the tools’ features. Try them out.

Google Refine: http://code.google.com/p/google-refine/Data Wrangler: http://vis.stanford.edu/wrangler/

12

Page 13: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Data Integration

Page 14: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Course OverviewCollection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 15: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

What is Data Integration? Why is it Important?

Page 16: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

16

Data IntegrationCombining data from different sources to provide the user with a unified view

As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges

How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.)

Page 17: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Examples of businesses based on

data integration

Page 18: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •
Page 19: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •
Page 20: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •
Page 21: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Mashup

Page 22: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

More Examples?• [FREE] Mint: account app, integrates multiple account (credit

card, bank, etc.), can parse receipts

• Google News

• Crime mapping

• Feedly

• app that check gas prices, coupons

• zillow-trulia/redfin

• imdb (movie database)

• coin: combine multiple credits

• ebay22

Page 23: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

More Examples?• Palantir gotham

• Yelp: restaurant reviews, business reviews

• Facebook friend request: look at your friends’s friends and recommend those friends as your friends

• Trulia / zillow (real estate sites)

• graph search (facebook)

• waze

• yahoo pipe

• google search engine

• google transit

• google now / apple siri23

Page 24: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

How to do data integration?

Page 25: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

“Low” Effort ApproachesUse database’s “Join”! (e.g., SQLite)

Google Refinehttp://code.google.com/p/google-refine/ (video #3)

25

id name state111 Smith GA222 Johnson NY333 Obama CA

id name111 Smith222 Johnson333 Obama

id state111 GA222 NY333 CA

Page 26: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Crowd-sourcing Approaches: Freebase

26http://wiki.freebase.com/wiki/What_is_Freebase%3F

Page 27: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Freebase(a graph of entities)

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

27

Wikipedia.

Page 28: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

So what? What can you do with Freebase?

Hint: Google acquired it in 2010 Freebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7

28

Page 29: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

http://www.google.com/insidesearch/features/search/knowledge.html

Page 30: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Given a graph of entities, like Freebase, what other cool

things can you do?

30

Page 31: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

https://www.facebook.com/about/graphsearch

Page 32: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Facebook’s Graph Search

Integrate your friends’ info with yours

32

Page 33: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

FeldsparFinding Information by Association.

CHI 2008 Polo Chau, Brad Myers, Andrew Faulring

33Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdfYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

Page 34: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •
Page 35: Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Summary for data integrationOpportunities

• enable new services (Siri, padmapper)• enable new ways to discover info• improve existing services• reduce redundancy• new way to interactive with data• promote knowledge transfer (e.g., between

companies)35


Recommended