+ All Categories
Home > Documents > CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning...

CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning...

Date post: 06-May-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
50
Statistical Distortion: Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi, Yuhao Wen Some contents were based on : Tamraparni Dasu’s DSAA Tutorial, 2016
Transcript
Page 1: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Statistical Distortion: Consequences of Data Cleaning

Data Cleaning & IntegrationCompSci 590.01 Spring 2017

Junyang Gao, Amir Rahimzadeh Ilkhechi, Yuhao Wen

Some contents were based on :Tamraparni Dasu’s DSAA Tutorial, 2016

Page 2: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Tamraparni Dasu, Ji Meng Loh. “Statistical Distortion: Consequences of Data Cleaning.” VLDB, 2012

Page 3: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Biggest take-away points?

(For us:)

● Cleaner data do not necessarily imply more useful or useable data

● In practice, simple cleaning strategy may outperform a more sophisticated method that have assumptions not suitable over the data

Page 4: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Outline

● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis

Page 5: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Data Cleaning or Data Mangling?● Changed the shape:

a. Most frequent values (Mode)

b. Least frequent values (Anomalies)

● Moved good values● Turned good values to

glitches

Page 6: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

How to measure data cleaning strategies?

● Three dimensional data quality metric:a. Statistical Distortionb. Glitch Improvementc. Cost

Page 7: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Experimental Framework

1. Glitch Index● Weighted Sum:

● The lower the glitch index,

The “cleaner” the data set

Name City State

0 0 0

0 0 0

0 0 1

0 0 1

0 0 1

0 0 0

0 0 0

0 0 0

0 0 0

Glitch Vector

Page 8: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Experimental Framework

2. Statistical Distortion

Page 9: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Distance between two distributions

1. Kullback-Liebler “distance”● P,Q are two probability distributions over the same event

space●

Page 10: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Distance between two distributions

1. Kullback-Liebler divergence●

Entropy of PCross-Entropy of P,Q

Page 11: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Distance between two distributions

2. Jensen–Shannon divergence

● symmetrized and smoothed version of the Kullback–Leibler divergence

● , where

Page 12: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Distance between two distributions

3. Earth Mover’s distance

● Minimum cost of converting P to Q, transportation problem●

Page 13: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Experimental Framework

3. Cost

● Highly context dependent● In this paper: glitch-based (percentage of glitches

removed)

Page 14: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Experimental Framework

“Best” strategies depends on user’s tolerance

● Statistical Distortion● Glitch Improvement● Cost

Page 15: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Outline

● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis

Page 16: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Applicable to: ● Structured,● Hierarchical,● Spatio-temporal,● Unstructured data

Page 17: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

A hierarchical network exampleN_1

N_13

N_132

Page 18: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

A hierarchical network example (cont.)

● Each node measures v variables (time series)● For N_ijk: represents the collected data at time t● F_t: the history up to time t-1● represents the window of time-step history from t-⍵ up

to time t-1

Page 19: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Glitch Types:● Multitype

● Co-occurring

● Stand alone

Page 20: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Glitch DetectionGlitch detector is a function of

X^t

Missing values

Page 21: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Glitch Detection(cont.)

Glitch detector is a function of X^t

Inconsistent values

Page 22: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Glitch Detection(cont.)

Glitch detector is also a function of other parameters

Outlier values

Page 23: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Glitch Detection(cont.)

Glitch Matrix:

Page 24: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Glitch Index*

*1 ⨉ p is a typo in the paper

Page 25: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Statistical Distortion Measure(EMD)*

Slides adopted from Pete Barnum presentation*

Page 26: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Statistical Distortion Measure(EMD)

Page 27: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Statistical Distortion Measure(EMD)

Page 28: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD:

Page 29: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD:

Page 30: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD:

Page 31: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD:

Page 32: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD (as an instance of transportation problem):

Page 33: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD (constraint 1):

Page 34: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD (constraint 2):

Page 35: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD (constraint 3):

Page 36: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Linear programming approach for EMD (constraint 4):

Page 37: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Final result for EMD:

Page 38: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Final result for EMD:

Page 39: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Outline

● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis

Page 40: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

ExperimentDataset:

● 20,000 time series , Length at most 170, 3 variables

Glitches

● Inconsistencies○ A1 >= 0○ 0<= A3 <=1○ If A3 is missing, A1 should not be populated

● Outliers● Missing values

All graph in the following slides are from T. Dasu and J. Loh. "Statistical Distortion: Consequences of Data Cleaning." VLDB 2012.

Page 41: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

ExperimentSampling

● From D & DI with replacement, 50 pairs in total● Sample size: 100, 500 (no significant impact)

Factors concerned:

● Attribute transformations● Strategies● Cost● Sample size

Page 42: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Experiment

Page 43: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Analysis

● Cleaning Strategies● Studying Cost● Data Transformation and Cleaning● Strategies and Attribute Distributions● Strategies Evaluation● Cleaning Cost

Page 44: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Analysis - Cleaning StrategiesApplied 5 strategies to each of the 100 test pairs of data streams.

Strategies Missing values Inconsistent values Outliers

1 Impute using SAS PROC MI Winsorization by attribute basis

2 Impute using SAS PROC MI ignore

3 ignore ignore Winsorization by attribute basis

4 Replace with mean attribute from ideal dataset ignore

5 Replace with mean attribute from ideal dataset & Winsorization (outlier only)

weight 0.25 0.25 0.5

Page 45: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Analysis - Studying Cost

● Cost ~ Proportion of the glitches cleaned● Process:

○ Compute normalized glitch score for each time series○ Rank them○ Top x% cleaned (x=0 -> Nothing cleaned, x=100 -> everything cleaned)

Page 46: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Analysis - Data Transformation and CleaningStrategy: 1

Gray:

Imputed missing values

X=Y:

Untouched data

Back dots:

Winsorized values

Page 47: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Analysis - Strategies and Attribute Distributions

Page 48: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Analysis - Strategies Evaluation

Figure 6

Page 49: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Analysis - Strategies Evaluation

● Single cleaning method○ SAS PROC MI○ Mean

● Winsorization only● Using two methods

○ Impute+ Winsorize○ Mean + Winsorize

Page 50: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,

Analysis - Cleaning Cost

Strategy: Imputation + Winsorization


Recommended