Download - 02 Data Preprocessing

8/7/2019 02 Data Preprocessing

1/68

1

2


2/68

2

1. Why preprocess the data?

2. Descriptive data summarization

3. Data cleaning

4. Data integration and transformation

5. Data reduction

6. Discretization and concept hierarchy generation

This chapter covers


3/68

3

Data in the real world is dirty

incomplete: lacking attribute values, lacking certainattributes of interest, or containing only aggregate data

e.g., occupation=

noisy: containing errors or outliers

e.g., Salary=-10

inconsistent: containing discrepancies in codes or names

e.g., Age=42 Birthday=03/07/1997

e.g., Was rating 1,2,3, now rating A, B, C

e.g., discrepancy between duplicate records

The real-world data are highly susceptible to noise, incompleteness and

inconsistency.


4/68

4

Why Is Data Dirty?


5/68


6/68

6

Inconsistent data may come from

Different data sources

Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning


7/68

7

Why Is Data Preprocessing Important?

No quality data, no quality miningresults!

Quality decisions must be based

on quality data

e.g., duplicate or missing data

may cause incorrect or evenmisleading statistics.

Data warehouse needs consistent

integration of quality data

Data extraction, cleaning, and

transformation comprises the

majority of the work of building a

data warehouse

Noisy Data

Lack of Data Quality

Lack of Quality in

Query Results

Lack of Quality

Information

Lack of Quality in

Mining

Lack of Quality

Decision Making


8/68

8

Measures of Data Quality

Accuracy

Completeness

Consistency

Timeliness

Believability

Value added

Interpretability

Accessibility


9/68

9

Major Tasks in Data Preprocessing

Technique Purpose

1. Data cleaning It can be applied to remove noise and correct

inconsistencies in the data.

2. Data integration It merges data from multiple sources into a

coherent data store, such as a data warehouse.

3. Data transformation These are like normalizations.

4. Data reduction It can reduce the data size by aggregating,

eliminating redundant features, or clustering.

5. Data discretization Part of data reduction but with particular

importance, especially for numerical data


10/68

10

Data preprocessing techniques are applied before mining.

These can improve the overall quality of the patterns mined andthe time required for the actual mining.


11/68

11

For data preprocessing to be successful, it is essential

to have an overall picture of the data.

For many preprocessing tasks, users would like to

learn about data characteristics regarding both central

tendency and dispersion of the data.


12/68

12

Measuring the central tendency

A distributive measure is a measure that can be computed for

a give data set by partitioning the data into smaller subsets,

and then merging the results in order to arrive at the

measures value for the entire data set. For example, sum(),

count().

An algebraic measure is a measure that can be computed by

applying an algebraic function to one or more distributive

measures. For example, mean, median, mode and midrange.


13/68

13

Measuring the dispersion

The degree to which numerical data tend to spread is

called the dispersion or variance of data. For example,

range, QD, SD, quantile plot, scatter plot.


14/68

14

Real-world data tend to be incomplete, noisy, and inconsistent.

Data cleaning routines attempt

to fill in missing values

to smooth out noise while identifying outliers

to correct inconsistencies in the data

to resolve redundancy caused by data integration.


15/68

15

1.Missing Data

Data is not always available E.g., many tuples have no recorded value for several attributes, such as

customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred


16/68

16

How to Handle Missing Data?

The methods are as follows:

Ignore the tuple

Fill in the missing values manually

This is usually done when the class label is missing. This method is

not very effective, unless the tuple contains several attributes with

missing values.

Use global constant to fill in the missing value

Use the attribute mean to fill in the missing value

Use the most probable value to fill in the missing value

It is time consuming.

Not possible for large data sets with many missing values.


17/68

17

2. Noisy Data

What is noise?

It is random error or variance in a measured variable.


18/68

18

Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation inconsistency in naming convention

Why noise occurs?


19/68

19

How to Handle Noisy Data?

Technique How applied

1. Binning First sort data and partition into (equal-

frequency) bins

Then one can smooth by bin means, bin

median, smooth by bin boundaries, etc.

2. Regression Smooth by fitting the data into regression

functions.

3. Clustering Detect and remove outliers

4. Combined computer andhuman inspection

Detect suspicious values and check by human.


20/68

20

Simple Discretization Methods: Binning

Equal-width (distance) partitioning

Divides the range into Nintervals of equal size: uniform grid

ifA and B are the lowest and highest values of the attribute, the width

of intervals will be: W= (B A)/N.

The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into Nintervals, each containing approximately

same number of samples

Good data scaling

Managing categorical attributes can be tricky


21/68

21

Binning Methods -- Examples

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34


22/68

22

Regression

x

y

y = x + 1

X1

Y1

Y1

Data can be smoothed by fitting the data to a function such as regression.


23/68

23

Cluster Analysis

Outliers may be detected by clustering, where similar values are organized into

groups, or clusters.

Values that fall outside of the set of clusters may be considered outliers.


24/68

24

Data integration combines data from multiple sources into a coherent data store.

These sources may include multiple databases, data cubes, or flat files.


25/68

25

Entity identification problem:

Identify real world entities from multiple data sources, e.g., BillClinton = William Clinton

Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sourcesare different

Possible reasons: different representations, different scales, e.g.,metric vs. British units

Data integration issues:

Contd


26/68

26

Data integration issues:

Schema integration and object matching can be tricky

Redundancy

Entity identification problem

How can equivalent real-world entities from multiple data stores be matched up?

Duplication

An attribute may be redundant if it can be derived from another attribute or set

of attributes.

Some redundancies can be detected by correlation analysis.

Detection and resolution of data value conflicts


27/68

27

WhyRedundancyCauses Problems?

Consider the following tables:

EMP(ENO, ENAME, BASIC, DA, PAY) -> PAY = BASIC + DA

EMPLOYEE(ENO, ENAME, BASIC, DA, PF, PAY) -> PAY = BASIC + DA - PF

In the same way ITEM-PRICE will be determined by local taxes, which vary

from area to area.

If redundant variables are numeric, it is better first to normalize them

before integrating data from multiple sources.


28/68

28

Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple databases

Objectidentification: The same attribute or object may have

different names in different databases

Derivabledata: One attribute may be a derived attribute in

another table, e.g., annual revenue

Redundant attributes may be able to be detected by correlationanalysis

Careful integration of the data from multiple sources may help

reduce/avoid redundancies and inconsistencies and improve mining

speed and quality


29/68

29

Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearsons product moment coefficient)

where n is the number of tuples, and are the respective means of

A and B, A and B are the respective standard deviation of A and B, and

(AB) is the sum of the AB cross-product.

If rA,B > 0, A and B are positively correlated (As values increase as Bs).

The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated

BABA n

BAnAB

n

BBAAr

BA

WWWW )1(

)(

)1(

))((,

!

!

A B


30/68

30

Correlation Analysis (Categorical Data)

G2 (chi-square) test

The larger the G2 value, the more likely the variables are related

The cells that contribute the most to the G2 value are those whoseactual count is very different from the expected count

Correlation does not imply causality

# of hospitals and # of car-theft in a city are correlated

Both are causally linked to the third variable: population

!Expected

ExpectedObserved 22 )(G


31/68

31

Chi-Square Calculation: An Example

G2 (chi-square) calculation (numbers in parenthesis are expected counts

calculated based on the data distribution in the two categories)

It shows that like_science_fiction and play_chess are correlated in the

group

93.507

840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222 !

!G

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500


32/68

32

Here, the data are transformed or consolidated into forms appropriate for mining.


33/68

33

Smoothing: remove noise from data through binning,regression and clustering

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small, specified range

min-max normalization

z-score normalization

normalization by decimal scaling

Attribute/feature construction

New attributes constructed from the given ones

Data transformation can involve the following:


34/68

34

Data Transformation: Normalization

Min-max normalization: to [new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0,

1.0]. Then $73,600 is mapped to

Z-score normalization (: mean, : standard deviation):

Ex. Let =54,000, =16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73!

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

!

A

Avv

W

Q!'

j

vv

10'! Wherej is thesmallest integersuch that Max(||) < 1

225.1

000,16

000,54600,73!


35/68

35


36/68

36

Why data reduction?

A database/data warehouse may store terabytes of data

Complex data analysis/mining may take a very long time to run on

the complete data set


37/68

37

Data reduction techniques can be applied to obtain a reduced representation of

the data set that is much smaller in volume, yet closely maintains the integrity of

the original data.

That is, mining on the reduced data set should be more efficient yet produce the

same analytical results.


38/68

38

Data reduction strategies

1. Data cube aggregation:

2. Attribute subset selection

3. Data Compression or dimensionality reduction

4. Numerosity reduction e.g., fit data into models


39/68

39

1. Data Cube Aggregation

Data cubes store multidimensional aggregated information.

The cube created at the lowest level of abstraction is referred to as the basecuboid.

The base cuboid should correspond to an individual entity of interest, such as sales

or customer. The lowest level should be useful for analysis.

A cube at the highest level of abstraction is apexcuboid.


40/68

40

Data cubes created for varying levels of abstraction are referred to as

cuboids.

Each higher level of abstraction further reduces the resulting data size.

When replying to data mining requests, the smallest available cuboid

relevant to the given task should be used.


41/68

41

2. Attribute selection

Attribute subset selection reduces the data set size by removing

irrelevant or redundant attributes or dimensions.

The goal here is to find a minimum set of attributes such that the

resulting probability distribution of data classes is as close as

possible to the original distribution obtained using all attributes.

The best attributes are typically determined using tests of statistical

significance, which assume that the attributes are independent of one

another.


42/68

42

Basic heuristic methods of attribute subset selection include the

following techniques:

1. Stepwise forward selection

2. Stepwise backward elimination

3. Combination of forward and backward elimination4. Decision tree induction


43/68

43

3. Data Compression or dimensionality reduction

Here, the data encoding or transformations are applied so as to obtain a reducedor compressed representation of the original data.

Lossy compression

Lossless compression

Two lossy compression techniques:

1. Wavelet transforms

Discrete wavelet transform

Discrete Fourier transform

Hierarchical pyramid algorithm

2. Principal component analysis (PCA)


44/68


45/68

45

String compression

There are extensive theories and well-tuned algorithms

Typically lossless

But only limited manipulation is possible without expansion

Audio/video compression Typically lossy compression, with progressive refinement

Sometimes small fragments of signal can be reconstructed without

reconstructing the whole

Time sequence is not audio

Typically short and vary slowly with time

Examples


46/68

46

Dimensionality Reduction:Wavelet Transformation

Discrete wavelet transform (DWT): linear signal processing, multi-

resolutional analysis

Compressed approximation: store only a small fraction of the strongest

of the wavelet coefficients

Similar to discrete Fourier transform (DFT), but better lossycompression, localized in space

Method:

Length, L, must be an integer power of 2 (padding with 0s, when necessary)

Each transform has 2 functions: smoothing, difference

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length

Haar

2 Daubechie

4


47/68

47

Given Ndata vectors from n-dimensions, find k n orthogonal vectors

(principalcomponents) that can be best used to represent data Steps

Normalize input data: Each attribute falls within the same range

Compute korthonormal (unit) vectors, i.e., principalcomponents

Each input data (vector) is a linear combination of the kprincipal component

vectors The principal components are sorted in order of decreasing significance or

strength

Since the components are sorted, the size of the data can be reduced by

eliminating the weak components, i.e., those with low variance. (i.e., using the

strongest principal components, it is possible to reconstruct a good

approximation of the original data

Works for numeric data only

Used when the number of dimensions is large

DimensionalityReduction: PrincipalComponent Analysis (PCA)


48/68

48

X1

X2

Y1

Y2

PrincipalComponent Analysis


49/68

49

4. Numerosity reduction

The techniques of numerosity reduction can be applied to reduce the data

volume by choosing alternative, smaller forms of data representation.

These techniques may be parametric or nonparametric.

For parametric methods, a model is used to estimate the data parameters need

to be stored, instead of the actual data. Outliers may also be stored.

Nonparametric methods for storing reduced representations of the data include

histograms, clustering and sampling.


50/68

50

Data Reduction Method (1): Regression and Log-LinearModels

Linear regression: Data are modeled to fit a straight line

Often uses the least-square method to fit the line

Multiple regression: allows a response variable Y to be modeled as a linear

function of multidimensional feature vector

Log-linear model: approximates discrete multidimensional probability

distributions


51/68

Linear regression: Y= w X + b

Two regression coefficients, wand b, specify the line and are to be estimatedby using the data at hand

Using the least squares criterion to the known values ofY1,Y2, , X1, X2, .

Multiple regression: Y= b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the above Log-linear models:

The multi-way table of joint probabilities is approximated by a product oflower-order tables

Probability: p(a,b,c,d) = EabFacGadHbcd

Regress Analysis and Log-LinearModels

Data Reduction Method (1):


52/68

52

Data Reduction Method (2): Histograms

Divide data into buckets and store

average (sum) for each bucket

Partitioning rules:

Equal-width: equal bucket range

Equal-frequency (or equal-depth)

V-optimal: with the leasth

istogram

variance (weighted sum of the original

values that each bucket represents)

MaxDiff: set bucket boundary between

each pair for pairs have the 1 largest

differences

0

5

10

15

20

25

30

35

40

10000 30000 50000 70000 90000


53/68

53

Data Reduction Method (3): Clustering

Partition data set into clusters based on similarity, and store cluster

representation (e.g., centroid and diameter) only

Can be very effective if data is clustered but not if data is smeared

Can have hierarchical clustering and be stored in multi-dimensional index treestructures

There are many choices of clustering definitions and clustering algorithms


54/68

54

Data Reduction Method (4): Sampling

Sampling: obtaining a small sample s to represent the whole data setN

Allow a mining algorithm to run in complexity that is potentially sub-linear tothe size of the data

Choose a representative subset of the data

Simple random sampling may have very poor performance in the presenceof skew

Develop adaptive sampling methods

Stratified sampling:

Approximate the percentage of each class (or subpopulation of interest)in the overall database

Used in conjunction with skewed data Note: Sampling may not reduce database I/Os (page at a time)


55/68

55

Sampling: with or withoutReplacement

Raw Data


56/68

56

Sampling: Cluster orStratifiedSampling

Raw Data Cluster/StratifiedSample


57/68

57

Three types of attributes: Nominal values from an unordered set, e.g., color, profession

Ordinal values from an ordered set, e.g., military or academic rank

Continuous real numbers, e.g., integer or real numbers

Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes.

Reduce data size by discretization

Prepare for further analysis


58/68

58

Discretization

Reduce the number of values for a given continuous attribute by dividing

the range of the attribute into intervals

Interval labels can then be used to replace actual data values

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute

Concept hierarchy formation

Recursively reduce the data by collecting and replacing low level concepts

(such as numeric values for age) by higher level concepts (such as young,middle-aged, or senior)


59/68

59

Discretization techniques can be categorized based on how the discretization is

performed.

1. Supervised discretization:

2. Unsupervised discretization:

Here, the process uses the class information.

Top-down discretization or splitting:

Bottom-up discretization or merging:

Here, the process starts by first finding one or a few points (split or cut

points) to split the entire attribute range, and then repeats recursively

on the resulting intervals.

Discretization can be performed recursively on an attribute to provide ahierarchical partitioning of the attribute values, known as concept hierarchy.

Concept hierarchies are useful for mining at multiple levels of abstraction.


60/68

60

Typical methods:All the methods can be applied recursively

Binning (covered above)

Top-down split, unsupervised,

Histogram analysis (covered above)

Top-down split, unsupervised

Clustering analysis (covered above) Either top-down split or bottom-up merge, unsupervised

Entropy-based discretization: supervised, top-down split

Interval merging by G2 Analysis: unsupervised, bottom-up merge

Segmentation by natural partitioning: top-down split, unsupervised


61/68

61

Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using

boundary T, the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set. Given m

classes, the entropy ofS1

is

where pi is the probability of class i in S1

The boundary that minimizes the entropy function over all possible boundaries is

selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping

criterion is met

Such a boundary may reduce data size and improve classification accuracy

)(||

||)(

||

||),( 2

21

1SEntropy

S

SSEntropy

S

STSI !

!

!m

i

iippSEntropy

1

21 )(log)(


62/68

62

IntervalMerge byG2 Analysis

Merging-based (bottom-up) vs. splitting-based methods

Merge: Find the best neighboring intervals and merge them to form larger intervals

recursively

ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

Initially, each distinct value of a numerical attr. A is considered to be one interval

G2 tests are performed for every pair of adjacent intervals

Adjacent intervals with the leastG2 values are merged together, since low G2 values for a

pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met

(such as significance level, max-interval, max inconsistency, etc.)


63/68

63

Segmentation by Natural Partitioning

A simply 3-4-5 rule can be used to segment numeric data into relatively

uniform, natural intervals.

If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit,

partition the range into 3 equi-width intervals

If it covers 2, 4, or 8 distinct values at the most significant digit, partition

the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most significant digit, partition

the range into 5 intervals


64/68


65/68

65

A concept hierarchyfor a numerical attribute defines a discretization of the attribute.

Concept hierarchies can be used to reduce the data by collecting and replacing low-

level concepts (such as numerical values for age) with higher-level concepts (such as

youth, middle-aged, or senior).

The high level concepts are useful for data generalization.

The discretization techniques and concept hierarchies are applied before data mining

as a preprocessing step, rather than during mining.


66/68

66

ConceptHierarchy Generation forCategorical Data

Specification of a partial/total ordering of attributes explicitly at theschema level by users or experts

street < city < state < country

Specification of a hierarchy for a set of values by explicit data grouping

{Urbana, Champaign, Chicago}