8/7/2019 02 Data Preprocessing
1/68
1
2
8/7/2019 02 Data Preprocessing
2/68
2
1. Why preprocess the data?
2. Descriptive data summarization
3. Data cleaning
4. Data integration and transformation
5. Data reduction
6. Discretization and concept hierarchy generation
This chapter covers
8/7/2019 02 Data Preprocessing
3/68
3
Data in the real world is dirty
incomplete: lacking attribute values, lacking certainattributes of interest, or containing only aggregate data
e.g., occupation=
noisy: containing errors or outliers
e.g., Salary=-10
inconsistent: containing discrepancies in codes or names
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
The real-world data are highly susceptible to noise, incompleteness and
inconsistency.
8/7/2019 02 Data Preprocessing
4/68
4
Why Is Data Dirty?
8/7/2019 02 Data Preprocessing
5/68
8/7/2019 02 Data Preprocessing
6/68
6
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
8/7/2019 02 Data Preprocessing
7/68
7
Why Is Data Preprocessing Important?
No quality data, no quality miningresults!
Quality decisions must be based
on quality data
e.g., duplicate or missing data
may cause incorrect or evenmisleading statistics.
Data warehouse needs consistent
integration of quality data
Data extraction, cleaning, and
transformation comprises the
majority of the work of building a
data warehouse
Noisy Data
Lack of Data Quality
Lack of Quality in
Query Results
Lack of Quality
Information
Lack of Quality in
Mining
Lack of Quality
Decision Making
8/7/2019 02 Data Preprocessing
8/68
8
Measures of Data Quality
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
8/7/2019 02 Data Preprocessing
9/68
9
Major Tasks in Data Preprocessing
Technique Purpose
1. Data cleaning It can be applied to remove noise and correct
inconsistencies in the data.
2. Data integration It merges data from multiple sources into a
coherent data store, such as a data warehouse.
3. Data transformation These are like normalizations.
4. Data reduction It can reduce the data size by aggregating,
eliminating redundant features, or clustering.
5. Data discretization Part of data reduction but with particular
importance, especially for numerical data
8/7/2019 02 Data Preprocessing
10/68
10
Data preprocessing techniques are applied before mining.
These can improve the overall quality of the patterns mined andthe time required for the actual mining.
8/7/2019 02 Data Preprocessing
11/68
11
For data preprocessing to be successful, it is essential
to have an overall picture of the data.
For many preprocessing tasks, users would like to
learn about data characteristics regarding both central
tendency and dispersion of the data.
8/7/2019 02 Data Preprocessing
12/68
12
Measuring the central tendency
A distributive measure is a measure that can be computed for
a give data set by partitioning the data into smaller subsets,
and then merging the results in order to arrive at the
measures value for the entire data set. For example, sum(),
count().
An algebraic measure is a measure that can be computed by
applying an algebraic function to one or more distributive
measures. For example, mean, median, mode and midrange.
8/7/2019 02 Data Preprocessing
13/68
13
Measuring the dispersion
The degree to which numerical data tend to spread is
called the dispersion or variance of data. For example,
range, QD, SD, quantile plot, scatter plot.
8/7/2019 02 Data Preprocessing
14/68
14
Real-world data tend to be incomplete, noisy, and inconsistent.
Data cleaning routines attempt
to fill in missing values
to smooth out noise while identifying outliers
to correct inconsistencies in the data
to resolve redundancy caused by data integration.
8/7/2019 02 Data Preprocessing
15/68
15
1.Missing Data
Data is not always available E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred
8/7/2019 02 Data Preprocessing
16/68
16
How to Handle Missing Data?
The methods are as follows:
Ignore the tuple
Fill in the missing values manually
This is usually done when the class label is missing. This method is
not very effective, unless the tuple contains several attributes with
missing values.
Use global constant to fill in the missing value
Use the attribute mean to fill in the missing value
Use the most probable value to fill in the missing value
It is time consuming.
Not possible for large data sets with many missing values.
8/7/2019 02 Data Preprocessing
17/68
17
2. Noisy Data
What is noise?
It is random error or variance in a measured variable.
8/7/2019 02 Data Preprocessing
18/68
18
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation inconsistency in naming convention
Why noise occurs?
8/7/2019 02 Data Preprocessing
19/68
19
How to Handle Noisy Data?
Technique How applied
1. Binning First sort data and partition into (equal-
frequency) bins
Then one can smooth by bin means, bin
median, smooth by bin boundaries, etc.
2. Regression Smooth by fitting the data into regression
functions.
3. Clustering Detect and remove outliers
4. Combined computer andhuman inspection
Detect suspicious values and check by human.
8/7/2019 02 Data Preprocessing
20/68
20
Simple Discretization Methods: Binning
Equal-width (distance) partitioning
Divides the range into Nintervals of equal size: uniform grid
ifA and B are the lowest and highest values of the attribute, the width
of intervals will be: W= (B A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into Nintervals, each containing approximately
same number of samples
Good data scaling
Managing categorical attributes can be tricky
8/7/2019 02 Data Preprocessing
21/68
21
Binning Methods -- Examples
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
8/7/2019 02 Data Preprocessing
22/68
22
Regression
x
y
y = x + 1
X1
Y1
Y1
Data can be smoothed by fitting the data to a function such as regression.
8/7/2019 02 Data Preprocessing
23/68
23
Cluster Analysis
Outliers may be detected by clustering, where similar values are organized into
groups, or clusters.
Values that fall outside of the set of clusters may be considered outliers.
8/7/2019 02 Data Preprocessing
24/68
24
Data integration combines data from multiple sources into a coherent data store.
These sources may include multiple databases, data cubes, or flat files.
8/7/2019 02 Data Preprocessing
25/68
25
Entity identification problem:
Identify real world entities from multiple data sources, e.g., BillClinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sourcesare different
Possible reasons: different representations, different scales, e.g.,metric vs. British units
Data integration issues:
Contd
8/7/2019 02 Data Preprocessing
26/68
26
Data integration issues:
Schema integration and object matching can be tricky
Redundancy
Entity identification problem
How can equivalent real-world entities from multiple data stores be matched up?
Duplication
An attribute may be redundant if it can be derived from another attribute or set
of attributes.
Some redundancies can be detected by correlation analysis.
Detection and resolution of data value conflicts
8/7/2019 02 Data Preprocessing
27/68
27
WhyRedundancyCauses Problems?
Consider the following tables:
EMP(ENO, ENAME, BASIC, DA, PAY) -> PAY = BASIC + DA
EMPLOYEE(ENO, ENAME, BASIC, DA, PF, PAY) -> PAY = BASIC + DA - PF
In the same way ITEM-PRICE will be determined by local taxes, which vary
from area to area.
If redundant variables are numeric, it is better first to normalize them
before integrating data from multiple sources.
8/7/2019 02 Data Preprocessing
28/68
28
Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases
Objectidentification: The same attribute or object may have
different names in different databases
Derivabledata: One attribute may be a derived attribute in
another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlationanalysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality
8/7/2019 02 Data Preprocessing
29/68
29
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearsons product moment coefficient)
where n is the number of tuples, and are the respective means of
A and B, A and B are the respective standard deviation of A and B, and
(AB) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (As values increase as Bs).
The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated
BABA n
BAnAB
n
BBAAr
BA
WWWW )1(
)(
)1(
))((,
!
!
A B
8/7/2019 02 Data Preprocessing
30/68
30
Correlation Analysis (Categorical Data)
G2 (chi-square) test
The larger the G2 value, the more likely the variables are related
The cells that contribute the most to the G2 value are those whoseactual count is very different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
!Expected
ExpectedObserved 22 )(G
8/7/2019 02 Data Preprocessing
31/68
31
Chi-Square Calculation: An Example
G2 (chi-square) calculation (numbers in parenthesis are expected counts
calculated based on the data distribution in the two categories)
It shows that like_science_fiction and play_chess are correlated in the
group
93.507
840
)8401000(
360
)360200(
210
)21050(
90
)90250( 22222 !
!G
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
8/7/2019 02 Data Preprocessing
32/68
32
Here, the data are transformed or consolidated into forms appropriate for mining.
8/7/2019 02 Data Preprocessing
33/68
33
Smoothing: remove noise from data through binning,regression and clustering
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data transformation can involve the following:
8/7/2019 02 Data Preprocessing
34/68
34
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,600 is mapped to
Z-score normalization (: mean, : standard deviation):
Ex. Let =54,000, =16,000. Then
Normalization by decimal scaling
716.00)00.1(000,12000,98
000,12600,73!
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
!
A
Avv
W
Q!'
j
vv
10'! Wherej is thesmallest integersuch that Max(||) < 1
225.1
000,16
000,54600,73!
8/7/2019 02 Data Preprocessing
35/68
35
8/7/2019 02 Data Preprocessing
36/68
36
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on
the complete data set
8/7/2019 02 Data Preprocessing
37/68
37
Data reduction techniques can be applied to obtain a reduced representation of
the data set that is much smaller in volume, yet closely maintains the integrity of
the original data.
That is, mining on the reduced data set should be more efficient yet produce the
same analytical results.
8/7/2019 02 Data Preprocessing
38/68
38
Data reduction strategies
1. Data cube aggregation:
2. Attribute subset selection
3. Data Compression or dimensionality reduction
4. Numerosity reduction e.g., fit data into models
8/7/2019 02 Data Preprocessing
39/68
39
1. Data Cube Aggregation
Data cubes store multidimensional aggregated information.
The cube created at the lowest level of abstraction is referred to as the basecuboid.
The base cuboid should correspond to an individual entity of interest, such as sales
or customer. The lowest level should be useful for analysis.
A cube at the highest level of abstraction is apexcuboid.
8/7/2019 02 Data Preprocessing
40/68
40
Data cubes created for varying levels of abstraction are referred to as
cuboids.
Each higher level of abstraction further reduces the resulting data size.
When replying to data mining requests, the smallest available cuboid
relevant to the given task should be used.
8/7/2019 02 Data Preprocessing
41/68
41
2. Attribute selection
Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes or dimensions.
The goal here is to find a minimum set of attributes such that the
resulting probability distribution of data classes is as close as
possible to the original distribution obtained using all attributes.
The best attributes are typically determined using tests of statistical
significance, which assume that the attributes are independent of one
another.
8/7/2019 02 Data Preprocessing
42/68
42
Basic heuristic methods of attribute subset selection include the
following techniques:
1. Stepwise forward selection
2. Stepwise backward elimination
3. Combination of forward and backward elimination4. Decision tree induction
8/7/2019 02 Data Preprocessing
43/68
43
3. Data Compression or dimensionality reduction
Here, the data encoding or transformations are applied so as to obtain a reducedor compressed representation of the original data.
Lossy compression
Lossless compression
Two lossy compression techniques:
1. Wavelet transforms
Discrete wavelet transform
Discrete Fourier transform
Hierarchical pyramid algorithm
2. Principal component analysis (PCA)
8/7/2019 02 Data Preprocessing
44/68
8/7/2019 02 Data Preprocessing
45/68
45
String compression
There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without expansion
Audio/video compression Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Examples
8/7/2019 02 Data Preprocessing
46/68
46
Dimensionality Reduction:Wavelet Transformation
Discrete wavelet transform (DWT): linear signal processing, multi-
resolutional analysis
Compressed approximation: store only a small fraction of the strongest
of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossycompression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0s, when necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
Haar
2 Daubechie
4
8/7/2019 02 Data Preprocessing
47/68
47
Given Ndata vectors from n-dimensions, find k n orthogonal vectors
(principalcomponents) that can be best used to represent data Steps
Normalize input data: Each attribute falls within the same range
Compute korthonormal (unit) vectors, i.e., principalcomponents
Each input data (vector) is a linear combination of the kprincipal component
vectors The principal components are sorted in order of decreasing significance or
strength
Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance. (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data
Works for numeric data only
Used when the number of dimensions is large
DimensionalityReduction: PrincipalComponent Analysis (PCA)
8/7/2019 02 Data Preprocessing
48/68
48
X1
X2
Y1
Y2
PrincipalComponent Analysis
8/7/2019 02 Data Preprocessing
49/68
49
4. Numerosity reduction
The techniques of numerosity reduction can be applied to reduce the data
volume by choosing alternative, smaller forms of data representation.
These techniques may be parametric or nonparametric.
For parametric methods, a model is used to estimate the data parameters need
to be stored, instead of the actual data. Outliers may also be stored.
Nonparametric methods for storing reduced representations of the data include
histograms, clustering and sampling.
8/7/2019 02 Data Preprocessing
50/68
50
Data Reduction Method (1): Regression and Log-LinearModels
Linear regression: Data are modeled to fit a straight line
Often uses the least-square method to fit the line
Multiple regression: allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
Log-linear model: approximates discrete multidimensional probability
distributions
8/7/2019 02 Data Preprocessing
51/68
Linear regression: Y= w X + b
Two regression coefficients, wand b, specify the line and are to be estimatedby using the data at hand
Using the least squares criterion to the known values ofY1,Y2, , X1, X2, .
Multiple regression: Y= b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the above Log-linear models:
The multi-way table of joint probabilities is approximated by a product oflower-order tables
Probability: p(a,b,c,d) = EabFacGadHbcd
Regress Analysis and Log-LinearModels
Data Reduction Method (1):
8/7/2019 02 Data Preprocessing
52/68
52
Data Reduction Method (2): Histograms
Divide data into buckets and store
average (sum) for each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
V-optimal: with the leasth
istogram
variance (weighted sum of the original
values that each bucket represents)
MaxDiff: set bucket boundary between
each pair for pairs have the 1 largest
differences
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
8/7/2019 02 Data Preprocessing
53/68
53
Data Reduction Method (3): Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is smeared
Can have hierarchical clustering and be stored in multi-dimensional index treestructures
There are many choices of clustering definitions and clustering algorithms
8/7/2019 02 Data Preprocessing
54/68
54
Data Reduction Method (4): Sampling
Sampling: obtaining a small sample s to represent the whole data setN
Allow a mining algorithm to run in complexity that is potentially sub-linear tothe size of the data
Choose a representative subset of the data
Simple random sampling may have very poor performance in the presenceof skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or subpopulation of interest)in the overall database
Used in conjunction with skewed data Note: Sampling may not reduce database I/Os (page at a time)
8/7/2019 02 Data Preprocessing
55/68
55
Sampling: with or withoutReplacement
Raw Data
8/7/2019 02 Data Preprocessing
56/68
56
Sampling: Cluster orStratifiedSampling
Raw Data Cluster/StratifiedSample
8/7/2019 02 Data Preprocessing
57/68
57
Three types of attributes: Nominal values from an unordered set, e.g., color, profession
Ordinal values from an ordered set, e.g., military or academic rank
Continuous real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
8/7/2019 02 Data Preprocessing
58/68
58
Discretization
Reduce the number of values for a given continuous attribute by dividing
the range of the attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level concepts
(such as numeric values for age) by higher level concepts (such as young,middle-aged, or senior)
8/7/2019 02 Data Preprocessing
59/68
59
Discretization techniques can be categorized based on how the discretization is
performed.
1. Supervised discretization:
2. Unsupervised discretization:
Here, the process uses the class information.
Top-down discretization or splitting:
Bottom-up discretization or merging:
Here, the process starts by first finding one or a few points (split or cut
points) to split the entire attribute range, and then repeats recursively
on the resulting intervals.
Discretization can be performed recursively on an attribute to provide ahierarchical partitioning of the attribute values, known as concept hierarchy.
Concept hierarchies are useful for mining at multiple levels of abstraction.
8/7/2019 02 Data Preprocessing
60/68
60
Typical methods:All the methods can be applied recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above) Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down split
Interval merging by G2 Analysis: unsupervised, bottom-up merge
Segmentation by natural partitioning: top-down split, unsupervised
8/7/2019 02 Data Preprocessing
61/68
61
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using
boundary T, the information gain after partitioning is
Entropy is calculated based on class distribution of the samples in the set. Given m
classes, the entropy ofS1
is
where pi is the probability of class i in S1
The boundary that minimizes the entropy function over all possible boundaries is
selected as a binary discretization
The process is recursively applied to partitions obtained until some stopping
criterion is met
Such a boundary may reduce data size and improve classification accuracy
)(||
||)(
||
||),( 2
21
1SEntropy
S
SSEntropy
S
STSI !
!
!m
i
iippSEntropy
1
21 )(log)(
8/7/2019 02 Data Preprocessing
62/68
62
IntervalMerge byG2 Analysis
Merging-based (bottom-up) vs. splitting-based methods
Merge: Find the best neighboring intervals and merge them to form larger intervals
recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A is considered to be one interval
G2 tests are performed for every pair of adjacent intervals
Adjacent intervals with the leastG2 values are merged together, since low G2 values for a
pair indicate similar class distributions
This merge process proceeds recursively until a predefined stopping criterion is met
(such as significance level, max-interval, max inconsistency, etc.)
8/7/2019 02 Data Preprocessing
63/68
63
Segmentation by Natural Partitioning
A simply 3-4-5 rule can be used to segment numeric data into relatively
uniform, natural intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit,
partition the range into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the most significant digit, partition
the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most significant digit, partition
the range into 5 intervals
8/7/2019 02 Data Preprocessing
64/68
8/7/2019 02 Data Preprocessing
65/68
65
A concept hierarchyfor a numerical attribute defines a discretization of the attribute.
Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts (such as numerical values for age) with higher-level concepts (such as
youth, middle-aged, or senior).
The high level concepts are useful for data generalization.
The discretization techniques and concept hierarchies are applied before data mining
as a preprocessing step, rather than during mining.
8/7/2019 02 Data Preprocessing
66/68
66
ConceptHierarchy Generation forCategorical Data
Specification of a partial/total ordering of attributes explicitly at theschema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data grouping
{Urbana, Champaign, Chicago}