+ All Categories
Home > Documents > Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by...

Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by...

Date post: 17-May-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
119
Data Mining MTAT 03 183 Data Mining MTAT .03.183 (4AP = 6EAP) Descriptive analysis and i preprocessing Jaak Vilo Jaak Vilo 2009 Fall
Transcript
Page 1: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Mining MTAT 03 183Data Mining MTAT.03.183(4AP = 6EAP)( )

Descriptive analysis and ipreprocessing

Jaak ViloJaak Vilo

2009 Fall

Page 2: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Reminder “shopping basket”Reminder – shopping basket

• Database consists of sets of items boughttogetherg

• Describe the data

Ch i h “ i l” h• Characterise the “typical” purchase patterns– Frequent subsets of items

– Association rules ‐> correlations

• Be clever in how to count• Be clever in how to count

• Lattices, borders, RAM/disk, sets, indexes…

Jaak Vilo and other authors UT: Data Mining 2009 2

Page 3: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Motivation for aprioriMotivation for apriori

• RAM can hold a fraction of disk

DB

Jaak Vilo and other authors UT: Data Mining 2009 3

Page 4: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 4

Page 5: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

AlgorithmsAlgorithms

• Theoretical characteristics – O() notations– linear time

– n log n

on paper vs in practice– on paper vs in practice

• Machine learning– Quality of predictionsQuality of predictions

– time, etc.

Jaak Vilo and other authors UT: Data Mining 2009 5

Page 6: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

NeedNeed

• Scan DB only a limited # of times

• Collect necessary information,Collect necessary information, generate new candidates

I i h k/ if• In consecutive scans, check/verify

Jaak Vilo and other authors UT: Data Mining 2009 6

Page 7: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data preprocessingData preprocessing

Jaak Vilo and other authors UT: Data Mining 2009 7

Page 8: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Why Data Preprocessing?Why Data Preprocessing?

Data in the real world is dirtyData in the real world is dirty incomplete: lacking attribute values, lacking

certain attributes of interest, or containingcertain attributes of interest, or containing only aggregate data e.g., occupation=“ ”g , p

noisy: containing errors or outliers e.g., Salary=“-10”g , y

inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C”

September 17, 2009 Data Mining: Concepts and Techniques 8

e.g., discrepancy between duplicate records

Page 9: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Why Is Data Dirty?y a a y

Incomplete data may come fromIncomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was

collected and when it is analyzed. Human/hardware/software problems

Noisy data (incorrect values) may come from Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Human or computer error at data entry Errors in data transmission

Inconsistent data may come fromy Different data sources Functional dependency violation (e.g., modify some linked data)

September 17, 2009 Data Mining: Concepts and Techniques 9

Duplicate records also need data cleaning

Page 10: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Why Is Data Preprocessing Important?y p g p

No quality data no quality mining results! No quality data, no quality mining results! Quality decisions must be based on quality data

d l d e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality Data warehouse needs consistent integration of quality data

Data extraction cleaning and transformation comprises Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Garbage in garbage outSeptember 17, 2009 Data Mining: Concepts and Techniques 10

Garbage in, garbage out

Page 11: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Multi-Dimensional Measure of Data QualityMulti Dimensional Measure of Data Quality

A well-accepted multidimensional view: A well accepted multidimensional view: Accuracy Completenessp Consistency Timeliness Believability Value added Interpretability Accessibility

Broad categories: Intrinsic, contextual, representational, and accessibility

September 17, 2009 Data Mining: Concepts and Techniques 11

Page 12: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Major Tasks in Data PreprocessingMajor Tasks in Data Preprocessing

Data cleaningData cleaning Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies

Data integration Integration of multiple databases, data cubes, or files

D t t f ti Data transformation Normalization and aggregation

Data reduction Data reduction Obtains reduced representation in volume but produces the same

or similar analytical resultsy

Data discretization Part of data reduction but with particular importance, especially

September 17, 2009 Data Mining: Concepts and Techniques 12

for numerical data

Page 13: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Forms of Data PreprocessingForms of Data Preprocessing

September 17, 2009 Data Mining: Concepts and Techniques 13

Page 14: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Chapter 2: Data Preprocessingp p g

Why preprocess the data?

D i ti d t i ti Descriptive data summarization

Data cleaning Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

September 17, 2009 Data Mining: Concepts and Techniques 14

Summary

Page 15: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

• 1 51 9 36 5 93 2 09 4 67 7 38 8 83 2 35 4 90 4 00• 1.51 9.36 5.93 2.09 4.67 7.38 8.83 2.35 4.90 4.00 7.25 6.93 8.22 2.15 4.04 5.00 6.15 5.39 0.09 6.39 3.78 1.37 4.44 8.45 8.97 2.39 6.08 4.04 3.39 0.87 9.61 5.33 6.46 4.08 8.16 6.41 7.64 5.41 6.95 8.30 0.61 9.48 9.87 1.05 5.50 4.18 9.86 1.45 1.48 5.95 0.85 9.16 3.10 3.45 5.58 0.99 7.52 1.26 8.64 0.390.85 9.16 3.10 3.45 5.58 0.99 7.52 1.26 8.64 0.39 9.32 7.17 8.96 9.27 1.53 6.08 8.97 0.03 8.75 8.85 0 58 8 33 8 90 6 71 9 53 1 37 5 01 0 58 0 45 2 260.58 8.33 8.90 6.71 9.53 1.37 5.01 0.58 0.45 2.26 7.44 9.90 4.57 8.24 0.02 7.37 1.00 9.48 0.50 4.50 0.21 5.62 7.07 7.81 5.33 9.14 7.68 6.95 2.85 8.39

September 17, 2009 Data Mining: Concepts and Techniques 15

Page 16: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Characterise dataCharacterise data

use Big_University_DBmine characteristics as "Science_Students"_in relevance toname gendermajor birth date residence phoname,gender,major,birth_date,residence,phone#,gpafrom studentfrom studentwhere status in graduate

Jaak Vilo and other authors UT: Data Mining 2009 16

Page 17: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Mining Data Descriptive CharacteristicsMining Data Descriptive Characteristics

Motivation

To better understand the data: central tendency, variation and spread

Data dispersion characteristics

median, max, min, quantiles, outliers, variance, etc.

Numerical dimensions correspond to sorted intervals

Data dispersion: analyzed with multiple granularities of precision

Boxplot or quantile analysis on sorted intervals

Dispersion analysis on computed measures

Folding measures into numerical dimensions

September 17, 2009 Data Mining: Concepts and Techniques 17

Boxplot or quantile analysis on the transformed cube

Page 18: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Measuring the Central TendencyMeasuring the Central Tendency

Mean (algebraic measure) (sample vs. population): n

ixx 1N

xMean (algebraic measure) (sample vs. population):

Weighted arithmetic mean:

Trimmed mean: chopping extreme values

i

in 1

n

n

iii xw

x 1

N

Trimmed mean: chopping extreme values

Median: A holistic measure

Middle value if odd number of values or average of the middle two

n

iiw

1

Middle value if odd number of values, or average of the middle two values otherwise

Estimated by interpolation (for grouped data): Estimated by interpolation (for grouped data):

Mode

V l th t t f tl i th d t

cf

lfnLmedian

median

))(2/

(1

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

September 17, 2009 Data Mining: Concepts and Techniques 18

Empirical formula: )(3 medianmeanmodemean

Page 19: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Hi t d P b bilit D it F ti• Histograms and Probability Density Functions

• Probability Density Functions– Total area under curve integrates to 1

• Frequency Histogramsq y g– Y‐axis is counts

– Simple interpretation

– Can't be directly related to probabilities or density functions

• Relative Frequency HistogramsRelative Frequency Histograms– Divide counts by total number of observations

– Y‐axis is relative frequency

– Can be interpreted as probabilities for each range

Can't be directly related to density function– Can t be directly related to density function• Bar heights sum to 1 but won't integrate to 1 unless bar width = 1

• Density Histograms– Divide counts by (total number of observations X bar width)

Y i i d i l– Y‐axis is density values

– Bar height X bar width gives probability for each range

– Can be directly related to density function• Bar areas sum to 1

Jaak Vilo and other authors UT: Data Mining 2009 19

http://www.geog.ucsb.edu/~joel/g210_w07/lecture_notes/lect04/oh07_04_1.html

Page 20: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 20

Page 21: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 21

Page 22: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 22

Page 23: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

histogramshistograms

• equal sub‐intervals, known as `bins‘equal sub intervals, known as  bins

• break points

• bin width

Jaak Vilo and other authors UT: Data Mining 2009 23

Page 24: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

The data are (the log of) wing spans of aircraft built in from 1956 - 1984

Jaak Vilo and other authors UT: Data Mining 2009 24

The data are (the log of) wing spans of aircraft built in from 1956 - 1984.

Page 25: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 25

Page 26: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 26

Page 27: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Histogram vs kernel densityHistogram vs kernel density

• properties of histograms with these two examples: – they are not smooth 

– depend on end points of bins 

– depend on width of bins 

• We can alleviate the first two problems by using p y gkernel density estimators. 

• To remove the dependence on the end points of theTo remove the dependence on the end points of the bins, we centre each of the blocks at each data point rather than fixing the end points of the blocks.rather than fixing the end points of the blocks. 

Jaak Vilo and other authors UT: Data Mining 2009 27

Page 28: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

block of width 1 and height 1/12 (the dotted boxes) as they are 12 data points and then add them up

Jaak Vilo and other authors UT: Data Mining 2009 28

are 12 data points, and then add them up

Page 29: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

• Blocks ‐ it is still discontinuous as we have used a discontinuous kernel as our building gblock

• If we use a smooth kernel for our building• If we use a smooth kernel for our building block, then we will have a smooth density estimate. 

Jaak Vilo and other authors UT: Data Mining 2009 29

Page 30: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

• It's important to choose the most appropriate bandwidth as a value that is too small or too large is not useful. 

• If we use a normal (Gaussian) kernel with bandwidth or standard deviation of 0.1 (which has area 1/12 under the each curve) then the kernel density estimate is said to undersmoothed as the bandwidth is too small in the figure below. 

• It appears that there are 4 modes in this density ‐some of these are surely artifices of the data. y

Jaak Vilo and other authors UT: Data Mining 2009 30

Page 31: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 31

Page 32: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 32

Page 33: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

• Choose optimal bandwith– Need to estimate it

• AMISE Asymptotic Mean Integrated Squared Error• AMISE = Asymptotic Mean Integrated Squared Error

• optimal bandwidth = arg min AMISE

Jaak Vilo and other authors UT: Data Mining 2009 33

Page 34: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

• The optimal value of the bandwidth for our dataset is about 0.25. 

• From the optimally smoothed kernel density estimate, there are two modes. As these are the log of aircraft wing span, it means that there were a group of smaller, lighter planes built, and these are clustered around 2.5 (which is about 12 m). 

• Whereas the larger planes, maybe using jet engines g p , y g j gas these used on a commercial scale from about the 1960s, are grouped around 3.5 (about 33 m). , g p ( )

Jaak Vilo and other authors UT: Data Mining 2009 34

Page 35: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 35

Page 36: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

• The properties of kernel density estimators are, as compared to histograms: p g– smooth 

no end points– no end points 

– depend on bandwidth 

Jaak Vilo and other authors UT: Data Mining 2009 36

Page 37: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Kernel Density estimationKernel Density estimation

• Gentle introduction– http://school.maths.uwa.edu.au/~duongt/seminars/intro2kde/

• Tutorial– http://parallel.vub.ac.be/research/causalModels/tutorial/kde.html

Jaak Vilo and other authors UT: Data Mining 2009 37

Page 38: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 38

Page 39: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 39

Page 40: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 40

Page 41: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

http://jmlr csail mit edu/proceedings/papers/v2/kontkanen07a/kontkanen07a pdfhttp://jmlr.csail.mit.edu/proceedings/papers/v2/kontkanen07a/kontkanen07a.pdf

Jaak Vilo and other authors UT: Data Mining 2009 41

Page 42: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 42

Page 43: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

R example (d K T tj k )R – example (due K. Tretjakov)

d (1 2 2 2 2 1 2 2 2 3 2 3 4 5 4 3 2 3 4 4 5 6 7)d = c(1,2,2,2,2,1,2,2,2,3,2,3,4,5,4,3,2,3,4,4,5,6,7);

kernelsmooth <‐ function(data, sigma, x) {

result = 0;

for (d in data) {

result = result + exp(‐(x‐d)^2/2/sigma^2);

}

result/sqrt(2*pi)/sigma;

}

x = seq(min(d), max(d), by=0.1);

y = sapply(x, function(x) { kernelsmooth(d, 1, x) });

hist(d);

lines(x,y);

Jaak Vilo and other authors UT: Data Mining 2009 43

Page 44: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

• R tutorial– http://cran.r‐project.org/doc/manuals/R‐p // p j g/ / /intro.html

– http://www google com/search?q=R+tutorialhttp://www.google.com/search?q=R+tutorial

Jaak Vilo and other authors UT: Data Mining 2009 44

Page 45: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

More links on R and kernel densityMore links on R and kernel density

// / /• http://en.wikipedia.org/wiki/Kernel_density_estimation

• http://sekhon.berkeley.edu/stats/html/density.html

• http://stat.ethz.ch/R‐manual/R‐patched/library/stats/html/density.html

• http://www.google.com/search?q=kernel+density+estimation+R

Jaak Vilo and other authors UT: Data Mining 2009 45

Page 46: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Symmetric vs. Skewed DataSymmetric vs. Skewed Data

Median mean and mode of

MeanMedianMode

Median, mean and mode of symmetric, positively and negatively skewed datanegatively skewed data

September 17, 2009 Data Mining: Concepts and Techniques 46

Page 47: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Measuring the Dispersion of DataMeasuring the Dispersion of Data

Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and p q , , ,

plot outlier individually

Outlier: usually, a value higher/lower than 1.5 x IQR

Variance and standard deviation (sample: s, population: σ)

Variance: (algebraic, scalable computation)n nn 111 11

n

i

n

iii

n

ii x

nx

nxx

ns

1 1

22

1

22 ])(1[1

1)(1

1

n

ii

n

ii x

Nx

N 1

22

1

22 1)(1

September 17, 2009 Data Mining: Concepts and Techniques 47

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

Page 48: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Properties of Normal Distribution Curveop o o a bu o u

The normal (distribution) curve The normal (distribution) curve From μ–σ to μ+σ: contains about 68% of the

measurements (μ: mean, σ: standard deviation)measurements (μ: mean, σ: standard deviation) From μ–2σ to μ+2σ: contains about 95% of it From μ–3σ to μ+3σ: contains about 99 7% of it From μ 3σ to μ+3σ: contains about 99.7% of it

95%68% 99.7%

September 17, 2009 Data Mining: Concepts and Techniques 48

−3 −2 −1 0 +1 +2 +3−3 −2 −1 0 +1 +2 +3 −3 −2 −1 0 +1 +2 +3

Page 49: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Boxplot Analysiso p o a y

Fi b f di t ib ti Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum

Boxplot

Data is represented with a boxData is represented with a box

The ends of the box are at the first and third quartiles i e the height of the box is IRQquartiles, i.e., the height of the box is IRQ

The median is marked by a line within the box

Whiskers: two lines outside the box extend to Minimum and Maximum

September 17, 2009 Data Mining: Concepts and Techniques 49

Page 50: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Box PlotsBox Plots

• Tukey77: John W. Tukey, "Exploratory Data Analysis". Addison‐Wesley, Reading, MA. y y g1977. 

• http://informationandvisualization de/blog/box‐plot• http://informationandvisualization.de/blog/box‐plot

Jaak Vilo and other authors UT: Data Mining 2009 50

Page 51: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 51

Page 52: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 52

Page 53: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Visualization of Data Dispersion: Boxplot Analysisp p y

September 17, 2009 Data Mining: Concepts and Techniques 53

Page 54: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

September 17, 2009 Data Mining: Concepts and Techniques 54

Page 55: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

September 17, 2009 Data Mining: Concepts and Techniques 55

Page 56: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Quantile PlotQua o

Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)

Plots quantile informationF d t d t t d i i i d f For a data xi data sorted in increasing order, fiindicates that approximately 100 fi% of the data are below or equal to the value xibelow or equal to the value xi

September 17, 2009 Data Mining: Concepts and Techniques 56

Page 57: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Kemmeren et.al. (Mol. Cell, 2002)

Randomized expression data

Yeast 2-hybrid studiesy

Known (literature) PPI

MPK1 YLR350wSNF4 YCL046WSNF7 YGR122W.

Page 58: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Quantile-Quantile (Q-Q) PlotQua Qua (Q Q) o

Graphs the quantiles of one univariate distribution againstGraphs the quantiles of one univariate distribution against the corresponding quantiles of another

Allows the user to view whether there is a shift in going g gfrom one distribution to another

September 17, 2009 Data Mining: Concepts and Techniques 58

Page 59: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Scatter plota p o

Provides a first look at bivariate data to see clusters of Provides a first look at bivariate data to see clusters of points, outliers, etc

Each pair of values is treated as a pair of coordinates and Each pair of values is treated as a pair of coordinates and plotted as points in the plane

September 17, 2009 Data Mining: Concepts and Techniques 59

Page 60: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

September 17, 2009 Data Mining: Concepts and Techniques 60

Page 61: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Not Correlated DataNot Correlated Data

September 17, 2009 Data Mining: Concepts and Techniques 61

Page 62: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Loess Curveo u

Adds a smooth curve to a scatter plot in order toAdds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence

Loess curve is fitted by setting two parameters: a y g psmoothing parameter, and the degree of the polynomials that are fitted by the regression

September 17, 2009 Data Mining: Concepts and Techniques 62

Page 63: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Elements of microarray statistics

Reference TestM = log2R – log2G

= log2(R/G)

A = 1/2 (log R + log G)A = 1/2 (log2R + log2G)

Page 64: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

NormalisationNormalisation

can be used totransform datatransform data

Expression Profiler 64

Page 65: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Jaak Vilo and other authors UT: Data Mining 2009 65

Page 66: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Graphic Displays of Basic Statistical DescriptionsGraphic Displays of Basic Statistical Descriptions

Histogram: (shown before) Histogram: (shown before) Boxplot: (covered before) Quantile plot: each value x is paired with f indicating Quantile plot: each value xi is paired with fi indicating

that approximately 100 fi % of data are xi

Quantile-quantile (q-q) plot: graphs the quantiles of one Quantile quantile (q q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another

Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of d d

September 17, 2009 Data Mining: Concepts and Techniques 66

dependence

Page 67: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Chapter 2: Data PreprocessingChapter 2: Data Preprocessing

Why preprocess the data?

D i ti d t i ti Descriptive data summarization

Data cleaning Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

September 17, 2009 Data Mining: Concepts and Techniques 67

Summary

Page 68: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data CleaningData Cleaning

Importance Importance “Data cleaning is one of the three biggest problems

in data warehousing”—Ralph Kimballda a a e ous g a p ba “Data cleaning is the number one problem in data

warehousing”—DCI survey

Data cleaning tasks

Fill in missing values Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

September 17, 2009 Data Mining: Concepts and Techniques 68

Resolve redundancy caused by data integration

Page 69: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Missing Datag a a

Data is not always available Data is not always available

E.g., many tuples have no recorded value for several attributes such as customer income in sales dataattributes, such as customer income in sales data

Missing data may be due to

equipment malfunction equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding data not entered due to misunderstanding

certain data may not be considered important at the time of entryy

not register history or changes of the data

Missing data may need to be inferred.

September 17, 2009 Data Mining: Concepts and Techniques 69

g y

Page 70: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

How to Handle Missing Data?o o a d g a a

Ignore the tuple: usually done when class label is missing (assuming

the tasks in classification—not effective when the percentage of

missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

h ib the attribute mean

the attribute mean for all samples belonging to the same class:

smarter

the most probable value: inference-based such

September 17, 2009 Data Mining: Concepts and Techniques 70

p

as Bayesian formula or decision tree

Page 71: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Noisy Datao y a a

Noise: random error or variance in a measured variableNoise: random error or variance in a measured variable Incorrect attribute values may due to

faulty data collection instruments faulty data collection instruments data entry problems data transmission problems data transmission problems technology limitation inconsistency in naming convention inconsistency in naming convention

Other data problems which requires data cleaning duplicate records duplicate records incomplete data inconsistent data

September 17, 2009 Data Mining: Concepts and Techniques 71

inconsistent data

Page 72: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

How to Handle Noisy Data?o o a d o y a a

BinningBinning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc. Regressiong

smooth by fitting the data into regression functions Clusteringg

detect and remove outliers Combined computer and human inspection Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with possible outliers)

September 17, 2009 Data Mining: Concepts and Techniques 72

p )

Page 73: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Simple Discretization Methods: BinningSimple Discretization Methods: Binning

Equal-width (distance) partitioningq ( ) p g

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the if A and B are the lowest and highest values of the attribute, the

width of intervals will be: W = (B –A)/N.

The most straightforward but outliers may dominate presentation The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately

same number of samples

Good data scaling

September 17, 2009 Data Mining: Concepts and Techniques 73

Managing categorical attributes can be tricky

Page 74: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Binning Methods for Data Smoothingg od o a a oo g

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, p ( ) , , , , , , , , ,28, 29, 34

* Partition into equal-frequency (equi-depth) bins:Bin 1: 4 8 9 15- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34Bin 3: 26, 28, 29, 34

* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25

September 17, 2009 Data Mining: Concepts and Techniques 74

, , 5, 5- Bin 3: 26, 26, 26, 34

Page 75: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Regressiong o

yy

Y1

y = x + 1

Y1

Y1’ y = x + 1Y1’

xX1

September 17, 2009 Data Mining: Concepts and Techniques 75

Page 76: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Cluster Analysisu a y

September 17, 2009 Data Mining: Concepts and Techniques 76

Page 77: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Cleaning as a Processa a a g a a o Data discrepancy detection

U d ( d i d d di ib i ) Use metadata (e.g., domain, range, dependency, distribution) Check field overloading Check uniqueness rule, consecutive rule and null ruleC ec u que ess u e, co secut e u e a d u u e Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal code spell-check) to detect errors and make correctionscode, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)to find outliers)

Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface Integration of the two processes

September 17, 2009 Data Mining: Concepts and Techniques 77

Integration of the two processes Iterative and interactive (e.g., Potter’s Wheels)

Page 78: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Chapter 2: Data PreprocessingChapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning Data cleaning

Data integration and transformationg

Data reduction

Discretization and concept hierarchy generation

Summary

September 17, 2009 Data Mining: Concepts and Techniques 78

Page 79: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Integrationa a g a o

Data integration: Combines data from multiple sources into a coherent

storeSchema integ ation e g A c st id B c st # Schema integration: e.g., A.cust-id B.cust-# Integrate metadata from different sources

Entity identification problem: Entity identification problem: Identify real world entities from multiple data sources,

e.g., Bill Clinton = William Clintong , Detecting and resolving data value conflicts

For the same real world entity, attribute values from ydifferent sources are different

Possible reasons: different representations, different scales e g metric vs British units

September 17, 2009 Data Mining: Concepts and Techniques 79

scales, e.g., metric vs. British units

Page 80: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Handling Redundancy in Data Integrationg y g

Redundant data occur often when integration of multiple databases Object identification: The same attribute or object j j

may have different names in different databases Derivable data: One attribute may be a “derived” y

attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by y y

correlation analysis Careful integration of the data from multiple sources mayCareful integration of the data from multiple sources may

help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

September 17, 2009 Data Mining: Concepts and Techniques 80

p g p q y

Page 81: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Correlation Analysis (Numerical Data)y ( )

Correlation coefficient (also called Pearson’s product Correlation coefficient (also called Pearson s product moment coefficient)

BABA n

BAnABn

BBAAr BA )1(

)()1(

))((,

where n is the number of tuples, and are the respective means of A and B σ and σ are the respective standard deviation

A Bmeans of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.

If rA B > 0, A and B are positively correlated (A’s values If rA,B > 0, A and B are positively correlated (A s values increase as B’s). The higher, the stronger correlation.

rA B = 0: independent; rA B < 0: negatively correlated

September 17, 2009 Data Mining: Concepts and Techniques 81

rA,B 0: independent; rA,B < 0: negatively correlated

Page 82: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Correlation Analysis (Categorical Data)Correlation Analysis (Categorical Data)

Χ2 (chi-square) test Χ (chi square) test

ExpectedObserved 22 )(

The larger the Χ2 value, the more likely the variables are l t d

Expected

related The cells that contribute the most to the Χ2 value are

h h l i diff f hthose whose actual count is very different from the expected count

Correlation does not imply causality # of hospitals and # of car-theft in a city are correlated

h ll l k d h h d bl l

September 17, 2009 Data Mining: Concepts and Techniques 82

Both are causally linked to the third variable: population

Page 83: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Chi-Square Calculation: An Exampleq p

Play chess Not play chess Sum (row)y p y ( )

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution i h i )in the two categories)

93507)8401000()360200()21050()90250( 22222

It shows that like_science_fiction and play_chess are

93.507840

)(360

)(210

)(90

)(

September 17, 2009 Data Mining: Concepts and Techniques 83

correlated in the group

Page 84: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Transformationa a a o a o

Smoothing: remove noise from data Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified

range min-max normalization z-score normalization normalization by decimal scalingy g

Attribute/feature construction New attributes constructed from the given ones

September 17, 2009 Data Mining: Concepts and Techniques 84

New attributes constructed from the given ones

Page 85: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Transformation: Normalizationa a a o a o o a a o

Min-max normalization: to [new_minA, new_maxA]

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__('

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 716.00)00.1(

000,12000,98000,12600,73

Z-score normalization (μ: mean, σ: standard deviation):

Avv '

Ex. Let μ = 54,000, σ = 16,000. Then

A

v

225.1000,16

000,54600,73

Normalization by decimal scaling

vv' Where j is the smallest integer such that Max(|ν’|) < 1

,

September 17, 2009 Data Mining: Concepts and Techniques 85

jv

10j g (| |)

Page 86: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Chapter 2: Data PreprocessingChapter 2: Data Preprocessing

Why preprocess the data?

D t l i Data cleaning

Data integration and transformation Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

September 17, 2009 Data Mining: Concepts and Techniques 86

Page 87: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Reduction Strategiesg

Why data reduction?y A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run

on the complete data seton the complete data set Data reduction

Obtain a reduced representation of the data set that is muchObtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data reduction strategies Data reduction strategies Data cube aggregation: Dimensionality reduction — e.g., remove unimportant attributesDimensionality reduction e.g., remove unimportant attributes Data Compression Numerosity reduction — e.g., fit data into models

September 17, 2009 Data Mining: Concepts and Techniques 87

Discretization and concept hierarchy generation

Page 88: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Cube Aggregationgg g

The lowest level of a data cube (base cuboid) The lowest level of a data cube (base cuboid)

The aggregated data for an individual entity of interest

E t i h lli d t h E.g., a customer in a phone calling data warehouse

Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with

Reference appropriate levels Reference appropriate levels

Use the smallest representation which is enough to solve the tasksolve the task

Queries regarding aggregated information should be d i d t b h ibl

September 17, 2009 Data Mining: Concepts and Techniques 88

answered using data cube, when possible

Page 89: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Attribute Subset Selection

Feature selection (i.e., attribute subset selection):Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the

probability distribution of different classes given the p y gvalues for those features is as close as possible to the original distribution given the values of all features

reduce # of patterns in the patterns, easier to understand

H i i h d (d i l # f h i ) Heuristic methods (due to exponential # of choices): Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination

September 17, 2009 Data Mining: Concepts and Techniques 89

Decision-tree induction

Page 90: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Example of Decision Tree Inductionp

Initial attribute set:Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Cl 2 Cl 2Class 1 Class 2 Class 1 Class 2

Reduced attribute set: {A1 A4 A6}

September 17, 2009 Data Mining: Concepts and Techniques 90

> Reduced attribute set: {A1, A4, A6}

Page 91: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Heuristic Feature Selection Methodsu a u o od

There are 2d possible sub-features of d features There are 2 possible sub features of d features Several heuristic feature selection methods:

Best single features under the feature independence g passumption: choose by significance tests

Best step-wise feature selection: The best single-feature is picked first Then next best feature condition to the first, ...

S i f li i i Step-wise feature elimination: Repeatedly eliminate the worst feature

Best combined feature selection and elimination Best combined feature selection and elimination Optimal branch and bound:

Use feature elimination and backtracking

September 17, 2009 Data Mining: Concepts and Techniques 91

Use feature elimination and backtracking

Page 92: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Compressiona a o p o

String compressionString compression There are extensive theories and well-tuned algorithms Typically lossless Typically lossless But only limited manipulation is possible without

expansionp Audio/video compression

Typically lossy compression, with progressive yp y y p , p grefinement

Sometimes small fragments of signal can be g greconstructed without reconstructing the whole

Time sequence is not audio

September 17, 2009 Data Mining: Concepts and Techniques 92

Typically short and vary slowly with time

Page 93: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Compressiona a o p o

Original Data Compressed DataData

lossless

Original DataApproximated

September 17, 2009 Data Mining: Concepts and Techniques 93

Page 94: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Dimensionality Reduction:Wavelet TransformationWavelet Transformation

Discrete wavelet transform (DWT): linear signalHaar2 Daubechie4

Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis

Compressed approximation: store only a small fraction of Compressed approximation: store only a small fraction of the strongest of the wavelet coefficientsSimilar to discrete Fourier transform (DFT) but better Similar to discrete Fourier transform (DFT), but better lossy compression, localized in spaceMethod: Method: Length, L, must be an integer power of 2 (padding with 0’s, when

necessary)necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2

September 17, 2009 Data Mining: Concepts and Techniques 94

pp p , g g / Applies two functions recursively, until reaches the desired length

Page 95: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Dimensionality Reduction: Principal Component Analysis (PCA)

Given N data vectors from n-dimensions, find k ≤ n orthogonal

Component Analysis (PCA), g

vectors (principal components) that can be best used to represent data Steps

Normalize input data: Each attribute falls within the same range Normalize input data: Each attribute falls within the same range Compute k orthonormal (unit) vectors, i.e., principal components Each input data (vector) is a linear combination of the k principalEach input data (vector) is a linear combination of the k principal

component vectors The principal components are sorted in order of decreasing

“significance” or strength“significance” or strength Since the components are sorted, the size of the data can be

reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data

Works for numeric data only

September 17, 2009 Data Mining: Concepts and Techniques 95

Works for numeric data only Used when the number of dimensions is large

Page 96: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Principal Component Analysis

X2

p p y

X2

Y1Y2

X1

September 17, 2009 Data Mining: Concepts and Techniques 96

Page 97: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Numerosity ReductionNumerosity Reduction

Reduce data volume by choosing alternative smaller Reduce data volume by choosing alternative, smaller forms of data representation

Parametric methods Parametric methods Assume the data fits some model, estimate model

parameters, store only the parameters, and discard p , y p ,the data (except possible outliers)

Example: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces

Non-parametric methods Do not assume models

September 17, 2009 Data Mining: Concepts and Techniques 97

Major families: histograms, clustering, sampling

Page 98: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Reduction Method (1): Regression and Log Linear ModelsRegression and Log-Linear Models

Linear regression: Data are modeled to fit a straight line

Often uses the least-square method to fit the line Often uses the least-square method to fit the line

Multiple regression: allows a response variable Y to be

modeled as a linear function of multidimensional feature

tvector

Log-linear model: approximates discreteLog linear model: approximates discrete

multidimensional probability distributions

September 17, 2009 Data Mining: Concepts and Techniques 98

Page 99: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Regress Analysis and Log-Linear Models

Linear regression: Y w X + b

g y g

Linear regression: Y = w X + b Two regression coefficients, w and b, specify the line

and are to be estimated by using the data at handa d a e o be es a ed by us g e da a a a d Using the least squares criterion to the known values

of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the aboveabove

Log-linear models:

The multi-way table of joint probabilities is The multi-way table of joint probabilities is approximated by a product of lower-order tables

Probability: p(a, b, c, d) = ab acad bcd Probability: p(a, b, c, d) ab acad bcd

Page 100: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Reduction Method (2): HistogramsData Reduction Method (2): Histograms

Divide data into buckets and store40 Divide data into buckets and store average (sum) for each bucket

Partitioning rules:35

40

g

Equal-width: equal bucket range

Equal-frequency (or equal- 25

30

q q y ( qdepth)

V-optimal: with the least 20

25

histogram variance (weighted sum of the original values that

h b k t t )10

15

each bucket represents)

MaxDiff: set bucket boundary between each pair for pairs have0

5

September 17, 2009 Data Mining: Concepts and Techniques 100

between each pair for pairs have the β–1 largest differences

010000 30000 50000 70000 90000

Page 101: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Reduction Method (3): ClusteringData Reduction Method (3): Clustering

Partition data set into clusters based on similarity, and store cluster

representation (e.g., centroid and diameter) only

Can be very effective if data is clustered but not if data is “smeared”

Can have hierarchical clustering and be stored in multi-dimensional

index tree structures

There are many choices of clustering definitions and clustering a a y o o u g d o a d u g

algorithms

Cluster analysis will be studied in depth in Chapter 7 Cluster analysis will be studied in depth in Chapter 7

September 17, 2009 Data Mining: Concepts and Techniques 101

Page 102: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Data Reduction Method (4): SamplingData Reduction Method (4): Sampling

Sampling: obtaining a small sample s to represent the p g g p pwhole data set N

Allow a mining algorithm to run in complexity that is t ti ll b li t th i f th d tpotentially sub-linear to the size of the data

Choose a representative subset of the dataSimple random sampling may have very poor Simple random sampling may have very poor performance in the presence of skew

Develop adaptive sampling methodsp p p g Stratified sampling:

Approximate the percentage of each class (or pp p g (subpopulation of interest) in the overall database

Used in conjunction with skewed dataN S li d d b I/O (

September 17, 2009 Data Mining: Concepts and Techniques 102

Note: Sampling may not reduce database I/Os (page at a time)

Page 103: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Sampling: with or without ReplacementSampling: with or without Replacement

September 17, 2009 Data Mining: Concepts and Techniques 103

Raw Data

Page 104: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Sampling: Cluster or Stratified SamplingSampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

September 17, 2009 Data Mining: Concepts and Techniques 104

Page 105: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Chapter 2: Data PreprocessingChapter 2: Data Preprocessing

Why preprocess the data?

D t l i Data cleaning

Data integration and transformation Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

September 17, 2009 Data Mining: Concepts and Techniques 105

Page 106: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Discretizationa o

Three types of attributes: Three types of attributes:

Nominal — values from an unordered set, e.g., color, profession

Ordinal — values from an ordered set, e.g., military or academic

rank

Continuous — real numbers, e.g., integer or real numbers

Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes. Some classification algorithms only accept categorical attributes.

Reduce data size by discretization

P epa e fo f the anal sis

September 17, 2009 Data Mining: Concepts and Techniques 106

Prepare for further analysis

Page 107: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Discretization and Concept Hierarchya o a d o p a y

Discretization Discretization

Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervalsd d g a g o a bu o a

Interval labels can then be used to replace actual data values

Supervised vs unsupervised Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute Discretization can be performed recursively on an attribute

Concept hierarchy formation

R i l d th d t b ll ti d l i l l l Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young middle aged or senior)

September 17, 2009 Data Mining: Concepts and Techniques 107

(such as young, middle-aged, or senior)

Page 108: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Discretization and Concept Hierarchy Generation for Numeric DataGeneration for Numeric Data

Typical methods: All the methods can be applied recursively Typical methods: All the methods can be applied recursively

Binning (covered above)

Top do n split ns pe ised Top-down split, unsupervised,

Histogram analysis (covered above)

Top-down split, unsupervised

Clustering analysis (covered above)

Either top-down split or bottom-up merge, unsupervised

Entropy-based discretization: supervised, top-down splitEntropy based discretization: supervised, top down split

Interval merging by 2 Analysis: unsupervised, bottom-up merge

Segmentation by natural partitioning: top down split unsupervised

September 17, 2009 Data Mining: Concepts and Techniques 108

Segmentation by natural partitioning: top-down split, unsupervised

Page 109: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Entropy-Based Discretizationopy a d a o

Given a set of samples S, if S is partitioned into two intervals S1 and S2

using boundary T, the information gain after partitioning is

)(||||)(

||||),( 2

21

1 SEntropySSSEntropy

SSTSI

Entropy is calculated based on class distribution of the samples in the set. Given m classes, the entropy of S1 is

|||| SS

m

where pi is the probability of class i in S1

i

ii ppSEntropy1

21 )(log)(

The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification

September 17, 2009 Data Mining: Concepts and Techniques 109

Such a boundary may reduce data size and improve classification accuracy

Page 110: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Interval Merge by 2 Analysisg y y

Merging-based (bottom-up) vs. splitting-based methods

Merge: Find the best neighboring intervals and merge them to form larger intervals recursively

ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

Initially, each distinct value of a numerical attr. A is considered to be y,one interval

2 tests are performed for every pair of adjacent intervals p y p j

Adjacent intervals with the least 2 values are merged together, since low 2 values for a pair indicate similar class distributions p

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level, max-interval, max

September 17, 2009 Data Mining: Concepts and Techniques 110

( g , ,inconsistency, etc.)

Page 111: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Segmentation by Natural Partitioningg a o by a u a a o g

A i l 3 4 5 l b d t t i d t A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.

If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-ost s g ca t d g t, pa t t o t e a ge to 3 equwidth intervals

If it covers 2 4 or 8 distinct values at the most If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

September 17, 2009 Data Mining: Concepts and Techniques 111

g g , p g

Page 112: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Example of 3-4-5 Rulep

count

Step 1: -$351 -$159 profit $1,838 $4,700

msd=1,000 Low=-$1,000 High=$2,000Step 2:Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max

(-$1,000 - $2,000)Step 3:

(-$400 -$5,000)Step 4:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 - 0)

(-$400 -$300)

(0 - $1,000)

(0 -$200)

($2,000 - $5, 000)

($2,000 -

($1,000 - $2, 000)

($1,000 -$1,200)-$300)

(-$300 --$200)

(-$200 -

($200 -$400)

($400 -$600)

($ ,$3,000)

($3,000 -$4,000)

($4 000 -

$1,200)

($1,200 -$1,400)

($1,400 -$1,600)

September 17, 2009 Data Mining: Concepts and Techniques 112

-$100)

(-$100 -0)

)

($600 -$800) ($800 -

$1,000)

($4,000 $5,000)($1,600 -

$1,800) ($1,800 -$2,000)

Page 113: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

ExampleExample-351,976.00 .. 4,700,896.50

MIN=-351 976 00MIN 351,976.00 MAX=4,700,896.50LOW = 5th percentile -159,876HIGH = 95th percentile 1,838,761 G 95t pe ce t e ,838, 6msd = 1,000,000 (most significant digit)LOW = -1,000,000 (round down) HIGH = 2,000,000 (round up)

3 value ranges1. (-1,000,000 .. 0] 2. (0 .. 1,000,000] 3 (1 000 000 2 000 000]3. (1,000,000 .. 2,000,000]

Adjust with real MIN and MAX1. (-400,000 .. 0] ( , ]2. (0 .. 1,000,000] 3. (1,000,000 .. 2,000,000]4. (2,000,000 .. 5,000,000]

September 17, 2009 Data Mining: Concepts and Techniques 113

Page 114: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Recursive …

1.1. (-400,000 .. -300,000 ] 1.2. (-300,000 .. -200,000 ] 1.3. (-200,000 .. -100,000 ] 1.4. (-100,000 .. 0 ]

2.1. (0 .. 200,000 ]2.2. (200,000 .. 400,000 ]2.3. (400,000 .. 600,000 ]2.4. (600,000 .. 800,000 ]2 5 (800 000 1 000 000 ]2.5. (800,000 .. 1,000,000 ]

3.1. (1,000,000 .. 1,200,000 ]3 2 (1 200 000 1 400 000 ]3.2. (1,200,000 .. 1,400,000 ]3.3. (1,400,000 .. 1,600,000 ]3.4. (1,600,000 .. 1,800,000 ]3 5 (1 800 000 2 000 000 ]3.5. (1,800,000 .. 2,000,000 ]

4.1. (2,000,000 .. 3,000,000 ]4 2 (3 000 000 4 000 000 ]

Jaak Vilo and other authors UT: Data Mining 2009 114

4.2. (3,000,000 .. 4,000,000 ]4.3. (4,000,000 .. 5,000,000 ]

Page 115: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Concept Hierarchy Generation for Categorical Datap y g

Specification of a partial/total ordering of attributesSpecification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country street < city < state < country

Specification of a hierarchy for a set of values by explicit data groupingdata grouping {Urbana, Champaign, Chicago} < Illinois

Specification of only a partial set of attributes Specification of only a partial set of attributes E.g., only street < city, not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values

September 17, 2009 Data Mining: Concepts and Techniques 115

E.g., for a set of attributes: {street, city, state, country}

Page 116: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Automatic Concept Hierarchy GenerationAutomatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based h l i f h b f di i lon the analysis of the number of distinct values per

attribute in the data set The attribute with the most distinct values is placed The attribute with the most distinct values is placed

at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, yearp , g , y, , q , y

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

September 17, 2009 Data Mining: Concepts and Techniques 116

street 674,339 distinct values

Page 117: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

Chapter 2: Data Preprocessingp p g

Why preprocess the data?

Data cleaning

Data integration and transformation Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

September 17, 2009 Data Mining: Concepts and Techniques 117

Summary

Page 118: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

SummarySummary

Data preparation or preprocessing is a big issue for both Data preparation or preprocessing is a big issue for both data warehousing and data mining

Di i ti d t i ti i d f lit d t Discriptive data summarization is need for quality data preprocessing

Data preparation includes

Data cleaning and data integrationg g

Data reduction and feature selection

Discretization Discretization

A lot a methods have been developed but data i till ti f h

September 17, 2009 Data Mining: Concepts and Techniques 118

preprocessing still an active area of research

Page 119: Data Mining MTAT03 183MTAT.03 - Arvutiteaduse instituut · Estimatedbyinterpolation(forEstimated by interpolation (for groupeddatagrouped data): Mode V l th t t f tl i th d t c f

References D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications

f ACM 42 73 78 1999of ACM, 42:73-78, 1999

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality Browser. SIGMOD’02.

H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), December 1997

D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999

E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee on Data Engineering. Vol.23, No.4g g ,

V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001

T Redman Data Quality: Management and Technology Bantam Books 1992 T. Redman. Data Quality: Management and Technology. Bantam Books, 1992

Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996

R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

September 17, 2009 Data Mining: Concepts and Techniques 119

R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995


Recommended