of 15
7/24/2019 LECTURE01_DataPreprocessing
1/15
Topics
K-Nearest Neighbor Editing
Density based Clustering Subspace Clustering/Bi-Clustering
Skyline Pattern Mining top-K Pattern Mining!
"idden Marko# Models Mo#ing $b%ects Mining
&ork'o( Disco#ery Process Mining!
)te*s +eco**endation Mining +e#ie(s/Senti*ents Data
Sel, $rganied M.Ps
Map +educe
7/24/2019 LECTURE01_DataPreprocessing
2/15
Data Preprocessing
7/24/2019 LECTURE01_DataPreprocessing
3/15
Data Quality
Data uality is a *a%or concern ,or Data Mining 0asks
&hy1.t *ost all Data Mining algorith*s inducekno(ledge strictly ,ro* data
0he uality o, kno(ledgee2tracted highly depends
on the uality o, data 0here are t(o *ain proble*s in data uality1-
3 Missing data10he data not present
3 Noisy data10he data present but not correct
Missing/Noisy data sources1-3 "ard(are ,ailure
3 Data trans*ission error
3 Data entry proble*
3 +e,usal o, responds to ans(er certain uestions
7/24/2019 LECTURE01_DataPreprocessing
4/15
Effect of Noisy Data on Results Accuracy
Data Mining
If age 40and income = mediumthenbuys_computer = no
Discover only those
rules which containsupport fre!uency"greater >= #
Due to the missing value in trainingdataset$ the accuracy of predictionreduced to %%&'(
)raining data
)esting data or actual data
7/24/2019 LECTURE01_DataPreprocessing
5/15
Imputation of Missing Values (Basic)
)*putationis a ter* that denotes a procedure that
replaces the *issing #alues in a dataset usingplausible #alues
3 i4e4 by considering relationship a*ong correlated#alues a*ong the attributes o, the dataset
If we consider only
*attribute+#,$thenvalue coolappears in 4
records.Probability of Imputing
for value #0"= 75%
Probability of Imputing
for value 30"= 5%
7/24/2019 LECTURE01_DataPreprocessing
6/15
Imputation of Missing Values (Basic)
5or {attribute#4}the #alue trueappears in 6 records
Probability o, )*puting,or value (20)7 89:
Probability o, )*puting
,or value (10)7 89:
5or {attribute#2,attribute#3} the#alue {cool,high}appears inonly ; records
Probability o, )*puting
,or value (20)7
7/24/2019 LECTURE01_DataPreprocessing
7/15
7/24/2019 LECTURE01_DataPreprocessing
8/15
Meto!s for Imputing Missing Values
$e!lace issing values using
!redictionclassi%cation odel:-3 .d#antage1-it considers relationship a*ong the
kno(n attribute #alues and the *issing #alues@ sothe i*putation accuracy is high than statistical
techniues3 Disad#antage1-), there e2ists no correlationbet(een instances ha#ing *issing #alues andinstance ha#ing not *issing #alues4 0heni*putation canAt be per,or*ed
3 .lternati#e approach!1-se hybrid co*bination o,Prediction/Classication *odel and Mean/M$D 5irst try to i*pute *issing #alue using
prediction/classication *odel@ and then Median/M$D
7/24/2019 LECTURE01_DataPreprocessing
9/15
Meto!s for Imputing Missing Values
&-'earest 'eighbor (-'')
3 k-NN i*putes the *issing attribute #alues onthe basis o, nearest Kneighbors4 Neighbors aredeter*inedon the basis o, distance *easure
3 $nce K neighbors are deter*ined@ then *issing#alues are i*puted by taking *ean/*edian or
M$D o, kno(n #alues
"issing valuerecord
ther datasetrecords
7/24/2019 LECTURE01_DataPreprocessing
10/15
"#Nearest Neig$or (Pseu!o#co!e)
Missing #alues )*putation using k-NN
)nput1 Dataset (D)@ sie o, K
,or each record (x)ha#ing *issing #alue
3 ,or each data ob%ect (y)in D Co*puting the Distance bet(een (x,y)
Sa#e the distance in Si*ilarity (S)array
3Sort the array Sin descending order3 Pick the top &data ob%ects ,ro* S
)*pute the *issing attribute #alue s! o, xon the basico, kno(n #alues o, Suse Mean/Median or M$D!
7/24/2019 LECTURE01_DataPreprocessing
11/15
Noisy Data
Noise1 +ando* error@ Data Present but not
correct3 Data 0rans*ission error
3 Data Entry proble*
+e*o#ing noise3 Data S*oothing rounding@ a#eraging (ithin a
(indo(!4
3 Clustering/*erging and Detecting outliers4
7/24/2019 LECTURE01_DataPreprocessing
12/15
Effect of %ontinuous Data on Results
Accuracy
age buysDco*puter
6< *
66 *
6; *
age buysDco*puter
;E no
FF no88 no
8; noData Mining
If age = 2- then buys_computer = no
If age = 30 then buys_computer = no
If age = 41 then buys_computer = no
If age = 52 then buys_computer = no
.hat would be the accuracy of above
rules
7/24/2019 LECTURE01_DataPreprocessing
13/15
Entropy#Base! Discreti&ation
Gi#en a set o, sa*ples S@ i, S is partitioned into t(o
inter#als S< and S; using boundary 0@ the entropya,ter partitioning is
&here piis the probability o, class i in S
7/24/2019 LECTURE01_DataPreprocessing
14/15
Entropy#Base! Discreti&ation
0he boundary that *ini*ies the entropy ,unction
o#er all possible boundaries is selected as a binarydiscretiation4
0he process is recursi#ely applied to partitionsobtained until so*e stopping criterion is *et@ e4g4@
7/24/2019 LECTURE01_DataPreprocessing
15/15
E'ample (cont)
0he nu*ber o, ele*ents in S< and S; are1=S