+ All Categories
Home > Documents > LECTURE01_DataPreprocessing

LECTURE01_DataPreprocessing

Date post: 23-Feb-2018
Category:
Upload: bilo044
View: 213 times
Download: 0 times
Share this document with a friend

of 15

Transcript
  • 7/24/2019 LECTURE01_DataPreprocessing

    1/15

    Topics

    K-Nearest Neighbor Editing

    Density based Clustering Subspace Clustering/Bi-Clustering

    Skyline Pattern Mining top-K Pattern Mining!

    "idden Marko# Models Mo#ing $b%ects Mining

    &ork'o( Disco#ery Process Mining!

    )te*s +eco**endation Mining +e#ie(s/Senti*ents Data

    Sel, $rganied M.Ps

    Map +educe

  • 7/24/2019 LECTURE01_DataPreprocessing

    2/15

    Data Preprocessing

  • 7/24/2019 LECTURE01_DataPreprocessing

    3/15

    Data Quality

    Data uality is a *a%or concern ,or Data Mining 0asks

    &hy1.t *ost all Data Mining algorith*s inducekno(ledge strictly ,ro* data

    0he uality o, kno(ledgee2tracted highly depends

    on the uality o, data 0here are t(o *ain proble*s in data uality1-

    3 Missing data10he data not present

    3 Noisy data10he data present but not correct

    Missing/Noisy data sources1-3 "ard(are ,ailure

    3 Data trans*ission error

    3 Data entry proble*

    3 +e,usal o, responds to ans(er certain uestions

  • 7/24/2019 LECTURE01_DataPreprocessing

    4/15

    Effect of Noisy Data on Results Accuracy

    Data Mining

    If age 40and income = mediumthenbuys_computer = no

    Discover only those

    rules which containsupport fre!uency"greater >= #

    Due to the missing value in trainingdataset$ the accuracy of predictionreduced to %%&'(

    )raining data

    )esting data or actual data

  • 7/24/2019 LECTURE01_DataPreprocessing

    5/15

    Imputation of Missing Values (Basic)

    )*putationis a ter* that denotes a procedure that

    replaces the *issing #alues in a dataset usingplausible #alues

    3 i4e4 by considering relationship a*ong correlated#alues a*ong the attributes o, the dataset

    If we consider only

    *attribute+#,$thenvalue coolappears in 4

    records.Probability of Imputing

    for value #0"= 75%

    Probability of Imputing

    for value 30"= 5%

  • 7/24/2019 LECTURE01_DataPreprocessing

    6/15

    Imputation of Missing Values (Basic)

    5or {attribute#4}the #alue trueappears in 6 records

    Probability o, )*puting,or value (20)7 89:

    Probability o, )*puting

    ,or value (10)7 89:

    5or {attribute#2,attribute#3} the#alue {cool,high}appears inonly ; records

    Probability o, )*puting

    ,or value (20)7

  • 7/24/2019 LECTURE01_DataPreprocessing

    7/15

  • 7/24/2019 LECTURE01_DataPreprocessing

    8/15

    Meto!s for Imputing Missing Values

    $e!lace issing values using

    !redictionclassi%cation odel:-3 .d#antage1-it considers relationship a*ong the

    kno(n attribute #alues and the *issing #alues@ sothe i*putation accuracy is high than statistical

    techniues3 Disad#antage1-), there e2ists no correlationbet(een instances ha#ing *issing #alues andinstance ha#ing not *issing #alues4 0heni*putation canAt be per,or*ed

    3 .lternati#e approach!1-se hybrid co*bination o,Prediction/Classication *odel and Mean/M$D 5irst try to i*pute *issing #alue using

    prediction/classication *odel@ and then Median/M$D

  • 7/24/2019 LECTURE01_DataPreprocessing

    9/15

    Meto!s for Imputing Missing Values

    &-'earest 'eighbor (-'')

    3 k-NN i*putes the *issing attribute #alues onthe basis o, nearest Kneighbors4 Neighbors aredeter*inedon the basis o, distance *easure

    3 $nce K neighbors are deter*ined@ then *issing#alues are i*puted by taking *ean/*edian or

    M$D o, kno(n #alues

    "issing valuerecord

    ther datasetrecords

  • 7/24/2019 LECTURE01_DataPreprocessing

    10/15

    "#Nearest Neig$or (Pseu!o#co!e)

    Missing #alues )*putation using k-NN

    )nput1 Dataset (D)@ sie o, K

    ,or each record (x)ha#ing *issing #alue

    3 ,or each data ob%ect (y)in D Co*puting the Distance bet(een (x,y)

    Sa#e the distance in Si*ilarity (S)array

    3Sort the array Sin descending order3 Pick the top &data ob%ects ,ro* S

    )*pute the *issing attribute #alue s! o, xon the basico, kno(n #alues o, Suse Mean/Median or M$D!

  • 7/24/2019 LECTURE01_DataPreprocessing

    11/15

    Noisy Data

    Noise1 +ando* error@ Data Present but not

    correct3 Data 0rans*ission error

    3 Data Entry proble*

    +e*o#ing noise3 Data S*oothing rounding@ a#eraging (ithin a

    (indo(!4

    3 Clustering/*erging and Detecting outliers4

  • 7/24/2019 LECTURE01_DataPreprocessing

    12/15

    Effect of %ontinuous Data on Results

    Accuracy

    age buysDco*puter

    6< *

    66 *

    6; *

    age buysDco*puter

    ;E no

    FF no88 no

    8; noData Mining

    If age = 2- then buys_computer = no

    If age = 30 then buys_computer = no

    If age = 41 then buys_computer = no

    If age = 52 then buys_computer = no

    .hat would be the accuracy of above

    rules

  • 7/24/2019 LECTURE01_DataPreprocessing

    13/15

    Entropy#Base! Discreti&ation

    Gi#en a set o, sa*ples S@ i, S is partitioned into t(o

    inter#als S< and S; using boundary 0@ the entropya,ter partitioning is

    &here piis the probability o, class i in S

  • 7/24/2019 LECTURE01_DataPreprocessing

    14/15

    Entropy#Base! Discreti&ation

    0he boundary that *ini*ies the entropy ,unction

    o#er all possible boundaries is selected as a binarydiscretiation4

    0he process is recursi#ely applied to partitionsobtained until so*e stopping criterion is *et@ e4g4@

  • 7/24/2019 LECTURE01_DataPreprocessing

    15/15

    E'ample (cont)

    0he nu*ber o, ele*ents in S< and S; are1=S