+ All Categories
Home > Documents > Data Mining Memahami Data

Data Mining Memahami Data

Date post: 07-Jul-2018
Category:
Upload: asa
View: 216 times
Download: 0 times
Share this document with a friend

of 38

Transcript
  • 8/19/2019 Data Mining Memahami Data

    1/38

    1

    Data Mining:Mengenal dan

    memahami data

  • 8/19/2019 Data Mining Memahami Data

    2/38

    2

    Mengenal dan memahami data

    Objek data dan macam-macam atribut

    Statistik diskriptif data

    Visualisasi data

    Mengukur kesamaan dan ketidaksamaan

    data

  • 8/19/2019 Data Mining Memahami Data

    3/38

    3

    Types of Data Sets 

    Record

    Relational records

    Data matrix e!g! numerical matrix

    Document data" text documents"

    term-fre#uenc$ %ector

     &ransaction data

    'rap( and net)ork

    *orld *ide *eb

    Social or information net)orks

    Ordered

    Video data" se#uence of images

     &emporal data" time-series

    Spatial image and multimedia"

    Spatial data" maps

    +mage data"

    Video data"

    TID Items

    1 Bread, Coke, Milk  

    2  Beer, Bread 

    3  Beer, Coke, Diaper, Milk  

    4  Beer, Bread, Diaper, Milk  

    5  Coke, Diaper, Milk  

  • 8/19/2019 Data Mining Memahami Data

    4/38

    ,

    Data Objects

    Data object men$atakan suatu entitas onto("

    Database penjualan" customers barang-barang

    $ang dijual penjualan

    Database medis" pasien pera)atan

    Database uni%ersitas" ma(asis)a professor

    perkulia(an

    Data objects dijelaskan dengan attribut-atribut!

    .aris-baris Database -/ data objects0

    olom-kolom -/attribut-atribut!

  • 8/19/2019 Data Mining Memahami Data

    5/38

    Atribut

    Attribut ( dimensi, tur, !ariabel"men$atakan karakteristik atau 4tur dari dataobjek Misal., ID_pelanggan, nama, alama

     &ipe-tipe" 5ominal Ordina

    .iner 5umerik"

    +nter%al-scaled

    Ratio-scaled

  • 8/19/2019 Data Mining Memahami Data

    6/38

    6

    Attribute Types 

    "ominal: kategori keadaan atau 7nama suatu (al8 Warna rambut  Status kode pos dll 5R9 dll

    #inary ::tribut 5ominal dengan (an$a 2 keadaan ;< dan 1 S$mmetric binar$" keduan$a sama penting

    Misal" jenis kelamin :s$mmetric binar$" keduan$a tidak sama penting!

    Misal " medical test ;positi%e atau negati%e Din$atakan dengan 1 untuk men$atakan (al $ang

    lebi( penting ; positif =+V

    Ordinal Memiliki arti secara berurutan ;ranking tetapi tidak

    din$atakan dengan besaran angka atau nilai! Size = >small, medium, large?, kelas pangkat

  • 8/19/2019 Data Mining Memahami Data

    7/38@

    Atribut "umeri$  

    uantitas ;integer atau nilai real %nter!al

    Diukur pada skala dengan unit satuan $angsama

    5ilai memiliki urutan tanggal kalender 

    5o true Aero-point &atio

    +n(erent 'ero-point onto("9anjang berat badan dll .isa mengatakan perkalian dari nilai objek data

    $ang lain Misal " panjang jalan : adala( 2 kali dari

    panjang jalan .

  • 8/19/2019 Data Mining Memahami Data

    8/38B

    Atribut Discrete dan $ontinu

    Atribut Dis$rit  &er(ingga dapat di(itung )alaupun itu tak

    ter(ingga ode pos kata dalam sekumpulan dokumen

    adang din$atakan dengan %ariabel integer atatan :tribut .inar$" kasus k(usus atribut

    diskrit Atribut ontinu

    Memilki nilai real C!g! temperature tinggi berat

    :tribut kontinu din$atakn dengan oating-point%ariables

    # i S i i l D i i f

  • 8/19/2019 Data Mining Memahami Data

    9/38E

    #asic Statistical Descriptions ofData

     &ujuan Fntuk mema(ami data" central tendenc$

    %ariasi dan sebaran

    arakteristik Sebaran data

    median max min #uantiles outliers %ariancedll!

  • 8/19/2019 Data Mining Memahami Data

    10/381<

    Mengu$ur )entral Tendency

    Mean ;algebraic measure ;sample %s! population"

    5ote" n jumla( sample dan N nilai populasi!

    MeanGrata-rata"

     &rimmed mean"

    Median"

    Cstimated b$ interpolation ;for grouped data"

    Mode

    Value t(at occurs most fre#uentl$ in t(e data

    Fnimodal bimodal trimodal

    Cmpirical formula"

    ∑==

    n

    i

    i xn x 1

    1

    =

    ==n

    i

    i

    n

    i

    ii

    w

     xw

     x

    1

    1

    width freq

     freqn Lmedian

    median

    l )

    )(2/(1

    ∑−+=

    )(3   medianmeanmodemean  −×=−

     N 

     x∑= µ 

    Media

    ninter!al

  • 8/19/2019 Data Mining Memahami Data

    11/38Marc( 2, 2

  • 8/19/2019 Data Mining Memahami Data

    12/3812

    Mengu$ur sebaran Data

    Huartiles outliers and boxplots uartiles" H1 ;2t( percentile H3 ;@t( percentile

    %nter-uartile range" +HR I H3 J H1

    .i!e number summary" min H1 median H3 max

    Outlier" biasan$a lebi( tinggi atau lebi( renda( dari 1! x +HR

    Variansi dan standar de%iasi ;sample: s, population: σ)

    /ariance" ;algebraic scalable computation

    Standard de!iation s atau σ) akar kuadrat daro %ariance s!

    atau σ ! )

    ∑ ∑∑ = == −−=−−=n

    i

    n

    i

    ii

    n

    i

    i   xn xn x xn s 1 1

    22

    1

    22

    ])(1

    [1

    1)(1

    1

    ∑∑ ==−=−=

    n

    i i

    n

    i i

      x N 

     x N    1

    22

    1

    22   1)(1

     µ  µ σ 

  • 8/19/2019 Data Mining Memahami Data

    13/3813

     #o0plot Analysis

    1ima nilai dari sebaran data Minimum H1 Median H3 Maximum

    #o0plot

    Data din$atakan dengan box

    Fjung dari box kuartil pertama dan

    ketiga tinggi kotak adala( +HR

    Median ditandai garis dalam box

    Outliers" diplot sendiri diluar

  • 8/19/2019 Data Mining Memahami Data

    14/381,

    Sifat-sifat $ur!a Distribusi "ormal

    ur%a norma dari KJL to KL" berisi 6BN pengukukuran ;K"

    mean L" standar de%iasi  Dari KJ2L to K2L" berisi EN pengukuran Dari KJ3L to K3L" berisi EE!@N pengukuran

  • 8/19/2019 Data Mining Memahami Data

    15/381

    2istogram Analysis

    =istogram" gra4kmenampilkan tabulasi dari

    frek)ensi data

    2 stograms ten Te More t an

  • 8/19/2019 Data Mining Memahami Data

    16/3816

    2 stograms ten Te More t an#o0plots

    Dua (istogram

    menunjukkan

    boxplot $ang sama

    5ilai $ang sama"

    min H1 median

    H3 max

     &etapi distribusi

    datan$a berbeda

  • 8/19/2019 Data Mining Memahami Data

    17/381@

    Scatter plot

    Meli(at data bi%ariate data untuk meli(at clusterdan outlier data etc Setiap data menunjukkan pasangan koordinat dari

    suatu data

  • 8/19/2019 Data Mining Memahami Data

    18/381B

    3ositi!ely and "egati!ely )orrelatedData

    iri atas korelasi positif 

    anan atas korelasi negatif 

  • 8/19/2019 Data Mining Memahami Data

    19/381E

     4ncorrelated Data

  • 8/19/2019 Data Mining Memahami Data

    20/382<

    Data /isuali'ation

    *($ data %isualiAation 'ain insig(t into an information space b$ mapping data onto

    grap(ical primiti%es 9ro%ide #ualitati%e o%er%ie) of large data sets Searc( for patterns trends structure irregularities

    relations(ips among data =elp 4nd interesting regions and suitable parameters for

    furt(er #uantitati%e anal$sis

    eometr c 3ro ect on / sua 'at on

  • 8/19/2019 Data Mining Memahami Data

    21/3821

    eometr c 3ro ect on / sua 'at onTechniues

    VisualiAation of geometric transformations andprojections of t(e data

    Met(ods

    Scatterplot and scatterplot matrices

  • 8/19/2019 Data Mining Memahami Data

    22/3822

    Scatterplot Matrices

    Matrix of scatterplots ;x-$-diagrams of t(e k-dim! data Ptotal of ;k2G2-k scatterplotsQ

       U  s  e   d

       b  y  e  r  m   i  s  s   i  o  n  o   f   M .

       W  a  r   d ,

       W  o  r  c  e  s   t  e  r   P  o   l  y   t  e  c   h  n   i  c   I  n  s   t   i   t  u   t  e

  • 8/19/2019 Data Mining Memahami Data

    23/38

    23

    Similarity and Dissimilarity

    Similarity Mengukur secara 5umerik bagaimana kesamaan

    dua objek data

     &inggi nilain$a bila benda $ang lebi( mirip

    Range P

  • 8/19/2019 Data Mining Memahami Data

    24/38

    2,

     Matri0

    Data matrix n titik data dengan

    p dimensi  &)o modes

    Dissimilarit$ matrix n titik data $ang

    didata adala(distanceGjarak

    Matrik segitiga Single mode

     

     

    np x  ...nf   x  ...n1 x  ... ... ... ... ...

    ip x  ...

    if   x  ...

    i1 x 

     ... ... ... ... ...

    1p x  ...1f   x  ...11 x 

     

     

    0...)2,()1,(

    :::

    )2,3()

     ...nd nd 

    0d d(3,10d(2,1)

    0

  • 8/19/2019 Data Mining Memahami Data

    25/38

    2

    3ro0imity Measure for "ominalAttributes

    Misal terdapat 2 atau lebi( nilai misal! red$ello) blue green ;generalisasi dari atribut

    binar$

    Metode Simple matc(ing

    m" $ang sesuai p" total %ariabel pm p

      jid   −=),(

    3 i it M f #i

  • 8/19/2019 Data Mining Memahami Data

    26/38

    26

    3ro0imity Measure for #inaryAttributes

    : contingenc$ table for binar$

    data

    Distance measure for s$mmetric

    binar$ %ariables"

    Distance measure for as$mmetric

    binar$ %ariables"

     accard coeTcient ;similarit"  

    measure for as"mmetri# binar$

    %ariables"

    5ote" accard coeTcient is t(e same as 7co(erence8"

    Object i

    Object $

    Dissimilarity bet+een #inary

  • 8/19/2019 Data Mining Memahami Data

    27/38

    2@

    Dissimilarity bet+een #inary/ariables

    Cxample

    'ender is a s$mmetric attribute  &(e remaining attributes are as$mmetric binar$ Uet t(e %alues and 9 be 1 and t(e %alue 5 <

    ame   en er ever oug es - es - es - es -

    Ja! " # N $ N N N

    "ar% F # N $ N $ N

    J&m " # $ N N N N

    75.0

    211

    21),(

    67.0111

    11),(

    33.0102

    10),(

     mary  jimd 

      jim  jack d 

    mary  jack d 

  • 8/19/2019 Data Mining Memahami Data

    28/38

    2B

    Standardi'ing "umeric Data

    W-score" X" ra) score to be standardiAed K" mean of t(e population

    L" standard de%iation

    t(e distance bet)een t(e ra) score and t(e population

    mean in units of t(e standard de%iation

    negati%e )(en t(e ra) score is belo) t(e mean 78 )(en

    abo%e

    :n alternati%e )a$" alculate t(e mean absolute de%iation

    )(ere

    standardiAed measure ; z%s#ore"

    Fsing mean absolute de%iation is more robust t(an using

    standard de%iation

    ')'''211 nf  f  f  f    x x(xnm   +++=

    )'''(121   f nf  f  f  f  f  f   m xm xm xn s   −++−+−=

     f 

     f if 

    if    s

    m x  z 

    −=

    σ  µ −

    =

      x  z 

    6 ample

  • 8/19/2019 Data Mining Memahami Data

    29/38

    2E

    60ample:Data Matri0 and Dissimilarity Matri0

    Dissimilarity Matri0

    (+ith 6uclidean Distance7

    Data Matrix

    D stance on "umer c Data: M n o+s

  • 8/19/2019 Data Mining Memahami Data

    30/38

    3<

    D stance on "umer c Data: M n o+sDistance

    Minko&ski distan#e" : popular distance measure

    )(ere i I ; ' i1 ' i2 Y ' ip and $ I ; '  j1 '  j2 Y '  jp are

    t)o p-dimensional data objects and ( is t(e order;t(e distance so de4ned is also called U-( norm

    9roperties

    d;i j / < if i Z j and d;i i I < ;9ositi%e

    de4niteness

    d;i j I d;j i  ;S$mmetr$

    d;i j ≤ d;i k d;k j  ;&riangle +ne#ualit$

  • 8/19/2019 Data Mining Memahami Data

    31/38

    31

    Special )ases of Min$o+s$i Distance

    ( I 1" Man(attan ;cit$ block U1 norm distance 

    C!g! t(e =amming distance" t(e number of bits t(atare di[erent bet)een t)o binar$ %ectors

    ( I 2" ;U2 norm Cuclidean distance

    ( → ∞! 7supremum8 ;Umax norm U∞ norm distance!

     &(is is t(e maximum di[erence bet)een an$component ;attribute of t(e %ectors

    )'''(),(22

    22

    2

    11   p p   j x

    i x

     j x

    i x

     j x

    i x jid    −++−+−=

    '''),(2211   p p   j

     xi

     x j

     xi

     x j

     xi

     x jid    −++−+−=

  • 8/19/2019 Data Mining Memahami Data

    32/38

    32

    60ample: Min$o+s$i DistanceDissimilarity Matrices

    Manhattan

    (187

    6uclidean (197

    Supremum

  • 8/19/2019 Data Mining Memahami Data

    33/38

    33

    Ordinal /ariables

    :n ordinal %ariable can be discrete or continuous

    Order is important e!g! rank

    an be treated like inter%al-scaled

    replace ' i)   b$ t(eir rank map t(e range of eac( %ariable onto P

  • 8/19/2019 Data Mining Memahami Data

    34/38

    3,

    Attributes of Mi0ed Type

    : database ma$ contain all attribute t$pes 5ominal s$mmetric binar$ as$mmetric binar$numeric ordinal

    One ma$ use a )eig(ted formula to combine t(eire[ects

       is binar$ or nominal"

    dij

    ;f I < if xifI x

     jf   or d

    ij

    ;f I 1 ot(er)ise    is numeric" use t(e normaliAed distance    is ordinal

    ompute ranks rif  and &reat Aif  as inter%al-scaled

    )(1

    )()(

    1),( f  

    ij p f  

     f  

    ij

     f  

    ij

     p

     f     d  jid δ 

    δ =

    =

    ΣΣ=

    1

    1

    −=

     f  

    if  

     M r 

     z if 

  • 8/19/2019 Data Mining Memahami Data

    35/38

    3

     )osine Similarity

    : document can be represented b$ t(ousands of attributes eac(recording t(e re*uen#"  of a particular )ord ;suc( as ke$)ords orp(rase in t(e document!

    Ot(er %ector objects" gene features in micro-arra$s Y :pplications" information retrie%al biologic taxonom$ gene feature

    mapping !!! osine measure" +f d+ and d! are t)o %ectors ;e!g! term-fre#uenc$

    %ectors t(en

      cos;d+, d! I ;d+ • d! G\\d+\\ \\d!\\

    )(ere • indicates %ector dot product \\d\\" t(e lengt( of %ector d

  • 8/19/2019 Data Mining Memahami Data

    36/38

    36

     60ample: )osine Similarity

    cos;d+, d! I ;d+ • d! G\\d+\\ \\d!\\ )(ere • indicates %ector dot product \\d\" t(e lengt( of %ector d

    Cx" ]ind t(e similarity bet)een documents 1 and 2!

    d+  ;

  • 8/19/2019 Data Mining Memahami Data

    37/38

    3@

     1 Di!ergence: )omparingT+o 3robability Distributions

    (e -ullba#k%eibler -) di/ergen#e: Measure t(e di[erence

    bet)een t)o probabilit$ distributions o%er t(e same %ariable '  ]rom information t(eor$ closel$ related to relati/e entrop" 

    inormation di/ergen#e and inormation or dis#rimination

    D-; p; '  \\ *; ' " di%ergence of *; '  from p; '  measuring t(e

    information lost )(en *; '  is used to approximate p; '  Discrete form"

     &(e U di%ergence measures t(e expected number of extra bitsre#uired to code samples from p; '  ;7true8 distribution )(en

    using a code based on *; '  )(ic( represents a t(eor$ modeldescription or approximation of p; '  +ts continuous form"

     &(e U di%ergence" not a distance measure not a metric"

    as$mmetric not satisf$ triangular ine#ualit$

  • 8/19/2019 Data Mining Memahami Data

    38/38

     )ompute the 1

    Di!ergence= 0ase on t(e ormula, D-;1,2 3 < and D-;1 \\ 2 I < if and onl$ if 1 I 2!

    =o) about )(en p I < or # I " ;a "  8 E 98 3, b " 3 8 E 98 3, # " 9, d " 1 8 E 98 3! D-;1? \\ 2?  can be computed easil$


Recommended