+ All Categories
Home > Documents > CT477-3 Information Retrieval Ch.3

CT477-3 Information Retrieval Ch.3

Date post: 01-Jun-2018
Category:
Upload: chukiat-sakjirapapong
View: 221 times
Download: 0 times
Share this document with a friend

of 44

Transcript
  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    1/44

    3

    1. 2. 3. 4. 5. Clustering algorithm

    3.1 .............................................................................................................60

    3.2 (Measures of Association).............................................61

    3.3 (Dissimilarity) ………………………………....................64

    3.4 (Classification Methods)………………..…………………….....663.5 (Cluster Hypothesis) ..…………………….............68

    3.6 ………………………………….…703.7 Clustering algorithm ………………………………………................................77

    ....................................................................................................101

    ..................................................................................................102

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    2/44

    3 60

    3.1

    (Automatic Classification) (Document Clustering)

    (pattern recognition) (automatic medical

    diagnosis) (Keyword Clustering) IR 2

    (Keyword clustering) (document clustering) R.M.Hayes

    “ ( item)

    (logical Relationship)

    (Logical Organization) 2 1. 2.

    (Group Vector) (query)

    (query)

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    3/44

    3 61

    (query) (query) String Matching / comparison , Same vocabulary used ,

    Probability that documents arise from same model , Same meaning of text

    3.2 (Measures of Association)

    (object)

    1. Simple Coefficient

    | X ∩ Y | X Y X

    Y

    X = { 1 , 2 , 3 } Y = { 1 , 4 } X ∩ Y = { 1 } | X ∩ Y | = 1

    2. Dice’s Coefficient

    2 | X ∩ Y | / | X | + | Y |

    X Y

    3. Jaccard’s Coefficient

    | X ∩ Y | / | X ∪ Y |

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    4/44

    4. Cosine Coefficient

    Cosine Correlation Salton SMART Salton n n

    (X,Y) / ||X|| ||Y|| cosine 2 X Y

    | X ∩ Y | / |X|1/2 * |Y| ½ (X,Y) ||.||

    X = (X1,…,Xn ) Y = (Y1,..Yn)

    Vector Space Similarity

    ),( 1

    ∑=

    ∗=t

    k jk ik ji ww D Dsim

    Di , Dj

    ∑ ==

    t

    k k ik

    k ik ik

    n N tf

    n N tf w

    122 )]/[log()(

    )/log(

    Normalize (the term weights) 0 1 cosine normalized inner

    product

    5. Overlap coefficient

    | X ∩ Y | / min( |X| ,|Y| )

    3 62

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    5/44

    3.1 Vector Space D , Q ,w

    normalize

    3.2 Vector Space Q

    D1 D1 = (0.8,0.3) D2 D2 =(0.2,0.3)

    COS 74.01 =α COS α 2=0.98

    Term B

    Term A

    3 63

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    6/44

    3.1 : cosine Degrees

    (Q) D2

    (Q) D1

    SIM(Q,D1) = 74.058.0

    56.0=

    3.3 (Dissimilarity)

    3 64

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    7/44

    3 65

    P D P x P

    D (1) D( X , Y) > 0 X , Y P

    (2) D(X , X ) = 0 X P

    (3) D(X, Y ) = D(Y ,X) X , Y P

    (4) D(X, Y) < D(X , Z) + D(Y, Z)

    4 1. | X Δ Y | / |X | + |Y| | X Δ Y | = | X ∪ Y | - | X ∩ Y |

    Dice’s Coefficient2 | X ∩ Y | / | X | + | Y |

    Dissimilarity = 1 - ( 2 | X ∩ Y | / | X | + | Y | )

    = ( |X | + | Y | - 2 | X ∩ Y | ) / ( |X| + |Y| )= ( | X ∪ Y | - | X ∩ Y | ) / ( |X| + |Y| )= | X Δ Y | / |X | + |Y|

    2. Jaccard i 0 1 I 1

    0

    |X| = Σ X i i ∈1..N N

    | X ∩ Y | = Σ X i Y i I ∈1..N

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    8/44

    | X Δ Y | / |X | + |Y| = ( Σ X i (1-Y i) + Σ Y i(1-X i ) ) / ( Σ X i + Σ Y i )

    2

    P(Xi) P(Xj)

    3.

    Jardine Sibson

    1 0 P1(1),p1(0),P2(1),P2(0) Jardine Sibson

    (Information Radius)

    u v

    3.4 (Classification Methods)

    (documents) (keyword) (hand writtencharacter) (species) (Description)

    - - 9Keyword)

    - - (Probability Distributions)

    3 66

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    9/44

    Sparck Jones 1.

    2 Monothetic Polythetic Monothetic Polythetic

    G = { f1 , f2 , f3 , … , fn } fi Individuals

    Individuals G (row) f G Individuals (Column) f G Individuals

    Individual Monothetic 3.2 5 6 Monothetic 7

    Monothetic 1 , 2 , 3 , 4 , 5 Polythetic 3

    3.2 : Monothetic Polythetic

    3 67

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    10/44

    3 68

    2. 2 Exclusive

    Overlapping Overlapping Exclusive Overlapping Class Individuals

    Exclusive Class individual Overlapping

    Individuals

    3. 2 Ordered

    Unordered Ordered Unordered Class

    Ordered Classification (Hierarchical)

    Unordered Classification Thesaurus

    3.5 (Cluster Hypothesis)

    “ “

    “closely associated documents tend to be relevant to the same requests “

    (relevant) (Non-relevant)

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    11/44

    3.3 :

    3.3 (request) relevant-relevant(R-R) relevant-non-relevant(R-N-R)

    X B X Y

    (document clustering)

    clustering

    (Clustering algorithm ) -

    -

    3 69

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    12/44

    3 70

    - - - -

    Clustering

    (Distance-based Clustering)

    3.6 (Clustering algorithm) 4

    (1) Exclusive Clustering

    (2) Overlapping Clustering

    (3) Hierarchical Clustering Exclusive Clustering Overlapping Clustering

    (4) Probabilistic Clustering

    2

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    13/44

    1. (1) (2)

    (3)

    2.

    1. Object

    1.1 Graph Theoretic Method

    3.3 : 6

    3 71

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    14/44

    3.3

    threshold threshold 2 2

    3.5 :

    3 72

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    15/44

    3.5 Keyword clustering Sparck Jones andJackon ,Auguestson and Minker Vaswani and Cameron String connected component

    1 maximalcompleate subgraph

    1.2 Single Link Hierarchic Cluster Objects (dissimilarity coefficient)

    dendrogram (tree structure)

    3.6 : Dendrogram

    3.6 {A,B,C,D,E} L1 {A,B}, {C} ,{D} ,{E} L2 {A,B} {C,D,E}

    L3 (A,B,C,D,E} dendrogram

    Jardine and Sibson Single-link (dissimilarity:DC)

    3 73

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    16/44

    thresholding

    hierarchic cluster complete-link , average-link

    matching function threshold (request) (low level)

    high precision row recall cut-off (low rank position)

    high recall low precision

    3.7 : single-link clusters thresholding

    3 74

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    17/44

    Hierarchic -

    hierarchic - hierarchic - hierarchic

    hierarchic Minimum Spanning Tree MST

    Single-link tree Single-linked tree Minimum Spanning Tree

    MST

    MST Single-link hierarchy Single link Cluster single-link hierarchy

    thresholding MST

    minimum spanning tree (edge) edge

    A B

    C D

    E

    800

    1421 4 0 0

    2 0 0

    410

    6 1 2

    2 9 1 5 310

    A B

    C D

    E

    2 0 0

    410

    6 1 2

    310

    minimum spanning tree

    2. (Descriptions) Object

    (clusterrepresentative) cluster profile classification vector centroid

    3 75

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    18/44

    3 76

    - - - threshold matching function threshold

    - (Overlap) .

    -

    (Descriptions) Object

    1. Rocchio’s clustering 3

    rag-bag

    thresholds matching function(overlap)

    Single-Pass algorithm

    -

    -

    -

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    19/44

    3 77

    - matching function

    - - (test)

    (input

    parameter) 2. Dattola

    (hierarchic)

    graph-theoretic heuristic approaches ( ) graph-theoretic (association

    measure) matching function n log n n

    3.7 Clustering algorithm

    Clustering algorithm (1) K-means (Partitioning)

    n K

    a. k b. (centroid) (mean)c.

    d.

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    20/44

    k-mean

    K-means clustering1 K-means clustering2 K-means clustering3

    O(tkn) n k t k t n (local optimal) (global

    optimal)

    k

    3 78

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    21/44

    K-means

    (mode) categorical frequency-bases k-prototype

    categorical medoid PAM(Partitioning Around

    Medoids,1987) medoid medoid medoid

    PAM

    - k - h medoid i medoid- TCih

    o TCih

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    22/44

    3 80

    K

    K K

    a. K

    b. c. K d. b c K

    multiple time

    fuzzy feature vector n X1,X2,...,Xn K< n

    mi cluster i X i ||X-mi||

    K - m1,m2,…mk- Until

    o o For I = 1 to K

    mi - end_until

    m K

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    23/44

    K-mean

    CLARA(Cluster Large Applications,1990) , CLARANS(Ng&Han,1994) Sanpawat

    (http://www.cs.tufts.edu/~{sanpawat,couch})

    O(K/2) K

    Spherical K-Means

    (global)

    (full text)

    (unstructured text document) Vector Space Model(VSM)

    VSM w ik k i

    D i

    Di = (w i1, w i2, …, w it) space t- 3-

    3 81

    http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    24/44

    (similarity) cosine 0 1

    (term frequency)

    tf*idf(term frequency *inverse document frequency) idf log(N/df) N

    df normalization 1

    tf ik k iN

    df k i

    inner product ||D i|| = ||D j|| = 1

    3 82

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    25/44

    3.3 D1, D2, D3 (word segmentation)

    df idf

    3 83

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    26/44

    0 VSM

    input

    3.8 : -

    spherical K-mean k-mean

    Euclidean cosine

    X1

    3 84

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    27/44

    4800 5 1146 1653

    828 47 1126 1 1

    Longest Matching 32675 F-measure

    3 85

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    28/44

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    29/44

    Objective Function

    J = ),()1 1

    2( i jm

    Z X d

    c

    i

    n

    j

    ij∑∑= =

    μ

    J Objective Function

    X = {X 1,X2,…Xn}

    n c m 1

    ij (membership) J i),(

    2i j Z X d x j z

    i

    Zi =

    ∑∑

    =

    =

    n

    j

    m

    ij

    n

    j j

    m

    ij X

    1

    1

    )(

    )(

    μ

    μ

    ij

    ij =

    ∑=−

    −c

    i

    m

    i j

    mi j

    Z X d

    Z X d

    1

    )1/(12

    )1/(12

    )](/1[

    )](/1[

    3 87

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    30/44

    Initial centroids

    Z1,Z2,Z3,…Ze

    3 88

    Calculate membership from

    The given centroids

    Calculate new centroids

    noImproved

    Centroidsyes

    Calculate Membership

    and objectivity function

    (Euclidean distance)

    ED ji =T

    i ji j Z X Z X ))(( −−

    ED ji X j Z i T Transpose matrix

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    31/44

    (Mahalanobis distance)

    MD ji =T

    i ji j Z X A Z X )()( 1 −− −

    MDji X j Z i

    A variance-covariance matrix

    A =1

    )()(1

    −−∑=

    n

    Z X Z X n

    ji j

    T i j

    , , , “ ”

    (www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf )

    (3) Hierarchical Clustering (Hierarchical decomposition) distance matrix

    k

    3 89

    http://www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdfhttp://www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    32/44

    AGNES(Agglomerative) Kaufmann Rousseeuw(1990)

    Single-link

    dendrogram (tree of cluster) dendrogram

    DIANA(Divisive Analysis) Kaufmann Rousseeuw(1990)

    Single-Link

    3 90

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    33/44

    3 91

    agglomerative O(n

    2

    ) n BIRCH(1996) CF-tree

    CURE(1998) CHAMELEON(1990)

    BIRCH CURE

    Hierarchical algorithm 1 N N*N 1 item cluster N

    cluster N item item

    2 2

    3

    4 2 3

    Hierarchical Algorithm

    N single cluster K link K-1

    Single-Linkage Clustering

    N*N D =[d(i,j)] 0,1,2,…,(n-1) L(k) level k clustering m

    cluster (r) (s) d[(r )(s)]

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    34/44

    3 92

    1 level L(0) =0 m=0

    2 d[(r )(s)] = min d[(i )(j)]

    3 m = m+1 (r ) (s) m level

    L(m) = d[(r )(s)]

    4 proximity D (r ) (s)

    denoted(r,s) (k) d[(k )(r,s)] = min d[(k )(r), d[(k )(s)]

    5 2-5

    3.4 Hierarchical clustering

    single-linkage BA ,FI, MI ,NA , RM ,TO

    Input distance matrix L=0

    BA FI MI NA RM TO

    BA 0 662 877 255 412 996

    FI 662 0 295 468 268 400

    MI 877 295 0 754 564 138

    NA 255 468 754 0 219 869

    RM 412 268 564 219 0 669

    TO 996 400 138 869 669 0

    MI TO 138 MI/TO

    Level L(MI/TO) = 138 m= 1 single-linkage clustering

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    35/44

    3 93

    MI TO

    BA FI MI/TO NA RM

    BA 0 662 877 255 412

    FI 662 0 295 468 268

    MI/TO 877 295 0 754 564

    NA 255 468 754 0 219

    RM 412 268 564 219 0

    min d(i,j) = d(NA,RM) = 219 NA RM NA/RM L(NA/RM) = 219 m =2

    BA FI MI/TO NA/RM

    BA 0 662 877 255

    FI 662 0 295 268

    MI/TO 877 295 0 564

    NA/RM 255 268 564 0

    min d(i,j) = d(BA,NA/RM) = 255 BA NA/RM BA/NA/RM L(BA/NA/RM) = 255 m = 3

    BA/NA/RM FI MI/TO

    BA/NA/RM 0 268 564

    FI 268 0 295

    MI/TO 564 295 0

    min d(i,j) = d(BA/NA/RM.FI) = 268 BA/NA/RM FI BA/NA/RM/FI L(BA/NA/RM/FI) = 268 m = 4

    BA/NA/RM/FI MI/TO

    BA/NA/RM/FI 0 295

    MI/TO 295 0

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    36/44

    2 level = 295

    hierarchical tree

    BA NA RM FI MI TO

    O(n 2) n

    (4) Mixture of Gaussians clustering model-based

    a Gaussian a Poisson

    Mixture Model component distribution

    mixture model

    1 (the Gaussian)

    )(∞P ],[ 2 I N i

    δ μ

    3 94

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    37/44

    ),...,,,/()()/( 21 k ii ii X PP X P μ μ μ μ ∞∞= ∑

    EM (Expectation-Maxixization) the mixture Gaussian

    Xk X1 = 30 P(X1) = 0.5

    X2 = 18 P(X2) =

    X3 = 0 P(X3) = 2

    X4 = 23 P(X4) = 0.5-3

    1 : X1 : a students

    X2 : b students

    X3 : c students

    X4 : d students

    d cbad cbaP )35.0(*)2(**)5.0()|,,,( μ μ μ μ −∞

    0=∂∂

    μ P

    d cba LP )35.0log()2log()log()5.0log()( μ μ μ −+++=

    03

    21

    322 =

    −−+=

    μ μ μ σμ d cbP L

    3 95

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    38/44

    )(6 d cb

    cb++

    +=μ

    a=14 , b=6 , c=9 d=10 101=μ

    2 :

    x1 + x2 : h studentsx3 : c students

    x4 : d students

    2

    1

    hbhaμ

    μ

    μ μ

    +=

    +=→

    21,

    21 2

    1

    2

    )(6,

    d cbcb

    ba++

    +=→ μ

    EM algorithm mixture of Gaussian 1 : Initialize parameters:

    },...,,,...,,{ 21210 k k p p pμ λ =

    2 : E-step:

    3 96

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    39/44

    ∑ ∞

    ∞=

    ∞=∞

    k

    t

    i

    t

    iik

    t i

    t iik

    t k

    t jt k t k j

    p X p

    p X p

    X p

    p X p X p

    ).,|(

    ).,|(),(

    )|(),(),|(

    2

    2

    σ μ

    σ μ λ

    λ λ λ

    3 : M-step:

    ∑∑

    ∞=+

    k t k i

    k k t k i

    t i X p

    x X p

    ),|(

    ),|()1(

    λ

    λ μ

    R

    X p p k

    t k it

    i

    ∑ ∞=+

    ),|()1(

    λ R

    (5) Genetic Algorithm John Holland . . 1975

    (Optimization)

    , , , (query)

    5 a.

    b. c. d.

    e. ,

    3 97

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    40/44

    3 98

    3.5 5

    DOC1 ={Database, Query, Data Retrieval , Computer,Network, DBMS}

    DOC2={Artificial Intelligence, Internet, Indexing, Natural Language Processing}

    DOC3={Database , Expert System, Information Retrieval System, Multimedia}

    DOC4={Fuzzy Logic, Neural Network, Computer Networks}

    DOC5-{Object-Oriented, DBMS , Query ,Indexing}

    16 Artificial Intelligence , Computer Network, Data Retrival, Database

    DBMS, Expert System , Fuzzy Logic, Indexing

    Information Retrieval System, Internet,Multimedia,Natural Language Processing,

    Neural Network, Object Oriented, Query, Relational Database

    DOC1=0110100000000011

    DOC2=1000000101010000

    DOC3=0001010010100000

    DOC4=0100001000001000

    DOC5=0000100100000110

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    41/44

    3 99

    (query) 16

    Dice coefficient Cosine coefficient

    Jaccard coefficient 0.0 1.0 1.0

    (fitness)

    (Survival of the fittest)

    (Crossover) 2

    8

    101111110011101

    100110011110000

    101111111110000

    100110010011101

    (Mutation)

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    42/44

    3 100

    101111110011101

    10101111110111101

    threshold

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    43/44

    1. Simple Coefficient , Dice’s Coefficient ,

    Jaccard’s Coefficient , Cosine Coefficient Overlap coefficient

    2. Jaccard 3. Dice’s Coefficient

    4. Monothetic polythetic

    5. Exclusive Class Overlapping Class

    6. Ordered Classification

    7. Clustering

    8. Clustering algorithm 9. 10. Graph Theoretic Method 11. Single Link Method 12. Rocchio

    13. K-means

    14. PAM

    15. Fuzzy C-means(FCM) 16. Single-Linkage Clustering

    17. Genetic

    3 101

  • 8/9/2019 CT477-3 Information Retrieval Ch.3

    44/44

    ,” ” ,

    ,” ” ,

    ,CS337 ,2535

    , “ ” ,TechnicalJournal ,Vol11.No7,March-June, 2000

    Sanpawat Kantabutra and Alva L.Couch ,”Parallel K-means Clustering Algorithm on

    NOW’s” , Department of Computer Science Tufts University, Medford, Massachusetts,

    www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdf

    . , “ ” ,

    , , , , “ ” , The Joint Conference on Computer

    Science and Software Engineering. November 17-18, 2005 ,

    ,(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf )

    , ” Spherical K-Means “ ,Intelligent Information

    Retrieval and Database Laboratory, Department of Computer Science,Faculty of

    Science Kasetsart University,Bangkok,10900,Thailand

    http://www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdfhttp://www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdfhttp://www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdfhttp://www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdfhttp://www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdf

Recommended