Date post: | 01-Jun-2018 |
Category: |
Documents |
Upload: | chukiat-sakjirapapong |
View: | 221 times |
Download: | 0 times |
of 44
8/9/2019 CT477-3 Information Retrieval Ch.3
1/44
3
1. 2. 3. 4. 5. Clustering algorithm
3.1 .............................................................................................................60
3.2 (Measures of Association).............................................61
3.3 (Dissimilarity) ………………………………....................64
3.4 (Classification Methods)………………..…………………….....663.5 (Cluster Hypothesis) ..…………………….............68
3.6 ………………………………….…703.7 Clustering algorithm ………………………………………................................77
....................................................................................................101
..................................................................................................102
8/9/2019 CT477-3 Information Retrieval Ch.3
2/44
3 60
3.1
(Automatic Classification) (Document Clustering)
(pattern recognition) (automatic medical
diagnosis) (Keyword Clustering) IR 2
(Keyword clustering) (document clustering) R.M.Hayes
“ ( item)
”
(logical Relationship)
(Logical Organization) 2 1. 2.
(Group Vector) (query)
(query)
8/9/2019 CT477-3 Information Retrieval Ch.3
3/44
3 61
(query) (query) String Matching / comparison , Same vocabulary used ,
Probability that documents arise from same model , Same meaning of text
3.2 (Measures of Association)
(object)
1. Simple Coefficient
| X ∩ Y | X Y X
Y
X = { 1 , 2 , 3 } Y = { 1 , 4 } X ∩ Y = { 1 } | X ∩ Y | = 1
2. Dice’s Coefficient
2 | X ∩ Y | / | X | + | Y |
X Y
3. Jaccard’s Coefficient
| X ∩ Y | / | X ∪ Y |
8/9/2019 CT477-3 Information Retrieval Ch.3
4/44
4. Cosine Coefficient
Cosine Correlation Salton SMART Salton n n
(X,Y) / ||X|| ||Y|| cosine 2 X Y
| X ∩ Y | / |X|1/2 * |Y| ½ (X,Y) ||.||
X = (X1,…,Xn ) Y = (Y1,..Yn)
Vector Space Similarity
),( 1
∑=
∗=t
k jk ik ji ww D Dsim
Di , Dj
∑ ==
t
k k ik
k ik ik
n N tf
n N tf w
122 )]/[log()(
)/log(
Normalize (the term weights) 0 1 cosine normalized inner
product
5. Overlap coefficient
| X ∩ Y | / min( |X| ,|Y| )
3 62
8/9/2019 CT477-3 Information Retrieval Ch.3
5/44
3.1 Vector Space D , Q ,w
normalize
3.2 Vector Space Q
D1 D1 = (0.8,0.3) D2 D2 =(0.2,0.3)
COS 74.01 =α COS α 2=0.98
Term B
Term A
3 63
8/9/2019 CT477-3 Information Retrieval Ch.3
6/44
3.1 : cosine Degrees
(Q) D2
(Q) D1
SIM(Q,D1) = 74.058.0
56.0=
3.3 (Dissimilarity)
3 64
8/9/2019 CT477-3 Information Retrieval Ch.3
7/44
3 65
P D P x P
D (1) D( X , Y) > 0 X , Y P
(2) D(X , X ) = 0 X P
(3) D(X, Y ) = D(Y ,X) X , Y P
(4) D(X, Y) < D(X , Z) + D(Y, Z)
4 1. | X Δ Y | / |X | + |Y| | X Δ Y | = | X ∪ Y | - | X ∩ Y |
Dice’s Coefficient2 | X ∩ Y | / | X | + | Y |
Dissimilarity = 1 - ( 2 | X ∩ Y | / | X | + | Y | )
= ( |X | + | Y | - 2 | X ∩ Y | ) / ( |X| + |Y| )= ( | X ∪ Y | - | X ∩ Y | ) / ( |X| + |Y| )= | X Δ Y | / |X | + |Y|
2. Jaccard i 0 1 I 1
0
|X| = Σ X i i ∈1..N N
| X ∩ Y | = Σ X i Y i I ∈1..N
8/9/2019 CT477-3 Information Retrieval Ch.3
8/44
| X Δ Y | / |X | + |Y| = ( Σ X i (1-Y i) + Σ Y i(1-X i ) ) / ( Σ X i + Σ Y i )
2
P(Xi) P(Xj)
3.
Jardine Sibson
1 0 P1(1),p1(0),P2(1),P2(0) Jardine Sibson
(Information Radius)
u v
3.4 (Classification Methods)
(documents) (keyword) (hand writtencharacter) (species) (Description)
- - 9Keyword)
- - (Probability Distributions)
3 66
8/9/2019 CT477-3 Information Retrieval Ch.3
9/44
Sparck Jones 1.
2 Monothetic Polythetic Monothetic Polythetic
G = { f1 , f2 , f3 , … , fn } fi Individuals
Individuals G (row) f G Individuals (Column) f G Individuals
Individual Monothetic 3.2 5 6 Monothetic 7
Monothetic 1 , 2 , 3 , 4 , 5 Polythetic 3
3.2 : Monothetic Polythetic
3 67
8/9/2019 CT477-3 Information Retrieval Ch.3
10/44
3 68
2. 2 Exclusive
Overlapping Overlapping Exclusive Overlapping Class Individuals
Exclusive Class individual Overlapping
Individuals
3. 2 Ordered
Unordered Ordered Unordered Class
Ordered Classification (Hierarchical)
Unordered Classification Thesaurus
3.5 (Cluster Hypothesis)
“ “
“closely associated documents tend to be relevant to the same requests “
(relevant) (Non-relevant)
8/9/2019 CT477-3 Information Retrieval Ch.3
11/44
3.3 :
3.3 (request) relevant-relevant(R-R) relevant-non-relevant(R-N-R)
X B X Y
(document clustering)
clustering
(Clustering algorithm ) -
-
3 69
8/9/2019 CT477-3 Information Retrieval Ch.3
12/44
3 70
- - - -
Clustering
(Distance-based Clustering)
3.6 (Clustering algorithm) 4
(1) Exclusive Clustering
(2) Overlapping Clustering
(3) Hierarchical Clustering Exclusive Clustering Overlapping Clustering
(4) Probabilistic Clustering
2
8/9/2019 CT477-3 Information Retrieval Ch.3
13/44
1. (1) (2)
(3)
2.
1. Object
1.1 Graph Theoretic Method
3.3 : 6
3 71
8/9/2019 CT477-3 Information Retrieval Ch.3
14/44
3.3
threshold threshold 2 2
3.5 :
3 72
8/9/2019 CT477-3 Information Retrieval Ch.3
15/44
3.5 Keyword clustering Sparck Jones andJackon ,Auguestson and Minker Vaswani and Cameron String connected component
1 maximalcompleate subgraph
1.2 Single Link Hierarchic Cluster Objects (dissimilarity coefficient)
dendrogram (tree structure)
3.6 : Dendrogram
3.6 {A,B,C,D,E} L1 {A,B}, {C} ,{D} ,{E} L2 {A,B} {C,D,E}
L3 (A,B,C,D,E} dendrogram
Jardine and Sibson Single-link (dissimilarity:DC)
3 73
8/9/2019 CT477-3 Information Retrieval Ch.3
16/44
thresholding
hierarchic cluster complete-link , average-link
matching function threshold (request) (low level)
high precision row recall cut-off (low rank position)
high recall low precision
3.7 : single-link clusters thresholding
3 74
8/9/2019 CT477-3 Information Retrieval Ch.3
17/44
Hierarchic -
hierarchic - hierarchic - hierarchic
hierarchic Minimum Spanning Tree MST
Single-link tree Single-linked tree Minimum Spanning Tree
MST
MST Single-link hierarchy Single link Cluster single-link hierarchy
thresholding MST
minimum spanning tree (edge) edge
A B
C D
E
800
1421 4 0 0
2 0 0
410
6 1 2
2 9 1 5 310
A B
C D
E
2 0 0
410
6 1 2
310
minimum spanning tree
2. (Descriptions) Object
(clusterrepresentative) cluster profile classification vector centroid
3 75
8/9/2019 CT477-3 Information Retrieval Ch.3
18/44
3 76
- - - threshold matching function threshold
- (Overlap) .
-
(Descriptions) Object
1. Rocchio’s clustering 3
rag-bag
thresholds matching function(overlap)
Single-Pass algorithm
-
-
-
8/9/2019 CT477-3 Information Retrieval Ch.3
19/44
3 77
- matching function
- - (test)
(input
parameter) 2. Dattola
(hierarchic)
graph-theoretic heuristic approaches ( ) graph-theoretic (association
measure) matching function n log n n
3.7 Clustering algorithm
Clustering algorithm (1) K-means (Partitioning)
n K
a. k b. (centroid) (mean)c.
d.
8/9/2019 CT477-3 Information Retrieval Ch.3
20/44
k-mean
K-means clustering1 K-means clustering2 K-means clustering3
O(tkn) n k t k t n (local optimal) (global
optimal)
k
3 78
8/9/2019 CT477-3 Information Retrieval Ch.3
21/44
K-means
(mode) categorical frequency-bases k-prototype
categorical medoid PAM(Partitioning Around
Medoids,1987) medoid medoid medoid
PAM
- k - h medoid i medoid- TCih
o TCih
8/9/2019 CT477-3 Information Retrieval Ch.3
22/44
3 80
K
K K
a. K
b. c. K d. b c K
multiple time
fuzzy feature vector n X1,X2,...,Xn K< n
mi cluster i X i ||X-mi||
K - m1,m2,…mk- Until
o o For I = 1 to K
mi - end_until
m K
8/9/2019 CT477-3 Information Retrieval Ch.3
23/44
K-mean
CLARA(Cluster Large Applications,1990) , CLARANS(Ng&Han,1994) Sanpawat
(http://www.cs.tufts.edu/~{sanpawat,couch})
O(K/2) K
Spherical K-Means
(global)
(full text)
(unstructured text document) Vector Space Model(VSM)
VSM w ik k i
D i
Di = (w i1, w i2, …, w it) space t- 3-
3 81
http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89http://www.cs.tufts.edu/~%7Bsanpawat,couch%7D)%20%20%E0%B8%97%E0%B8%B5%E0%B9%88%E0%B9%84%E0%B8%94%E0%B9%89
8/9/2019 CT477-3 Information Retrieval Ch.3
24/44
(similarity) cosine 0 1
(term frequency)
tf*idf(term frequency *inverse document frequency) idf log(N/df) N
df normalization 1
tf ik k iN
df k i
inner product ||D i|| = ||D j|| = 1
3 82
8/9/2019 CT477-3 Information Retrieval Ch.3
25/44
3.3 D1, D2, D3 (word segmentation)
df idf
3 83
8/9/2019 CT477-3 Information Retrieval Ch.3
26/44
0 VSM
input
3.8 : -
spherical K-mean k-mean
Euclidean cosine
X1
jπ
3 84
8/9/2019 CT477-3 Information Retrieval Ch.3
27/44
4800 5 1146 1653
828 47 1126 1 1
Longest Matching 32675 F-measure
3 85
8/9/2019 CT477-3 Information Retrieval Ch.3
28/44
8/9/2019 CT477-3 Information Retrieval Ch.3
29/44
Objective Function
J = ),()1 1
2( i jm
Z X d
c
i
n
j
ij∑∑= =
μ
J Objective Function
X = {X 1,X2,…Xn}
n c m 1
ij (membership) J i),(
2i j Z X d x j z
i
Zi =
∑∑
=
=
n
j
m
ij
n
j j
m
ij X
1
1
)(
)(
μ
μ
ij
ij =
∑=−
−
−
−c
i
m
i j
mi j
Z X d
Z X d
1
)1/(12
)1/(12
)](/1[
)](/1[
3 87
8/9/2019 CT477-3 Information Retrieval Ch.3
30/44
Initial centroids
Z1,Z2,Z3,…Ze
3 88
Calculate membership from
The given centroids
Calculate new centroids
noImproved
Centroidsyes
Calculate Membership
and objectivity function
(Euclidean distance)
ED ji =T
i ji j Z X Z X ))(( −−
ED ji X j Z i T Transpose matrix
8/9/2019 CT477-3 Information Retrieval Ch.3
31/44
(Mahalanobis distance)
MD ji =T
i ji j Z X A Z X )()( 1 −− −
MDji X j Z i
A variance-covariance matrix
A =1
)()(1
−
−−∑=
n
Z X Z X n
ji j
T i j
, , , “ ”
(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf )
(3) Hierarchical Clustering (Hierarchical decomposition) distance matrix
k
3 89
http://www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdfhttp://www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf
8/9/2019 CT477-3 Information Retrieval Ch.3
32/44
AGNES(Agglomerative) Kaufmann Rousseeuw(1990)
Single-link
dendrogram (tree of cluster) dendrogram
DIANA(Divisive Analysis) Kaufmann Rousseeuw(1990)
Single-Link
3 90
8/9/2019 CT477-3 Information Retrieval Ch.3
33/44
3 91
agglomerative O(n
2
) n BIRCH(1996) CF-tree
CURE(1998) CHAMELEON(1990)
BIRCH CURE
Hierarchical algorithm 1 N N*N 1 item cluster N
cluster N item item
2 2
3
4 2 3
Hierarchical Algorithm
N single cluster K link K-1
Single-Linkage Clustering
N*N D =[d(i,j)] 0,1,2,…,(n-1) L(k) level k clustering m
cluster (r) (s) d[(r )(s)]
8/9/2019 CT477-3 Information Retrieval Ch.3
34/44
3 92
1 level L(0) =0 m=0
2 d[(r )(s)] = min d[(i )(j)]
3 m = m+1 (r ) (s) m level
L(m) = d[(r )(s)]
4 proximity D (r ) (s)
denoted(r,s) (k) d[(k )(r,s)] = min d[(k )(r), d[(k )(s)]
5 2-5
3.4 Hierarchical clustering
single-linkage BA ,FI, MI ,NA , RM ,TO
Input distance matrix L=0
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
MI TO 138 MI/TO
Level L(MI/TO) = 138 m= 1 single-linkage clustering
8/9/2019 CT477-3 Information Retrieval Ch.3
35/44
3 93
MI TO
BA FI MI/TO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MI/TO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
min d(i,j) = d(NA,RM) = 219 NA RM NA/RM L(NA/RM) = 219 m =2
BA FI MI/TO NA/RM
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564
NA/RM 255 268 564 0
min d(i,j) = d(BA,NA/RM) = 255 BA NA/RM BA/NA/RM L(BA/NA/RM) = 255 m = 3
BA/NA/RM FI MI/TO
BA/NA/RM 0 268 564
FI 268 0 295
MI/TO 564 295 0
min d(i,j) = d(BA/NA/RM.FI) = 268 BA/NA/RM FI BA/NA/RM/FI L(BA/NA/RM/FI) = 268 m = 4
BA/NA/RM/FI MI/TO
BA/NA/RM/FI 0 295
MI/TO 295 0
8/9/2019 CT477-3 Information Retrieval Ch.3
36/44
2 level = 295
hierarchical tree
BA NA RM FI MI TO
O(n 2) n
(4) Mixture of Gaussians clustering model-based
a Gaussian a Poisson
Mixture Model component distribution
mixture model
1 (the Gaussian)
)(∞P ],[ 2 I N i
δ μ
3 94
8/9/2019 CT477-3 Information Retrieval Ch.3
37/44
),...,,,/()()/( 21 k ii ii X PP X P μ μ μ μ ∞∞= ∑
EM (Expectation-Maxixization) the mixture Gaussian
Xk X1 = 30 P(X1) = 0.5
X2 = 18 P(X2) =
X3 = 0 P(X3) = 2
X4 = 23 P(X4) = 0.5-3
1 : X1 : a students
X2 : b students
X3 : c students
X4 : d students
d cbad cbaP )35.0(*)2(**)5.0()|,,,( μ μ μ μ −∞
0=∂∂
μ P
d cba LP )35.0log()2log()log()5.0log()( μ μ μ −+++=
03
21
322 =
−−+=
∂
μ μ μ σμ d cbP L
3 95
8/9/2019 CT477-3 Information Retrieval Ch.3
38/44
)(6 d cb
cb++
+=μ
a=14 , b=6 , c=9 d=10 101=μ
2 :
x1 + x2 : h studentsx3 : c students
x4 : d students
2
1
hbhaμ
μ
μ μ
+=
+=→
21,
21 2
1
2
)(6,
d cbcb
ba++
+=→ μ
EM algorithm mixture of Gaussian 1 : Initialize parameters:
},...,,,...,,{ 21210 k k p p pμ λ =
2 : E-step:
3 96
8/9/2019 CT477-3 Information Retrieval Ch.3
39/44
∑ ∞
∞=
∞=∞
k
t
i
t
iik
t i
t iik
t k
t jt k t k j
p X p
p X p
X p
p X p X p
).,|(
).,|(),(
)|(),(),|(
2
2
σ μ
σ μ λ
λ λ λ
3 : M-step:
∑∑
∞
∞=+
k t k i
k k t k i
t i X p
x X p
),|(
),|()1(
λ
λ μ
R
X p p k
t k it
i
∑ ∞=+
),|()1(
λ R
(5) Genetic Algorithm John Holland . . 1975
(Optimization)
, , , (query)
5 a.
b. c. d.
e. ,
3 97
8/9/2019 CT477-3 Information Retrieval Ch.3
40/44
3 98
3.5 5
DOC1 ={Database, Query, Data Retrieval , Computer,Network, DBMS}
DOC2={Artificial Intelligence, Internet, Indexing, Natural Language Processing}
DOC3={Database , Expert System, Information Retrieval System, Multimedia}
DOC4={Fuzzy Logic, Neural Network, Computer Networks}
DOC5-{Object-Oriented, DBMS , Query ,Indexing}
16 Artificial Intelligence , Computer Network, Data Retrival, Database
DBMS, Expert System , Fuzzy Logic, Indexing
Information Retrieval System, Internet,Multimedia,Natural Language Processing,
Neural Network, Object Oriented, Query, Relational Database
DOC1=0110100000000011
DOC2=1000000101010000
DOC3=0001010010100000
DOC4=0100001000001000
DOC5=0000100100000110
8/9/2019 CT477-3 Information Retrieval Ch.3
41/44
3 99
(query) 16
Dice coefficient Cosine coefficient
Jaccard coefficient 0.0 1.0 1.0
(fitness)
(Survival of the fittest)
(Crossover) 2
8
101111110011101
100110011110000
101111111110000
100110010011101
(Mutation)
8/9/2019 CT477-3 Information Retrieval Ch.3
42/44
3 100
101111110011101
10101111110111101
threshold
8/9/2019 CT477-3 Information Retrieval Ch.3
43/44
1. Simple Coefficient , Dice’s Coefficient ,
Jaccard’s Coefficient , Cosine Coefficient Overlap coefficient
2. Jaccard 3. Dice’s Coefficient
4. Monothetic polythetic
5. Exclusive Class Overlapping Class
6. Ordered Classification
7. Clustering
8. Clustering algorithm 9. 10. Graph Theoretic Method 11. Single Link Method 12. Rocchio
13. K-means
14. PAM
15. Fuzzy C-means(FCM) 16. Single-Linkage Clustering
17. Genetic
3 101
8/9/2019 CT477-3 Information Retrieval Ch.3
44/44
,” ” ,
,” ” ,
,CS337 ,2535
, “ ” ,TechnicalJournal ,Vol11.No7,March-June, 2000
Sanpawat Kantabutra and Alva L.Couch ,”Parallel K-means Clustering Algorithm on
NOW’s” , Department of Computer Science Tufts University, Medford, Massachusetts,
www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdf
. , “ ” ,
, , , , “ ” , The Joint Conference on Computer
Science and Software Engineering. November 17-18, 2005 ,
,(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf )
, ” Spherical K-Means “ ,Intelligent Information
Retrieval and Database Laboratory, Department of Computer Science,Faculty of
Science Kasetsart University,Bangkok,10900,Thailand
http://www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdfhttp://www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdfhttp://www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdfhttp://www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdfhttp://www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdf