Date post: | 03-Dec-2014 |
Category: |
Technology |
Upload: | guest00a636 |
View: | 1,238 times |
Download: | 0 times |
Author : Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, and Dharmendra S. ModhaSource : KDD ’04, August 22-25, 2004, ACM, pp. 509- pp.514Presenter : Allen Wu
112/04/09
1
Introduction Bregman divergences Bregman co-clustering Algorithm Experiments Conclusion
112/04/09
2
Information-theoretic co-clustering (ITCC) model the co-clustering problem as the joint probability distribution.
We seek a co-clustering of both dimensions such that loss in “Mutual Information”
is minimized given a fixed no. of row & col. Clusters.
)ˆ;ˆ( - );(min,ˆ
YXIYXIYX
112/04/09
3
The loss in mutual information equals
where
Can be shown that q(x,y) is a “maximum entropy” approximation to p(x,y).
)),( || ),((D )ˆ;ˆ( - );( KL yxqyxpYXIYXI
yyxxyypxxpyxpyxq ˆ,ˆ where),ˆ|()ˆ|()ˆ,ˆ(),(
112/04/09
4
0.18 0.18 0.14 0.14 0.18 0.18
0.150.150.150.150.20.2
)ˆ(
)(
)ˆ(
)()ˆ,ˆ()ˆ|()ˆ|()ˆ,ˆ(),(
yp
yp
xp
xpyxpyypxxpyxpyxq
5
0.5 0.5
0.30.30.4
054.05.0
18.0
3.0
15.03.0
112/04/09
6
D(p||q)0.0419
090.0419
090.05696
0.05696
0.03760.04964
1
D(p||q)0.056960.056960.0419
10.0419
10.04964
10.0376
112/04/09
D(p||q)0.0211
80.0211
80.0224
30.04076
50.04893 0.04893
7
D(p||q)0.04813
80.04813
80.04194
20.0229
50.0205
20.0205
2
112/04/09
8
112/04/09
However, the matrix may contain negative entries or a distortion measure other than KL-divergence.
The squared Euclidean distance might be more appropriate.
This paper address the general situation by extending ITCC along three directions. “Nearness” is now measured by any Bregman
divergence. Allow specification of a larger class of constraints. Generalize the maximum entropy approach.
112/04/09
9
112/04/09
10
112/04/09
11
112/04/09
12
112/04/09
13
The objective function is
k
h xh
hk
x1
2
},...,{ 1
min
112/04/09
14
Let ф be a real-valued strictly convex function defined on the convex set S=dom(ф)R, ф is differentiable on int(S), the interior of
S.
The Bregman divergence dф:S ×int(S)[0,∞) is defined as
)(,)()(),( 2212121 zzzzzzzd
112/04/09
15
112/04/09
16
I-Divergence Given zR+, let ф(z) = zlog(z).For z1, z2 R+
Squared Euclidean Distance Given z R, let ф(z) =z2. For z1, z2 R,
)()/log(),( 2121121 zzzzzzzd
22121 )(),( zzzzd
112/04/09
17
Bregman information is defined as the expected Bregman divergence to the expectation. Iф(Z)=E[dф(Z,E[Z])]
I-Divergence Given a real non-negative random variable Z, the
Bregman information is Iф(Z)=E[Zlog(Z/E[Z])]
Squared Euclidean Distance Given any real random variable Z, the Bregman
information is Iф(Z)=E[(Z-E[Z])2]
112/04/09
18
Let (X, Y)~p(X, Y) be jointly distributed random variables with X, Y.
p(X, Y) be written the form of the matrix Z
The quality of the co-clustering can be defined as
)(,][,][],[ ,11 vuuvnm
uv yxpzvuzZ
nv
mu vyYuxX 11 ][},{:;][},{:
),( clustering-co by the determineduniquely is Z where
)ˆ,()]ˆ,([1 1
m
u
n
vuvuvuv zzdzZZdE
112/04/09
19
(,) involves four random variables corresponding to the various partitioning of the matrix Z.
We can obtain different matrix approximations based on the statistics of Z corresponding to the non-trivial combinations of }}ˆ{},ˆ{},{},{},ˆ,ˆ{},,ˆ{},ˆ,{{ VUVUVUVUVU
}ˆ,ˆ,,{ VUVU
}ˆ,ˆ,,{ VUVU
112/04/09
20
(Γ) denotes the class of matrix approximation schemes based on (,).
The set of approximations MA(,,C) consists of all Z’Sm×n.
The “best” approximation Z.
}},ˆ{},ˆ,{{ }},{},{},ˆ,ˆ{{
}}ˆ,ˆ{{ }},ˆ{},ˆ{{
43
21
VUVUCVUVUC
VUCVUC
)]',([minargˆ),,('
ZZdEZCMZ A
112/04/09
21
112/04/09
22
We present brief case studies to demonstrate two salient features. Dimensionality reduction Missing value prediction
112/04/09
23
Clustering interleaved with implicit dimensionality reduction
Superior performance as compared to one-sided clustering
112/04/09
24
Assign zero measure for missing elements, co-cluster and use reconstructed matrix for prediction
Implicit discovery of correlated sub-matrices
112/04/09
25
The Bregman divergence as the co-clustering loss function. I-divergence and squared Euclidean distance
Approximation models of various complexities are possible depending on the statistics.
The minimum Bregman information principle as a generalization of the maximum entropy principle.
112/04/09
26