C. Ding, NMF => Unsupervised Clustering 1
(Semi-)Nonnegative Matrix Factorization and
K-mean Clustering
Xiaofeng He Lawrence Berkeley Nat’l LabHorst Simon Lawrence Berkeley Nat’l LabTao Li Florida Int’l Univ.Michael Jordan UC BerkeleyHaesun Park Georgia Tech
Chris DingLawrence Berkeley National Laboratory
C. Ding, NMF => Unsupervised Clustering 2
Nonnegative Matrix Factorization (NMF)
),,,( 21 nxxxX L=Data Matrix: n points in p-dim:
TFGX ≈Decomposition (low-rank approximation)
Nonnegative Matrices0,0,0 ≥≥≥ ijijij GFX
),,,( 21 kgggG L=),,,( 21 kfffF L=
is an image, document, webpage, etc
ix
C. Ding, NMF => Unsupervised Clustering 3
Some historical notes
• Earlier work by statistics people (G. Golub)• P. Paatero (1994) Environmetrices• Lee and Seung (1999, 2000)
– Parts of whole (no cancellation)– A multiplicative update algorithm
C. Ding, NMF => Unsupervised Clustering 4
0 0050 710
080 20 0
.
.
.
.
.
.
.
M
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
Pixel vector
C. Ding, NMF => Unsupervised Clustering 5
XFGT ≈
),,,( 21 kfffF L= ),,,( 21 kgggG L=
Lee and Seung (1999): Parts-based Perspective
original
),,,( 21 nxxxX L=
C. Ding, NMF => Unsupervised Clustering 6
TFGX ≈ ),,,( 21 kfffF L=
Straightforward NMF doesn’t get parts-based picture
Several People explicitly sparsify F to get parts-based picture
Donono & Stodden (2003) study condition for parts-of-whole
(Li, et al, 2001; Hoyer 2003)
“Parts of Whole” Picture
C. Ding, NMF => Unsupervised Clustering 7
Meanwhile …….A number of studies empirically show the usefulness of NMF for pattern discovery/clustering:
Xu et al (SIGIR’03)Brunet et al (PNAS’04)Many others
We claim:
NMF factors give holistic pictures of the data
C. Ding, NMF => Unsupervised Clustering 8
Our Experiments: NMF gives holistic pictures
C. Ding, NMF => Unsupervised Clustering 9
Our Experiments: NMF gives holistic pictures
C. Ding, NMF => Unsupervised Clustering 10
Task:Prove NMF is doing “Data Clustering”
NMF => K-means Clustering
C. Ding, NMF => Unsupervised Clustering 11
NMF-Kmeans Theorem
2
0||X||min
0,
T
FFG
GIGTG
−≥=
≥
)(Trmin0,
XGXGXX TTT
GIGGT−
≥=
G -orthogonal NMF is equivalent to relaxed K-means clustering.
Proof.
(Ding, He, Simon, SDM 2005)
C. Ding, NMF => Unsupervised Clustering 12
• Also called “isodata”, “vector quantization”• Developed in 1960’s (Lloyd, MacQueen, Hartigan, etc)
K-means clustering
• Computationally Efficient (order-mN)• Most widely used in practice
– Benchmark to evaluate other algorithms
∑∑∈=
−=kCi
ki
K
kK cxJ 2
1
||||min
TnxxxX ),,,( 21 L=Given n points in m-dim:
K-means objective
C. Ding, NMF => Unsupervised Clustering 13
Reformulate K-means Clustering
∑ ∑ ∑=
∈−= i
K
kCji j
Ti
kiK k
xxn
xJ1
,2 1||||
}2/1/)00,11,00( k
T
n
k nhk
LLL=
Cluster membership indicators:
∑ ∑=
−=i
K
kk
TTkiK XhXhxJ
1
2
),,( 1 KhhH L=
)(Trmax0,
XHXH TT
HIHHT ≥=Solving K-mean =>
(Zha, Ding, Gu, He, Simon, NIPS 2001) (Ding & He, ICML 2004)
C. Ding, NMF => Unsupervised Clustering 14
Reformulate K-means Clustering
Cluster membership indicators :
Hhhh ==
⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
),,(
1100000
0011100
0000011
321
C1 C2 C3
C. Ding, NMF => Unsupervised Clustering 15
NMF-Kmeans Theorem
2
0||X||min
0,
T
FFG
GIGTG
−≥=
≥
)(Trmin0,
XGXGXX TTT
GIGGT−
≥=
G -orthogonal NMF is equivalent to relaxed K-means clustering.
Proof.
(Ding, He, Simon, SDM 2005)
C. Ding, NMF => Unsupervised Clustering 16
Kernel K-means Clustering
Map feature vector to higher-dim space
∑∑∈=
−=kCi
ki
K
kK cxJ 2
1||)()(||min φφφ
Kernel K-means objective:
)( ii xx φ→
∑∈
≡kCi
ik
k xn
c )(1)( φφ
Kernal K-means optimization:
∑ ∑∑= ∈
−=K
k Cjij
Ti
kiiK
k
xxn
xJ1 ,
2 )()(1|)(| φφφφ
)(Tr)(),(1max1 ,
WHHxxn
J TK
k Cjiji
kK
k
== ∑ ∑= ∈
φφφ
Matrix of pairwise similarities
C. Ding, NMF => Unsupervised Clustering 17
Symmetric NMF:
Symmetric NMF
Is Equivalence to )(Trmax0,
WHH T
HIHHT ≥=
2
0,||||min T
HIHHHHW
T−
≥=
THHW ≈
Orthogonal symmetric NMF is equivalent to Kernel K-means clustering.
Symmetric Nonnegative matrix
C. Ding, NMF => Unsupervised Clustering 18
Orthogonality in NMF
Strict orthogonal G: hard clustering
Non-orthogonal G: soft clustering),( 21 hhH =
Ambiguous/outlier points
),,,( 21 nxxxX L=
C. Ding, NMF => Unsupervised Clustering 19
K-means Clustering Theorem
2
0,||X||min T
GIGGGF
T +±±≥=−
G -orthogonal NMF is equivalent to relaxed K-means clustering.
(Ding, Li, Jordan, 2006)
Proof requires only G-orthogonality and nonnegativity
),,,( 21 kgggG L=
),,,( 21 kfffF L= => cluster centroids
=> cluster indicators
C. Ding, NMF => Unsupervised Clustering 20
NMF Generalizations
SVD: TT VUGFX Σ== ±±±
TGFX +±± =Semi-NMF:
Tri-NMF:
Convex-NMF:
Kernel-NMF:
TGWXX ++±± =
TGSFX +±+± =
TGWXX ++±± = )()( φφ
(Ding, Li, Jordan, 2006)
(Ding, Li, Peng, Park, KDD 2006)
C. Ding, NMF => Unsupervised Clustering 21
Semi-NMF:
• For any mixed-sign input data (centered data)• Clustrering and Low-rank approximation
TGFX +±± =
Update F:
Update G:
1)( −= GGXGF T
ikikT
ikikT
ikik FFGXFFFGXFGG
])([)(])([)(
+−
−+
++←
(Ding, Li, Jordan, 2006)
||||min TFGX −
C. Ding, NMF => Unsupervised Clustering 22
In NMF TGFX +++ =TGFX +±± =In Semi-NMF
For fk factor to capture the notion of cluster centroid, Require fk to be a convex combination of input data
Convex-NMF
is in a large space
+=++= XWFxwxwf nnkk ,111 L
TGWXX ++±± =
(Ding, Li, Jordan, 2006)
),,,( 21 kfffF L=
For F interpretability, ±= XWF(Affine combination )
C. Ding, NMF => Unsupervised Clustering 23
Convex-NMF:
||||min TXWGX −
Update F:
Update G:
ikTT
ikT
ikTT
ikT
ikik GWGXXGXXGWGXXGXXWW
])[(])[(])[(])[(
+−
−+
++←
ikTT
ikTT
ikTT
ikTT
ikik WXXGWXXWWXXGWXXW
GG])([])([])([])([
+−
−+
++
←
(Ding, Li, Jordan, 2006)
TGWXX ++±± =
Computing algorithm
C. Ding, NMF => Unsupervised Clustering 24
Semi-NMF factors: Convex-NMF factors:
C. Ding, NMF => Unsupervised Clustering 25
Semi-NMF factors: Convex-NMF factors:
C. Ding, NMF => Unsupervised Clustering 26
C. Ding, NMF => Unsupervised Clustering 27
- Sparse factorization is a recent trend.- Sparsity is usually explicitly enforced- Convex-NMF factors are naturally sparse
⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
===
000001000001000001000001000001
),,( 1 keeWG L
From this we infer convex NMF factors are naturally sparse
Sparsity of Convex-NMF
2222 ||)(|||||||||| TTkk kXX
TF
T WGIvWGIGXWX T −=−=− ∑ σ
Consider 22 ||)(|||||| TTkk
T WGIeWGI −=− ∑Solution is
C. Ding, NMF => Unsupervised Clustering 28
48476 1cluster
xxxxxxxx48476 2cluster
xxxxxxxx
A Simple Example
08.0|||| =− Kmeansconvex CF53.0|||| =− Kmeanssemi CF
30877.0,27944.0,27940.0|||| =− TFGXSVD Semi Convex
C. Ding, NMF => Unsupervised Clustering 29
Experiments on 7 datasets
NMF variants always perform better than K-means
C. Ding, NMF => Unsupervised Clustering 30
Kernel NMF -- Generalized Convex NMF
Map feature vector to higher-dim space
NMF/semi-NMF
)( ii xx φ→TFGX =)(φ
Minimization objective depends on kernel only:
)()(),()Tr(||)()(|| 2 TTT WGIXXGWIWGXX −−=− φφφφ
(Ding & He, ICML 2004)
)](,),(),([)( 21 nxxxX φφφφ L=
depends on the explicit mapping function )(•φ
TGWXX ])([)( φφ =Kernel NMF:
C. Ding, NMF => Unsupervised Clustering 31
Kernel K-means Clustering
Map feature vector to higher-dim space
∑∑∈=
−=kCi
ki
K
kK cxJ 2
1||)()(||min φφφ
Kernel K-means objective:
)( ii xx φ→
∑∈
≡kCi
ik
k xn
c )(1)( φφ
Kernal K-means optimization:
∑ ∑∑= ∈
−=K
k Cjij
Ti
kiiK
k
xxn
xJ1 ,
2 )()(1|)(| φφφφ
)(Tr)(),(1max1 ,
WHHxxn
J TK
k Cjiji
kK
k
== ∑ ∑= ∈
φφφ
Matrix of pairwise similarities
C. Ding, NMF => Unsupervised Clustering 32
NMF and PLSI : Equivalence
So far we only use the Frobenius norm as the NMF objective function. Another objective is the KL divergence
∑∑∈=
−=kCi
ki
K
kK cxJ 2
1||)()(||min φφφ
Kernel K-means objective:
)( ii xx φ→
∑∈
≡kCi
ik
k xn
c )(1)( φφ
Kernal K-means optimization:
∑ ∑∑= ∈
−=K
k Cjij
Ti
kiiK
k
xxn
xJ1 ,
2 )()(1|)(| φφφφ
)(Tr)(),(1max1 ,
WHHxxn
J TK
k Cjiji
kK
k
== ∑ ∑= ∈
φφφ
Matrix of pairwise similarities
C. Ding, NMF => Unsupervised Clustering 33
ikTT
ikT
ikTT
ikT
ikik GWGXXGXXGWGXXGXXWW
])[(])[(])[(])[(
+−
−+
++←
Kernel-NMF Algorithm
Update F:
Update G:
ikTT
ikTT
ikTT
ikTT
ikik WXXGWXXWWXXGWXXW
GG])([])([])([])([
+−
−+
++
←
(Ding, Li, Jordan, 2006)
Computing algorithm depends only on the kernel
)(),( XX φφ
C. Ding, NMF => Unsupervised Clustering 34
Orthogonal Nonnegative Tri-Factorization
2
0,||X||min
0,
T
FIFFGSF
GIGTG
T +±+±≥=−
≥=
3-factor NMF with explicit orthogonality constraints
Simultaneous K-means clustering of rows and columns
),,,( 21 kgggG L=
),,,( 21 kfffF L= => Row cluster indicators
=> Column cluster indicators(Ding, Li, Peng, Park, KDD 2006)
1. Solution is unique2. Can’t reduce to NMF
C. Ding, NMF => Unsupervised Clustering 35
NMF-like algorithms are different ways to relax F , G !
),,(,,/ 11
knnkkk nndiagDXGDFnXgf L=== −
IGGGGXXGXGDXJ TTTnK =−=−= − ~~,||~~|||||| 221
2
1
2
1
2
1|||||||||||| T
n
ikiik
K
kCiki
K
kK FGXfxgfxJ
k
−=−=−= ∑∑∑∑==∈=
),,,( 21 kgggG L=
),,,( 21 kfffF L= = cluster centroids
= cluster indicators
),,,( 21 nxxxX L= = input data
K-means clustering objective function
C. Ding, NMF => Unsupervised Clustering 36
NMF PLSINMF objective functions• Frobenius norm• KL-divergence: ij
Tij
n
jFG
xij
m
iKL FGxxJ
ijTij )(log
1)(
1+−= ∑∑
==
),(log),(11
j
n
jiji
m
iPLSI dwpdwxJ ∑∑
==
=
)|()()|(),( kjkk
kiji zdpzpzwpdwp ∑=
Probabilistic LSI (Hoffman, 1999) is a latent variable model for clustering:
constant+−= −KLNMFPLSI JJWe can show(Ding, Li, Peng, AAAI 2006)
C. Ding, NMF => Unsupervised Clustering 37
Summary• NMF is doing K-means clustering (or PLSI)• Interpretability is key to motivate new NMF-
like factorizations– Semi-NMF, Convex-NMF, Kernel-NMF, Tri-NMF
• NMF-like algorithms always outperform K-means clustering
• Advantage: hard/soft clustering• Convex-NMF enforces notion of cluster centroids
and is naturally sparse
NMF: A new/rich paradigm for unsupervised learning
C. Ding, NMF => Unsupervised Clustering 38
References
• On the Equivalence of Nonnegative Matrix Factorization and K-means /Spectral clustering, Chris Ding, XiaofengHe, Horst Simon, SDM 2005.
• Convex and Semi-Nonnegative Matrix Factorization, Chris Ding, Tao Li, Michael Jordan, submitted
• Orthogonal Non-negative Matrix Tri-Factorization for clustering, Chris Ding, Tao Li, Wei Peng, Haesun Park,KDD 2006.
• Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, Chi-square and a Hybrid Algorithm, Chris Ding, Tao Li, Wei Peng, AAAI 2006.
C. Ding, NMF => Unsupervised Clustering 39
Data Clustering: NMF and PCA
2
0,||X||min T
GIGGGF
T +±±≥=−
NMF is useful due to nonnegativity.
G-orthogonality and nonnegativity
),,,( 21 kgggG L=
),,,( 21 kfffF L= => cluster centroids
=> cluster indicators
What happens if we ignore nonnegativity?
C. Ding, NMF => Unsupervised Clustering 40
K-means clustering PCA
2
0,||))((X||min T
GIGGRGRF
T +±±≥=−
Ignore nonnegativity => orth. transform R
)]()([Trmax GRXXGR TT
GR
GRVFRUVUX T ==Σ= ,,
TTT UUFRFRFF == ))((
Equivelevant to
Solution is given by SVD:
Cluster indicator projection:
Centroid subspace projection:
TTT VVGRGRGG == ))((
PCA/SVD is automatically doing K-means clustering
(Ding & He, ICML 2004)
C. Ding, NMF => Unsupervised Clustering 41
48476 1cluster
xxxxxxxx48476 2cluster
xxxxxxxx
A Simple Example
08.0|||| =− Kmeansconvex CF53.0|||| =− Kmeanssemi CF
30877.0,27944.0,27940.0|||| =− TFGXSVD Semi Convex
C. Ding, NMF => Unsupervised Clustering 42
NMF = Spectral Clustering (Normalized Cut)
Normalized Cut ⇒
cluster indicators:
))~((
)~()~(),,( 111
YWIY
yWIyyWIyyyJT
kTk
Tk
−=
−++−=
TrNcut LL
Re-write:}
||||/)00,11,00( 2/12/1k
Tn
k hDDyk
LLL=
IYYYW T =tosubject :Optimize ),~YTY
Tr(max
2/12/1~ −−= WDDW
2
0,||~||min T
HIHHHHW
T−
≥=
kTk
kTk
T
T
k DhhhWDh
DhhhWDhhhJ
)()(),,(11
111
−++−= LLNcut
(Gu , et al, 2001)