Post on 17-Sep-2018
transcript
Nonnegative Matrix Factorization for Clustering
Haesun Parkhpark@cc.gatech.edu
School of Computational Science and EngineeringGeorgia Institute of Technology
Atlanta, GA, USA
MMDS July 2012
This work was supported in part by the National Science Foundation.
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Co-authors
Jingu Kim Nokia
Da Kuang CSE, Georgia Tech
Yunlong He Math, Georgia Tech
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Outline
Overview of NMFFast algorithms for NMF with Frobenius norm
Block Coordinate Descent (BCD) frameworkOn convergenceSome other algorithms
Variations of NMFNonnegative Tensor factorizationNMF with Bregman divergences, ...
NMF for ClusteringSparse NMF via regularizationSymmetric NMF for graph clustering
Experimental ResultsSummary
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Nonnegative Matrix Factorization (NMF)(Lee&Seung 99, Paatero&Tapper 94)
Given A ∈ R+m×n and a desired rank k << min(m,n),
find W ∈ R+m×k and H ∈ R+
k×n s.t. A ≈WH.
minW≥0,H≥0 ‖A−WH‖FNonconvexW and H not unique ( e.g. W = WD ≥ 0, H = D−1H ≥ 0)
Notation: R+: nonnegative real numbers
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Nonnegative Matrix Factorization (NMF)(Lee&Seung 99, Paatero&Tapper 94)
Given A ∈ R+m×n and a desired rank k << min(m,n),
find W ∈ R+m×k and H ∈ R+
k×n s.t. A ≈WH.
minW≥0,H≥0 ‖A−WH‖FNMF improves the approximation as k increases:If rank+(A) > k ,
minWk+1≥0,Hk+1≥0
‖A −Wk+1Hk+1‖F < minWk≥0,Hk≥0
‖A−WkHk‖F ,
Wi ∈ R+m×i and Hi ∈ R+
i×n
But SVD does better: if A = UΣV T , then‖A− UkΣkV T
k ‖F ≤ min‖A−WH‖F , W ∈ R+m×k and H ∈ R+
k×n
So Why NMF? for Nonnegative DataNMF provides Better Interpretation of Lower Rank Approximation.
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Algorithms for NMF
Multiplicative update rules: Lee and Seung, 99, Modifiedmultiplicative update: Lin 07Alternating least squares (ALS): Berry et al 06Alternating nonnegative least squares (ANLS)
Lin, 07, Projected gradient descentD. Kim et al., 07, Quasi-NewtonH. Kim and Park, 08, Active-setJ. Kim and Park, 08, Block principal pivotingHan et al., 09, Projected Barzilai-Borwein
Other algorithms and variantsCichocki et al., 07, Hierarchical ALS (HALS)Ho, 08, Rank-one Residue Iteration (RRI)Gillis and Glineur, 12, Accelerated multiplicative updates andHALS/multilevel approachHsieh and Dhillon, 11, Coordinate descent with variable selectionZdunek, Cichocki, Amari 06, Quasi-NewtonChu and Lin, 07, Low dim polytope approx.Other rank-1 deflation based algorithms (Vavasis,..)C. Ding, T. Li, tri-factor NMF, orthogonal NMF, ...Cichocki, Zdunek, Phan, Amari: NMF and NTF: Applications toExploratory Multi-way Data Analysis and Blind Source Separation,Wiley, 09Andersson and Bro, Nonnegative Tensor Factorization, 00And MANY MORE...
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Block Coordinate Descent (BCD) Method
A constrained nonlinear problem:
min f (x)(e.g., f (W ,H) = ‖A−WH‖F )subject to x ∈ X = X1 × X2 × · · · × Xp,
where x = (x1, x2, . . . , xp), xi ∈ Xi ⊂ Rni , i = 1, . . . ,p.
Block Coordinate Descent method generatesx (k+1) = (x (k+1)
1 , . . . , x (k+1)p ) by
x (k+1)i = arg min
ξ∈Xi
f (x (k+1)1 , . . . , x (k+1)
i−1 , ξ, x (k)i+1, . . . , x
(k)p ).
Th. (Bertsekas, 99): Suppose f is continuously differentiable over theCartesian product of closed, convex sets X1,X2, . . . ,Xp and supposefor each i and x ∈ X , the minimum for
minξ∈Xi
f (x (k+1)1 , . . . , x (k+1)
i−1 , ξ, x (k)i+1, . . . , x
(k)p )
is uniquely attained. Then every limit point of the sequence generatedby the BCD method {x (k)} is a stationary point.NOTE: Uniqueness not required when p = 2 (Grippo and Sciandrone, 00).
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
BCD with k(m + n) Scalar Blocks
W
H
A
Minimize functions of wij or hij while all other components in Wand H are fixed:
wij ← arg minwij≥0
‖wijhTj − (rT
i −∑
k 6=j
wikhTk )‖2,
hij ← arg minhij≥0‖wihij − (aj −
∑
k 6=i
wkhkj)‖2,
W =(
w1 ... wk)
, H =
hT1...
hTk
, A =
(
a1 ... an)
=
rT1...
rTm
Scalar quadratic function, closed form solution.
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
BCD with k(m + n) Scalar Blocks
Lee and Seung (01)’s multiplicative updating (MU) rule
wij ← wij(AHT )ij
(WHHT )ij, hij ← hij
(W T A)ij
(W T WH)ij
Derivation based on gradient-descent form:
wij ← wij +wij
(WHHT )ij
[
(AHT )ij − (WHHT )ij
]
hij ← hij +hij
(W T WH)ij
[
(W T A)ij − (W T WH)ij
]
Rewriting of the solution of coordinate descent:
wij ←[
wij +1
(HHT )jj
(
(AHT )ij − (WHHT )ij
)
]
+
hij ←[
hij +1
(W T W )ii
(
(W T A)ij − (W T WH)ij
)
]
+
In MU, conservative steps are taken to ensure nonnegativity.Bertsekas’ Th. on convergence is not applicable to MU.
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
BCD with 2k Vector Blocks
W
H
A
Minimize functions of wi or hi while all other components in Wand H are fixed:
‖A−k
∑
j=1
wjhTj ‖F = ‖(A −
k∑
j=1j 6=i
wjhTj )− wih
Ti ‖F = ‖R(i) − wih
Ti ‖F
wi ← arg minwi≥0‖wih
Ti − R(i)‖F
hi ← arg minhi≥0‖wih
Ti − R(i)‖F
Each subproblem has the form minx≥0 ‖cxT −G‖F andhas a closed form solution x = [GT c
cT c ]+Hierarchical Alternating Least Squares (HALS) (Cichocki et al, 07, 09),Rank-one Residue Iteration (RRI) (Ho, 08)
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Successive Rank-1 Deflation in SVD and NMF
(Perron-Frobenius) There are nonnegative left and right singularvectors u1 and v1 of A ∈ R
m×n+ associated with the largest
singular value σ1.for A ∈ R
m×n+ , rank 1 SVD = rank 1 NMF
Successive rank-1 deflation works for SVD but not for NMFA− σ1u1vT
1 ≈ σ2u2vT2 ? A− w1hT
1 ≈ w2hT2 ?
4 6 06 4 00 0 1
=
1√
2−
1√
20
1√
21
√
20
0 0 1
10 0 00 2 00 0 1
1√
21
√
20
1√
2−
1√
20
0 0 1
The sum of two successive best rank-1 nonnegative approx. is
4 6 06 4 00 0 1
≈
5 5 05 5 00 0 0
+
0 0 00 0 00 0 1
The best rank-2 nonnegative approx. is
WH =
4 6 06 4 00 0 0
=
4 66 40 0
(
1 0 00 1 0
)
NOTE: 2k vector BCD 6= successive rank-1 deflation for NMF
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
BCD with 2 Matrix Blocks
W
H
A
Minimize functions of W or H while the other is fixed:
W ← arg minW≥0‖HT W T − AT‖F
H ← arg minH≥0‖WH − A‖F
Alternating Nonnegativity-constrained Least Squares (ANLS)No closed form solution.
Projected gradient method (Lin, 07)
Projected quasi-Newton method (D. Kim et al., 07)
Active-set method (H. Kim and Park, 08)
Block principal pivoting method (J. Kim and Park, 08 ICDM, 11 SISC)
ALS (M. Berry et al. 06) ?
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
NLS : minX≥0 ‖CX − B‖2F =
∑
minxi≥0 ‖Cxi − bi‖22
Nonnegativity-constrained Least Squares (NLS) problem
Projected Gradient method (Lin, 07) x (k+1) ← P+(x (k) − αk∇f (x (k)))* P+(·): Projection operator to the nonnegative orthant* Back-tracking selection of step αk
Projected Quasi-Newton method (Kim et al., 07)
x (k+1) ←[
yzk
]
=
[
P+[
y (k) − αD(k)∇f (y (k))]
0
]
* Gradient scaling only for nonzero variablesOther methods: Merritt and Zhang 05: Interior point gradient method, Zdunek and Cichocki 08:
Quasi-Newton method, projected Landweber method, projected sequential subspace method, Bellavia et al. 06:
Interior point Newton-like method, Franc et al. 05: Sequential coordinate-wise method ...
Active Set method (H. Kim and Park, (08)
Lawson and Hanson (74), Bro and De Jong (97), Van Benthem and Keenan (04) )
Block principal pivoting method (J. Kim and Park, 08 and 11)
linear complementarity problems (LCP) (Judice and Pires, 94)
Active set type methods fully exploit the structures of the NLSproblems in NMF
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Active-set type Algorithms forminx≥0 ‖Cx − b‖2,C : m × k
KKT conditions: y = CT Cx − CT by ≥ 0, x ≥ 0, xiyi = 0, i = 1, · · · , kIf we know P = {i |xi > 0} in the solution in advancethen we only need to solve min ‖CPxP − b‖2, and the rest ofxi = 0, where CP : columns of C with the indices in P
C x b
+
+
0
0
+
*
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Experimental Results (NMF) (J. Kim and Park, 2011 SISC)
NMF Algorithms ComparedName Description Author
ANLS-BPP ANLS / block principal pivoting J. Kim and HP 08, 11ANLS-AS ANLS / active set H. Kim and HP 08ANLS-PGRAD ANLS / projected gradient Lin 07ANLS-PQN ANLS / projected quasi-Newton D. Kim et al. 07HALS Hierarchical ALS Cichocki et al. 07MU Multiplicative updating Lee and Seung 01ALS Alternating least squares Berry et al. 06
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Residual vs. Execution time (J. Kim and Park, 2011)
0 10 20 30 40 50 60 70 80 90 100
0.84
0.85
0.86
0.87
0.88
0.89
0.9
time(sec)
rela
tive
ob
j. v
alu
e
TDT2, k=10
HALSMUALSANLS−PGRADANLS−PQNANLS−BPP
0 100 200 300 400 500 600 700
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.65
time(sec)
rela
tive
ob
j. v
alu
e
TDT2, k=160
HALS
MU
ALS
ANLS−PGRAD
ANLS−PQN
ANLS−BPP
TDT2 text data: 19,009 × 3, 087, k = 10 and k = 160
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Residual vs. Execution time (J. Kim and Park, 2011)
0 50 100 150 200 250 300
0.15
0.2
0.25
0.3
0.35
0.4
time(sec)
rela
tive
ob
j. v
alu
e
PIE 64, k=80
HALS
MU
ALS
ANLS−PGRAD
ANLS−PQN
ANLS−BPP
0 100 200 300 400 500 600 700
0.15
0.2
0.25
0.3
0.35
time(sec)
rela
tive
ob
j. v
alu
e
PIE 64, k=160
HALSMUANLS−PGRADANLS−PQNANLS−BPP
PIE 64 image data: 4, 096 × 11, 554, k = 80 and k = 160
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Nonnegative Tensor Factorization (PARAFAC)(J. Kim and Park, 2012)
Consider minA,B,C≥0 ‖X − [[ABC]]‖2Fwhere X ∈ R
m×n×p+ A ∈ R
m×k+ , B ∈ R
n×k+ , C ∈ R
p×k+ .
The loading matrices (A,B, and C) can be iteratively estimated
Matrices are longer and thinner, ideal for ANLS/BPP.Can be similarly extended to higher order tensors.
NTF Algorithms ComparedName Description Author
ANLS-BPP ANLS / block principal pivoting J. Kim and HP 08ANLS-AS ANLS / active set H. Kim and HP 08HALS Hierarchical ALS Cichocki et al. 07MU Multiplicative updating Welling and Weber 01
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Residual vs. Execution time (J. Kim and Park, 2012)
0 50 100 150 200 250 300 350 400
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
time(sec)
rela
tive
ob
j. v
alu
e
YALEB−CROP, k=60
MUHALSANLS−ASANLS−BPP
0 50 100 150 200 250 300 350 4000.986
0.988
0.99
0.992
0.994
0.996
0.998
1
time(sec)
rela
tive
ob
j. v
alu
e
NIPS, k=10
MUHALSANLS−ASANLS−BPP
Extended Yale Face: 168 × 192 × 2424 with k = 60 and NIPS data:
2037 × 1740 × 13649 × 13 with k = 10
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
NMF and K-means
Clustering and Lower Rank Approximation are related.NMF for Clustering: Document (Xu et al. SIGIR 03), Image (Cai et al. ICDM 08), Microarray (Kim & Park, Bio 07), etc.
Objective functions for K-means and NMF: (Ding et al. SDM 05; Kim & Park, TR 08)
minn
∑
i=1
‖ai − wσi‖22 = min ‖A−WH‖2F
σi = j when i-th point is assigned to j-th cluster (j ∈ {1, · · · , k})K-means: W : k cluster centroids, hi : cluster membership indicatorNMF: W : basis vectors for rank-k approx., H: k-dim rep. of ASparse NMF (for sparse H) (H. Kim and Park, Bioinformatics, 07)
minW ,H
‖A−WH‖2F + η ‖W‖2F + β
n∑
j=1
‖H(:, j)‖21
,∀ij ,Wij ,Hij ≥ 0
ANLS reformulation (H. Kim and Park, 07) : alternate the following
minH≥0
∥
∥
∥
∥
(
W√βe1×k
)
H −(
A01×n
)∥
∥
∥
∥
2
F, min
W≥0
∥
∥
∥
∥
(
HT√ηIk
)
W T −(
AT
0k×m
)∥
∥
∥
∥
2
F
Obj. functions of K-means and NMF are related but theirperformances may be very different.
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
NMF as a Clustering MethodNMF performs well on documents:Many success stories (Xu et al. 03; Pauca et al. 04; Li et al. 07; Kim & Park, 08; Ding et al. 10 ...)
Columns in W : good cluster representatives for documentsClustering accuracy on TDT2 text data:
# clusters 2 6 10 14 18K-means 0.8099 0.7295 0.7015 0.6675 0.6675
NMF/ANLS 0.9990 0.8717 0.7436 0.7021 0.7160SNMF/ANLS 0.9991 0.8770 0.7512 0.7269 0.7278
However, NMF may fail as a clustering method:
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
x1
x 2
Standard K−means
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
x1
x 2Spherical K−means
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
x1
x 2
Standard NMF
NMF still approximates the data points well.NO two basis vectors CAN represent the two clusters.NMF tries to find k linearly independent cluster representativesbehaving more like ’spherical clustering’
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
SymNMF for Graph Clustering
When H ≥ 0,HT H = I, max trace(HT SH)⇔ min ‖S − HHT‖2FS ∈ R
n×n: pairwise similarityH ∈ R
n×k : cluster membership indicatorSymNMF Formulation: minH≥0 ‖S − HHT‖2F (Kuang & Park, SDM 12)
minH≥0,HTH=I ‖S − HHT‖2F
Spectral clustering SymNMF
KeepHTH=I K
eepH≥0
Spectral clustering relies on the eigen-structure of S (Ng et al. NIPS 01)
The solution of SymNMF is independent of eigenvectors, andhas a more natural interpretation. No post-clustering is required.S is indefinite in general, and multiplicative update rule algorithmdoes not work well when applied to SymNMF.We have developed a Newton-like and ANLS/BPP typealgorithms with good convergence properties for SymNMF
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
SymNMF Experiments
Artificial graphs:
−1 0 1−1
0
1Graph 1
−1 0 1−1
0
1Graph 2
−1 0 1−1
0
1Graph 3
−1 0 1−1
0
1Graph 4
−1 0 1−1
0
1Graph 5
−1 0 1−1
0
1Graph 6
Number of optimal assignments among 20 runs on sparse graph:Graph 1 2 3 4 5 6
Spectral 7 16 2 10 1 18SymNMF 14 17 18 18 11 20
Reuters-21578:Clustering accuracy averaged over 20 subsets:
k = 2 k = 6 k = 10 k = 14 k = 18Kmeans 0.7867 0.5137 0.4191 0.4529 0.3403
NMF 0.9257 0.6934 0.5568 0.5654 0.4313GNMF 0.8709 0.7439 0.7038 0.6160 0.5704
Spectral 0.8885 0.6452 0.5428 0.5637 0.4411SymNMF 0.9111 0.7265 0.6842 0.6539 0.6188
GNMF: Graph-regularized NMF (Cai et al. ICDM 08)Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
SymNMF Experiments
COIL-20: 0 500 1000 1500 2000 2500
0
50
100
Clustering accuracy averaged over 20 subsets:k = 2 k = 4 k = 6 k = 8 k = 10 k = 20
Kmeans 0.9206 0.7484 0.7443 0.6541 0.6437 0.6083NMF 0.9291 0.7488 0.7402 0.6667 0.6238 0.4765
GNMF 0.9345 0.7325 0.7389 0.6352 0.6041 0.4638Spectral 0.8925 0.8115 0.8023 0.7969 0.7372 0.7014SymNMF 0.9917 0.8406 0.8725 0.8221 0.8018 0.7397
BSDS500: (image segmentation)Preliminary results on 320× 480 images: (n = 153,600)(Original image / Spectral clustering / SymNMF)
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Summary
Overview of NMF with Frobenius norm and algorithmsFast algorithms and convergence via BCD frameworknonnegative PARAFACNMF and SNMF for clusteringSymNMF for graph clusteringComputational comparisons
NMF for clustering and semi-supervised clusteringNMF and probability related methodsAdaptive NMFNMF algorithms for large scale problems, parallel/GPUimplementationsNMF with other difference measures (other matrix norms,Bregman and Csiszar divergences)NMF for blind source separation? Uniqueness?More theoretical study on NMF especially for foundations forcomputational methods
NMF Matlab codes and papers available athttp://www.cc.gatech.edu/∼hpark andhttp://www.cc.gatech.edu/∼jingu
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering
Papers
Hyunsoo Kim and Haesun Park. Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrainedLeast Squares for Microarray Data Analysis. Bioinformatics, 23(12):1495-1502, 2007.Hyunsoo Kim and Haesun Park. Nonnegative Matrix Factorization Based on Alternating Non-negativity-constrainedLeast Squares and the Active Set Method. SIAM Journal on Matrix Analysis and Applications (SIMAX), 30(2):713-730,2008.Jingu Kim and Haesun Park, Sparse Nonnegative Matrix Factorization for Clustering, Georgia Tech Technical ReportGT-CSE-08-01, 2008.Jingu Kim and Haesun Park. Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons.Proc. of the 8th IEEE Int. Conf. on Data Mining (ICDM), pp. 353-362, 2008.Barry Drake, Jingu Kim, Mahendra Mallick, and Haesun Park. Supervised Raman Spectra Estimation Based onNonnegative Rank Deficient Least Squares. Proc. of the 13th Int. Conf. on Information Fusion, Edinburgh, UK, 2010.
Anoop Korattikara, Levi Boyles, Max Welling, Jingu Kim, and Haesun Park. Statistical Optimization of Non-NegativeMatrix Factorization. Proc. of the Fourteenth Int. Conf. on Artificial Intelligence and Statistics (AISTATS), JMLR: W&CP15, 2011.Jingu Kim and Haesun Park, Fast Nonnegative Matrix Factorization: an Active-set-like Method and Comparisons,SIAM Journal on Scientific Computing (SISC), 33(6):3261-3281, 2011.Jingu Kim and Haesun Park, Fast Nonnegative Tensor Factorization with an Active-set-like Method, InHigh-Performance Scientific Computing: Algorithms and Applications, Springer, pp. 311-326, 2012.
Jingu Kim, Renato Monteiro, and Haesun Park, Group Sparsity in Nonnegative Matrix Factorization, Proc. of SIAM Int.Conf. on Data Mining (SDM), pp. 851-862, 2012.Da Kuang, Chris Ding, and Haesun Park, Symmetric Nonnegative Matrix Factorization for Graph Clustering, Proc. ofSIAM Int. Conf. on Data Mining (SDM), pp. 106-117, 2012.
Liangda Li, Guy Lebanon, and Haesun Park, Coordinate Descent Algorithm for Nonnegative Matrix Factorization withBregman Divergences, Proc. of the 18th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining (KDD),2012.
Thank you!
Haesun Park hpark@cc.gatech.edu Nonnegative Matrix Factorization for Clustering