Nicolas Gillis Joint work with Fran˘cois Glineur, …laurent.risser.free.fr › TMP_SHARE ›...

Computing Nonnegative Matrix Factorizations

Nicolas Gillis

Joint work with Francois Glineur, Robert Luce, Stephen Vavasis,Arnaud Vandaele, Jeremy Cohen

Where is Mons?

2//

Where is Mons?

2//

Nonnegative Matrix Factorization (NMF)

Given a matrix M ∈ Rp×n+ and a factorization rank r min(p, n), find

U ∈ Rp×rand V ∈ Rr×n such that

minU≥0,V≥0

||M − UV ||2F =∑i ,j

(M − UV )2ij . (NMF)

NMF is a linear dimensionality reduction technique for nonnegative data :

M(:, i)︸︷︷︸≥0

≈r∑

k=1

U(:, k)︸︷︷︸≥0

V (k , i)︸︷︷︸≥0

for all i .

Why nonnegativity?

→ Interpretability: Nonnegativity constraints lead to easily interpretablefactors (and a sparse and part-based representation).→ Many applications. image processing, text mining, hyperspectralunmixing, community detection, clustering, etc.

3//




minU≥0,V≥0

||M − UV ||2F =∑i ,j



M(:, i)︸︷︷︸≥0

≈r∑

k=1

U(:, k)︸︷︷︸≥0

V (k , i)︸︷︷︸≥0

for all i .

Why nonnegativity?


3//




minU≥0,V≥0

||M − UV ||2F =∑i ,j



M(:, i)︸︷︷︸≥0

≈r∑

k=1

U(:, k)︸︷︷︸≥0

V (k , i)︸︷︷︸≥0

for all i .

Why nonnegativity?


3//

Example 1: Blind hyperspectral unmixing

Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.

Problem. Identify the materials and classify the pixels.

4//

Example 1: Blind hyperspectral unmixing

Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.

Problem. Identify the materials and classify the pixels.

4//

Linear mixing model

5//

Linear mixing model

5//

Example 1: Blind hyperspectral unmixing with NMF

Basis elements allow to recover the different endmembers: U ≥ 0;

Abundances of the endmembers in each pixel: V ≥ 0.

6//




6//




6//

Urban hyperspectral image

7//


Figure: Decomposition of the Urban dataset.

8//



8//



8//

Example 2: topic recovery and document classification

Basis elements allow to recover the different topics;

Weights allow to assign each text to its corresponding topics.

9//




9//




9//

Exemple 3: feature extraction and classification

The basis elements extract facial features such as eyes, nose and lips.

10//


The basis elements extract facial features such as eyes, nose and lips.

10//


The basis elements extract facial features such as eyes, nose and lips.10

//

Outline

1 Computational complexity

2 Standard non-linear optimization schemes and acceleration

3 Exact NMF (M = UV ) and its geometric interpretation

4 NMF under the separability assumption

12//

Computational Complexity of NMF

13//

Complexity of NMF

minU∈Rp×r ,V∈Rr×n

||M − UV ||2F such that U ≥ 0,V ≥ 0.

For r = 1, Eckart-Young and Perron-Frobenius theorems.

Checking whether there exists an exact factorization M = UV :

NP-hard (Vavasis, 2009) where p, n and r are not fixed.

Using quantifier elimination (reformulation with fixed number ofvariables)

Cohen and Rothblum [1991]: (mn)O(mr+nr), non-polynomialArora et al. [2012]: (mn)O(2r ), polynomial

Moitra [2013] : (mn)O(r2), polynomial→ not really useful in practice . . .

Does not imply that rank+ (the minimum r such that M = UV ) canbe computed in polynomial time (because there are no upper boundon rank+).

14//

Complexity of NMF


||M − UV ||2F such that U ≥ 0,V ≥ 0.








14//

Complexity of NMF


||M − UV ||2F such that U ≥ 0,V ≥ 0.








14//

Complexity of NMF


||M − UV ||2F such that U ≥ 0,V ≥ 0.








14//

Complexity for other norms

minu∈Rp ,v∈Rn

||M − uvT ||1 =∑i ,j

|Mij − uivj | . (`1 norm)

If M is binary, M ∈ 0, 1m×n, any optimal solution (u∗, v∗) can beassumed to be binary, that is, (u∗, v∗) ∈ 0, 1p × 0, 1n.

minu∈Rp ,v∈Rn

||M − uvT ||2W =∑i ,j

Wij(M − uvT )2ij , (weighted `2 norm)

where W is a nonnegative weight matrix.This model can be used when

data is missing (Wij = 0 for missing entries),entries have different variances (Wij = 1/σ2

ij).

G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.

15//


minu∈Rp ,v∈Rn

||M − uvT ||1 =∑i ,j



minu∈Rp ,v∈Rn

||M − uvT ||2W =∑i ,j




ij).

G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.

G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.

15//


minu∈Rp ,v∈Rn

||M − uvT ||1 =∑i ,j



minu∈Rp ,v∈Rn

||M − uvT ||2W =∑i ,j




ij).

G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.

15//

NMF Algorithms and Acceleration

16//

NMF Algorithms

Given a matrix M ∈ Rm×n+ and a factorization rank r ∈ N:

minU∈Rm×r

+ ,V∈Rr×n+

||M − UV ||2F =∑i ,j


This is a difficult non-linear optimization problem with potentially manylocal minima.

Standard framework:

0. Initialize (U, V ). Then, alternatively update U and V :

1. Update V ≈ argminX≥0 ||M − UX ||2F . (NNLS)2. Update U ≈ argminY≥0 ||M − YV ||2F . (NNLS)

Most NMF algorithms come with no guarantees (except convergence tostationary points).

Solution is in general highly non-unique: indentifiability issues.

17//

NMF Algorithms


minU∈Rm×r

+ ,V∈Rr×n+

||M − UV ||2F =∑i ,j



Standard framework:





17//

NMF Algorithms


minU∈Rm×r

+ ,V∈Rr×n+

||M − UV ||2F =∑i ,j



Standard framework:





17//

NMF Algorithms


minU∈Rm×r

+ ,V∈Rr×n+

||M − UV ||2F =∑i ,j



Standard framework:





17//

Block coordinate descent method

Use block-coordinate descent on the NNLS subproblems−→ closed-form solutions for the columns of U and rows of V :

U∗:k = argminU:k≥0 ||Rk − U:kVk:||2F = max

(0,

RkVTk:

||Vk:||22

)∀k ,

where Rk.

= M −∑

j 6=k U:jVj :, and similarly for V .This is the so-called HALS algorithm.

It can be accelerated:

1 Gauss-Seidel Coordinate descent (Hsieh, Dhillon, 2011).

2 Loop several time over columns of U/rows of V to perform moreiterations at a lower computational cost (Glineur, G., 2012).

3 Randomized shuffling (Chow, Wu, Yin, 2017).

4 Use an extrapolation step: W (k+1) = W (k+1) + βk(W (k+1) −W (k))(Ang, G., 2018).

18//

Block coordinate descent method

Use block-coordinate descent on the NNLS subproblems−→ closed-form solutions for the columns of U and rows of V :

U∗:k = argminU:k≥0 ||Rk − U:kVk:||2F = max

(0,

RkVTk:

||Vk:||22

)∀k ,

where Rk.

= M −∑

j 6=k U:jVj :, and similarly for V .This is the so-called HALS algorithm.

It can be accelerated:

1 Gauss-Seidel Coordinate descent (Hsieh, Dhillon, 2011).

2 Loop several time over columns of U/rows of V to perform moreiterations at a lower computational cost (Glineur, G., 2012).

3 Randomized shuffling (Chow, Wu, Yin, 2017).

4 Use an extrapolation step: W (k+1) = W (k+1) + βk(W (k+1) −W (k))(Ang, G., 2018).

18//

Illustration on the CBCL face image data set

19//

Exact NMF: Geometry and ExtendedFormulations

20//

Geometric interpretation of exact NMF

Given M = UV , one can scale M and U such that they become columnstochastic implying that V is column stochastic:

M = UV ⇐⇒ M ′ = MDM = (UDU)(D−1U VDM) = U ′V ′.

The columns of M are convex combinations of the columns of U:

M:j =k∑

i=1

U:i Vij withk∑

i=1

Vij = 1∀j , Vij ≥ 0∀ij .

In other terms,

conv(M) ⊆ conv(U) ⊆ Sn,

where conv(X ) is the convex hull of the columns of X , andSn = x ∈ Rn |x ≥ 0,

∑ni=1 xi = 1 is the unit simplex.

Exact NMF ≡ Find r points whose convex hull is nested between twogiven polytopes.

21//





M:j =k∑

i=1

U:i Vij withk∑

i=1

Vij = 1∀j , Vij ≥ 0∀ij .

In other terms,





21//





M:j =k∑

i=1

U:i Vij withk∑

i=1

Vij = 1∀j , Vij ≥ 0∀ij .

In other terms,





21//





M:j =k∑

i=1

U:i Vij withk∑

i=1

Vij = 1∀j , Vij ≥ 0∀ij .

In other terms,





21//

Geometric interpretation of NMF

Example: Two nested hexagons (rank(Ma) = 3)

22//


Example: Two nested hexagons (rank(Ma) = 3)

Ma =1

a

1 a 2a− 1 2a− 1 a 11 1 a 2a− 1 2a− 1 aa 1 1 a 2a− 1 2a− 1

2a− 1 a 1 1 a 2a− 12a− 1 2a− 1 a 1 1 a

a 2a− 1 2a− 1 a 1 1

, a > 1.

22//


Example: Two nested hexagons (rank(Ma) = 3)Case 1: a = 2, rank+(Ma) = 3, col(M) = col(U)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

∆p ∩ col(M2)

conv(M2)

conv(U)

22//


Example: Two nested hexagons (rank(Ma) = 3)Case 2: a = 3, rank+(Ma) = 4, col(M) = col(U)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

∆p ∩ col(M3)

conv(M3)

conv(U)

22//


Example: Two nested hexagons (rank(Ma) = 3)Case 3: a→ +∞, rank+(Ma) = 5, col(M) 6= col(U)

22//

An amazing result: NMF and extended formulations

Let P be a polytope

P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,

and let vj ’s (1 ≤ j ≤ n) be its vertices.

We define the m-by-n slack matrix SP of P as follows:

SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

The hexagon:

SP =

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

23//


Let P be a polytope

P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,



SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

The hexagon:

SP =

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

23//


Let P be a polytope

P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,



SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

The hexagon:

SP =

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

23

//


An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p

that (linearly) projects onto P. The minimum number of facets of such apolytope is called the extension complexity xp(P) of P.

Theorem (Yannakakis, 1991).

rank+(SP) = xp(P).

Proof (one direction). Given P = x ∈ Rk | b−Ax ≥ 0, any exact NMFof SP = UV ,U ≥ 0,V ≥ 0 provides an explicit extended formulation(with some redundant equalities) of P:

P = x | b − Ax ≥ 0 = x | b − Ax = Uy and y ≥ 0.

Remark. The slack matrix SP of P satisfies

conv(SP) = Sm ∩ col(SP).

To get a small factorization, we need to go to a higher dimensional space:rank(U) > rank(M).

24//





rank+(SP) = xp(P).






24//





rank+(SP) = xp(P).






24//





rank+(SP) = xp(P).






24//

The Hexagon

SP =

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

25//

The Hexagon

SP =

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

=

1 0 0 1/2 00 1 0 1 00 0 1 1/2 00 0 1 0 1/20 1 0 0 11 0 0 0 1/2

0 1 2 1 0 00 0 1 0 0 11 0 0 0 1 20 0 0 2 2 02 2 0 0 0 0

,

with

rank(SP) = 3 ≤ rank+(SP) = 5 ≤ min(m, n) = 6.

25//

Some implications

Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?

Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.

Ex. The matching problem cannot be solved via a polynomial-size LP.Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.

This can be generalized to

approximations (no poly-size LP can approximate these problems upto some precision).Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs(beyond hierarchies), FOCS.

any convex cone, in particular PSD (so called PSD-rank).See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positivesemidefinite rank, Mathematical Programming, 2015.

26//

Some implications

Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.

Key tool: lower bound techniques for the nonnegative rank.





26//

Some implications

Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.





26//

Some implications






26//

Some implications






26//

Some implications






26//

Exact NMF computation and regular n-gons

Can we use numerical solvers to get insight into these problems?

Yes!

We have developed a library to compute exact NMF’s for small matricesusing meta-heuristics.[V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).

Extension complexity of the octagon?

27//


Can we use numerical solvers to get insight into these problems? Yes!



27//





27//





rank(SP) = 3 ≤ rank+(SP) = 6 ≤ min(m, n) = 8.

27//


We observed a special structure on the solutions for regular n-gons,leading to the best known upper bound and closing the gap for somen-gons:

rank+(Sn) ≤

2dlog2(n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2,2dlog2(n)e for 2k−1 + 2k−2 < n ≤ 2k .

[V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regularn-gons (2015).

Implication: conic quadratic programming is ‘polynomially reducible’to linear programming.[BTN01] Ben-Tal and Nemirovski (2001). On polyhedral approximations of thesecond-order cone. Mathematics of Operations Research, 26(2), 193-205.

28//


We observed a special structure on the solutions for regular n-gons,leading to the best known upper bound and closing the gap for somen-gons:

rank+(Sn) ≤

2dlog2(n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2,2dlog2(n)e for 2k−1 + 2k−2 < n ≤ 2k .

[V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regularn-gons (2015).

Implication: conic quadratic programming is ‘polynomially reducible’to linear programming.[BTN01] Ben-Tal and Nemirovski (2001). On polyhedral approximations of thesecond-order cone. Mathematics of Operations Research, 26(2), 193-205.

28//

NMF under the separability assumption

29//

Separability Assumption

Separability of M: there exists an index set K and V ≥ 0 withM = M(:,K)︸︷︷︸

U

V , with |K| = r .

[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization –Provably, STOC 2012.

30//



U

V , with |K| = r .


30//



U

V , with |K| = r .


30//

Applications

In hyperspectral imaging, this is the pure-pixel assumption: for eachmaterial, there is a ‘pure’ pixel containing only that material.[M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insightsfrom Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.

In document classification: for each topic, there is a ‘pure’ word usedonly by that topic (an ‘anchor’ word).[A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees,ICML 2013.

Time-resolved Raman spectra analysis: each substance has a peak inits spectrum while the other spectra are (close) to zero.[L+16] Luce et al., Using Separable Nonnegative Matrix Factorization for the Analysis ofTime-Resolved Raman Spectra, Appl Spectrosc. 2016.

Others: video summarization, foreground-background separation.[ESV12] Elhamifar, Sapiro, Vidal, See all by looking at a few: Sparse modeling for findingrepresentative objects, CVPR 2012.[KSK13] Kumar, Sindhwani, Near-separable Non-negative Matrix Factorization with `1-and Bregman Loss Functions, SIAM data mining 2015.

31//

Applications





31//

Applications





31//

Geometric Interpretation

The columns of U are the vertices of the convex hull of the columns of M:

M(:, j) =r∑

k=1

U(:, k)V (k , j) ∀j , wherer∑

k=1

V (k , j) = 1,V ≥ 0.

32//

Geometric Interpretation with Noise

The columns of U are the vertices of the convex hull of the columns of M:

M(:, j) ≈r∑

k=1

U(:, k)V (k , j) ∀j , wherer∑

k=1

V (k , j) = 1,V ≥ 0.

Goal: theoretical analysis of the robustness to noise of separable NMFalgorithms

32//

Key Parameters: Noise and Conditioning

We assumeM = U[Ir , V

′]Π + N,

where V ′ ≥ 0, Π is a permutation and N is the noise.

We will assume that the noise is bounded (but otherwise arbitrary):

||N(:, j)||2 ≤ ε, for all j ,

and some dependence on the conditioning κ(U) = σmax(U)σmin(U) is unavoidable:

33//



′]Π + N,



||N(:, j)||2 ≤ ε, for all j ,


33//



′]Π + N,



||N(:, j)||2 ≤ ε, for all j ,


33//



′]Π + N,



||N(:, j)||2 ≤ ε, for all j ,


33//

Successive Projection Algorithm (SPA)

0: Initially K = ∅.For i = 1 : r1: Find j∗ = argmaxj ||M(:, j)||.2: K = K ∪ j∗.3: M ←

(I − uuT

)M where u = M(:,j∗)

||M(:,j∗)||2 .end∼modified Gram-Schmidt with column pivoting.

Theorem. If ε ≤ O(σmin(U)√rκ2(U)

), SPA satisfies

||U−M(:,K)|| = max1≤k≤r

||U(:, k)−M(:,K(k))|| ≤ O(εκ2(U)

).

Advantages. Extremely fast, no parameter.

Drawbacks. Requires U to be full rank; bound is weak.

[GV14] G., Vavasis, Fast and Robust Recursive Algorithms for Separable Nonnegative MatrixFactorization, IEEE Trans. Patt. Anal. Mach. Intell. 36 (4), pp. 698-714, 2014.

34//

Successive Projection Algorithm (SPA)

0: Initially K = ∅.For i = 1 : r1: Find j∗ = argmaxj ||M(:, j)||.2: K = K ∪ j∗.3: M ←

(I − uuT

)M where u = M(:,j∗)

||M(:,j∗)||2 .end∼modified Gram-Schmidt with column pivoting.

Theorem. If ε ≤ O(σmin(U)√rκ2(U)

), SPA satisfies

||U−M(:,K)|| = max1≤k≤r

||U(:, k)−M(:,K(k))|| ≤ O(εκ2(U)

).

Advantages. Extremely fast, no parameter.

Drawbacks. Requires U to be full rank; bound is weak.

[GV14] G., Vavasis, Fast and Robust Recursive Algorithms for Separable Nonnegative MatrixFactorization, IEEE Trans. Patt. Anal. Mach. Intell. 36 (4), pp. 698-714, 2014.

34//

Pre-conditioning for More Robust SPA

Observation. Pre-multiplying M preserves separability:

P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V

′]Π + PN.

Ideally, P = U−1 so that κ(PU) = 1 (assuming m = r).Solving the minimum volume ellipsoid centered at the origin andcontaining all the columns of M (which is SDP representable)

minA∈Sr+

log det(A)−1 s.t. mjTAmj ≤ 1 ∀ j ,

allows to approximate U−1: in fact, A∗ ≈ (UUT )−1.

Theorem. If ε ≤ O(σmin(U)r√r

), preconditioned SPA satisfies

||U −M(:,K)|| ≤ O (εκ(U)).

[GV15] G., Vavasis, SDP-based Preconditioning for More Robust Near-Separable NMF, SIAM J.on Optimization, 2015.

35//



P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V

′]Π + PN.

Ideally, P = U−1 so that κ(PU) = 1 (assuming m = r).

Solving the minimum volume ellipsoid centered at the origin andcontaining all the columns of M (which is SDP representable)

minA∈Sr+





||U −M(:,K)|| ≤ O (εκ(U)).


35//



P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V

′]Π + PN.


minA∈Sr+





||U −M(:,K)|| ≤ O (εκ(U)).


35//



P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V

′]Π + PN.


minA∈Sr+





||U −M(:,K)|| ≤ O (εκ(U)).


35//

Geometric Interpretation

Figure: Geometric Interpretation of the SDP-based Preconditioning.

See also Mizutani, Ellipsoidal Rounding for Nonnegative MatrixFactorization Under Noisy Separability, JMLR, 2014.

36//

Synthetic data sets

Each entry of U ∈ R40×20+ uniform in [0, 1]; each column normalized.

The other columns of M are the middle points of the columns of U(hence there are

(202

)= 190).

The noise moves the middle points toward the outside of the convexhull of the column of U.

37//

Synthetic data sets

Each entry of U ∈ R40×20+ uniform in [0, 1]; each column normalized.

The other columns of M are the middle points of the columns of U(hence there are

(202

)= 190).

The noise moves the middle points toward the outside of the convexhull of the column of U.

37//

Results for the synthetic data sets

Figure: Average of the fraction of columns correctly extracted depending on thenoise level (for each noise level, 25 matrices are generated).

38//

Combinatorial formulation for separable NMF

We want to find the index set K with |K| = r such that

M = M(:,K)V .

This is equivalent to finding X ∈ Rn×n with r non-zero rows such that

M = M X .

A combinatorial formulation:

minX||X ||row,0 such that M = MX or ||M −MX || ≤ ε.

How to make X row sparse?

39//



M = M(:,K)V .


M = M X .




39//



M = M(:,K)V .


M = M X .




39//



M = M(:,K)V .


M = M X .




39//

A Linear Optimization Model

minX∈Rn×n

+

trace(X ) = || diag(X )||1

such that ||M −MX || ≤ ε,Xij ≤ Xii ≤ 1 for all i , j .

Robustness: noise ≤ O(κ−1

)⇒ error ≤ O

(rεκ

)[GL14].

This model is an improvement over [B+12]: more robust and detects thefactorization rank r automatically.It is equivalent [GL16] to using ||X ||1,∞ =

∑di=1 ||X (i , :)||∞ as a convex

surrogate for ||X ||row,0 [E+12].

[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014.

[B+12] Bittorf, Recht, Re, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.[E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space,IEEE Trans. Image Processing, 2012.[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.

40//


minX∈Rn×n

+




)⇒ error ≤ O

(rεκ

)[GL14].






40//


minX∈Rn×n

+




)⇒ error ≤ O

(rεκ

)[GL14].

This model is an improvement over [B+12]: more robust and detects thefactorization rank r automatically.

It is equivalent [GL16] to using ||X ||1,∞ =∑d

i=1 ||X (i , :)||∞ as a convexsurrogate for ||X ||row,0 [E+12].


[B+12] Bittorf, Recht, Re, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.

[E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space,IEEE Trans. Image Processing, 2012.[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.

40//


minX∈Rn×n

+




)⇒ error ≤ O

(rεκ

)[GL14].






40//

Practical Model and Algorithm

minX∈Ω

||M −MX ||2F + µ tr(X ),

Ω = X ∈ Rn,n | Xii ≤ 1,wiXij ≤ wjXii∀i , j.

We used a fast gradient method (optimal 1st order):

1 Choose an initial point X (0), Y = X (0), α1 ∈ (0, 1).

2 k = 1, 2, . . .

1 X (k) = PΩ

(Y − 1

L∇f (Y )).

2 Y = X (k)+βk(X (k) − X (k−1)

),

where βk = αk (1−αk )α2

k+αk+1with αk+1 ≥ 0 t.q. α2

k+1 = (1− αk+1)α2k .

Projection onto Ω can be done effectively in O(n2 log(n)) operations.

The total computational cost is O(pn2) operations.

[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.

41//


minX∈Ω

||M −MX ||2F + µ tr(X ),




2 k = 1, 2, . . .

1 X (k) = PΩ

(Y − 1

L∇f (Y )).

2 Y = X (k)+βk(X (k) − X (k−1)

),



k+1 = (1− αk+1)α2k .




41//


minX∈Ω

||M −MX ||2F + µ tr(X ),




2 k = 1, 2, . . .

1 X (k) = PΩ

(Y − 1

L∇f (Y )).

2 Y = X (k)+βk(X (k) − X (k−1)

),



k+1 = (1− αk+1)α2k .




41//


minX∈Ω

||M −MX ||2F + µ tr(X ),




2 k = 1, 2, . . .

1 X (k) = PΩ

(Y − 1

L∇f (Y )).

2 Y = X (k)+βk(X (k) − X (k−1)

),



k+1 = (1− αk+1)α2k .




41//

Hyperspectral unmixing

r = 6 r = 8

Time (s.) Rel. err. (%) Time (s.) Rel. err. (%)

VCA 1.02 18.05 1.05 22.68VCA-500 0.03 7.19 0.09 7.25

SPA 0.26 9.58 0.32 9.45SPA-500 <0.01 10.05 <0.01 8.86

SNPA 13.60 9.63 23.02 5.64SNPA-500 0.15 10.05 0.25 8.86

XRAY 28.17 7.50 95.34 6.82XRAY-500 0.15 8.07 0.28 7.36

H2NMF 12.20 5.81 14.92 5.47H2NMF-500 0.27 5.87 0.37 5.68

FGNSR-500 40.11 5.07 39.49 4.08

Table: Numerical results for the Urban HSI (the best result is highlighted in bold).

42//

Figure: Abundance maps extracted by FGNSR-500.

43//

Minimum-volume NMF: Relaxing separability

minK,V≥0

||M −M(:,K)V ||2F such that |K| = r .

test

Open problems: Efficient algorithm for min-vol NMF, robustness to noise.

Fu, Huang, Sidiropoulos, Ma, Nonnegative matrix factorization for signal and dataanalytics: Identifiability, algorithms, and applications, arXiv:1803.01257, 2018.

44//


minU≥0,V≥0

vol(U) such that ||M − UV ||2F ≤ ε,

where vol(U) ∼ det(UTU), V (:, j) ∈ ∆r for all j .



44//


minU≥0,V≥0





44//


minU≥0,V≥0





44//

Identifiability with sparsity

Decompose a low rank matrix with known coefficients sparsity.

M = UV ,rank(M) = rank(U) = r ,‖V (:, j)‖0 ≤ k = r − s < r ∀j .

Many existing theoretical results (see, e.g., [Gribonval 16]) and algorithms(Dictionary Learning). But:

% Not many results specific to the low-rank case

% Only two deterministic identifiability results [Elad 06, Georgiev 05]

% Not much in the NMF case except `1 regularization

45//

Identifiability with sparsity

Decompose a low rank matrix with known coefficients sparsity.

M = UV ,rank(M) = rank(U) = r ,‖V (:, j)‖0 ≤ k = r − s < r ∀j .

Many existing theoretical results (see, e.g., [Gribonval 16]) and algorithms(Dictionary Learning). But:

% Not many results specific to the low-rank case

% Only two deterministic identifiability results [Elad 06, Georgiev 05]

% Not much in the NMF case except `1 regularization

45//

Identifiability with sparsity: example

Example: p = 3, r = 3, s=sparsity=1, n = 9.

data pointsfirst decomposition

second decomposition

46//

Identifiability with sparsity: example

Example: p = 3, r = 3, s=sparsity=1, n = 9.

data pointsfirst decompositionsecond decomposition

46//

Identifiability results

Theorem

Let M = UV where rank(U) = rank(M) = r and each column of V has atleast s zeros. The factorization (U,V ) is essentially unique if on each

hyperplane spanned by all but one column of U, there are⌊r(r−2)

s

⌋+ 1

data points with spark r .

! For s = 1, this requires r3 − 2r2 + r data points and it is tight up tothe constant r (counter examples for any n = r3 − 2r2).

! For s = r − 1, this requires r data points and it is tight (one on eachintersection of r − 1 hyperplanes).

! It is tight up to constant factors for any s = βr for any fixed constantβ.

! Nonnegativity not taken into account in the analysis, it helps both intheory and in practice: further work.

[CG18] Cohen, G., Identifiability of Low-Rank Sparse Component Analysis,arXiv:1808.08765.

47//


Theorem



s

⌋+ 1







47//


Theorem



s

⌋+ 1







47//


Theorem



s

⌋+ 1







47//


Theorem



s

⌋+ 1







47//

Geometric intuition

Example: p = 3, r = 3, sparsity=1, n = 4 + 3 + 2 = 9.

data pointsunique decomposition

48//

Sparsity in action

Spectral unmixing, R = 6, s = 4

! Sparsity is another way to obtain identifiability for matrixdecompositions.

% Hard combinatorial problems to solve. . .

49//

Sparsity in action




49//

Sparsity in action




49//

Take-home messages

1 NMF is a useful and widely used linear model in data analysis andmachine learning.

2 NMF is difficult (NP-hard) and ill-posed (non-uniqueness).

3 NMF is closely related to the nested polytopes problem and extendedformulations.

4 NMF with (self-)dictionary is tractable and well-posed (separableNMF).

5 To obtain identifiable NMF models: minimum volume or sparsity canbe used but, as opposed to separability, this does not lead to tractablemodels. This is an important direction of reasearch (robustness tonoise, tractability).

50//

Take-home messages






50//

Take-home messages






50//

Take-home messages






50//

Take-home messages






50//

Thank you for your attention!

Code and papers available fromhttps://sites.google.com/site/nicolasgillis

51//

https://sites.google.com/site/nicolasgillis

Date post:	24-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Nicolas Gillis Joint work with Fran˘cois Glineur, …laurent.risser.free.fr › TMP_SHARE ›...

Documents