Low Rank ApproximationLecture 1
Daniel KressnerChair for Numerical Algorithms and HPC
Institute of Mathematics, [email protected]
1
Organizational aspects
I Lecture dates: 16.4., 23.4., 30.4., 14.5., 28.5., 4.6., 11.6., 18.6.,25.6., 2.7. (tentative)
I Exam: To be discussed next week (most likely oral exam).I Webpage: https://www5.in.tum.de/wiki/index.php/Low_Rank_ApproximationSlides on http://anchp.epfl.ch.
I EFY = Exercise For You.
2
From http://www.niemanlab.org
... his [AleksandrKogan’s] messagewent on to confirmthat his approachwas indeed similar toSVD or other matrixfactorization meth-ods, like in the NetflixPrize competition, andthe Kosinki-Stillwell-Graepel Facebookmodel. Dimensionalityreduction of Facebookdata was the core ofhis model.
3
Rank and matrix factorizationsFor field F , let A ∈ F m×n. Then
rank(A) := dim(range(A)).
For simplicity, F = R throughout the lecture and often m ≥ n.Let B = b1, . . . ,br ⊂ Rm with r = rank(A) be basis of range(A).Then each of the columns of A =
(a1,a2, . . . ,an
)can be expressed
as linear combination of B:
aj =r∑
j=1
bicij for some coefficients cij ∈ R, i = 1, . . . , r , j = 1, . . . ,n.
Defining B =(b1,b2, . . . ,br
)∈ Rm×r :
aj = B
cj1...
cjr
A = B
c11 · · · cn1...
...c1r · · · cnr
4
Rank and matrix factorizationsLemma. A matrix A ∈ Rm×n of rank r admits a factorization of theform
A = BCT , B ∈ Rm×r , C ∈ Rn×r .
We say that A has low rank if rank(A) m,n.Illustration of low-rank factorization:
A BCT
#entries mn mr + nrI Generically (and in most applications), A has full rank, that is,
rank(A) = minm,n.I Aim instead at approximating A by a low-rank matrix.
5
Questions addressed in lecture series
What? Theoretical foundations of low-rank approximation.When? A priori and a posteriori estimates for low-rank
approximation. Situations that allow for low-rankapproximation techniques.
Why? Applications in engineering, scientific computing, dataanalysis, ... where low-rank approximation plays acentral role.
How? State-of-the-art algorithms for performing and workingwith low-rank approximations.
Will cover both, matrices and tensors.
6
Contents of Lecture 1
1. Fundamental tools (SVD, relation to eigenvalues, norms, bestlow-rank approximation)
2. Overview of applications3. Fundamental tools (Stability, QR)4. Extensions (weighted approximation, bivariate functions)5. Subspace iteration
7
Literature for Lecture 1
Golub/Van Loan’2013 Golub, Gene H.; Van Loan, Charles F. Matrixcomputations. Fourth edition. Johns Hopkins UniversityPress, Baltimore, MD, 2013.
Horn/Johnson’2013 Horn, Roger A.; Johnson, Charles R. Matrixanalysis. Second edition. Cambridge University Press,2013.
+ References on slides.
8
1. Fundamental toolsI SVDI Relation to eigenvaluesI NormsI Best low-rank approximation
9
The singular value decompositionTheorem (SVD). Let A ∈ Rm×n with m ≥ n. Then there areorthogonal matrices U ∈ Rm×m and V ∈ Rn×n such that
A = UΣV T , with Σ =
σ1
. . .σn
0
∈ Rm×n
and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
I σ1, . . . , σn are called singular valuesI u1, . . . ,um are called left singular vectorsI v1, . . . , vn are called right singular vectorsI Avi = σiui , AT ui = σivi for i = 1, . . . ,n.I Singular values are always uniquely defined by A.I Singular values are never unique. If σ1 > σ2 > · · ·σn > 0 then
unique up to ui ← ±ui , vi ← ±vi .
10
SVD: Sketch of proofInduction over n. n = 1 trivial.For general n, let v1 solve max‖Av‖2 : ‖v‖2 = 1 =: ‖A‖2. Setσ1 := ‖A‖2 and u1 := Av1/σ1.1 By definition,
Av1 = σ1u1.
After completion to orthogonal matrices U1 =(u1, U⊥
)∈ Rm×m and
V1 =(v1, V⊥
)∈ Rn×n:
UT1 AV1 =
(uT
1 Av1 uT1 AV⊥
UT⊥Av1 UT
⊥AV⊥
)=
(σ1 wT
0 A1
),
with w := V T⊥AT u1 and A1 = UT
⊥AV⊥. ‖ · ‖2 invariant under orthogonaltransformations
σ1 = ‖A‖2 = ‖UT1 AV1‖2 =
∥∥∥∥( σ1 wT
0 A1
)∥∥∥∥2≥√σ2
1 + ‖w‖22.
Hence, w = 0. Proof completed by applying induction to A1.1If σ1 = 0, choose arbitrary u1.
11
Very basic properties of the SVD
I r = rank(A) is number of nonzero singular values of A.I kernel(A) = spanvr+1, . . . , vnI range(A) = spanu1, . . . ,ur
12
SVD: Computation (for small dense matrices)Computation of SVD proceeds in two steps:
1. Reduction to bidiagonal form: By applying n Householderreflectors from left and n − 1 Householder reflectors from right,compute orthogonal matrices U1, V1 such that
UT1 AV1 = B =
(B10
)=
@@@
@@
0
,
that is, B1 ∈ Rn×n is an upper bidiagonal matrix.2. Reduction to diagonal form: Use Divide&Conquer to compute
orthogonal matrices U2, V2 such that Σ = UT2 B1V2 is diagonal.
Set U = U1U2 and V = V1V2.Step 1 is usually the most expensive. Remarks on Step 1:
I If m is significantly larger than n, say, m ≥ 3n/2, first computingQR decomposition of A reduces cost.
I Most modern implementations reduce A successively via bandedform to bidiagonal form.2
2Bischof, C. H.; Lang, B.; Sun, X. A framework for symmetric band reduction. ACMTrans. Math. Software 26 (2000), no. 4, 581–601.
13
SVD: Computation (for small dense matrices)In most applications, vectors un+1, . . . ,um are not of interest. Byomitting these vectors one obtains the following variant of the SVD.Theorem (Economy size SVD). Let A ∈ Rm×n with m ≥ n. Thenthere is a matrix U ∈ Rm×n with orthonormal columns and anorthonormal matrix V ∈ Rn×n such that
A = UΣV T , with Σ =
σ1. . .
σn
∈ Rn×n
and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
Computed by MATLAB’s [U,S,V] = svd(A,’econ’).Complexity:
memory operationssingular values only O(mn) O(mn2)economy size SVD O(mn) O(mn2)
(full) SVD O(m2 + mn) O(m2n + mn2)
14
SVD: Computation (for small dense matrices)Beware of roundoff error when interpreting singular value plots.Exmaple: semilogy(svd(hilb(100)))
0 20 40 60 80 10010
-20
10-10
100
I Kink is caused by roundoff error and does not reflect truebehavior of singular values.
I Exact singular values are known to decay exponentially.3
I Sometimes more accuracy possible.4.3Beckermann, B. The condition number of real Vandermonde, Krylov and positive
definite Hankel matrices. Numer. Math. 85 (2000), no. 4, 553–577.4Drmac, Z.; Veselic, K. New fast and accurate Jacobi SVD algorithm. I. SIAM J.
Matrix Anal. Appl. 29 (2007), no. 4, 1322–134215
Singular/eigenvalue relations: symmetric matrices
Symmetric A = AT ∈ Rn×n admits spectral decomposition
A = U diag(λ1, λ2, . . . , λn)UT
with orthogonal matrix U.After reordering may assume |λ1| ≥ |λ2| ≥ · · · ≥ |λn|. Spectraldecomposition can be turned into SVD A = UΣV T by defining
Σ = diag(|λ1|, . . . , |λn|), V = U diag(sign(λ1), . . . , sign(λn)).
Remark: This extends to the more general case of normal matrices(e.g., orthogonal or symmetric) via complex spectral or real Schurdecompositions.
16
Singular/eigenvalue relations: general matricesConsider SVD A = UΣV T of A ∈ Rm×n with m ≥ n. We then have:
1. Spectral decomposition of GramianAT A = V ΣT ΣV T = V diag(σ2
1 , . . . , σ2n)V T
AT A has eigenvalues σ21 , . . . , σ
2n ,
right singular vectors of A are eigenvectors of AT A.2. Spectral decomposition of Gramian
AAT = UΣΣT UT = U diag(σ21 , . . . , σ
2n ,0, . . . ,0)UT
AAT has eigenvalues σ21 , . . . , σ
2n and, additionally, m − n zero
eigenvalues,first n left singular vectors A are eigenvectors of AAT .
3. Decomposition of Golub-Kahan matrix
A =
(0 A
AT 0
)=
(U 00 V
)(0 Σ
ΣT 0
)(U 00 V
)T
.
EFY. Prove thatA has eigenvalues±σj with eigenvectors 1√2
(±uj
vj
).
17
Norms: Spectral and Frobenius normGiven SVD A = UΣV T , one defines:
I Spectral norm: ‖A‖2 = σ1.
I Frobenius norm: ‖A‖F =√σ2
1 + · · ·+ σ2n .
Basic properties:I ‖A‖2 = max‖Av‖2 : ‖v‖2 = 1 (see proof of SVD).I ‖ · ‖2 and ‖ · ‖F are both (submultiplicative) matrix norms.I ‖ · ‖2 and ‖ · ‖F are both unitarily invariant, that is
‖QAZ‖2 = ‖A‖2, ‖QAZ‖F = ‖A‖F
for any orthogonal matrices Q,Z .I ‖A‖2 ≤ ‖A‖F ≤ ‖A‖2/
√r
I ‖AB‖F ≤ min‖A‖2‖B‖F , ‖A‖F ‖B‖2EFY. Prove these two inequalities. Hint for the second inequality: Use the relations on the next slide to first show that‖B‖F = ‖(‖b1‖2, . . . , ‖bn‖2)‖F .
EFY. Find a matrix A ∈ Rm1×n and a nonzero matrix B ∈ Rm2×n such that ‖A‖2 =
∥∥∥∥(AB
)∥∥∥∥2. Classify the set of matrices
A ∈ Rm1×n such that ‖A‖2 <∥∥∥∥(A
B
)∥∥∥∥2
for every nonzero matrix B ∈ Rm2×n .
Investigate analogous questions for the Frobenius norm.
18
Euclidean geometry on matricesLet B ∈ Rn×n have eigenvalues λ1, . . . , λn ∈ C. Then
trace(B) := b11 + · · ·+ bnn = λ1 + · · ·+ λn.
In turn,‖A‖2
F = trace AT A = trace AAT =∑i,j
a2ij .
Two simple consequences:I ‖ · ‖F is the norm induced by the matrix inner product
〈A,B〉 := trace(ABT ), A,B ∈ Rm×n.
I Partition A =(a1,a2, . . . ,an
)and define vectorization
vec(A) =
a1...
an
∈ Rmn.
Then 〈A,B〉 = 〈vec(A), vec(B)〉 and ‖A‖F = ‖ vec(A)‖2.
19
Von Neumann’s trace inequality
TheoremFor m ≥ n, let A,B ∈ Rm×n have singular values σ1(A) ≥ · · · ≥ σn(A)and σ1(B) ≥ · · · ≥ σn(B), respectively. Then
|〈A,B〉| ≤ σ1(A)σ1(B) + · · ·+ σn(A)σn(B).
Consequence:
‖A− B‖2F = 〈A− B,A− B〉 = ‖A‖2
F − 2〈A,B〉+ ‖B‖2F
≥ ‖A‖2F − 2
n∑i=1
σi (A)σi (B) + ‖B‖2F
=n∑
i=1
(σi (A)− σi (B))2.
EFY. Use Von Neumann’s trace inequality and the SVD to show for 1 ≤ k ≤ n that
max|〈A, PQT 〉| : P ∈ Rm×k, Q ∈ Rn×k
, PT P = QT Q = Ik = σ1(A) + · · · + σk (A).
20
Proof of Von Neumann’s trace inequality5
Singular value vector σ(A) can be written as convex combination
σ(A) = σn(A)fn + (σn−1(A)− σn(A))fn−1 + · · ·
with fj = e1 + · · ·+ ej . Decompose A analogously via its SVDA = UAΣAV T
A :
A = σn(A)An + (σn−1(A)− σn(A))An−1 + · · · , Aj := UAdiag(fj )V TA
Insert in lhs of trace inequality:
|〈A,B〉| ≤ σn(A)|〈An,B〉|+ (σn−1(A)− σn(A))|〈An−1,B〉|+ · · · .
Rhs is linear wrt σ(A) May assume A = Ak for k = 1, . . . ,n.Analogously for B.
5This proof follows [Grigorieff, R. D. Note on von Neumann’s trace inequality. Math.Nachr. 151 (1991), 327–328]. For Mirsky’s ingenious proof based on doubly stochasticmatrices; see Theorem 8.7.6 in [Horn/Johnson’2013].
21
Proof of Von Neumann’s trace inequalityLet A = UAdiag(fk )V T
A , B = UBdiag(f`)V TB , and k ≤ `. Then
〈A,B〉 = trace( k∑
i=1
vA,iuTA,i
∑j=1
uB,jvTB,j
)
=k∑
i=1
∑j=1
trace(vA,iuT
A,iuB,jvTB,j)
=k∑
i=1
∑j=1
(uTA,iuB,j )(vT
B,jvA,i )
Cauchy-Schwartz
|〈A,B〉| ≤k∑
i=1
‖UTB uA,i‖2‖V T
B vA,i‖2 = k ,
which completes the proof.
22
Schatten normsThere are other unitarily invariant matrix norms.6
Let s(A) = (σ1, . . . , σn). The p-Schatten norm defined by
‖A‖(p) := ‖s(A)‖p
is a matrix norm for any 1 ≤ p ≤ ∞.p =∞: spectral norm, p = 2: Frobenius norm, p = 1: nuclear norm.EFY. What is lim
p→0+‖A‖(p)?
DefinitionThe dual of a matrix norm ‖ · ‖ on Rm×n is defined by
‖A‖D = max〈A,B〉 : ‖B‖ = 1.
LemmaLet p,q ∈ [1,∞] such that p−1 + q−1 = 1. Then
‖A‖D(p) = ‖A‖(q).
EFY. Prove this lemma for p = ∞. Hint: Von Neumann’s trace inequality.
6Complete characterization via symm gauge functions in [Horn/Johnson’2013].23
Best low-rank approximation
Consider k < n and let
Uk :=(u1 · · · uk
), Σk := diag(σ1, . . . , σk ), Vk :=
(u1 · · · uk
).
ThenTk (A) := Uk Σk V T
k
has rank at most k . For any unitarily invariant norm ‖ · ‖:
‖Tk (A)− A‖ =∥∥diag(0, . . . ,0, σk+1, . . . , σn)
∥∥In particular, for spectral norm and the Frobenius norm:
‖A− Tk (A)‖2 = σk+1, ‖A− Tk (A)‖F =√σ2
k+1 + · · ·+ σ2n .
Nearly equal if and only if singular values decay sufficiently quickly.
24
Best low-rank approximationTheorem (Schmidt-Mirsky). Let A ∈ Rm×n. Then
‖A− Tk (A)‖ = min‖A− B‖ : B ∈ Rm×n has rank at most k
holds for any unitarily invariant norm ‖ · ‖.
Proof7 for ‖ · ‖F : Follows directly from consequence of VonNeumann’s trace inequality.Proof for ‖ · ‖2: For any B ∈ Rm×n of rank ≤ k , kernel(B) hasdimension ≥ n − k . Hence, ∃w ∈ kernel(B) ∩ range(Vk+1) with‖w‖2 = 1. Then
‖A− B‖22 ≥ ‖(A− B)w‖2
2 = ‖Aw‖22 = ‖AVk+1V T
k+1w‖22
= ‖Uk+1Σk+1V Tk+1w‖2
2
=r+1∑j=1
σj |vTj w |2 ≥ σk+1
r+1∑j=1
|vTj w |2 = σk+1.
7See Section 7.4.9 in [Horn/Johnson’2013] for the general case.25
Best low-rank approximationUniqueness:
I If σk > σk+1 best rank-k approximation with respect to Frobeniusnorm is unique.
I If σk = σk+1 best rank-k approximation never unique. Forexample I3 has several best rank-two approximations:1 0 0
0 1 00 0 0
,
1 0 00 0 00 0 1
,
0 0 00 1 00 0 1
.
I With respect to spectral norm best rank-k approximation onlyunique if σk+1 = 0. For example, diag(2,1, ε) with 0 < ε < 1 hasinfinitely many best rank-two approximations:2 0 0
0 1 00 0 0
,
2− ε/2 0 00 1− ε/2 00 0 0
,
2− ε/3 0 00 1− ε/3 00 0 1
, . . . .
EFY. Given a symmetric matrix A ∈ Rn×n and 1 ≤ k < n, show that there is always a best rank-k approximation that is symmetric.Is every best rank-k approximation (with respect to Frobenius norm) symmetric? What about the spectral norm?
26
Approximating the range of a matrix
Aim at finding a matrix Q ∈ Rm×k with orthonormal columns such that
range(Q) ≈ range(A).
I −QQT is orthogonal projector onto range(Q)⊥ Aim at minimizing
‖(I −QQT )A‖ = ‖A−QQT A‖
for unitarily invariant norm ‖ · ‖. Because rank(QQT A) ≤ k ,
‖A−QQT A‖ ≥ ‖A− Tk (A)‖.
Setting Q = Uk one obtains
Uk UTk A = Uk UT
k UΣV T = Uk Σk V Tk = Tk (A).
Q = Uk is optimal.
27
Approximating the range of a matrix
Variation:max‖QT A‖F : QT Q = Ik.
Equivalent tomax|〈AAT ,QQT 〉| : QT Q = Ik.
By Von Neumann’s trace inequality and equivalence betweeneigenvectors of AAT and left singular vectors of A, optimal Q given byUk .EFY. When replacing the Frobenius norm by the spectral norm in this formulation, does one obtain the same result?
28
2. ApplicationsI Principal Component AnalysisI Matrix CompletionI Some other applications
29
Principal Component Analysis (PCA)
I Most popular method for dimensionality reduction in statistics,data science, . . .
Consider N independently drawn observations for K randomvariables X1, . . . ,XK . Illustration of N = 100 observations for K = 2:
1 1.2 1.4 1.6 1.8 2
2.5
3
3.5
4
4.5
5
5.5
30
Principal Component Analysis (PCA)Each of the observations is arranged in a vector xj ∈ RK withj = 1, . . . ,N.Subtract sample mean
x :=1N
(x1 + · · ·+ xN)
Date with mean subtracted:
-0.5 0 0.5
-1
-0.5
0
0.5
1
1.5
31
Principal Component Analysis (PCA)
Covariance matrix
C :=1
N − 1
N∑j=1
(xj − x)(xj − x)T .
Diagonal entry cii estimates variance of Xi , while off-diagonal entrycik estimates covariance between Xi and Xk .Defining A :=
[x1 − x , . . . , xN − x
]∈ RK×N , we can equivalently write
C =1
N − 1AAT .
32
Principal Component Analysis (PCA)
Reduce data to dimension 1:Find linear combination Y1 = w1X1 + · · ·+ wK XK with w1, . . . ,wK ∈ Rand w2
1 + · · ·+ w2K = 1 that captures most of the observed variation.
Maximize variance of new variable Y1.Corresponding observations of Y1 given by wT x1, . . . ,wT xN withsample mean wT x maximization of variance corresponds to
maxw∈RN‖w‖2=1
N∑j=1
(wT xj − wT x)2 = maxw∈RN‖w‖2=1
‖wT A‖22.
Optimal vector w given by dominant left singular vector of A!(Corresponds to eigenvector for largest eigenvalue of AAT .)This is the first principal vector.
33
Principal Component Analysis (PCA)Data with first principal vector:
-1 -0.5 0 0.5 1
-2
-1
0
1
2
Projection of data onto first principal vector:
-1 -0.5 0 0.5 1
-2
-1
0
1
2
34
Principal Component Analysis (PCA)
I Analogously, first k principal vectors given by dominant k leftsingular vectors u1, . . . ,uk . Equivalent to best rank-kapproximation of data matrix:
minUT U=Ik
‖A− UCT‖
I PCA not robust wrt outliers in data. Robust PCA8 uses model
A ≈ low rank + sparse.
Obtained via solution of
min‖L‖(1) + λ‖S‖1 : A = L + S,
for multiplier λ > 0 and ‖S‖1 = 1-norm of vec(S).
8Emmanuel J. Candes; Xiaodong Li; Yi Ma; John Wright. Robust PrincipalComponent Analysis?
35
Matrix Completion
Assume that data matrix is modeled by (low) rank k . Two popularapproaches to deal with missing entries:
1. Impute data (insert 0 or row/column means in missing entries).Apply SVD to get best low-rank approximation BCT of imputeddata matrix.
2. Find rank-k matrix BCT that fits known entries best; measured in(weighted) Euclidean norm.
Predict unknown entries from BCT .Netflix prize won by combination of matrix completion with othertechniques.
36
Applications in Scientific Computing and Engineering
I POD, reduced basis method, reduced-order modelling.I High-dimensional integration.I Solution of large-scale matrix equations. Optimal control.I Solution of high-dimensional PDEs.I Uncertainty quantification.I . . .
Several of these will be covered in later parts of the course.
37
3. Fundamental ToolsI Stability of SVDI Canonical anglesI Stability of low-rank approximationI QR decomposition
38
Stability of SVDWhat happens to SVD if A is perturbed by noise?
LemmaLet A,E ∈ Rm×n. Then
|σi (A + E)− σi (A)| ≤ ‖E‖2.
Proof.Using the characterization
σi (A + E) = min‖B‖2 : rank(A + E − B) ≤ i − 1
and setting B = A− Ti−1(A) + E , we obtain
σi (A + E) ≤ ‖B‖2 ≤ ‖A− Ti−1(A)‖2 + ‖E‖2 = σi (A) + ‖E‖2,
which implies the result.Result also special case of famous Weyl’s inequality.EFY. Show that the matrix rank is a lower semi-continuous function.
39
Stability of SVD
Singular values are perfectly well conditioned.Singular vectors tend to be less stable! Example:
A =
(1 00 1 + ε
), E =
(0 εε −ε
).
I A has right singular vectors(
10
),
(01
).
I A + E has right singular vectors 1√2
(11
), 1√
2
(1−1
)To formulate perturbation bound, need to measure distances betweensubspaces.
40
Canonical angles
Let columns of X ,Y ∈ Cn×k contain orthonormal bases of twok -dimensional subspaces X ,Y ⊂ Cn, respectively. Denote singularvalues (in reverse order) of X T Y :
0 ≤ σ1 ≤ · · · ≤ σp ≤ 1.
We callθi (X ,Y) := arccosσi , i = 1, . . . , k ,
the canonical angles between X and Y. Note: For k = 1, θ1 is theusual angle θ(x , y) between vectors.Geometric characterization:
θ1(X ,Y) = maxx∈Xx 6=0
miny∈Yy 6=0
θ(x , y).
It follows that θ1(X ,Y) = 0 if and only if X ∩ Y⊥ 6= 0.
41
Canonical anglesNote that XX T and YY T are orthogonal projectors on X and Y,respectively.
Lemma (Projector characterization)Define sin Θ(X ,Y) = diag(sin θ1(X ,Y), . . . , sin θp(X ,Y)). Then
sin θ1(X ,Y) = ‖ sin Θ(X ,Y)‖2 = ‖XX T − YY T‖2.
Proof. See Theorem I.5.5 in [Stewart/Sun’1990].
LemmaLet Q ∈ R(n−k)×k , and X = range
(Ik0
), Y = range
(IpQ
).
Then θ1(X ,Y) = arctan ‖Q‖2.
Proof.The columns of
( IpQ
)(I + QT Q)−1/2 form an orthonormal basis of Y.
By definition, this implies that cos θ1(X ,Y) is the smallest singularvalue of (I + QT Q)−1/2. By the SVD of Q, it follows that
cos θ1(X ,Y) =1√
1 + ‖Q‖22
.
42
Stability of SVD
Theorem (Wedin). Let k < n and assume
δ := σk (A + E)− σk+1(A) > 0.
Let Uk/Uk/Vk/Vk denote subspaces spanned by first k left/rightsingular vectors of A / A + E . Then√∥∥ sin Θ(Uk , Uk )
∥∥2F +
∥∥ sin Θ(Vk , Vk )∥∥2
F ≤√
2‖E‖F
δ. (1)
Θ: diagonal matrix containing canonical angles between twosubspaces.
I Perturbation on input multiplied by δ−1 ≈ [σk (A)− σk+1(A)]−1.I Bad news for stability of low-rank approximations?
43
Stability of low-rank approximationLemma. Let A ∈ Rm×n have rank ≤ k . Then
‖Tk (A + E)− A‖ ≤ C‖E‖
holds with C = 2 for any unitarily invariant norm ‖ · ‖. For theFrobenius norm, the constant can be improved to C = (1 +
√5)/2.
Proof. Schmidt-Mirsky gives ‖Tk (A + E)− (A + E)‖ ≤ ‖E‖. Triangleinequality implies
‖Tk (A + E)− (A + E) + (A + E)− A‖ ≤ 2‖E‖.
Second part is result by Hackbusch9.
Implication for general matrix A:
‖Tk (A + E)− Tk (A)‖ =∥∥Tk(Tk (A) + (A− Tk (A)) + E
)− Tk (A)
∥∥≤ C‖(A− Tk (A)) + E‖ ≤ C(‖A− Tk (A)‖+ ‖E‖).
Perturbations on the level of truncation error pose no danger.9Hackbusch, W. New estimates for the recursive low-rank truncation of
block-structured matrices. Numer. Math. 132 (2016), no. 2, 303–32844
Stability of low-rank approximation: Application
Consider partitioned matrix
A =
(A11 A12A21 A22
), Aij ∈ Rmi×nj ,
and desired rank k ≤ mi ,nj . Let ε := ‖Tk (A)− A‖.
Eij := Tk (Aij )− Aij ⇒ ‖Eij‖ ≤ ε.
By stability of low-rank approximation,∥∥∥∥Tk
(Tk (A11) Tk (A12)Tk (A21) Tk (A22)
)− A
∥∥∥∥F
=
∥∥∥∥Tk
(A +
(E11 E12E21 E22
))− A
∥∥∥∥F≤ Cε,
with C = 32 (1 +
√5).
This allows, e.g., to perform truncations in parallel.
45
The QR decompositionTheoremLet X ∈ Rm×n with m ≥ n. Then there is an orthogonal matrixQ ∈ Rm×m such that
X = QR, with R =
(R10
)=
@@@
0
,
that is, R1 ∈ Rn×n is an upper triangular matrix.MATLAB: [Q,R] = qr(X).Will use economy size QR decomposition instead: Letting Q1 ∈ Cm×n
contain first n columns of Q, one obtains
X = Q1R1 = Q1 · @@@
.
MATLAB: [Q,R] = qr(X,0).EFY. Let A =
(a1, a2, . . . , an
)with ai ∈ Rm . Using the QR decomposition, show Hadamard’s inequality:
| det(A)| ≤ ‖a1‖2 · ‖a2‖2 · · · ‖an‖2.
Characterize the set of all m × n matrices A for which equality holds.
46
QR for recompressionSuppose that
A = BCT , with B ∈ Rm×K ,C ∈ Rn×K . (2)
Goal: Compute best rank-k approximation of A for k < K .Typical example: Sum of J matrices of rank k :
A =J∑
j=1
Bi︸︷︷︸∈Rm×k
Ci︸︷︷︸∈Rn×k
T=(B1 · · · BJ
)︸ ︷︷ ︸Rm×Jk
(C1 · · · CJ
)︸ ︷︷ ︸Rm×Jk
T. (3)
Algorithm to recompress A:1. Compute (economic) QR decomps B = QBRB and C = QCRC .2. Compute truncated SVD Tk (RBRT
C ) = Uk Σk V Tk .
3. Set Uk = QBUk , Vk = QCVk and return Tk (A) := Uk Σk V Tk .
Returns best rank-k approximation of A with O((m + n)K 2) ops.
47
4. ExtensionsI Weighted and structured low-rank approximationI Semi-separable approximation of bivariate functions
48
Weighted low-rank approximationIf some columns or rows are more important than others (e.g., theyare known to be less corrupted by noise), replace low-rankapproximation problem by
min‖DR(A− B)DC‖ : B ∈ Rm×n has rank at most k
with suitably chosen pos def diagonal matrices DR ,DC . More general:Given invertible matrices WR ∈ Rm×m, WC ∈ Rn×n, weightedlow-rank approximation problem consists of
min‖WR(A− B)WC‖ : B ∈ Rm×n has rank at most k
.
Solution given by
B = W−1R · Tk (WRAWC) ·W−1
C
Proof: EFY.Remark: Numerically more stable approach via generalized SVD[Golub/Van Loan’2013] .
49
Limit case: Infinite weights
Choosing diagonal weights that converge to∞ rows/columnsremain unperturbed.Case of fixed columns: Consider block column partition
A =(A1 A2
).
Consider
min‖A2 − B2‖ :
(A1 B2
)has rank at most k
No/trivial solution if rank(A1) ≥ k . Assume ` := rank(A1) < k and letX1 ∈ Rn×` contain orthonormal basis of range(A1). Then10
B2 = X1X T1 A2 + Tk−`
((I − X1X T
1 )A2).
10Golub, G. H.; Hoffman, A.; Stewart, G. W. A generalization of theEckart-Young-Mirsky matrix approximation theorem. Linear Algebra Appl. 88/89(1987), 317–327.
50
General weights
Given an mn ×mn symmetric pos def matrix W , define
‖A‖W =√
vec(A)T W vec(A)
Equals Frobenius norm for W = I. General weighted low-rankapproximation problem:
min‖A− B‖W : B ∈ Rm×n has rank at most k
.
EFY. Show that this problem can be rephrased as the previously considered (standard) weighted low-rank approximation problem for thecase of a Kronecker product W = W2 ⊗ W1. Hint: Cholesky decomposition.
I For general W no expression in terms of SVD available needto use general optimization method.
I Similarly, imposing general structures on A (such asnonnegativity, fixing individual entries, ...) usually does not admitsolutions in terms of SVD. Often end up with NP-hard problems.
51
Separable approximation of bivariate functionsGiven Ωx ⊂ Rdx and Ωy ⊂ Rdy aim at finding semi-separableapproximation of f ∈ L2(Ωx × Ωy ) ∼= L2(Ωx )⊗ L2(Ωy ):
f (x , y) ≈ g1(x)h1(y) + · · ·+ gr (x)hr (y)
for g1, . . . ,gr ∈ L2(Ωx ), h1, . . . ,hr ∈ L2(Ωy )
Application to higher-dimensional integrals:∫Ωx
∫Ωy
f (x , y)dµy (y) dµx (x)
≈r∑
i=1
∫Ωx
∫Ωy
gi (x)hi (y) dµy (y) dµx (x)
=r∑
i=1
[ ∫Ωx
gi (x) dµx (x)][ ∫
Ωy
hi (y) dµy (y)]
semi-separable approximation breaks down dimensionality ofintegrals (for separable measures).
52
Separable approximation of bivariate functionsGiven f ∈ L2(Ωx × Ωy ), consider linear operator
Lf : L2(Ωx )→ L2(Ωy ), w 7→∫
Ωx
w(x)f (x , y) dx .
Admits SVD
Lf (·) =∞∑i=1
σiui〈vi , ·〉
with L2 orthonormal bases u1,u2, . . . and v1, v2, . . ..Best semi-separable approximation of f (in L2(Ωx × Ωy )) given by
fr (x , y) =r∑
i=1
σiui (x)vi (y),
provided that∑∞
i=1 σ2i <∞ (Hilbert-Schmidt).
‖f − fr‖L2 = σ2r+1 + σ2
r+2 + · · · .
53
Separable and low-rank approximation
Choose discretization x1, . . . , xm ∈ Ωx , y1, . . . , ym ∈ Ωy . Define
F =
f (x1, y1) f (x1, y2) · · · f (x1, yn)f (x2, y1) f (x2, y2) · · · f (x2, yn)
......
...f (xm, y1) f (xm, y2) · · · f (xm, yn)
and
Fr =
fr (x1, y1) fr (x1, y2) · · · fr (x1, yn)fr (x2, y1) fr (x2, y2) · · · fr (x2, yn)
......
...fr (xm, y1) fr (xm, y2) · · · fr (xm, yn)
=r∑
i=1
gi(x1)gi(x2)
...gi(xm)
hi(y1)hi(y2)
...hi(yn)
T
Fr has rank at most r .EFY. Prove ‖F − Fr‖2
F ≤ σ2r+1 + σ2
r+2 + · · · .
54
5. Subspace Iteration
55
Subspace iteration and low-rank approximation
Subspace iteration = extension of power method.
Input: Matrix A ∈ Rm×n.1: Choose starting matrix X (0) ∈ Rm×k with
(X (0)
)T X (0) = Ik .2: j = 0.3: repeat4: Set j := j + 1.5: Compute Y (j) := AAT X (j−1).6: Compute economy size QR factorization: Y (j) = QR.7: Set X (j) := Q.8: until convergence is detected
As will soon be seen, converges to basis of dominant subspace Uk .Low-rank approximation obtained from
Tk (A) ≈ X (j)(X (j))T A.
56
Convergence of subspace iteration
TheoremConsider SVD A = UΣV T and, for k < n, let Uk = spanu1, . . . ,uk.Assume that σk+1 > σk and θ1(Uk ,X (0)) < π/2. Then the iteratesX (j) = range(X (j)) of the subspace iteration satisfy
tan θ1(Uk ,X (j)) =
∣∣∣∣σk+1
σk
∣∣∣∣2j
tan θ1(Uk ,X (0)).
Sketch of proof. As angles do not depend on choice of bases, mayomit QR decompositions X (j) = (AAT )jX (0). By SVD of A, may setA = Σ and hence Uk = spane1, . . . ,ek. Partition
Σ =
(Σ1 00 Σ2
), X (0) =
(X (0)
1
X (0)2
), Σ1,X
(0)1 ∈ Rk×k .
Result follows from applying expression for tangent of θ1.EFY. Complete details of proof.
57
Numerical experiments
Convergence of subspace iteration for 100× 100 Hilbert matrix.k = 5 σk+1/σk = 0.188. Random starting guess.
0 5 10 15 2010
-30
10-20
10-10
100 Black curve:
tan θ1(Uk ,X (j))
Blue curve:‖T7(A)− X (j)
(X (j)
)T A‖2
Red curve:‖A− X (j)
(X (j)
)T A‖2
58
Numerical experiments
Convergence of subspace iteration for matrix with singular values
1,0.99,0.98,1
10,
0.9910
,0.9810
,1
100,
0.99100
,0.98100
, . . .
k = 7 σk+1/σk = 0.99. Random starting guess.
0 5 10 15 2010
-3
10-2
10-1
100
101
Black curve:tan θ1(Uk ,X (j))
Blue curve:‖T7(A)− X (j)
(X (j)
)T A‖2
Red curve:‖A− X (j)
(X (j)
)T A‖2
59
Numerical experiments
Observations:I Low-rank approximation sufficiently good (for most purposes)
already after 1 iteration.I Convergence to dominant subspace arbitrarily slow, but not
relevant.I Classical, asymptotic convergence analysis insufficient.I Pre-asymptotic analysis needs to take randomization of starting
guess into account.
60