+ All Categories
Home > Documents > Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s...

Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s...

Date post: 16-Mar-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
36
Stein’s Method for Matrix Concentration Lester Mackey Collaborators: Michael I. Jordan , Richard Y. Chen , Brendan Farrell , and Joel A. Tropp Stanford University University of California, Berkeley California Institute of Technology December 10, 2012 Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 1 / 35
Transcript
Page 1: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Steinrsquos Method for Matrix Concentration

Lester Mackeydagger

Collaborators Michael I JordanDagger Richard Y Chenlowast Brendan Farrelllowast and Joel A Tropplowast

daggerStanford University DaggerUniversity of California BerkeleylowastCalifornia Institute of Technology

December 10 2012

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 1 35

Motivation

Concentration Inequalities

Matrix concentration

PX minus EX ge t le δ

Pλmax(X minus EX) ge t le δ

Non-asymptotic control of random matrices with complexdistributions

Applications

Matrix completion from sparse random measurements(Gross 2011 Recht 2011 Negahban and Wainwright 2010 Mackey Talwalkar and Jordan 2011)

Randomized matrix multiplication and factorization(Drineas Mahoney and Muthukrishnan 2008 Hsu Kakade and Zhang 2011b)

Convex relaxation of robust or chance-constrained optimization(Nemirovski 2007 So 2011 Cheung So and Wang 2011)

Random graph analysis (Christofides and Markstrom 2008 Oliveira 2009)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 2 35

Motivation Matrix Completion

Motivation Matrix Completion

Goal Recover a matrix L0 isin Rmtimesn from a subset of its entries

1 43 5 5

rarr

2 3 1 43 4 5 12 5 3 5

Examples

Collaborative filtering How will user i rate movie j

Ranking on the web Is URL j relevant to user i

Link prediction Is user i friends with user j

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 3 35

Motivation Matrix Completion

Motivation Matrix Completion

Goal Recover a matrix L0 isin Rmtimesn from a subset of its entries

1 43 5 5

rarr

2 3 1 43 4 5 12 5 3 5

Bad News Impossible to recover a generic matrixToo many degrees of freedom too few observations

Good News

Small number of latent factors determine preferencesMovie ratings cluster by genre and director

L0 = A

B⊤

These low-rank matrices are easier to completeMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 4 35

Motivation Matrix Completion

How to Complete a Low-rank Matrix

Suppose Ω is the set of observed entry locations

First attemptminimizeA rankA

subject to Aij = L0ij (i j) isin Ω

Problem NP-hard rArr computationally intractable

Solution Solve convex relaxation ()

minimizeA Alowastsubject to Aij = L0ij (i j) isin Ω

where Alowast =sum

k σk(A) is the tracenuclear norm of A

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 5 35

Motivation Matrix Completion

Can Convex Optimization Recover L0

Yes with high probability

Theorem (Recht 2011)

If L0 isin Rmtimesn has rank r and s amp βrn log2(n) entries are observeduniformly at random then (under some technical conditions) convexoptimization recovers L0 exactly with probability at least 1minus nminusβ

See also Gross (2011) Mackey Talwalkar and Jordan (2011)

Past results (Candes and Recht 2009 Candes and Tao 2009) requiredstronger assumptions and more intensive analysis

Streamlined approach reposes on a matrix variant of a classicalBernstein inequality (1946)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 6 35

Motivation Matrix Completion

Scalar Bernstein Inequality

Theorem (Bernstein 1946)

Let (Yk)kge1 be independent random variables in R satisfying

EYk = 0 and |Yk| le R for each index k

Define the variance parameter

σ2 =sum

kEY 2

k

Then for all t ge 0

P

sum

kYk

∣ge t

le 2 middot exp minust22σ2 + 2Rt3

Gaussian decay controlled by variance when t is small

Exponential decay controlled by uniform bound for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 7 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Rmtimesn satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 = max(∥

sum

kEYkY

⊤k

∥∥

sum

kEY ⊤

k Yk

)

Then for all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

See also Tropp (2011) Oliveira (2009) Recht (2011)

Gaussian tail when t is small exponential tail for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 8 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 2: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation

Concentration Inequalities

Matrix concentration

PX minus EX ge t le δ

Pλmax(X minus EX) ge t le δ

Non-asymptotic control of random matrices with complexdistributions

Applications

Matrix completion from sparse random measurements(Gross 2011 Recht 2011 Negahban and Wainwright 2010 Mackey Talwalkar and Jordan 2011)

Randomized matrix multiplication and factorization(Drineas Mahoney and Muthukrishnan 2008 Hsu Kakade and Zhang 2011b)

Convex relaxation of robust or chance-constrained optimization(Nemirovski 2007 So 2011 Cheung So and Wang 2011)

Random graph analysis (Christofides and Markstrom 2008 Oliveira 2009)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 2 35

Motivation Matrix Completion

Motivation Matrix Completion

Goal Recover a matrix L0 isin Rmtimesn from a subset of its entries

1 43 5 5

rarr

2 3 1 43 4 5 12 5 3 5

Examples

Collaborative filtering How will user i rate movie j

Ranking on the web Is URL j relevant to user i

Link prediction Is user i friends with user j

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 3 35

Motivation Matrix Completion

Motivation Matrix Completion

Goal Recover a matrix L0 isin Rmtimesn from a subset of its entries

1 43 5 5

rarr

2 3 1 43 4 5 12 5 3 5

Bad News Impossible to recover a generic matrixToo many degrees of freedom too few observations

Good News

Small number of latent factors determine preferencesMovie ratings cluster by genre and director

L0 = A

B⊤

These low-rank matrices are easier to completeMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 4 35

Motivation Matrix Completion

How to Complete a Low-rank Matrix

Suppose Ω is the set of observed entry locations

First attemptminimizeA rankA

subject to Aij = L0ij (i j) isin Ω

Problem NP-hard rArr computationally intractable

Solution Solve convex relaxation ()

minimizeA Alowastsubject to Aij = L0ij (i j) isin Ω

where Alowast =sum

k σk(A) is the tracenuclear norm of A

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 5 35

Motivation Matrix Completion

Can Convex Optimization Recover L0

Yes with high probability

Theorem (Recht 2011)

If L0 isin Rmtimesn has rank r and s amp βrn log2(n) entries are observeduniformly at random then (under some technical conditions) convexoptimization recovers L0 exactly with probability at least 1minus nminusβ

See also Gross (2011) Mackey Talwalkar and Jordan (2011)

Past results (Candes and Recht 2009 Candes and Tao 2009) requiredstronger assumptions and more intensive analysis

Streamlined approach reposes on a matrix variant of a classicalBernstein inequality (1946)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 6 35

Motivation Matrix Completion

Scalar Bernstein Inequality

Theorem (Bernstein 1946)

Let (Yk)kge1 be independent random variables in R satisfying

EYk = 0 and |Yk| le R for each index k

Define the variance parameter

σ2 =sum

kEY 2

k

Then for all t ge 0

P

sum

kYk

∣ge t

le 2 middot exp minust22σ2 + 2Rt3

Gaussian decay controlled by variance when t is small

Exponential decay controlled by uniform bound for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 7 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Rmtimesn satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 = max(∥

sum

kEYkY

⊤k

∥∥

sum

kEY ⊤

k Yk

)

Then for all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

See also Tropp (2011) Oliveira (2009) Recht (2011)

Gaussian tail when t is small exponential tail for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 8 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 3: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Completion

Motivation Matrix Completion

Goal Recover a matrix L0 isin Rmtimesn from a subset of its entries

1 43 5 5

rarr

2 3 1 43 4 5 12 5 3 5

Examples

Collaborative filtering How will user i rate movie j

Ranking on the web Is URL j relevant to user i

Link prediction Is user i friends with user j

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 3 35

Motivation Matrix Completion

Motivation Matrix Completion

Goal Recover a matrix L0 isin Rmtimesn from a subset of its entries

1 43 5 5

rarr

2 3 1 43 4 5 12 5 3 5

Bad News Impossible to recover a generic matrixToo many degrees of freedom too few observations

Good News

Small number of latent factors determine preferencesMovie ratings cluster by genre and director

L0 = A

B⊤

These low-rank matrices are easier to completeMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 4 35

Motivation Matrix Completion

How to Complete a Low-rank Matrix

Suppose Ω is the set of observed entry locations

First attemptminimizeA rankA

subject to Aij = L0ij (i j) isin Ω

Problem NP-hard rArr computationally intractable

Solution Solve convex relaxation ()

minimizeA Alowastsubject to Aij = L0ij (i j) isin Ω

where Alowast =sum

k σk(A) is the tracenuclear norm of A

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 5 35

Motivation Matrix Completion

Can Convex Optimization Recover L0

Yes with high probability

Theorem (Recht 2011)

If L0 isin Rmtimesn has rank r and s amp βrn log2(n) entries are observeduniformly at random then (under some technical conditions) convexoptimization recovers L0 exactly with probability at least 1minus nminusβ

See also Gross (2011) Mackey Talwalkar and Jordan (2011)

Past results (Candes and Recht 2009 Candes and Tao 2009) requiredstronger assumptions and more intensive analysis

Streamlined approach reposes on a matrix variant of a classicalBernstein inequality (1946)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 6 35

Motivation Matrix Completion

Scalar Bernstein Inequality

Theorem (Bernstein 1946)

Let (Yk)kge1 be independent random variables in R satisfying

EYk = 0 and |Yk| le R for each index k

Define the variance parameter

σ2 =sum

kEY 2

k

Then for all t ge 0

P

sum

kYk

∣ge t

le 2 middot exp minust22σ2 + 2Rt3

Gaussian decay controlled by variance when t is small

Exponential decay controlled by uniform bound for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 7 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Rmtimesn satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 = max(∥

sum

kEYkY

⊤k

∥∥

sum

kEY ⊤

k Yk

)

Then for all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

See also Tropp (2011) Oliveira (2009) Recht (2011)

Gaussian tail when t is small exponential tail for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 8 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 4: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Completion

Motivation Matrix Completion

Goal Recover a matrix L0 isin Rmtimesn from a subset of its entries

1 43 5 5

rarr

2 3 1 43 4 5 12 5 3 5

Bad News Impossible to recover a generic matrixToo many degrees of freedom too few observations

Good News

Small number of latent factors determine preferencesMovie ratings cluster by genre and director

L0 = A

B⊤

These low-rank matrices are easier to completeMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 4 35

Motivation Matrix Completion

How to Complete a Low-rank Matrix

Suppose Ω is the set of observed entry locations

First attemptminimizeA rankA

subject to Aij = L0ij (i j) isin Ω

Problem NP-hard rArr computationally intractable

Solution Solve convex relaxation ()

minimizeA Alowastsubject to Aij = L0ij (i j) isin Ω

where Alowast =sum

k σk(A) is the tracenuclear norm of A

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 5 35

Motivation Matrix Completion

Can Convex Optimization Recover L0

Yes with high probability

Theorem (Recht 2011)

If L0 isin Rmtimesn has rank r and s amp βrn log2(n) entries are observeduniformly at random then (under some technical conditions) convexoptimization recovers L0 exactly with probability at least 1minus nminusβ

See also Gross (2011) Mackey Talwalkar and Jordan (2011)

Past results (Candes and Recht 2009 Candes and Tao 2009) requiredstronger assumptions and more intensive analysis

Streamlined approach reposes on a matrix variant of a classicalBernstein inequality (1946)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 6 35

Motivation Matrix Completion

Scalar Bernstein Inequality

Theorem (Bernstein 1946)

Let (Yk)kge1 be independent random variables in R satisfying

EYk = 0 and |Yk| le R for each index k

Define the variance parameter

σ2 =sum

kEY 2

k

Then for all t ge 0

P

sum

kYk

∣ge t

le 2 middot exp minust22σ2 + 2Rt3

Gaussian decay controlled by variance when t is small

Exponential decay controlled by uniform bound for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 7 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Rmtimesn satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 = max(∥

sum

kEYkY

⊤k

∥∥

sum

kEY ⊤

k Yk

)

Then for all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

See also Tropp (2011) Oliveira (2009) Recht (2011)

Gaussian tail when t is small exponential tail for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 8 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 5: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Completion

How to Complete a Low-rank Matrix

Suppose Ω is the set of observed entry locations

First attemptminimizeA rankA

subject to Aij = L0ij (i j) isin Ω

Problem NP-hard rArr computationally intractable

Solution Solve convex relaxation ()

minimizeA Alowastsubject to Aij = L0ij (i j) isin Ω

where Alowast =sum

k σk(A) is the tracenuclear norm of A

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 5 35

Motivation Matrix Completion

Can Convex Optimization Recover L0

Yes with high probability

Theorem (Recht 2011)

If L0 isin Rmtimesn has rank r and s amp βrn log2(n) entries are observeduniformly at random then (under some technical conditions) convexoptimization recovers L0 exactly with probability at least 1minus nminusβ

See also Gross (2011) Mackey Talwalkar and Jordan (2011)

Past results (Candes and Recht 2009 Candes and Tao 2009) requiredstronger assumptions and more intensive analysis

Streamlined approach reposes on a matrix variant of a classicalBernstein inequality (1946)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 6 35

Motivation Matrix Completion

Scalar Bernstein Inequality

Theorem (Bernstein 1946)

Let (Yk)kge1 be independent random variables in R satisfying

EYk = 0 and |Yk| le R for each index k

Define the variance parameter

σ2 =sum

kEY 2

k

Then for all t ge 0

P

sum

kYk

∣ge t

le 2 middot exp minust22σ2 + 2Rt3

Gaussian decay controlled by variance when t is small

Exponential decay controlled by uniform bound for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 7 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Rmtimesn satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 = max(∥

sum

kEYkY

⊤k

∥∥

sum

kEY ⊤

k Yk

)

Then for all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

See also Tropp (2011) Oliveira (2009) Recht (2011)

Gaussian tail when t is small exponential tail for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 8 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 6: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Completion

Can Convex Optimization Recover L0

Yes with high probability

Theorem (Recht 2011)

If L0 isin Rmtimesn has rank r and s amp βrn log2(n) entries are observeduniformly at random then (under some technical conditions) convexoptimization recovers L0 exactly with probability at least 1minus nminusβ

See also Gross (2011) Mackey Talwalkar and Jordan (2011)

Past results (Candes and Recht 2009 Candes and Tao 2009) requiredstronger assumptions and more intensive analysis

Streamlined approach reposes on a matrix variant of a classicalBernstein inequality (1946)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 6 35

Motivation Matrix Completion

Scalar Bernstein Inequality

Theorem (Bernstein 1946)

Let (Yk)kge1 be independent random variables in R satisfying

EYk = 0 and |Yk| le R for each index k

Define the variance parameter

σ2 =sum

kEY 2

k

Then for all t ge 0

P

sum

kYk

∣ge t

le 2 middot exp minust22σ2 + 2Rt3

Gaussian decay controlled by variance when t is small

Exponential decay controlled by uniform bound for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 7 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Rmtimesn satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 = max(∥

sum

kEYkY

⊤k

∥∥

sum

kEY ⊤

k Yk

)

Then for all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

See also Tropp (2011) Oliveira (2009) Recht (2011)

Gaussian tail when t is small exponential tail for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 8 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 7: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Completion

Scalar Bernstein Inequality

Theorem (Bernstein 1946)

Let (Yk)kge1 be independent random variables in R satisfying

EYk = 0 and |Yk| le R for each index k

Define the variance parameter

σ2 =sum

kEY 2

k

Then for all t ge 0

P

sum

kYk

∣ge t

le 2 middot exp minust22σ2 + 2Rt3

Gaussian decay controlled by variance when t is small

Exponential decay controlled by uniform bound for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 7 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Rmtimesn satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 = max(∥

sum

kEYkY

⊤k

∥∥

sum

kEY ⊤

k Yk

)

Then for all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

See also Tropp (2011) Oliveira (2009) Recht (2011)

Gaussian tail when t is small exponential tail for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 8 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 8: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Rmtimesn satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 = max(∥

sum

kEYkY

⊤k

∥∥

sum

kEY ⊤

k Yk

)

Then for all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

See also Tropp (2011) Oliveira (2009) Recht (2011)

Gaussian tail when t is small exponential tail for large t

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 8 35

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 9: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Completion

Matrix Bernstein Inequality

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

For all t ge 0

P

sum

kYk

∥ge t

le (m+ n) middot exp minust23σ2 + 2Rt

Consequences for matrix completion

Recht (2011) showed that uniform sampling of entries capturesmost of the information in incoherent low-rank matrices

Negahban and Wainwright (2010) showed that iid sampling ofentries captures most of the information in non-spiky (near)low-rank matrices

Foygel and Srebro (2011) characterized the generalization errorof convex MC through Rademacher complexity

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 9 35

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 10: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Concentration

Concentration Inequalities

Matrix concentration

Pλmax(X minus EX) ge t le δ

Difficulty Matrix multiplication is not commutative

rArr eX+Y 6= eXeY

Past approaches (Ahlswede and Winter 2002 Oliveira 2009 Tropp 2011)

Rely on deep results from matrix analysis

Apply to sums of independent matrices and matrix martingales

This work

Steinrsquos method of exchangeable pairs (1972) as advanced byChatterjee (2007) for scalar concentrationrArr Improved exponential tail inequalities (Hoeffding Bernstein)rArr Polynomial moment inequalities (Khintchine Rosenthal)rArr Dependent sums and more general matrix functionals

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 10 35

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 11: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Motivation Matrix Concentration

Roadmap

1 Motivation

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent Sequences

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 11 35

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 12: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Background

Notation

Hermitian matrices Hd = A isin Cdtimesd A = AlowastAll matrices in this talk are Hermitian

Maximum eigenvalue λmax(middot)Trace trB the sum of the diagonal entries of B

Spectral norm B the maximum singular value of B

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 12 35

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 13: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Background

Matrix Stein Pair

Definition (Exchangeable Pair)

(ZZ prime) is an exchangeable pair if (ZZ prime)d= (Z prime Z)

Definition (Matrix Stein Pair)

Let (ZZ prime) be an exchangeable pair and let Ψ Z rarr Hd be ameasurable function Define the random matrices

X = Ψ(Z) and X prime = Ψ(Z prime)

(XX prime) is a matrix Stein pair with scale factor α isin (0 1] if

E[X prime |Z] = (1minus α)X

Matrix Stein pairs are exchangeable pairs

Matrix Stein pairs always have zero mean

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 13 35

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 14: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Background

The Conditional Variance

Definition (Conditional Variance)

Suppose that (XX prime) is a matrix Stein pair with scale factor αconstructed from the exchangeable pair (ZZ prime) The conditional

variance is the random matrix

∆X = ∆X(Z) =1

2αE[

(X minusX prime)2 |Z]

∆X is a stochastic estimate for the variance EX2

Take-home Message

Control over ∆X yields control over λmax(X)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 14 35

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 15: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Exponential Concentration for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a matrix Stein pair with X isin Hd Suppose that

∆X 4 cX + v I almost surely for c v ge 0

Then for all t ge 0

Pλmax(X) ge t le d middot exp minust22v + 2ct

Control over the conditional variance ∆X yields

Gaussian tail for λmax(X) for small t exponential tail for large t

When d = 1 improves scalar result of Chatterjee (2007)

The dimensional factor d cannot be removed

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 15 35

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 16: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Matrix Hoeffding Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let X =sum

k Yk for independent matrices in Hd satisfying

EYk = 0 and Y 2k 4 A2

k

for deterministic matrices (Ak)kge1 Define the variance parameter

σ2 =∥

sum

kA2k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot eminust22σ2

Improves upon the matrix Hoeffding inequality of Tropp (2011)Optimal constant 12 in the exponent

Can replace variance parameter with σ2 = 12

sum

k

(

A2k + EY 2

k

)∥

Tighter than classical Hoeffding inequality (1963) when d = 1

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 16 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 17: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

1 Matrix Laplace transform method (Ahlswede amp Winter 2002)

Relate tail probability to the trace of the mgf of X

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ)

where m(θ) = E tr eθX

Problem eX+Y 6= eXeY when XY isin Hd

How to bound the trace mgf

Past approaches Golden-Thompson Liebrsquos concavity theorem

Chatterjeersquos strategy for scalar concentration

Control mgf growth by bounding derivative

mprime(θ) = E trXeθX for θ isin R

Rewrite using exchangeable pairs

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 17 35

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 18: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Method of Exchangeable Pairs

Lemma

Suppose that (XX prime) is a matrix Stein pair with scale factor α LetF Hd rarr Hd be a measurable function satisfying

E(X minusX prime)F (X) ltinfin

Then

E[X F (X)] =1

2αE[(X minusX prime)(F (X)minus F (X prime))] (1)

Intuition

Can characterize the distribution of a random matrix byintegrating it against a class of test functions F

Eq 1 allows us to estimate this integral using the smoothnessproperties of F and the discrepancy X minusX prime

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 18 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 19: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

2 Method of Exchangeable Pairs

Rewrite the derivative of the trace mgf

mprime(θ) = E trXeθX =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

Goal Use the smoothness of F (X) = eθX to bound the derivative

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 19 35

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 20: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Mean Value Trace Inequality

Lemma (Mackey Jordan Chen Farrell and Tropp 2012)

Suppose that g R rarr R is a weakly increasing function and thath R rarr R is a function whose derivative hprime is convex For allmatrices AB isin Hd it holds that

tr[(g(A)minus g(B)) middot (h(A)minus h(B))] le1

2tr[(g(A)minus g(B)) middot (AminusB) middot (hprime(A) + hprime(B))]

Standard matrix functions If g R rarr R and

A = Q

λ1

λd

Qlowast then g(A) = Q

g(λ1)

g(λd)

Qlowast

Inequality does not hold without the traceFor exponential concentration we let g(A) = A and h(B) = eθB

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 20 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 21: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) =1

2αE tr

[

(X minusX prime)(

eθX minus eθXprime)]

le θ

4αE tr

[

(X minusX prime)2 middot(

eθX + eθXprime)]

2αE tr

[

(X minusX prime)2 middot eθX]

= θ middot E tr

[

1

2αE[

(X minusX prime)2 |Z]

middot eθX]

= θ middot E tr[

∆X eθX]

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 21 35

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 22: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Exponential Concentration Proof Sketch

3 Mean Value Trace Inequality

Bound the derivative of the trace mgf

mprime(θ) le θ middot E tr[

∆X eθX]

4 Conditional Variance Bound ∆X 4 cX + v I

Yields differential inequality

mprime(θ) le cθE tr[

X eθX]

+ vθE tr[

eθX]

= cθ middotmprime(θ) + vθ middotm(θ)

Solve to bound m(θ) and thereby bound

Pλmax(X) ge t le infθgt0

eminusθt middotm(θ) le d middot exp minust22v + 2ct

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 22 35

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 23: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Refined Exponential Concentration

Relaxing the constraint ∆X 4 cX + v

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let (XX prime) be a bounded matrix Stein pair with X isin Hd Definethe function

r(ψ) =1

ψlogE tr(eψ∆Xd) for each ψ gt 0

Then for all t ge 0 and all ψ gt 0

Pλmax(X) ge t le d middot exp minust22r(ψ) + 2t

radicψ

r(ψ) measures typical magnitude of conditional variance

Eλmax(∆X) le infψgt0

[

r(ψ) + log dψ

]

When d = 1 improves scalar result of Chatterjee (2008)Proof extends to unbounded random matricesMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 23 35

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 24: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Exponential Tail Inequalities

Matrix Bernstein Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (Yk)kge1 be independent matrices in Hd satisfying

EYk = 0 and Yk le R for each index k

Define the variance parameter

σ2 =∥

sum

kEY 2

k

Then for all t ge 0

P

λmax

(

sum

kYk

)

ge t

le d middot exp minust23σ2 + 2Rt

Gaussian tail controlled by improved variance when t is smallKey proof idea Apply refined concentration and boundr(ψ) = 1

ψlogE tr(eψ∆Xd) using unrefined concentration

Constants better than Oliveira (2009) worse than Tropp (2011)Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 24 35

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 25: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Polynomial Moment Inequalities

Polynomial Moments for Random Matrices

Theorem (Mackey Jordan Chen Farrell and Tropp 2012)

Let p = 1 or p ge 15 Suppose that (XX prime) is a matrix Stein pairwhere E tr |X|2p ltinfin Then

(

E tr |X|2p)12p le

radic

2pminus 1 middot(

E tr∆pX

)12p

Moral The conditional variance controls the moments of X

Generalizes Chatterjeersquos version (2007) of the scalarBurkholder-Davis-Gundy inequality (Burkholder 1973)

See also Pisier amp Xu (1997) Junge amp Xu (2003 2008)

Proof techniques mirror those for exponential concentration

Also holds for infinite dimensional Schatten-class operators

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 25 35

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 26: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Polynomial Moment Inequalities

Matrix Khintchine Inequality

Corollary (Mackey Jordan Chen Farrell and Tropp 2012)

Let (εk)kge1 be an independent sequence of Rademacher randomvariables and (Ak)kge1 be a deterministic sequence of Hermitianmatrices Then if p = 1 or p ge 15

E tr(

sum

kεkAk

)2p

le (2pminus 1)p middot tr(

sum

kA2k

)p

Noncommutative Khintchine inequality (Lust-Piquard 1986 Lust-Piquard

and Pisier 1991) is a dominant tool in applied matrix analysis

eg Used in analysis of column sampling and projection forapproximate SVD (Rudelson and Vershynin 2007)

Steinrsquos method offers an unusually concise proof

The constantradic2pminus 1 is within

radice of optimal

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 26 35

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 27: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Dependent Sequences

Adding Dependence

1 MotivationMatrix CompletionMatrix Concentration

2 Steinrsquos Method Background and Notation

3 Exponential Tail Inequalities

4 Polynomial Moment Inequalities

5 Dependent SequencesSums of Conditionally Zero-mean MatricesCombinatorial Sums

6 Extensions

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 27 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 28: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Sum of Conditionally Zero-Mean Matrices)

Given a sequence of Hermitian matrices (Yk)nk=1 satisfying the

Conditional zero mean property E[Yk | (Yj)j 6=k] = 0

for all k define the random sum X =sumn

k=1Yk

Note (Yk)kge1 is a martingale difference sequence

Examples

Sums of independent centered random matricesMany sums of conditionally independent random matrices

Yk perpperp (Yj)j 6=k | Z and E[Yk |Z] = 0

Rademacher series with random matrix coefficients

X =sum

kεkWk

(Wk)kge1 Hermitian (εk)kge1 independent Rademacher

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 28 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 29: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Matrix Stein Pair for X =sumn

k=1 Yk

Let Y primek and Yk be conditionally iid given (Yj)j 6=k

Draw index K uniformly from 1 nDefine X prime = X + Y prime

K minus YK

Check Stein pair condition

E[X minusX prime | (Yj)jge1] = E[YK minus Y primeK | (Yj)jge1]

=1

n

sumn

k=1

(

Yk minus E[Y primek | (Yj)j 6=k]

)

=1

n

sumn

k=1Yk =

1

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 29 35

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 30: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Dependent Sequences Sums of Conditionally Zero-mean Matrices

Sums of Conditionally Zero-mean Matrices

Definition (Conditional Zero Mean Property)

E[Yk | (Yj)j 6=k] = 0

Conditional Variance for X = Y minus EY

∆X =n

2middot E

[

(X minusX prime)2 | (Yj)jge1

]

=n

2middot E

[

(YK minus Y primeK)

2 | (Yj)jge1

]

=1

2

sumn

k=1

(

Y 2k + E[Y 2

k | (Yj)j 6=k])

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 30 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 31: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Given a deterministic array (Ajk)njk=1 of Hermitian matrices and a

uniformly random permutation π on 1 n define thecombinatorial matrix statistic

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Generalizes the scalar statistics studied by Hoeffding (1951)

Example

Sampling without replacement from B1 BnW =

sums

j=1Bπ(j)

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 31 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 32: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Matrix Stein Pair for X = Y minus EY

Draw indices (JK) uniformly from 1 n2Define πprime = π (JK) and X prime =

sumnj=1Ajπprime(j) minus EY

Check Stein pair condition

E[X minusX prime | π] = E[

AJπ(J) +AKπ(K) minusAJπ(K) minusAKπ(J) | π]

=1

n2

sumn

jk=1Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

=2

n(Y minus EY ) =

2

nX

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 32 35

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 33: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Dependent Sequences Combinatorial Sums

Combinatorial Sums of Matrices

Definition (Combinatorial Matrix Statistic)

Y =sumn

j=1Ajπ(j) with mean EY =

1

n

sumn

jk=1Ajk

Conditional Variance for X = Y minus EY

∆X(π) =n

4E[

(X minusX prime)2 | π]

=1

4n

sumn

jk=1

[

Ajπ(j) +Akπ(k) minusAjπ(k) minusAkπ(j)

]2

41

n

sumn

jk=1

[

A2jπ(j) +A2

kπ(k) +A2jπ(k) +A2

kπ(j)

]

rArr Conditional variance controlled when summands are bounded

rArr Dependent analogues of concentration and moment inequalities

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 33 35

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 34: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Extensions

Extensions

General Complex Matrices

Map any matrix B isin Cd1timesd2 to a Hermitian matrix via dilation

D(B) =

[

0 B

Blowast0

]

isin Hd1+d2

Preserves spectral information λmax(D(B)) = B

Beyond Sums

Matrix-valued functions satisfying a self-reproducing property

eg Matrix second-order Rademacher chaossum

jk εjεkAjk

Yields a dependent bounded differences inequality for matrices

Generalized Matrix Stein Pairs

Satisfy E[g(X)minus g(X prime) |Z] = αX almost surely forg R rarr R weakly increasingMackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 34 35

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 35: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Extensions

References IAhlswede R and Winter A Strong converse for identification via quantum channels IEEE Trans Inform Theory 48(3)

569ndash579 Mar 2002

Bernstein S The theory of probabilities Gastehizdat Publishing House 1946

Burkholder D L Distribution function inequalities for martingales Ann Probab 119ndash42 1973 doi101214aop1176997023

Candes E J and Recht B Exact matrix completion via convex optimization Found Comput Math 9717ndash772 2009

Candes E J and Tao T The power of convex relaxation Near-optimal matrix completion IEEE Trans Info Theory 2009URL arXiv09031476 To appear Available at arXiv09031476

Chatterjee S Steinrsquos method for concentration inequalities Probab Theory Related Fields 138305ndash321 2007

Chatterjee S Concentration inequalities with exchangeable pairs PhD thesis Stanford University Palo Alto Feb 2008 URLarxivmath0507526vl

Cheung S-S So A Man-Cho and Wang K Chance-constrained linear matrix inequalities with dependent perturbations Asafe tractable approximation approach Available athttpwwwoptimization-onlineorgDB_FILE2011012898pdf 2011

Christofides D and Markstrom K Expansion properties of random cayley graphs and vertex transitive graphs via matrixmartingales Random Struct Algorithms 32(1)88ndash100 2008

Drineas P Mahoney M W and Muthukrishnan S Relative-error CUR matrix decompositions SIAM Journal on Matrix

Analysis and Applications 30844ndash881 2008

Foygel R and Srebro N Concentration-based guarantees for low-rank matrix reconstruction Journal of Machine Learning

Research - Proceedings Track 19315ndash340 2011

Gross D Recovering low-rank matrices from few coefficients in any basis IEEE Trans Inform Theory 57(3)1548ndash1566 Mar2011

Hoeffding W A combinatorial central limit theorem Ann Math Statist 22558ndash566 1951

Hoeffding W Probability inequalities for sums of bounded random variables Journal of the American Statistical Association58(301)13ndash30 1963

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 35 35

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions
Page 36: Stein's Method for Matrix Concentrationlmackey/papers/matstein-12_10_12-slides.pdf · Stein’s Method for Matrix Concentration Lester Mackey† Collaborators: Michael I. Jordan‡,

Extensions

References II

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matrices Available atarXiv11041672 2011a

Hsu D Kakade S M and Zhang T Dimension-free tail inequalities for sums of random matricesarXiv11041672v3[mathPR] 2011b

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities Ann Probab 31(2)948ndash995 2003

Junge M and Xu Q Noncommutative BurkholderRosenthal inequalities II Applications Israel J Math 167227ndash282 2008

Lust-Piquard F Inegalites de Khintchine dans Cp (1 lt p lt infin) C R Math Acad Sci Paris 303(7)289ndash292 1986

Lust-Piquard F and Pisier G Noncommutative Khintchine and Paley inequalities Ark Mat 29(2)241ndash260 1991

Mackey L Talwalkar A and Jordan M I Divide-and-conquer matrix factorization In Shawe-Taylor J Zemel R SBartlett P L Pereira F C N and Weinberger K Q (eds) Advances in Neural Information Processing Systems 24 pp1134ndash1142 2011

Mackey L Jordan M I Chen R Y Farrell B and Tropp J A Matrix concentration inequalities via the method ofexchangeable pairs URL httparxivorgabs12016002 2012

Negahban S and Wainwright M J Restricted strong convexity and weighted matrix completion Optimal bounds with noisearXiv10092118v2[csIT] 2010

Nemirovski A Sums of random symmetric matrices and quadratic optimization under orthogonality constraints Math

Program 109283ndash317 January 2007 ISSN 0025-5610 doi 101007s10107-006-0033-0 URLhttpdlacmorgcitationcfmid=12297161229726

Oliveira R I Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges Availableat arXiv09110600 Nov 2009

Pisier G and Xu Q Non-commutative martingale inequalities Comm Math Phys 189(3)667ndash698 1997

Recht B Simpler approach to matrix completion J Mach Learn Res 123413ndash3430 2011

Rudelson M and Vershynin R Sampling from large matrices An approach through geometric functional analysis J Assoc

Comput Mach 54(4)Article 21 19 pp Jul 2007 (electronic)

So A Man-Cho Moment inequalities for sums of random matrices and their applications in optimization Math Program 130(1)125ndash151 2011

Stein C A bound for the error in the normal approximation to the distribution of a sum of dependent random variables InProc 6th Berkeley Symp Math Statist Probab Berkeley 1972 Univ California Press

Tropp J A User-friendly tail bounds for sums of random matrices Found Comput Math August 2011

Mackey (Stanford) Steinrsquos Method for Matrix Concentration December 10 2012 36 35

  • Motivation
    • Matrix Completion
    • Matrix Concentration
      • Extensions

Recommended