CONSTRAINED LOW-RANK MATRIX (AND TENSOR ...laurent.risser.free.fr › TMP_SHARE › optimCIMI2018...

transcript

CONSTRAINED LOW-RANK MATRIX (AND TENSOR) ESTIMATION

Lenka Zdeborová (IPhT, CEA Saclay, France)

with T. Lesieur, F. Krzakala; Proofs with J. Xu, J. Barbier, N. Macris, M. Dia, M. Lelarge, L. Miolane.

LET’S PLAY A GAME

N=15 people

LET’S PLAY A GAME

• Generate a random Gaussian variable Z (zero mean and variance Δ)

• Report:

‣ Y=Z+1/⎷N if the cards were the same.

‣ Y=Z-1/⎷N if the cards were different.

LET’S PLAY A GAME

Collect Yij for every pair (ij).

Goal: Recover cards (up to symmetry) purely from the knowledge of

• Each pair reports:

‣ Yij=Zij+1/⎷N if cards the same.

‣ Yij=Zij-1/⎷N if cards different.

Zij ⇠ N (0,�)

Y = {Yij}i<j

HOW TO SOLVE THIS?

Eigen-decomposition of Y (aka PCA) minimises X

(Yij � Yij)2 with rank(Y ) = 1

xPCA (leading eigen-vector of Y) estimates x* (up to a sign).

true values of cards: Yij =1

Nx*i x*j + Zij

BBP phase transition:

Zij ⇠ N (0,�) x*i ∈ {−1, + 1}

Δ > 1Δ < 1

xPCA ⋅ x* ≈ 0|xPCA ⋅ x* | > 0

x* ∈ {−1, + 1}N

MAIN QUESTIONS

What is the minimal achievable estimation error on x*?

(Is it possible to do better than PCA?)

What is the minimal efficiently achievable estimation error on x*?

BAYESIAN INFERENCE

P (x|Y ) =P (x)P (Y |x)

P (Y )

P (x|Y ) =1

Z(Y,�)

[�(xi + 1) + �(xi � 1)]Y

e�(Yij�xixj/

Values of cards:

Posterior distribution:

Bayes-optimal inference = computation of marginals (argmax maximizes the number of correctly assigned values, mean of marginals minimises the mean-squared error).

xi ∈ {−1, + 1}x ∈ {−1, + 1}N

IN THIS TALK

P (u, v|Y ) =1

PU (ui)MY

PV (vj)Y

Pout(Yij |uTi vj/

P (x|Y ) =1

PX(xi)Y

Pout(Yij |xTi xj/

Bayes-optimal inference for generic prior and output

Generate ground-truth xi* from PX. Generate Yij from Pout. Goal: Infer x* from Y.

P (x|Y ) =1

PX(xi)Y

i1<...<ip

Pout(Yi1....ip |p

(p� 1)!

N (p�1)/2xi1 ...xip)

Stochastic block model (dense):

PX(x) =1

�(x� ek)

x 2 Rr eTk = (0, . . . , 0, 1, 0, . . . , 0)r-valued cards

Pout(Yij = 1|xTi xjpN

) = pout +µpN

xTi xj

Another example:

Yij is the adjacency matrix of a graph

) = 1� pout �µpN

xTi xj

P (x|Y ) =1

PX(xi)Y

Pout(Yij |xTi xj/

Submatrix localization.

Z2 synchronization.

Planted spin glass (Ising/spherical/vectorial).

Spiked Wigner models.

More examples:

P (x|Y ) =1

PX(xi)Y

Pout(Yij |xTi xj/

ASYMMETRIC CASE

Gaussian mixture clustering.

Biclustering.

Dawid-Skene model for crowdsourcing.

Johnstone’s spiked covariance model.

Restricted Boltzmann machine with random weights.

P (u, v|Y ) =1

PU (ui)MY

PV (vj)Y

Pout(Yij |uTi vj/

TENSOR ESTIMATION

Spiked tensor model (Richard, Montanari, NIPS’14)

Hyper-graph clustering

Tensor completion.

Sub-tensor localisation

P (x|Y ) =1

PX(xi)Y

i1<...<ip

Pout(Yi1....ip |p

(p� 1)!

N (p�1)/2xi1 ...xip)

OUR RESULTS

In the limit, , we compute rigorously the minimum mean-squared error

Message passing algorithm that is asymptotically optimal out of a sharply delimited “hard” region of parameters.

P (x|Y ) =1

PX(xi)Y

Pout(Yij |xTi xj/

N ! 1,M/N = ↵ = O(1)

MMSE =1

(x⇤i � xi)

or asymmetric or tensors

xi = ∑x

xi P(x |Y )

COMMENTS

Limit, , high dimensional statistics. Rank = O(1).

Regime of MSE: When is MSE better than a random pick from the prior? And how much better. A statistician would perhaps rather ask how fast the MSE goes to zero.

When we talk about sparsity, finite fraction of non-zeros. In most statistics works # or non-zeros o(N).

The noise and spikes iid. Does not describe most of real data. Precise analysis of optimality and many algorithms possible, intriguing behaviour (phase transitions).

N ! 1,M/N = ↵ = O(1)

How do we compute the Bayes-optimal performance?

xi ! Si

Map to a spin glass?

BACK TO THE CARD GAME

Si 2 {�1,+1}P (S|J) = 1

Z(Y,�)

e�(Jij�SiSj/

P (S|J) = 1

Z(Y,�)e

Pi<j JijSiSj

Boltzmann measure of a mean-field Ising spin glass (Sherrington-Kirkpatrick’75 model)

Jij conditioned on Si*: planted disorder

MEAN-FIELD SPIN GLASS

‣ Mean-field spin glass models solvable using the non-rigorous Replica method / cavity method (Mezard, Parisi, Nishimori, Watkin, Nadal, Sompolinsky, many many others 70s-80s.)

‣ For (Ising spins):

1p�⇤

�⇤

De Almeida; Thouless’78:

Si 2 {�1,+1}

MEAN-FIELD SPIN GLASS

‣ Mean-field spin glass models solvable using the non-rigorous Replica method / cavity method (Mezard, Parisi, Nishimori, Watkin, Nadal, Sompolinsky, many many others 70s-80s.)

‣ For (Ising spins):Si 2 {�1,+1}

1p�⇤

�⇤

LET’S JUMP ~40 YEARS FORWARD:

MAIN RESULTS

DEFINITIONS:

Sij ⌘@ logPout(yij |w)

��yij ,0

�⌘ EPout(y|w=0)

"✓@ logPout(y|w)

��y,0

Fisher-score matrix

Fisher information

P(x;A,B) =1

Z(A,B)PX(x) exp

✓B>x� x>Ax

f(A,B) ⌘ EP(x)

P (x|Y ) =1

PX(xi)Y

Pout(Yij |xTi xj/

A, B ∈ ℝr×r x ∈ ℝr

THEOREMS:

NlogZ(Y ) �(M)concentrates around maximum of

x ⇠ PX(x)

w ⇠ N (0, 1r)

�(M) = Ex,w

!#� Tr(MM>)

Theorem 1: M 2 Rr⇥r

x ∈ ℝr

Why is this useful?

When N>>1, the rN-dimensional problem reduces to a r-dimensional one.

P (x|Y ) =1

PX(xi)Y

Pout(Yij |xTi xj/

= replica symmetric free energy

THEOREMS:

NlogZ(Y ) �(M)concentrates around maximum of

x ⇠ PX(x)

w ⇠ N (0, 1r)

MMSE = Tr[Ex(xx>)� argmax�(M)]

�(M) = Ex,w

!#� Tr(MM>)

Theorem 1:

Theorem 2:

Proofs: Korada, Macris’10, Krzakala, Xu, LZ, ITW’16, Barbier, Dia, Macris, Krzakala, Lesieur, LZ, NIPS’16; more elegant Lelarge, Miolane’16; El-Alaoui, Krzakala’17

M 2 Rr⇥r

FREE ENERGY FOR THE ASYMMETRIC CASE

P (u, v|Y ) =1

PU (ui)MY

PV (vj)Y

Pout(Yij |uTi vj/

�(Mu,Mv) = Eu,w

"logZu

�,↵Mv

r↵Mv

+↵Ev,w

"logZv

�,Mu

�v +

!#� ↵Tr(MvM>

Conjectured: Lesieur, Krzakala, LZ’15

Proof: Miolane’17

FREE ENERGY FOR THE TENSOR CASE

P (x|Y ) =1

PX(xi)Y

i1<...<ip

Pout(Yi1....ip |p

(p� 1)!

N (p�1)/2xi1 ...xip)

�(M) = Ex,w

Mp�1

�,Mp�1

rMp�1

!#� Mp(p� 1)

rank=1

Proof (r=1): Lesieur, Miolane, Lelarge, Krzakala, LZ’17

General rank: Barbier, Macris, Miolane’17.

KEY PROOF INGREDIENTS

Guerra’s interpolation (from N independent

scalar denoising problems) +

MAIN QUESTIONS

APPROXIMATE MESSAGE PASSING

AMP algorithm estimates means and variances of the marginals:

Thouless, Anderson, Palmer’77, Rangan, Fletcher’12, Matsushita, Tanaka’14, Deshpande, Montanari’14, Lesieur, Krzakala, LZ’15 and 16

P(x |Y ) =1

∏i=1

PX(xi)∏i<j

Pout(yij |x⊤i xj / N)

∑l=1

Silatl −

1Δ ( 1

∑l=1

vtl)at−1

NΔ (N

∑l=1

tl⊤)

at+1i = f(At, Bt

vt+1i = ∂B f(At, Bt

DEFINITIONS:

��yij ,0

�⌘ EPout(y|w=0)

"✓@ logPout(y|w)

��y,0

Fisher-score matrix

Fisher information

P(x;A,B) =1

Z(A,B)PX(x) exp

✓B>x� x>Ax

f(A,B) ⌘ EP(x)

P (x|Y ) =1

PX(xi)Y

Pout(Yij |xTi xj/

Characterisation of the AMP via matrix-order parameter M.

STATE EVOLUTION

M t ⌘ 1

ati(x⇤i )

> 2 Rr⇥r

x ⇠ PX(x)

w ⇠ N (0, 1r)

M t+1 = Ex,w

�,M t

Observation: Stationary points of are fixed points of the state evolution.

�(M)

Proof: Rangan, Fletcher’12, Javanmard, Montanari’12, Deshpande, Montanari’14.

MSEAMP = Tr[Ex(xx>)�MAMP]

AMP-MSE given by the local maximum of the free energy reached gradient descent starting from small M/large MSE.

MMSE is given by the global maximum of the free energy.

BOTTOM LINE

�(M) = Ex,w

!#� Tr(MM>)

MMSE = Tr[Ex(xx>)� argmax�(M)]

argmax�(M)

MSEAMP = Tr[Ex(xx>)�MAMP]

ZOOLOGY OF FIXED POINTS (FOR MATRIX ESTIMATION)

EX(x) = 0

EX(x) 6= 0

M t+1 =⌃M t⌃

�M t+1

(r=1) =[EX(x2)]2

�M t

Zero mean prior:

SE has always a “trivial” fixed point M=0.

Stability of the trivial fixed point:

This is the same as the spectral phase transition of the Fisher score matrix (Edwards’68, known as the BBP’05 transition)

Non-zero mean priors:

MMSE always better than random guessing (spectral methods still have a phase transition).

Multiple fixed points may still exist.

PX(xi) =⇢

2[�(xi � 1) + �(xi + 1)] + (1� ⇢)�(xi)

From fixed points to phase transitions:ac

ALGORITHMIC INTERPRETATION

noise, Δ

• Easy by approximate message passing. • Impossible information theoretically. • Hard phase: in presence of a first order phase transition.

PX(xi) =⇢

2[�(xi � 1) + �(xi + 1)] + (1� ⇢)�(xi)

Conjecture: No polynomial algorithm works.

- Physically sensible. - Mathematically wide open. ac

PX(xi) =⇢

2[�(xi � 1) + �(xi + 1)] + (1� ⇢)�(xi)

Phase Diagram:

impossible

easyhard

HARD PHASE IN NATURE

Metastable diamond = high error. Equilibrium graphite = low error. Algorithms are stuck at high error for exponential time.

MAIN QUESTIONS

PX(xi) =⇢

2[�(xi � 1) + �(xi + 1)] + (1� ⇢)�(xi)

From fixed points to phase transitions:ac

OPTIMAL SPECTRAL ALGORITHMS

For zero-mean priors, spectral method that has the same phase transition as AMP. AMP has better error.

For noise that is not Gaussian additive, to have the optimal phase transition, spectral algorithm need to be done on the Fisher score matrix.

��yij ,0

OPTIMAL PRE-PROCESSING

Exponential additive noise Cauchy additive noise

Sij = sign(Yij) Sij =Yij

1 + Y 2ij

Fisher score:Fisher score:

Pout(y|w) = e�|y�w|/2<latexit sha1_base64="sJc70CzfprbfCVUCTqXK8yG1B/0=">AAACB3icbVDLSsNAFJ3UV62vqEsXDhahLlqTIqgLoejGZQVjC20Mk+m0HTp5MDOxhDRLN/6KGxcqbv0Fd/6N0zYLbT1w4XDOvdx7jxsyKqRhfGu5hcWl5ZX8amFtfWNzS9/euRNBxDGxcMAC3nSRIIz6xJJUMtIMOUGey0jDHVyN/cYD4YIG/q2MQ2J7qOfTLsVIKsnR9+tO0uYeDCKZluLR8AheQHKflEdxeThKj6uOXjQqxgRwnpgZKYIMdUf/ancCHHnEl5ghIVqmEUo7QVxSzEhaaEeChAgPUI+0FPWRR4SdTB5J4aFSOrAbcFW+hBP190SCPCFiz1WdHpJ9MeuNxf+8ViS7Z3ZC/TCSxMfTRd2IQRnAcSqwQznBksWKIMypuhXiPuIIS5VdQYVgzr48T6xq5bxi3JwUa5dZGnmwBw5ACZjgFNTANagDC2DwCJ7BK3jTnrQX7V37mLbmtGxmF/yB9vkDpIeYpg==</latexit><latexit sha1_base64="sJc70CzfprbfCVUCTqXK8yG1B/0=">AAACB3icbVDLSsNAFJ3UV62vqEsXDhahLlqTIqgLoejGZQVjC20Mk+m0HTp5MDOxhDRLN/6KGxcqbv0Fd/6N0zYLbT1w4XDOvdx7jxsyKqRhfGu5hcWl5ZX8amFtfWNzS9/euRNBxDGxcMAC3nSRIIz6xJJUMtIMOUGey0jDHVyN/cYD4YIG/q2MQ2J7qOfTLsVIKsnR9+tO0uYeDCKZluLR8AheQHKflEdxeThKj6uOXjQqxgRwnpgZKYIMdUf/ancCHHnEl5ghIVqmEUo7QVxSzEhaaEeChAgPUI+0FPWRR4SdTB5J4aFSOrAbcFW+hBP190SCPCFiz1WdHpJ9MeuNxf+8ViS7Z3ZC/TCSxMfTRd2IQRnAcSqwQznBksWKIMypuhXiPuIIS5VdQYVgzr48T6xq5bxi3JwUa5dZGnmwBw5ACZjgFNTANagDC2DwCJ7BK3jTnrQX7V37mLbmtGxmF/yB9vkDpIeYpg==</latexit><latexit sha1_base64="sJc70CzfprbfCVUCTqXK8yG1B/0=">AAACB3icbVDLSsNAFJ3UV62vqEsXDhahLlqTIqgLoejGZQVjC20Mk+m0HTp5MDOxhDRLN/6KGxcqbv0Fd/6N0zYLbT1w4XDOvdx7jxsyKqRhfGu5hcWl5ZX8amFtfWNzS9/euRNBxDGxcMAC3nSRIIz6xJJUMtIMOUGey0jDHVyN/cYD4YIG/q2MQ2J7qOfTLsVIKsnR9+tO0uYeDCKZluLR8AheQHKflEdxeThKj6uOXjQqxgRwnpgZKYIMdUf/ancCHHnEl5ghIVqmEUo7QVxSzEhaaEeChAgPUI+0FPWRR4SdTB5J4aFSOrAbcFW+hBP190SCPCFiz1WdHpJ9MeuNxf+8ViS7Z3ZC/TCSxMfTRd2IQRnAcSqwQznBksWKIMypuhXiPuIIS5VdQYVgzr48T6xq5bxi3JwUa5dZGnmwBw5ACZjgFNTANagDC2DwCJ7BK3jTnrQX7V37mLbmtGxmF/yB9vkDpIeYpg==</latexit>

Pout(y|w) = [1 + (y � w)2]�1/⇡<latexit sha1_base64="L4hdWBGK2Eo850AlER/DD9qtke8=">AAACD3icbVDLSsNAFJ3UV62vqks3g0VskdakCOpCKLpxWcHYQpqGyXTSDp08mJlYQuwnuPFX3LhQcevWnX/j9LHQ1gMXDufcy733uBGjQur6t5ZZWFxaXsmu5tbWNza38ts7dyKMOSYmDlnImy4ShNGAmJJKRpoRJ8h3GWm4/auR37gnXNAwuJVJRGwfdQPqUYykkpz8Yd1JW9yHYSyHxeRhUIIXlnFUTMqDUrtqt9OyMTxuRRQ6+YJe0ceA88SYkgKYou7kv1qdEMc+CSRmSAjL0CNpp4hLihkZ5lqxIBHCfdQllqIB8omw0/FDQ3iglA70Qq4qkHCs/p5IkS9E4ruq00eyJ2a9kfifZ8XSO7NTGkSxJAGeLPJiBmUIR+nADuUES5YogjCn6laIe4gjLFWGORWCMfvyPDGrlfOKfnNSqF1O08iCPbAPisAAp6AGrkEdmACDR/AMXsGb9qS9aO/ax6Q1o01ndsEfaJ8/25Wawg==</latexit><latexit sha1_base64="L4hdWBGK2Eo850AlER/DD9qtke8=">AAACD3icbVDLSsNAFJ3UV62vqks3g0VskdakCOpCKLpxWcHYQpqGyXTSDp08mJlYQuwnuPFX3LhQcevWnX/j9LHQ1gMXDufcy733uBGjQur6t5ZZWFxaXsmu5tbWNza38ts7dyKMOSYmDlnImy4ShNGAmJJKRpoRJ8h3GWm4/auR37gnXNAwuJVJRGwfdQPqUYykkpz8Yd1JW9yHYSyHxeRhUIIXlnFUTMqDUrtqt9OyMTxuRRQ6+YJe0ceA88SYkgKYou7kv1qdEMc+CSRmSAjL0CNpp4hLihkZ5lqxIBHCfdQllqIB8omw0/FDQ3iglA70Qq4qkHCs/p5IkS9E4ruq00eyJ2a9kfifZ8XSO7NTGkSxJAGeLPJiBmUIR+nADuUES5YogjCn6laIe4gjLFWGORWCMfvyPDGrlfOKfnNSqF1O08iCPbAPisAAp6AGrkEdmACDR/AMXsGb9qS9aO/ax6Q1o01ndsEfaJ8/25Wawg==</latexit><latexit sha1_base64="L4hdWBGK2Eo850AlER/DD9qtke8=">AAACD3icbVDLSsNAFJ3UV62vqks3g0VskdakCOpCKLpxWcHYQpqGyXTSDp08mJlYQuwnuPFX3LhQcevWnX/j9LHQ1gMXDufcy733uBGjQur6t5ZZWFxaXsmu5tbWNza38ts7dyKMOSYmDlnImy4ShNGAmJJKRpoRJ8h3GWm4/auR37gnXNAwuJVJRGwfdQPqUYykkpz8Yd1JW9yHYSyHxeRhUIIXlnFUTMqDUrtqt9OyMTxuRRQ6+YJe0ceA88SYkgKYou7kv1qdEMc+CSRmSAjL0CNpp4hLihkZ5lqxIBHCfdQllqIB8omw0/FDQ3iglA70Qq4qkHCs/p5IkS9E4ruq00eyJ2a9kfifZ8XSO7NTGkSxJAGeLPJiBmUIR+nADuUES5YogjCn6laIe4gjLFWGORWCMfvyPDGrlfOKfnNSqF1O08iCPbAPisAAp6AGrkEdmACDR/AMXsGb9qS9aO/ax6Q1o01ndsEfaJ8/25Wawg==</latexit>

OTHER EXAMPLES OF PHASE DIAGRAMS

�Alg

PX(xi) = (1� ⇢)�(xi) + ⇢�(xi � 1)

⇢ = 0.01⇢ = 0.2

Non-zero mean priorac

PX(xi) = (1� ⇢)�(xi) + ⇢�(xi � 1)

Non-zero mean prior

) = pout +µpN

xTi xj � =

pout(1� pout)

Stochastic block model, r groups

0.0 0.5 1.0 1.5 2.0

AMP from solutionSE stable branchSE unstable branch�c = �Alg

�Dyn

r>4 hard phase exists. r<4 hard phase does not exist.

� =pout(1� pout)

2 groups, different sizes, same average degree ✓pout poutpout pout

1�⇢⇢ �1�1 ⇢

1�⇢

PX(x) = ⇢�

✓x�

r1� ⇢

◆+ (1� ⇢)�

1� ⇢

⇢c =1

2� 1p

ITsmall ⇢

kAlg =pN

1� pout,

kIT = log(N)4pout

1� pout.

As in balanced planted clique.

impossible

TENSORS

ZOOLOGY OF FIXED POINTS (FOR TENSOR ESTIMATION)

Zero mean prior:

SE has a “trivial” fixed point M=0, stable for any

Information-theoretic phase transition at

Huge hard phase, until (e.g. Richard, Montanari’14)

Non-zero mean priors:

Hard phase shrinks back to regime.

EX(x) = 0

EX(x) 6= 0

� = ⌦(1)

�IT = ⌦(1)

� = ⌦(1)

Δ = Ω(N(2−p)/4)

Take home: In tensor estimation use your prior!

SPIKED TENSOR (ZERO MEAN SPIKE)

No information contained in Y.

GOOD statisticallyHARD algorithmically

PX(x) = N (x; 0, 1) p=3

PX(x) = N (x; 0.2, 1) p=3

Almost no information

SPIKED TENSOR (NON-ZERO MEAN SPIKE)

PHASE DIAGRAMS SPIKED TENSORS

PX(xi) = (1� ⇢)�(xi) + ⇢�(xi � 1)PX(x) = N (x;µ, 1)

CONCLUSION

• Analysis of Bayes optimal inference in low-rank matrix and tensor estimation.

• Approximate message passing, its performance.

• Channel universality. Optimal pre-processing for spectral methods.

• Existence of the hard phase (metastability next to a first order phase transition) for a range of priors.

WORK IN PROGRESS

• Beyond iid priors. Priors coming from another graphical model are also tractable. E.g. optimal generalisation error in neural networks with one small hidden layer.

• Applications of optimal pre-processing for spectral methods: degree corrected stochastic block model. Inference of patterns learned by real biological neural network.

• Nature of the nard phase. Deep connection with algorithmic barrier of sum-of-squares proofs.

TALK BASED ON

• Lesieur, Krzakala, LZ, Phase transitions in sparse PCA, ISIT’15

• Lesieur, Krzakala, LZ, MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel, Allerton’15.

• Lesieur, De Bacco, Banks, Krzakala, Moore, LZ, Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering, Allerton’16

• Krzakala, Xu, LZ, Mutual information in rank-one matrix estimation, ITW’16

• Barbier, Dia, Macris, Krzakala, Lesieur, LZ Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula, NIPS’16

• Lesieur, Krzakala, LZ, Constrained Low-rank Matrix Estimation: Phase Transitions, Approximate Message Passing and Applications, J. Stat. Mech.’17

• Lesieur, Miolane, Lelarge, Krzakala, LZ, Statistical and computational phase transitions in spiked tensor estimation, ISIT’17

CONSTRAINED LOW-RANK MATRIX (AND TENSOR ...laurent.risser.free.fr › TMP_SHARE › optimCIMI2018...

Documents