Recent Advances in Approximate Message Passing
Phil Schniter
Supported in part by NSF grant CCF-1716388.
July 5, 2019
Overview
1 Linear Regression
2 Approximate Message Passing (AMP)
3 Vector AMP (VAMP)
4 Unfolding AMP and VAMP into Deep Neural Networks
5 Extensions: GLMs, Parameter Learning, Bilinear Problems
Phil Schniter (Ohio State Univ.) July’19 2 / 52
Linear Regression
Outline
1 Linear Regression
2 Approximate Message Passing (AMP)
3 Vector AMP (VAMP)
4 Unfolding AMP and VAMP into Deep Neural Networks
5 Extensions: GLMs, Parameter Learning, Bilinear Problems
Phil Schniter (Ohio State Univ.) July’19 3 / 52
Linear Regression
The Linear Regression ProblemConsider the following linear regression problem:
Recover xo from
y = Axo +w with
xo ∈ Rn unknown signal
A ∈ Rm×n known linear operator
w ∈ Rm white Gaussian noise.
Typical methodologies:
1 Optimization (or MAP estimation):
x = argminx
{1
2‖Ax− y‖22 +R(x)
}
2 Approximate MMSE:
x ≈ E{x|y} for x ∼ p(x), y|x ∼ N (Ax, νwI)
3 Plug-and-play:1 iteratively apply a denoising algorithm like BM3D
4 Train a deep network to recover xo from y.1Venkatakrishnan,Bouman,Wohlberg’13Phil Schniter (Ohio State Univ.) July’19 4 / 52
Approximate Message Passing (AMP)
Outline
1 Linear Regression
2 Approximate Message Passing (AMP)
3 Vector AMP (VAMP)
4 Unfolding AMP and VAMP into Deep Neural Networks
5 Extensions: GLMs, Parameter Learning, Bilinear Problems
Phil Schniter (Ohio State Univ.) July’19 5 / 52
Approximate Message Passing (AMP)
The AMP Methodology
All of the aforementioned methodologies can be addressed using theApproximate Message Passing (AMP) framework.
AMP tackles these problems via iterative denoising.
We will write the iteration-t denoiser as ηt(·) : Rn → Rn.
Each method defines the denoiser ηt(·) differently:Optimization: ηt(r) = argmin
x{R(x) + 1
2νt ‖x− r‖22} , “proxRνt(r)”
MMSE: ηt(r) = E{x∣∣ r = x+N (0, νt)
}
Plug-and-play: ηt(r) = BM3D(r, νt)
Deep network: ηt(r) is learned from training data.
Phil Schniter (Ohio State Univ.) July’19 6 / 52
Approximate Message Passing (AMP)
The AMP Algorithm
initialize x0=0, v−1=0
for t = 0, 1, 2, . . .
vt = y −Axt + N
M vt−1 div(ηt−1(xt−1 +ATv
t−1))corrected residual
xt+1 = ηt(xt +ATvt) denoising
wherediv
(ηt(r)
),
1
ntr
(∂ηt(r)
∂r
)“divergence.”
Note:Original version proposed by Donoho, Maleki, and Montanari in 2009.
They considered “scalar” denoisers, such that [ηt(r)]j = ηt(rj) ∀jFor scalar denoisers, div
(ηt(r)
)= 1
n
∑n
j=1ηt′(rj)
Can be recognized as iterative shrinkage/thresholding2 plus “Onsagercorrection.”
Can be derived using Gaussian & Taylor-series approximations of loopybelief-propagation (hence “AMP”).
2Chambolle,DeVore,Lee,Lucier’98Phil Schniter (Ohio State Univ.) July’19 7 / 52
Approximate Message Passing (AMP)
AMP’s Denoising Property
Original AMP Assumptions
A ∈ Rm×n is drawn i.i.d. Gaussian
m,n → ∞ s.t. mn → δ ∈ (0,∞) . . . “large-system limit”
[ηt(r)]j = ηt(rj) with Lipschitz η(·) . . . “scalar denoising”
Under these assumptions, the denoiser’s input rt , xt +ATvt obeys3
rtj = xo,j +N (0, νtr)
That is, rt is a Gaussian-noise corrupted version of the true signal xo.
It should now be clear why we think of ηt(·) as a “denoiser.”
Furthermore, the effective noise variance can be consistently estimated:
νtr , 1m‖vt‖2 −→ νtr.
3Bayati,Montanari’11Phil Schniter (Ohio State Univ.) July’19 8 / 52
Approximate Message Passing (AMP)
AMP’s State Evolution
Assume that the measurements y were generated via
y = Axo +N (0, νwI)
where xo empirically converges to some random variable Xo as n → ∞.
Define the iteration-t mean-squared error (MSE)
Et , 1n‖x
t − xo‖2.
Under above assumptions, AMP obeys the following state evolution (SE):4
for t = 0, 1, 2, . . .
νtr = νw + nmEt
Et+1 = E{[
ηt(Xo +N (0, νtr)
)−Xo
]2}
4Bayati,Montanari’11Phil Schniter (Ohio State Univ.) July’19 9 / 52
Approximate Message Passing (AMP)
Achievability Analysis via the AMP SE
AMP’s SE can be applied to analyze achievability in various problems.
E.g., it yields a closed-form expression5 for the sparsity/sampling region whereℓ1-penalized regression is equivalent to ℓ0-penalized regression:
ρ(δ) = maxc>0
1− 2δ−1[(1 + c2)Φ(−c)− cφ(c)]
1 + c2 − 2[(1 + c2)Φ(−c)− cφ(c)],
0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
δ = m/n (sampling rate)
ρ=
k/m
(sparsity
rate)
MMSE reconstruct
weak ℓ1/ℓ0 equiv
empirical AMP
5Donoho,Maleki,Montanari’09Phil Schniter (Ohio State Univ.) July’19 10 / 52
Approximate Message Passing (AMP)
MMSE Optimality of AMP
Now suppose that the AMP Assumptions hold, and that
y = Axo +N (0, νwI),
where the elements of xo are i.i.d. draws of some random variable Xo.
Suppose also that ηt(·) is the MMSE denoiser, i.e.,
ηt(R) = E{Xo
∣∣R = Xo+N (0, νtr)}
Then, if the state evolution has a unique fixed point, the MSE of xt converges6
to the replica prediction of the MMSE as t → ∞.
Under the AMP Assumptions, the replica prediction of the MMSE was shownto be correct.78
6Bayati,Montanari’11, 7Reeves,Pfister’16, 8Barbier,Dia,Macris,Krzakala’16Phil Schniter (Ohio State Univ.) July’19 11 / 52
Approximate Message Passing (AMP)
Universality of AMP State Evolution
Until now, it was assumed that A is drawn i.i.d. Gaussian.
The state evolution also holds when A is drawn from i.i.d. Aij such that
E{Aij} = 0
E{A2ij} = 1/m
E{A6ij} = C/m for some fixed C > 0.
often abbreviated as “sub-Gaussian Aij .”
The proof 9 assumes polynomial scalar denoising ηt(·) of bounded order.
9Bayati,Lelarge,Montanari’15Phil Schniter (Ohio State Univ.) July’19 12 / 52
Approximate Message Passing (AMP)
Deriving AMP via Loopy BP (e.g., sum-product alg)f(x1)
f(x2)
f(xn)
x1
x2
xn
p1→1(x1)
pm←n(xn)
N (y1; [Ax]1, νw)
N (y2; [Ax]2, νw)
N (ym; [Ax]m, νw)
......
...
1 Message from yi node to xj node:
pi→j(xj) ∝
∫
{xl}l 6=j
N(yi;
≈ N via CLT︷ ︸︸ ︷∑
l ailxl , νw
)∏l 6=j pi←l(xl)
≈
∫
zi
N (yi; zi, νw)N(zi; zi(xj), ν
zi (xj)
)∼ N
To compute zi(xj), νzi (xj), the means and variances of {pi←l}l 6=j suffice,
implying Gaussian message passing, similar to expectation-propagation.Remaining problem: we have 2mn messages to compute (too many!).
2 Exploiting similarity among the messages{pi←j}mi=1, AMP employs a Taylor-seriesapproximation of their difference whoseerror vanishes as m→∞ for dense A (andsimilar for {pi←j}nj=1 as n→∞). Finally,need to compute only O(m+n) messages!
f(x1)
f(x2)
f(xn)
x1
x2
xn
p1→1(x1)
pm←n(xn)
N (y1; [Ax]1, νw)
N (y2; [Ax]2, νw)
N (ym; [Ax]m, νw)
......
...
Phil Schniter (Ohio State Univ.) July’19 13 / 52
Approximate Message Passing (AMP)
Understanding AMP
The belief-propagation derivation of AMP provides very little insight!
Loopy BP is suboptimal, even if implemented exactlyThe i.i.d. property of A is never used in the derivation
And the rigorous proofs of AMP’s state evolution are very technical!
As a middle ground, we suggest an alternate derivation that gives insight intohow and why AMP works.
Based on the idea of “first-order cancellation”We will assume equiprobable Bernoulli aij ∈ ±1/
√m and polynomial η(·)
Phil Schniter (Ohio State Univ.) July’19 14 / 52
Approximate Message Passing (AMP)
AMP as First-Order Cancellation
Recall the AMP recursion:
vt = y −Axt + n
mvt−1 div(η(rt−1)
)
xt+1 = η(xt +ATvt
︸ ︷︷ ︸, rt
)
Notice that
[Axt]i = aT
i η(xt−1 +
∑l alv
t−1k
)where aT
i is the ith row of A
= aTi η
(xt−1 +
∑l 6=i alv
t−1l︸ ︷︷ ︸
, rt−1i , which removes the direct contribution of ai from rt−1
+aivt−1i
)
= aTi
[η(rt−1i ) +
∂η
∂r(rt−1i )aiv
t−1i +O(1/m)
]using a Taylor expansion
= aTi η(r
t−1i ) + vt−1i
∑j a
2ijη′(rt−1ij ) +O(1/
√m)
= aTi η(r
t−1i ) + n
mvt−1i1n
∑j η′(rt−1ij )
︸ ︷︷ ︸div
(η(rt−1i )
)+O(1/
√m) since a2ij = 1/m ∀ij
which uncovers the Onsager correction.
Phil Schniter (Ohio State Univ.) July’19 15 / 52
Approximate Message Passing (AMP)
AMP as First-Order Cancellation (cont.)
Now use [Axt]i to study jth component of denoiser input error et , rt − xo:
etj =∑
i
aij∑
l 6=j
ail[xo,l − η(rt−1il )
]+∑
i
aijwi
+∑
i
aij
[nmvt−1i div
(η(rt−1)
)− n
mvt−1i div(η(rt−1i )
)]+O(1/
√m)
where the divergence difference can be absorbed into the O(1/√m) term. . .
=∑
i
aij∑
l 6=j
ail[xo,l − η(rt−1il )
]︸ ︷︷ ︸
, ǫil︸ ︷︷ ︸∼ N
(0, 1
m2
∑i
∑l 6=j(ǫ
til)
2)
+∑
i
aijwi
︸ ︷︷ ︸∼ N
(0, 1
m
∑i w
2i
)
+O(1/√m)
using the CLT and assuming independence of {ail}nl=1 and {rt−1il }nl=1
∼ N(0, n
mE(t) + νw)+O(1/
√m) . . . the AMP state evolution
where E(t) , 1n
∑nj=1
[xo,j − x
(t)j
]2and νw , 1
m
∑mi=1 w
2i
Phil Schniter (Ohio State Univ.) July’19 16 / 52
Approximate Message Passing (AMP)
AMP with Non-Separable Denoisers
Until now, we have focused on separable denoisers, i.e., [ηt(r)]j = ηt(rj) ∀j
Can we use sophisticated non-separable η(·) with AMP?
Yes! Many examples. . .
Markov chain,10 Markov field,12 Markov tree,12 denoisers in 2010Blockwise & TV denoising considered by Donoho, Johnstone, Montanari in 2011BM3D denoising considered by Metzler, Maleki, Baraniuk in 2015
Rigorous state-evolution proven by Berthier, Montanari, Nguyen in 2017.
Assumes A drawn i.i.d. GaussianAssumes η is Lipschitz and “convergent under Gaussian inputs”
10S’10, 11Som,S’11, 12Som,S’12Phil Schniter (Ohio State Univ.) July’19 17 / 52
Approximate Message Passing (AMP)
AMP at Large but Finite Dimensions
Until now, we have focused on the large-system limit m,n → ∞ withm/n → δ ∈ (0,∞)
The non-asymptotic case was analyzed by Rush and Venkataramanan.13
They showed that probability of ǫ-deviation between the finite and limiting SEfalls exponentially in m, as long as the number of iterations t < o( logn
log logn )
13Rush,Venkataramanan’18Phil Schniter (Ohio State Univ.) July’19 18 / 52
Approximate Message Passing (AMP)
AMP Summary: The good, the bad, and the ugly
The good:
With large i.i.d. sub-Gaussian A, AMP is rigorously characterized by a scalarstate-evolution whose fixed points, when unique, are MMSE optimal underproper choice of denoiser.
Empirically, AMP behaves well with many other “sufficiently random” A
(e.g., randomly sub-sampled Fourier A & i.i.d. sparse x).
The bad:
With general A, AMP gives no guarantees.
The ugly:
With some A, AMP may fail to converge!(e.g., ill-conditioned or non-zero-mean A)
Phil Schniter (Ohio State Univ.) July’19 19 / 52
Vector AMP (VAMP)
Outline
1 Linear Regression
2 Approximate Message Passing (AMP)
3 Vector AMP (VAMP)
4 Unfolding AMP and VAMP into Deep Neural Networks
5 Extensions: GLMs, Parameter Learning, Bilinear Problems
Phil Schniter (Ohio State Univ.) July’19 20 / 52
Vector AMP (VAMP)
Vector AMP (VAMP)
Recall goal is linear regression: Recover xo from y = Axo +N (0, I/γw).
Now it will be easier to work with inverse variances, i.e., precisions
VAMP is like AMP in many ways, but supports a larger class of randommatrices.
VAMP yields a precise analysis for right-orthogonally invariant A:
svd(A) = USV T for
U : deterministic orthogonalS: deterministic diagonalV : “Haar;” uniform on set of orthogonal matrices
of which i.i.d. Gaussian is a special case.
Can be derived as a form of messagepassing on a vector-valued factor graph.
p(x1)
x1
δ(x1 − x2)
x2
N (y;Ax2, I/γw)
Phil Schniter (Ohio State Univ.) July’19 21 / 52
Vector AMP (VAMP)
VAMP: The Algorithm
With SVD A = U Diag(s)V T, damping ζ ∈ (0, 1], and Lipschitz ηt1(·) : Rn → R
n.
Initialize r1, γ1.
For t = 1, 2, 3, . . .
x1 ← ηt1(r1) denoising of r1 = xo +N (0, I/γ1)
ξ1 ← γ1/ div(ηt1(r1)
)
r2 ← (ξ1x1 − γ1r1)/(ξ1 − γ1) Onsager correction
γ2 ← ξ1 − γ1
x2 ← η2(r2; γ2) LMMSE estimate of x ∼ N (r2, I/γ2)
from y = Ax+N (0, I/γw)ξ2 ← γ2/ div(η2(r2; γ2)
)
r1 ← ζ(ξ2x2 − γ2r2)/(ξ2 − γ2) + (1−ζ)r1 Onsager correction
γ1 ← ζ(ξ2 − γ2) + (1− ζ)γ1 damping
where η2(r2; γ2) = (γwA
TA+ γ2I)−1(γwA
Ty + γ2r2)
= V(γw Diag(s)2 + γ2I
)−1(γw Diag(s)UTy + γ2V
Tr2
)
ξ2 = 1
n
∑n
j=1(γws
2
j + γ2)−1 two mat-vec mults per iteration!
Phil Schniter (Ohio State Univ.) July’19 22 / 52
Vector AMP (VAMP)
VAMP’s Denoising Property
Original VAMP Assumptions
A ∈ Rm×n is right-orthogonally invariant
m,n → ∞ s.t. m/n → δ ∈ (0,∞) . . . “large-system limit”
[ηt1(r)]j = ηt1(rj) with Lipschitz ηt1(·) . . . “separable denoising”
Under Assumption 2, the elements of the denoiser’s input rt1 obey14
rt1,j = xo,j +N (0, νt1)
That is, rt1 is a Gaussian-noise corrupted version of the true signal xo.
As with AMP, we can interpret η1(·) as a “denoiser.”
14Rangan,S,Fletcher’16Phil Schniter (Ohio State Univ.) July’19 23 / 52
Vector AMP (VAMP)
VAMP’s State Evolution
Assume empirical convergence of {sj}→S and {(r01,j , xo,j)}→(R0
1, Xo), and define
Eti , 1
n‖xt
i − xo‖2 for i = 1, 2.
Then under the VAMP Assumptions, VAMP obeys the following state-evolution:
for t = 0, 1, 2, . . .
Et1 = E
{[ηt1(Xo +N (0, νt1)
)−Xo
]2}MSE
αt1 = E
{ηt1′(Xo +N (0, νt1))
}divergence
γt2 = γt
11−αt
1
αt1
, νt2 = 1(1−αt
1)2
[Et1 −
(αt1
)2νt1]
Et2 = E
{[γwS
2 + γt2
]−1}MSE
αt2 = γt
2 E{[γwS
2 + γt2
]−1}divergence
γt+11 = γt
21−αt
2
αt2
, νt+11 = 1
(1−αt2)2
[Et2 −
(αt2
)2νt2]
Note: Above equations assume η2(·) uses true noise precision γw.
If not, there are more complicated expressions for Et2 and αt2.
Phil Schniter (Ohio State Univ.) July’19 24 / 52
Vector AMP (VAMP)
MMSE Optimality of VAMP
Now suppose that the VAMP Assumptions hold, and that
y = Axo +N (0, I/γw),
where the elements of xo are i.i.d. draws of some random variable Xo.
Suppose also that ηt1(·) is the MMSE denoiser, i.e.,
ηt1(R1) = E{Xo
∣∣R1 = Xo +N (0, νt1)}
Then, if the state evolution has a unique fixed point, the MSE of xt1
converges15 to the replica prediction16 of the MMSE as t → ∞.
15Rangan,S,Fletcher’16, 16Tulino,Caire,Verdu,Shamai’13Phil Schniter (Ohio State Univ.) July’19 25 / 52
Vector AMP (VAMP)
Experiment with MMSE Denoising
Comparison of several algorithms17 with MMSE denoising.
100
101
102
103
104
105
106
-50
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
AMP
S-AMP
damped GAMPVAMP
replica MMSE
condition number κ(A)
mediannormalized
MSE[dB]
n = 1024m/n = 0.5
A = U Diag(s)V T
U ,V ∼ Haarsj/sj−1 = φ ∀jφ determines κ(A)
Xo ∼Bernoulli-GaussianPr{X0 6= 0} = 0.1
SNR= 40dB
VAMP achieves the replica MMSE over a wide range of condition numbers.
17S-AMP: Cakmak,Fleury,Winther’14, damped GAMP: Vila,S,Rangan,Krzakala,Zdeborova’15
Phil Schniter (Ohio State Univ.) July’19 26 / 52
Vector AMP (VAMP)
Experiment with MMSE Denoising (cont.)
Comparison of several algorithms with priors matched to data.
100
101
102
103
-50
-40
-30
-20
-10
0
med
ian
NM
SE
[d
B]
condition number=1
AMP
S-AMP
damped GAMP
VAMP
VAMP SE
100
101
102
103
iterations
-40
-30
-20
-10
0
med
ian
NM
SE
[d
B]
condition number=1000
AMP
S-AMP
damped GAMP
VAMP
VAMP SE
n = 1024m/n = 0.5
A = U Diag(s)V T
U ,V ∼ Haarsj/sj−1 = φ ∀jφ determines κ(A)
Xo ∼Bernoulli-GaussianPr{X0 6= 0} = 0.1
SNR= 40dB
VAMP is relative fast even when A is ill-conditioned.
Phil Schniter (Ohio State Univ.) July’19 27 / 52
Vector AMP (VAMP)
VAMP for Optimization
Consider the optimization problem
x = argminx
{12‖Ax− y‖2 +R(x)
}
where R(·) is strictly convex and A is arbitrary (e.g., not necessarily RRI).
If we choose the denoiser
ηt1(r) = argmin
x
{R(x) +
γt1
2‖x− r‖2
}= proxR/γt
1
(r)
and the damping parameter
ζ ≤ 2min{γ1, γ2}γ1 + γ2
,
then a double-loop version of VAMP converges18 to x from above.
Furthermore, if the γ1 and γ2 variables are fixed over the iterations, thenVAMP reduces to the Peaceman-Rachford variant of ADMM.
18Fletcher,Sahraee,Rangan,S’16Phil Schniter (Ohio State Univ.) July’19 28 / 52
Vector AMP (VAMP)
Example of AMP & VAMP on the LASSO Problem
100
101
102
103
-45
-40
-35
-30
-25
-20
-15
-10
-5
0
VAMP
AMP
Chambolle-Pock
FISTA
iterations
NMSE[dB]
iid Gaussian A matrix
100
101
102
103
104
-30
-25
-20
-15
-10
-5
0
VAMP
AMP
Chambolle-Pock
FISTA
iterations
NMSE[dB]
column-correlated (0.99) A matrix
Solving LASSO to reconstruct 40-sparse x ∈ R1000 from noisy y ∈ R
400.
x = argminx
{1
2‖y −Ax‖22 + λ‖x‖1
}.
Phil Schniter (Ohio State Univ.) July’19 29 / 52
Vector AMP (VAMP)
Deriving VAMP from EC
Ideally, we would like to compute the exact posterior density
p(x|y) = p(x)ℓ(x;y)
Z(y)for Z(y) ,
∫p(x)ℓ(x;y) dx,
but the high-dimensional integral in Z(y) is difficult to compute.
We might try to circumvent Z(y) through variational optimization:
p(x|y) = argminb
D(b(x)
∥∥p(x|y))where D(·‖·) is KL divergence
= argminb
D(b(x)
∥∥p(x))+D
(b(x)
∥∥ℓ(x;y))+H
(b(x)
)︸ ︷︷ ︸
Gibbs free energy
= argminb1,b2,q
D(b1(x)
∥∥p(x))+D
(b2(x)
∥∥ℓ(x;y))+H
(q(x)
)︸ ︷︷ ︸
, JGibbs(b1, b2, q)s.t. b1 = b2 = q,
but the density constraint keeps the problem difficult.
Phil Schniter (Ohio State Univ.) July’19 30 / 52
Vector AMP (VAMP)
Deriving VAMP from EC (cont.)
In expectation-consistent approximation (EC)19, the density constraint isrelaxed to moment-matching constraints:
p(x|y) ≈ argminb1,b2,q
JGibbs(b1, b2, q)
s.t.
{E{x|b1} = E{x|b2} = E{x|q}tr(Cov{x|b1}) = tr(Cov{x|b2}) = tr(Cov{x|q}).
The stationary points of EC are the densities
b1(x)∝ p(x)N (x; r1, I/γ1)b2(x)∝ ℓ(x;y)N (x; r2, I/γ2)q(x)= N (x; x, I/ξ)
s.t.
{E{x|b1} = E{x|b2} = x1n tr(Cov{x|b1}) = 1
n tr(Cov{x|b2}) = 1ξ
VAMP iteratively solves for the quantities r1, γ1, r2, γ2, x, ξ above.
Leads to ηt1(·) being the MMSE denoiser of r1 = xo +N (0, I/γt
1)In this setting, VAMP is simply an instance of expectation propagation (EP)20.But VAMP is more general than EP, in that it allows non-MMSE denoisers η
1.
19Opper,Winther’04, 20Minka’01Phil Schniter (Ohio State Univ.) July’19 31 / 52
Vector AMP (VAMP)
Plug-and-play VAMP
Recall the scalar denoising step of VAMP (or AMP):
x1 = ηt1(r1) where r1 = xo +N (0, I/γt1)
For many signal classes (e.g., images), very sophisticated non-separable
denoisers η1(·) have been developed (e.g., BM3D, DnCNN).
These non-separable denoisers can be “plugged into” VAMP!
Their divergence can be approximated via Monte Carlo21
div(ηt1(r)
)≈ 1
K
K∑
k=1
pTk
[ηt1(r+ǫpk)− ηt
1(r)]
nǫ
with random vectors pk ∈ {±1}n and small ǫ > 0. Empirically, K=1 suffices.
A rigorous state-evolution has been established for plug-and-play VAMP.22
21Ramani,Blu,Unser’08, 22Fletcher,Rangan,Sarkar,S’18Phil Schniter (Ohio State Univ.) July’19 32 / 52
Vector AMP (VAMP)
Experiment: Compressive Image Recovery with BM3D
Plug-and-play versions of VAMP and AMP behave similarly with i.i.d. Gaussian A
is i.i.d., but VAMP can handle a larger class of random matrices A.
0.1 0.2 0.3 0.4 0.5
sampling rate M/N
15
20
25
30
35
40
PS
NR
VAMP-BM3D
AMP-BM3D
VAMP-L1
AMP-L1
iid Gaussian A
100
101
102
103
104
condition number
0
5
10
15
20
25
30
PS
NR VAMP-BM3D
VAMP-L1
AMP-BM3D
AMP-L1
spread spectrum A (M/N = 0.2)
Results above are averaged over 128× 128 versions of
lena, barbara, boat, fingerprint, house, peppers
and 10 random realizations of A,w.
Phil Schniter (Ohio State Univ.) July’19 33 / 52
Unfolding AMP and VAMP into Deep Neural Networks
Outline
1 Linear Regression
2 Approximate Message Passing (AMP)
3 Vector AMP (VAMP)
4 Unfolding AMP and VAMP into Deep Neural Networks
5 Extensions: GLMs, Parameter Learning, Bilinear Problems
Phil Schniter (Ohio State Univ.) July’19 34 / 52
Unfolding AMP and VAMP into Deep Neural Networks
Deep learning for sparse reconstruction
Until now we’ve focused on designing algorithms to recover xo ∼ p(x) frommeasurements y = Axo +w.
xy
model p(x),A
algorithm
What about training deep networks to predict xo from y?Can we increase accuracy and/or decrease computation?
xy
training data {(xd,yd)}Dd=1
deepnetwork
Are there connections between these approaches?
Phil Schniter (Ohio State Univ.) July’19 35 / 52
Unfolding AMP and VAMP into Deep Neural Networks
Unfolding Algorithms into Networks
Consider, e.g., the classical sparse-reconstruction algorithm, ISTA.23
vt =y −Axt
xt+1 =η(xt +ATvt)
⇔ xt+1= η(Sxt +By) with
S, I −ATA
B, AT
Gregor & LeCun24 proposed to “unfold” it into a deep net and “learn” improvedparameters using training data, yielding “learned ISTA” (LISTA):
+++
y B
SSSx1
x2
x3 x
4η(·)η(·) η(·)η(·)
The same “unfolding & learning” idea can be used to improve AMP, yielding“learned AMP” (LAMP).25
23Chambolle,DeVore,Lee,Lucier’98. 24Gregor,LeCun’10. 25Borgerding,S’16.Phil Schniter (Ohio State Univ.) July’19 36 / 52
Unfolding AMP and VAMP into Deep Neural Networks
Onsager-Corrected Deep Networks
tth LISTA layer:
+
+
−xt
xt+1
vt vt+1
y y
rt
Bt At
η(•;λt)
to exploit low-rank BtAt in linear stage St = I −BtAt.
tth LAMP layer:
+ +
+−
×
xt xt+1
vt vt+1
y y
rtct‖•‖2√
MλtBt At
η(•; •)NM div(η)
Onsager correction now aims to decouple errors across layers.
Phil Schniter (Ohio State Univ.) July’19 37 / 52
Unfolding AMP and VAMP into Deep Neural Networks
LAMP performance with soft-threshold denoising
LISTA beats AMP,FISTA,ISTALAMP beats LISTA
in convergence speed and asymptotic MSE.
5 10 15 20
-40
-35
-30
-25
-20
-15
-10
-5
ISTA
FISTA
AMP
LISTA tiedLISTA untiedLAMP tiedLAMP untied
averageNMSE[dB]
layers / iterations
-4 -3 -2 -1 0 1 2 3 4-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
QQplot of LAMP rt
Standard Normal QuantilesQuantilesof
InputSam
ple
Phil Schniter (Ohio State Univ.) July’19 38 / 52
Unfolding AMP and VAMP into Deep Neural Networks
LAMP beyond soft-thresholding
So far, we used soft-thresholding to isolate the effects of Onsager correction.
What happens with more sophisticated (learned) denoisers?
2 4 6 8 10 12 14
-45
-40
-35
-30
-25
-20
-15
-10LISTA
LAMP-l1
LAMP-bg
LAMP-expo
LAMP-pwlin
LAMP-splinesupport oracle
averageNMSE[dB]
layers
Here we learned the parameters ofthese denoiser families:
scaled soft-thresholding
conditional mean under BG
Exponential kernel26
Piecewise Linear26
Spline27
Big improvement!
26Guo,Davies’15. 27Kamilov,Mansour’16.Phil Schniter (Ohio State Univ.) July’19 39 / 52
Unfolding AMP and VAMP into Deep Neural Networks
LAMP versus VAMP
How does our best Learned AMP compare to MMSE VAMP?
2 4 6 8 10 12 14
-45
-40
-35
-30
-25
-20
-15
-10LAMP-pwlin
VAMP-bg
support oracle
averageNMSE[dB]
layers / iterations
VAMP wins!
So what about “learned VAMP”?
Phil Schniter (Ohio State Univ.) July’19 40 / 52
Unfolding AMP and VAMP into Deep Neural Networks
Learned VAMP
Suppose we unfold VAMP and learn (via backprop) the parameters {St, ηt}Tt=1
that minimize the training MSE.
Onsager
Onsager
Onsager
Onsager
ηt(·)ηt(·) StSt. . .
xo +N (0, I/γt1)xo +N (0, I/γt
1) xo +N (0, I/γt2)xo +N (0, I/γt
2)0
y
xt
xt
Remarkably, backpropagation learns the parameters prescribed by VAMP!
Theory explains the deep network!
Onsager correction decouples the design of {St, ηt(·)}Tt=1:
Layer-wise optimal St, ηt(·) ⇒ Network optimal {St, ηt(·)}Tt=1
Phil Schniter (Ohio State Univ.) July’19 41 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
Outline
1 Linear Regression
2 Approximate Message Passing (AMP)
3 Vector AMP (VAMP)
4 Unfolding AMP and VAMP into Deep Neural Networks
5 Extensions: GLMs, Parameter Learning, Bilinear Problems
Phil Schniter (Ohio State Univ.) July’19 42 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
Generalized linear models
Until now we have considered the standard linear model: y = Axo +w.
One may also consider the generalized linear model (GLM), where
y ∼ p(y|z) with hidden z = Axo
which supports, e.g.,
yi = zi + wi: additive, possibly non-Gaussian noiseyi = Q(zi + wi): quantizationyi = sgn(zi + wi): binary classificationyi = |zi + wi|: phase retrievalPoisson yi: photon-limited imaging
For this, there is a Generalized AMP29 with a rigorous state evolution.30
There is also a Generalized VAMP31 with a rigorous state evolution.32
29Rangan’11, 30Javanmard,Montanari’12, 31S,Fletcher,Rangan’16. 32Fletcher,Rangan,S’18.Phil Schniter (Ohio State Univ.) July’19 43 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
Parameter learning
Consider inference under prior p(x;θ1) and likelihood ℓ(x;y,θ2), where the
hyperparameters θ , [θ1,θ2] are unknown.
θ1 might specify sparsity rate, or all parameters of a GMMθ2 might specify the measurement noise variance, or forward model A
EM-inspired extensions of (G)AMP and (G)VAMP that simultaneously estimatex and learn θ from y have been developed.
Have rigorous state evolutions3334
“Adaptive VAMP” yields asymptotically consistent34 estimates of θ
SURE-based auto-tuning AMP algorithms have also been proposed
for LASSO by Mousavi, Maleki, and Baraniukfor parametric separable denoisers by Guo and Davies
33Kamilov,Rangan,Fletcher,Unser’12, 34Fletcher,Sahraee,Rangan,S’17Phil Schniter (Ohio State Univ.) July’19 44 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
Bilinear problems
So far we have considered (generalized) linear models.
AMP has also been applied to (generalized) bilinear models.
The typical problem is to recover B ∈ Rm×k and C ∈ R
k×n from . . .{Y = BC +W (standard bilinear model)Y ∼ p(Y |Z) for Z = BC (generalized bilinear model)
The case where m,n→∞ for fixed k is well understood.35 (See Jean’s talk)
With m,n, k →∞, algorithms work (e.g., BiGAMP36) but are not well understood.
A more general bilinear problem is to recover b ∈ Rk and c ∈ R
n from{yi = bTAic+ wi, i = 1 . . .m
yi ∼ p(yi|zi) for zi = bTAic, i = 1 . . .mwhere {Ai} are known matrices
Algorithms37 and replica analyses38 (for m,n, k →∞ and i.i.d. Ai) exist.
35Montanari,Venkataramanan’17, 36Parker,S,Cevher’14, 37Parker,S’16, 38Schulke,S,Zdeborova’16
Phil Schniter (Ohio State Univ.) July’19 45 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
Conclusions
AMP and VAMP are a computationally efficient algorithms for (generalized)linear regression.
With large random A, the ensemble behaviors of AMP and VAMP obeyrigorous state evolutions whose fixed-points, when unique, agree with thereplica predictions of the MMSE.
AMP and VAMP support nonseparable (i.e., “plug-in”) denoisers, also withrigorous state evolutions.
For convex optimization problems, VAMP is provably convergent for any A.
Extensions of AMP and VAMP cover . . .
unfolded deep networksthe learning of unknown prior/likelihood parametersbilinear problems
Not discussed: multilayer versions of AMP & VAMP.
Phil Schniter (Ohio State Univ.) July’19 46 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
References I
S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-play priors for modelbased reconstruction,” in Proc. IEEE Global Conf. Signal Info. Process., pp. 945–948, 2013.
D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressedsensing,” Proc. Nat. Acad. Sci., vol. 106, pp. 18914–18919, Nov. 2009.
A. Chambolle, R. A. DeVore, N. Lee, and B. J. Lucier, “Nonlinear wavelet imageprocessing: Variational problems, compression, and noise removal through waveletshrinkage,” IEEE Trans. Image Process., vol. 7, pp. 319–335, Mar. 1998.
M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, withapplications to compressed sensing,” IEEE Trans. Inform. Theory, vol. 57, pp. 764–785,Feb. 2011.
G. Reeves and H. D. Pfister, “The replica-symmetric prediction for compressed sensing withGaussian matrices is exact,” in Proc. IEEE Int. Symp. Inform. Thy., 2016.
J. Barbier, M. Dia, N. Macris, and F. Krzakala, “The mutual information in random linearestimation,” in Proc. Allerton Conf. Commun. Control Comput., pp. 625–632, 2016.
M. Bayati, M. Lelarge, and A. Montanari, “Universality in polytope phase transitions andmessage passing algorithms,” Ann. App. Prob., vol. 25, no. 2, pp. 753–822, 2015.
Phil Schniter (Ohio State Univ.) July’19 47 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
References II
P. Schniter, “Turbo reconstruction of structured sparse signals,” in Proc. Conf. Inform.
Science & Syst., (Princeton, NJ), pp. 1–6, Mar. 2010.
S. Som and P. Schniter, “Approximate message passing for recovery of sparse signals withMarkov-random-field support structure.” Internat. Conf. Mach. Learning—Workshop on
Structured Sparsity: Learning and Inference, (Bellevue, WA), July 2011.
S. Som and P. Schniter, “Compressive imaging using approximate message passing and aMarkov-tree prior,” IEEE Trans. Signal Process., vol. 60, pp. 3439–3448, July 2012.
D. L. Donoho, I. M. Johnstone, and A. Montanari, “Accurate prediction of phasetransitions in compressed sensing via a connection to minimax denoising,” IEEE Trans.
Inform. Theory, vol. 59, June 2013.
C. A. Metlzer, A. Maleki, and R. G. Baraniuk, “BM3D-AMP: A new image recoveryalgorithm based on BM3D denoising,” in Proc. IEEE Int. Conf. Image Process.,pp. 3116–3120, 2015.
R. Berthier, A. Montanari, and P.-M. Nguyen, “State evolution for approximate messagepassing with non-separable functions,” Inform. Inference, 2019.
Phil Schniter (Ohio State Univ.) July’19 48 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
References III
C. Rush and R. Venkataramanan, “Finite-sample analysis of approximate message passingalgorithms,” IEEE Trans. Inform. Theory, vol. 64, no. 11, pp. 7264–7286, 2018.
S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,” IEEE
Trans. Inform. Theory, to appear (see also arXiv:1610.03082).
A. M. Tulino, G. Caire, S. Verdu, and S. Shamai (Shitz), “Support recovery with sparselysampled free random matrices,” IEEE Trans. Inform. Theory, vol. 59, pp. 4243–4271, July2013.
B. Cakmak, O. Winther, and B. H. Fleury, “S-AMP: Approximate message passing forgeneral matrix ensembles,” in Proc. Inform. Theory Workshop, pp. 192–196, 2014.
J. Vila, P. Schniter, S. Rangan, F. Krzakala, and L. Zdeborova, “Adaptive damping andmean removal for the generalized approximate message passing algorithm,” in Proc. IEEE
Int. Conf. Acoust. Speech & Signal Process., pp. 2021–2025, 2015.
A. K. Fletcher, M. Sahraee-Ardakan, S. Rangan, and P. Schniter, “Expectation consistentapproximate inference: Generalizations and convergence,” in Proc. IEEE Int. Symp. Inform.
Thy., pp. 190–194, 2016.
M. Opper and O. Winther, “Expectation consistent approximate inference,” J. Mach.
Learn. Res., vol. 1, pp. 2177–2204, 2005.
Phil Schniter (Ohio State Univ.) July’19 49 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
References IV
T. Minka, A Family of Approximate Algorithms for Bayesian Inference.
PhD thesis, Dept. Comp. Sci. Eng., MIT, Cambridge, MA, Jan. 2001.
S. Ramani, T. Blu, and M. Unser, “Monte-Carlo SURE: A black-box optimization ofregularization parameters for general denoising algorithms,” IEEE Trans. Image Process.,vol. 17, no. 9, pp. 1540–1554, 2008.
A. K. Fletcher, S. Rangan, S. Sarkar, and P. Schniter, “Plug-in estimation inhigh-dimensional linear inverse problems: A rigorous analysis,” in Proc. Neural Inform.
Process. Syst. Conf., pp. 7440–7449, 2018.
M. Borgerding, P. Schniter, and S. Rangan, “AMP-inspired deep networks for sparse linearinverse problems,” IEEE Trans. Signal Process., vol. 65, no. 15, pp. 4293–4308, 2017.
C. Guo and M. E. Davies, “Near optimal compressed sensing without priors: ParametricSURE approximate message passing,” IEEE Trans. Signal Process., vol. 63, pp. 2130–2141,2015.
U. Kamilov and H. Mansour, “Learning optimal nonlinearities for iterative thresholdingalgorithms,” IEEE Signal Process. Lett., vol. 23, pp. 747–751, May 2016.
Phil Schniter (Ohio State Univ.) July’19 50 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
References V
S. Rangan, “Generalized approximate message passing for estimation with random linearmixing,” in Proc. IEEE Int. Symp. Inform. Thy., pp. 2168–2172, Aug. 2011.
(full version at arXiv:1010.5141).
A. Javanmard and A. Montanari, “State evolution for general approximate message passingalgorithms, with applications to spatial coupling,” Inform. Inference, vol. 2, no. 2,pp. 115–144, 2013.
P. Schniter, S. Rangan, and A. K. Fletcher, “Vector approximate message passing for thegeneralized linear model,” in Proc. Asilomar Conf. Signals Syst. Comput., pp. 1525–1529,2016.
A. K. Fletcher, S. Rangan, and P. Schniter, “Inference in deep networks in highdimensions,” in Proc. IEEE Int. Symp. Inform. Thy., 2018.
U. S. Kamilov, S. Rangan, A. K. Fletcher, and M. Unser, “Approximate message passingwith consistent parameter estimation and applications to sparse learning,” IEEE Trans.
Inform. Theory, vol. 60, pp. 2969–2985, May 2014.
A. K. Fletcher, M. Sahraee-Ardakan, S. Rangan, and P. Schniter, “Rigorous dynamics andconsistent estimation in arbitrarily conditioned linear systems,” in Proc. Neural Inform.
Process. Syst. Conf., pp. 2542–2551, 2017.
Phil Schniter (Ohio State Univ.) July’19 51 / 52
Extensions: GLMs, Parameter Learning, Bilinear Problems
References VI
A. Mousavi, A. Maleki, and R. G. Baraniuk, “Consistent parameter estimation for LASSOand approximate message passing,” Ann. Statist., vol. 45, no. 6, pp. 2427–2454, 2017.
A. Montanari and R. Venkataramanan, “Estimation of low-rank matrices via approximatemessage passing,” arXiv:1711.01682, 2017.
J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate messagepassing—Part I: Derivation,” IEEE Trans. Signal Process., vol. 62, pp. 5839–5853, Nov.2014.
J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate messagepassing—Part II: Applications,” IEEE Trans. Signal Process., vol. 62, pp. 5854–5867, Nov.2014.
J. T. Parker and P. Schniter, “Parametric bilinear generalized approximate messagepassing,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 4, pp. 795–808, 2016.
C. Schulke, P. Schniter, and L. Zdeborova, “Phase diagram of matrix compressed sensing,”Physical Rev. E, vol. 94, pp. 062136(1–16), Dec. 2016.
Phil Schniter (Ohio State Univ.) July’19 52 / 52