Hybrid approximate message passing with applications to structured sparsity

Post on 19-Nov-2023

1 views 0 download

transcript

1

Hybrid Approximate Message Passing withApplications to Structured Sparsity

Sundeep Rangan, Alyson K. Fletcher, Vivek K Goyal, and Philip Schniter

Abstract—Gaussian and quadratic approximations of messagepassing algorithms on graphs have attracted considerable recentattention due to their computational simplicity, analytic tractabil-ity, and wide applicability in optimization and statistical inferenceproblems. This paper presents a systematic framework for incor-porating such approximate message passing (AMP) methods ingeneral graphical models. The key concept is a partition of depen-dencies of a general graphical model into strong and weak edges,with the weak edges representing interactions through aggregatesof small, linearizable couplings of variables. AMP approximationsbased on the Central Limit Theorem can be readily applied tothe weak edges and integrated with standard message passingupdates on the strong edges. The resulting algorithm, which wecall hybrid generalized approximate message passing (Hybrid-GAMP), can yield significantly simpler implementations of sum-product and max-sum loopy belief propagation. By varying thepartition of strong and weak edges, a performance–complexitytrade-off can be achieved. Group sparsity problems are studied asan example of this general methodology where there is a naturalpartition of edges.

Index Terms—belief propagation, estimation, group sparsity,max-sum algorithm, maximum a posteriori probability, minimummean-squared error, optimization, simultaneous sparsity, sum-product algorithm

I. INTRODUCTION

Message passing algorithms on graphical models have be-come widely-used in high-dimensional optimization and infer-ence problems in a range of fields [1], [2]. The fundamentalprinciple of graphical models is to factor high-dimensionalproblems into sets of smaller problems of lower dimension.The factorization is represented via a graph where the problemvariables and factors are represented by the graph vertices,and the dependencies between them represented by edges.Message passing methods such as loopy belief propagation(BP) use this graphical structure to perform approximateinference or optimization in an iterative manner. In eachiteration, inference or optimization is performed “locally” onthe sub-problems associated with each factor, and “messages”are passed between the variables and factors to account forthe coupling between the local problems.

Although effective in a range of problems, loopy BP is onlyas computationally simple as the problems in the constituent

S. Rangan (email: srangan@poly.edu) is with Polytechnic Institute of NewYork University, Brooklyn, NY.

A. K. Fletcher (email: alyson@eecs.berkeley.edu) is with the Departmentof Electrical Engineering and Computer Sciences, University of California,Berkeley.

V. K. Goyal (email: vgoyal@mit.edu) is with the Department of ElectricalEngineering and Computer Science and the Research Laboratory of Electron-ics, Massachusetts Institute of Technology.

P. Schniter (email: schniter@ece.osu.edu) is with the Department of Elec-trical and Computer Engineering, The Ohio State University.

factors. If the factors themselves are of high dimensions,exact implementation of loopy BP will be computationallyintractable.

To reduce the complexity of loopy BP, this paper presentsa Hybrid-generalized approximate message passing (Hybrid-GAMP) algorithm for what we call graphical models withlinear mixing. The basic idea is that when factors depend onlarge numbers of variables, the dependencies are often throughaggregates of small, linearizable perturbations. In the proposedgraphical model with linear mixing, these linear, weak interac-tions are identified by partitioning the graph edges into weakand strong edges, with the dependencies on the weak edgesbeing described by a linear transform. Under the assumptionthat the components of the linear transform are small, it isargued that the computations for the messages of standardloopy BP along the weak edges can be significantly simplified.The approximate message passing along with the weak edgesare integrated with the standard loopy BP messages on thestrong edges.

We illustrate this Hybrid-GAMP methodology in the contextof two common variants of loopy BP: the sum-product algo-rithm for inference (e.g., computation of a posterior mean) andthe max-sum algorithm for optimization (e.g., computation of aposterior mode). For the sum-product loopy BP algorithm, weshow that the messages along the weak edges can be approx-imated as Gaussian random variables and the computationsfor these messages can be simplified via the Central LimitTheorem. For max-sum loopy BP, we argue that one can usequadratic approximations of the messages and perform thecomputations via a simple least-squares solution.

These approximations can dramatically simplify the com-putations. The complexity of standard loopy BP genericallygrows exponentially with the maximum degree of the factornodes. With the GAMP approximation, however, the com-plexity is exponential only in the maximum degree from thestrong edges, while it is linear in the number of weak edges.As a result, Hybrid-GAMP algorithms on a graphical modelwith linear mixing can remain tractable even with very largenumbers of weak, linearizable interactions.

Gaussian and quadratic approximations for message passingalgorithms with linear dependencies are, of course, not new.The purpose of this paper is to provide a systematic andgeneral framework for these approximations that incorporateand extend many earlier algorithms. For example, many previ-ous works have considered Gaussian approximations of loopyBP for the problem of estimating vectors with independentcomponents observed through noisy, linear measurements [3]–[9]. In the terminology of this paper, these algorithms apply to

arX

iv:1

111.

2581

v2 [

cs.I

T]

12

Nov

201

1

2

graphs where all the non-trivial edges are weak. As discussedin Section III, by enabling graphs that have mixes of bothstrong and weak edges, the framework of this paper cansignificantly generalize these methods. For example, instead ofthe unknown vector simply having independent components,the presence of strong edges can enable the vector to have anyprior describable with a graphical model.

The approach here of combining approximate messagepassing methods and standard graphical model with linearmixing is closest to the methods developed in [10]–[13]for wavelet image denoising and turbo equalization. Theseworks also considered graphical models that had both linearand nonlinear components, and applied approximate messagepassing techniques along the lines of [7], [8] to the lineariz-able portions while maintaining standard BP updates in theremainder of the graph. The use of approximate messagepassing methods on portions of a factor graph has also beenapplied with joint parameter estimation and decoding forCDMA multiuser detection in [14]; in a wireless interferencecoordination problem in [15], and proposed in [16, Section7] in the context of compressed sensing. The frameworkpresented here unifies and extends all of these examplesand thus provides a systematic procedure for incorporatingGaussian approximations of message passing in a modularmanner in general graphical models.

II. GRAPHICAL MODEL PROBLEMS WITH LINEAR MIXING

Let x and z be real-valued block column vectors

x = (x∗1, . . . ,x∗n)∗, z = (z∗1, . . . , z

∗m)∗, (1)

and consider a function of these vectors of the form

F (x, z) :=

m∑i=1

fi(xα(i), zi), (2)

where, for each i, fi(·) is a real-valued function; α(i) is asubset of the indices {1, . . . , n}; and xα(i) is the concatenationof the vectors {xj , j ∈ α(i)}. We will be interested incomputations on this function subject to linear constraints ofthe form

zi =

n∑j=1

Aijxj = Aix, (3)

where each Aij is a real-valued matrix and Ai is the blockcolumn matrix with components Aij . We will also let A bethe block matrix with components Aij so that we can writethe linear constraints as z = Ax.

The function F (x, z) is naturally described via a graphicalmodel as shown in Fig. 1. Specifically, we associate withF (x, z) a bipartite factor graph G = (V,E) whose verticesV consist of n variable nodes corresponding to the (vector-valued) variables xj , and m factor nodes corresponding to thefactors fi(·) in (2). There is an edge (i, j) ∈ E in the graphif and only if the variable xj has some influence on the factorfi(xα(i), zi). This influence can occur in one of two mutuallyexclusive ways:• The index j is in α(i), so that the variable xj directly

appears in the sub-vector xα(i) in the factor fi(xα(i), zi).

1x

2x

nx

3x

1 (1) 1( , )f x z

2 (2) 2( , )f x z

( )( , )m m mf x z

Fig. 1. Factor graph representation of the linear mixing estimation andoptimization problems. The variable nodes (circles) are connected to the factornodes (squares) either directly (strong edges) or via the output of the linearmixing matrix A (weak edges).

In this case, (i, j) will be called a strong edge, since xjcan have an arbitrary and potentially-large influence onthe factor.

• The matrix Aij is nonzero, so that xj affects fi(xα(i), zi)through its linear influence on zi in (3). In this case, (i, j)will be called a weak edge, since the approximations wewill make in the algorithms below assume that Aij issmall. The set of weak edges into the factor node i willbe denoted β(i).

Together α(i) and β(i) comprise the set of all indices jsuch that the variable node xj is connected to the factor nodefi(·) in the graph G. The union ∂(i) = α(i) ∪ β(i), is thusthe neighbor set of fi(·). Similarly, for any variable node xj ,we let α(j) be the set of indices i such that that the factornode fi(·) is connected to xj via a strong edge, and let β(j)be the set of indices i such that there is a weak edge. We let∂(j) = α(j) ∪ β(j) be the union of these sets, which is theneighbor set of xj .

Given these definitions, we are interested in two problems:• Optimization problem P-OPT: Given a function F (x, z)

of the form (2) and a matrix A, compute the maxima:

x = arg maxx : z=Ax

F (x, z), z = Ax. (4)

Also, for each j, compute the marginal value function

∆j(xj) := maxx\j : z=Ax

F (x, z), (5)

where the maximization is over all variables xr for r 6= j.• Expectation problem P-EXP: Given a function F (x, z)

of the form (2), a matrix A, and scale factor u > 0,define the joint distribution

p(x) :=1

Z(u)exp [uF (x, z)] , z = Ax (6)

where Z(u) is a normalization constant called the parti-tion function (it is a function of u). For this distribution,compute the expectations

x = E[x], z = E[z]. (7)

3

Also, for each j, compute the log marginal

∆j(xj) :=1

ulog

∫exp [uF (x, z)] dx\j , (8)

where the integral is over all variables xr for r 6= j.Both problems P-OPT and P-EXP naturally arise in statisti-

cal inference: Suppose we are given a probability distributionp(x) of the form (6) for some function F (x, z). The functionF (x, z) may depend implicitly on some observed vector y,so that p(x) represents the posterior distribution of x giveny. In this context, the solution (x, z) to the problem P-OPTis precisely the maximum a posteriori (MAP) estimate of xand z given the observations y. Similarly, the solution (x, z)to the problem P-EXP is precisely the minimum mean squarederror (MMSE) estimate when u = 1. For P-EXP, the function∆j(xj) is the log marginal distribution of xj .

The two problems are related: A standard large deviationsargument [17] shows that, under suitable conditions, as u →∞ the distribution p(x) in (6) concentrates around the maxima(x, z) in the solution to the problem P-OPT. As a result, thesolution (x, z) to P-EXP converges to the solution to P-OPT.

A. Further Assumptions and Notation

In the analysis below, we will assume that, for all factornodes fi(·), the strong and weak neighbors, α(i) and β(i),are disjoint. That is,

α(i) ∩ β(i) = ∅. (9)

This assumption introduces no loss of generality: If an edge(i, j) is both weak and strong, we can modify the functionfi(xα(i), zi) to “move” the influence of xj from the term ziinto the direct term xα(i). For example, suppose that for somei,

zi = Ai1x1 + Ai3x3 + Ai4x4

and α(i) = {1, 2}. In this case, the edge (i, 1) is both strongand weak. That is, the function fi(xα(i), zi) depends on thevariable x1 both directly through xα(i) and through zi. Tosatisfy the assumption (9), we define a new zi

znewi = Ai3x3 + Ai4x4,

and new function fi(·)

fnewi (xα(i), znewi ) = fnewi ((x1,x2), znewi )

= fi((x1,x2),Ai1x1 + znewi ).

With these definitions,

fi(xα(i), zi) = fnewi (xα(i), znewi ).

Therefore we can replace fi(·) and zi with fnewi (·) and znewi

and obtain an equivalent problem.Even when the dependence of a factor fi(xα(i), zi) on a

variable xj is only through the linear term zi, we may stillwish to “move” the dependence to a strong edge. The reasonis that the GAMP algorithm below assumes that the lineardependence is weak, that is, Aij is small. If that is not thecase, the dependence can be treated as strong, where the small-term approximations are not made. This improves the accuracyof the method at the expense of greater computation.

( )jp x( | )i ip y z

Fig. 2. An example of a simple graphical model for an estimation problemwhere x has independent components with priors p(xj), z = Ax, and theobservation vector y is the output of a componentwise measurement channelwith transition function p(yi | zi).

One final piece of notation: since Aij 6= 0 only when j ∈β(i), we may sometimes write the summation (3) as

zi =∑j∈β(i)

Aijxj = Ai,β(i)xβ(i), (10)

where xβ(i) is the sub-vector of x with components j ∈ β(i)and Ai,β(i) is the corresponding portion of the ith block-rowof A.

III. MOTIVATING EXAMPLES

We begin with a basic development to show that somepreviously-studied problems with a fully separable prior onx fit within our model. Then we show an extension to morecomplicated problems. A more extensive example is deferredto Section VI.

Linear Mixing and General Output Channel—IndependentVariables: As a simple example of a graphical model withlinear mixing, consider the following estimation problem: Anunknown vector x has independent components xj , each withprobability distribution p(xj). The vector x is passed througha linear transform to yield an output z = Ax. Each componentzi of z then randomly generates an output component yi witha conditional distribution p(yi | zi). The problem is to estimatex given the observations y and transform matrix A.

Under the assumption that the components xj are indepen-dent, and the components yi are conditionally independentgiven z, the posterior distribution of x factors as

p(x |y) =1

Z(y)

m∏i=1

p(yi | zi)n∏j=1

p(xj), z = Ax,

where Z(y) is a normalization constant. If we now fix someobserved y, we can rewrite this posterior as

p(x |y) ∝ exp [F (x, z)] , z = Ax,

where F (x, z) is the log posterior,

F (x, z) =

m∑i=1

log p(yi | zi) +

n∑j=1

log p(xj),

and the dependence on y is implicit. The log posterior istherefore in the form of (2) with the scale factor u = 1 and

4

m+n factors, fi(·), i = 1, . . . ,m+n. The first m factors canbe used for the output terms

fi(zi) = log p(yi | zi), i = 1, . . . ,m,

which do not depend on any of the terms xj . That is, α(j) = ∅for each variable index j. The remaining n factors are for theinputs

fm+j(xj) = log p(xj), j = 1, . . . , n.

For these factors, the strong edge set is the singleton α(m+j) = {j}, and there is no linear term; we can think of zm+j

as zero-dimensional. The corresponding factor graph with them+ n factors is shown in Fig. 2.

In the case when all the variables xj and zi are scalars, theestimation problem is precisely the set-up considered in [6],[18], as mentioned in the introduction. The special subcaseof this problem when each output measurement yi is zi withadditive white Gaussian noise (AWGN),

yi = zi + wi, wi ∼ N (0, σ2w), (11)

is considered in several references [3], [5], [8], [9], [19] for avariety of different distributions p(xj).

Linear Mixing and General Output Channel—DependentVariables: The graphical model framework considered here issignificantly more general. For example, consider the graphicalmodel in Fig. 3. In this case, the components of the inputvector are no longer necessarily independent, but insteaddescribed themselves by a graphical model. Some additionallatent variables in the vector u may also be added. Forexample, [10] used a discrete Markov chain to model clusteredsparsity, [11] used discrete-Markov and Gauss-Markov chainsto model slow changes in support and amplitude across multi-ple measurement vectors, and [12] used a discrete Markovtree model to capture the persistence across scales in thewavelet coefficients of an image. In the example we present inSection VI, we will use the graphical model to represent jointsparsity. Similarly, the output need not involve a separablemapping, and the observations y can depend on the outputsz through a second graphical model, also with some latentvariables v that can act as unknown parameters. For example,the output could be an unknown nonlinear function and theparameters v may be the parameters in the function. Thistechnique was used in [13] to incorporate constraints on LDPCcoded bits when performing turbo sparse-channel estimation,equalization, and decoding using GAMP.

IV. REVIEW OF LOOPY BELIEF PROPAGATION

Finding exact solutions to the problems P-OPT and P-EXPabove is generally intractable for most factors fi(·), asthe solutions requires optimizations or expectations over alln variables xj . A widely-used approximate technique isloopy belief propagation [2], [20], which attempts to re-duce the optimization and estimation problems to a sequenceof lower-dimensional problems associated with each factorfi(xα(i), zi). We consider two common variants of loopy BP:the max-sum algorithm for the problem P-OPT and the sum-product algorithm for the problem P-EXP. This section will

Fig. 3. A generalization of the model in Fig. 2, where the input variablesx are themselves generated by a graphical model with latent variables u.Similarly, the dependence of the observation vector y on the linear mixingoutput z is through a second graphical model.

briefly review these methods, as they will be the basis of theGAMP algorithms described in Section V.

The max-sum loopy BP algorithm is based on iterativelypassing estimates of the marginal utilities ∆j(xj) in (5)along the graph edges. Similarly, the sum-product loopy BPalgorithm passes estimates of the log marginals ∆j(xj) in (8).For either algorithm, we index the iterations by t = 0, 1, . . .,and denote the estimate “message” from the factor node fi tothe variable node xj in the tth iteration by ∆i→j(t,xj) andthe reverse message by ∆i←j(t,xj).

To describe the updates of the messages, we need to intro-duce some additional notation. First, the messages in loopyBP are equivalent up to a constant factor. That is, adding anyconstant term that does not depend on xj to either the message∆i→j(t,xj) or ∆i←j(t,xj) has no effect on the algorithm. Wewill thus use the notation

∆(x) ≡ g(x)⇐⇒ ∆(x) = g(x) + C,

for some constant C that does not depend on x. Similarly, wewrite p(x) ∝ q(x) when p(x) = Cq(x) for some constant C.Finally, for the sum-product algorithm, we will fix the scalefactor u > 0 in the problem P-EXP, and, for any function∆(·), we will write E[g(x) ; ∆(·)] to denote the expectationof g(x) with respect to a distribution specified indirectly by∆(·):

E[g(x) ; ∆(·)] =

∫g(x)p(x) dx, (12)

where p(x) is the probability distribution

p(x) =1

Z(u)exp [u∆(x)]

and Z(u) is a normalization constant. Given these definitions,the updates for the messages in both the max-sum and sum-product loopy BP algorithms are as follows:

Algorithm 1: Loopy belief propagation: Consider theproblems P-OPT or P-EXP above for some function F (x, z)of the form (2) and matrix A. For the problem P-EXP, fixthe scale factor u > 0. The max-sum loopy BP algorithm forthe problem P-OPT and the sum-product loopy BP for theproblem P-EXP follow the following steps:

5

1) Initialization: Set t = 0 and, for all (i, j) ∈ E, set∆i←j(t,xj) = 0.

2) Factor node update: For all edges (i, j) ∈ E, computethe function

Hi→j(t,x∂(i), zi)

:= fi(xα(i), zi) +∑

r∈β(i)6=j

∆i←r(t,xr). (13)

For max-sum loopy BP compute:

∆i→j(t,xj) ≡ maxx∂(i)\jzi=Aix

Hi→j(t,x∂(i), zi), (14)

where and the maximization in (14) is over all variablesxr with r ∈ ∂(i) and r 6= j, subject to the constraintzi = Aix.For sum-product loopy BP compute:

∆i→j(t,xj) ≡1

ulog

∫pi→j(t,x∂(i)) dx∂(i)\j , (15)

where the integration is over variables xr with r ∈ ∂(i)and r 6= j, and pi→j(t,x∂(i)) is the probability distri-bution

pi→j(t,x∂(i)) ∝ exp[uHi→j(t,x∂(i), zi)

],

zi = Aix. (16)

3) Variable node update: For all (i, j) ∈ E:

∆i←j(t+1,xj) ≡∑

`∈∂(j)6=i

∆`→j(t,xj). (17)

Also, let

∆j(t+1,xj) ≡∑i∈∂(j)

∆i→j(t,xj). (18)

For max-sum loopy BP, compute:

xj(t+1) := arg maxxj

∆j(t+1,xj). (19)

For sum-product loopy BP, compute:

xj(t+1) := E [xj ; ∆j(t+1, ·)] . (20)

Increment t and return to step 2 for some number ofiterations.

When the graph G is acyclic, then it can be shown thatthe max-sum and sum-product loopy BP algorithms aboveconverge, respectively, to the exact solutions to the problemsP-OPT and P-EXP in Section II. However, for graphs withcycles, loopy BP algorithm is, in general, only approximate. Acomplete analysis of loopy BP algorithm is beyond the scopeof this work and is covered extensively elsewhere. See, forexample, [2], [20] and [21].

What is important here is the computational complexityof loopy BP. Brute force solutions to the problems P-OPTand P-EXP in Section II involve either a joint optimization orexpectation over all n variables xj . The loopy BP algorithmin contrast reduces these “global” problems to a sequence of“local” problems associated with each of the factors fi(·).The local optimization or expectation problems may be signif-icantly lower in dimension than the global problem. Consider,

for example, the update in Step 2 at some factor node fi(·) andlet d = |∂(i)| be the number of neighbors of fi(·). The factorfi(xα(i), zi) is thus a function of d variables xj , either throughone of the |α(i)| strong edges or |β(i)| weak edges. For eachj ∈ ∂(i), the optimization (14) for the max-sum algorithm thusinvolves an optimization over d−1 of the variables. Similarly,for the sum-product algorithm, the factor node update (15)involves an integration over d−1 variables. For certain classesof functions (such as the binary constraint functions in LDPCcodes [22], [23], the optimization or expectation admits asimple solution. However, both optimization and expectations,in general, have complexities that grow exponentially in d.Thus, standard loopy BP is only typically tractable when thedegrees d of the factor nodes are small or the factors havesome particular form.

V. HYBRID-GAMP

The Hybrid-GAMP algorithm reduces the cost of loopyBP by exploiting complexity-reducing approximations of thecumulative effect of the weak edges. We saw in the previoussection that the loopy BP update at each factor node fi(·)has a cost that may be exponential in the degree d of thenode, which consists of |α(i)| strong edges and |β(i)| weakedges. The Hybrid-GAMP algorithm with edge partitioninguses the linear mixing property to eliminate the exponentialdependence on the |β(i)| weak edges, so the only exponentialdependence is on the |α(i)| strong edges. Thus, the edgepartitioning makes Hybrid-GAMP computationally tractableas long as the number of strong edges is small. There can bean arbitrary number of weak edges. In particular, the mixingmatrix A can be dense.

The basis of the Hybrid-GAMP approximation is to assumethat the matrix Aij is small along any weak edge (i, j). Underthis assumption, in the max-sum algorithm one can apply aquadratic approximation of the messages along the weak edgesand reduce the factor node update to a standard least-squaresproblem. Similarly, in the sum-product algorithm, one canapply a Gaussian approximation of the weak edges messagesand use the Central Limit Theorem at the factor nodes.

A heuristic derivation of the Hybrid-GAMP approximationsis given in Appendix A for the max-sum algorithm andAppendix B for the sum-product algorithm. We emphasize thatthese derivations are merely heuristic—we do not claim anyformal matching between loopy BP and the Hybrid-GAMPapproximation.

To state the Hybrid-GAMP algorithm, we need additionalnotation: The Hybrid-GAMP algorithm produces a sequenceof estimates xj(t) and zi(t) for the variables xj and zi. Severalother intermediate variables pi(t), si(t) and rj(t) are alsoproduced. Associated with each of the variables are matricesQxj (t), Qz

i (t), . . ., that represent certain Hessians for the max-sum algorithm and covariances for the sum-product algorithm.When we need to take the inverse of the matrices, we willuse the notation Q−xj (t) to mean (Qx

j (t))−1. Also, a∗ andA∗ denote the transposes of vector a and matrix A. Finally,for any positive definite matrix Q and vector a, we will let‖a‖2Q = a∗Q−1a, which is a weighted two norm.

6

Algorithm 2: Hybrid-GAMP: Consider the problemsP-OPT or P-EXP above for some function F (x, z) of theform (2) and matrix A. For the problem P-EXP, fix thescale factor u > 0. The max-sum GAMP algorithm forthe problem P-OPT and the sum-product GAMP for theproblem P-EXP follow the following steps:

1) Initialization: Set t = 0 and select some initial values∆i→j(t−1,xj) for all strong edges (i, j) and valuesrj(t−1) and Qr

j(t−1) for all variable node indices j.2) Variable node update, strong edges: For all strong edges

(i, j), compute

∆i←j(t,xj) ≡∑

`∈α(j)6=i

∆`→j(t−1,xj)

− 1

2‖rj(t−1)− xj‖2Qr

j (t−1). (21)

3) Variable node update, weak edges: For all variable nodesj, compute

∆j(t,xj) ≡ Hxj (t,xj , rj(t−1),Qr

j(t−1)) (22)

and

Hxj (t,xj , rj ,Q

rj)

=∑i∈α(j)

∆i→j(t−1,xj)−1

2‖rj − xj‖2Qr

j. (23)

For max-sum GAMP:

xj(t) = arg maxxj

∆j(t,xj), (24a)

Q−xj (t) = − ∂2

∂x2∆j(t,xj). (24b)

For sum-product GAMP:

xj(t) = E (xj ; ∆j(t, ·)) , (25a)Qxj (t) = u var (xj ; ∆j(t, ·)) . (25b)

4) Factor node update, linear step: For all factor nodes i,compute

zi(t) =∑j∈β(i)

Aijxj(t), (26a)

pi(t) = zi(t)−Qpi (t)si(t−1), (26b)

Qpi (t) =

∑j∈β(i)

AijQxj (t)A∗ij , (26c)

where, initially, we set si(−1) = 0.5) Factor node update, strong edges: For all strong edges

(i, j), compute:

Hzi→j(t,xα(i), zi, pi,Q

pi ) := fi(xα(i), zi)

+∑

r∈α(i)6=j

∆i←r(t,xr)−1

2‖zi − pi‖2Qp

i. (27)

Then, for max-sum GAMP compute:

∆i→j(t,xj)

= maxxα(i)\j ,zi

Hzi→j(t,xα(i), zi, pi(t),Q

pi (t)), (28)

where the maximization is over zi and all componentsxr with r ∈ α(i) \ j.

For sum-product GAMP compute:

∆i→j(t,xj) ≡1

ulog

∫pi→j(t,xα(i), zi)dxα(i)\jdzi

(29)where the integral is over zi and all components xrwith r ∈ α(i) \ j, and pi→j(0,xj) is the probabilitydistribution function

pi→j(t,xα(i), zi) ∝exp

(uHz

i→j(t,xα(i).zi, pi(t),Qpi (t))

). (30)

6) Factor node update, weak edges: For all factor nodes i,compute

Hzi (t,xα(i), zi, pi,Q

pi ) := fi(xα(i), zi)

+∑r∈α(i)

∆i←r(t,xr)−1

2‖zi − pi‖2Qp

i. (31a)

Then, for max-sum GAMP compute:

(x0α(i)(t), z

0i (t))

:= arg maxx,zi

Hzi (t,xα(i), zi, pi(t),Q

pi (t)), (31b)

Dzi (t) := − ∂2

∂z2iHzi (t, x0

α(i), z0i , pi(t),Q

pi (t)), (31c)

where the maximization in (31b) is over the sub-vectorxα(i) and output vector zi.For sum-product GAMP, let

z0i (t) = E(zi), Qzi (t) = u var(zi), (32)

where zi is the component of the pair (xα(i), zi) withthe joint distribution

pi(t,xα(i), zi) ∝exp

(uHz

i (t,xα(i), zi, pi(t),Qpi (t))

). (33)

Then, for either max-sum or sum-product compute

si(t) = Q−pi (t)[z0i (t)− pi(t)

], (34a)

Qsi (t) = Q−pi (t)−Q−pi (t)D−zi (t)Q−pi (t). (34b)

7) Variable node update, linear step: For all variables nodesj compute

Q−rj (t) =∑i∈β(j)

A∗ijQsi (t)Aij , (35a)

rj(t) = x(t) + Qrj(t)

∑i∈β(j)

A∗ij si(t). (35b)

Increment t and return to step 2 for some number ofiterations.

Although the Hybrid-GAMP algorithm above appears muchmore complicated than standard loopy BP (Algorithm 1),Hybrid-GAMP can be computationally dramatically cheaper.Recall that the main computational difficulty of loopy BP isStep 2, the factor update. The updates (14) and (15) involve anoptimization or expectation over |∂(i)| variables, where ∂(i)is the set of all variables connected to the factor node i. In theHybrid-GAMP algorithm, these computations are replaced by(28) and (29), where the optimization and expectation need

7

only be computed over the strong edge variables α(i). If thenumber of edges is large, the computational savings can bedramatic. The other steps of the Hybrid-GAMP algorithms areall linear, simple least-square operations, or componentwisenonlinear functions on the individual variables.

A. VariantsFor illustration, we have only presented one form of the

Hybrid-GAMP procedure. Several variants are possible:• Discrete distributions: The above description assumed

continuous valued random variables xj . The procedurescan be easily modified for discrete-valued variables byappropriately replacing integrals with summations.

• Message scheduling: The above description also onlyconsidered a completely parallel implementation whereeach iteration performs exactly one update on all edges.Other so-called message schedules are also possible andmay offer more efficient implementations or better con-vergence depending on the application [24], [25].

VI. APPLICATION TO STRUCTURED SPARSITY

A. Hybrid-GAMP AlgorithmTo illustrate the Hybrid-GAMP method, we consider the

group sparse estimation problem [26], [27]. Although thisproblem does not utilize the full generality of the Hybrid-GAMP framework, it provides a simple example of theHybrid-GAMP method and has a number of existing algo-rithms that can be compared against.

A general version of the group sparsity problem that fallswithin the Hybrid-GAMP framework can be described asfollows: Let x be an n-dimensional vector with scalar com-ponents xj , j = 1, . . . , n. Vector-valued components canalso be considered, but we restrict our attention to scalarcomponents for simplicity. The component indices j of thevector x are divided into K (possibly overlapping) groups,G1, . . . , GK ⊆ {1, . . . , n}. We let γ(j) be the set of groupindices k such that j ∈ Gk. That is, the γ(j) is the set ofgroups for which the component xj belongs to.

Suppose that each group Gk can be “active” or “inactive”,and each component xj can be non-zero only when at least onegroup Gk is active for some k ∈ γ(j). Qualitatively, a vector xis sparse with respect to this group structure if it is consistentwith only a small number of groups being active. That is, mostof the components of x are zero with the non-zero componentshaving support contained in a union of a small number ofgroups. The group sparse estimation problem is to estimatethe vector x from some measurements y. The traditional (non-group) sparse estimation problem corresponds to the specialcase when there n groups of singletons Gj = {j}.

Particularly with overlapping groups, there are a numberof ways to model the group sparse structure in a Bayesianmanner. For illustration, we consider the following simplemodel: For each group Gk, let ξk ∈ {0, 1} be a Booleanvariable with ξk = 1 when the group Gk is active and ξk = 0when it is inactive. We call ξk activity indicators and modelthem as i.i.d. with

P (ξk = 1) = 1− P (ξk = 0) = ρ (36)

Fig. 4. Graphical model for the group sparsity problem with overlappinggroups. The group dependencies between components of the vector x aremodeled via a set of binary latent variables ξ.

for some sparsity level ρ ∈ (0, 1). We assume that giventhe vector ξ, the components of x are independent with theconditional distributions

xj ∼{

0 if ξk = 0 for all k ∈ γ(j)V otherwise, (37)

where V is a random variable having the distribution ofthe component xj in the event that it belongs to an activegroup. Finally, suppose that measurement vector y is generatedby first passing x through a linear transform z = Ax,and then a separable componentwise measurement channelwith probability distribution functions p(yi|zi). Many otherdependencies on the activities of x and measurement modelsy are possible – we use this simple model for illustration.

Under this model, the prior x and the measurements yare naturally described by a graphical model with linearmixing. Due to the independence assumptions, the posteriordistribution of x given y factors as

p(x|y) =1

Z(y)

m∏i=1

p(yi|zi)n∏j=1

P (xj |ξγ(j))K∏k=1

P (ξk), (38)

where P (xj |ξγ(j)) is the conditional distribution for the ran-dom variable in (37). The factor graph corresponding to thisdistribution is shown in Fig. 4.

Under this graphical model, Appendix C shows that thesum-product version of Hybrid-GAMP algorithm in Algorithm2 reduces to the simple procedure in Algorithm 3. A similarmax-sum algorithm could also be derived. In lines 8 and 9, wehave used the notation E(X|R;Qr, ρ) and var(X|R;Qr, ρ) todenote the expectation and variance of the scalar variable Xwith distribution

X ∼{

0 with probability 1− ρV with probability ρ;

(39)

and R is an AWGN corrupted version of X

R = X +W, W ∼ N (0, Qr). (40)

The algorithm can be interpreted as the GAMP procedurein [18] run in a parallel with updates the sparsity levels.Specifically, each iteration t of the main repeat-until loop hastwo stages. The first half of the iteration, labeled as the “basicGAMP update”, is identical to the standard updates from thebasic GAMP algorithm [18], treating the components xj asindependent with sparsity level ρj(t). The second half of theiteration, labeled as the “sparsity update”, updates the sparsity

8

Algorithm 3 Sum-product Hybrid-GAMP for group sparsity1: {Initialization}2: t← 03: Qrj(t−1)←∞4: LLRj←k(t−1)← log(ρ/(1− ρ))5: ρj(t)← 1−

∏k∈γ(j) 1/(1 + exp LLRj←k(t−1))

6: repeat

7: {Basic GAMP update}8: xj(t)← E(X|R = rj(t−1);Qrj(t−1), ρj(t))9: Qxj (t)← var(X|R = rj(t−1);Qrj(t−1), ρj(t))

10: zi(t)←∑j Aij xj(t)

11: Qpi (t)←∑j |Aij |2Qxj (t)

12: pi(t)← zi(t)−Qpi (t)si(t−1)13: z0i (t)← E(zi|pi(t), Qpi (t))14: Qzi (t)← var(zi|pi(t), Qpi (t))15: si(t)← (z0i − pi(t))/Q

pi (t)

16: Qsi (t)← Q−pi (t)(1−Qzi (t)/Qpi (t))

17: Q−rj (t)←∑i |Aij |2Qsi (t)

18: rj(t)← xj(t) +Qrj(t)∑iA∗ij si(t)

19: {Sparsity level update}20: ρj→k(t)← 1−

∏i∈γ(j)6=k 1/(1 + exp LLRi←k(t−1))

21: Compute LLRj→k(t) from (41)22: LLRj←k(t)← log(ρ/(1− ρ)) +

∑i∈Gk 6=j LLRi→k(t)

23: ρj(t+1)← 1−∏k∈γ(j) 1/(1 + exp LLRj←k(t))

24: t← t+125: until Terminate

levels ρj(t) based on the estimates from basic GAMP half ofthe iteration.

The sparsity update half of the iteration in Algorithm 3also has a simple interpretation. The quantities ρj(t) andρj→k(t) can be interpreted, respectively, as estimates for theprobabilities

ρj = Pr(ξk = 1 for some k ∈ γ(j)

∣∣y)ρj→k = Pr

(ξi = 1 for some i ∈ γ(j) 6= k

∣∣y).That is, ρj(t) is an estimate for the probability that thecomponent xj belongs to at least one active group and ρj→k(t)is the estimate for the probability that it belongs to an activegroup other than Gk. Similarly, the quantities LLRj→k(t) andLLRj←k(t) are estimates for the log likelihood ratios

LLRk = logP(ξk = 1

∣∣y)P(ξk = 0

∣∣y) .Most of the updates in the sparsity update half of the iterationare the natural conversions from the LLR values to estimatesfor ρj and ρj→k. In line 21, the LLR message is computedby

LLRj→k(t) = log

(pR(rj(t);Q

rj(t), ρ = 1)

pR(rj(t);Qrj(t), ρ = ρj→k(t))

), (41)

where pR(r;Qr, ρ) is the probability distribution for the scalarrandom variable R in (40) where X has the distribution (39).The message (41) is the ratio of the likelihood of the outputr given that xj definitely belongs to an active group to the

likelihood given that xj belongs to an active group other thanthe group Gk.

In this way, the Hybrid-GAMP procedure provides anintuitive and simple method for extending the basic GAMPalgorithm of [18] for structured sparsity.

The Hybrid-GAMP algorithm for group sparsity is alsoextremely general. The algorithm can apply to arbitrary pri-ors and output channels. In particular, the algorithm canincorporate logistic outputs that are often used for groupsparse classification problems [28]–[30]. Also, the method canhandle arbitrary, even non-overlapping, groups. In contrast, theextensions of other iterative algorithms to the case of non-overlapping groups sometimes requires approximations. See,for example, [31]. In fact, the methodology is quite generaland likely be applied to general structured sparsity, includingpossibly the graphical model based sparse structures in imageprocessing considered in [32].

B. Computational Complexity

In addition to its generality, the Hybrid-GAMP procedureis among the most computationally efficient. To illustratethis point, consider the special case when there are K non-overlapping groups of d elements each. In this case, the totalvector dimension for x is n = Kd. We consider the non-overlapping case since there are many algorithms that applyto this case that we can compare against. For non-overlappinguniform groups, Table I compares the computational cost ofthe Hybrid-GAMP algorithm to other methods.

The dominant computation in each iteration of the Hybrid-GAMP algorithm, Algorithm 3, is simply the matrix multi-plications by A and A∗ and their componentwise squares ofA and A∗. These operations all have total cost O(mn) =O(mdK). Note that these multiplications could be cheaper ifthe matrix has any particular structure (e.g. sparse, Fourier,etc). The other per iteration computations are the m scalarestimates at the output (lines 13 and 14); the n scalar estimatesat the input (lines 8 and 9) and the updates of the LLRs.All these computations are less than O(mn) for the matrixmultiply.

For the case of non-overlapping groups, the Hybrid-GAMPalgorithm could also be implemented using vector-valuedcomponents. Specifically, the vector x can be regarded as ablock vector with K vector components, each of dimensiond. The general Hybrid-GAMP algorithm, Algorithm 2, can beapplied on the vector-valued components. To contrast this withAlgorithm 3, we will call Algorithm 3 Hybrid-GAMP withscalar components, and call the vector-valued case Hybrid-GAMP with vector components.

The cost is slightly higher for Hybrid-GAMP with vectorcomponents. In this case, there are no non-trivial strong edgessince the block components are independent. However, inthe update (26c), each Aij is 1 × d and Qx

j (t) is d × d.Thus, the computation (26c) requires mK computations ofd2 cost each for a total cost of O(mKd2) = O(mnd),which is the dominant cost. Of course, there may be a benefitin performance for Hybrid-GAMP with vector components,since it maintains the complete correlation matrix of all the

9

Method ComplexityGroup-OMP [34] O(ρmn2)Group-Lasso [26], [27], [35] O(mn) per iterationRelaxed BP with vector components [33] O(mn2) per iterationHybrid-GAMP with vector components O(mnd) per iterationHybrid-GAMP with scalar components O(mn) per iteration

TABLE ICOMPLEXITY COMPARISON FOR DIFFERENT ALGORITHMS FOR GROUPSPARSITY ESTIMATION OF A SPARSE VECTOR WITH K GROUPS, EACH

GROUP OF DIMENSION d. THE NUMBER OF MEASUREMENTS IS m ANDTHE SPARSITY RATIO IS ρ.

components in each group. We do not investigate this possibleperformance benefit in this paper.

Also shown in Table I is the cost of a relaxed BP methodof [33], which also uses approximate message passing simi-lar to Hybrid-GAMP with vector components. That method,however, performs the same computations as Hybrid-GAMPon each of the mK graph edges as opposed to the m + Kgraph vertices. It can be verified that the resulting cost has anO(mK2d2) = O(mn2) term.

The cost is slightly higher for GAMP with vector com-ponents. In this case, there are no non-trivial strong edgessince the block components are independent. However, in theupdate (26c), each Aij is 1× d and Qx

j (t) is d× d. Thus, thecomputation (26c) requires mn computations of d2 cost eachfor a total cost of O(mnd2), which is the dominant cost.

These message passing algorithms can be compared againstwidely-used group LASSO methods [26], [27] which estimatex by solving some variant of a regularized least-squaresproblem of the form

x := arg minx

1

2‖y −Ax‖2 + γ

n∑j=1

‖xj‖2, (42)

for some regularization parameter γ > 0. The problem (42) isconvex and can be solved via a number of methods including[35]–[37], the fastest of which is the SpaRSA algorithm of[35]. Interestingly, this algorithm is similar to the GAMPmethod in that the algorithm is an iterative procedure, wherein each iteration there is a linear update followed by acomponentwise scalar minimization. Like the GAMP method,the bulk of the cost is the O(mn) operations per iteration forthe linear transform. An alternative approach for group sparseestimation is group orthogonal matching pursuit (Group-OMP)of [30], [34], a greedy algorithm that detects one group ata time. Each round of detection requires K correlations ofcost md2. If there are on average ρK nonzero groups, thetotal complexity will be O(ρK2md2) = O(ρmn2). From thecomplexity estimates summarized in Table I it can be seen thatGAMP, despite its generality, is computationally as simple (periteration) as some of the most efficient algorithms specificallydesigned for the group sparsity problem.

Of course, a complete comparison requires that we considerthe number of iterations, not just the computation per iteration.This comparison requires further study beyond the scope ofthis paper. However, it is possible that the Hybrid-GAMPprocedure will be favorable in this regard. Our simulations

50 100 150 200−35

−30

−25

−20

−15

−10

−5

0

5

Num measurements (m)

Nor

mal

ized

MS

E (

dB)

LMMSEGrp LASSOGrp OMPHybrid−GAMP

Fig. 5. Comparison of performances of various estimation algorithms forgroup sparsity with n = 100 groups of dimension d = 4 with a sparsityfraction of ρ = 0.1.

below show good convergence after only 10–20 iterations.Moreover, in the case of independent (i.e. non-group) sparsity,the number of iterations for AMP algorithms is typicallysmall and often much less than other iterative methods. Forexamples, the paper [16] shows excellent convergence in 10-20 iterations which is dramatically faster than iterative softthresholding method of [38].

C. Numerical Simulation

Fig. 5 shows a simple simulation comparison of the meansquared error (MSE) of the Hybrid-GAMP method (Algorithm3) along with group OMP, group LASSO and a simple linearminimum MSE estimator. The simulation used a vector xwith n = 100 groups of size d = 4 and sparsity fraction ofρ = 0.1. The matrix was i.i.d. Gaussian and the observationswere with AWGN noise at an SNR of 20 dB. The numberof measurements m was varied from 50 to 200, and the plotshows the MSE for each of the methods. The Hybrid-GAMPmethod was run with 20 iterations. In group LASSO, at eachvalue of m, the algorithm was simulated with several values ofthe regularization parameter γ in (42) and the plot shows theminimum MSE. In Group-OMP, the algorithm was run withthe true value of the number of nonzero coefficients. It can beseen that the GAMP method is consistently as good or betterthan both other methods. All code for the simulations can befound in the SourceForge open website [39].

We conclude that, even in the special case of group sparsitywith AWGN measurements, the GAMP method is at leastcomparable in performance and computational complexity tothe most competitive algorithms. On top of this, GAMP offersa much more general framework that can include more richmodeling in both the output and input.

VII. CONCLUSIONS

A general model for optimization and statistical inferencebased on graphical models with linear mixing was presented.The linear mixing components of the graphical model accountfor interactions through aggregates of large numbers of small,

10

linearizable perturbations. Gaussian and second-order approx-imations are shown to greatly simplify the implementationof loopy BP for these interactions, and the Hybrid-GAMPframework presented here enables these approximations tobe incorporated in a systematic manner in general graphicalmodels. Simulations were presented for group sparsity wherethe Hybrid-GAMP method has equal or superior performanceto existing methods. However, the generality of the methodwill enable GAMP to be applied to much more complexmodels where few algorithms are available. In addition toexperimenting with such models, future work will focus onestablishing rigorous theoretical analyses along the lines of[9], [18].

APPENDIX AHEURISTIC DERIVATION OF MAX-SUM HYBRID-GAMP

A. Preliminary Lemma

Before deriving the Hybrid-GAMP approximation for themax-sum algorithm, we need the following result. LetH(w,v) be a real-valued function of vectors w and v ofthe form

H(w,v) = H0(w)− 1

2‖w − v‖2Qv (43)

for some positive definite matrix Qv . For each v, let

w(v) := arg maxw

H(w,v), (44a)

G(v) := H(w(v),v) = maxw

H(w,v). (44b)

Lemma 1: Assume the maximization in (44) exists and isunique and twice differentiable. Then,

∂vG(v) = Q−v(w(v)− v), (45a)

∂w

∂v= −D−1Q−v, (45b)

∂2

∂v2G(v) = −Q−v −Q−vD−1Q−v, (45c)

whereD =

∂2H(w,v)

∂w2

∣∣∣∣w=w(v)

.

Proof: Since w = w(v) is a maximizer of H(w,v),

∂H(w(v),v)

∂w= 0. (46)

Therefore, (45a) follows from

∂G(v)

∂v=

∂H(w(v),v)

∂w ∂w(v)∂v + ∂H(w(v),v)

∂v

=∂H(w(v),v)

∂v= Q−v(w(v)− v),

where the last step is a result of the form of H(·) in (43). Theform of H(·) in (43) also shows that for all w and v

∂2H(w,v)

∂w∂v= Q−v.

Taking the derivative of (46),

∂2H(w,v)

∂w∂v+∂2H(w,v)

∂w2

∂w(v)

∂v= 0,

which implies that

∂w(v)

∂v= −D−1Q−v,

which proves (45b). Finally, taking the second derivative of(45a) along with (45b) shows (45c).

B. Hybrid-GAMP Approximation for Max-Sum

First consider the factor node update (14) and partition theobjective function Hi→j(·) in (13) as

Hi→j(t,x∂(i), zi)

= Hstrongi→j (t,xα(i), zi) +Hweak

i→j (t,xβ(i)), (47)

where

Hstrongi→j (t,xα(i), zi)

:= fi(xα(i), zi) +∑

r∈α(i)6=j

∆i←r(t,xr), (48a)

Hweaki→j (t,xβ(i)) :=

∑r∈β(i)6=j

∆i←r(t,xr). (48b)

That is, we have separated the terms in the objective functionHi→j(·) between the strong and weak edges. We can alsopartition the maximization (14) as

∆i→j(t,xj)

= maxzi

[∆strongi→j (t,xj , zi) + ∆weak

i→j (t,xj , zi)], (49)

where

∆strongi→j (t,xj , zi) := max

xα(i)\jHstrongi→j (t,xα(i), zi), (50a)

∆weaki→j (t, zi,xj) := max

xβ(i)\jzi=Aix

Hweaki→j (t,xβ(i)), (50b)

with the maximization in (50a) being over all xr with r ∈α(i) \ j; and the maximization in (50b) over all xr with r ∈β(i) \ j subject to zi = Aix. The partitioning (49) is validsince the strong and weak edges are distinct. This insures thatfor all r ∈ δ(i), either r ∈ α(i) or r ∈ β(i), but not both.

The Hybrid-GAMP approximation applies to the weak term(50b). For any j and all weak edges (i, j), define:

xj(t) := arg maxxj

∆j(t,xj), (51a)

xi←j(t) := arg maxxj

∆i←j(t,xj), (51b)

Q−xj (t) := − ∂2

∂x2j

∆j(t,xj)|xj=xj(t), (51c)

Q−xi←j(t) := − ∂2

∂x2j

∆i←j(t,xj)|xj=xi←j(t), (51d)

which are the maximum and Hessian of the incoming weakmessages. Since the assumption of the Hybrid-GAMP algo-rithm is that Air is small for all weak edges (i, r), the valuesof xr in the maximization (50b) will be close to xi←r(t).So, for all weak edges, (i, r), we can approximate each term∆i←r(t,xr) in (48b) with the second-order approximation

∆i←r(t,xr)

≈ ∆i←r(t, xi←r(t))−1

2‖xr − xi←r(t)‖2Qx

j (t), (52)

11

where we have additionally made the approximationQxi←r(t) ≈ Qx

r (t) for all i. Substituting (52) into (48b), themaximization (50b) reduces to

∆weaki→j (t,xj , zi) ≈ const

− maxxβ(i)\jzi=Aix

1

2

∑r∈β(i) 6=j

‖xr − xi←r(t)‖2Qxr (t)

, (53)

where the constant term does not depend on xj or zi.To proceed, we need to consider two cases separately: when

j ∈ β(i) and when j 6∈ β(i). First consider the case whenj ∈ β(i). That is, (i, j) is a weak edge. In this case, a standardleast squares calculation shows that (53) reduces to

∆weaki→j (t,xj , zi) ≈ const

− 1

2‖zi −Aijxi←j(t)− pi←j(t)‖2Qp

i→j(t), (54)

where

pi→j(t) =∑

r∈β(i)6=j

Airxi←r(t) (55a)

Qpi→j(t) =

∑r∈β(i)6=j

AirQxr (t)A∗ir. (55b)

Also, when j ∈ β(i), the assumption that α(i) and β(i) aredisjoint implies that j 6∈ α(i). In this case, ∆strong

i→j (t,xj , zi)in (50a) with the objective function (48a) will not depend onxj , so we can write

∆strongi→j (t,xj , zi) = ∆strong

i (t, zi)

:= maxx

fi(xα(i), zi) +∑r∈α(i)

∆i←r(t,xr)

, (56)

where the maximization is over all xr for r ∈ α(i). Combining(49), (54) and (56), we can write that, for all weak edges (i, j),

∆i→j(t,xj) ≈ Gi(t, pi→j(t) + Aijxj), (57)

where

Gi(t, pi) := maxxα(i),zi

Hzi (t,xα(i), zi, pi,Q

pi (t)) (58)

and Hzi (·) is defined in (31a). Now, define

pi(t) =∑r∈β(i)

Airxi←r(t) (59a)

Qpi (t) =

∑r∈β(i)

AirQxr (t)A∗ir, (59b)

so that the expressions in (55) can be re-written as

pi→j(t) = pi(t)−Airxi←r(t) (60a)Qpi→j(t) = Qp

i (t)−AirQxr (t)A∗ir. (60b)

Using (60), neglecting terms of order O(‖Aij‖2) and takingthe approximation that xi←j(t) ≈ xj(t), (57) can be furtherapproximated as

∆i→j(t,xj) ≈ Gi(t, pi(t) + Aij(xj − xj(t))) (61)

Now let

si(t) =∂

∂pGi(t, pi(t)) (62a)

Q−si (t) = − ∂2

∂p2Gi(t, pi(t)). (62b)

Based on the definition of Gi(·) in (58) with Hzi (·) defined

in (31a), one can apply Lemma 1 to show that (62) agreeswith (34). Also applying (62), we can take a second-orderapproximation of (61) as

∆i→j(t,xj) ≈ const

+ si(t)∗Aij(xj − xj(t))−

1

2‖Aij(xj − xj(t))‖2Qs

i (t)

= const +[A∗ijsi(t) + A∗ijQ

si (t)Aijxj(t)

]∗xj

+1

2x∗jA

∗ijQ

si (t)Aijxj (63)

for all weak edges (i, j).Next consider the case when j 6∈ β(i) so that (i, j) is a

strong edge. In this case, ∆weaki→j (t,xj , zi) in (53) does not

depend on xj , so we can write

∆weaki→j (t,xj , zi) ≈ const + ∆weak

i (t, zi), (64)

where

∆weaki (t, zi) := max

x : zi=AixHweaki→j (t,xβ(i)), (65)

with the maximization being over x such that zi = Aix. Usinga similar least-squares calculations as above, ∆weak

i (t, zi) isgiven by

∆weaki (t, zi) := −1

2‖zi − pi(t)‖2Qp

i (t), (66)

and pi(t) and Qpi (t) are defined in (59). Combining (49), (64)

and (66), we can write that, for all strong edges (i, j),

∆i→j(t,xj) ≈ const

+ maxzi

[∆strongi→j (t,xj , zi)−

1

2‖zi − pi(t)‖2Qp

i (t)

].(67)

From (48a) and (50a), we see that (67) agrees with the factornode update (28) for the the strong edges.

We now turn to the variable node update (17) which wepartition as

∆i←j(t,xj) = ∆weaki←j (t,xj) + ∆strong

i←j (t,xj), (68)

where

∆strongi←j (t+1,xj) =

∑6=i : j∈α(`)

∆`→j(t,xj) (69a)

∆weaki←j (t+1,xj) =

∑6=i : j∈β(`)

∆`→j(t,xj). (69b)

Substituting the approximation (63) into the summation (69b)

∆weaki←j (t+1,xj) ≈ −

1

2‖ri←j(t)− xj‖2Qr

i←j(t), (70)

12

where

Q−ri←j(t) =∑6=i

A∗`jQs`(t)A`j (71a)

ri←j(t) = Qri←j(t)

×

∑` 6=i

A∗`j s`(t) + A∗`jQs`(t)A`jxj(t)

= x(t) + Qr

i←j(t)∑6=i

A∗`j s`(t). (71b)

We now again consider two cases: when (i, j) is a weak edgeand when it is a strong edge. First consider the weak edgecase. In this case, j 6∈ α(i), so ∆strong

i←j (t+1,xj) in (69a) doesnot depend on i. Combining (68) and (70), we see that

∆i←j(t+1,xj) ≈ Hxj (t,xj , ri←j(t),Q

ri←j(t)), (72)

where Hxj (·) is defined in (23). Also, comparing (35) with

(71), we have that

Q−ri←j(t) ≈ Q−rj (t) (73a)ri←j(t) ≈ rj(t)−Qr

j(t)A∗ij si(t). (73b)

Substituting (73) into (72) we get

∆i←j(t+1,xj)

≈ Hxj (t,xj , rj(t)−Qr

j(t)A∗ij si(t),Q

rj(t)). (74)

A similar set of calculations shows that ∆j(t+1,xj) in (18)can be approximated as

∆j(t+1,xj) ≈ Hxj (t,xj , rj(t),Q

rj(t)). (75)

Thus, the definitions of xj(t+1) and Qxj (t+1) in (51) agree

with (24). Also, if we let

Γj(t, rj) := arg maxxj

Hxj (t,xj , rj ,Q

rj(t)),

it follows from (51), (74) and (75) that

xj(t+1) ≈ Γj(t, rj(t))

xi←j(t+1) ≈ Γj(t, rj(t)−Qrj(t)A

∗ij si(t))

≈ xj(t)−∂Γj(t, rj(t))

∂rjQrj(t)A

∗ij si(t). (76)

It can be shown from Lemma 1 that

∂Γj(t, rj(t))

∂rj= −

[∂2

∂x2j

Hxj (t, xj(t+1), rj(t))

]−1Q−r(t)

≈ Qx(t)Q−r(t),

and hence, from (76),

xi←j(t+1) ≈ xj(t+1)−Qx(t+1)A∗ij si(t). (77)

Substituting (77) into (59) we obtain

pi(t) ≈∑j∈β(i)

Aijxj(t)−∑j∈β(i)

AijQx(t)A∗ij si(t−1)

≈ zi(t)−Qp(t)si(t−1),

which agrees with the definition in (26).

APPENDIX BHEURISTIC DERIVATION OF SUM-PRODUCT

HYBRID-GAMP

A. Preliminary Lemma

The derivation of the approximate sum-product algorithmis similar to the derivation in Appendix A for the max-sum algorithm. In particular, the analogue to Lemma 1 is asfollows:

Lemma 2: Suppose that W and V are random vectors witha conditional probability distribution function of the form

pW|V(w |v) =1

Z(v)exp [uH(w,v)] ,

where H(w,v) is given in (43), u > 0 is some constantand Z(v) is a normalization constant (called the partitionfunction). Then,

∂vx(v) = DQ−v (78a)

∂vlogZ(v) = Q−v(x(v)− v) (78b)

∂2

∂v2logZ(v) = −Q−v + Q−vDQ−v (78c)

where

x(v) = E[W |V = v], D = uvar(W |V = v).

Proof: The relations are standard properties of exponen-tial families [2].

B. Approximation for Hybrid-GAMP

The derivation of the Hybrid-GAMP algorithm for the sum-product algorithm is similar to the derivation of the max-sumalgorithm in Appendix A, so we will just sketch the proofhere. Similar to the max-sum derivation, we first partitionthe function Hi→j(·) in (13) as in (47). Then, the marginaldistribution pi→j(t,xj) of the distribution pi→j(t,x∂i) in (16)can be re-written as

pi→j(t,xj) =

∫pi→j(t,x∂(i))dx∂(i)\j

∝∫ψstrongi→j (t,xj , zi)ψ

weaki→j (t,xj , zi)dzi, (79)

where

ψstrongi→j (t,xj , zi)

∝∫

xα(i)\j

exp[uHstrong

i→j (t,xα(i), zi)]dxα(i)\j (80a)

ψweaki→j (t,xj , zi)

∝∫

xβ(i)\jzi=Aix

exp[uHweak

i→j (t,xβ(i))]dxβ(i)\j (80b)

where the integration in (80a) is over the variables xr withr ∈ α(i) 6= j, and and the integration in (80b) is over thevariables xr with r ∈ β(i) 6= j, and zi = Aix.

To approximate pi→j(t,xj) in (79), we separately considerthe cases when (i, j) is weak edge and when it is a strong

13

edge. We begin with the weak edge case. That is, j ∈ β(i).Let

xj(t) := E[xj ; ∆j(t, ·)], (81a)xi←j(t) := E[xj ; ∆i←j(t, ·)], (81b)Qxj (t) := u var[xj ; ∆j(t, ·)] (81c)

Qxi←j(t) := u var[xj ; ∆i←j(t, ·)], (81d)

where we have used the notation E[g(x) ; ∆(·)] from (12).Now, using the expression for Hweak

i→j (t,xβ(i)) in (48b),it can be verified that ψweak

i→j (t,xj , zi) is equivalent to theprobability distribution of a random variable

zi = Aijxj +∑

r∈β(i)6=j

Airxr, (82)

with the variables xr being independent with probabilitydistribution

p(xr) ∝ exp(u∆i←r(xr)).

Moreover, xi←j(t) and Qxi←j(t)/u in (81) are precisely the

mean and variance of the random variables xj under thisdistribution. Therefore, if the summation in (82) is over a largenumber of terms, we can then use the Central Limit Theoremto approximate the variable in zi in (82) as Gaussian, withdistribution ψweak

i→j (t,xj , zi) given by

ψweaki→j (t,xj , zi) ≈ N (Aijxj + pi→j(t),Q

pi→j(t)/u), (83)

where pi→j(t) and Qpi→j(t) are given in (55). Applying

this Gaussian approximation into the probability distributionpi→j(t,x∂(i), zi) in (16), and then using the definitions in(48a) and (80a), we obtain the approximation for the messagein (15)

∆i→j(t,xj) ≈ Gi(t,Aijxj + pi→j(t),Qpi→j(t))

where

Gi(t,pi,Qpi )

:=1

ulog

∫exp

[uHz

i (t,xα(i), zi,pi,Qpi )]dxα(i)dzi(84)

and Hzi (·) is given in (31a).

We now define pi(t) and Qpi (t) as in (59), and similar to

(62) let

si(t) =∂

∂pGi(t, pi(t),Q

pi (t)) (85a)

Q−si (t) = − ∂2

∂p2Gi(t, pi(t),Q

pi (t)). (85b)

Using Lemma 2 one can show that the definitions in (85) agreewith the updates (34) where z0i (t) and Qz

i (t) are the meanand covariance of the random variable zi with the distribution(33). Using a similar approximations as in the derivation ofthe max-sum algorithm, one can then obtain the quadraticapproximation in (63) for ∆i→j(t,xj) for all weak edges(i, j).

Next consider the case when j 6∈ β(i) so that (i, j) is astrong edge. In this case, ψweak

i→j (t,xj , zi) does not depend onxj and a similar calculation as above shows that

ψweaki→j (t,xj , zi) ≈ ψweak

i (t, zi) := N (pi(t),Qpi (t)/u), (86)

where pi(t) and Qpi (t) are defined in (59). Substituting the

Gaussian approximation (86) into (16), and then using thedefinitions in (48a) and (80a), one can show that the marginaldistribution pi→j(t,xj) in (16) is equal to the marginal dis-tribution of pi→j(t,xα(i), zi) in (30). Therefore, the message∆i→j(t,xj) in (15) can be written as (29) for all strong edges(i, j).

We now turn to the variable update steps of the sum-product algorithm. Since this step is identical to the max-sumalgorithm, one can follow the derivation in Appendix A toshow that ∆i←j(t+1,xj) and ∆i(t+1,xj) are given by (74)and (75), respectively and rj(t) and Qr

j(t) are given in (35).Also, the definitions of xj(t) and Qx

j (t) in (81a) and (81c)are consistent with (25).

Finally, define

Γj(t, rj) := E[xj ; Hx

j (t, ·, rj ,Qrj(t−1))

], (87)

where, again we are using the notation (12) and Hxj (·) is

defined in (23). It follows from (74), (75) and (81) that

xj(t+1) ≈ Γj(t, rj(t))

xi←j(t+1) ≈ Γj(t, rj(t)−Qrj(t)A

∗ij si(t))

≈ xj(t)−∂Γj(t, rj(t))

∂rjQrj(t)A

∗ij si(t). (88)

From the definition (87), Lemma 2 shows that

∂Γj(t, rj(t))

∂rj≈ Qx(t)Q−r(t), (89)

and hence, from (88),

xi←j(t+1) ≈ xj(t+1)−Qx(t+1)A∗ij si(t). (90)

The proof now follows identically to the derivation of theHybrid-GAMP max-sum algorithm.

APPENDIX CDERIVATION OF HYBRID-GAMP FOR GROUP SPARSITY

This Appendix provides a brief derivation of the steps inAlgorithm 3 based on the general Hybrid-GAMP algorithm,Algorithm 2. In the description of the general Hybrid-GAMPalgorithm, we used labels i and j for the factor and variablenodes. However, the group sparse estimation problem has alarge number of indices. To avoid confusion, we adopt thefollowing more explicit (albeit somewhat more cumbersome)labeling. The variables nodes will be labeled explicitly by xjor ξk. For the factor nodes, we use the labels:• ai for the factors p(yi|zi);• bj for the factors P (xj |ξγ(j)); and• ck for the factors P (ξk).

With this convention, for example, ∆bj←ξk(t, ξk) representsthe message from the variable node ξk to the factor node bjwhen j ∈ Gk.

Now, in the graphical model in Fig. 4, the strong edges areall the edges to the right of the variables xj . That is, the strongedges are:• Between the variables xj and factors bj for all j;• Between the variables ξk and factors bj for all j ∈ Gk;

and

14

• Between the variables ξk and factors ck for all k.The remaining edges, those between the variables xj and thefactor nodes ai, are all weak.

With these definitions, we can easily derive the steps inAlgorithm 3 from the general Algorithm 2. First, note that allthe steps from lines 10 to 18 are simply the weak edge updatesin Algorithm 2, specialized to the case of scalar variables.

To understand the role of the remaining lines, first considerthe message along the strong edge from the factor node ck andthe variable ξk. The factor node ck corresponds to the priorP (ξk) in (36). Since the factor is attached to only one variablenode, the outgoing message in (29) for this edge reduces to

∆ck→ξk(t, ξk) = logP (ξk) =

{ρ if ξk = 1,1− ρ if ξk = 0,

(91)

where the last step follows from (36).Next consider the message along the strong edge from the

variable ξk to the factor node bj for some j ∈ Gk. Similar tothe case of binary LDPC codes [23], since ξk = 0 or 1, it isconvenient to work with log-likelihood ratios (LLRs). Givenany strong edge between bj and ξk, define the LLR,

LLRj→k(t) := ∆bj→ξk(t, ξk = 1)−∆bj→ξk(t, ξk = 0). (92)

The reverse LLR, LLRj←k(t) is defined similarly.Since the variable node ξk is not connected to any weak

edges, the variable node output message in (21) reduces to

∆bj←ξk(t+1, ξk) = ∆ck→ξk(t, ξk) +∑

i∈Gk 6=j

∆bi→ξk(t, ξk).

Therefore the LLR in (92) is given by

LLRj←k(t+1) = ∆ck→ξk(t, 1)−∆ck→ξk(t, 0)

+∑

r∈Gk 6=j

LLRr←k(t)

= log

1− ρ

)+

∑r∈Gk 6=j

LLRr→k(t), (93)

where the last step follows from (91).Next consider the message from bj to xj . Recall that the

factor node bj corresponds to the distribution P (xj |ξγ(j)),defined by the variable xj in (37). Also, this factor node hasno weak edges. Hence, it can be verified, that the message(29), applied to the edge from the factor node bj to xj , isgiven by

∆bj→xj (t, xj) = logPbj→xj (t, xj), (94)

where Pbj→xj (t, xj) is the probability distribution function

Pbj→xj (t, xj) = E[P (xj |ξγ(j))

], (95)

and the expectation is over independent variables ξk with

P (ξk = 1) = 1−P (ξk = 0) =1

1 + exp(−LLRj←k(t)). (96)

Using the fact that the P (xj |ξγ(j)) is the conditional dis-tribution for the variable in (37), the probability distributionPbj→xj (t, xj) in (95) can be written

Pbj→xj (t, xj) = PX(xj ; ρ = ρj(t)), (97)

where PX(x; ρ) is the distribution for the variable X in (39)and ρj(t) is the probability

ρj(t) = Pr (ξk = 0, ∀k ∈ γ(j))

=∏

k∈γ(j)

1

1 + exp(LLRj←k(t)). (98)

Now, the variable node xj has only one strong edge – theconnection to the factor node bj . Therefore, the log probabilityin (22) reduces to

∆xj (t+1, xj) = ∆bj→xj (t, xj)−1

2Qrj(t)|rj(t)− xj |2. (99)

Now, as described in equations (94) and (97), ∆bj→xj (t, xj)is the log of the probability distribution for the variable X in(39) with ρ(t) = ρj(t). Hence ∆xj (t+1, xj) in (99) must bethe log posterior distribution for the X with the measurementR = r(t) in (40). Therefore, the expectations and variances in(25) agree with the expressions in lines 8 and 9.

Finally, consider the message from the factor node bj to avariable node ξk. The derivation for this message is similar tothe message from bj to xj . Specifically, it can be verified thatthe factor node message (29), applied to the strong edge frombj to ξk, is given by

∆bj→ξk(t, ξk) = logPbj→ξk(t, ξk), (100)

where Pbj→ξk(t, ξk) is the probability mass function

Pbj→ξk(t, ξk)

=

∫exp ∆bj←xj (t−1, xj)E

(P (xj |ξγ(j))

∣∣ξk)dxj ,(101)

where the expectation is over independent variables ξk withprobabilities in (96). To evaluate the expectation on theright-hand side of (101), consider the conditional expectationE(P (xj |ξγ(j))|ξk). Since the distribution P (xj |ξγ(j)) corre-sponds to the random variable xj in (37),

E(P (xj |ξγ(j))

∣∣ξk)=

{PX(xj ; ρ = 1) if ξk = 1PX(xj ; ρ = ρj→k(t)) if ξk = 0,

(102)

where PX(x; ρ) is the probability distribution for the randomvariable X in (39) and

ρj→k(t) = 1− Pr (ξi = 0, ∀i ∈ γ(j) 6= k)

= 1−∏

i∈γ(j) 6=k

1

1 + exp LLRi←k(t). (103)

Also, the edge from variable node xj to the factor node bj isthe only strong edge connected to xj . Therefore, the variablenode message (21) applied to that edge reduces to

∆bj←xj (t−1, xj) = − 1

2Qrj(t−1)|xj − rj(t−1)|2. (104)

Substituting (102) and (104) into (101) we obtain that

Pbj→ξk(t, ξk)

∝{pR(rj(t−1);Qrj(t−1), 1) if ξk = 1pR(rj(t−1);Qrj(t−1), ρj→k(t) if ξk = 0,

(105)

15

where pR(r;Qr, ρ) is the probability distribution of the scalarrandom variable R in (40) with X being distributed in (39).The LLR corresponding to (105) is thus given by

LLRj→k(t) = logPbj→ξk(t, ξk = 1)

−Pbj→ξk(t, ξk = 0)

= log pR(rj(t−1);Qrj(t−1), ρ = 1)−− log pR(rj(t−1);Qrj(t−1), ρ = ρj→k(t)),

which agrees with (41).

REFERENCES

[1] B. J. Frey, Graphical Models for Machine Learning and Digital Com-munication. MIT Press, 1998.

[2] M. J. Wainwright and M. I. Jordan, Graphical Models, ExponentialFamilies, and Variational Inference, ser. Foundations and Trends inMachine Learning. Hanover, MA: NOW Publishers, 2008, vol. 1.

[3] J. Boutros and G. Caire, “Iterative multiuser joint decoding: Unifiedframework and asymptotic analysis,” IEEE Trans. Inform. Theory,vol. 48, no. 7, pp. 1772–1793, Jul. 2002.

[4] T. Tanaka and M. Okada, “Approximate belief propagation, densityevolution, and neurodynamics for CDMA multiuser detection,” IEEETrans. Inform. Theory, vol. 51, no. 2, pp. 700–706, Feb. 2005.

[5] D. Guo and C.-C. Wang, “Asymptotic mean-square optimality of beliefpropagation for sparse linear systems,” in Proc. IEEE Inform. TheoryWorkshop, Chengdu, China, Oct. 2006, pp. 194–198.

[6] ——, “Random sparse linear systems observed via arbitrary channels:A decoupling principle,” in Proc. IEEE Int. Symp. Inform. Theory, Nice,France, Jun. 2007, pp. 946–950.

[7] S. Rangan, “Estimation with random linear mixing, belief propagationand compressed sensing,” arXiv:1001.2228v1 [cs.IT]., Jan. 2010.

[8] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algo-rithms for compressed sensing,” Proc. Nat. Acad. Sci., vol. 106, no. 45,pp. 18 914–18 919, Nov. 2009.

[9] M. Bayati and A. Montanari, “The dynamics of message passing ondense graphs, with applications to compressed sensing,” IEEE Trans.Inform. Theory, vol. 57, no. 2, pp. 764–785, Feb. 2011.

[10] P. Schniter, “Turbo reconstruction of structured sparse signals,” in Proc.Conf. on Inform. Sci. & Sys., Princeton, NJ, Mar. 2010.

[11] J. Ziniel, L. C. Potter, and P. Schniter, “Tracking and smoothing of time-varying sparse signals via approximate belief propagation,” in Conf. Rec.Asilomar Conf. on Signals, Syst. & Computers, Pacific Grove, CA, Nov.2010.

[12] S. Som, L. C. Potter, and P. Schniter, “Compressive imaging usingapproximate message passing and a Markov-tree prior,” in Conf. Rec.Asilomar Conf. on Signals, Syst. & Computers, Pacific Grove, CA, Nov.2010.

[13] P. Schniter, “A message-passing receiver for BICM-OFDM over un-known clustered-sparse channels,” in Proc. IEEE Workshop SignalProcess. Adv. Wireless Commun., San Francisco, CA, Jun. 2011.

[14] G. Caire, A. Tulino, and E. Biglieri, “Iterative multiuser joint detectionand parameter estimation: a factor-graph approach,” in Proc. IEEEInform. Theory Workshop, Cairns, Australia, Sep. 2001, pp. 36–38.

[15] S. Rangan and R. K. Madan, “Belief propagation methods for intercellinterference coordination,” in Proc. IEEE Infocom, Shanghai, China,Apr. 2011.

[16] A. Montanari, “Graphical models concepts in compressed sensing,”arXiv:1011.4328v3 [cs.IT], Mar. 2011.

[17] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applica-tions. New York: Springer, 1998.

[18] S. Rangan, “Generalized approximate message passing for estimationwith random linear mixing,” arXiv:1010.5141v1 [cs.IT]., Oct. 2010.

[19] A. Montanari and D. Tse, “Analysis of belief propagation for non-linearproblems: The example of CDMA (or: How to prove Tanaka’s formula),”arXiv:cs/0602028v1 [cs.IT]., Feb. 2006.

[20] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. San Mateo, CA: Morgan Kaufmann Publ., 1988.

[21] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Understanding beliefpropagation and its generalizations,” in Exploring Artificial Intelligencein the New Millennium. San Francisco, CA: Morgan KaufmannPublishers, 2003, pp. 239–269.

[22] T. J. Richardson and R. L. Urbanke, “The capacity of low-density paritycheck codes under message-passing decoding,” IEEE Trans. Inform.Theory, vol. 47, no. 2, pp. 599–618, Feb. 2001.

[23] ——, Modern Coding Theory. Cambridge, UK: Cambridge UniversityPress, 2009.

[24] D. M. Malioutov, J. K. Johnson, and A. S. Willsky, “Walk-sums andbelief propagation in Gaussian graphical models,” J. Machine LearningRes., vol. 7, Dec. 2006.

[25] G. Elidan, I. McGraw, and D. Koller, “Residual belief propagation:Informed scheduling for asynchronous message passing,” in Proc. Conf.on Uncertainty in AI, Boston, MA, Jul. 2006.

[26] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,” J. Royal Statist. Soc., vol. 68, pp. 49–67, 2006.

[27] P. Zhao, G. Rocha, and B. Yu, “The composite absolute penalties familyfor grouped and hierarchical variable selection,” Ann. Stat., vol. 37, no. 6,pp. 3468–3497, 2009.

[28] Y. Kim, J. Kim, and Y. Kim, “Blockwise sparse regression,” StatisticaSinica, vol. 16, pp. 375–390, 2006.

[29] L. Meier, S. van de Geer, and P. Buhlmann, “The group lasso for logisticregression,” J. Royal Statistical Society: Series B, vol. 70, no. 1, pp. 53–71, 2008.

[30] A. C. Lozano, G. Swirszcz, and N. Abe, “Group orthogonal matchingpursuit for logistic regression,” J. Machine Learning Res., vol. 15, 2011.

[31] N. S. Rao, R. D. Nowak, S. J. Wright, and N. G. Kingsbury, “Con-vex approaches to model wavelet sparsity patterns,” arXiv:1104.4385[cs.CV]., Apr. 2011.

[32] V. Cevher, P. Indyk, L. Carin, and R. Baraniuk, “Sparse signal recoveryand acquisition with graphical models,” IEEE Signal Process. Mag.,vol. 27, no. 6, pp. 92–103, Nov. 2010.

[33] J. Kim, W. Chang, B. Jung, D. Baron, and J. C. Ye, “Belief propagationfor joint sparse recovery,” arXiv:1102.3289, Feb. 2011.

[34] A. C. Lozano, G. Swirszcz, and N. Abe, “Group orthogonal matchingpursuit for variable selection and prediction,” in Proc. Neural Informa-tion Process. Syst., Vancouver, Canada, Dec. 2008.

[35] S. J. Wright, R. D. Nowak, and M. Figueiredo, “Sparse reconstruction byseparable approximation,” IEEE Trans. Signal Process., vol. 57, no. 7,pp. 2479–2493, Jul. 2009.

[36] M. Figueiredo, S. J. Wright, and R. D. Nowak, “Gradient projectionfor sparse reconstruction: Application to compressed sensing and otherinverse problems,” IEEE J. Sel. Topics Signal Process., vol. 1, no. 4, pp.586–597, Dec. 2007.

[37] S. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinvesky, “An interiorpoint method for large-scale `1-regularized least squares,” IEEE J. Sel.Topics Signal Process., vol. 1, no. 4, pp. 606–617, Dec. 2007.

[38] I. Daubechies, M. Defrise, and C. D. Mol, “An iterative thresholding al-gorithm for linear inverse problems with a sparsity constraint,” Commun.Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, Nov. 2004.

[39] S. Rangan et al., “Generalized approximate message passing,”SourceForge.net project gampmatlab, available on-line athttp://gampmatlab.sourceforge.net/.