+ All Categories
Home > Documents > Structured Variational Methods for Distributed Inference ...Part of this work was presented at the...

Structured Variational Methods for Distributed Inference ...Part of this work was presented at the...

Date post: 26-Jan-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
37
Structured Variational Methods for Distributed Inference in Networked Systems: Design and Analysis Huaiyu Dai*, Senior Member, IEEE, Yanbing Zhang, Member, IEEE, and Juan Liu Abstract In this paper, a variational message passing framework is proposed for distributed inference in networked systems. Based on this framework, structured variational methods are explored to take advantage of both the simplicity of variational approximation (for inter-cluster processing) and the quality of more accurate inference (for intra-cluster processing). To investigate the convergence performance of our inference approach, we distinguish the inter- and intra-cluster inference algorithms as vertex and edge processes respectively. Based on an analysis on the intra-cluster inference procedure, the overall performance of structured variational methods, modeled as a mixed vertex-edge process, is quantitatively characterized via a coupling approach. The tradeoff between performance and complexity of this inference approach is also addressed. Index Terms: convergence analysis, distributed inference, Markov chain, variational methods Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. H. Dai and J. Liu are with the Department of Electrical and Computer Engineering, NC State University, Raleigh, NC 27695 (Email: [email protected], [email protected]). Y. Zhang is with Broadcom Corporation, Matawan, NJ, 07747 (Email: [email protected]). This work was done while he was with NC State University. This work was supported in part by the National Science Foundation under Grants CCF-0830462 and ECCS-1002258. Part of this work was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009 [29], and IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009 [30]. 07747 (Email: [email protected]). This work was done while he was with NC State University. This work was supported in part by the National Science Foundation under Grants CCF- 0830462 and ECCS-1002258. Part of this work was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009 [29], and IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009 [30].
Transcript
  • Structured Variational Methods for Distributed Inference in Networked Systems: Design and Analysis

    Huaiyu Dai*, Senior Member, IEEE, Yanbing Zhang, Member, IEEE, and Juan Liu

    Abstract

    In this paper, a variational message passing framework is proposed for distributed inference in

    networked systems. Based on this framework, structured variational methods are explored to take

    advantage of both the simplicity of variational approximation (for inter-cluster processing) and

    the quality of more accurate inference (for intra-cluster processing). To investigate the

    convergence performance of our inference approach, we distinguish the inter- and intra-cluster

    inference algorithms as vertex and edge processes respectively. Based on an analysis on the

    intra-cluster inference procedure, the overall performance of structured variational methods,

    modeled as a mixed vertex-edge process, is quantitatively characterized via a coupling approach.

    The tradeoff between performance and complexity of this inference approach is also addressed.

    Index Terms: convergence analysis, distributed inference, Markov chain, variational methods

    Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for

    any other purposes must be obtained from the IEEE by sending a request to [email protected]. H. Dai and J. Liu are with the Department of Electrical and Computer Engineering, NC State University, Raleigh,

    NC 27695 (Email: [email protected], [email protected]). Y. Zhang is with Broadcom Corporation, Matawan, NJ, 07747 (Email: [email protected]). This work was done while he was with NC State University. This work was supported in part by the National Science Foundation under Grants CCF-0830462 and ECCS-1002258.

    Part of this work was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009 [29], and IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009 [30]. 07747 (Email: [email protected]). This work was done while he was with NC State University. This work was supported in part by the National Science Foundation under Grants CCF-0830462 and ECCS-1002258. Part of this work was presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009 [29], and IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009 [30].

  • 1

    I. INTRODUCTION Large-scale networked systems of intelligent devices are playing an increasingly important role

    in protecting the nation's critical infrastructures as well as serving people’s needs; smart grid,

    intelligent transportation, precision agriculture, and seamless surveillance are a few such

    examples. In many such systems, there is a pressing need for automatic reasoning and inference

    due to practical and economic considerations, and it is desirable that inference be conducted in a

    distributed fashion. This motivates us to develop a general and flexible framework for automatic

    inference in networked systems which can admit wide applications and provide desired tradeoff

    between accuracy and efficiency, while allowing simple and distributed implementation.

    Exact inference is known to be NP-hard [2], and generally computationally intractable for

    many applications. Therefore, approximate methods are often resorted to in practice. One

    popular approach for approximate inference is sampling, of which the family of Markov Chain

    Monte Carlo methods is noteworthy. Some concerns about this approach include slow

    convergence, analytical tractability, and computational complexity. Belief propagation (BP)

    algorithms [1] and its variants (such as consensus propagation [3], a special case of Gaussian BP)

    have also been widely studied in literature. BP algorithms yield accurate inference on acyclic

    graphs, and continue to work well on loopy graphs with sufficient sparsity and symmetry. They

    are also amenable to distributed implementation. However, BP and related algorithms are known

    not always to converge in general cyclic graphs. They are also computationally intractable when

    continuous variables are involved (except for Gaussian distribution), and approximate methods

    such as particle filtering may be employed as a remedy in practice. Variational methods [4] are an alternative for approximate inference. Being a deterministic

    approach, they are often more efficient in computation, more amenable to analysis, and admit

    wide applicability regarding the underlying models, whether acyclic or cyclic, discrete or

    continuous. A message-passing algorithm for the mean-field (MF) inference was proposed for

    conjugate-exponential models in (directed) Bayesian networks [5]. An implementation on the

  • 2

    factor graph can be found in [6]. In this paper, we derive a variational message passing framework

    for Markov random fields (MRF), which arguably assume advantages in modeling wireless

    networks. In particular, we formulate explicit message passing rules for distributions in the

    exponential family, which covers a large class of probabilistic models. Relevant discussion is

    given in Section II.

    Among variational methods, the simplest MF approach is mostly considered due to its

    analytical and computational tractability, whose inference accuracy is nonetheless limited because

    of its inherent assumption that variables of interest are fully independent. Naturally a richer

    structure for the variational distribution can be exploited for better inference quality, with

    increased complexity. Such approaches are named structured variational methods or simply

    structured mean field (SMF). They have mainly been studied in the artificial intelligence area

    [7][8], and little consideration is given on their applications in real networks. In this work, we

    further investigate exploiting substructures of networks to improve variational methods in real

    systems. Thus the simplicity of variational methods (for inter-cluster processing) and the accuracy

    of (approximately) exact inference algorithms (for intra-cluster processing) can be exploited

    simultaneously, as detailed in Section III. In this study, BP is adopted as an approximation for

    exact inference in intra-cluster processing, as it can be readily realized in a distributed form.

    Meanwhile, our SMF framework can effectively control the cluster sizes (and even the topologies)

    to ensure good performance for BP processing. In [24], an alternative approach of combining the

    BP and MF inference is presented on the factor graph model: the whole set of factor nodes are

    divided into two parts, with BP applied on one subset (in particular discrete variables) and MF on

    the remaining part. One possible application of such an approach is to design iterative message-

    passing algorithms jointly for different components of a communication system, such as joint

    channel estimation and decoding, which is orthogonal to our study.

    For distributed inference algorithms mentioned above and studied in this work, typically

    stochastic weight matrices are employed, which are conformant to the underlying graphical

  • 3

    structure (network topology). Hence the convergence of these algorithms is closely related to the

    mixing time of a random walk on the corresponding graph. Random walks on graphs can be

    categorized as vertex process-based or edge process-based ones. The essential difference

    between these two is that the former is a process on nodes that transitions along edges and is

    allowed to “backtrack”, while the latter is a process on directed edges that transitions towards

    nodes where “backtrack” is forbidden. As we will see, distributed algorithm derived from the

    variational method can be characterized by a vertex process, typically involving reversible

    Markov chains; while belief propagation and its variants correspond to edge processes, typically

    involving non-reversible Markov chains.

    Even though quite a few techniques exist for analyzing the convergence of reversible Markov

    chains, including spectral theory, conductance, canonical paths and multi-commodity flow (see [9]

    and the references therein), few of them can be successfully applied to non-reversible cases. In

    [10] a non-reversible random walk in the one-dimensional chain is analyzed through a direct

    probabilistic approach. A study on the convergence properties of consensus propagation is given

    in [3] through function mapping and matrix analysis; an explicit result on convergence time is

    derived for the cycle, with conjectures given for higher-dimensional tori. Structured variational

    methods, as we will formulate in Section III, are actually mixed vertex-edge processes involving

    hybrid Markov chains, entailing even more difficulties on analysis. In this paper, we use a “divide

    and conquer” strategy to investigate its performance: first we analyze the convergence of the

    intra-cluster edge process, where we derive an upper bound on the mixing time and verify the

    conjecture in [3] for the two-dimensional (2-d) torus; then we exploit the coupling technique [11]

    to combine the results for edge and vertex processes to obtain a characterization on the overall

    performance. Relevant contents are given in Section IV and Section V, respectively. As a result,

    the performance-complexity tradeoff in structured variational methods is further addressed in

    Section VI, together with some supporting simulation results.

  • 4

    The contributions of this work are summarized as follows. First, we derive a general and

    scalable variational message-passing framework for Markov random fields, which admits wide

    applicability concerning network size and topology, allows flexible tradeoff between performance

    and complexity, and easily adapts to practical wireless networks. In particular, we obtain explicit

    forms for variational message passing rules for probabilistic distributions in the exponential

    family, which are simple and yet admit wide applications. Then, this framework is applied to a

    clustered network (exemplified by Gaussian MRF), to realize a novel distributed inference

    approach which can achieve a flexible balance between inference accuracy and computational

    complexity. We also characterize the convergence behavior of the proposed inference algorithm

    on a 2-d torus, and during this process, derive an upper bound of the mixing time of the intra-

    cluster BP inference process, which should be of independent interest. Our analytical

    methodologies are developed for general edge process-based and mixed random walks, which

    may assume wider applicability.

    II. VARIATIONAL MESSAGE PASSING IN MRF A. System Model Distributed inference in complex networked systems is often casted as probabilistic inference in

    a graphical model. Well-known graphical models include Markov random fields, Bayesian

    networks, and factor graphs. Associated with each graphical model is a family of distributions

    which factorize according to the dependency structure of the underlying graph. Besides obvious

    advantages in visual representation, graphical models also facilitate design and analysis of

    distributed algorithms. Among existing graphical models, Markov random fields exhibit certain

    modeling convenience for wireless networks, as they can be conveniently mapped to real

    communication graphs, and often admit simpler forms for message-passing algorithms.

  • 5

    In this work we mainly consider a pairwise MRF1 represented by an undirected graph ( , ),V E

    where V and E denote the vertex and edge set respectively, and each node i V is associated

    with a random variable iX and observation iy . Define for each node a local potential function

    ( , )i i iX y , and for each edge ( , )i j E a compatibility function ( , )ij i jX X [12]. The

    Hammersley-Clifford theorem [1] indicates that the posterior probability of the random vector

    i i VX X given the observation vector i i Vy y admits the following product form:

    ( , )( | ) ( , ) ( , )ij i j i i i

    i j E i Vp X X X y

    X y . (1)

    We also assume that { }i and { }ij belong to the exponential family, i.e., take the following

    forms:

    ( , ) exp ( ) ( )Ti i i i i i i iX y X g θ η θ , (2) ( , ) exp ( , ) ( )Tij i j ij ij i j ij ijX X X X g θ η θ , (3)

    where θ ’s and η ’s are usually referred to as the natural parameters and sufficient statistics,

    respectively, and ( )i ig θ and ( )ij ijg θ are functions dependent on θ ’s only and irrelevant with

    X’s. The exponential family covers a large class of distributions of interest, such as Gaussian,

    Wishart, Gamma, Beta, and any discrete distributions.

    The objective of distributed inference is to compute ( | )P X Y y (or more generally

    ( | )P S Y y for some subset S X ) in an efficient way, through local computation and

    communications only. This is particularly important for large-scale networked systems where the

    observation data is widely distributed, and each node may have limited computation and

    communication resources. Distributed inference algorithms can also find applications in data

    processing where communication is not a concern but computational complexity is. A

    centralized processing will incur a computational complexity scaling exponentially with the size

    of X, and additional complexity is needed when subset S X is considered.

    1 MRF with higher order cliques can always be converted into an equivalent pairwise MRF [12].

  • 6

    B. Variational Inference Modern variational methods in general refer to converting the original problem into an

    optimization problem (variational transformation), and seeking approximate solutions to the

    latter through approximating the objective function or restricting the feasible set of solutions

    (variational approximation). Solving the (relaxed) optimization problem typically results in a set

    of fixed-point equations, and successive enforcement of them can (hopefully) lead to a solution.

    Applying the variational approach to distributed inference, the original problem is first

    transformed into finding a distribution ( )Q X which minimizes the KL divergence

    ( ( ) ( | ))KL Q P X X Y y , or equivalently maximizes a tight lower bound of log ( )P y [4]:

    ( ) ( ) log ( , )QL Q H Q E P X y , (4)

    where ( )H Q is the entropy function of ( )Q X , and {}QE stands for expectation with respect to

    (w.r.t.) ( )Q X . For analytical and computational tractability, the variational distribution ( )Q X is

    often restricted to a class of distributions with simpler dependency structure (such as sub-graphs

    of the original graphical model). Thus far, the most fruitful applications of variational inference

    assume a fully factorized form ( ) ( )i iQ Q XX , referred to as the Mean Field approach. When

    (4) is instantiated by this form and optimized w.r.t. each individual component, the following set

    of fixed-point equations is obtained for the optimal variational distribution (where |Q iE X

    refers to conditional expectation on iX and iZ is a normalization constant):

    log ( ) log ( | ) | logi i Q i iQ X E P X Z X y , i . (5)

    The complexity and accuracy of the variational inference depends on the inherent structure of

    the variational distributions ( )Q X . While the MF approach is attractive for its simplicity, its

    inference accuracy may not be satisfactory as the posterior correlation is not captured. In this

    paper, we will consider a richer structure for the variational distributions and explore the tradeoff

    between performance and complexity.

  • 7

    C. Variational Message Passing Framework In this section, we follow the general procedure of variational inference discussed in Section II.B

    to derive a message-passing framework for distributed inference in networked systems. Consider

    the system model (1), and rearrange ( , )Tij ij i jX Xθ η in terms of iX : ( , ) ( ' ) ' ( ),T Tij ij i j ij ij iX X Xθ η θ η

    where 'ijθ may be a function of jX . Let ( )i iXη be the union of sufficient statistics ( )i iXη and

    ' ( )ij iXη . Then the corresponding terms in (2) and (3) can be rewritten as

    ( ) ( )T Ti i i i i iX Xθ η θ η , (6) ( , ) ( ' ) ' ( ) ( )T T Tij ij i j ij ij i ij i iX X X X θ η θ η θ η , (7)

    where the newly obtained iθ and ijθ 2 are named the extended natural parameters and

    ( )i iXη the extended sufficient statistics.

    It can be shown that the optimal mean-field approximation *iQ dictated by (5) is also a

    member of the exponential family with sufficient statistics ( )i iXη and natural parameter

    * **, \ i ii v i ijQ Q jE θ θ θ , (8) where i is the set of neighboring nodes of node i, and

    * *\ iQ Q stands for the distribution

    *( )k i k kQ X . From (8), a simple message passing rule can be obtained:

    Messaging passing: ( 1)( ) nj

    ijQ

    ni j E θm ; (9)

    Parameter updating: ( ) ( ),i

    n ni v i i j

    j

    θ θ m . (10)

    That is, in the nth iteration, the message from node j to its neighbor i is the expected value of the

    extended parameter of the corresponding edge compatibility function ijθ (generally a function of

    jX ), w.r.t. the current variational approximation ( 1)njQ (similarly the message from node i to its

    2 Note that there is a counterpart jiθ which abstracts the corresponding terms of iX .

  • 8

    neighbor j is ( 1)( ) ni

    nj i jiQ

    E m θ ). In turn, node i sums up all the messages from its neighbors,

    together with the extended parameter of its own potential function, to get an updated parameter

    of its variational distribution component. The iteration generally converges under mild

    conditions [4][13]. While conforming to general expressions in literature, the above explicit

    message-passing and parameter updating forms seem to be new.

    D. Gaussian MRF For concreteness of discussion, we will particularly consider Gaussian graphical models in this

    study. Gaussian models are widely adopted in theory and practice of many areas, such as

    computer vision, oceanography, and wireless networks, and serve as good approximations in

    many scenarios due to the application of the central limit theorem. Without loss of generality,

    consider that X in (1) is jointly Gaussian with zero mean and (positive definite) covariance matrix

    XΣ , abbreviated as (0, )XX Σ N , where 2 2

    S,[ ]i iE X and . The

    observation at each node is given by

    ,i i i iy H x 1, ,| |i V , (11)

    where channel gain iH is assumed known, and noise 2N,(0, )i i N , independent across the

    network 3 . Given the observation vector y the posterior probability ( | )P X y is Gaussian

    distributed as4

    1 1 1 1 1( | ) ~ ( , ) ~ ( , )T TP X y F H Ξ y F H Ξ y FN N , (12)

    where diag( )iHH , 2N,diag( )iΞ , and

    1 1| | | |[ ]

    Tij V VF

    XF H Ξ H .

    Consider approximating the posterior probability (12) by the MF variational distribution with

    2MF, MF,( ) ~ ( , )i i i iQ X N , where MF,i and

    2MF,i are the posterior mean and variance respectively.

    3 It is straightforward to extend the model to include more complex scenarios, such as correlated observations and noise in space. 4 1( , ) N is the information parameterization for Gaussian distribution ( , ) N , with 1 , 1 .

    S, S,[ ]i j i j ijE X X

  • 9

    As an application of the variational message passing framework derived in the previous section,

    the following iterative form is obtained for the estimate of MF,i (see Appendix A):

    ( ) 2 ( 1)MF, N, MF,/ /i

    n ni i i i ij j iij

    H y F F

    . (13)

    Collecting all node estimates into a vector of dimension | |V leads to the following expression

    ( ) ( 1)MF MFˆn n

    V μ Ay BP μ , (14)

    where A and B are relevant coefficient matrices, and the stochastic matrix ˆVP has entries

    / ,

    0, otherwise.i

    ij ij ijij

    F F jP

    (15)

    In the next section, the same variational message passing framework will be applied to derive the

    message exchange rules between clusters (viewed as “mega-nodes”).

    III. STRUCTURED VARIATIONAL METHODS FOR DISTRIBUTED INFERENCE Although attractive for its computational simplicity, the naive mean field approach may not yield

    sufficient accuracy or fast convergence due to the independence restriction on variational

    distributions. A natural idea for improvement is to consider a variational distribution with richer

    (and yet tractable) dependence structure, and integrate exact or more accurate probabilistic

    inference algorithms with the mean field method to achieve a good tradeoff between accuracy

    and complexity. As mentioned earlier, the application of structured variational methods for

    distributed inference in practical networks is largely unexplored, so is its quantitative

    performance analysis in this setting. In this section, we discuss its instantiation in the context of

    clustered wireless networks, and analyze its convergence. In the following two sections, we will

    characterize its convergence rate.

  • 10

    A. Overview of the SMF5 Method The MF approximation corresponds to a totally disconnected graph where all the dependencies

    between the variables of the original model are removed. A natural idea to enrich the structure of

    Q is to replace each node in the MF approximation by a “mega-node”, i.e., to consider a class of

    variational distributions of the form 1

    ( )i

    s

    i Ci

    Q Q

    X , where 1,..., sC C form a disjoint partition of

    all nodes (variables). This approach intends to keep the original dependency structure within the

    clusters while decouple the connections between clusters. Approximately accurate inference will

    be pursued within the clusters to improve the performance, while MF approximation can be

    utilized across the clusters to maintain tractability. The tradeoff between accuracy and

    complexity can be realized through the construction of clusters, with the MF approach (clusters

    of size one) and exact inference (one cluster) on the two extremes.

    In particular, we adopt the belief propagation algorithm for intra-cluster reasoning, as it

    yields accurate inference on acyclic graphs, and continues to work well on loopy graphs with

    suitable sparsity or symmetry. An important reason to choose BP is that it can be readily

    implemented in a message-passing style through the prominent sum-product algorithm or its

    variants. However, intra-cluster processing in our SMF framework is not limited to BP. Other

    exact inference algorithms such as the junction tree (JT) algorithm can be employed for better

    convergence and wider applicability. It should be noted that the computational complexity of JT

    grows exponentially with the size of the maximal clique in the cluster, so it is sensible to put a

    limit on the cluster size when JT is applied. In both cases, our SMF framework provides a

    platform to best reap the benefits of these high-accuracy inference algorithms in practice.

    SMF also requires some overhead for clustering, which can be done before the network setup

    and can be adjusted during network operation when necessary. We have designed a distributed

    clustering scheme for SMF, which endeavors to minimize the dependence (correlation) between

    5 In this paper, structured variational method and structured mean field are largely interchangeable; more specifically, the former refers to the methodology, while the latter is used on the scheme we develop.

  • 11

    the clusters, thus improve the algorithm performance. The effectiveness of the algorithm is

    testified in Section VI via simulations. Interested readers are referred to [14] for details.

    B. SMF in Gaussian MRF Consider the Gaussian MRF model presented in II.D, for which the messages and node beliefs

    are both Gaussian distributed. Assuming for the nth iteration, the message from node i to j, ( )nj im ,

    and the belief at node i, ( )nib , are parameterized by

    ( ) 1 ( ) ( )( ) ~ ( , )n n nj i j i j im x

    N and ( ) 1 ( ) ( )( ) ~ ( , )n n ni i ib x q W

    N ,

    we can obtain the updating rules as (see Appendix B)

    ( 1)\{ }( )

    2 ( 1)S, S, S, S, \{ }

    2 2 ( 1) 2S, S, S,\{ }( )

    2 2 ( 1)S, \{ }

    (1 )

    1,

    1 (1 )

    i

    i

    i

    i

    nij i i kk jn

    j i nj i i j ij i i kk j

    ni j i i k jk jn

    j i ni ij i i kk j

    V

    V

    V

    (16)

    with

    2N,/i i i iH y ,

    2 2 2N, S,/ (1 | |)i i i i iV H , (17)

    and

    ( ) ( )

    ( ) ( ) .i

    i

    n ni i i kk

    n ni i i kk

    q

    W V

    (18)

    We proceed to discuss inter-cluster message updates. Consider a partitioning of the network

    nodes 1{ }s

    i iC C , and denote iCX as the collection of all variables such that ii C . For each

    cluster iC , define its Markov blanket (MB), ( )iMB C , as the set of nodes outside of iC but

    connected to some nodes in iC . In turn, those nodes in which are connected to some nodes in

    are denoted as gateway nodes. A neighboring cluster which contains part of is

    named a Markov blanket cluster (MBC) of iC , whose collection is denoted as iC . A conceptual

    iX

    iC

    ( )iMB C ( )iMB C

  • 12

    illustration for MB and MBC is given in Figure 1. To apply the variational message passing rules

    derived in II.C, the posterior probability can be reformulated as (c.f. (1))

    ,( | ) ( , ) ( , )i i i j i ji j Ci

    C C C C C CC C C

    P

    X y X y X X , (19)

    where ( , ) , ,

    ( ) ( , ) ( , )i i

    i i iC C i i i ij i j

    i C i j E i C j CX y X X

    X collects the node potentials and edge

    compatibility functions within cluster iC , while ,( , ) , ,

    ( , ) ( , )i j i j

    i j

    C C C C ij i ji j E i C j C

    X X

    X X

    collects the compatibility functions of the connecting edges between cluster iC and jC .

    As derived in Appendix C, inter-cluster updating can be readily obtained for the gateway

    node i in the cluster iC as

    2

    1N,( ) ( 1) ( 1) ( 1)

    ( )i

    in n n ni i ij j j

    j MB Ci

    y y F W qH

    . (20)

    Expression (20) admits an interesting interpretation: gateway nodes use the intra-cluster estimates

    of their neighbors in the Markov blanket to “update” observations, and exploit these

    “new” observations, which encode the messages from other parts of the network (propagated

    through intra- and inter-cluster processing), for the next round intra-cluster inference. The

    execution of intra- and inter-cluster updating does not need to follow one another; in practice it is

    found to be advantageous to run intra-cluster inference more often than the inter-cluster one.

    C. Convergence Analysis Convergence of Gaussian BP in loopy graphs has been actively studied in literature ([15][3][16]

    and references therein). While a full understanding is still lacking, various sufficient conditions

    have been found, among which the pairwise-normalizable condition6 is noteworthy [16]. Here

    we will assume that such a condition is satisfied, so the intra-cluster BP is guaranteed to

    converge. We can see from (16) that the inverse variance updating of BP within clusters stands

    6 That is, there exists a decomposition of the form (1) where both node potential and edge compatibility functions are valid Gaussian distributions.

    ( )iMB C

  • 13

    alone. It is also observed in our study that the inverse variance iteration converges much faster

    than the message mean iteration, so we can allow it to run first till the variance is sufficiently low

    (as clarified in the proof of Lemma 3.1 below). In the following, we provide an alternative proof

    to the convergence of the mean iteration, assuming that variances in intra-cluster BP have

    already converged to some small values { }j i . This approach can be easily extended for the

    analysis of the overall convergence of SMF, and facilitate the study of the convergence rate.

    The message mean iteration in Equation (16) can be rewritten with the conventional

    parameter pair ( ij , ij ) as (c.f. Footnote 4)

    12 ( 1)\{ }N,( ) ( 1)

    12\{ }S, S, S, S,

    /( )

    (1 )i

    i

    nk jij ij i i i i k i kn n

    j i j ik jj i i j ij i i k

    H yG

    V

    μ , (21)

    where ( 1) ( 1) 2| | 1[ ]n n Ej i R

    μ , ( , )i j E .

    Lemma 3.1 ( )j iG is a contraction mapping7.

    Proof: Let 12S, S,

    \{ } 2S, S, S, S,

    (1 )(1 )

    i j ij i kki j

    j i i j ij i

    KV

    ,

    2N,

    2S, S, S, S,

    /(1 )

    ij ij i i iij

    j i i j ij i

    H yy

    V

    , and

    2S, S, (1 )

    ij ijij

    i j ij

    , then Equation (21) can be reformulated to

    ( 1)\

    ( )\( ) ( 1)

    \( )\

    ( )1

    nij ij ki j i k

    k N i jn nj i j i

    ki jk N i j

    y KG

    K

    μ . (22)

    Define\{ }

    ( )\{ }

    11ij ki j

    k N i jK

    , 2 2diag R E Eij D , 2 2diag R E Eij V and

    2 1[ ] R Eijy y . Further define a stochastic matrix 2 2ˆ R E EE

    P with the entries

    7 A contraction mapping on a metric space (M, d) is a function G from M to itself, with the property that there is some nonnegative real number 1 such that for all and ' in M, ( ), ( ') ( , ')d G G d .

  • 14

    ( )

    ( ') ( ')\{ ( )}

    ( ') ( ')\{ ( )}', ': ( ') \ ( )

    , ( ) ( ') but ( ') ( )ˆ

    0, otherwise,s e

    s e d e d e

    s e d e d ee e e s e d e

    Ks e d e s e d e

    K

    P (23)

    where ( )s e and ( )d e denote the source and destination node of edge e. The iteration (22) can be

    written in a vector-matrix form as (c.f. (14))

    ( ) ( 1)BP BPˆn nE μ Dy V I D P μ , (24)

    which leads to

    ˆ' '

    '

    '

    '

    EG G

    μ μ V I D P μ μ

    V I D μ μ

    V I D 1 μ μ

    μ μ

    (25)

    with 1 . The last inequality comes from the assumption that for sufficiently large n,

    2S, S,

    1(1 )ij

    i j ij j i

    , i.e. ( ) 1nij . Thus, G μ is a maximum-norm contraction mapping and

    hence has a unique fixed point. This proves the convergence of the mean in Gaussian belief

    propagation.

    Based on Lemma 3.1, we have:

    Theorem 3.1 In a Gaussian MRF, the structured variational method using Gaussian BP as the

    intra-cluster inference algorithm converges.

    Proof: Taking inter-cluster updating into account, the change in observations (20) will only

    reflect on ( )niy , which will be cancelled out in ( ) ( ')j i j iG G μ μ (c.f. (22)). So the proof of

    Lemma 3.1 still can be applied, and the conclusion follows.

  • 15

    IV. CONVERGENCE RATE OF INTRA-CLUSTER INFERENCE Although successfully employed in SMF for intra-cluster inferences, the performance of the

    belief propagation algorithm is still not fully understood. We first analyze the performance of the

    intra-cluster algorithm in this section, where we derive an upper bound on its convergence time

    for the 2-d torus; then we utilize this result in Section V to investigate the overall performance of

    SMF. Our analysis is mainly focused on 2-d tori, which captures the essence of planar networks;

    extension to other models such as geometric random graphs will be considered in our future

    work.

    A. Vertex, Edge and Mixed Process Both updates in (14) and (24) involve stochastic matrices which define irreducible and aperiodic

    Markov chains. The MFP̂ in (14) is a | | | |V V matrix defined on the vertex set (c.f. (15)), while

    BPP̂ in (24) turns out to be a 2 | | 2 | |E E matrix defined on the set of directed edges (c.f. (23)).

    We denote the evolvement of the corresponding Markov chains in these two schemes as the

    vertex process and edge process respectively.

    Figure 2 illustrates the distinction between a vertex process and an edge process. As (a)

    shows, the states in a vertex process are represented by nodes (the circles), while the allowable

    (two-way) transition between the states is determined by the undirected edges. In contrast, the

    states in an edge process are represented by the directed edges (the arrows in (b)), and the

    transitions are guided by the directions that the arrows point to. More specifically, the transition

    can only occur between the edges which are connected but not directly against each other (i.e.,

    between 'e and e such that ( ) ( ')s e d e but ( ') ( )s e d e ), dictated by the rule in BP that the

    message from one neighbor of a node contributes to the new messages sent to other neighbors

    but not back to itself (c.f. (21)). For structured variational methods, we constrain the edge

    process only within clusters, and employ the vertex process to exchange information between

    clusters. This leads to a mixed vertex-edge process model as shown in (c).

  • 16

    For a Markov chain P̂ with stationary distributionπ , the mixing time is defined as

    mix 1ˆ( ) max inf :|| ( , ) || /2 ,tit t i P π (26)

    where ˆ ( , )t i P is the i-th row of the t-step transition matrix, and 1|| || stands for the 1l norm.

    Essentially, mix ( )t specifies the (worst case) time that P̂ takes to converge to the -vicinity of

    its stationary distribution, considering all possible initial states.

    The convergence behavior of vertex processes has been well studied in the context of

    reversible Markov chains. In particular, it is not difficult to prove that the mixing time of a

    reversible Markov chain on a 2-d n n torus is 2mix ( )t O n [17], which characterizes the

    convergence time of vertex processes and thus the variational method. However, the

    performance of edge processes, which in general involves non-reversible Markov chains, is still

    largely unexplored. And to the best of our knowledge, there is no formal convergence discussion

    on the mixed model.

    B. Convergence Rate of Edge Process To facilitate our discussion, we adopt the following labeling to represent an edge process on an

    n n torus, as shown in Figure 3. Specifically, given node (1,1) on the left-bottom corner, and

    ( , )n n on the right-top corner, four outgoing edges of node ( , )i j pointing to the East, North, West

    and South directions are respectively labeled as ( , )i j , ( , )j i , ( , )i j and ( , )j i . The states are

    only allowed to make 90 degrees turns with the probability ( ) /q n n (e.g., from ( , )i j to ( , 1)j i

    or ( , 1)j i ), and move forward with the probability1 2 ( ) /q n n (e.g., from ( , )i j to ( 1, )i j ),

    but can’t backtrack (e.g., the transition from ( , )i j to ( 1, )i j is forbidden).

    A conjecture is put forth in [3] that the convergence time of consensus propagation (or belief

    propagation) on a 2-d n n torus is 3/2( )O n . In this section, we verify this conjecture through

    deriving a slightly more general upper bound for the mixing time of edge processes on a 2-d

    torus, assuming a turning probability ( ) /q n n , where ( )q n satisfies lim ( )n

    q n

    and

  • 17

    ( ) / 1/ 3q n n . For a normal BP algorithm, we have ( ) /q n n c for some constant 1 / 3c 8. We

    begin with citing a result from [18], which applies to general Markov chains.

    Lemma 4.1 For any irreducible and aperiodic Markov chain P̂ with stationary distribution ,π

    1 1mix ( ) log( ) log((1 ) ) 1 ( )fillt c t c

    , (27)

    where fill ˆ( ) max inf{ : ( , ) }t

    it c t i c P π , 0 1c .

    As mentioned above, a two-tuple 0 1( , )s s 2{ ,..., 1,1,..., }n n is used to represent the states

    of the edge process. It can be verified that the state evolution (whether horizontal or vertical)

    admits:

    1 10 0 1 1Moving Forward 1,i i i is s s s , (28)

    1 10 1 1 0Turning Left , 1)i i i is s s s , (29)

    1 10 1 1 0Turning Right , 1i i i is s s s . (30)

    Without loss of generality, assume the random walk starts from state 0 00 1( , ) ( , )s s a b . Let

    0 1 20, , ,...T T T be the time instances that the random walk makes turns, and L / RiD be the

    corresponding turning direction (Left or Right) at time i. Then for the time 1[ , )k kt T T , with k

    being the total number of turnings before time t, the destination state evolves as9

    1 1

    1 1

    1 0 10 1

    1 0 1

    ( , + ) L( , )

    ( , ) R.

    k k

    k k

    T Tt t k k k k

    T Tk k k k

    t s T s T T Ds s

    t s T s T T D

    (31)

    Clearly, the number of possible destination states grows with the number of turns made. For

    example, when 1 2[ , )t T T

    1 1 10 1

    1 1 1

    ( , ) L( , )

    ( , ) R;t t t b T a T Ds s

    t b T a T D

    (32)

    8 The reader is referred to [10][18] for the construction of non-reversible chains that mix even faster, where ( )q n is a constant. 9 All the summations in (31)-(35) should be understood as modulo 2n operation, taking values in { , ..., 1,1, ..., }n n .

  • 18

    and when 2 3[ , )t T T

    2 1 2 1 1 2

    2 1 2 1 1 20 1

    2 1 2 1 1 2

    2 1 2 1 1 2

    ( , ) L, L( , ) L, R

    ( , )( , ) R, L

    ( , ) R, R.

    t t

    t a T T b T T D Dt a T T b T T D D

    s st a T T b T T D D

    t a T T b T T D D

    (33)

    Generally, with k total number of turnings, the destination state has 2k possibilities; when k is

    even, it is given by

    1 2 10

    1 2 11

    ( ) ( )( ) ( )

    tk kk

    tk k

    T T T at TsT T T Tbs

    , (34)

    and when k is odd, it is given by

    1 2 2 10

    1 2 3 11

    ( ) ( ) ( )( ) ( ) ( )

    tk k k

    tk k k k

    b t T T T T TsT T T T T as

    , (35)

    where the plus/minus signs are determined by the turning directions.

    By symmetry, the stationary distribution of this Markov chain is uniform with probability

    21/ 4n . As dictated by Lemma 4.1, we need to find the minimum time t such that,

    20 1Pr(( , ) ( , )) /t ts s x y c n for any 2( , ) { ,..., 1,1,..., }x y n n for some constant c, regardless of

    the initial state. Intuitively, both 0ts and 1

    ts in (34) and (35) are sums of independent geometric

    random variables, which allows us to examine the final state probability by the Central Limit

    Theorem. This is done in the Lemma 4.3 below, which requires a technical result in Lemma 4.2

    to simplify analysis.

    Lemma 4.2 With high probability (w.h.p.)10 there exists a constant 1 0c such that there are

    at least 3/2( )q n positive odd integers (time indices) i satisfying

    3/41 / ( )i iT T n q n and

    3/41 / ( )i iT T n q n

    10 the probability approaches 1 as n

  • 19

    in the first 1 ( )c q n n steps, where denotes the floor function which returns the closest

    integer smaller than the argument, and denotes the opposite, the ceil function.

    Proof: See Appendix D.

    Lemma 4.2 essentially reveals some regularity in the turning times; in particular, some

    intervals between them are well bounded. Divide the whole turning time index set {1,2,...} into

    two subsets 1S and 2S , where 1S is the set of positive odd integers i such that

    3/41 / ( )i iT T n q n and

    3/ 41 / ( )i iT T n q n , as in Lemma 4.2, and 2S contains the rest. To

    achieve our purpose, it is sufficient to find a lower bound of the conditional probability

    20 1Pr ( , ) ( , ) |t t Ss s x y T for any given set 2 2{ }S j j ST T 11. This allows us to focus on the more

    regular random variables 11

    ( )( )

    i i

    i i

    T TT T

    in (34) and (35) specified by 1i S . We thus can derive:

    Lemma 4.3 There exists a constant 0c such that if 1 ( )t c q n n ,

    2 20 1Pr ( , ) ( , ) | /t t Ss s x y T c n , (36)

    for any 2( , ) ,..., 1,1,...,x y n n and any 2 2

    { }S j j ST T w.h.p.

    Proof: See Appendix E.

    Combining Lemma 4.1 and Lemma 4.3, we get the following conclusion:

    Theorem 4.1 On a 2-d n n torus, the mixing time of an edge process with turning probability

    ( ) /q n n is ( ( ) )O q n n w.h.p.

    As a result, the convergence time of consensus propagation [3] and our intra-cluster BP

    inference (16) is 3/2( )O n , with ( ) /q n n c for some constant c in the worst case. This is verified

    in Figure 4, where the mixing times with 0.01 of the vertex and edge process (with

    11 Our goal is then achieved by the total probability theorem.

  • 20

    ( ) / 1/ 3q n n ) are simulated. It is observed that the two curves fit well with 2( )O n and 3/2( )O n ,

    respectively.

    V. CONVERGENCE RATE OF STRUCTURED VARIATIONAL METHODS It has been shown in Section IV.A that the performance of structured variational methods is

    governed by a mixed vertex-edge process, or equivalently a hybrid Markov chain model. The

    complexity of this model precludes direct applications of any standard techniques in literature. In

    this section, we explore the coupling technique [11] to analyze this model.

    Coupling provides a simple and elegant way of bounding the mixing time, and isn’t tied to

    reversibility. Essentially, a coupling of Markov chains is a process 0{ , }t t tX Y with the property

    that both { }tX and { }tY are Markov chains with the same transition matrix P̂ of interest, but

    typically with different starting states. Once the two chains meet at one state, they stay together

    at all times after that, i.e. if ' 't tX Y , then t tX Y for 't t . For starting states x and y, let

    ,0 0inf{ : | , }

    x yt tT t X Y X x Y y , (37)

    then the coupling time is defined as

    ,couple ,max ( )

    x yx yt E T , (38)

    which can serve as an upper bound for the mixing time according to the Coupling Lemma [11]:

    mix couple( ) lnt t . (39)

    We assume an n n torus is equally divided into 2s clusters, each of size ( / ) ( / )n s n s , and

    consider a vertex-edge process on it as indicated in Figure 2(c). Then by investigating the

    coupling time of two random walks on such a clustered graph, we can obtain a characterization

    for the mixing time. Firstly, we study how quickly an edge process can “escape” from a 2-d torus

    (or how long it can stay in the torus before hitting any outgoing edges on the boundaries), and

    obtain the following result:

  • 21

    Lemma 5.1 On a 2-d n n torus, the average staying time of an edge process is upper-

    bounded by ( ( ) )O q n n w.h.p.

    Proof: See Appendix F.

    Using this result, the mixing time of a mixed vertex-edge process can be characterized as

    follows:

    Theorem 5.1 On a 2-d n n torus, the mixing time of a mixed vertex-edge process with equal

    cluster size of ( / ) ( / )n s n s is 3/2O sn w.h.p. Proof: Suppose two random walks start from two randomly selected points in this clustered

    graph, then the coupling process can be described as follows: firstly, these two random walks

    wander inside their respective clusters (edge process) until they hit the gateway nodes and exit

    the starting clusters. From then on, they roam over the network, repeatedly entering and leaving

    clusters, and finally arrive at a same cluster. This journey, on the high level, can be regarded as a

    vertex process on an s s torus with “mega-vertices”, which take 2( )O s steps to couple. At each

    mega-vertex (cluster), the average staying time is 3/ 2(( / ) )O n s according to Lemma 5.1. Besides,

    we need to consider the following scenario. Even after these two walks reach the same cluster,

    one of them may leave early, by hitting gateway nodes before coupling with the other walk; the

    probability of such an event is denoted as p . In this case, the above journey is repeated.

    To evaluate p , assume a is the state of the random walk which has entered a cluster earlier,

    x is the state of the second random walk which just steps into this cluster, and z is a boundary

    node of this cluster. It has been shown that [19], the probability that starting from node x, a

    random walk hits node z before it hits a is given by the ratio of the effective resistance12 between

    a and x and that between and a and z

    hit, hit ,( )( )( )x z a

    R a xp PR a z

    . (40)

    12 The effective resistance between node i and j is the expected number of traversals in a random walk starting at i and ending in j [19].

  • 22

    Since both x and z are boundary states, while we do not have any further information about a but

    to assume it is uniformly and randomly located inside the cluster, ( )R a x and ( )R a z are

    on the same order, and so p is a constant.

    Then the total time to couple these two random walks is given by

    2 3/2 3/2 1couple

    1

    2 3/2 3/2

    ( ) ( / ) ( / ) (1 )

    ( ) ( / ) / (1 ) ( / ) / (1 ),

    i

    it i O s O n s O n s p p

    O s O n s p O n s p

    (41)

    where the first term gives the total roaming time among the clusters, while the second term

    corresponds to the staying time of two random walks in the same cluster.

    We thus have

    2 3/2 3/2mix couple( ) ln ( ( / ) )t t O s n s O sn . (42) □

    Note that Theorem 5.1 includes previous results for the vertex process ( s n ) and edge process

    ( 1s ) as two special cases.

    VI. PERFORMANCE-COMPLEXITY TRADEOFF IN STRUCTURED VARIATIONAL METHODS

    Theorem 5.1 indicates that the inference performance decreases with s (increases with the cluster

    size). To inspect how clustering affects the message complexity, note for the total 2s clusters,

    each cluster has 2( / )n s nodes and thus 24( / )n s directed edges. On each directed edge, two

    metrics, the message mean and variance, are exchanged in the BP algorithm, except for the

    4( / )n s outgoing ones across the cluster boundaries. Instead, on these cross-cluster edges, only

    the estimated means are needed for inter-cluster updating. So the message complexity per

    iteration in the SMF algorithm is given by

    2 2 2 22 4( / ) 4( / ) 4( / ) (8 4 )O s n s n s s n s O n ns . (43)

  • 23

    Comparing (42) and (43), we can observe the performance-complexity tradeoff inherent in SMF:

    as the cluster size increases, more accurate inference is performed on more nodes; this results in

    faster convergence, but also inevitably increases communication burden.

    The inherent tradeoff of SMF is further explored through simulations. We consider a

    Gaussian MRF estimation, and adopt a similar simulation setting as in [20]13, with 150 nodes

    uniformly and randomly distributed in a unit plane. The estimation mean square error (MSE) is

    used as the comparison metric, defined by

    2 2ˆ ˆ ˆ/mMSE x x x , (44)

    with ˆ mx denoting the estimation vector at the mth iteration, and x̂ being the exact estimation (the

    MMSE solution).

    Two clustering schemes are considered in the simulation. One is a centralized scheme [21],

    which uses semi-definite programming to solve a relaxation of the min-cut problem14 with equal

    cluster size constraint, and the other is our proposed distributed clustering algorithm in [14],

    which seeks to approximate the min-cut solution in a greedy manner.

    Figure 5 compares the convergence rate of BP, MF, and SMF – with the cluster size limit of 3

    and 8, respectively. The results verify that BP converges the fastest. But we also observe that even

    when the network is divided into very small clusters (of size 3), SMF can still significantly speed

    up the convergence; in this case there is not much increase in computational complexity as

    compared to MF. A reasonable cluster size (of 8) achieves almost indistinguishable performance

    to that of BP. It is also observed that our distributed clustering scheme yields results close to the

    centralized one, but with a much lower computational overhead.

    Another important consideration in wireless networks is the energy consumption, which is

    typically dominated by the communication energy. Comparison of the three approaches in this

    aspect is illustrated in Figure 6, where the total number of exchanged messages till reaching the

    13 see Section V.A in [20] 14 The edge weight is set proportional to the correlation between two end variables.

  • 24

    corresponding MSE is used as an indicator for communication energy. Note that compared to MF,

    BP exchanges more messages per round but converges faster. It is interesting to observe that for

    this simulation setting15 SMF consumes the least communication energy to obtain the same

    estimation accuracy, which indicates its potential superiority in practice.

    The above analysis and simulation indicates that, an appropriate cluster size should be

    selected for SMF depending on applications, to achieve a good balance among estimation

    accuracy, convergence rate, computational complexity, and energy efficiency. In practice, how to

    select an appropriate cluster size is worth further investigation. If the anticipated operational

    environment is stable, we may conduct off-line simulation or experiment such as in the Figure 4

    and 5 to determine the best cluster size. In the more dynamic scenarios, we can resort to a cross-

    layer approach to facilitate re-clustering. For example, when there is a demand for higher quality

    of service from the application layer, the inference module will notify the clustering module to re-

    cluster with a larger cluster size. This can be easily accommodated through our distributed

    clustering scheme in [14], which has a procedure to merge clusters of smaller sizes into larger

    ones till the designated conditions are satisfied. Similarly, a split of current clusters can be

    initiated when a smaller cluster size is desired.

    VII. CONCLUSIONS AND FUTURE WORK In this paper, we develop a general variational message passing framework for distributed

    inference in Markov random fields. Structured variational methods are explored to achieve a nice

    tradeoff between system performance and complexity. We also investigate the convergence

    performance of our proposed structured variational methods for distributed inference. We adopt

    a direct probabilistic approach to analyze the convergence of an edge process which models the

    intra-cluster processing, and devise a coupling process to characterize the overall performance

    concerning a more complicated mixed vertex-edge process. We expect both the results obtained

    and the analytical tools developed in this work can be applied to other similar problems in

    15 This result should not be viewed as contradictory to (43), which only describe the scaling behavior.

  • 25

    wireless networks. In particular, the methodologies developed on the mixing behavior of the

    edge process and mixed process seem not to be limited to Gaussian assumption and should

    assume wider applicability.

    In this study, the quality of distributed inference at convergence has been neglected. This is a

    very challenging research topic, and any progress in this area will likely lead to significant

    impact. Relevant works along this line include [15], [25]and [26]. Another direction for future

    work is to further consider the impact of channel uncertainties and communication constraints

    [26][27]. One recent work [28] is also worth mentioning, where a hidden variable is introduced

    to decouple the dependence among observations, for the purpose of simplifying the optimal local

    decision rules in distributed detection. The feasibility of incorporating this idea into our

    variational processing framework deserves further exploration.

    APPENDIX A. Derivation of Gaussian Variational Message Passing Eq. (13) Consider the information representation of the posterior distribution given in (12), we can find a

    factorization of the form (1), with the individual node potential functions and edge compatibility

    functions given by

    212( , ) exp ( )i i i i i i ii iX y v y X F X , and 120

    ( , ) exp [ ]0

    ij iij i j i j

    ij j

    F XX X X X

    F X

    , (45)

    where 1 2N,( ) /Ti i i i iiv y H y H Ξ y . Defining an extended sufficient statistics vector 2( ) [ , ]Ti i i iX X Xη , the corresponding extended natural parameters in (6) and (7) are

    respectively given by

    12,T

    i i iiv F θ and ,0T

    ij ij jF X θ . (46)

    The MF variational distribution ( )i iQ X can be parameterized with the same sufficient statistics,

    and the corresponding natural parameter ,i vθ is given by 2 2

    MF, MF, MF,[ / , 1/ 2 )]T

    i i i . Applying the

  • 26

    variational message passing rule and after some algebra, we obtain the updating form given in

    (13).

    B. Derivation of Gaussian Belief Propagation Eq. (16) and (18) The message passing rules of belief propagation are derived on a spanning tree of the network16.

    In this setting, the posterior probability (12) assumes the following form composed of the prior

    marginal ( )i ip x and pairwise joint distribution ( , )ij i jp x x , as well as the marginal conditional

    distribution ( | )i i if y x (viewed as a function ix ):

    ( , ) | | 1( , )

    ( | ) ( | )( ) iij i j

    i j Ei i i

    i Vi ii V

    p x xP f y x

    p x

    X y . (47)

    In the above expression, 1 2 2 2N, N,( | ) ( , )i i i i i i i if y x H y H N , 1 2S,( ) ~ (0,1 )i i ip x

    N , and

    1( , ) ~ ( , )ij i j ijp x x 0 ΩN , ij , where

    2S, S, S,

    22 2 2S, S, S,S, S,

    1(1 )

    j i j ijij

    i j ij ii j ij

    Ω .

    Note that with information parameter representation, if 11 1 1( ) ~ ( , )p x N and

    12 2 2( ) ~ ( , )p x

    N are two different distributions on the same random Gaussian random

    vector x, then the product density

    112 1 2 12 12( ) ( ) ( ) ~ ( , )p x p x p x N (48)

    with 12 1 2 and 12 1 2 . Similarly, the quotient 1 2( ) / ( )p x p x produces an

    exponential quadratic form with parameters 1 2 1 2( , ) , which defines a valid

    probability density when 1 2 is positive definite. Moreover, if 1x and 2x are two joint

    16 Various forms have been developed in literature (e.g., [15]). We provide a derivation for the specific forms of (16) and (18) here for completeness, which will also facilitate the discussion in Section III.

  • 27

    Gaussian random vectors with distribution 1 11 1211 22 21 22

    ( , ) ~ ,p x x

    N , the marginal

    distribution 11 1 1( ) ~ ( , )p x N is also Gaussian with information parameters given by

    1

    1 1 12 22 21

    1 11 12 22 21.

    (49)

    Comparing (47) with the standard Hammersley-Clifford equation (1), we have the node potential

    function

    1( ) ( , )i i i iX V

    N , (50)

    where the parameters are given in (17), and the edge compatibility function

    1( , ) ( , )ij i j ijX X 0 ΩN . (51)

    Consider the general message updating and belief computing formulas in belief propagation

    [1][12]:

    ( ) ( 1)\{ }

    ( ) ( , ) ( ) ( )ii

    n nj i j ij i j i i i k i ik jx

    m x x x x m x dx , ( ) ( )( ) ( ) ( )

    i

    n ni i i i i k ik

    b x x m x .

    By applying the product rule (48) on the message updating equation, we have the integrand 1 1

    \{ }( , ) ( ) ( ) ~ ( , )

    i

    nij i j i i i k i ij ijk j

    x x x m x Ι JN , (52)

    where

    1

    \{ }

    0i

    ni i kk j

    ij

    Ι , (53)

    ( 1)2 2 2\{ }S, S, S,

    2 2 2S, S, S,

    1(1 ) (1 )

    1(1 ) (1 )

    i

    ijni i kk j

    i ij i j ijij

    ij

    i j ij j ij

    V

    J . (54)

    Using the marginalization rule (49) to perform the integration over ix , we can obtain the

    message updating rule (16). The belief updating rule (18) is easier to obtain following the

    product rule (48).

  • 28

    C. Derivation of Inter-cluster Variational Message Passing Eq.(20) As in Appendix A, define the extended sufficient statistics for the cluster iC as

    2

    { }{ }( ){ }

    i

    i

    i ji C

    i

    XX XXX

    η , , ii j C , (55)

    where{ }iX denotes the collection all iX ’s in cluster iC , and the other two terms are defined

    similarly. Referring to (19) and (45), the corresponding extended natural parameters of the

    cluster potential and cluster compatibility functions are given by

    12[{ },{ },{ }] , ,iT

    C i ij ii iv F F i j C θ and ,{0},{0} , ,i j iT

    C C ij j i CF X i C j θ , (56)

    where the involved parameters are the same as in (46). Considering the structure of i jC C

    θ above,

    we only need to focus on the first component when applying the variational message passing rule

    in Section II.C. In particular, the messages are exchanged between gateway nodes and their

    Markov blankets, taking the form of ( 1)nj i j

    CQ CE θ according to (9); then, following the spirit of

    (10), such messages are used together with the parameters of the gateway nodes 2N,/i i i iv H y

    to update the corresponding variational distributions. Recall that in our structured variational

    method, the variational distribution is assumed to take the form 1

    ( )i

    s

    i Ci

    Q Q

    X , and Gaussian

    belief propagation is adopted for intra-cluster inference. Therefore, the variational message

    passing rule indicates the following interaction between neighboring clusters: the intra-cluster

    inference in the previous round provides a mean estimate for all the variables in the Markov

    blanket of cluster iC , i.e, ( 1)1( 1) ( 1)

    njQ

    n nj j jE X W q

    , iC

    j (c.f. (18)). Then this estimate is

    used to construct a “new” observation ( )niy as given in (20) to stimulate a new round of intra-

    cluster inference through (16)-(18).

  • 29

    D. Proof of Lemma 4.2

    Notice that 1i i iT T are i.i.d. geometric random variables with parameter 2 ( ) /q n n . We can

    readily compute the probability

    3/43/4 3/4

    1 1

    2/ ( )

    Pr / ( ) , / ( )

    1 (1 2 ( ) / ) ,

    i i i i

    n q n

    p T T n q n T T n q n

    q n n

    (57)

    which is non-vanishing as n (for all ( ) ( )q n O n ). Choose a constant 2 1 /c p and

    consider a group of 3/22 ( )c q n positive odd integers. By the Chernoff bound, the probability

    that there are at least 3/2( )q n positive odd integers i satisfying 3/4

    1 / ( )i iT T n q n and

    3/41 / ( )i iT T n q n in such a group is at least 2 3/22 21 exp 1 1 ( ) 2c p c pq n , which

    approaches 1 as n goes to infinity. Finally, consider a constant 1 22( 1)c c , and it is claimed

    that in the first 1 ( )c q n n steps there are at least 23/2

    2 ( )c q n turns w.h.p. To prove this claim,

    note that the random variable 3/222 ( )c q n

    T

    is the sum of 2 3/22 ( )c q n i.i.d. geometric random

    variables each with mean (2 ( ))n q n and variance between 2 212 ( )n q n and 2 24 ( )n q n . By

    the Chebyshev’s inequality

    3/2

    2

    3/22

    3/22

    3/2 2 22 ( ) 2

    1 2 2 22 ( )1 2

    1 2 ( )

    var ( ) / ( )Pr ( ) 0.

    2( 2 ) ( )( )

    c q n

    c q n

    c q n

    T c q n n q nT c q n n

    c c q n nc q n n E T

    (58)

    E. Proof of Lemma 4.3

    Recall that 1i i iT T are i.i.d. geometric random variables with parameter 2 ( ) /q n n . For turning

    times indexed by 1S , there is a useful property given as follows. If i k L j , it is known

    that [22], 1 11Pr( | , , )i i i iT k T i T j T L L i

    , { 1, 2, , }k i i L . That is, when 2S

    T is

  • 30

    fixed, for every 1i S , the component vector 1

    1

    ( )( )

    i i

    i i

    T TT T

    in (34) and (35) is a uniform random

    vector on the space 23/ 4 3/ 4/ ( ) ,..., 1,1,..., / ( )n q n n q n . From Lemma 4.2, we know

    that if 1 ( )t c q n n , there are at least 3/2( )q n such uniform random vectors on , with

    zero mean and covariance matrix Σ whose entries are all on the order of 2 3/2/ ( )n q n . Denote

    these random vectors as ( )m nX , 3/21,..., ( )m q n and let

    1/ 2 3/ 4( ) ( ) / ( )m mn n q nY Σ X . Then

    from the Central Limit Theorem, the summation of ( )m nY converges to a standard multivariate

    normal distribution

    3/2( )2 1 2 21

    ( ) ( , )q n Dn mm n

    S Y 0 IN . (59)

    Clearly the probability 3 4Pr || ||nc c S is non-vanishing for any constants 3c and 4c , where

    || || denotes the Euclidean distance. Noting that the entries of 3/4 1/2( )q n Σ are on the order of n ,

    there exists a constant 5c such that 3/4 1/2 5Pr 2 ( ) 2nn q n n c Σ S . Also note that nS is a zero-mean unimodal symmetric random vector. Therefore there exists a constant c such that

    3/4 1/2 2Pr ( ) 2 /nq n n c n Σ S . Now consider the sum of 3/2( )q n i.i.d. random vectors which are uniform on (those indexed by 1S in (34) and (35) with the rest fixed) and denote it

    as nU , then nU and 3/4 1/2( ) nq n Σ S have the same distribution. Therefore the lemma is proved.

    F. Proof of Lemma 5.1 From the state representation of the edge process (c.f. Figure 3), we know that hitting an

    outgoing edge on the boundary corresponds to 0 1ts or n. According to (34) and (35), 0

    ts is

    given by

    1 2 10

    1 2 2 1

    ( ) ( ) is even( ) ( ) ( ) is odd,

    k k kt

    k k k

    t T T T T a ks

    b t T T T T T k

    (60)

  • 31

    which is the sum of k geometric random variables with zero mean and variance 2 3/2/ ( )n q n .

    We learn from the proof of Lemma 4.2 that in the first 1 ( )c q n n steps there are 3/ 2( ( ) )k O q n

    turns. Let us consider 0ts n , for which the local limit theorem ([23], page 10) can be invoked to

    get

    2

    3/ 403/ 4 2 3/ 2 3/ 2

    1( ) Pr( ) exp( ) 2 / ( ) ( )2

    tn nq n s nq n n q n q n

    , (61)

    or 0Pr( ) 1/ 2ts n n e as n . Then by the Chernoff bound, the probability that there is at

    least one time that the walk hits the boundary edges before 6 ( )t c q n n for some constant 6c is

    at least 2

    7 711 exp 1 1/ ( ) ( )2

    c q n c q n ( 7 6 / 2c c e ), which approaches 1 as .n

    REFERENCES [1] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA:

    Morgan Kaufmann, 1988. [2] G. Cooper, “Probabilistic inference using belief networks is NP-hard,” Artificial Intelligence, vol. 42, pp. 393-

    405,1990. [3] C. C. Moallemi and B. Van Roy, “Consensus Propagation,” IEEE Transactions on Information Theory, Vol. 52,

    No. 11, pp. 4753-4766, 2006. [4] T. Jaakkola, “Tutorial on variational approximation methods,” In Advanced mean field methods: theory and

    practice. MIT Press, 2000. [5] C. M. Bishop, J. M. Winn, and D. Spiegelhalter, “VIBES: A variational inference engine for Bayesian

    networks,” Advances in Neural Information Processing Systems, 2002. [6] J. Dauwels, “On variational message passing on factor graphs,” Proc. International Symposium on Information

    Theory (ISIT), 2007. [7] L. Saul and M. Jordan, “Exploiting tractable substructures in intractable networks,” In Advances in Neural

    Information Processing Systems, Vol. 8. The MIT Press, 1996. [8] E.P. Xing, M.I. Jordan, and S. Russell, “A generalized mean field algorithm for variational inference in

    exponential families,” In Proceeding of 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 2003.

    [9] D. Randall, “Rapidly mixing Markov chains with applications in computer science and physics,” Computing in Science and Engineering, Vol. 8, No 2, March 2006.

    [10] P. Diaconis, S. Holmes, and R. M. Neal, “Analysis of a non-reversible Markov chain sampler,” Biometrics Unit, Cornell University, Tech. Rep.BU-1385-M, 1997.

    [11] T. Lindvall, “Lectures on the Coupling Method,” Courier Dover Publications, 2002. [12] J. Yedidia, W. Freeman, and Y. Weiss, “Understanding belief propagation and its generalizations,” Technical

    Report TR-2001-22, Mitsubishi Electric Research Laboratories, 2001.

  • 32

    [13] M. J. Wainwright, and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, Nos. 1–2, pp.1–305, 2008.

    [14] Y. Zhang and H. Dai, “Distributed Network Decomposition: a Probabilistic Greedy Approach,” 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, TX, Mar. 2010.

    [15] Y. Weiss and W. Freeman, “Correctness of belief propagation in Gaussian graphical models of arbitrary topology,” Neural Computation, vol. 13, no. 10, pp. 2173-2200, 2001.

    [16] D. Malioutov, J. Johnson, and A. Willsky, “Walk-sums and belief propagation in Gaussian graphical models,” Journal of Machine Learning Research, 2006.

    [17] S. Boyd, A. Ghosh, B. Prabhakar and D. Shah, “Mixing Times for Random Walks on Geometric Random Graphs,” SIAM Workshop on Analytic Algorithmics & Combinatorics (ANALCO), Vancouver, Canada, January 2005.

    [18] W. Li, H. Dai, and Y. Zhang, “Location Aided Fast Distributed Consensus in Wireless Networks,” IEEE Trans. Information Theory, vol. 56, no. 12, pp. 6208-6227, Dec. 2010.

    [19] D. A. Levin, Y. Peres and E. L. Wilmer, “Markov Chains and Mixing Times,” American Mathematical Society, 2008.

    [20] V. Delouille, R. Neelamani, R. Baraniuk, “Robust Distributed Estimation using the Embedded Subgraphs Algorithm,” IEEE Trans. Signal Processing, Vol. 54, No.8, 2006.

    [21] K. Schloegel, G. Karypis, and V. Kumar, Graph partitioning for high performance scientific simulations, CRPC Parallel Computing Handbook, 2000.

    [22] Grimmett and Stirzaker, Probability and Random Processes, Oxford University Press, 2001. [23] V. F. Kolchin, “Random Graph,” Cambridge University Press, 1999. [24] E. Riegler, G. E. Kirkelund, C. N. Manchon, and B. H. Fleury, “Merging belief propagation and the mean field

    approximation: A free energy approach,” 2010 6th International Symposium on Turbo Codes and Iterative Information Processing (ISTC), pp.256-260, Sept. 6-10, 2010.

    [25] M. J. Wainwright, T. S. Jaakkola and A. S. Willsky, “Tree-based reparameterization framework for analysis of sum-product and related algorithms,” IEEE Transactions on Information Theory, vol. 49, no. 5, pp. 1120-1146, May 2003.

    [26] A. T. Ihler, J. W. Fisher III, and A. S. Willsky, “Loopy belief propagation: Convergence and effects of message errors. Journal of Machine Learning Research, vol. 6, pp. 905-936, May 2005.

    [27] O. P. Kreidl and A. S. Willsky, "Inference with minimal communication: A decision-theoretic variational approach," Advances in Neural Information Processing Systems, 2006.

    [28] H. Chen, B. Chen, and P. K. Varshney, “A new framework for distributed detection with conditionally dependent observations,” IEEE Transactions on Signal Processing, vol.60, no.3, pp.1409--1419, March 2012.

    [29] Y. Zhang and H. Dai, “Structured Variational Methods for Distributed Inference in Wireless Ad Hoc and Sensor Networks,” 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, April 2009.

    [30] Y. Zhang and H. Dai, “Structured Variational Methods for Distributed Inference: Convergence Analysis and Performance-Complexity Tradeoff,” 2009 IEEE International Symposium on Information Theory (ISIT), Seoul, South Korea, June 2009.

    Huaiyu Dai (M’03, SM’09) received the B.E. and M.S. degrees in Electrical Engineering from Tsinghua University, Beijing, China, in 1996 and 1998, respectively, and the Ph.D. degree in Electrical Engineering from Princeton University, Princeton, NJ in 2002.

  • 33

    He was with Bell Labs, Lucent Technologies, Holmdel, NJ, during summer 2000, and with AT&T Labs-Research, Middletown, NJ, during summer 2001. Currently he is an Associate Professor of Electrical and Computer Engineering at NC State University, Raleigh. His research interests are in the general areas of communication systems and networks, advanced signal processing for digital communications, and communication theory and information theory. His current research focuses on networked information processing and crosslayer design in wireless networks, cognitive radio networks, wireless security, and associated information-theoretic and computation-theoretic analysis.

    He has served as editor of IEEE Transactions on Communications, Signal Processing, and Wireless Communications. He co-edited two special issues for EURASIP journals on distributed signal processing techniques for wireless sensor networks, and on multiuser information theory and related applications, respectively. He co-chairs the Signal Processing for Communications Symposium of IEEE Globecom 2013, the Communications Theory Symposium of IEEE ICC 2014, and the Wireless Communications Symposium of IEEE Globecom 2014.

    Yanbing Zhang (M’09) received the B.E. and M.S. degrees in electronics engineering from Tsinghua University, Beijing, China, in 2001 and 2004, respectively, and the Ph.D. degree in electrical engineering from North Carolina State University, Raleigh, in 2009. Currently, he is a Staff Scientist in the Mobile Communications Group, Broadcom Inc., Matawan, NJ. His research interests are in the general areas of wireless communications and networking, signal processing for wireless communications, with emphasis on

    cooperative communication and information processing in wireless networks.

    Juan Liu received her B.S. degree in Information and Electronic Engineering from Zhejiang University, Hangzhou, China, in 2000. She received her M.S. degree in Information Engineering from Beijing University of Posts and Telecommunications, Beijing, China, in 2005. She received her PhD degree in Electronic Engineering from Tsinghua University, Beijing, China, in 2011. She is currently a postdoc researcher in the Department of Electrical and Computer Engineering, NC State University, Raleigh, NC. Her research interest is in

    wireless communications.

  • 34

    iC

    ( )iMB CiC

    Figure 1 Markov blanket (shaded nodes) and Markov blanket clusters (shaded clusters) for cluster iC .

    Figure 2 Vertex Process (a), Edge Process (b) and Mixed Process (c)

    Figure 3 State labeling of edge process on a 2-d torus

  • 35

    Figure 4 Mixing times of vertex and edge process

    Figure 5 Mean square error of estimation versus time complexity

    0 10 20 30 40 50 6010

    -5

    10 -4

    10 -3

    10 -2

    10 -1

    10 0

    Iteration Number

    Mea

    n S

    quar

    e E

    rror

    MFBPSMF - Centralized Clustering SMF - Distributed Clustering

    Cluster Size

  • 36

    Figure 6 Mean square error of estimation versus message complexity

    0 50 100 150 200 25010 -5

    10 -4

    10 -3

    10 -2

    10 -1

    Number of Exchanged Messages

    Mea

    n S

    quar

    e E

    rror

    BPMFSMF - Centralized ClusteringSMF - Distributed Clustering

    Cluster Size


Recommended