Improving the Efﬁciency of Belief Propagation in Large...

UNIVERSITY OF WISCONSIN, MACHINE LEARNING RESEARCH GROUP WORKING PAPER 06-1, SEPTEMBER 2006

Improving the Efficiency of Belief Propagationin Large, Highly Connected Graphs

Frank DiMaio and Jude ShavlikComputer Sciences Dept.

University of Wisconsin–MadisonMadison, WI 53706

{dimaio,shavlik}@cs.wisc.edu

Abstract

We describe a part-based object-recognition framework,specialized to mining complex 3D objects from detailed 3Dimages. Objects are modeled as a collection of parts to-gether with a pairwise potential function. The algorithm’skey component is an efficient inference algorithm, based onbelief propagation, that finds the optimal layout of parts,given some input image. Belief Propagation (BP) – a mes-sage passing method for approximate inference in graphi-cal models – is well suited to this task. However, for largeobjects with many parts, even BP may be intractable. Wepresent AggBP, a message aggregation scheme for BP, inwhich groups of messages are approximated as a singlemessage, producing a message update analogous to that ofmean-field methods. For objects consisting of N parts, wereduce CPU time and memory requirements from O(N2) toO(N ). We apply AggBP to both real-world and synthetictasks. First, we use our framework to recognize proteinfragments in three-dimensional images. Scaling BP to thistask for even average-sized proteins is infeasible withoutour enhancements. We then use a synthetic “object genera-tor” to test our algorithm’s ability to locate a wide varietyof part-based objects. These experiments show that our im-provements result in minimal loss of accuracy, and in somecases produce a more accurate solution than standard BP.

1 Introduction

Several recent publications have explored the use of part-based models for recognizing generic objects in images[4, 19, 9]. These models represent physical objects as agraph: a collection of vertices (“parts”) connected by edgesenforcing pairwise constraints. An inference algorithm de-termines the most probable location of each part in the

model given the image. However, this previous work hasonly considered simple objects with relatively few parts, of-ten only using two-dimensional image data. We present apart-based object recognition algorithm specialized to ob-jects with hundreds of parts in detailed, three-dimensionalimages.

Rich, three-dimensional data commonly arises in bio-logical datasets, especially with recent advancements in bi-ological imaging techniques. For example, fMRI scansproduce detailed 3D images of the brain. Confocal mi-croscopy constructs high-quality 3D images of tissues. X-ray crystallography yields a 3D electron density map, athree-dimensional “image” of a macromolecule. This three-dimensional data often contains objects comprised of manyparts, connected with some complex topology. As detailedbiological imagery becomes easier to acquire, techniques toaccurately interpret such images are needed. For example, avascular biologist may want to automatically locate all theblood vessels in a kidney section; a crystallographer maywant to trace a piece of RNA in an electron density map.Even rich two-dimensional data, such as detailed satelliteimagery, may contain complex objects that cannot be inter-preted using current methods.

To effectively mine complex 3D objects, our algorithmincludes an efficient message-passing inference algorithmbased on belief propagation [15]. Message-passing al-gorithms are an extremely powerful tool for inference ingraphical models (that is, probabilistic models defined ona graph). Belief propagation (BP) – also known as thesum-product algorithm – is a message-passing method forexactly computing marginal distributions in tree-structuredgraphs. In graphs with arbitrary topologies, no such opti-mality is guaranteed. Empirically, however, “loopy BP” of-ten provides extremely accurate approximations, when ex-act inference methods are intractable [6, 13, 21].

For very large, highly-connected graphs, with large in-

1

put images, even loopy BP may not offer enough efficiency.In near-fully connected graphs, with hundreds or thousandsof vertices, approximations to BP’s messages may be nec-essary to compute marginal distributions in a reasonableamount of time. We describe AggBP (for aggregate BP),a technique for approximating groups of BP messages witha single message. This composite message turns out to bequite similar to the message update for mean-field methods.We illustrate that, for certain types of graphs, AggBP mayreduce running time in a (near-fully connected) graph withN nodes from O(N2) to O(N ).

Additionally, we provide a method for dealing withcontinuously-valued variables that is efficient and does notrequire accurate initialization. Recently, an extension to BP,nonparametric belief propagation (NBP), was introduced[18]. NBP represents variables that have continuous non-Gaussian distributions as a mixture of Gaussians. We intro-duce an efficient variant which alternately represents prob-ability distributions over a continuous three-dimensionalspace as a set of Fourier-series coefficients. We describe ef-ficient message passing and update algorithms in this frame-work.

Finally, we test our approximation techniques using bothreal-world and synthetic data. Our first testbed is on a real-world computer-vision task, identifying protein fragmentsin three-dimensional images. Interpreting these protein im-ages is a very important step in determining protein struc-tures using x-ray crystallography. AggBP lets us scale inter-pretation to large proteins in large 3D images. Our secondtestbed uses a synthetic object generator to test AggBP’sperformance locating a wide variety of objects with variouspart topologies.

2 Modeling 3D Objects

Following others [4], we describe a class of objects us-ing a graphical model. Graphical models, such as Bayesiannetworks and Markov fields, represent the joint probabil-ity distribution over a set of variables as a function definedover some graph. A pairwise undirected graphical model(or pairwise Markov field) represents this joint distributionas a product of potential functions defined on each edge andvertex in the graph. To represent an object as a graph, then,vertices correspond to parts in the object, while edges cor-respond to constraints between pairs of parts.

Formally, the graph G = (V, E) consists of a set of nodess ∈ V connected by edges (s, t) ∈ E . Each node in thegraph is associated with a (hidden) random variable xs ∈ x,and the graph is conditioned on a set of observation vari-ables y. For object recognition, these xs’s are the 3D po-sition of part s. Each vertex has a corresponding observa-tion potential ψs(xs, y), and each edge is associated withan structural potential ψst(xs, xt). Then, we can represent

the full joint probability as

p(x|y) ∝∏

(s,t)∈E

ψst(xs, xt)×∏s∈V

ψs(xs|y) (1)

In many applications, this paper included, we are most con-cerned with finding the maximum marginal assignment, thatis, the labels xs ∈ x that maximize this joint probability forsome value of y.

To describe an object using a graphical model, one mustprovide three pieces of data: a part graph, each node’s ob-servation potential, and each edge’s structural potential.Given a graph describing an object, these potential func-tions are learned from a set of previously solved probleminstances.

For 3D object recognition, the part graph is fully con-nected; most edges are associated with identical, weak (dif-fuse) potentials, while a sparse subset of the graph (the“skeleton”) connects very highly correlated variables. Asan illustration, consider using a graphical model for recog-nizing people in images, as in Figure 1. In this model,a sparsely connected skeleton connects highly correlatednodes. For example, the head and body are connected inthis skeletal structure, because the position of the head ishighly correlated with the position of the body.

However, many other pairs of nodes – such as the left legand the left arm – are not connected in the skeletal structure,yet their labels are not completely (conditionally) indepen-dent. There is a weak correlation: the two parts may notoccupy the same location in space. As this constraint isnot implicitly modeled by the chain that connects them inthe skeletal structure, an edge between them is necessary.These occupancy edges only serve to ensure that two partsin the model do not overlap in 3D space. For example, whenmodeling a hand [19], occupancy edges are required to en-sure two fingers do not occupy the same space. The po-tential associated with these edges is typically very diffuse;it is non-zero everywhere except in a small neighborhoodaround the origin (in the node’s local coordinates).

Each part’s observation potential is usually based on theapplication of a simple classifier. At each location in 3Dspace, it returns the probability that a particular part is atthat location. Individual part potential functions may usetemplate matching, color matching, edge detection, or anyother method. Observation potentials need not be particu-larly accurate, as belief propagation is able to infer the truelocation using the combined power of many weak detectors.

As illustrated in the person-detector example, structuralpotentials are broken into two types: skeletal potentials (orsequential potentials) model the relationship between partsconnected in an object’s skeleton, while occupancy poten-tials model the weakly correlated relationship between allother pairs of nodes. Skeletal potentials may take an ar-bitrary form, learned from a set of allowable object con-

2

Figure 1. A sample graphical model forrecognizing a person in an image. Thickerdark edges illustrate the highly-correlated“skeleton” of the model, while thinnerlight edges are weakly-correlated occupancyedges, which ensure two parts do not occupythe same 3D space.

formations. They may be a function of position as well asorientation of the 3D object. Occupancy potentials take theform of a step function (using a “hard collision” model) ora sigmoidal function (using a “soft collision” model). Oc-cupancy potentials only depend on the position of the con-nected objects, and is only nonzero if the connected objectsare sufficiently far apart.

3 Inference: Locating 3D Objects

Given an image and some object’s graphical model, in-ference attempts to find the most-probable location of eachof the object’s parts in the image. Because our object graphis fully connected, with a number of loops, exact infer-ence methods either will not work (e.g., tree-based meth-ods) or are intractable (e.g., exhaustive methods). Instead,we are forced to rely on approximate inference methods.Our object-recognition framework uses belief propagation,a message-passing approximate inference algorithm.

3.1 Belief propagation

Belief propagation – based on Pearl’s polytree algo-rithm [15] – computes the marginal probability over eachxs (the location of each part) by passing a series of localmessages. The marginal probability refers to the joint prob-ability, where all but one variable is summed out, that is:

bs(xs|y) =∑x1

. . .∑xs−1

∑xs+1

. . .∑xN

P (x|y) (2)

This marginal distribution is important because it providesinformation about the distribution of some variable (xs

Algorithm 1: Belief propagation

input : Observational potentials ψs(xs|y) andstructural potentials ψst(xs, xy)

output: An approximation to the marginalbs(xs|y) ≈

∑x1

. . .∑xs−1

∑xs+1

. . .∑xN

P (x|y)

initialize accumulator, messages to 1while b’s have not converged do

foreach part s = 1 . . . N dobs(xs|y)← ψs(xs|y)foreach part t = 1 . . . N do

if t 6= s and bt has been updated thenmn

t→s(xs)←∫

xtψst × bn

t

mn−1s→t

dxt

endbs(xs|y)← bs(xs|y)×mn

t→s(xs)end

endend

above) in the full joint distribution, without requiring one toexplicitly compute the (possibly intractable) full joint dis-tribution.

Pseudocode appears in Algorithm 1. At each iteration, apart in the model computes the product of all incoming mes-sages, then passes a convolution of this product to its neigh-bors (for clarity, the message’s dependence on y is usuallydropped):

mnt→s(xs) ∝

∫xt

ψst(xs, xt)× ψt(xt|y)

×∏

u∈Γ(t)\s

mn−1u→t(xt) dxt (3)

In the above, Γ(t)\s denotes all neighbors of t in the graphexcluding s. Typically, these messages are normalized sothat the probabilities sum to unity. Following Koller et al.[11], we assign some order to the nodes, and update the be-lief at each node sequentially, alternating between forwardand backward passes through our ordering. At any iteration,the algorithm computes an approximation to the marginal asthe product of incoming messages and the node’s observa-tion potential ψs,

bns (xs|y) ∝ ψs(xs|y)×∏

u∈Γ(t)

mnu→t(xt) (4)

In tree-structured graphs (graphs without cycles), this al-gorithm is exact. Unfortunately, for many tasks, this limi-tation is overly restrictive. In graphs with arbitrary topolo-gies, however, there are no guarantees to the convergence of

3

this algorithm – and convergence may not be to the correctsolution – but empirical results show that “loopy BP” oftenproduces good estimates in practice [14].

Several papers have explored circumstances under whichloopy BP’s convergence or optimality can be guaranteed.Weiss has shown a category of graphical models with a sin-gle loops in which optimality is guaranteed [22]. More re-cent work [24] has shown the existence of fixed-points inloopy BP, but they are neither unique or optimal. Heskes[7] has developed sufficient conditions for the uniquenessof BP’s fixed-points. Others have characterized the fixed-points in loopy BP [20].

Others have explored message approximation in loopyBP. When exact message computation is intractable, sto-chastic approximation of messages [11] as well as mes-sage simplification [1] have been investigated. Additionally,when dealing with continuous-valued variables, some sortof approximation or simplifying assumptions must be made[9, 18]. Ihler et al. [8] have explored the consequences ofapproximating messages in BP, placing bounds on accumu-lated message errors as BP progresses.

A recent paper [18] investigates the special case wherethe labels xt are continuously valued. Using ideas fromparticle filtering [3], these authors use weighted-Gaussianprobability density estimates. That is, given a set of weightswi

s, i = 1 . . . N , a set of Gaussian centers µis and a covari-

ance matrix Λs, an estimate of the belief is given by

bns (xs|y) =∑

w(i)s ×N (xs;µ(i)

s ,Λs) (5)

Message computation is implemented using an efficientGibbs sampling routine. The Gibbs sampler [8] approxi-mates the product of k Gaussian mixtures – each with Mcomponents – as an M -component mixture. This samplingis used to compute BP message products. For the BP con-volution operation, forward sampling is employed. Theirinference algorithm was applied to several vision tasks. Is-ard [9] makes use of a similar technique, with a samplingroutine specialized to mixture-of-Gaussian edge potentials.

In the next sections, we provide several techniques toscale belief propagation for 3D part-based object recogni-tion. One of these techniques is a message aggregation tohandle large, highly connected graphs that arise in miningcomplex, 3D images. We also include an alternate represen-tation for continuously-valued beliefs and potentials, whichallows for efficient message computation and products.

4 Scaling Belief Propagation

Belief propagation was originally intended for small,sparsely connected graphs. In large, highly-connectedgraphs, the number of messages quickly becomes over-whelming. To make BP tractable in these types of graphs,

we propose AggBP, which approximates some subset ofoutgoing messages at a single node with a single message,replacing many message computations with relatively few.

4.1 BP Message Aggregation

In the undirected graphical models used for 3D objectrecognition, pairs of nodes along skeleton edges are highlycorrelated. Consequently, messages along these edges havea high information content. It is important to exactly com-pute messages along these edges. Coarse approximations –like those used in mean field methods [10] – introduce toomuch error.

However, in these graphs, the majority of edges are oc-cupancy edges, which enforce the constraint that two partscannot occupy the same 3D space. The potential functionsassociated with these edges are weak – that is, nearly uni-form – and messages along these edges carry little infor-mation. Along these edges, we can make some approxima-tions; a full BP message update may be overkill.

Formally, BP’s message update, given by Equation 4,can be alternately written as (again, the explicit dependenceof the message on y is dropped for clarity):

mnt→s(xs)← α1

∫xt

ψst(xs, xt)×bnt (xt|y)mn−1

s→t(xt)dxt (6)

The denominator in the above, mn−1s→t(xt) is a term that

serves to avoid double-counting or “feedback”, making themethod exact in tree-structured graphs. In loopy graphs,such feedback – through the graph’s loops – in unavoidable.For messages along occupancy edges this denominator car-ries little information, and AggBP drops it with little loss ofaccuracy. This gives an update equation more like the naıvemean-field theory update :

mnt→s(xs)← α2

∫xt

ψst(xs, xt)× bnt (xt|y) dxt (7)

The key advantage of doing this is – assuming that the struc-tural potential ψst is identical along all occupancy edges –is all occupancy messages outgoing from a single node areidentical. For the remainder of this section, we will refer tothese approximate messages as mt→∗(x∗).

Assuming identical ψst’s, AggBP reduces the number ofoccupancy messages computed from O(N2) to O(N ) in amodel with N parts. However, updating the belief for somepart still requires multiplying all the incoming occupancymessages times all the incoming skeletal message; for anN -part model, this is still O(N2). To reduce this complex-ity, we utilize the fact that each node receives this broadcastmessage from all but a few nodes in the graph: its neighbors(in the skeleton graph) and itself. We consider, then, send-

4

(a)

(b)

Figure 2. Our message aggregation approxi-mates (a) all the outgoing messages at node3, with (b) a single message sent to all non-adjacent nodes. Caching these aggregatemessage products results in significant run-time savings.

ing all these aggregate messages to a central accumulator:

ACC(x∗)←N∏

t=1

mt→∗(x∗) (8)

This accumulator is then used to efficiently update a node’sbelief, by sending – in a single message – the product of alloccupancy messages.

Figure 2 illustrates AggBP when the graph is a chain.For a chain, we compute the product of incoming messages,using this accumulator, as:

b1 ← ψ1 ×ACC ×m2→1

m1→∗ ×m2→∗

b2 ← ψ2 ×ACC ×m1→2 ×m3→2

m1→∗ ×m2→∗ ×m3→∗...

bk ← ψk ×ACC ×mk−1→k ×mk+1→k

mk−1→∗ ×mk→∗ ×mk+1→∗...

The numerators of these message updates contain skele-tal messages, while the denominators contain the approx-imated occupancy messages. AggBP reduces the runtimeand memory requirements from O(N2) to O(N ) in a modelwith N parts. The storage benefit is especially appealingwhen the 3D space for each part is large, and storing o(N2)messages is space-prohibitive. Section 5.1 provides a closerlook at one such application where this is the case.

Algorithm 2: Aggregate belief propagation (same I/Oas Algorithm 1).

initialize accumulator ACC, messages m to 1while b’s have not converged do

foreach part s = 1 . . . N doACC ← ACC/mn−1

s→∗bs(xs|y)← ψs ×ACCforeach part t = 1 . . . N do

if s is skeleton neighbor of t thenif bt has been updated then

mnt→s(xs)←

∫xtψst × bn

t

mn−1s→t

dxt

endbs(xs|y)← bs(xs|y)× (mn

t→s/mnt→∗)

endend// compute composite messagemn

s→∗(xs)←∫

xtψ∗s × bns (xs|y)

// update accumulatorACC ← ACC ×mn

s→∗end

end

Algorithm 2 gives a pseudocode overview of AggBP(notice the arguments xs and xt have been dropped for clar-ity). As we progress from node to node, instead of com-puting each outgoing message at a single node, we insteadcompute significantly fewer composite messages. The keydifference from Algorithm 1 is in the inner loop, “if s isa skeleton neighbor of t.” For graphs used in part-basedobject recognition, this loop is rarely entered, requiring sig-nificantly fewer message calculations.

Finally, when the occupancy edges all have a differentpotential function (e.g., when parts in the model are of adifferent size), then AggBP can still take advantage of thisapproximation, with additional approximation error. In thiscase, AggBP simply computes the broadcast message froma part t using the average potential function outgoing fromt:

mnt→∗(x∗)← α

∫x∗

∑Nu=1 ψtu(xt, x∗)

N× bnt (xt)dxt (9)

Section 5.2 explores how well AggBP handles varying oc-cupancy potentials.

4.2 Belief representation

Section 3 describes an approach to belief propagationwith continuous-valued labels based on particle filtering.

5

However, there may be cases where these models are in-sufficient. In general, when using particle filtering-basedmethods, reasonably accurate initialization of the Gaussiancenters representing the probability distribution is necessaryfor accurate inference [19]. In this section, we describe analternative belief representation that uses a Fourier-seriesprobability density estimate [17] to represent probabilitiesand messages. While particle-filtering-based methods tendto concentrate on high-probability space, our approach ac-curately represents the probability distribution over the en-tire space of each random variable. Efficient message pass-ing and message computation make this representation idealfor large, highly-connected graphs.

Formally, we represent marginal distributions bns as a setof 3-dimensional Fourier coefficients fk, where, given anupper-frequency limit, K

bns (xs|y) ≈K∑

k=0

fk × e−2πi(xs�k) (10)

Messages are represented using the same type of probabilitydensity estimate.

4.2.1 Message convolution

Recall from Eq. (3) that computing mt→s requires integrat-ing the product of the edge potential ψst(xs, xt), the obser-vation potential ψs(xs, y), and the incoming message prod-uct

∏ms→t(xt) over all xt. While this computation is dif-

ficult in general for Fourier-based density estimates, if ouredge potential can be represented as a function of the differ-ence between the labels of the two connected nodes, that is,ψst(xs, xt) = f(||xs − xt||), then mt→s is just a convolu-tion:

mnt→s(xs) =

(ψst ∗

∏mn−1

s→t

)(xs) (11)

This is easily computed as the product of Fourier coeffi-cients:

F[mn

t→s(xs)]

= F[ψst(xs, xt)

]×F

[( ∏mn−1

s→t(xt))]. (12)

For object recognition, all of our occlusion potentials maybe represented in this manner. That is, the potential hereonly depends upon the difference between labels.

This computational shortcut was originally proposed byFelzenswalb for belief propagation in low-level vision [5].Computing these message products is efficient, with run-ning time O(K), where K is the high-frequency limit ofthe density estimate.

4.2.2 Message products

As shown in Eq. (4), computing the current belief bns at agiven node requires taking the product of all incoming mes-

sages mt→s(xs), and multiplying it by the observation po-tential ψs(xs, y). Given the Fourier coefficients of all mes-sages, we compute this multiplication in real space:

bns (xs|y) = ψs(xs, y)×∏F−1

[F

[mn

t→s(xs)]]

(13)

As with the message convolution, this operation is fairlyefficient. Each transform and inverse transform runs in timeO(K log K).

5 Experiments

In this section, we compare the standard “full” beliefpropagation algorithm with AggBP, using both real-worldand synthetic datasets. The real-world task is based upon lo-cating protein fragments in 3D images; these objects consistof a chain of “parts” (amino acids). The synthetic datasetlooks at our algorithm’s performance recognizing objectscontaining more complex part topologies.

5.1 Protein fragment identification

One application for object recognition arises from x-raycrystallography. our approach is building a graphical modelfor a protein, in order to identify it in a three-dimensionalimage. These three-dimensional images, or electron densitymaps, are produced when determining protein structures us-ing x-ray crystallography. Interpreting this map – illustratedin Figure 3 – is the final step of x-ray crystallography [16].Interpretation produces the Cartesian coordinates of everyatom in the protein. It is often quite difficult and time-consuming to interpret electron-density maps: it make takeweeks to months of a crystallographer’s time to find everyatom in the protein. Alternatively, a backbone trace focusesinstead on predicting the location of a key carbon atom – thealpha carbon, or Cα – contained in each amino acid. We useAggBP to automatically determine a 3D backbone structuregiven an electron density map and a protein sequence.

5.1.1 Task overview

A protein is constructed as a linear chain of amino acids.Each of the 20 different naturally-occurring amino acidsconsists of a constant four-atom motif (the backbone) anda variable sidechain. Figure 4 illustrates a protein struc-ture, highlighting the backbone and sidechains. Proteinrecognition is difficult for several reasons. The electron-density maps are experimentally determined and are oftenvery noisy. Additionally, the protein chain is extremelyflexible: proteins are typically tightly coiled together, andamino acids distant on the linear chain are often very closein three-dimensional space. A protein chain typically has

6

(a) (b) (c)

Figure 3. An overview of electron density map interpretation. Given the amino acid sequence of theprotein and (a) a density map, the crystallographer’s goal is to find (b) the positions of all the proteinsatoms. Alternatively, (c) a backbone trace, provides the location of a key carbon atom called Cα thatis in each amino acid.

H N

H

C

H

C

O

CH 2

OH

N

H

C

H

C

O

CH

H3C

N

H

C

H

C

O

CH 2

S H C H3

OH

Amino end (N -terminus)

Carboxyl end (C -terminus)

Peptide bond

Sidechains

Backbone

Alpha carbon

Amino acid residue

Figure 4. Proteins are constructed by aminoacids condensing into a polypeptide chain. Achain of three amino acids is illustrated.

hundreds to thousands of amino acids, and the density map“image” usually contains more than one copy of the protein.

Figure 5 shows our encoding of a protein as a Markovfield model. Each node s represents an amino-acid in theprotein. The label ws = {xs, qs} for each amino-acid con-sists of seven terms: the 3D Cartesian coordinates xs of theamino acid’s Cα, and four internal parameters qs. Thesefour internal parameters are an alternate parameterizationof: (a) three 3D rotational parameters plus (b) the bend an-gle formed by three consecutive Cαs. Probability distrib-utions over Cartesian space make use of the Fourier-seriesbased parameterization outlined in Section 4.2. The inter-nal parameters qs are modeled as constant-width Gaussiansconditioned on the Cartesian coordinates, that is, bs(qs) =N (qs|µi

s(xs),Λ), i = 1 . . . 4.A recent paper by this paper’s authors [2] describes how

protein-specific structural and observation potential func-tions are learned; it also compares this method to other algo-

ALA GLY LYS LEU ... ...

Figure 5. Our protein part graph. The proba-bility of some conformation is the product ofan observation potential, a sequential poten-tial (dark lines), and an occupancy potential(light lines).

rithms for map interpretation. However, this previous workdid not present the results shown here; nor did it describeour message approximations and aggregation. Before de-scribing our new contributions, we briefly review the poten-tial functions for electron density map interpretation that weintroduced in our prior article, and use in our experimentsin this paper.

Each node’s potential function ψs(ws, y) is computed bymatching a learned set of small-protein-fragment templatesto the electron density map. Details of this potential func-tion are beyond the scope of this paper, but appear else-where [2]. Edge potential functions are of two basic types.Sequential edges connect adjacent amino acids; the corre-sponding potential ψseq

st (ws, wt) ensures that these adjacentamino acids are the proper distance apart and in the properorientation with respect to each others. This proper dis-tance and orientation is learned from a set of previously-solved protein structures. Occupancy edges connect allother pairs of amino acids, and the corresponding poten-tial ψocc

st (xs, xt) ensures two amino acids do not occupy thesame space.

As the graph is fully connected, and the messages are

7

0

5

10

15

20

25

30

15 25 35 45 55 65 75 85 95

protein fragment length

No

rmal

ized

CPU

tim

e

0

10

20

30

40

No

rmal

ized

Mem

ory

BP CPU timeAggBP CPU timeBP memoryAggBP memory

Figure 6. A comparison of memory and CPUtime usage between our approximate-BP andstandard BP.

continuous three-dimensional probability distributions overthe entire unit cell, storage and run-time requirements areconsiderable. We use the AggBP’s message aggregationand approximation outlined in Section 4.1. Without theseshortcuts, inference in even medium-sized proteins wouldbe computationally intractable.

5.1.2 Results

This section details some experiments locating protein frag-ments in electron density maps. These maps were providedto us by crystallographer George Phillips, at UW-Madison.We convoluted the maps with a Gaussian to simulate a poor-quality (3A resolution) density map: a resolution at whichother automated interpretation methods fail. The maps hadbeen previously solved by a crystallographer, giving us the“true” solution with which to compare our predictions. Fora given protein, we were provided the sequence, and weconstructed a Markov field based on this sequence.

We compare standard BP inference (exact-BP) to AggBPon this Markov field model. Exact-BP was unable to scaleto the entire protein (as many as 500 amino acids in ourtestset), so to compare these two methods we consider pro-tein fragments of between 15 and 65 amino acids. CPUtime per iteration and memory usage of the two techniquesare illustrated in Figure 6. Because the actual running-timeand memory usage is dependant upon the size of the densitymap, we normalize these values, so that exact-BP’s time andmemory usage at 15 amino-acids is 1.0 (in an average-sizedprotein, these values are 200 MB and 120 sec, respectively).We increase fragment length until our 6 GB machine beganpaging; Figure 6 does not include time swapping to disk;however, in larger fragments this is a serious issue.

At each of six fragment lengths, we searched for 15 dif-ferent fragments of that length from 5 different proteins (in5 different electron-density maps), for a total of 90 differenttarget fragments. Fragments were chosen that roughly cor-responded with the beginning, middle, and end of each pro-tein chain. We ran BP and AggBP until convergence or 20iterations (where one iteration is a single forward or back-ward pass through the protein). In each map, we reducedthe electron density map to a small neighborhood aroundeach fragment, then searched for that fragment. Using thisreduced density map is a more-realistic model of searchingfor a complete protein.

Results from this experiment appear in Figure 7. We plotthree different metrics – RMS deviation and log-likelihoodof the maximum-marginal interpretation (i.e. the predictedbackbone trace), as well as average KL-divergence [12] be-tween the marginal distributions – as a function of itera-tion. As these plots show, the solutions found by thesetwo methods differ somewhat, however, in terms of errorversus the true trace, both produce equally accurate traces.More interestingly, Figure 7b shows the log-likelihood ofthe maximum-marginal interpretation. Under this met-ric, AggBP produces a better solution; perhaps becauseAggBP’s approximation avoids overfitting the data. Fig-ure 7d shows the RMS error as a function of protein-fragment length. Not surprisingly, both methods seem toperform slightly worse when searching for longer frag-ments; still, the predicted structure is fairly accurate – con-sidering the quality of the maps – with an RMS error ofunder 4A.

Finally, a scatterplot of log-likelihoods, where each ofthe 90 fragments is represented as a point, is illustrated inFigure 8. In this figure, points below the diagonal corre-spond to fragments on which our AggBP produced a more-likely interpretation. For almost every fragment, AggBPproduces a solution with a greater log-likelihood than doesstandard BP. This difference is statistically significant; atwo-tailed, paired t test gives a p value of 0.014.

5.2 Synthetic object recognition

While the protein fragment identification testbed showsthe CPU and memory savings achievable by our algorithm,it uses a rather limited part topology: the skeletal structureis just a linear chain, and each part is a constant distanceapart. In this section, we construct a synthetic object gener-ator, that builds “part graphs” with varying branching fac-tors, object sizes, and object “softness.” We also exploreapproximation performance under various part-finder accu-racies.

8

-1200

-1100

-1000

-900

-800

-700

-600

0 5 10 15 20

BP iteration

AggBPBP

0

2

4

6

8

10

0 5 10 15 20

BP iteration

C-a

lph

a RM

S d

evia

tio

n

true vs. AggBP

BP vs. AggBP

0

2

4

6

8

10 20 30 40 50 60 70

protein fragment length

C-a

lph

a RM

S

0.0

0.5

1.0

1.5

2.0

0 5 10 15 20

BP iteration

BP

vs. A

gg

BP

KL-

div

erg

ence

(a) (b)

(c) (d)

true vs. BP

true vs. AggBP

BP vs. AggBP

true vs. BP

log

like

liho

od

Figure 7. A comparison of our BP approximation with standard BP as the algorithm progresses,using (a) RMS deviation, (b) average KL-divergence of the predicted marginals, and (c) log-likelihoodof the maximum-marginal interpretation. Additionally, (d) shows RMS error as a function of proteinfragment size (at iteration 20).

5.2.1 Object generator

We have developed a synthetic object generator to betterunderstand how well AggBP works over the range of pos-sible tasks locating 3D objects composed of interconnectedparts. This object generator lets us vary the graph topol-ogy and individual part parameters, as in Figure 9. Thegenerator constructs objects with a predefined number ofparts, arranged in a tree-structured skeleton. Given somebranching factor, the skeleton is randomly assembled fromthe parts. As before, all pairs of nodes not connected inthis skeleton are connected with edges enforcing occupancypotentials, which ensure two parts do not occupy the samethree-dimensional space.

Each part in the model is given a radius ri ∈ r and a soft-ness si ∈ s, from which the structural potential functionsare derived. Pairs of parts directly connected in the skeletonmaintain a distance equal to the sum of their radii. Otherpairs may not occupy the same 3D space: they should beat least as far apart as the sum of their radii, although they

may get closer than this distance as softness increases. Thissoftness parameter allows these part pairs to get slightlycloser than the sum of their radii with some low probability.Specifically, the softness parameter replaces the occupancypotential’s step function with a sigmoid. For non-zero soft-ness, then, the probability distribution of the distance d be-tween two parts i and j, with radii ri and rj , and softness si

and sj , is given by:

pij(d) =1

1 + exp(−(d−(ri+rj))

si·ri+sj ·rj

) (14)

Our testbed generator also generates observation poten-tials ψobs, that is, the probability distribution of each part’slocation in 3D space. These would normally be generatedby some type of pattern matcher in a 2D or 3D image. Ouralgorithm’s generator assumes we have a classifier that –given a location in 3D space – returns a score. Shownin Figure 10, scores are drawn from one of two distribu-tions: at the true location of a part, the score for that part is

9

-2000

-1500

-1000

-500

0

-2000 -1500 -1000 -500 0

AggBP log-likelihood

BP

log

like

liho

od

15-mers25-mers35-mers45-mers55-mers65-mers

Figure 8. A scatterplot showing – for each ofthe 90 target fragments – the log likelihoodof AggBP’s trace versus the standard BP’strace. Points below the diagonal correspondto fragments where AggBP returned a morelikely solution.

vary radii

increasebranching factor

allow spatial overlap

Figure 9. An illustration of the three graphtopology parameters we vary using our graphgenerator.

drawn from one distribution, at any other location the scoreis drawn from another distribution.

For simplification, we assume both distributions arefixed-width Gaussians with different means. Varying thedifference in means results in classifiers with varying accu-racy. Given this difference in means, then, we generate eachpart’s observation potential by drawing scores at randomfrom these two distribution. Assuming the distributions areknown, scores are converted into probabilities using Bayes’rule.

A specific width µ corresponds to a single part classi-fier, with some accuracy. In the remainder of this section,we report not this value for µ, but rather the area under theprecision-recall curve (AUPRC) which it – along with the

µ

positive scoredistribution

negative scoredistribution

score

P

Figure 10. Observation potentials are gener-ated by drawing scores from two distribu-tions. The parameter µ is directly related toeach part-classifier’s accuracy.

number of positive and negative examples – induces. Forexample, in a 40x40x40 grid, µ = 3.85 corresponds to anAUPRC of 0.3.

5.2.2 Results

We used our generator to vary four different parameters inthe model (default values are shown in parentheses):

• branching-factor: the average branching factor in theskeleton graph (default = 2)

• softness: each part’s softness (default = 0.0)

• σ(radius): the standard deviation of radii in the graph(default = 0)

• µ: the difference in means between the positive scoredistribution and negative score distribution. We reportthis value as the area under the associated classifier’sprecision-recall curve. (default area = 0.3)

In every graph, the average part radius was fixed (at 1 gridpoint), and each model was constructed of 100 parts.

We used our object recognition framework to search forthe optimal layout of parts, given some generated objectand observation potentials. As in the previous section, weused both standard belief propagation as well as AggBP,and compared the results. We assumed that part parame-ters – radius and softness – were known (or learned) by thealgorithm. For both AggBP and standard BP, we ran untilconvergence or 20 iterations. Standard BP occasionally didnot converge; in these cases, we took the highest-likelihoodsolution at any iteration. At each parameter setting, wecompute the average error using 20 randomly generated partgraphs.

Results from this experiment appear in Figure 11. Foreach of the four varied parameters we plot the RMS error (a)between standard BP and ground truth, (b) between AggBPand truth, and (c) between AggBP’s solution and standard

10

0.0

0.5

1.0

1.5

2.0

0 1 2 3 4 5

branching factor

RMS

Erro

r

true vs. BPtrue vs. AggBPBP vs. AggBP

0.0

0.5

1.0

1.5

2.0

2.5

0 0.2 0.4 0.6 0.8

softness

RMS

Erro

r0

1

2

3

4

5

0 1 2 3 4

standard-deviation(radius)

RMS

Erro

r

0

1

2

3

4

5

0 0.1 0.2 0.3 0.4

classifier AUPRC

RMS

Erro

r

(a) (b)

(c) (d)

Figure 11. A comparison of our BP approximation with standard BP using the synthetic objectgenerator. While holding other parameters fixed, we vary (a) skeleton branching factor, (b) partsoftness, (c) radius standard deviation, and (d) classifier AUPRC. We report the RMS error of ouralgorithm (AggBP) and standard BP (BP) against each other as well as ground truth.

BP’s. Runtime and memory usage is almost identical to theprevious experiment.

For two of the varied parameters – graph branching fac-tor and classifier AUPRC (Figures 11a and 11d) – the so-lutions returned by the two methods are of comparable ac-curacy. Even though the solutions themselves may be quitedifferent, they are both equally close to ground truth. Fig-ure 11c shows that our algorithm performs reasonably wellas the radii of the model’s parts are varied more and more.The performance of the two algorithms is similar until thestandard deviation of part radii is increased to three timesthe radius. Larger variations could be handled by clusteringobjects into multiple groups based on radius, approximatingmessages to each group.

The most interesting result, however, is that in Fig-ure 11c, where the object softness is varied. Increasing ob-ject softness allows two objects to move closer than wouldnormally be allowed with some low probability. Here, forany non-zero softness, AggBP finds a more accurate solu-tion than standard BP. The reason for this in unclear, how-ever, it may be due to feedback introduced by this softness,

that is dampened by our approximation. When running witha non-zero softness, standard BP fails to converge quite of-ten, giving some support to the idea that our approximationis dampening some feedback loops.

Log likelihood plots, not shown for our synthetic experi-ments, are very similar to the error plots. These experimentsshow that our method is clearly valid for a wide variety ofmodel parameters and part topologies. In a large majorityof the synthetic experiments, AggBP produced an interpre-tation that was as good or better than standard BP, in a smallfraction of the time.

6 Conclusions and Future Work

We describe a part-based, 3D object recognition frame-work, well suited to mining detailed 3D image data. We in-troduce AggBP, a message approximation and aggregationscheme that makes belief propagation tractable in large andhighly connected graphs. Using a message-approximationsimilar to that of mean-field methods, we reduce the num-ber of message computations at a single node from many

11

to just a few. In the fully connected graphs used by ourobject-recognition framework, we reduce the running timeand memory requirements for an object with N parts fromO(N2) to O(N ). Additionally, we describe an efficientprobability representation based on Fourier series. Exper-iments on a 3D vision task arising from x-ray crystallogra-phy shows that using these improvements produce solutionsas good or better then standard BP. Synthetic tests show thatAggBP is accurate under a variety of object types with var-ious part topologies, almost always producing a solution asgood or better than standard BP.

It is unclear why AggBP should sometimes producemore-accurate results than standard BP. Our approximate-message computation ignores a term that serves to avoidfeedback, and makes the method exact in tree-structuredgraphs. However, in graphs with loops, such feedback is un-avoidable (through the loops of the graph). For some typesof edge potentials, ignoring this term produces a more-accurate approximation, perhaps by dampening some ofthese feedback loops inherent in loopy belief propagation.Further investigation into this is needed.

In the future, we would like to take a more dynamic ap-proach to message aggregation. For example, in the pro-tein backbone-tracing task, AggBP’s approximation error ishighest along edges connecting amino acids that are nearbyin space. If we could accurately predict which amino acidsare close (in space) as BP iterates, we could precisely com-pute messages between these pairs of nodes, and approxi-mately compute messages along other edges.

The results using AggBP illustrate our techniques areuseful in the automatic interpretation of complex 3D im-age data. The shortcuts we introduce drastically increasethe size of problems on which BP is tractable. In one realand one synthetic dataset, we produces accurate results withsignificant CPU and storage savings over standard BP. Ouralgorithm appears to be a powerful tool for mining largeimages.

Acknowledgements

This work is supported by NLM grant 1R01 LM008796 andNLM Grant 1T15 LM007359.

References

[1] J. Coughlan and S. Ferreira. Finding deformableshapes using loopy belief propagation. Proc. ECCV.

[2] F. DiMaio, J. Shavlik and G. Phillips (2006). A prob-abilistic approach to protein backbone tracing in elec-tron density maps. Proc. ISMB.

[3] A. Doucet, S. Godsill and C. Andrieu (2000). On se-quential Monte Carlo sampling methods for Bayesianfiltering. Statistics and Computing.

[4] P. Felzenszwalb and D. Huttenlocher (2000). Efficientmatching of pictorial structures. Proc. CVPR

[5] P. Felzenszwalb and D. Huttenlocher (2004), Efficientbelief propagation for early vision. Proc. CVPR.

[6] B. Frey (1998). Graphical Models for Machine Learn-ing and Digital Communication. MIT Press.

[7] T. Heskes (2004). On the uniqueness of loopy beliefpropagation fixed points. Neural Comp., 16.

[8] A. Ihler, E. Sudderth, W. Freeman, and A. Willsky(2004). Efficient multiscale sampling from products ofgaussian mixtures. Proc. NIPS.

[9] M. Isard (2003). PAMPAS: Real–valued graphicalmodels for computer vision. Proc. CVPR.

[10] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul(1999). An introduction to variational methods forgraphical models. Machine Learning.

[11] D. Koller, U. Lerner, and D. Angelov (1999). A gen-eral algorithm for approximate inference and its appli-cation to hybrid Bayes nets. Proc. UAI.

[12] S. Kullback and R. Leibler. On information and suffi-ciency. Annals of Mathematical Statistics.

[13] D. MacKay and R. Neal (1995). Good codes based onvery sparse matrices. Cryptography and Coding: 5thIMA Conference.

[14] K. Murphy, Y. Weiss, and M. Jordan (1999). Loopybelief propagation for approximate inference: An em-pirical study. Proc. UAI.

[15] J. Pearl (1988). Probabilistic Reasoning in IntelligentSystems. Morgan Kaufman, San Mateo.

[16] G. Rhodes (2000). Crystallography Made CrystalClear. Academic Press.

[17] B. W. Silverman (1986). Density Estimation for Sta-tistics and Data Analysis. Chapman & Hall.

[18] E. Sudderth, A. Ihler, W. Freeman, and A. Will-sky (2003). Nonparametric belief propagation.Proc. CVPR.

[19] E. Sudderth, M. Mandel, W. Freeman, and A. Will-sky (2004). Visual hand tracking using nonparametricbelief propagation. MIT LIDS Technical Report 2603.

[20] S. Tatikonda and M. Jordan (2002). Loopy belief prop-agation and Gibbs measures. Proc. UAI.

[21] Y. Weiss (1996). Interpreting images by propagatingBayesian beliefs. Proc. NIPS.

[22] Y. Weiss (2000). Correctness of local probability prop-agation in graphical models with loops. Neural Comp.,12.

[23] Y. Weiss and W. T. Freeman (2001). Correctness ofbelief propagation in Gaussian graphical models of ar-bitrary topology. Neural Comp., 13.

[24] J. Yedidia, W. Freeman and Y. Weiss (2005). Con-structing free-energy approximations and generalizedbelief propagation algorithms. IEEE Trans. on Infor-mation Theory.

12

Date post:	18-Jun-2018
Category:	Documents
Upload:	truongnhu
View:	212 times
Download:	0 times

Improving the Efﬁciency of Belief Propagation in Large...

Documents