Efﬁcient Factored Inference with Afﬁne Algebraic Decision...

Journal of Artificial Intelligence Research ? (?) ? Submitted 9/10; published ?/?

Efficient Factored Inference with Affine Algebraic Decision Diagrams

Scott Sanner SSANNER@NICTA .COM.AU

NICTA and the Australian National UniversityCanberra, ACT 2601, Australia

David McAllester [email protected]

Toyota Technological Institute at ChicagoChicago, IL 60637, USA

Will Uther WILLU @CSE.UNSW.EDU.AU

NICTA and the University of New South WalesSydney, NSW, 2052, Australia

Karina Valdivia Delgado KVD @IME .USP.BR

University of Sao PauloSao Paulo, SP, Brazil

AbstractA key component of efficient factored probabilistic and decision-theoretic inference in AI is

the use of compact factor representations that support computationally efficient operations. In thisarticle we propose an affine extension to the algebraic decision diagram (ADD) capable of com-pactly representing logical, additive, and multiplicative structure in discrete factors. We show thataffine ADDs (AADDs) have worst-case time and space complexity within a multiplicative constantof ADDs and in some cases yield an exponential improvement intime and space over ADDs. Weintroduce a procedure for efficiently approximating AADDs within a fixed error bound and showthis procedure can lead to both smaller and lower error approximations in comparison to standardADD and tabular approximation techniques. In an empirical comparison of tabular, ADD, andAADD representations used in Bayes net variable elimination and factored MDP value iterationalgorithms, we observe that AADDs perform at least as well astables and ADDs and often yield asubstantial space and time reduction over both (and lower error in the case of approximation). Insummary, these results suggest that the substitution of AADDs for tabular or ADD representationscan improve the space and time efficiency and approximation accuracy of factored inference whenadditive or multiplicative structure can be exploited in the underlying problem domain.

c©? AI Access Foundation. All rights reserved.

SANNER, MCALLESTER, UTHER, DELGADO

Contents

1 Introduction 4

2 Factored Inference 52.1 Basic Tabular Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52.2 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Variable Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Loopy Belief Propagation(???) . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Factored MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Algebraic Decision Diagrams (ADDs) 63.1 Canonical Reduced ADDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83.2 Binary Operations on ADDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Terminal computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Recursive computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.3 Other Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Affine Algebraic Decision Diagrams (AADDs) 144.1 Canonical Reduced AADDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154.2 Binary Operations on AADDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 Terminal computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.2 Recursive computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.3 Canonical caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.4 Other Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.5 Cache Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Factor Approximation 265.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2 ADDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 AADDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Empirical Evaluation 316.1 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Exact Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31

6.2.1 Bayes Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2.2 Factored MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.3 Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .346.3.1 Bayes Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3.2 Factored MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Related Work 35

8 Conclusions and Future Work 36

2

EFFICIENT FACTORED INFERENCE WITHAFFINE ALGEBRAIC DECISION DIAGRAMS

References 38

3


ADD Structure

017 6 5 4 3 2

07 6 5 4 3 2 11 0 1 0

x2

x1

x3

x2

x1 x1x1x1x1

x2 x2

x3x3

Structurec) Additive and Multiplicative

Structurea) Conjunctive ADD b) Disjunctive ADD

Figure 1: Some example ADDs showing a) conjunctive structure (f = if(a∧b∧c, 1, 0), b) disjunc-tive structure (f = if(a∨b∨c, 1, 0)), and c) additive (f = 4a+2b+c) and multiplicative(f = γ4a+2b+c) structure (top and bottom sets of terminal values, respectively). The high(true) edge is solid, the low (false) edge is dotted.

WARNING: VERY ROUGH DRAFT, MAINLY PLACEHOLDER TEXT

1. Introduction

A key component of efficient factored probabilistic and decision-theoretic inference in AI is the useof compact factor representations that support computationally efficientoperations.

The simplest explicit representation for discrete factors is a tabular formatthat simply enumer-ates all domain assignments and their corresponding valuations. [GIVE EXAMPLE]

However, when many values are repeated, data structures like Algebraicdecision diagrams(ADDs) (Bahar, Frohm, Gaona, Hachtel, Macii, Pardo, & Somenzi, 1993) provide an even moreefficient means for representing and performing arithmetic operations on functions from a factoredboolean domain to a real-valued range (i.e.,B

n → R). ADDs rely on two main principles to do this:

1. ADDs represent a functionBn → R as a directed acyclic graph – essentially a decision treewith reconvergent branches and real-valued terminal nodes.

2. ADDs enforce a strict variable ordering on the decisions from the root to the terminal node,enabling a minimal, canonical diagram to be produced for a given function.Thus, two identi-cal functions will always have identical ADD representations under the same variable order-ing.

A few examples of ADDs along with their representation of various independences are givenin figure 1. As shown, ADDs often provide an efficient representation of functions with context-specific independence (Boutilier, Friedman, Goldszmidt, & Koller, 1996), such as functions whosestructure is conjunctive (1a) or disjunctive (1b) in nature. Thus, ADDscan offer exponential spacesavings over a fully enumerated tabular representation. However, the compactness of ADDs doesnot extend to the case of additive or multiplicative independence, as demonstrated by the exponen-tially large representations when this structure is present (1c). Unfortunately such structure often

4


occurs in probabilistic and decision-theoretic reasoning domains, potentiallyleading to exponentialrunning times and space requirements for inference in these domains.

In this article, we propose to remedy this representational and computationallimitation of ADDsby extending them with the ability to compactly represent functions with additive and multiplicativeindependence in addition to logical structure. We do so by introducing an extension of the ADDthat we term an affine AADD (AADD).

Specifically, we make the following contributions in this article:

• We provide a formal definition of the AADD and prove that every functionBn → R has a

unique canonical AADD representation.

• We generalize the fundamental ADD procedures ofReduce for constructing canonical ADDsandApply for performing binary operations to the case of AADDs.

• We prove that operations on an AADD will never perform more than a constant times worsethan the ADD in space or time and that there exist functions and operations for which theAADD offers an exponential reduction in space and time over the ADD.

• We introduce a greedy technique for approximating AADDs within a fixed error bound thatrequires linear time and space in the size of the original AADD.

• We provide an extensive comparative empirical evaluation of (a) tabular representations, (b)ADDs, and (c) AADDs on (1) basic operations, (2) inference in Bayesnets, and (3) valueiteration in factored MDPs, demonstrating the clear advantages of AADDs for bothexact andapproximate inference.

To this end, the rest of this article is structured as follows. In Section 2, we discuss the basic op-erations of factored inference in the context of a tabular representationand the two specific factoredinference applications of variable elimination in graphical models and value iteration in factoredMDPs (that we use for empirical evaluation later in the article). In Section 3 we review ADDs andtheir operations and extend this in Section 4 to the novel case of AADDs. In Section 5 we reviewtechniques for approximating tabular representations and ADDs and introduce novel techniques forapproximating AADDs. In Section 6 we proceed to provide an extensive empirical evaluation ofthe tabular, ADD, and AADD representations for factored inference problems as outlined above.Finally in Section 7 we discuss related work, and we conclude in Section 8 with asummary of ourresults and directions for future work.

2. Factored Inference

2.1 Basic Tabular Operations

2.2 Graphical Models

2.2.1 VARIABLE ELIMINATION

2.2.2 LOOPY BELIEF PROPAGATION (???)

2.3 Factored MDPs

In the factored version of a Markov Decision Processs (MDP) (Puterman, 1994), states will berepresented by vectors~x of lengthn, where for simplicity we assume the state variablesx1, . . . , xn

5


have domain{0, 1}; hence the total number of states isN = 2n. We also assume a set of actionsA = {a1, . . . , an}. An MDP is defined by: (1) a state transition modelP (~x′|~x, a) which specifiesthe probability of the next state~x′ given the current state~x and actiona; (2) a reward functionR(~x, a) which specifies the immediate reward obtained by taking actiona in state~x; and (3) adiscount factorγ, 0 ≤ γ < 1. A policy π specifies the actionπ(~x) to take in each state~x. Our goalis to find a policy that maximizes the value function, defined using the infinite horizon, discountedreward criterion:V π(~x) = Eπ[

∑∞t=0

γt · rt|~x], wherert is the reward obtained at timet (starting instate~x).

Many MDPs often have a natural structure that can be exploited in the formof a factoredMDP (Boutilier, Dean, & Hanks, 1999). For example, the transition functioncan be factoredas a dynamic Bayes net (DBN)P (x′

i|~xi, a) where each next state variablex′i is only dependent

upon the actiona and its direct parents~xi in the DBN. Then the transition model can be com-pactly specified asP (~x′|~x, a) =

∏ni=1

P (x′i|~xi, a). Often, the reward can be factored additively as

R(~x, a) =∑m

i=1Ri(~x, a).

2.3.1 VALUE ITERATION

Value iteration(Bellman, 1957) is a simple dynamic programming algorithm for constructing opti-mal policies. We first define abackupoperatorBa for actiona as follows:

(BaV )(~x) = γ∑

~x′

n∏

i=1

P (x′i|~xi, a)V (~x′) (1)

If π∗ denotes the optimal policy andV ∗ its value function, we have the fixed-point relationshipV ∗(~x) = maxa∈A {∑m

r=1Ri(~xr, a) + (BaV ∗)(~x)}.

Value iteration proceeds by constructing a series oft-stage-to-go value functionsV t. SettingV 0 to arbitrary values, we define

V t+1(~x) = maxa∈A

{∑m

r=1Ri(~xr, a) + (BaV t)(~x)

}

(2)

The sequence of value functionsV t produced by value iteration converges linearly to the optimalvalue functionV ∗.

Approximate value iteration (AVI)is an approximate dynamic programming variant of the valueiteration algorithm with the additional step that after each Bellman backup, the value function maybe projected onto a more compact representation while inducing some error inthis projection step.

3. Algebraic Decision Diagrams (ADDs)

We can often represent a table more efficiently than by enumerating all state configurations of thevariables in that table. Quite often, we find that certain values of variables ina CPT render the othervalues irrelevant. This is known ascontext-specific independence (CSI)(Boutilier et al., 1996).

For the example DBN in Figure 2(a), given that the value ofx′2 depends onx1,x2 anda in

P (x′2|x1, x2, a) but that in the context ofa 6= reboot(2), the value ofx′

2 depends on no other vari-ables, we say that in the context ofa 6= reboot(2), x′

2 is independent of all other variables and thusP (x′

2|x1, x2, a 6= reboot(2)) = P (x′2|a 6= reboot(2)). In order to represent this CSI compactly,

we can use a decision tree or an algebraic decision diagram (ADD) (Bahar et al., 1993), which is

6


.475 .025

x3 x2

x1

x3

.95

x1

P(x2’| a,x1,x2)

x2

x1

.05

R(x1,x2,x3)x1

x2

3 1 0

x2

x3 x3 x3

2

SysAdminNetwork

t t+1a

x2

r

x2’

x3’

x1 x1’

a) DBN Representation of Transition Function

b) TransitionCPT as ADD

c) Reward as ADD

a≠reboot(2)

x2’

1

Figure 2: a) A dynamic Bayes network and decision diagram representinga transition function anda reward function for SYSADMIN with n = 3 and a unidirectional ring network topology.b) A compact encoding of the transition function CPT for the DBN as an ADD.Notethatx′

3 sums to one over all possible previous states. c) An ADD representation oftheadditive reward function for SYSADMIN . For all ADDs, the high (true) edge is solid, thelow (false) edge is dotted.

similar to a tree except that it is a canonicaldirected acyclic graph (DAG)with all variable deci-sion tests following a strict order from the root to the leaves. An example ADDfor this probabilitydistribution showing the above CSI is given in Figure 2(b). Effectively, CSI performs automaticstate aggregation in that all possible state contexts under the conditiona 6= reboot(2) are effectivelygrouped together and assigned a common value. An example ADD for the reward is given in Fig-ure 2(c), here there is no explicit CSI, but the reconvergent DAG structure of the ADD does allowsharing of common substructure that reduces what would be a tabular representation exponentiallysized inn to an ADD representation quadratically sized inn.

In addition to the representational efficiency of state aggregation in ADDs,we note that com-putation with ADDs can also be very efficient. When we perform operationson factors representedas ADDs, we can just replace these operations with their ADD-based versions (Bahar et al., 1993),allowing us to exploit CSI and shared substructure not only in the representation of factored MDPs,but also in the computations required for their solution.

Since the ADD will be a crucial data structure for our subsequent presentation of factored MDPsolution algorithms, we provide a formal definition of ADDs and algorithms to construct and ma-

7


nipulate them in the following subsections. The following discussion draws onthe work of Baharet al. (1993), which is itself a slight variant of the original work on orderedbinary decision diagrams(BDDs)of Bryant (1986).

3.1 Canonical Reduced ADDs

An ADD is a decision diagram with a fixed variable ordering of all decision tests on paths from theroot to the leaves that is capable of representing functions fromB

n → R. We define ADDs with thefollowing simple BNF grammar:

F ::= C | if (F var ) then Fh else Fl (3)

Here,C ∈ R is a constant-valued terminal node. Each internal decision node is represented asif (F var) then Fh else Fl and is associated with a single variablevar that indicates the highbranch leading to nodeFh should be taken whenvar = true and the low branch leading toFl

should be taken whenvar = false.Let V al(F, ρ) be the value of ADDF under variable value assignmentρ. Then the valuation of

an ADD can be defined recursively by the following equation:

V al(F, ρ) =

F = C : CF 6= C ∧ ρ(F var ) = true : V al(Fh, ρ)F 6= C ∧ ρ(F var ) = false : V al(Fl, ρ)

Formally, we define avariable orderingas a total ordering over all variables such that for all variablepairsxi, xj (i 6= j) eitherxi ≻ xj or xj ≻ xi. We say thatF satisfies a given variable ordering ifF = C or F is of the formif (F var) then Fh else Fl where (1)F var does not occur inFh or Fl,(2) F var is the earliest variable under the given ordering occuring inF and (3)Fl andFh satisfy thevariable ordering. We discuss choices for variable order later in the context of variable reordering.

Then we obtain the following lemma where we define areducedADD to be the minimally-sizedordered decision diagram representation a functionf(x1, . . . , xn).

Lemma 3.1. Fix a variable ordering overx1, . . . , xn. For any functionf(x1, . . . , xn) mappingB

n −→ R, there exists a unique reduced ADDF over variable domainx1, . . . , xn satisfying thegiven variable ordering such that for allρ ∈ B

n we havef(ρ) = V al(F, ρ).

Bryant (1986) provides a proof of this lemma for BDDs, which only have two distinct terminalvalues. The proof trivially generalizes to ADDs, which can have more thantwo distinct terminalvalues. This lemma shows that there is a unique canonical ADD representation of all functions fromB

n −→ R.Given that there exists a unique reduced ADD for any function fromB

n −→ R, we next de-scribe how this reduced ADD can be constructed from an arbitrary ordered decision diagram. Allalgorithms that we will define rely on the helper functionGetNode in Algorithm 1, which returnsa canonical representation of a single internal decision node. UsingGetNode, theReduce proce-dure in Algorithm 2 takes any ordered decision diagram and returns its reduced, canonical ADDrepresentation (necessarily removing any redundant structure in the process). The control flow ofReduce is very simple in that it uses theGetNode procedure to recursively build a reduced ADDfrom the bottom up (i.e., from the terminal leaf nodes all the way up to the root node). An exampleapplication of theReduce algorithm is given in Figure 8.

8


Algorithm 1: GetNode(v, Fh, Fl〉) −→ Fr

input : v, Fh, Fl : Var and node ids for high/low branchesoutput : Fr : Return values for offset,

multiplier, and canonical node idbegin

// If branches redundant, return childif (Fl = Fh) then

returnFl;

// Make new node if not in cacheif (〈v, Fh, Fl → id is not in node cache)then

id := currently unallocated id;insert〈v, Fh, Fl〉〉 → id in cache;

// Return the cached, canonical nodereturnid ;

end

Algorithm 2: Reduce(F ) −→ Fr

input : F : Node idoutput : Fr : Canonical node id for reduced ADDbegin

// Check for terminal nodeif (F is terminal node)then

return canonical terminal node for value ofF ;

// Check reduce cacheif ( F → Fr is not in reduce cache)then

// Not in cache, so recurseFh := Reduce(Fh);Fl := Reduce(Fl);

// Retrieve canonical formFr := GetNode(F var, Fh, Fl);

// Put in cacheinsertF → Fr in reduce cache;

// Return canonical reduced nodereturnFr;

end

3.2 Binary Operations on ADDs

Given functionsBn −→ R represented as ADDs, we now want to apply operations to these func-tions that work directly on the ADD representation. Additionally, we would prefer that these oper-ations avoid enumerating all possible variable assignments whenever possible.

To do this, we first define theApply function that applies a binary operation to two operandsrepresented as ADDs and returns the result as an ADD. We letop denote a binary operator onADDs with possible operations being addition, substraction, multiplication, division, min, and maxdenoted respectively as⊕, ⊖, ⊗, ⊘, min(·, ·), andmax(·, ·). We also define binary comparison

9


1

x2

x1

1 0

x2

x1

1 0

x2

x1

1 0

x1

1 0

x1

1 0x

Figure 3: An example application of theReduce algorithm. The input is the leftmost diagram.From left to right, the hollow arrow shows the nodeF currently being evaluated byReduce just after the recursiveReduce calls to the high branchFh and low branchFl

but beforeGetNode(F var , Fh, Fl) is called and the canonical representation ofF is re-turned (see Algorithm 2). The next diagram in the sequence shows the result after thepreviousReduce call. The rightmost diagram is the final canonical ADD representationof the input.

op

Fvar1,h Fvar

1,l

Fvar1 Fvar

Fvar Fvar

2

2,h 2,l

Figure 4: Two ADD nodesF1 andF2 and a binary operationop with the corresponding notationused in the presentation of theApply function.

functions≥, >, ≤, < that return an indicator function represented as an ADD that takes the value1when the comparison is satisfied and0 otherwise.

The high-level control flow of theApply routine in Algorithm 3 is straightforward: we firstcheck whether we can compute the result immediately by callingComputeResult , otherwise wecheck if we can reuse the result of a previously cachedApply computation. If we can do neitherof these, we then choose a variable to branch on and recursively call the Apply routine for eachinstantiation of the variable. We cover these steps in-depth in the following sections and note thatFigure 5 provides an example of theApply operation.

10


Algorithm 3: Apply(F1, F2, op) −→ Fr

input : F1, F2, op : ADD nodes and opoutput : Fr : ADD result node to returnbegin

// Check if result can be immediately computedif (ComputeResult(F1, F2, op) → Fr is not null )then

returnFr;// Check if result already in apply cacheif ( 〈F1, F2, op〉 → Fr is not in apply cache)then

// Not terminal, so recurseif (F1 is a non-terminal node)then

if (F2 is a non-terminal node)thenif (F var

1 comes beforeF var2 ) then

var := F var1 ;

elsevar := F var

2 ;

elsevar := F var

1 ;

elsevar := F var

2 ;

// Set up nodes for recursionif (F1 is non-terminal∧ var = F var

1 ) thenF v1

l := F1,l; F v1

h := F1,h;

elseF v1

l/h := F1;

if (F2 is non-terminal∧ var = F var2 ) then

F v2

l := F2,l; F v2

h := F2,h;

elseF v2

l/h := F2;

// Recurse and get cached resultFl := Apply(F v1

l , F v2

l , op);Fh := Apply(F v1

h , F v2

h , op);Fr := GetNode(var, Fh, Fl);

// Put result in apply cache and returninsert〈F1, F2, op〉 → Fr into apply cache;

returnFr;end

3.2.1 TERMINAL COMPUTATION

The functionComputeResult given in Table 1, determines if the result of a computation can be im-mediately computed without recursion. The first entry in this table is required for proper terminationof the algorithm as it computes the result of an operation applied to two terminal constant nodes.However, the other entries denote a number of pruning optimizations that immediately return a nodewithout recursion. For example, we know thatF1 ⊕ 0 = F1 andF1 ⊗ 1 = F1. If a result cannot beimmediately determined inComputeResult then we must continue recursing on the substructure ofthe operands until a result can be computed.

11


ComputeResult(F1, F2, op) −→ Fr

Operation and Conditions Return Value

F1 op F2; F1 = C1; F2 = C2 C1 op C2

F1 ⊕ F2; F2 = 0 F1

F1 ⊕ F2; F1 = 0 F2

F1 ⊖ F2; F2 = 0 F1

F1 ⊗ F2; F2 = 1 F1

F1 ⊗ F2; F1 = 1 F2

F1 ⊘ F2; F2 = 1 F1

min(F1, F2); max(F1) ≤ min(F2) F1

min(F1, F2); max(F2) ≤ min(F1) F2

similarly for max

F1 ≤ F2; max(F1) ≤ min(F2) 1

F1 ≤ F2; max(F2) ≤ min(F1) 0

similarly for <,≥, >

other null

Table 1: Input and output summaries ofComputeResult . If ComputeResult receives two constantADD nodes as input, the constant resulting from the direct evaluation ofany possiblebinary operation is returned. In other cases where at least one node isnon-terminal, specialoperand structure and specific operator properties sometimes permit the computation ofthe result without further recursion. Some computations rely on the unarymin(F ) andmax(F ) operators that are discussed directly following theApply algorithm.

3.2.2 RECURSIVE COMPUTATION

If a call to Apply is unable to immediately compute a result or reuse a previously cached compu-tation, we must recursively compute the result. For this we have two cases (the third case whereboth operands are constant terminal nodes having been taken care of inthe previous section). Thesealgorithms assume the notation given in Figure 4 for the structure of the operands.

• F1 or F2 is a constant terminal node, orF var1 6= F var

2 : For simplicity of exposition, weassume the operation is commutative and reorder the operands so thatF1 is the constant nodeor the operand whose variable comeslater in the variable ordering so that we know to branchon F var

2 first.1 Thus, we compute the operation applied separately toF1 andeachof F2’shigh and low branches. We then build an internalif decision node conditional onF var

2 and

1. We note that the first case prohibits the use of the non-commutative⊖ and⊘ operations. However, a simple solutionwould be to recursively descend on eitherF1 or F2 rather than assuming commutativity and swapping operands toensure descent onF2. To accommodate general non-commutative operations, we have used this alternate approachin our specification of theApply routine.

12


get its canonical representation for the result:

Fh = Apply(F1, F2,h, op)

Fl = Apply(F1, F2,l, op)

Fr = GetNode(F var2 , Fh, Fl)

• F1 and F2 are constant nodes andF var1 = F var

2 : Since the variables for each operandmatch, we know the resultFr is simply anif statement branching onF var

1 (= F var2 ) with the

true case being the operator applied to the high branches ofF1 andF2 and likewise for thefalse case and the low branches:

Fh = Apply(F1,h, F2,h, op)

Fl = Apply(F1,l, F2,l, op)

Fr = GetNode(F var1 , Fh, Fl)

(1)

1

x2

1 0

x1

x2

1

x1

0

(5)

x2

1

x1

0 1

(1)

(1)

(2)

(2)

(3)

(3)

(4)

(4)

(5)

(5)

(2)

(3) (4)

x

Figure 5: An example application of theApply algorithm. The indices(i) in the diagram corre-spond to successive (recursive) calls to theApply algorithm: for the operands the indicesdenote which node of each operand is passed as a parameter to the call toApply (the opis always⊕); for the result the indices indicate the node that is returned by the call toApply . For example, the initial call toApply takes the arguments corresponding to thenode marked (1)x2 on the LHS of the⊕ and the node (1)x1 on the RHS of the⊕ (as wellas the operation⊕ itself) and returns the node marked (1) on the RHS of the equality.

3.2.3 OTHER OPERATIONS

Above we covered binary operations on ADDs, but we will also need to perform a variety of unaryoperations on ADDs such as determining themin andmax value of an ADD and marginalizationover variables. Here we cover some unary operations that can be performed (efficiently) on ADDs:

• min and max computation: During theReduce operation, it is easy to maintain the mini-mum and maximum values for each internal decision node. Exploiting the fact that an ADDis a DAG,minF = min(Fl, Fh) and likewise formax. A simple example of this annotationand its recursive relationship is shown in Figure 6.

13


[.4,.6]x2

.2.5

x2

.1 .2

x1F = true F x1= false

x2

x1

.1 .2.5

F x1Σ F

x2

.6 .4[.1,.5]

[.1,.5] [.2,.5] [.1,.2]

Figure 6: An example application of the unaryrestriction andmarginalizationoperations. EachADD has all of its internal nodes annotated with[min, max], which can be recursivelycomputed from the children of each internal node.

• Restriction: The restriction of a variablexi in an ADDF to eithertrueor false(i.e. F |xi=true/false)can be computed by replacing all decision nodes for variablexi with the branch correspond-ing to the variable restriction. ThenReduce can be applied on the resulting decision diagramto convert it to a canonical ADD. Two examples of restriction are given in Figure 6.

• Sum out/marginalization: A variablexi can be summed (or marginalized) out of a functionF simply by computing the sum of the restricted functions (i.e.

∑

xiF = F |xi=T ⊕F |xi=F ).

An example of this is given in Figure 6.

• Negation/reciprocation: Negation can be performed using the binaryApply operation on0 ⊖ F . Likewise, reciprocation (i.e.,1F ) can be computed using the binaryApply operation1 ⊘ F .

• Variable reordering: Rudell (1993) provides an ADD variable reordering algorithm thatcasts a general variable reordering in terms of a sequence of pairwise reorderings of neigh-boring variables. Then, the basic idea is that two variablesxi andxj can be reordered locally(i.e., rotated) in the ADD DAG without requiring the modification of any internal nodes otherthan those involvingxi andxj . Furthermore, Rudell describes how this can be done withoutrequiring extra storage for backpointers from children to parents ifGetNode ’s canonical nodecache is allowed to be modified.

4. Affine Algebraic Decision Diagrams (AADDs)

Although ADDs can exploit some structure in additive rewards as was shown in Figure 2(c), ADDswere not intended to directly exploit additive structure nor can they compactly represent all additivefunctions. What is needed is a decision diagram that can exploit logical, additive, and multiplicativeforms of independence.

14


}

2 3

31 −

−0 ,< >

3

31 − 31 −

− −1, ><

31 − 3 ,< >

x2 x2

x1 x1

a) Additive AADD Structure b) Multiplicative AADD Structure

< 0 , 0 >< 1, 0 >

< 0 , 1/3 >

< 0 , 3 >

< 2/3, 1/3 >

0 0

< 0 , 0 > < 1, 0 >

f = 2x + x f = < 1;2x x2 112

}}

F

G

F

0

}

Figure 7: Portions of the ADDs from Figure 1(c) expressed as generalized AADDs. The edgeweights are given as〈c, b〉. The curly braces on the right indicate the elements of theAADD grammar that correspond to each portion of the AADD diagram.

To address the limitations of ADDs, we introduce an affine extension to the ADD(AADD) thatis capable of canonically and compactly representing context-specific, additive, and multiplicativestructure in functions fromBn → R. However, before we formally define AADDs we begin withtwo examples of AADDs that compactly represent additive and multiplicative structure.

Figure 7 shows portions of the exponentially sized ADDs from Figure 1c represented by AADDsof linear size. The evaluation of an AADD is essentially the same as ADDs: given a variableassignment, one traverses the AADD from the root to the leaf following branches at each nodecorresponding to the given variable assignment. However, one will note that the edges are labelledwith two parameters〈c, b〉 that denote an affine transform of the subnode it points to. That is, if thesubnode evaluates tov, then the affine transform of that subnode evaluates toc + b · v. This verysimple modification to ADDs to specify affine transforms on edges turns out to be quite powerful inthat previously exponentially-sized ADDs can be represented as linearly-sized ADDs as shown inthese examples.

4.1 Canonical Reduced AADDs

Recalling our definitions from Section 3 for ADDs, we formally define AADDswith the followingBNF grammar whereF represents anormalized AADDthat we will subsequently restrict to havemaximum range[0, 1] andG represents ageneralized AADDwith range[c, c + b]:

G ::= c + bF

F ::= 0 | if (F var ) then ch + bhFh else cl + blFl

F may be the constant0 terminal node or an internal decision node represented asif (F var )then ch + bhFh else cl + blFl. Internal decision nodes have essentially the same semantics as

15


they did for ADDs in the BNF grammar from Equation 3 except that there is an affine transformch + bh · Fh on the high edge (evaluated whenvar = true) and an affine transformcl + bl · Fl

on the low edge (evaluated whenvar = false). Here,ch andcl are real constants in the closedinterval[0, 1], bh andbl are real constants in the half-open interval(0, 1], F var is a boolean variableassociated withF , andFl andFh are of grammarF (i.e., normalized AADDs themselves). We alsoimpose the following constraints to enforce canonicity of the AADD representation:

1. The variableF var does not appear inFh or Fl.

2. min(ch, cl) = 0

3. max(ch + bh, cl + bl) = 1

4. If Fh = 0 thenbh = 0 andch > 0. Similarly for Fl.

5. In the grammar forG, we require that ifF = 0 thenb = 0, otherwiseb > 0.

These constraints require thatF is normalized to have range[0, 1] (whenF 6= 0). Since normal-ized AADDs in grammarF are restricted to the range[0, 1], we need the top-level positive affinetransform of generalized AADDs in grammarG to allow for the representation of functions witharbitrary range. One can verify that these constraints hold for the AADDs in Figure 7 where all vari-able and terminal nodes are normalized AADD nodes in the grammar forF and the affine transformfor the root node of the AADD is a generalized node in the grammar forG.

Let V al(F, ρ) be the value of AADDF under variable value assignmentρ. This can be definedrecursively by the following equation:

V al(F, ρ) =

F = 0 : 0F 6= 0 ∧ ρ(F var ) = true : ch + bh · V al(Fh, ρ)F 6= 0 ∧ ρ(F var ) = false : cl + bl · V al(Fl, ρ)

Lemma 4.1. For any normalized AADDF over a variable domainx1, . . . , xn and for all vari-able assignmentsρ to variables inF ’s domain, we have thatV al(F, ρ) is in the interval[0, 1],minρ V al(F, ρ) = 0, and ifF 6= 0 thenmaxρ V al(F, ρ) = 1.

Proof. For the base case ofF = 0, the lemma obviously holds. Now, forF 6= 0, we inductivelyassume thatFl and Fh satisfy the lemma and are in the interval[0, 1]. Then forF , we obtainthe range[min(ch + min(Fh), cl + min(Fl)), max(ch + bh · max(Fh), cl + bl · max(Fl))], whichsimplifies to[min(ch, cl), max(ch + bh, cl + bl)] based on our inductive assumption. Our previousconstraints (2) and (3) then imply the range ofF is [0, 1], which proves the inductive case.

Recalling our previous definition of variable ordering for ADDs, we say thatF satisfies a givenvariable ordering ifF = 0 or F is of the formif (F var) then ch + bhFh else cl + blFl whereF var

does not occur inFh or Fl andF var is the earliest variable under the given ordering occuring inF .We say that a generalized AADD of formc + bF satisfies the order ifF satisfies the order.

Lemma 4.2. Fix a variable ordering overx1, . . . , xn. For any non-constant functiong(x1, . . . , xn)mappingB

n −→ R, there exists a unique generalized AADDG over variable domainx1, . . . , xn

satisfying the given variable ordering such that for allρ ∈ Bn we haveg(ρ) = V al(G, ρ).

16


Algorithm 4: GetGNode(v, 〈ch, bh, Fh〉, 〈cl, bl, Fl〉) −→ 〈cr, br, Fr〉

input : v, 〈ch, bh, Fh〉, 〈cl, bl, Fl〉 : Var, offset, mult, and node id for high/low branches

output : 〈cr, br, Fr〉 : Return values for offset,multiplier, and canonical node id

begin// If branches redundant, return childif (cl = ch ∧ bl = bh ∧ Fl = Fh) then

return〈cl, bl, Fl〉;

// Non-redundant so compute canonical formrmin := min(cl, ch);rmax := max(cl + bl, ch + bh);rrange := rmax − rmin;cl := (cl − rmin)/rrange;ch := (ch − rmin)/rrange;bl := bl/rrange;bh := bh/rrange;

// Make new node if not in cacheif (〈v, 〈ch, bh, Fh〉, 〈cl, bl, Fl〉〉 → id is not in node cache)then

id := currently unallocated id;insert〈v, 〈ch, bh, Fh〉, 〈cl, bl, Fl〉〉 → id in cache;

// Return the cached, canonical nodereturn〈rmin, rrange, id〉 ;

end

See Appendix A for the proof.

This second lemma shows that under a given variable ordering, generalized AADDs are canon-ical, i.e., two identical functions will always have identical AADD representations.

We now define AADD algorithms that are analogs of those previously given for ADDs. Assuch, familiarity with theGetNode, Reduce, andApply algorithms from Section 3 will greatly aidin understanding the extensions to these algorithms for AADDs.

Similar to ADDs, we begin by defining a procedure for maintaining a cache of unique AADDnodes. All algorithms rely on the helper functionGetGNode given in Algorithm 4 that takes anunnormalized AADD node of the formif (v) then ch + bhFh else cl + blFl and returns the uniquecached, generalized2 AADD node of the form〈cr + brFr〉. As for GetNode with ADDs, such aprocedure is needed to ensure that there is a single unique node representing any given function.3

Then, given a potentially unnormalized representation of an entire AADD, we define an AADDgeneralization of theReduce algorithm that constructs a corresponding canonical generalized AADD,removing any redundant structure in the process. Next, we define an AADD generalization ofthe Apply algorithm to specify an efficient procedure for performing binary operations on theseAADDs. From these operations, we can then build the remaining operations such as unarymin andmax and marginalization that we will need for probabilistic inference.

2. Thus the “G” in the procedure name forGetGNode.3. Throughout all of the algorithms we use the tuple representation〈c, b, F 〉, while in the text we often use the equivalent

notation〈c + bF 〉 to make the node semantics more clear.

17


3

2

x1

< 0 , 0 >< 1, 0 >

< 0 , 1/3 >

< 0 , 3 >

< 2/3, 1/3 >

0

x2

x1

0

< 3, 0 > < 2 , 0 >

x1 x1

x2x2

x1

x2

x1

x2

x1x1

0

< 1, 0 > < 0 , 0 >

0

< 0 , 0 >

< 2, 1 >

< 1, 0 >< 0 , 0 >< 1, 0 >

0

< 0 , 1 >< 2, 1 >

0

< 1, 0 > < 0 , 0 >

< 2, 1 >

012

x

Figure 8: An example application of theReduce algorithm. The input is the top, leftmost diagram(all edge weights are assumed to be〈0, 1〉). The solid arrow shows the node currentlybeing evaluated byReduce while the next diagram shows the result after this evaluation;when the solid arrow is on a branch rather than a node itself, it indicates thatit is complet-ing the evaluation of that branch within theReduce call for the parent node. The bottom,leftmost diagram is the final canonical AADD representation of the input.

At an abstract level, one can view theGetNode, Reduce, andApply algorithms for AADDsas essentially identical to those for ADDs except that they are extended to propagate the affinetransform of the edge weights on recursion and to compute the normalization of the resulting nodeon return.

TheReduce algorithm given in Algorithm 5 takes an arbitrary ordered AADD, normalizesandcaches the internal nodes, and returns the corresponding generalized AADD. This produces a uniquerepresentation of the AADD that removes any redundant structure in the input representation. Onewill note that theReduce algorithm precisely follows the constructive proof in Lemma 4.2. This issufficient to prove correctness of the algorithm. An example application of theReduce algorithm isgiven in Figure 8.

One nice property of theReduce algorithm is that one does not need to prespecify the structurethat the AADD should exploit. If the represented function contains context-specific, additive, ormultiplicative independence, theReduce algorithm will compactly represent this structure uniquelyand automatically w.r.t. the variable ordering as guaranteed by previous lemmas.

18


Algorithm 5: Reduce(〈c, b, F 〉) −→ 〈cr, br, Fr〉

input : 〈c, b, F 〉 : Offset, multiplier, and node id

output : 〈cr, br, Fr〉 : Return values for offset,multiplier, and node id

begin// Check for terminal nodeif (F = 0) then

return〈c, 0, 0〉;

// Check reduce cacheif ( F → 〈cr, br, Fr〉 is not in reduce cache)then

// Not in cache, so recurse〈ch, bh, Fh〉 := Reduce(ch, bh, Fh);〈cl, bl, Fl〉 := Reduce(cl, bl, Fl);

// Retrieve canonical form〈cr, br, Fr〉 := GetGNode(F var, 〈ch, bh, Fh〉, 〈cl, bl, Fl〉);

// Put in cacheinsertF → 〈cr, br, Fr〉 in reduce cache;

// Return canonical reduced nodereturn〈c + b · cr, b · br, Fr〉;

end

, b

Fvar1,h Fvar

1,l

< c , b >2 2

< c , b >1 1

Fvar1 Fvar

Fvar Fvar

2

2,h 2,l

op

1,h>

1,h>

1,l 1,l >2,h 2,h

>2,l 2,l

< c , b < c , b < c , b < c

Figure 9: Two AADD nodesF1 andF2 and a binary operationop with the corresponding notationused in the presentation of theApply algorithm.

4.2 Binary Operations on AADDs

We letop denote a binary operator on AADDs with possible operations being addition,substraction,multiplication, division, min, and max denoted respectively as⊕, ⊖, ⊗, ⊘, min(·, ·), andmax(·, ·).We do not explicitly provide binary comparison functions≥, >, ≤, < for AADDs as we did forADDs, but note that they could be easily defined analogously to the other binary operations, ifneeded.

19


Algorithm 6: Apply(〈c1, b1, F1〉, 〈c2, b2, F2〉, op) −→ 〈cr, br, Fr〉

input : 〈c1, b1, F1〉, 〈c2, b2, F2〉, op : Nodes and op

output : 〈cr, br, Fr〉 : Generalized node to returnbegin

// Check if result can be immediately computedif (ComputeResult(〈c1, b1, F1〉, 〈c2, b2, F2〉, op) → 〈cr, br, Fr〉 is not null )then

return〈cr, br, Fr〉;

// Get normalized key and check apply cache〈〈c′1, b

′

1〉, 〈c′

2, b′

2〉〉 :=GetNormCacheKey(〈c1, b1, F1〉, 〈c2, b2, F2〉, op);

if ( 〈〈c′1, b′

1, F1〉, 〈c′

2, b′

2, F2〉, op〉 → 〈cr, br, Fr〉 is not in apply cache)then// Not terminal, so recurseif (F1 is a non-terminal node)then

if (F2 is a non-terminal node)thenif (F var

1 comes beforeF var2 ) then

var := F var1 ;

elsevar := F var

2 ;

elsevar := F var

1 ;

elsevar := F var

2 ;

// Propagate affine transform to branchesif (F1 is non-terminal∧ var = F var

1 ) thenF v1

l := F1,l; F v1

h := F1,h;cv1

l := c′1 + b′1 · c1,l; cv1

h := c′1 + b′1 · c1,h;bv1

l := b′1 · b1,l; bv1

h := b′1 · b1,h;

elseF v1

l/h := F1; cv1

l/h := c′1; bv1

l/h := b′1;

if (F2 is non-terminal∧ var = F var2 ) then

F v2

l := F2,l; F v2

h := F2,h;cv2

l := c′2 + b′2 · c2,l; cv2

h := c′2 + b′2 · c2,h;bv2

l := b′2 · b2,l; bv2

h := b′1 · b2,h;

elseF v2

l/h := F2; cv2

l/h := c′2; bv2

l/h := b′2;

// Recurse and get cached result〈cl, bl, Fl〉 := Apply(〈cv1

l , bv1

l , F v1

l 〉, 〈cv2

l , bv2

l , F v2

l 〉, op);〈ch, bh, Fh〉 := Apply(〈cv1

h , bv1

h , F v1

h 〉, 〈cv2

h , bv2

h , F v2

h 〉, op);〈cr, br, Fr〉 := GetGNode(var, 〈ch, bh, Fh〉, 〈cl, bl, Fl〉);

// Put result in apply cache and returninsert〈c′1, b

′

1, F1, c′

2, b′

2, F2, op〉 → 〈cr, br, Fr〉 into apply cache;

returnModifyResult(〈cr, br, Fr〉);

end

TheApply routine given in Algorithm 6 takes two generalized AADD operands and an oper-ation as given in Figure 9 and produces the resulting generalized AADD. The control flow of thealgorithm is straightforward: We first check whether we can compute the result immediately, oth-

20


ComputeResult(〈c1, b1, F1〉, 〈c2, b2, F2〉, op) −→ 〈cr, br, Fr〉Operation and Conditions Return Value

〈c1 + b1F1〉〈op〉〈c2 + b2F2〉; F1 = F2 = 0 〈(c1 〈op〉 c2) + 0 · 0〉max(〈c1 + b1F1〉, 〈c2 + b2F2〉); c1 + b1 ≤ c2 〈c2 + b2F2〉max(〈c1 + b1F1〉, 〈c2 + b2F2〉); c2 + b2 ≤ c1 〈c1 + b1F1〉〈c1 + b1F1〉 ⊕ 〈c2 + b2F2〉; F1 = F2 〈(c1 + c2) + (b1 + b2)F1〉max(〈c1 + b1F1〉, 〈c2 + b2F1〉); F1 = F2,(c1 ≥ c2 ∧ b1 ≥ b2) ∨ (c2 ≥ c1 ∧ b2 ≥ b1)

c1 ≥ c2 ∧ b1 ≥ b2 : 〈c1 + b1F1〉c2 ≥ c1 ∧ b2 ≥ b1 : 〈c2 + b2F1〉

Note: for allmax operations above, return opposite formin

〈c1 + b1F1〉〈op〉〈c2 + b2F2〉; F2 = 0, op ∈ {⊕,⊖} 〈(c1 〈op〉 c2) + b1F1〉〈c1 + b1F1〉〈op〉〈c2 + b2F2〉; F2 = 0, c2 ≥ 0, op ∈ {⊗,⊘} 〈(c1 〈op〉 c2) + (b1 〈op〉 c2)F1〉

Note: above two operations can be modified to handleF1 = 0 whenop ∈ {⊕,⊗}other null

Table 2: Input and output summaries of theComputeResult terminal computation routine.

GetNormCacheKey(〈c1, b1, F1〉, 〈c2, b2, F2〉, op) −→ 〈〈c′1, b′

1〉〈c′

2, b′

2〉〉 and ModifyResult(〈cr, br, Fr〉) −→ 〈c′r, b′

r, F′

r〉

Operation and Conditions Normalized Cache Key and Computation Result Modification

〈c1 + b1F1〉 ⊕ 〈c2 + b2F2〉; F1 6= 0 〈cr + brFr〉 = 〈0 + 1F1〉 ⊕ 〈0 + (b2/b1)F2〉〈(c1 + c2 + b1cr) + b1brFr〉

〈c1 +b1F1〉⊖〈c2 +b2F2〉; F1 6= 0 〈cr + brFr〉 = 〈0 + 1F1〉 ⊖ 〈0 + (b2/b1)F2〉〈(c1 − c2 + b1cr) + b1brFr〉

〈c1 + b1F1〉 ⊗ 〈c2 + b2F2〉; F1 6= 0 〈cr + brFr〉 = 〈(c1/b1) + F1〉 ⊗ 〈(c2/b2) + F2〉〈b1b2cr + b1b2brFr〉

〈c1 + b1F1〉 ⊘ 〈c2 + b2F2〉; F1 6= 0 〈cr + brFr〉 = 〈(c1/b1) + F1〉 ⊘ 〈(c2/b2) + F2〉〈(b1/b2)cr + (b1/b2)brFr〉

max(〈c1+b1F1〉, 〈c2+b2F2〉);F1 6= 0, Note: same formin

〈cr + brFr〉 =max(〈0 + 1F1〉, 〈(c2 − c1)/b1 + (b2/b1)F2〉)

〈(c1 + b1cr) + b1brFr〉

any〈op〉 not matching above:〈c1 + b1F1〉〈op〉〈c2 + b2F2〉

〈cr + brFr〉 = 〈c1 + b1F1〉〈op〉〈c2 + b2F2〉〈cr + brFr〉

Table 3: Input and output summaries of theGetNormCacheKey , andModifyResult routines.

erwise we normalize the operands to a canonical form and check if we canreuse the result of apreviously cached computation. If we can do neither of these, we then choose a variable to branchon and recursively call theApply routine for each instantiation of the variable. We cover these stepsin-depth in the following sections.

4.2.1 TERMINAL COMPUTATION

The functionComputeResult given in thetop half of Table 3, determines if the result of a compu-tation can be immediately computed without recursion. The first entry in this table isrequired forproper termination of the algorithm as it computes the result of an operation applied to two terminal0 nodes. However, the other entries denote a number of pruning optimizationsthat immediatelyreturn a node without recursion. For example, given the operation〈3 + 4F1〉 ⊕ 〈5 + 6F1〉, we canimmediately return the result〈8 + 10F1〉 sinceF1 is shared by both operands.

21


4.2.2 RECURSIVE COMPUTATION

If a call toApply is unable to immediately compute a result or reuse a previously cached computa-tion, we must recursively compute the result. For this we have two cases (thethird case where bothoperands are0 terminal nodes having been taken care of in the previous section):

• F1 or F2 is a 0 terminal node, or F var1 6= F var

2 : We assume the operation is commutativeand reorder the operands so thatF1 is the0 node or the operand whose variable comeslaterin the variable ordering so that we know to branch onF var

2 first.4 Then we propagate theaffine transform to each ofF2’s branches and compute the operation applied separately toF1

andeachof F2’s high and low branches. We then build anif statement conditional onF var2

and normalize it to obtain the generalized AADD node〈cr, br, Fr〉 for the result:

〈ch, bh, Fh〉 = Apply(〈c1, b1, F1〉, 〈c2 + b2c2,h, b2b2,h, F2,h〉, op)

〈cl, bl, Fl〉 = Apply(〈c1, b1, F1〉, 〈c2 + b2c2,l, b2b2,l, F2,l〉, op)

〈cr, br, Fr〉 = GetGNode(F var2 , 〈ch, bh, Fh〉, 〈cl, bl, Fl〉)

• F1 and F2 are non-terminal nodes andF var1 = F var

2 : Since the variables for each operandmatch, we know the result〈cr, br, Fr〉 is simply a generalizedif statement branching onF var

1

(= F var2 ) with the true case being the operator applied to the high branches ofF1 andF2 and

likewise for the false case and the low branches:

〈ch, bh, Fh〉 = Apply(〈c1 + b1c1,h, b1b1,h, F1,h〉,〈c2 + b2c2,h, b2b2,h, F2,h〉, op)

〈cl, bl, Fl〉 = Apply(〈c1 + b1c1,l, b1b1,l, F1,l〉〈c2 + b2c2,l, b2b2,l, F2,l〉, op)

〈cr, br, Fr〉 = GetGNode(F var1 , 〈ch, bh, Fh〉, 〈cl, bl, Fl〉)

4.2.3 CANONICAL CACHING

If the AADD Apply algorithm were to compute and cache the results of applying an operationdirectly to the operands, the algorithm would provably have the same time complexity as the ADDApply algorithm. Yet, if we were to compute〈0+1F1〉⊕〈0+2F2〉 and cache the result〈cr +brFr〉,we could compute〈5 + 2F1〉 ⊕ 〈4 + 4F2〉 without recursion as follows:

(a) 〈5 + 2F1〉 ⊕ 〈4 + 4F2〉 = 9 + 2 · (〈0 + F1〉 ⊕ 〈0 + 2F2〉)(b) = 9 + 2 · 〈cr + brFr〉(c) = 〈(9 + 2cr) + 2brFr〉

The key observation here is that we can (a) rewrite the second operationin a normalized formwhere we subtract off the constants and divide by the first coefficient,(b) substitute in the result of apreviously cached computation, and then (c) modify the result to reverse the previous normalization.

4. As for ADDs, we note that the first case prohibits the use of the non-commutative⊖ and⊘ operations. However, asimple solution would be to recursively descend on eitherF1 orF2 rather than assuming commutativity and swappingoperands to ensure descent onF2. To accommodate general non-commutative operations, we have used this alternateapproach in our specification of theApply routine given in Algorithm 6.

22


This suggests a canonical caching scheme that normalizes all cache entries to increase thechance of a cache hit. The actual result can then be easily computed fromthe cached result by re-versing the normalization as demonstrated in the example. This ensures optimal reuse of theApply

operations cache and can lead to an exponential reduction in running time over the non-canonicalcaching version.

We introduce two additional functions to perform this caching:GetNormCacheKey to com-pute the canonical cache key, andModifyResult to reverse the normalization in order to computethe actual result. These algorithms are summarized in thebottom halfof Table 3.

4.2.4 OTHER OPERATIONS

We summarize some of the remaining operations that can be performed (efficiently) on AADDs:

• min and max computation: The min and max of a generalized AADD node〈c + bF 〉 arerespectivelyc andc + b due to[0, 1] normalization ofF .

• Restriction: The restriction of a variablexi in a function to eithertrueor false(i.e. F |xi=T/F )can be computed similarly to ADDs by replacing all decision nodes for variablexi with thebranch corresponding to the variable restriction and propagating the affine transform to thedirect subnodes. ThenReduce can be applied on the resulting decision diagram to convert itto a canonical AADD.

• Sum out/marginalization: A variablexi can be summed (or marginalized) out of a functionF simply by computing the sum of the restricted functions (i.e.F |xi=T ⊕F |xi=F ) exactly asdone for ADDs.

• Negation/reciprocation: While it may seem that negation of a generalized AADD node〈c + bF 〉 would be as simple as〈−c + −bF 〉, we note that this violates our normalizationscheme which requiresb > 0. Consequently, negation must be performed explicitly with theApply operation as0 ⊖ 〈c + bF 〉. Likewise, reciprocation (i.e., 1

〈c+bF 〉 ) must be performedexplicitly with theApply operation as1 ⊘ 〈c + bF 〉.

• Variable reordering: The ADD variable reordering algorithm of Rudell (1993) previouslysummarized for ADDs can be applied to AADDs without loss of efficiency. The only mod-ification needed is to recompute the normalized affine transforms for pairwiserotations ofneighboring nodes involving variablesxi andxj , but this is simply a local application of theReduce algorithm.

4.2.5 CACHE IMPLEMENTATION

If one were to use a naive cache implementation that relied on exact floating-point values for hashingand equality testing, one would find that many nodes whichshouldbe the same under exact com-putation often turn out to have offsets or multipliers differing by±1e-15; these numerical precisionerrors result from repeated multiplications and divisions during theReduce andApply operations.This can result in an exponential explosion of nodes if not controlled. Consequently, it is betterto use a hashing scheme that considers equality within some range of numerical precision errorǫ.While it is difficult to guarantee such an exact property for anefficienthashing scheme, we next

23


2d

<u ,u >2

εε

<0,0>

<v ,v >1 2

1

1d

ε

Figure 10: A geometric representation of the hashing scheme we use. All points withinǫ of 〈u1, u2〉(the shaded circle) lie within the ring having outer and inner radius

√u1

2 + u22 ± ǫ.

Thus, a hashing scheme which hashes all points within the ring to thesamebucketguarantees that all points withinǫ of 〈u1, u2〉 also hash to the same bucket. Note thatbuckets are discretized according to the distance from the origin (i.e., the vantage pointfor comparison).

outline an approximate approach that we have found to work both efficientlyand nearly optimallyin practice.

The node cache used inGetGNode and the operation result cache used inApply both use cachekeys containing four floating-point values (i.e., the offsets and multipliers for two AADD nodes).If we consider this 4-tuple of floating-point values to be a point in Euclideanspace, then we canmeasure the distance between two 4-tuples〈u1, u2, u3, u4〉 and〈v1, v2, v3, v4〉 as the Ł2 (Euclidean)distance between these points. In an approximate caching scheme that takesnumerical precisionerror into account, we might consider two 4-tuples corresponding to hashkeys to be equivalent iftheir Ł2 distance from each other is smaller thanǫ:

√

(u1 − v1)2 + (u2 − v2)2 + (u3 − v3)2 + (u4 − v4)2 < ǫ (4)

Ideally, when probing the cache to see if a key exists within an Ł2 distance ofǫ, we would preferto avoid a pairwise comparison of our probe key to all nodes currently in thecache. Fortunately,we can use thevantage point(Yianilos, 1993) method for efficiently finding nearest neighbors in ametric space. The basic idea of these methods is that we can exploit the triangleinequality to obtainthe following necessary conditions implied by the previous error between two4-tuples:

√

(u1 − v1)2 + (u2 − v2)2 + (u3 − v3)2 + (u4 − v4)2 ≤ ǫ

=⇒ |√

u12 + u2

2 + u32 + u4

2 −√

v12 + v2

2 + v32 + v4

2| ≤ ǫ

A geometric representation providing intuitions for these necessary conditions is given for twodimensions in Figure 10. The benefit of these necessary conditions is thattheir computation onlyrequires the relative distances of each 4-tuple to the origin (thus, we can view the origin as thevantage point for comparison). While this only gives us a necessary condition in our search for

24


4-tuples within some Ł2 distance of the probe, it gives us a simple test that allows us to prune outthe majority of 4-tuples that we need to consider in a typical case.

Based on these necessary conditions for Equation 4, we can use the following approximatehashing scheme that will determine other 4-tuples in the hash table that are candidates for beingwithin ǫ distance of a probe〈v1, v2, v3, v4〉: Compute the Ł2 distanced between〈v1, v2, v3, v4〉and the origin. To compute the hash key for〈v1, v2, v3, v4〉, extract only the bits of the floating-point representation ofd representing a fractional portion greater thanǫ and use this for an integerrepresentation of the hash key (we are effectively discretizing the distances into buckets of widthǫ). For equality testing in the hash table, test that the true Ł2 metric between a tuple〈u1, u2, u3, u4〉and the probe〈v1, v2, v3, v4〉 is less thanǫ.

While this hashing scheme does not guarantee that all 4-tuples havingǫ distance from the origin〈0, 0, 0, 0〉 hash to the same bucket (some 4-tuples withinǫ could fall over bucket boundaries), wefound that with bucket widthǫ = 1e-9 and numerical precision error generally less than1e-13,there was only a small chance of two nodes withinǫ distance hashing to different buckets. Forthe empirical results we describe, this hashing scheme was sufficient to prevent any uncontrollablecases of numerical precision error.

An alternate (and exact) hashing scheme would be to explicitly check the neighboring bucketfor matching 4-tuples when the probe comes withinǫ of a bucket boundary.5

4.3 Theoretical Results

Here we present two fundamental results for AADDs. The first theorembounds the worst-casespace and time performance of theReduce andApply operations for AADDs in terms of the corre-sponding operations on ADDs:

Theorem 4.3. For all functionsF1 : Bn −→ R and F2 : B

m −→ R (n ≥ 0 and m ≥ 0),the time and space performance ofReduce(F1) andApply(F1, F2, op) for AADDs (operands andresults represented as canonical AADDs) is within a multiplicative constant of Reduce(F1) andApply(F1, F2, op) for ADDs (operands and results represented as canonical ADDs) in theworstcase assuming any fixed variable ordering.

See Appendix A for the proof.While the above results bound the space and time of the AADD operations on arbitrary functions

relative to the ADD operations for the same functions, it is interesting to note that the worst casespace and time bounds for theApply operation givensolely in terms of the corresponding size ofthe input operands is very different for ADDs vs. AADDs.

The size of the result of the ADDApply operation is known to be bounded quadratically inthe size of the largest input operand. Bryant (1986) shows this simply byobserving that the sizeof the ADD can be bounded in the number of possible distinctApply calls given two operands(any non-distinct calls will already have been cached), which is at most all possible pairs of nodeswhen taking one node each from the first and second operands. The number of these node pairs isobviously quadratic in the size of the largest input operand. Since each (recursive)Apply call cancontribute a maximum of one node to the ADD resulting from theApply operation, the space boundof Apply follows.

5. This suggestion is due to Roni Khardon.

25


On the other hand, we note that the size of the result of the AADDApply operation can only bebounded exponentially in the combined size of the operands. To understand this, note that unlikeADDs, AADDs do allow reconvergent edges when these edges are labelled with different affinetransforms of the same child node. For example, this can be observed in the linearly structuredAADDs of Figure 7. Letn be the number of nodes in a linearly structured AADD; then anApply

call with one of these AADDs as operands may need to traverse all possibledistinct paths fromroot node to terminal node, which isexp(n) due to the reconvergent structure. Following the samereasoning as for the ADD, this exponential number ofApply calls can lead to a result of theApply

operation that has a number of nodes exponential in the combined size of theoperands (in the worstcase).

Nonetheless, it is important to reiterate the result of Theorem 4.3 that the time and space com-plexity of operations on functions represented as AADDs is never more than a constant times worsethan the operations applied to the same functions represented as ADDs.

The second theorem shows that in special cases, the AADD can yield an exponential-to-linearreduction in the time and space complexity over the ADD:

Theorem 4.4. There exist functionsF1 andF2 and an operatorop such that the running time andspace performance ofApply(F1, F2, op) for AADDs can be linear in the number of variables whenthe corresponding ADD operations are exponential in the number of variables.

See Appendix A for the proof.

Empirically, we note that while the use of AADDs in place of ADDs has always led to smallerspace requirements and faster operations for all of our test cases, therather extreme best case of areduction from exponential to linear complexity noted in Theorem 4.4 has rarely been observed inpractice. And perhaps more disappointingly, functions that may appear to have additive and multi-plicative structure that can be exploited extensively by AADDs turn out to benefit little from the useof AADDs in place of ADDs. For example, the AADD representation of the function(

∑ni=1

2ixi)2

requires precisely 1/4 of the space of the ADD (forn > 2) even though the additive and mul-tiplicative structure inherent in this function ostensibly suggest that the AADD might achieve asubstantially more compact representation than the ADD. Nonetheless, a 75%reduction in spaceobtained by using the AADD in place of the ADD for this example still justifies the use of theAADD in this case.

5. Factor Approximation

5.1 Tables

5.2 ADDs

ADDs can be efficiently pruned to reduce their size in exchange for some approximation error. Forcompression of an ADDF within ǫ error, the operationApproxADD(F, ǫ) (Algorithm 7) can beperformed by collecting all leaves of the ADD and determining which can be merged to form newvalues without approximating more thanǫ. The old values are then replaced with these new valuescreating a new (minimally reduced) ADD. An illustrative example ofApproxADD(F, ǫ) for ADDsis provided in Figure 11.

26


x2

x3 x3

x1

x2

x4

3.03 3.07 7.11

x4

7.06 0.01 6.034.03

x4

4.01

x3

x4

2.03

x4

2.01

x4

1.04 1.01 5.07 0

x3

5.03

x4

6.06

x4

ApproxADD(·, 0 .1 )−−−−−−−−−−−−−−→x3

5.05 1.025

x3

3.05 7.085

x3

6.045 2.02

x2

x3

0.005

x2

x1

4.02

Figure 11: Compression of the ADD∑

3

i=12ixi +

∑

4

i=1

∑

4

j=i 0.01xixj within precision0.1. Dot-ted lines are the low (false) branch and solid lines are the high (true) branch.

Algorithm 7: ApproxADD(DD,ǫ)

beginleavesold=collectLeavesADD (DD);{leavesold → leavesnew}=mergeLeaves (leavesold , ǫ);return createNewDD(DD, {leavesold → leavesnew});

end

5.3 AADDs

We now introduce a method for efficiently finding compact approximations of AADDs within anǫerror budget.

Whereas it was fairly simple to approximate ADDs withǫ error as shown in Figure 11, it is lessstraightforward for AADDs. The problem is that the only leaf value is0 and that all of the valuestructure is stored internally in the edge-based affine transforms.

To see how we might approximate an AADD, it is best to view an example. If we examineFigure 12, we note that the “noisy” AADD on the left is simply the function

∑

3

i=12ixi with pairwise

noise factors∑

4

i=1

∑

4

j=i 0.01xixj added in. On the right hand side, we see the compressed version

of this AADD representing a compact approximation of∑

3

i=12ixi without the additional pairwise

noise terms that lead to branching in the AADD, since this structure can be merged or pruned withinanǫ = 0.1 error budget.

How do we obtain this compressed AADD on the right hand side of Figure 12?It turns outthat there are two basic operations that will allow us to recover it. However,we must first executeMarkRange (Algorithm 8) on the AADD we wish to compress in order to determine the maximumcontribution of any nodeF to the overall value (we store this value inFMaxRange , which shouldbe initialized to zero before running the algorithm). Incidentally,MarkRange also sets theF ǫ

property of each nodeF , which indicates how much of theǫ error is left to use in potentiallyapproximating nodeF . Once we’ve done this, we can perform the following two operations leadingto aApproxAADD(〈c, b, F 〉, ǫ) operation for AADDs that we will formally define shortly.

27


0

x3

x4

<0 + 0.005 * > <0.993 + 0.007 * >

ROOT

x1

<0 + 7.11 * >

x2

<0.332 + 0.668 * >

x3

<0 + 0.665 * >

<0 + 0 * ><1 + 0 * >

x3

<0 + 0.007 * ><0.99 + 0.01 * ><0 + 0.002 * > <0.995 + 0.005 * >

x3

<0.988 + 0.012 * ><0 + 0.01 * >

x2

<0 + 0.666 * > <0.331 + 0.669 * >

<0 + 0.852 * > <0.142 + 0.858 * >ApproxAADD(·, 0 .1 )−−−−−−−−−−−−−−−→

x1

x2

<0 + 0.852 * ><0.142 + 0.858 * >

0

ROOT

<0 + 7.11 * >

x3

<0.332 + 0.668 * ><0 + 0.665 * >

<0 + 0 * ><1 + 0 * >

Figure 12: Compression of the AADD∑

3

i=12ixi +

∑

4

i=1

∑

4

j=i 0.01xixj within precision0.1.Dotted lines are the low (false) branch and solid lines are the high (true) branch.

Algorithm 8: MarkRange(〈c, b, F 〉, range, ǫ)

input : 〈c, b, F 〉 : Offset, multiplier, and node idbegin

// Check for terminal nodeif F = 0 or (F visitedand FMaxRange > range) then

return ;

// Initializes error budget for current nodeF ǫ := ǫ;// Update max range for current nodeFMaxRange := max(FMaxRange , range);

// Recurse on both branches ofF with updated rangeMarkRange(〈F.cl, F.bl, F.Fl〉, range · bl);MarkRange(〈F.ch, F.bh, F.Fh〉, range · bh);

end

Merge Nodes:The first approximation procedure we might want to do is illustrated in Figure 13.Here we have two nodesF1 andF2 and we need to determine whether to merge them into a singlenode.

To see why we would want to do this, note that in Figure 12 there are many nodes that havethe same variable tests, same children, and nearly identical affine transforms on their low and highbranches. If they do not have the same children, we note that if their grandchildren were firstmerged, they might then have the same children. These nodes and affine transforms would beidentical except for the addition of the asymmetrical noise term

∑

4

i=1

∑

4

j=i 0.01xixj . However,we note that we can remove this noise in many cases by merging these nearly identical nodes whilecontrolling the amount of error induced by this approximation.

To potentially merge nodes with identical high and low children, we must calculatethe maxi-mum error incurred in the function when the affine transform for the low branch ofF1 is used for

28


, b

Fvar1,h Fvar

1,l

< c , b >2 2

< c , b >1 1

Fvar1 Fvar

Fvar Fvar

2

2,h 2,l

op

1,h>

1,h>

1,l 1,l >2,h 2,h

>2,l 2,l

< c , b < c , b < c , b < c

Figure 13: Two AADD nodesF1 andF2 (whereF var1 = F var

2 ) with the notation used in the merging procedure.Here, we want to ask whether these two nodes can be merged while incurring less thanǫ error impact onthe function?

< c

Fvar2

Fvar2Fvar

1

h

>h, b< c

r< c , b> , br >

>l, bl< c

prune?

Figure 14: An AADD nodeF1 with the notation used in the pruning procedure. Here, we want to ask whetherF1 canbe completely pruned while incurring less thanǫ error impact on the function?

the low branch ofF2 and likewise when the affine transform for the high branch ofF1 is used forthe high branch ofF2:

error :=max(FMaxRange1

, FMaxRange2

) (5)

· max(|F1.cl − F2.cl| + |F1.bl − F2.bl|,|F1.ch − F2.ch| + |F1.bh − F2.bh|)

If error < ǫ then we can perform a node merge where we simply replaceF1 with F2 and updateour error budget forF2 asF ǫ

2 := F ǫ2 − error . Clearly, the maximum merge error is just the error

of the affine transform approximation multiplied by theMaxRange of this node since all nodes arenormalized[0, 1]. A slightly more complex procedure could replace both nodes with an averagedversion, but this has subtle implications for AADD normalization that complicate thealgorithm andreduce its efficiency.

Prune Nodes: The second approximation procedure we might want to do is to remove a nodeentirely and replace it with its child in the case that it has the same child on its low andhighbranches as shown forF1 in Figure 14.

To see why we would want to do this operation, note that in Figure 12, the variable x4 haslittle impact (quantitatively,0.04 or less) on the overall value of the AADD; with an allowable errorbudgetǫ = 0.1, it can be removed entirely. This removalcannotbe done by merging nodes, itrequires pruning nodes. The error analysis for such pruning determines the error incurred if the

29


Algorithm 9: ApproxAADD(AADD =(〈c, b, F 〉, ǫ)

beginMarkRange(AADD,b,ǫ);foreachvariable level from bottom to top in AADDdo

foreachnodeF1 in a level doif F visited

1 thencontinue;

F ǫ1 = min(F ǫ

1,l, Fǫ1,h) ;

foreachnodeF2 in a leveldoif F visited

2 thencontinue;

F visited2 = true;

F ǫ2 =

min(F ǫ2,l, F

ǫ2,h);

if F1,l = F2,l and F1,h = F2,h thenFmergeErr

2= compute using

Eq (5) withF1, F2;err = min(F ǫ

1 , F ǫ2)

−FmergeErr2

;if err > 0 then

insertF2 in mergeList ;

if size (mergeList)> 0 thenforeachnodeF2 in mergeList do

FMaxRange1

=

max(FMaxRange1

, FMaxRange2

);F ǫ

1 =min(F ǫ1 , F ǫ

2);replace references toF2 with F1;F visited

2 =true;

end

decision forF1 is removed:

error := FMaxRange(|F1.ch − F1.cl| + |F1.bh − F1.bl|)/2

If error < ǫ, we assigncr := c+ b · F1.cl+F1.ch2

, br := b · F1.bh+F1.bl2

, replaceF1 with F2 and reduceour error budget forF2 by F ǫ

2 := F ǫ2 − error . Again, since all nodes are normalized[0, 1], we only

lose the error induced by deviation of the two affine transforms from their average multiplied by theMaxRange for the node being pruned.

However, it turns out that in a greedy approximation procedure, pruning nodes can often use upmost of the error budget early on in the approximation, thereby preventing the merging of nodes in

30


later operations that could potentially save more space with less error cost. As a consequence, whilewe see the potential value of node pruning in Figure 14, we note that it has led to poor performancein practice so we opt not to use it here. Nonetheless, we mention it here simplybecause it may beuseful in future work if its aggressive error consumption could somehowbe controlled better (e.g.,placing a lower error budget on prune operations).

Algorithm: We now formally define the algorithm that performs AADD compression.

ApproxAADD (Algorithm 9) merges the nodes of the AADD by starting at the bottom levelnodes and making its way up to the root nodes. By doing this, we ensure thatas many child nodesare merged as possible so that merging can be performed on their parent nodes. The proceduretakes any unvisited node and finds all the nodes than can be merged taking intoaccount the mergeerror. After that it replaces all references to merged nodes withF1 and updates theF ǫ

1 to reflect itsdecreased error budget.

6. Empirical Evaluation

First we explore the running time and space requirements of ADDs and AADDs for simple op-erations such as summation, multiplication, and maximization. Then we explore a number ofparadigms for structured probabilistic inference and compare the performance of standard algo-rithms implemented using ADDs and tabular representations to those using AADDs.

6.1 Basic Operations

Figure 15 demonstrates the relative time and space performance of tables, ADDs, and AADDs for⊕,⊗, andmax, each for one example function. These verify the exponential to linear space and timereductions proved in Theorem 4.4. The functions used in these examples are simply generalizationsof the additive and multplicative functions given in Figures 1c and 7 that could be represented inexponential space with ADDs and linear space with AADDs.

6.2 Exact Inference

6.2.1 BAYES NETS

Since dynamic Bayes nets are used in factored MDPs, it is informative to evaluate AADDs ona variety of Bayes net structures. For Bayes nets, we simply evaluate the variable eliminationalgorithm (Zhang & Poole, 1996) under the greedy tree-width minimizingmin-fill (Kjaerulff, 1990)variable ordering with the conditional probability tables (CPTs)Pj and corresponding operationsreplaced with those for tables, ADDs, and AADDs:

∑

xi /∈Query

∏

P1...Pj

P1(x1|Parents(x1)) · · ·Pj(xj |Parents(xj))

Table 4 shows the total number of table entries/nodes required to represent the original networkand the total running time of 100 random queries (each consisting of one query variable and oneevidence variable) for a number of publicly available Bayes nets6 and anoisy-or(Pearl, 1986) modelP (x1|x2, . . . , xn) = 1 − ∏n

i=2P (x1|xi) whereP (x1|xi) = .1 with n = 15.

6. See theBayes net repository: http://www.cs.huji.ac.il/labs/compbio/Repository

31


6 8 10 12 14 16 180

2000

4000

6000

8000

10000Running Time vs. #Vars for Sum

# Variables

Ru

nn

ing

Tim

e (

ms) Table

ADDAADD

6 8 10 12 14 16 180

1

2

3

4

5

6x 10

5 #Nodes/Entries vs. #Vars for Sum

# Variables

#N

od

es/E

ntr

ies

TableADDAADD

6 8 10 12 14 16 180

2000

4000

6000

8000

10000Running Time vs. #Vars for Product

# VariablesR

un

nin

g T

ime

(m

s) Table

ADDAADD

6 8 10 12 14 16 180

1

2

3

4

5

6x 10

5 #Nodes/Entries vs. #Vars for Product

# Variables

#N

od

es/E

ntr

ies

TableADDAADD

6 8 10 12 14 16 180

2000

4000

6000

8000

10000

12000Running Time vs. #Vars for Max

# Variables

Ru

nn

ing

Tim

e (

ms) Table

ADDAADD

6 8 10 12 14 16 180

1

2

3

4

5

6x 10

5 #Nodes/Entries vs. #Vars for Max

# Variables

#N

od

es/E

ntr

ies

TableADDAADD

Figure 15: Need to replot! Comparison ofApply operation running time (top) and table en-tries/nodes (bottom) for tables, ADDs and AADDs. Left to Right:(

∑

i 2ixi) ⊕

(∑

i 2ixi), (γ

P

i 2ixi) ⊗ (γP

i 2ixi), max(∑

i 2ixi,

∑

i 2ixi). Note the linear time/space

for AADDs.

Bayes Net Table ADD AADD# Table Entries Time # ADD Nodes Time # AADD Nodes Time

Alarm 1,192 2.97 s 689 2.42 s 405 1.26 sBarley 470,294 EML∗ 139,856 EML∗ 60,809 207 mCarpo 636 0.58 s 955 0.57 s 360 0.49 sHailfinder 9045 26.4 s 4511 9.6 s 2538 2.7 sInsurance 2104 278 s 1596 116 s 775 37 sNoisy-Or-15 65566 27.5 s 125356 50.2 s 1066 0.7 s

Table 4: Number of table entries/nodes in the original network and variable elimination runningtimes using tabular, ADD, and AADD representations for inference in various Bayes nets.∗EML denotes that a query exceeded the 1Gb memory limit.

32


6 7 8 9 10 11 120

2000

4000

6000

8000

10000

12000

14000Running Time vs. #Computers for Star Config

# Computers

Ru

nn

ing

Tim

e (

s)

6 7 8 9 10 11 120

1000

2000

3000

4000

5000#Nodes/Entries vs. #Computers for Star Config

# Computers

#N

od

es/E

ntr

ies

TableADDAADD

TableADDAADD

6 7 8 9 10 11 120

1

2

3

4

5

6x 10

4Running Time vs. #Computers for Bidirectional Ring Config

# Computers

Ru

nn

ing

Tim

e (

s)

6 7 8 9 10 11 120

1000

2000

3000

4000

5000#Nodes/Entries vs. #Computers for Bidirectional Ring Config

# Computers

#N

od

es/E

ntr

ies

TableADDAADD

TableADDAADD

6 7 8 9 10 11 12 13 140

2000

4000

6000

8000

10000

12000

14000Running Time vs. #Computers for Independent Rings Config

# Computers

Ru

nn

ing

Tim

e (

s)

6 7 8 9 10 11 12 13 140

2

4

6

8

10x 10

4#Nodes/Entries vs. #Computers for Independent Rings Config

# Computers

#N

od

es/E

ntr

ies

TableADDAADD

TableADDAADD

Figure 16: Need to replot and run other problems! MDP value iteration running times (top) andnumber of entries/nodes (bottom) in the final value function using tabular, ADD, andAADD representations for various network configurations in the SYSADMIN problem.

Note that the intermediate probability tables were too large in one instance for thetables orADDs, but not the AADDs, indicating that the AADD was able to exploit additive or multiplicativestructure in these cases. Also, the AADD appears to yield an exponential tolinear reduction ontheNoisy-Or-15problem by exploiting the multiplicative structure inherent in these special CPTs.While other algorithms have been explicitly designed to exploit noisy-or network structure for ef-ficient inference (Heckerman, 1990), the AADD automatically exploits this structure in standardvariable eliminination without explicit modification.

6.2.2 FACTORED MDPS

For MDPs, we simply evaluate the value iteration algorithm using a tabular representation and its ex-tension for decision diagrams as previously discussed for the SPUDD algorithm in exact structuredvalue iteration. We apply these variants of value iteration to factored MDPs from the SYSADMIN

domain introduced in Chapter 1 and formalized as a factored MDP in Section?? of this chapter.Here we simply substitute tables, ADDs, and AADDs for the reward function,value function, andDBN transition model dynamics in the factored MDP value iteration update of Equation??.

Figure 16 shows the relative performance of value iteration until convergence within0.01 of theoptimal value for networks in a star, bidirectional, and independent ring configuration. While thereward and transition dynamics in the SYSADMIN problem have considerable additive structure, wenote that the exponential size of the AADD (as for all representations) indicates that little additivestructure survives in theexactvalue function. Nonetheless, the AADD-based algorithm still man-ages to take considerable advantage of the additive structure during computations and thus performscomparably or exponentially better than ADDs and tables for exact value iteration.

33


0 5 10 15 20 25 300

1000

2000

3000

4000

5000

6000

True Approximation Error

Exe

cutio

n T

ime

(s)

Bi−Ring−16

0 10 20 30 400

500

1000

1500

2000

2500

3000


Uni−Ring−16

0 10 20 30 40 500

500

1000

1500

2000

2500


Star−16

ADDAADD

ADDAADD

ADDAADD

0 5 10 15 20 25 300

200

400

600

800

1000


Spa

ce (

# N

odes

)

ADDAADD

0 10 20 30 400

200

400

600

800

1000

1200


ADDAADD

0 10 20 30 40 500

100

200

300

400


ADDAADD

Figure 17: Need to be re-run and replotted. Also need to include other Factored MDPs. Herewe show time and space performance for 3 different SYSADMIN problems with 16variables vs. the true approximation errormax~x|V ∗(~x) − Vapprox (~x)|. Results weretaken after 100 iterations of approximate dynamic programming when values changedlittle.

6.3 Approximate Inference

6.3.1 BAYES NETS

Need to run approximate inference here. Could do standard factor approximation and/orLoopy BP.

6.3.2 FACTORED MDPS

In Figure 17 we show the time and space performance for 3 different SYSADMIN problems vs.the true approximation errormax~x|V ∗(~x) − Vapprox (~x)| incurred by the solution.7 This providesan indicator of what the tradeoff between actual approximation error andtime and space is for theADD and AADD approaches. In general, we note that the AADD always uses fewer nodes witha major reduction in running time, yielding a speedup of three times on the Bi-Ring-16 problemswhile using about half the space (or even better). But perhaps more importantly, we note that whilerunning faster and taking less space for any given approximation error, approximation with AADDsconsistently outperforms approximation with ADDs.

In Figure 18 we show the time and space performance for 3 different SYSADMIN problems asthe level of pruning precision is held constant atδ = 0.03 (thus giving ana priori approximation

7. It is important to note when viewing these running times that the Bellman backup complexity isO(|S|2|A|). Whilethis worst-case is not encountered in domains like grid-worlds, the complexity of the SYSADMIN backup comes closeto this worst-case due to its dense transition function. For reference, we note that the algorithms here all outperformenumerated state dynamic programming by at least an order of magnitude in time and space.

34


6 7 8 9 10 11 120

500

1000

1500

Problem Size

Exe

cutio

n T

ime

(s)

Bi−Ring

ADDAADD

6 7 8 9 10 11 120

200

400

600

800

Problem Size

Uni−Ring

ADDAADD

6 7 8 9 10 11 120

100

200

300

400

Problem Size

Star

ADDAADD

6 7 8 9 10 11 120

200

400

600

Problem Size

Spa

ce (

# N

odes

)

6 7 8 9 10 11 120

200

400

600

800

Problem Size

6 7 8 9 10 11 120

50

100

150

200

250

300

Problem Size

ADDAADD

ADDAADD

ADDAADD

Figure 18: These results need to be re-run and replotted. Also we need to include Traffic andWill’s Factored MDPs. Here we show time and space performance for 3 differentSYSADMIN problems as the level of pruning precision is held constant atδ = 0.03and the number of computers in SYSADMIN increases. Results were taken after 100iterations of approximate dynamic programming.

error guarantee) as the number of computers in SYSADMIN increases. Results were taken after100 iterations of approximate dynamic programming when the respective values of both algorithmsstopped changing. Here we notice that the AADD-based approximate dynamic programming ap-proach far outscales the performance of the ADD approach on two of thethree problems. Thespeedup in these cases is well over an order of magnitude. Consequently, for anyfixed a priorierrorbound, we note that the approximation with AADDs always ran faster (sometimes up to an order ofmagnitude) and took less space than its ADD counterpart.

7. Related Work

There has been much related work in the formal verification literature that has attempted to tackleadditive and multiplicative structure in representation of functions fromB

n → Bm. These include

*BMDs (Bryant & Chen, 1995), K*BMDs (R. Drechsler, B. Becker,& S. Ruppertz, 1997), EVB-DDs & FEVBDDs (Tafertshofer & Pedram, 1997), HDDs & *PHDDs (Chen & Bryant, 1997).8

However, without covering each data structure in detail, we note there area few major differ-ences between this related work and AADDs:

• These data structures all originated in the verification community, which means that theirterminals are restricted to be vectors of boolean variables, or more generally, integers. Whenthese diagrams can exploit both additive and multiplicative structure, normalization of nodes

8. See (Drechsler & Sieling, 2001) for an excellent general overviewof most of these decision diagrams.

35


in these data structures requires prime factorizations of edge weights so there is no directcorrespondence between this normalization and AADD normalization (obviously, the primefactorization of a value inR is ill-defined).

• One could attempt to perform probabilistic inference with integer terminals, thusrequiringa rational or direct floating-point representation of values inR. Unfortunately rational rep-resentations of terminals require large amounts of space to achieve comparable precision tofloating-point representations. And when rational representations arerestricted to the samespace as floating-point representations, their computation error is much greater than that of afloating-point representation (these reasons are, in fact, the motivation behind floating-pointrepresentations). Probabilistic inference applications require manipulating very small valuesand small numerical approximation errors tend to multiply uncontrollably during marginal-ization, requiring very precise numerical representations and accuratecomputations. This canonly be reasonably achieved with a floating-point representation.

• *PHDDs are the only decision diagrams that are intended to directly represent floating pointnumbers and perform standard operations on them since they were created for verification offloating-point arithmetic. However the caveat is that computation with *PHDDs isequivalentto performing all floating-point operations in software. In contrast, AADDs using direct ma-chine floating-point representations and highly accelerated hardware implementations. So,even if *PHDDs could match AADDs in representational efficiency (the correspondence iftrue, is not at all obvious and is an open question), their software-based floating-point com-putation would slow them down by orders of magnitude in comparison to AADDs.

8. Conclusions and Future Work

We have introduced the AADD and have proved that its worst-case time and space performanceare within a multiplicative constant of that of ADDs, but can lead to exponential reductions in timeand space over ADDs. We have provided an empirical comparison of tabular, ADD, and AADDrepresentations used in Bayes net and MDP inference algorithms, concluding that AADDs performat least as well as ADDs and tables, and can yield an exponential time and space improvementover both when additive or multiplicative structure can be exploited. Furthermore, we showed thatapproximate inference algorithms with AADDs yield lower error, more compactapproximationsthan ADDs or tabular representations when additive or multiplicative structure can be exploited.

[FUTURE DIRECTIONS]

Appendix A. Proofs

Lemma 4.2. Fix a variable ordering overx1, . . . , xn. For any functiong(x1, . . . , xn) mappingB

n −→ R, there exists a unique generalized AADDG over variable domainx1, . . . , xn satisfyingthe given variable ordering such that for allρ ∈ B

n we haveg(ρ) = V al(G, ρ).

Proof. We prove this lemma by induction onn. For n = 0, we have a function representing aconstantC. The constraints imply thatG = C + 0 · 0 is the only legal representation of thisfunction.

36


Now, for the inductive case, we assume that we haven variables in our functionf(x1, . . . , xn)with variablex1 first in the ordering. We inductively assume that the lemma holds for the representa-tion of the functionsfh(x1 = true, x2, . . . , xn) andfl(x1 = false, x2, . . . , xn) overn−1 variablesso that both of these functions are represented by unique generalized AADDs Gh = ch + bhFh

andGl = cl + blFl. If Gh = Gl, then this case is satisfied by our inductive assumption sincef(x1, . . . , xn) technically ranges overn − 1 variables. OtherwiseGh 6= Gl, so the only way torepresentf(x1, . . . , xn) in the grammar is to use anif node branching onx1. Constraint (1) im-plies that we can have at most oneif node aboveFh andFl branching on variablex1. So we buildF = if (F var ) then c′h + b′hFh else c′l + b′lFl andG = c + bF to representf(x1, . . . , xn). Letrmin = min(ch, cl), rmax = max(ch + bh, cl + bl), andrrange = rmax − rmin; these respectivelydenote the minimum, maximum, and value span of the child functionsGl andGh, which allow us tonormalize the newly constructedF node to have a range of[0, 1], while at the same time providingus with the offsetc and multiplierb for the newly constructedG node.

Now, we must solve forc, b, c′h, b′h, c′l, b′l that satisfy constraints (2) and (3). This gives us the

following six equations that must be simultaneously satisfied:

c = rmin

b = rmin + rrange

ch = b · c′h + c

bh = b · b′hcl = b · c′l + c

bl = b · b′l

In matrix form, this linear system is non-singular whenb > 0, which follows fromrmin+rrange > 0as implied by constraint (4). Thus, the matrix is full rank and the linear systemhas one uniquesolution. By simple Gaussian elimination, we can derive this unique solution as thefollowing:c = rmin, b = rmin + rrange, c′h = ch−rmin

rrange, c′l = cl−rmin

rrange, b′h = bh

rrange, b′l = bl

rrange. This shows us

that there is only one unique construction ofG to representf(x1, . . . , xn). Thus, the inductive caseis satisfied and the statement of the lemma follows.

Theorem 4.3. For all functionsF1 : Bn −→ R and F2 : B

m −→ R (n ≥ 0 and m ≥ 0),the time and space performance ofReduce(F1) andApply(F1, F2, op) for AADDs (operands andresults represented as canonical AADDs) is within a multiplicative constant of Reduce(F1) andApply(F1, F2, op) for ADDs (operands and results represented as canonical ADDs) in theworstcase assuming any fixed variable ordering.

Proof. The ADDReduce andApply algorithms can be seen as analogs of the corresponding AADDalgorithms without the overhead of progagating the affine transforms of edge weights during re-cursive calls and normalizing them when returning. However, a comparison of the ADD/AADDReduce and Apply algorithms shows that there are only a constant number of additional con-stant time operations for manipulating edge weights in each AADD algorithm in comparison tothe corresponding ADD algorithm. Thus each call to the AADD algorithm incurs an additionalconstant time overhead over the corresponding call to the ADD algorithm. Wedenote the respec-tive constant time to evaluate one ADDReduce or Apply call to beTReduce

ADD andTApplyADD , respec-

tively. Likewise, we denote the respective time to evaluate one AADDReduce or Apply call to be

37


TReduceAADD = TReduce

ADD + CReduceAADD andTApply

AADD = TApplyADD + CApply

AADD whereC represents the additionalconstant time overhead of the call for an AADD in comparison to the ADD.

Now, we only need to show that the AADD makes equal or fewer calls toReduce andApply

than the ADD version. First we note that under the same variable ordering,an ADD is equivalentto a non-canonical AADD with fixed edge weightsc = 0, b = 1. Thus, if we did not normalizeAADD nodes inReduce andApply , then there would be a direct 1-1 mapping between eachReduce

andApply call for ADDs and the corresponding call for AADDs. Since normalizationcan onlyincrease the number ofReduce andApply cache hits and reduce the number of cached nodes, itis clear that an AADD must generate equal or fewerReduce andApply calls and have equal orfewer cached nodes than the corresponding ADD. This allows us to conclude that in the worst case,the AADD generates as manyReduce andApply calls and cache hits as the ADD. Assumingncalls are made by both the ADD and AADD variants ofReduce andApply , then the ADD requirestotal timenTReduce

ADD andnTApplyADD for each respective algorithm whereas the AADD requires time

n(TReduceADD + CReduce

AADD ) andn(TApplyADD + CApply

AADD) for each respective algorithm. This verifies thatthe AADD operations are within a multiplicative constant of the time required by thecorresponding

ADD operations (specifically,TReduceADD +CReduce

AADD

TReduceADD

for Reduce andTApplyADD

+CApplyAADD

TApplyADD

for Apply).

An analogous proof for space can be obtained by substituting “space” for “time” above.

Theorem 4.4. There exist functionsF1 andF2 and an operatorop such that the running time andspace performance ofApply(F1, F2, op) for AADDs can be linear in the number of variables whenthe corresponding ADD operations are exponential in the number of variables.

Proof. Two functions andApply operation examples where this holds true are∑n

i=12ixi⊕

∑ni=1

2ixi

and∏n

i=1γ2ixi ⊗ ∏n

i=1γ2ixi . (Examples of these operands as ADDs and AADDs were given in

Figures 1(c) and 7.) Because these computations result in a number of terminal values exponentialin n, the ADD operations must require time and space exponential inn. On the other hand, it isknown that the operands can be represented in linear-sized AADDs. Due to this structure, theApply

algorithm will begin by recursing on the high branch of both operands to depthn. Then, at each stepas it returns and recurses down the low branch of decision testxi, the respective additive differenceof 2i and multiplicative coefficient ofγ2i

in the corresponding high-branch and low-branchApply

operation calls will be normalized out for the respective operations of⊕ and⊗ due to the canonicalcaching scheme in Table 3, thus yielding cache hits for all low branches. For each operation onthe specified pair of functions, this results inn cached nodes and2n Apply calls for the AADDoperations.

References

Bahar, R. I., Frohm, E., Gaona, C., Hachtel, G., Macii, E., Pardo, A., & Somenzi, F. (1993). Alge-braic Decision Diagrams and their applications. InIEEE /ACM International Conference onCAD, pp. 428–432.

Bellman, R. E. (1957).Dynamic Programming. Princeton University Press, Princeton, NJ.

Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoretic planning:Structural assumptionsand computational leverage.Journal of Artificial Intelligence Research (JAIR), 11, 1–94.

38


Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D. (1996). Context-specific independence inBayesian networks. InUncertainty in Artificial Intelligence (UAI-96), pp. 115–123, Portland,OR.

Bryant, R. E. (1986). Graph-based algorithms for Boolean function manipulation. IEEE Transac-tions on Computers, C-35(8).

Bryant, R. E., & Chen, Y.-A. (1995). Verification of arithmetic circuits with binary moment dia-grams. InDesign Automation Conference, pp. 535–541.

Chen, Y.-A., & Bryant, R. E. (1997). PHDD: an efficient graph representation for floating pointcircuit verification. InICCAD ’97: 1997 IEEE/ACM international conference on Computer-aided design, pp. 2–7, Washington, DC.

Drechsler, R., & Sieling, D. (2001). Binary decision diagrams in theory and practice. InSoftwareTools for Technology Transfer, Vol. 3.

Guestrin, C., Koller, D., & Parr, R. (2001). Max-norm projections for factored MDPs. In17thInternational Joint Conference on Artificial Intelligence (IJCAI-2001), pp. 673–680, Seattle.

Heckerman, D. (1990). A tractable inference algorithm for diagnosing multiple diseases. In5th An-nual Conference on Uncertainty in Artificial Intelligence (UAI-90), New York, NY. ElsevierScience.

Hoey, J., St-Aubin, R., Hu, A., & Boutilier, C. (1999). SPUDD: Stochastic planning using decisiondiagrams. InUncertainty in Artificial Intelligence (UAI-99), pp. 279–288, Stockholm.

Kjaerulff, U. (1990). Triangulation of graphs–algorithms giving small total state space. Tech. rep.Research Report R-90-09, Aalborg University.

Pearl, J. (1986). Fusion, propagation, and structuring in belief networks. Artificial Intelligence,29(3), 241–288.

Puterman, M. L. (1994).Markov Decision Processes: Discrete Stochastic Dynamic Programming.Wiley, New York.

R. Drechsler, B. Becker, & S. Ruppertz (1997). Manipulation algorithmsfor K*BMDs. In E.Brinksma (Ed.),Tools and Algorithms for the Construction and Analysis of Systems, pp. 4–18, Enschede, The Netherlands. Springer Verlag, LNCS 1217.

Rudell, R. (1993). Dynamic variable ordering for ordered binary decision diagrams. InInternationalConference on Computer Aided Design (ICCAD-93), pp. 2–5.

Sanner, S., & McAllester, D. (2005). Affine algebraic decision diagrams(AADDs) and their appli-cation to structured probabilistic inference. InIJCAI 2005.

St-Aubin, R., Hoey, J., & Boutilier, C. (2000). APRICODD: Approximate policy construction usingdecision diagrams. InAdvances in Neural Information Processing 13 (NIPS-00), pp. 1089–1095, Denver.

Tafertshofer, P., & Pedram, M. (1997). Factored edge-valued binary decision diagrams.Form.Methods Syst. Des., 10(2-3).

Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for largescale dynamic program-ming. Machine Learning, 22, 59–94.

39


Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metricspaces. InFifth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA).

Zhang, N. L., & Poole, D. (1996). Exploiting causal independence in bayesian network inference..Journal of Artificial Intelligence Research (JAIR), 5, 301–328.

40

Date post:	09-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Efﬁcient Factored Inference with Afﬁne Algebraic Decision...

Documents