Clamping Variables and Approximate Inferencemlg.eng.cam.ac.uk/adrian/slides_msr2.pdf · Clamping...

Post on 24-Jan-2020

16 views 0 download

transcript

Clamping Variables and Approximate Inference

Adrian WellerUniversity of Cambridge

MSR CambridgeMar 18, 2016

Work with Tony Jebara and Justin Domke

For more information, seehttp://mlg.eng.cam.ac.uk/adrian/

1 / 21

Motivation: undirected graphical models

Powerful way to represent relationships across variables

Many applications including: computer vision, social networkanalysis, deep belief networks, protein folding...

In this talk, focus on binary pairwise (Ising) models

Example: Grid for computer vision (attractive)

2 / 21

Motivation: undirected graphical models

Example: Part of epinions social network (mixed)

Figure courtesy of N. Ruozzi

3 / 21

Motivation: undirected graphical models

x4 x5 x6 x7 x8

x1 x2 x3

Example: Restricted Boltzmann machine (mixed)

A fundamental problem is marginal inference

Estimate marginal probability distribution of one variable

p(x1) =∑

x2,...,xn

p(x1, x2, . . . , xn)

Closely related to computing the partition function

Computationally intractable, focus on approximate methods

Our theme: combining approximate inference with clampingcan be very fruitful as a proof technique, and in practice

4 / 21

Background: Binary pairwise models

Binary variables X1, . . . ,Xn ∈ {0, 1}Singleton and pairwise potentials θ

Write θ · x for the total score of a complete configuration

Probability distribution given by

p(x) =1

Zexp(θ · x)

To ensure probabilities sum to 1, need normalizing constant

Z =∑

x exp (θ · x)

Z is the partition function, a fundamental quantity we’d liketo compute or approximate

5 / 21

Background: A variational approximation

Recall p(x) =1

Zexp(θ · x)

Exact inference may be viewed as optimization,

logZ = maxµ∈M

[ θ · µ+ S(µ) ]

M is the space of marginals that are globally consistent, S isthe (Shannon) entropy

Bethe makes two pairwise approximations,

logZB = maxq∈L

[ θ · q + SB(q) ]

L is the space of marginals that are pairwise consistent, SB isthe Bethe entropy approximation

Loopy Belief Propagation finds stationary points of Bethe

For models with no cycles (acyclic), Bethe is exact ZB = Z6 / 21

Background: When is Bethe a good approximation?

We know that Bethe is exact for acyclic models, ZB = Z

When else does Bethe perform well?

‘Tree-like models’: models with long cycles or weak potentials

Also: attractive models (all edges attractive)

Sudderth, Wainwright and Willsky (NIPS 2007) used loopseries to show that for a subclass of attractive binary pairwisemodels, ZB ≤ Z

Conjectured ZB ≤ Z for all attractive binary pairwise models

Proved true by Ruozzi (NIPS 2012) using graph covers

Here we provide a separate proof building from first principles,and also derive an upper bound for Z in terms of ZB

We use the idea of clamping variables

7 / 21

Background: What is clamping?

x2

x3x4

x1

x5

x10x6x7

x9x8

Example model

To compute the partition function Z , canenumerate all states and sum

x1x2 . . . x10 score exp(score)

0 0 . . . 0 1 2.70 0 . . . 1 2 7.4. . . . . . . . .0 1 . . . 1 1.3 3.71 0 . . . 0 -1 0.41 0 . . . 1 0.2 1.2. . . . . . . . .1 1 . . . 1 1.8 6.0

Total Z = 47.1

8 / 21

Background: What is clamping?

x2

x3x4

x5

x10x6x7

x9x8

x1

Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:

Z = Z |X1=0 + Z |X1=1

After we clamp a variable, it may be removed

x1x2 . . . x10 score exp(score)

0 0 . . . 0 1 2.70 0 . . . 1 2 7.4. . . . . . . . .0 1 . . . 1 1.3 3.7 27.5

1 0 . . . 0 -1 0.41 0 . . . 1 0.2 1.2. . . . . . . . .1 1 . . . 1 1.8 6.0 19.6

Total Z = 47.1

p(X1 = 1) =Z |X1=1

Z

9 / 21

Background: What is clamping?

x2

x3x4

x5

x10x6x7

x9x8

x1

Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:

Z = Z |X1=0 + Z |X1=1

After we clamp a variable, it may be removed

After removing the clamped variable, if the remainingsub-models are acyclic then can find sub-partition functionsefficiently (BP, Bethe approximation is exact on trees)

If not,

Can repeat: clamp and remove variables until acyclic, orSettle for approximate inference on sub-models

Z(i)B := ZB |Xi=0 + ZB |Xi=1

Will this lead to a better estimate than approximate inferenceon the original model? Always? Often but not always

10 / 21

Background: What is clamping?

x2

x3x4

x5

x10x6x7

x9x8

x1

Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:

Z = Z |X1=0 + Z |X1=1

After we clamp a variable, it may be removed

After removing the clamped variable, if the remainingsub-models are acyclic then can find sub-partition functionsefficiently (BP, Bethe approximation is exact on trees)

If not,

Can repeat: clamp and remove variables until acyclic, orSettle for approximate inference on sub-models

Z(i)B := ZB |Xi=0 + ZB |Xi=1

Will this lead to a better estimate than approximate inferenceon the original model? Always? Often but not always

10 / 21

Background: What is clamping?

x2

x3x4

x5

x10x6x7

x9x8

x1

Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:

Z = Z |X1=0 + Z |X1=1

After we clamp a variable, it may be removed

After removing the clamped variable, if the remainingsub-models are acyclic then can find sub-partition functionsefficiently (BP, Bethe approximation is exact on trees)

If not,

Can repeat: clamp and remove variables until acyclic, orSettle for approximate inference on sub-models

Z(i)B := ZB |Xi=0 + ZB |Xi=1

Will this lead to a better estimate than approximate inferenceon the original model? Always?

Often but not always

10 / 21

Background: What is clamping?

x2

x3x4

x5

x10x6x7

x9x8

x1

Can split Z in two: clamp variable X1 to each of{0, 1}, then add the two sub-partition functions:

Z = Z |X1=0 + Z |X1=1

After we clamp a variable, it may be removed

After removing the clamped variable, if the remainingsub-models are acyclic then can find sub-partition functionsefficiently (BP, Bethe approximation is exact on trees)

If not,

Can repeat: clamp and remove variables until acyclic, orSettle for approximate inference on sub-models

Z(i)B := ZB |Xi=0 + ZB |Xi=1

Will this lead to a better estimate than approximate inferenceon the original model? Always? Often but not always

10 / 21

A variational perspective on clamping

Bethe approximation

logZB = maxq∈L

[ θ · q + SB(q) ]

Observe that when Xi is clamped, we optimize over a subset

logZB |Xi=0 = maxq∈L:qi=0

[ θ · q + SB(q) ]

⇒ ZB |Xi=0 ≤ ZB , similarly ZB |Xi=1 ≤ ZB

Recap of Notation

Z true partition functionZB Bethe optimum partition function

Z(i)B := ZB |Xi=0 + ZB |Xi=1

≤ 2ZB

approximation obtained whenclamp and sum approximate

sub-partition functions

11 / 21

Clamping variables: an upper bound on Z

From before,

Z(i)B := ZB |Xi=0 + ZB |Xi=1 ≤ 2ZB

Repeat: clamp and remove variables, until remaining model isacyclic, where Bethe is exact

For example, if must delete 2 variables Xi ,Xj , obtain

Z(ij)B :=

∑a,b∈{0,1}

ZB |Xi=a,Xj=b ≤ 22ZB

But sub-partition functions are exact, hence LHS = Z

x2

x3x4

x1

x5

x10x6x7

x9x8

x2

x3x4

x5

x10x6x7

x9x8

x1

x3x4

x5

x10x6x7

x9x8

x1

x2

12 / 21

Clamping variables: an upper bound on Z

Z(i)B := ZB |Xi=0 + ZB |Xi=1 ≤ 2ZB

Repeat: clamp and remove variables, until remaining model isacyclic, where Bethe is exact

Let k(G ) be the minimum size of a feedback vertex set

Theorem (result is tight in a sense)

Z ≤ 2kZB

13 / 21

Clamping variables: an upper bound on Z

Z(i)B := ZB |Xi=0 + ZB |Xi=1 ≤ 2ZB

Repeat: clamp and remove variables, until remaining model isacyclic, where Bethe is exact

Let k(G ) be the minimum size of a feedback vertex set

Theorem (result is tight in a sense)

Z ≤ 2kZB

14 / 21

Attractive models: a lower bound on Z

An attractive model is one with all edges attractive

Recall definition,

Z(i)B := ZB |Xi=0 + ZB |Xi=1

Theorem (actually show a stronger result, ask if interested)

For an attractive binary pairwise model and any Xi , ZB ≤ Z(i)B

Repeat as before: ZB ≤ Z(i)B ≤ Z

(ij)B ≤ · · · ≤ Z

Corollary (similar proof to earlier result; first proved Ruozzi, 2012)

For an attractive binary pairwise model, ZB ≤ Z

⇒ each clamp and sum can only improve ZB

15 / 21

Attractive models: a lower bound on Z

An attractive model is one with all edges attractive

Recall definition,

Z(i)B := ZB |Xi=0 + ZB |Xi=1

Theorem (actually show a stronger result, ask if interested)

For an attractive binary pairwise model and any Xi , ZB ≤ Z(i)B

Repeat as before: ZB ≤ Z(i)B ≤ Z

(ij)B ≤ · · · ≤ Z

Corollary (similar proof to earlier result; first proved Ruozzi, 2012)

For an attractive binary pairwise model, ZB ≤ Z

⇒ each clamp and sum can only improve ZB

15 / 21

Recap of results so far

We have used clamping as a proof technique

Derived lower and upper bounds on Z for attractive models

ZB ≤︸ ︷︷ ︸attractive only

Z ≤ 2kZB︸ ︷︷ ︸attractive and mixed

⇔ Z

2k≤ ZB ≤ Z︸︷︷︸

attractive only

We also proved that for attractive models, clamping andsumming (optimum) Bethe sub-partition functions can onlyimprove the estimate

How about for mixed models?

16 / 21

Example: here clamping any variable worsens ZB estimate

x1 x2

x3x4

Blue edges are attractive with edge weight +2Red edges are repulsive with edge weight −2No singleton potentials

(performance is only slightly worse with clamping)

In practice, if we pick a good variable to clamp, then clampingis usually helpful

17 / 21

New work: what does clamping do for MF and TRW?

Mean field (MF) approximation assumes independentvariables, yields a lower bound, ZM ≤ Z

Tree-reweighted (TRW) is a pairwise approximation similar toBethe but allows a convex optimization and yields an upperbound, Z ≤ ZT ZM ≤ Z ≤ ZT

Earlier, we showed that for Bethe, clamping always improvesthe approximation for attractive models; often but not alwaysimproves for mixed models

How about for MF and TRW? ZM ≤ ZB ≤ ZT

Theorem

For both MF and TRW, for attractive and mixed models, clampingand summing approximate sub-partition functions can only improvethe respective approximation and bound (any number of labels).

18 / 21

New work: what does clamping do for MF and TRW?

Mean field (MF) approximation assumes independentvariables, yields a lower bound, ZM ≤ Z

Tree-reweighted (TRW) is a pairwise approximation similar toBethe but allows a convex optimization and yields an upperbound, Z ≤ ZT ZM ≤ Z ≤ ZT

Earlier, we showed that for Bethe, clamping always improvesthe approximation for attractive models; often but not alwaysimproves for mixed models

How about for MF and TRW? ZM ≤ ZB ≤ ZT

Theorem

For both MF and TRW, for attractive and mixed models, clampingand summing approximate sub-partition functions can only improvethe respective approximation and bound (any number of labels).

18 / 21

Error in log Z vs number of clamps: grids

larg

e(9

x9)

0 1 2 3 4 5−15

−10

−5

0

5

10

best

worst

pseudo

greedy

0 1 2 3 4 5−10

−5

0

5

10

15

20

25

best

worst

pseudo

greedy

smal

l(5

x5)

0 1 2 3 4 5−3

−2

−1

0

1

2

3

best

worst

pseudo

greedy

0 1 2 3 4 5

−2

0

2

4

6

best

worst

pseudo

greedy

attractive grids [0, 6] mixed grids [−6, 6]19 / 21

Conclusions for practitioners

Typically Bethe performs very well

Clamping can be very helpful, more so for denser models withstronger edge weights, a setting where inference is often hard

We provide fast methods to select a good variable to clamp

MF and TRW provide useful bounds on Z and ZB

Thank you

For more information, seehttp://mlg.eng.cam.ac.uk/adrian/

20 / 21

References

N. Ruozzi. The Bethe partition function of log-supermodulargraphical models. In NIPS, 2012.

E. Sudderth, M. Wainwright, and A. Willsky. Loop series andBethe variational bounds in attractive graphical models. InNIPS, 2007.

A. Weller and J. Domke. Clamping improves TRW and meanfield approximations. To appear in AISTATS, 2016.

A. Weller and T. Jebara. Bethe bounds and approximatingthe global optimum. In AISTATS, 2013.

A. Weller and T. Jebara. Clamping variables and approximateinference. In NIPS, 2014.

J. Yedidia, W. Freeman, and Y. Weiss. Understanding beliefpropagation and its generalizations. In IJCAI, DistinguishedLecture Track, 2001.

21 / 21

Supplementary material

Extra slides for questions or furtherexplanation

22 / 21

Error in log Z vs number of clamps: complete graphs

0 1 2 3 4 5

−1.5

−1

−0.5

0

best

worst

pseudo

greedy

0 1 2 3 4 5

0

10

20

30

40

best

worst

pseudo

greedy

attractive K15, [0, 6] mixed K15, [−6, 6]

For dense mixed models (many edges),

MF can be better than Bethe

What happens if we increase edge strength?

23 / 21

Error in log Z vs number of clamps: complete graphs

0 1 2 3 4 5

0

10

20

30

40

best

worst

pseudo

greedy

0 1 2 3 4 5−20

0

20

40

60

80

100

best

worst

pseudo

greedy

mixed K15, [−6, 6] mixed K15, [−12, 12]

With stronger edges, MF is much better than Bethe!

But MF assumes variables are independent, what’s going on?

Frustrated cycles cause Bethe to overestimate by a lotTRW is even worseMF behaves much better (in marginal polytope)

24 / 21

Error in log Z vs number of clamps: complete graphs

0 1 2 3 4 5

0

10

20

30

40

best

worst

pseudo

greedy

0 1 2 3 4 5−20

0

20

40

60

80

100

best

worst

pseudo

greedy

mixed K15, [−6, 6] mixed K15, [−12, 12]

With stronger edges, MF is much better than Bethe!

But MF assumes variables are independent, what’s going on?

Frustrated cycles cause Bethe to overestimate by a lotTRW is even worseMF behaves much better (in marginal polytope)

24 / 21

Time (secs) vs error in log Z for various methods

Mixed models, Wij ∼ U[−6, 6]Time shown on a log scale

10−2

100

102

−5

0

5

10

TRW

B

MF

maxW+c+TRE

pseudo

greedy

10−2

100

102

0

5

10

15

TRW

B

MF

maxW+c+TRE

pseudo

greedy

7x7 grid complete K10

Clamping can make the subsequent optimization problemseasier, hence sometimes total time with clamping is lowerwhile also being more accurate

25 / 21

Clamping variables: strongest result for attractive models

logZB = maxq∈L [ θ · q + SB(q) ]

For any variable Xi and x ∈ [0, 1], let qi = q(Xi = 1) and

logZBi (x) = maxq∈L:qi=x [ θ · q + SB(q) ]

ZBi (x) is ‘Bethe partition function constrained to qi = x ’

Note: ZBi (0) = ZB |Xi=0, ZBi (x∗) = ZB , ZBi (1) = ZB |Xi=1

Define new function,

Ai (x) := logZBi (x)− Si (x)

Theorem (implies all other results for attractive models)

For an attractive binary pairwise model, Ai (x) is convex

Builds on derivatives of Bethe free energy from [WJ13]

26 / 21

Clamping variables: strongest result for attractive models

logZB = maxq∈L [ θ · q + SB(q) ]

For any variable Xi and x ∈ [0, 1], let qi = q(Xi = 1) and

logZBi (x) = maxq∈L:qi=x [ θ · q + SB(q) ]

ZBi (x) is ‘Bethe partition function constrained to qi = x ’

Note: ZBi (0) = ZB |Xi=0, ZBi (x∗) = ZB , ZBi (1) = ZB |Xi=1

Define new function,

Ai (x) := logZBi (x)− Si (x)

Theorem (implies all other results for attractive models)

For an attractive binary pairwise model, Ai (x) is convex

Builds on derivatives of Bethe free energy from [WJ13]

26 / 21

Experiments: Which variable to clamp?

Compare error | logZ − logZ(i)B | to original error | logZ − logZB |

for various ways to choose which variable Xi to clamp:

best Clamp best improvement in error of Z in hindsight

worst Clamp worst improvement in error of Z in hindsight

avg Clamp average performance

maxW max sum of incident edge weights∑

j∈N(i) |Wij |Mpower more sophisticated, based on powers of related matrix

x2

x3x4

x1

x5

x10x6x7

x9x8

x10

x1

27 / 21

Experiments: attractive random graph n = 10, p = 0.5

unary θi ∼ U[−2, 2],edge Wij ∼ U[0,Wmax ]

Error of estimate of logZ

Observe

Clamping any variable helpssignificantly

Our selection methodsperform well

2 4 8 12 160

0.05

0.1

0.15

0.2

0.25

max interaction strength W

Avg `1 error of singleton

marginals

Using Frank-Wolfe to optimize

Bethe free energy

2 4 8 12 160

0.05

0.1

0.15

0.2

max

Originalall ClampmaxW Clampbest Clampworst ClampMpower

interaction strength W

28 / 21

Experiments: mixed random graph n = 10, p = 0.5

unary θi ∼ U[−2, 2],edge Wij ∼ U[−Wmax ,Wmax ]

Error of estimate of logZ

Results remain promisingfor higher n 2 4 8 12 16

0

1

2

3

4

5

6

max

Originalavg ClampmaxW Clampbest Clampworst ClampMpower

interaction strength W

AWAvg `1 error of singleton

marginals

Using Frank-Wolfe to optimize

Bethe free energy

2 4 8 12 160

0.1

0.2

0.3

0.4

max

Originalall ClampmaxW Clampbest Clampworst ClampMpower

interaction strength W

29 / 21

Experiments: attractive complete graph n = 10, TRW

unary θi ∼ U[−0.1, 0.1],edge Wij ∼ U[−Wmax ,Wmax ]

Error of estimate of logZ

Note low unary potentials

2 4 8 12 160

0.2

0.4

0.6

0.8

1

max interaction strength W

AWAvg `1 error of singleton

marginals

Clamping a variable ‘breaks

symmetry’ and overcomes

TRW advantage

2 4 8 12 160

0.1

0.2

0.3

0.4

0.5

max

Originalall ClampmaxW Clampbest Clampworst ClampMpowerTRW

interaction strength W

30 / 21

Experiments: mixed complete graph n = 10, TRW

unary θi ∼ U[−2, 2],edge Wij ∼ U[0,Wmax ]

Error of estimate of logZ

Note regular singleton

potentials

2 4 8 12 160

10

20

30

40

50

max

Originalavg ClampmaxW Clampbest Clampworst ClampMpowerTRW

interaction strength W

AWAvg `1 error of singleton

marginals

2 4 8 12 160

0.1

0.2

0.3

0.4

max interaction strength W

31 / 21

Experiments: attractive random graph n = 50, p = 0.1

unary θi ∼ U[−2, 2],edge Wij ∼ U[0,Wmax ]

Error of estimate of logZ

‘worst Clamp’ performs worse

here due to suboptimal

solutions found by Frank-Wolfe2 4 8 12 16

0

0.05

0.1

0.15

0.2

0.25

max interaction strength W

AWAvg `1 error of singleton

marginals

2 4 8 12 160

0.02

0.04

0.06

0.08

max interaction strength W

32 / 21

Experiments: mixed random graph n = 50, p = 0.1

unary θi ∼ U[−2, 2],edge Wij ∼ U[−Wmax ,Wmax ]

Error of estimate of logZ

Performance still good for

clamping just one variable

2 4 8 12 160

5

10

15

20

25

30

max

Originalavg ClampmaxW Clampbest Clampworst ClampMpower

interaction strength W

AWAvg `1 error of singleton

marginals

2 4 8 12 160

0.1

0.2

0.3

0.4

max

Originalall ClampmaxW Clampbest Clampworst ClampMpower

interaction strength W

33 / 21

Experiments: attractive ‘lamp’ graph

unary θi ∼ U[−2, 2],edge Wij ∼ U[0,Wmax ]

Error of estimate of logZ

Mpower performs well,

significantly better than maxW

2 4 8 12 160

0.05

0.1

0.15

0.2

0.25

max interaction strength W

Avg `1 error of singleton

marginals

x2

x3x4

x1

x5

x10x6x7

x9x8

x10

x1

2 4 8 12 160

0.02

0.04

0.06

0.08

0.1

0.12

max

Originalall ClampmaxW Clampbest Clampworst ClampMpower

interaction strength W

34 / 21

Experiments: mixed ‘lamp’ graph

unary θi ∼ U[−2, 2],edge Wij ∼ U[−Wmax ,Wmax ]

Error of estimate of logZ

Mpower performs well,

significantly better than maxW

2 4 8 12 160

0.5

1

1.5

2

max

Originalavg ClampmaxW Clampbest Clampworst ClampMpower

interaction strength W

Avg `1 error of singleton

marginals

x2

x3x4

x1

x5

x10x6x7

x9x8

x10

x1

2 4 8 12 160

0.05

0.1

0.15

0.2

max

Originalall ClampmaxW Clampbest Clampworst ClampMpower

interaction strength W

35 / 21