Download - Bayes Linear for Dummies - Durham · PDF fileThere are no strict rules for quantifying prior beliefs; ... Specifying probability quantiles and/or distributions consistent ... Bayes

Bayes Linear for Dummies


JAC+IRV

24th April 2009


Why would I use Bayes linear?





Di�culties with full Bayes

Even in small problems, it can be too di�cult ortime-consuming to express, document and validate ameaningful joint prior probability specification;

Given such a specification, the computations for learning fromdata become technically di�cult and extremely computerintensive;

In higher-dimensions the likelihood surface can be verycomplicated, making full Bayes calculations potentially highlynon-robust.

Therefore if, in complex problems, we are unable to make andanalyse full prior probability specifications, it follows that werequire methods based around simpler belief specifications



Working with partial belief specifications

Bayes linear is such an approach which is based around partialbelief specifications in terms of expectations

Typically, we require only mean, variance and covariancespecifications for all uncertain quantities

We may view the Bayes linear approach asO↵ering a simple approximation to a full Bayes analysisComplementary to the full Bayes approach, o↵ering newinterpretative and diagnostic toolsA generalisation of the full Bayes approach where we lift therestriction of requiring a full probabilistic prior before we maylearn anything from data


Features of Bayes linear





Features of the Bayes linear approach

Subjective and Bayesian

Belief specifications honestly correspond to our beliefs

Expectation as primitive

Adjust beliefs by linear fitting rather than conditioning

Computationally straightforward allowing the analysis of morecomplex problems

Diagnostic tools are a key part of the approachHow prior beliefs a↵ect conclusionsHow beliefs change by the adjustmentHow beliefs about observables compare to the observationsthemselves

Important special cases - multivariate Gaussian



Stages of belief analysis

A typical Bayes linear analysis of beliefs proceeds in the followingstages:

1 Specification of prior beliefs

2 Interpret the expected adjustments a priori

3 Given observations, perform and interpret the adjustments

4 Make diagnostic comparisons between actual and expectedbeliefs


Belief Specification





de Finetti: Expectation as Primitive

de Finetti spent most of his life studying subjectiveconceptions of probability.

He proposed the use of expectation as the primitive entity onwhich to base any analysis, as opposed to probability.

In the Bayes linear approach, we follow de Finetti and takeexpectation as primitive.

Probabilities (where relevant) enter as derived quantities: theyare the expectations of indicator functions.

Note this asymmetry: if probability is treated as the primitivequantity then one has to specify (in the continuous case) aninfinite set of probabilities in order to derive a singleexpectation.




The Bayes linear approach is subjectivist, and so in anyanalysis we need to specify our beliefs over all randomquantities of interest.

However, as we consider expectation as primitive, we requireonly specifications of the expectations, variances andcovariances of the random quantities of interest.

(If we have beliefs about higher orders we can include these inthe analysis too!)

For example, say we are interested in predictingB = (B

1

,B2

)T from knowledge of D = (D1

,D2

)T which wewill measure soon, then all we need to specify are E(B),E(D), Var(B), Var(D) and Cov (B,D).



Methods for Assigning Expectations

There are no strict rules for quantifying prior beliefs; everycase will depend on personal judgement, the problem inquestion, and the availability of information

Possible techniques include:Studying summary statistics from samples in relatedpopulationsIdentifying one and two standard deviation intervalsSpecifying probability quantiles and/or distributions consistentwith those quantilesAssessing a covariance by considering the variance of thedi↵erence of the corresponding quantitiesPartitioning variances and covariances into termscorresponding to uncorrelated components



The example: Numbers

Suppose we have a simple computer simulator and considerthe output points F = (B

1

,B2

,D1

,D2

)T

We want to predict the computer model output atB = (B

1

,B2

)T from the observed values at D = (D1

,D2

)T

We have a very simple prior specification:

E(F ) = 0, Var(F )ii = 100,

and we obtain a correlation matrix via the standard Gaussiancovariance function with ✓ = 1

B1

B2

D1

D2

B1

1.00 0.56 0.52 0.61B

2

0.56 1.00 0.32 0.98D

1

0.52 0.32 0.52 0.28D

2

0.61 0.98 0.28 1.00



The example: Picture


Adjusted Expectation and Variance





Belief Adjustment

We are interested in how our beliefs about B change in thelight of information given by D.

We look among the collection of linear estimates, i.e. those ofform c

0

+ c1

D1

+ c2

D2

, and choose constants c0

, c1

, c2

tominimise the prior expected squared error loss in estimatingeach of B

1

and B2

:

E([B1

� c0

� c1

D1

� c2

D2

]2).The choices of constants may be easily computed, and theestimators ED(B) = (ED(B

1

),ED(B2

))T turn out to be givenby:

ED(B) = E(B) + Cov (B,D) Var(D)†(D � E(D)).

which we refer to as the adjusted expectation for collection Bgiven collection D.



Adjusted expectation

The adjusted expectation for collection B given collection D is

ED(B) = E(B) + Cov (B,D) Var(D)†(D � E(D)).

The adjusted version of the B given D is the ‘residual’ vector

AD(B) = B � ED(B).

We can partition the vector B as the sum of two uncorrelatedvectors:

B = ED(B) + AD(B),



Adjusted variance

We partition the variance matrix of B into two variancecomponents:

Var(B) = Var(ED(B)) + Var(AD(B))

= RVarD(B) + VarD(B)

These are the resolved variance matrix and the adjustedvariance matrix (i.e. explained and residual variation).

The variance matrices are calculated as

VarD(B)= Var(B)� Cov (B,D)Var(D)†Cov (D,B) ,

RVarD(B) = Cov (B,D) Var(D)†Cov (D,B) .

Our variance matrices must be non-negative definite.

We use the Moore-Penrose generalized inverse (A†) to allowfor degeneracy.



Resolution

We summarize the expected e↵ect of the data D for theadjustment of B by a scale-free measure which we call theresolution of B induced by D,

RD(B) = 1� VarD(B)

Var(B)=

Var(ED(B))

Var(B).

The resolution lies between 0 and 1, and in general, small(large) resolutions imply that the information has little(much) linear predictive value, given the prior specification.

Similar in spirit to an R2 measure for the adjustment.



Example: The Adjustment

We can calculate our adjusted expectations for points B givenD algebraically as:

ED(B1

) = 0.381D1

+ 0.507D2

+ 0

ED(B2

) = 0.051D1

+ 0.961D2

+ 0

We see that B2

is mainly determined by the value of D2

–unsurprising given how close these points are!

We can also calculate the adjusted variance and resolutions

VarD(B) =

✓49.06 -5.83-5.83 4.64

◆, RD(B) =

✓0.5090.954

◆

We can see that we resolve much of the uncertainty about B2



Example: Variance Partition

We can decompose the prior variance into its resolved andunresolved portions:

Var(B) = RVarD(B) + VarD(B)✓

100.00 55.7155.71 100

◆=

✓50.94 61.5461.54 95.36

◆+

✓49.06 -5.83-5.83 4.64

◆

Which is nice!



Interpretations of belief adjustment

An approximationIf we’re fully Bayesian, then adjusted expectation is a tractableapproximation to the full Bayes conditional expectationAdjusted variance is then an easily-computable upper boundon the full Bayes preposterior risk, under quadratic loss

An estimatorED(B) is an ‘estimator’ of the value of B, which combines thedata with simple aspects of our prior beliefs in a plausiblemannerAdjusted variance is then the mean-squared error of theestimator ED(B)

A primitiveAdjusted expectation is a primitive quantification of furtheraspects of our beliefs about B having‘accounted for’ DAdjusted variance is also a primitive, but applied to the‘residual variance’ in B having removed the e↵ects of D



Adjusted and Conditional Expectations

The conditional expectation of B|D is the value you wouldspecify under the penalty LC =

Pi cDi [B � E(B|Di )]2

If D is a partition, so Di 2 {0, 1} andP

i Di = 1, then thenthe adjusted expectation minimises LA =

Pi cDi [B � xi ]2.

So we choose xi to be the conditional expectation, and

ED(B) =X

i

E(B|Di )Di

So when D is a partition, the adjusted and conditionalexpectations are identical

Adjusted expectation does not require D to be a partition,and so can be considered as a generalization of conditionalexpectation



Extension to linear combinations

Let hBi be the set of all linear combinations of B

If X = hTB 2 hBi, then we can write

E(X ) = hTE(B), Var(X ) = hT

Var(B)h.

So by specifying E(B) and Var(B) we have implicitly specifiedexpectations and variances for all elements of hBiSimilarly, by calculating ED(B) and VarD(B), we haveimplicitly calculated the adjustment for all X 2 hBi



The observed adjustment

Given the observed value d of D, we can calculate theobserved adjusted expectation

Ed(B) = E(B) + Cov (B,D) Var(D)†(d � E(D)).

For our example, we observe d = (�8, 10) and thecorresponding observed adjusted expectations are:

Ed(B) =

✓2.029.20

◆

Having observed D = d , we notice that our adjustedexpectations have both increased

B1

is relatively far from D and so only moves a little, whereasB

2

is close to D2

and so its expectation shifts substantiallytowards the value d

2

= 10








Diagnostics

Diagnostics

Diagnostics


Diagnostics

Data and Diagnostics

Once data has been observed (first for D and then for B) wecan perform diagnostics.

The Bayes linear methodology has a rich variety of diagnostictools available (more than in a fully Bayesian analysis).

We can perform diagnostics on individual random quantities,or on collections of random quantities.

Three important versions are:Prior Diagnostics.Adjustment Diagnostics.Final Observation Diagnostics.


Diagnostics

Prior Diagnostics

Each prior belief statement that we make describes our beliefsabout some random quantity.

If we observe that quantity, we may compare what we expectto happen with what actually happens.

Once we observe the values of D = d , we can check whetherthe data is consistent with our prior specifications.

For a single random quantity, we can calculate thestandardized change and the discrepancy:

S(di ) =di � E(Di )p

Var(Di ), Dis(d) =

[di � E(Di )]2

Var(Di )= S(di )

2

E(S(di )) = 0 and Var(S(di )) = 1, so if we observe S(di )greater than about 3 this suggests an inconsistency.


Diagnostics

Discrepancy Ratio

For the entire collection, the natural counterpart of thediscrepancy is the Mahalanobis distance:

Dis(d) = (d � E(D))TVar(D)†(d � E(D)).

The prior expected value of Dis(d) is given byE(Dis(d)) = rk{Var(D)}NB: if we pretend D is Normal, then Dis(d) would be �2

We can then normalise the discrepancy, to obtain thediscrepancy ratio for d

Dr(d) =Dis(d)

rk{Var(D)} ,

which has prior expectation E(Dr(d)) = 1.

Large Dr(d) will of course also suggest inconsistencies.


Diagnostics

Example: Prior Diagnostics

Comparing our observed values, with our priors for thosevalues we obtain

S(di ) =

✓-0.81.0

◆, Dr(d) = 1.1

So d1

is smaller than expected, and d2

is larger. But not bymuch.

Unsurprising as observing d = (�8, 10), which when we havea prior standard deviation of 10 is perfectly reasonable

If we assumed that S(di ) is unimodal, then approximate 95%bounds are given by ±3� – so this is clearly ok

Considering the collection, Dr(d) ' 1 so the observed valuesare not inconsistent with our prior beliefs


Diagnostics

Adjustment Diagnostics

Having obtained the observed adjusted expectation, we maynow check how much our beliefs have been a↵ected by thedata

We calculate the univariate standardized adjustments (anddiscrepancies):

Sd(Bi ) = S(Ed(Bi )) =Ed(Bi )� E(ED(Bi ))p

Var(ED(Bi ))=

Ed(Bi )� E(Bi )pRVarD(Bi )

The adjustment discrepancy for a collection is given by:

Disd(B) = (Ed(B)� E(B))TRVarD(B)†(Ed(B)� E(B)).

Again E(S(Ed(Bi ))) = 0, Var(S(Ed(Bi ))) = 1 andE(Disd(B)) = rk{RVarD(Bi )} so large values warrant furtherinvestigation.


Diagnostics

Size and Size Ratio

We may also now check how di↵erent our observed adjustedexpectation is from our prior expectations

For this, we calculate the size of the adjustment of B by D

Sized(Bi ) =(Ed(Bi )� E(Bi ))2

Var(Bi )

Similarly, the size of the adjustment for the collection B byD = d is

Sized(B) = (Ed(B)� E(B))TVarD(B)†(Ed(B)� E(B)).


Diagnostics

Example: Adjustment Diagnostics

Calculating the standardised adjustments we obtain:

Sd(Bi ) =

✓0.280.94

◆, Drd(B) = 1.13

So our beliefs about B1

appear only slightly a↵ected by thedata, whereas our beliefs about B

2

are more influenced due toits strong correlation to D

2

The size diagnostics are given by:

Sized(Bi ) =

✓0.040.85

◆, Sized(B) = 0.49

Since E(Sized(B)) = 1.13(= RUd(B)), this suggests ouradjusted beliefs are closer to our priors than expected –perhaps indicating that we may have over-stated our variance


Diagnostics

Final Observation Diagnostics

Eventually, we may observe the values B = b and in additionto checking how these deviate from the prior expected valuesE(B), we should also check the change from adjustedexpectation Ed(B) to actual observation b.

For a single rq, the appropriate standardized change anddiscrepancy are

Sd(bi ) = S(Abi (d)) =bi � Ed(Bi )p

VarD(Bi ), Disd(bi ) = Sd(bi )

2,

and for the collection we have:

Disd(b) = (b � Ed(B))TVarD(B)†(b � Ed(B)).

These checks suggest that the predictions were roughly withinthe tolerances suggested by our prior variance specifications.


Diagnostics

Example: Final Observation Diagnostics

We actually observe B to be b = (1, 9). Comparing this toour adjusted expectations given D, we obtain

Sd(b) =

✓-0.15-0.09

◆, Drd(b) = 0.02

So our adjusted expectation is suspiciously close to theobserved values of B, and Dr(d)b is suspiciously small

Perhaps this could indicate we’ve overstated our variance ormis-specified our correlation

Or perhaps its just a great prediction! Or an artificially goodsimulated example!


Diagnostics


-10 -5 0 5 10 15

05

10

15

20

25

30

Diagnostics for Differing choices of E[D]

Prior Expectation of D

Diagnostics

Dis_d(b)Dis(d)


Diagnostics

Diagnostics: morals

Piecemeal diagnostic analysis of individual quantities is notsu�cient. We have to examine diagnostics for collections.

Each part of the adjustment process can be diagnosticallyscrutinized. We have shown diagnostic measures for:

the raw data;the di↵erence between adjusted expectations (estimates) andprior expectations relative to variance explained and priorvariance;the di↵erence between adjusted versions (residuals) and priorexpectations relative to variance remaining.

If we had found a problem how might we have avoided it, orat least detected it sooner?


Canonical Structure

Canonical Structure

Canonical Structure


Canonical Structure

Canonical analysis

Our belief specification for B and our adjustment by Dimplies specifications and adjustments for all linearcombinations in hBi.We can explore the (possibly complex) changes in beliefsabout hBi induced by the adjustment via a canonical analysisA key component of the canonical analysis is the resolutiontransform matrix defined as

TB:D = Var(B)†Cov (B,D)Var(D)†Cov (D,B) .

TB:D has the property that Var(B)TB:D = RVarD(B)The eigenstructure of TB:D summarises all the e↵ects of beliefadjustmentLet the normed right eigenvectors of TB:D be v

1

, . . . , vrB ,ordered by eigenvalues 1 � �

1

� �2

� . . . � �rB � 0 andscaled as vT

i Var(B)vi = 1


Canonical Structure

Canonical directions

We define the ith canonical direction as

Yi = vTi (B � E(B))

The canonical directions have the following properties

E(Yi ) = 0, Var(Yi ) = 1, Corr (Yi ,Yj) = 0

RVarD(Yi ) = �i , VarD(Yi ) = 1� �i ,

So the collection {Y1

,Y2

, . . .} forms a mutually uncorrelated‘grid’ of directions over hBi, summarizing the e↵ects of theadjustment.

Y1

is the quantity we learn most about. Y2

is the quantity welearn next most about, given that it is uncorrelated with Y

1

.Yrk{B} is the quantity we learn least about.

Relationship to canonical correlation analysis (and PCA)


Canonical Structure

Canonical properties and system resolution

Each X 2 hBi can be expressed using the canonical structureas

X � E(X ) =X

i

Cov (X ,Yi ) Yi ,

and RVarD(X ) =X

i

�i (Corr (X ,Yi ))2

We can use this structure to express the resolved uncertaintyfor the entire collection hBi given adjustment by D via theresolved uncertainty and the system resolution

RUD(B) =X

i

�i , RD(B) =1

rk{B}X

i

�i

RD(B) is a scalar summary of the e↵ectiveness of theadjustment by D for the entire collection hBi


Canonical Structure

Bearing 180, Mark 0

The matrix TB:D fully summarises all aspects of theunobserved adjustment

The observed adjustment can be summarised in a singlevector – the bearing

The bearing for the adjustment of B by D = d is a randomquantity in hBi which maximises SizeD(X ) and is given by

Zd(B) = [Ed(B)� E(B)]TVar(B)†[B � E(B)].

The bearing expresses both the direction and the magnitudeof the change between prior and adjusted beliefs, relative tothe prior covariance specification.

The biggest possible expected squared change in expectation,relative to prior variance, is for the linear combination givenby Zd(B)


Canonical Structure

Example: Canonical gubbins

Investigating the canonical structure of the unobservedadjustment yields:

TB:D =

✓0.24 0.120.48 0.89

◆, RUD(B) = 1.13

� =

✓0.970.16

◆,

✓Y

1

Y2

◆=

✓0.17 B

1

+ 0.98 B2

0.83 B1

- 0.55 B2

◆

So we learn most about Y1

(and so particularly about B2

), weexpect to resolve 97% of the uncertainty in this direction

We learn comparatively little in direction Y2

Having seen the data, we can obtain the bearing

Zd(B) = �0.045B1

+ 0.117B2

Which is nice!


Partial Analysis

Partial Analysis

Partial Analysis


Partial Analysis

Partial Analysis

I need more data!Suppose we have already adjusted out beliefs about B givendata, DNow suppose we get even more data F , how should we furtheradjust our beliefs about B?

What does this bit do?Suppose we have already adjusted out beliefs about B givendata, H = D [ FWhat were the individual e↵ects of adjusting by D or F?

This requires a partial analysis where we consider the e↵ectsof subsets of the data on our beliefs


Partial Analysis

Partial adjustments

By adjusting beliefs sequentially, we can separate andscrutinize the adjustments at each stage

In order to separate the e↵ects on our beliefs of di↵erentsub-collections, we evaluate partial adjustments representingthe change in adjustment as we accumulate data.

Suppose we intend to adjust our beliefs about B byobservations on D and F

We adjust B by (D [ F ) but separate the e↵ects of thesubsets by adjusting B in stages, first by D, then adding F (orvice versa)


Partial Analysis

Separating things out

How do we separate the e↵ects of D and F on B?

If D ?? F , then adjusted expectations are additive so

ED[F (B � E(B)) = ED(B � E(B)) + EF (B � E(B))

If D and F are correlated, then we obtain a similar expressionby removing the ‘common variability’ between F and D.

For any D, F , the vectors D and AD(F ) = F � ED(F ) areuncorrelated.

Also the collection of linear combinations hD [ F i is the sameas hD [ AD(F )iSo, for any D, F

ED[F (B � E(B)) = ED(B � E(B)) + EAD(F )

(B � E(B))


Partial Analysis

The partial adjustment

The partial adjustment of B by F given D, denotedE

[F/D]

(B), is

ED[F (B) = ED(B) + E

[F/D]

(B)

We can partition the variance in several waysVar(B) = RVarD(B) + VarD(B)

= RVarD(B) + RVar

[F/D]

(B) + VarD[F (B)= RVarD[F (B) + VarD[F (B)

The partial resolved variance matrix of B by F given D is

RVar

[F/D]

(B) = Var(E[F/D]

(B))


Partial Analysis

Diagnostics and Path correlation

Every summary and diagnostic which we have alreadydiscussed can be calculated for the partial adjustment

There is an extra diagnostic available for partial adjustments

Every belief adjustment changes our beliefs. This change isencapsulated in the bearing for that adjustment.

If we do multiple partial adjustments, these changes mayreinforce or contradict one another

We can assess this by the path correlation

PC(d , [f /d ]) = Corr

�Zd(B), Z

[f /d ]

(B)�.

If this is near +1, then we may view the two collections ascomplementary

If this is near -1, then the two collections are givingcontradictory messages


The end

The end

We have seen:

How we represent our beliefs – using expectation as primitive

How we would update our beliefs – the BL adjustment

How we can investigate potential problems in our beliefspecification – diagnostics

How we can understand how our beliefs are a↵ected by thedata – canonical analysis

How we would incorporate additional information – partialanalysis