Chapter 4 More than one parameter - USP€¦ · Chapter 4 More than one parameter Aims: Moving...

Chapter 4

More than one parameter

Aims:

◃ Moving towards practical applications

◃ Illustrating that computations become quickly involved

◃ Illustrating that frequentist results can be obtained with Bayesian procedures

◃ Illustrating a multivariate (independent) sampling algorithm

Bayesian Biostatistics - Piracicaba 2014 196

4.1 Introduction

• Most statistical models involve more than one parameter to estimate

• Examples:

◃ Normal distribution: mean µ and variance σ2

◃ Linear regression: regression coefficients β0, β1, . . . , βd and residual variance σ2

◃ Logistic regression: regression coefficients β0, β1, . . . , βd

◃ Multinomial distribution: class probabilities θ1, θ2, . . . , θd with∑d

j=1 θj = 1

• This requires a prior for all parameters (together): expresses our beliefs about themodel parameters

• Aim: derive posterior for all parameters and their summary measures


• It turns out that in most cases analytical solutions for the posterior will not bepossible anymore

• In Chapter 6, we will see that for this Markov Chain Monte Carlo methods areneeded

• Here, we look at a simple multivariate sampling approach: Method of Composition


4.2 Joint versus marginal posterior inference

• Bayes theorem:

p(θ | y) = L(θ | y)p(θ)∫L(θ | y)p(θ) dθ

◦ Hence, the same expression as before but now θ = (θ1, θ2, . . . , θd)T

◦ Now, the prior p(θ) is multivariate. But often a prior is given for eachparameter separately

◦ Posterior p(θ | y) is also multivariate. But we usually look only at the(marginal) posteriors p(θj | y) (j = 1, . . . , d)

• We also need for each parameter: posterior mean, median (and sometimes mode),and credible intervals


• Illustration on the normal distribution with µ and σ2 unknown

• Application: determining 95% normal range of alp (continuation of Example III.6)

• We look at three cases (priors):

◃ No prior knowledge is available

◃ Previous study is available

◃ Expert knowledge is available

• But first, a brief theoretical introduction


4.3 The normal distribution with µ and σ2 unknown

Acknowledging that µ and σ2 are unknown

• Sample y1, . . . , yn of independent observations from N(µ, σ2)

• Joint likelihood of (µ, σ2) given y:

L(µ, σ2 | y) = 1

(2πσ2)n/2exp

[− 1

2σ2

n∑i=1

(yi − µ)2

]

• The posterior is again product of likelihood with prior divided by the denominatorwhich involves an integral

• In this case analytical calculations are possible in 2 of the 3 cases


4.3.1 No prior knowledge on µ and σ2 is available

• Noninformative joint prior p(µ, σ2) ∝ σ−2 (µ and σ2 a priori independent)

• Posterior distribution p(µ, σ2 | y) ∝ 1σn+2 exp

{− 1

2σ2

[(n− 1)s2 + n(y − µ)2

]}

6.8

6.9

7.0

7.1

7.2

7.3

7.4

1.2

1.4

1.6

1.8

2.0

2.2

2.4

µσ2


Justification prior distribution

• Most often prior information on several parameters arrives to us for each of theparameters separately and independently ⇒ p(µ, σ2) = p(µ)× p(σ2)

• And, we do not have prior information on µ nor on σ2 ⇒ choice of priordistributions:

��

��

��

��

��

��

�

��

��

��

��

��

��

��

� ��

• The chosen priors are called flat priors


• Motivation:

◦ If one is totally ignorant of a location parameter, then it could take any valueon the real line with equal prior probability.

◦ If totally ignorant about the scale of a parameter, then it is as likely to lie inthe interval 1-10 as it is to lie in the interval 10-100. This implies a flat prioron the log scale.

• The flat prior p(log(σ)) = c is equivalent to chosen prior p(σ2) ∝ σ−2


Marginal posterior distributions

Marginal posterior distributions are needed in practice

◃ p(µ | y)

◃ p(σ2 | y)

• Calculation of marginal posterior distributions involve integration:

p(µ | y) =∫p(µ, σ2 | y)dσ2 =

∫p(µ | σ2,y)p(σ2 | y)dσ2

• Marginal posterior is weighted sum of conditional posteriors with weights =uncertainty on other parameter(s)


Conditional & marginal posterior distributions for the normal case

• Conditional posterior for µ: p(µ | σ2,y) = N(y, σ2/n)

• Marginal posterior for µ: p(µ | y) = tn−1(y, s2/n)

⇒ µ− y

s/√n∼ t(n−1) (µ is the random variable)

• Marginal posterior for σ2: p(σ2 | y) ≡ Inv− χ2(n− 1, s2)(scaled inverse chi-squared distribution)

⇒ (n− 1)s2

σ2∼ χ2(n− 1) (σ2 is the random variable)

= special case of IG(α, β) (α = (n− 1)/2, β = 1/2)


Some t-densities

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�


Some inverse-gamma densities

� � � � � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�

� � � � � ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�


Joint posterior distribution

• Joint posterior = multiplication of marginal with conditional posterior

p(µ, σ2 | y) = p(µ | σ2,y) p(σ2 | y) = N(y, σ2/n) Inv− χ2(n− 1, s2)

• Normal-scaled-inverse chi-square distribution = N-Inv-χ2(y,n,(n− 1),s2)

6.8

6.9

7.0

7.1

7.2

7.3

7.4

1.2

1.4

1.6

1.8

2.0

2.2

2.4

µσ2

⇒ A posteriori µ and σ2 are dependent


Posterior summary measures and PPD

For µ:

◃ Posterior mean = mode = median = y

◃ Posterior variance = (n−1)n(n−2)s

2

◃ 95% equal tail credible and HPD interval:

[y − t(0.025;n− 1) s/√n, y + t(0.025;n− 1) s/

√n]

For σ2:

◦ Posterior mean, mode, median, variance, 95% equal tail CI all analytically available

◦ 95% HPD interval is computed iteratively

PPD:

◦ tn−1

[y, s2

(1 + 1

n

)]-distribution


Implications of previous results

Frequentist versus Bayesian inference:

◃ Numerical results are the same

◃ Inference is based on different principles


Example IV.1: SAP study – Noninformative prior

◃ Example III.6: normal range for alp is too narrow

◃ Joint posterior distribution = N-Inv-χ2 (NI prior + likelihood, see before)

◃ Marginal posterior distributions (red curves) for y = 100/√alp

6.2 6.6 7.0 7.4

01

23

4

µ

Posterior

1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

2.0

σ2

Posterior


Normal range for alp:

• PPD for y = t249(7.11, 1.37)-distribution

• 95% normal range for alp = [104.1, 513.2], slightly wider than before


4.3.2 An historical study is available

• Posterior of historical data can be used as prior to the likelihood of current data

• Prior = N-Inv-χ2(µ0,κ0,ν0,σ20)-distribution (from historical data)

• Posterior = N-Inv-χ2(µ,κ, ν,σ2)-distribution (combining data and N-Inv-χ2 prior)

◃ N-Inv-χ2 is conjugate prior

◃ Again shrinkage of posterior mean towards prior mean

◃ Posterior variance = weighted average of prior-, sample variance and distancebetween prior and sample mean

⇒ posterior variance is not necessarily smaller than prior variance!

• Similar results for posterior measures and PPD as in first case


Example IV.2: SAP study – Conjugate prior

• Prior based on retrospective study (Topal et al., 2003) of 65 ‘healthy’ subjects:

◦ Mean (SD) for y = 100/√alp = 5.25 (1.66)

◦ Conjugate prior = N-Inv-χ2(5.25, 65, 64,2.76)

◦ Note: mean (SD) prospective data: 7.11 (1.4), quite different

◦ Posterior = N-Inv-χ2(6.72, 315, 314, 2.61):

◦ Posterior mean in-between between prior mean & sample mean, but:

◦ Posterior precision = prior + sample precision

◦ Posterior variance < prior variance and > sample variance

◦ Posterior informative variance > NI variance

◦ Prior information did not lower posterior uncertainty, reason: conflict oflikelihood with prior


Marginal posteriors:

6.2 6.6 7.0 7.4

01

23

4

µ

Posterior

1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

2.0

σ2

Posterior

Red curves = marginal posteriors from informative prior (historical data)


Histograms retro- and prospective data:

Prospective data (Likelihood)

100ALP(−1 2)

Density

2 4 6 8 10 12

0.0

0.1

0.2

0.3

0.4

Retrospective data (Informative prior)

100ALP(−1 2)

Density

2 4 6 8 10 12

0.0

0.1

0.2

0.3


4.3.3 Expert knowledge is available

• Expert knowledge available on each parameter separately

⇒ Joint prior N(µ0, σ20) × Inv− χ2(ν0, τ

20 ) = conjugate

• Posterior cannot be derived analytically, but numerical/sampling techniques areavailable


What now?

Computational problem:

◃ ‘Simplest problem’ in classical statistics is already complicated

◃ Ad hoc solution is still possible, but not satisfactory

◃ There is the need for another approach


4.4 Multivariate distributions

Distributions with a multivariate response:

◃ Multivariate normal distribution: generalization of normal distribution

◃ Multivariate Student’s t-distribution: generalization of location-scale t-distribution

◃ Multinomial distribution: generalization of binomial distribution

Multivariate prior distributions:

◃ N-Inv-χ2-distribution: prior for N(µ, σ2)

◃ Dirichlet distribution: generalization of beta distribution

◃ (Inverse-)Wishart distribution: generalization of (inverse-) gamma (prior) forcovariance matrices (see mixed model chapter)


Example IV.3: Young adult study – Smoking and alcohol drinking

• Study examining life style among young adults

Smoking

Alcohol No Yes

No-Mild 180 41

Moderate-Heavy 216 64

Total 396 105

• Of interest: association between smoking & alcohol-consumption


Likelihood part:

2×2 contingency table = multinomial model Mult(n,θ)

• θ = {θ11, θ12, θ21, θ22 = 1− θ11 − θ12 − θ21} and 1 =∑

i,j θij

• y = {y11, y12, y21, y22} and n =∑

i,j yij

Mult(n,θ) =n!

y11! y12! y21! y22!θy1111 θ

y1212 θ

y1121 θ

y2222


Dirichlet prior:

Conjugate prior to multinomial distribution = Dirichlet prior Dir(α)

θ ∼ 1

B(α)

∏i,j

θαij−1ij

◦ α = {α11, α12, α21, α22}

◦ B(α) =∏

i,j Γ(αij)/Γ(∑

i,j αij

)⇒ Posterior distribution = Dir(α + y)

• Note:

◦ Dirichlet distribution = extension of beta distribution to higher dimensions

◦ Marginal distributions of a Dirichlet distribution = beta distribution


Measuring association:

• Association between smoking and alcohol consumption:

ψ =θ11 θ22θ12 θ21

• Needed p(ψ | y), but difficult to derive

• Alternatively replace analytical calculations by sampling procedure


Analysis of contingency table:

• Prior distribution: Dir(1, 1, 1, 1)

• Posterior distribution: Dir(180+1, 41+1,216+1, 64+1)

• Sample of 10, 000 generated values for θ parameters

• 95% equal tail CI for ψ: [0.839, 2.014]

• Equal to classically obtained estimate


Posterior distributions:

θ11

0.30 0.35 0.40 0.45

05

10

15

20

θ12

0.04 0.06 0.08 0.10 0.12

05

101520253035

θ21

0.35 0.40 0.45 0.50

05

10

15

ψ

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.0

0.4

0.8

1.2


4.5 Frequentist properties of Bayesian inference

• Not of prime interest for a Bayesian to know the sampling properties of estimators

• However, it is important that Bayesian approach gives most often the right answer

• What is known?

◃ Theory: posterior is normal for a large sample (BCLT)

◃ Simulations: Bayesian approach may offer alternative interval estimators withbetter coverage than classical frequentist approaches


4.6 The Method of Composition

A method to yield a random sample from a multivariate distribution

• Stagewise approach

• Based on factorization of joint distribution into a marginal & several conditionals

p(θ1, . . . , θd | y) = p(θd | y) p(θd−1 | θd,y) . . . p(θ1 | θd−1, . . . , θ2,y)

• Sampling approach:

◃ Sample θd from p(θd | y)

◃ Sample θ(d−1) from p(θ(d−1) | θd,y)◃ . . .

◃ Sample θ1 from p(θ1 | θd−1, . . . , θ2,y)


Sampling from posterior when y ∼ N(µ, σ2), both parameters unknown

• Sample first σ2, then given a sampled value of σ2 (σ2) sample µ from p(µ | σ2,y)

• Output case 1: No prior knowledge on µ and σ2 on next page


Sampled posterior distributions:

σ2

1.4 1.6 1.8 2.0 2.2 2.4

0.0

0.5

1.0

1.5

2.0

(a)

µ

6.9 7.1 7.3

01

23

45 (b)

6.9 7.0 7.1 7.2 7.3 7.4

1.4

1.6

1.8

2.0

2.2

2.4

µ

σ2

(c)

y~4 6 8 10 12

0.00

0.10

0.20

0.30

(d)


4.7 Bayesian linear regression models

• Example of a classical multiple linear regression analysis

• Non-informative Bayesian multiple linear regression analysis:

◃ Non-informative prior for all parameters + . . . classical linear regression model

◃ Analytical results are available + method of composition can be applied


4.7.1 The frequentist approach to linear regression

Classical regression model: y =Xβ + ε

. y = a n× 1 vector of independent responses

.X = n× (d + 1) design matrix

. β = (d + 1)× 1 vector of regression parameters

. ε = n× 1 vector of random errors ∼ N(0, σ2 I)

Likelihood:

L(β, σ2 | y,X) =1

(2πσ2)n/2exp

[− 1

2σ2(y −Xβ)T (y −Xβ)

]. MLE = LSE of β: β = (XTX)−1XTy

. Residual sum of squares: S = (y −Xβ)T (y −Xβ)

. Mean residual sum of squares: s2 = S/(n− d− 1)


Example IV.7: Osteoporosis study: a frequentist linear regression analysis

◃ Cross-sectional study (Boonen et al., 1996)

◃ 245 healthy elderly women in a geriatric hospital

◃ Aim: Find determinants for osteoporosis

◃ Average age women = 75 yrs with a range of 70-90 yrs

◃ Marker for osteoporosis = tbbmc (in kg) measured for 234 women

◃ Simple linear regression model: regressing tbbmc on bmi

◃ Classical frequentist regression analysis:

◦ β0 = 0.813 (0.12)

◦ β1 = 0.0404 (0.0043)

◦ s2 = 0.29, with n− d− 1 = 232

◦ corr(β0, β1) =-0.99


Scatterplot + fitted regression line:

20 25 30 35 40

0.5

1.0

1.5

2.0

2.5

BMI(kg m2)

TB

BM

C (

kg)


4.7.2 A noninformative Bayesian linear regression model

Bayesian linear regression model = prior information on regression parameters &residual variance + normal regression likelihood

• Noninformative prior for (β, σ2): p(β, σ2) ∝ σ−2

• Notation: omit design matrix X

• Posterior distributions:

p(β, σ2 | y) = N(d+1)

[β | β, σ2(XTX)−1

]× Inv− χ2(σ2 | n− d− 1, s2)

p(β | σ2,y) = N(d+1)

[β | β, σ2(XTX)−1

]p(σ2 | y) = Inv− χ2(σ2 | n− d− 1, s2)

p(β | y) = Tn−d−1

[β | β, s2(XTX)−1

]Bayesian Biostatistics - Piracicaba 2014 235

4.7.3 Posterior summary measures for the linear regression model

• Posterior summary measures of

(a) regression parameters β

(b) parameter of residual variability σ2

• Univariate posterior summary measures

◃ The marginal posterior mean (mode, median) of βj = MLE (LSE) βj

◃ 95% HPD interval for βj

◃ Marginal posterior mode and mean of σ2

◃ 95% HPD-interval for σ2


Multivariate posterior summary measures

Multivariate posterior summary measures for β

• Posterior mean (mode) of β = β (MLE=LSE)

• 100(1-α)%-HPD region

• Contour probability for H0 : β = β0


Posterior predictive distribution

• PPD of y with x: t-distribution

• How to sample?

◃ Directly from t-distribution

◃ Method of Composition


4.7.4 Sampling from the posterior distribution

• Most posteriors can be sampled via standard sampling algorithms

• What about p(β | y) = multivariate t-distribution? How to sample from thisdistribution? (R function rmvt in mvtnorm)

• Easy with Method of Composition: Sample in two steps

◃ Sample from p(σ2 | y): scaled inverse chi-squared distribution ⇒ σ2

◃ Sample from p(β | σ2,y) = multivariate normal distribution


Example IV.8: Osteoporosis study – Sampling with Method of Composition

• Sample σ2 from p(σ2 | y) = Inv− χ2(σ2 | n− d− 1, s2)

• Sample from β from p(β | σ2,y) = N(d+1)

[β | β, σ2(XTX)−1

]• Sampled mean regression vector = (0.816, 0.0403)

• 95% equal tail CIs = β0: [0.594, 1.040] & β1: [0.0317, 0.0486]

• Contour probability for H0 : β = 0 < 0.001

• Marginal posterior of (β0, β1) has a ridge (r(β0, β1) = −0.99)


PPD:

• Distribution of a future observation at bmi=30

• Sample future observation y from N(µ30, σ230):

◃ µ30 = βT(1, 30)

◃ σ230 = σ2[1 + (1, 30)(XTX)−1(1, 30)T

]• Sampled mean and standard deviation = 2.033 and 0.282



β0

0.6 0.8 1.0 1.2

01

23

4 (a)

β1

0.025 0.035 0.045

020

40

60

80

100

(b)

0.6 0.8 1.0 1.2

0.025

0.035

0.045

β0

β1

(c)

y~1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5 (d)


4.8 Bayesian generalized linear models

Generalized Linear Model (GLIM): extension of the linear regression model to a wideclass of regression models

• Examples:

◦ Normal linear regression model with normal distribution for continuousresponse and σ2 assumed known

◦ Poisson regression model with Poisson distribution for count response, andlog(mean) = linear function of covariates

◦ Logistic regression model with Bernoulli distribution for binary response andlogit of probability = linear function of covariates


4.8.1 More complex regression models

• Considered multiparameter models are limited

◃ Weibull distribution for alp?

◃ Censored/truncated data?

◃ Cox regression?

• Postpone to MCMC techniques


Take home messages

• Any practical application involves more than one parameter, hence immediatelyBayesian inference is multivariate even with univariate data.

• A multivariate prior is needed and a multivariate posterior is obtained, but themarginal posterior is the basis for practical inference

• Nuisance parameters:

◃ Bayesian inference: average out nuisance parameter

◃ Classical inference: profile out (maximize out nuisance parameter)

• Multivariate independent sampling can be done, if marginals can be computed

• Frequentist properties of Bayesian estimators (with NI priors) often good


Chapter 5

Choosing the prior distribution

Aims:

◃ Review the different principles that lead to a prior distribution

◃ Critically review the impact of the subjectivity of prior information


5.1 Introduction

Incorporating prior knowledge

◃ Unique feature for Bayesian approach

◃ But might introduce subjectivity

◃ Useful in clinical trials to reduce sample size

In this chapter we review different kinds of priors:

◃ Conjugate

◃ Noninformative

◃ Informative


5.2 The sequential use of Bayes theorem

• Posterior of the kth experiment = prior for the (k + 1)th experiment (sequentialsurgeries)

• In this way, the Bayesian approach can mimic our human learning process

• Meaning of ‘prior’ in prior distribution:

◦ Prior: prior knowledge should be specified independent of the collected data

◦ In RCTs: fix the prior distribution in advance


5.3 Conjugate prior distributions

In this section:

• Conjugate priors for univariate & multivariate data distributions

• Conditional conjugate and semi-conjugate distributions

• Hyperpriors


5.3.1 Conjugate priors for univariate data distributions

• In previous chapters, examples were given whereby combination of prior withlikelihood, gives posterior of the same type as the prior.

• This property is called conjugacy.

• For an important class of distributions (those that belong to exponential family)there is a recipe to produce the conjugate prior


Table conjugate priors for univariate discrete data distributions

Exponential family member Parameter Conjugate prior

UNIVARIATE CASE

Discrete distributions

Bernoulli Bern(θ) θ Beta(α0, β0)

Binomial Bin(n,θ) θ Beta(α0, β0)

Negative Binomial NB(k,θ) θ Beta(α0, β0)

Poisson Poisson(λ) λ Gamma(α0, β0)


Table conjugate priors for univariate continuous data distributions


UNIVARIATE CASE

Continuous distributions

Normal-variance fixed N(µ, σ2)-σ2 fixed µ N(µ0, σ20)

Normal-mean fixed N(µ, σ2)-µ fixed σ2 IG(α0, β0)

Inv-χ2(ν0, τ20 )

Normal∗ N(µ, σ2) µ, σ2 NIG(µ0, κ0, a0, b0)

N-Inv-χ2(µ0,κ0,ν0,τ20 )

Exponential Exp(λ) λ Gamma(α0, β0)


Recipe to choose conjugate priors

p(y | θ) ∈ exponential family:

p(y | θ) = b(y) exp[c(θ)Tt(y) + d(θ)

]◦ d(θ), b(y) = scalar functions, c(θ) = (c1(θ), . . . , cd(θ))

T

◦ t(y) = d-dimensional sufficient statistic for θ (canonical parameter)

◦ Examples: Binomial distribution, Poisson distribution, normal distribution, etc.

For a random sample y = {y1, . . . , yn} of i.i.d. elements:

p(y | θ) = b(y) exp[c(θ)Tt(y) + nd(θ)

]◦ b(y) =

∏n1 b(yi) & t(y) =

∑n1 t(yi)



For the exponential family, the class of prior distributions ℑ closed under sampling =

p(θ | α, β) = k(α, β) exp[c(θ)Tα + βd(θ)

]◦ α = (α1, . . . , αd)

T and β hyperparameters

◦ Normalizing constant: k(α, β) = 1/∫exp[c(θ)Tα + βd(θ)

]dθ

Proof of closure:

p(θ | y) ∝ p(y | θ)p(θ)= exp

[c(θ)Tt(y) + n d(θ)

]exp[c(θ)Tα + βd(θ)

]= exp

[c(θ)Tα∗ + β∗d(θ)

],

with α∗ = α + t(y), β∗ = β + n



• Above rule gives the natural conjugate family

• Enlarge class of priors ℑ by adding extra parameters: conjugate family of priors,again closed under sampling (O’Hagan & Forster, 2004)

• The conjugate prior has the same functional form as the likelihood, obtained byreplacing the data (t(y) and n) by parameters (α and β)

• A conjugate prior is model-dependent, in fact likelihood-dependent


Practical advantages when using conjugate priors

A (natural) conjugate prior distribution for the exponential family is convenient fromseveral viewpoints:

• mathematical

• numerical

• interpretational (convenience prior):

◃ The likelihood of historical data can be easily turned into a conjugate prior.

The natural conjugate distribution = equivalent to a fictitious experiment

◃ For a natural conjugate prior, the posterior mean = weighted combination ofthe prior mean and sample estimate


Example V.2: Dietary study – Normal versus t-prior

• Example II.2: IBBENS-2 normal likelihood was combined with N(328,100)(conjugate) prior distribution

• Replace the normal prior by a t30(328, 100)-prior

⇒ posterior practically unchanged, but 3 elegant features of normal prior are lost:

◃ Posterior cannot be determined analytically

◃ Posterior is not of the same class as the prior

◃ Posterior summary measures are not obvious functions of the prior and thesample summary measures


5.3.2 Conjugate prior for normal distribution – mean and varianceunknown

N(µ, σ2) with µ and σ2 unknown ∈ two-parameter exponential family

• Conjugate = product of a normal prior with inverse gamma prior

• Notation: NIG(µ0, κ0, a0, b0)


Mean known and variance unknown

• For σ2 unknown and µ known :

Natural conjugate is inverse gamma (IG)

Equivalently: scaled inverse-χ2 distribution (Inv-χ2)


5.3.3 Multivariate data distributions

Priors for two popular multivariate models:

• Multinomial model

• Multivariate normal model


Table conjugate priors for multivariate data distributions


MULTIVARIATE CASE

Discrete distributions

Multinomial Mult(n,θ) θ Dirichlet(α0)

Continuous distributions

Normal-covariance fixed N(µ, Σ)-Σ fixed µ N(µ0, Σ0)

Normal-mean fixed N(µ, Σ)-µ fixed Σ IW(Λ0, ν0)

Normal∗ N(µ, Σ) µ, Σ NIW(µ0, κ0, ν0, Λ0)


Multinomial model

Mult(n,θ): p(y | θ) = n!y1!y2!...yk!

∏kj=1 θ

yjj ∈ exponential family

Natural conjugate: Dirichlet(α0) distribution

p(θ | α0) =∏kj=1 Γ(α0j)∑kj=1 Γ(α0j)

∏kj=1 θ

α0j−1j

Properties:

◃ Posterior distribution = Dirichlet(α0 + y)

◃ Beta distribution = special case of a Dirichlet distribution with k = 2

◃ Marginal distributions of the Dirichlet distribution = beta distributions

◃ Dirichlet(1, 1, . . . , 1) = extension of the classical uniform prior Beta(1,1)


Multivariate normal model

The p-dimensional multivariate normal distribution:

p(y1, . . . ,yn | µ,Σ) = 1(2π)np/2|Σ|1/2 exp

[−1

2

∑ni=1(yi − µ)TΣ−1(yi − µ)

]

Conjugates:

◃ Σ known and µ unknown: N(µ0,Σ0) for µ

◃ Σ unknown and µ known: inverse Wishart distribution IW(Λ0, ν0) for Σ

◃ Σ unknown and µ unknown:

Normal-inverse Wishart distribution NIW(µ0, κ0, ν0,Λ0) for µ and Σ


5.3.4 Conditional conjugate and semi-conjugate priors

Example θ = (µ, σ2) for y ∼ N(µ, σ2)

• Conditional conjugate for µ: N(µ0, σ20)

• Conditional conjugate for σ2: IG(α, β)

• Semi-conjugate prior = product of conditional conjugates

• Often conjugate priors cannot be used in WinBUGS, but semi-conjugates arepopular


5.3.5 Hyperpriors

Conjugate priors are restrictive to present prior knowledge

⇒ Give parameters of conjugate prior also a prior

Example:

• Prior: θ ∼ Beta(1, 1)

• Instead: θ ∼ Beta(α, β) and α ∼ Gamma(1, 3), β ∼ Gamma(2, 4)

◃ α, β = hyperparameters

◃ Gamma(1, 3)× Gamma(2, 4) = hyperprior/hierarchical prior

• Aim: more flexibility in prior distribution (and useful for Gibbs sampling)


5.4 Noninformative prior distributions


5.4.1 Introduction

Sometimes/often researchers cannot or do not wish to make use of prior knowledge⇒ prior should reflect this absence of knowledge

• Prior that express no knowledge = (initially) called a noninformative (NI)

• Central question: What prior reflects absence of knowledge?

◃ Flat prior?

◃ Huge amount of research to find best NI prior

◃ Other terms for NI: non-subjective, objective, default, reference, weak, diffuse,flat, conventional and minimally informative, etc

• Challenge: make sure that posterior is a proper distribution!


5.4.2 Expressing ignorance

• Equal prior probabilities = principle of insufficient reason, principle of indifference,Bayes-Laplace postulate

• Unfortunately, but . . . flat prior cannot express ignorance


Ignorance at different scales:

0 1 2 3 4 5

01

23

45

Flat prior on σ

σ

Pri

or

0 1 2 3 4 5

01

23

45

Flat prior on σ

σ2

Pri

or

0 1 2 3 4 5

01

23

45

Flat prior on σ2

σ2

Pri

or

0 1 2 3 4 50

12

34

5

Flat prior on σ2

σ

Pri

or

Ignorance on σ-scale is different from ignorance on σ2-scale


Ignorance cannot be expressed mathematically


5.4.3 General principles to choose noninformative priors

A lot of research has been spent on the specification of NI priors, most popular areJeffreys priors:

• Result of a Bayesian analysis depends on choice of scale for flat prior:

p(θ) ∝ c or p(h(θ)) ≡ p(ψ) ∝ c

• To preserve conclusions when changing scale: Jeffreys suggested a rule toconstruct priors based on the invariance principle/rule (conclusions do notchange when changing scale)

• Jeffreys rule suggests a way to choose a scale to take the flat prior on

• Jeffreys rule also exists for more than one parameter (Jeffreys multi-parameterrule)


Examples of Jeffreys priors

• Binomial model: p(θ) ∝ θ−1/2(1− θ)−1/2 ⇔ flat prior on ψ(θ) ∝ arcsin√θ

• Poisson model: p(λ) ∝ λ−1/2 ⇔ flat prior on ψ(λ) =√λ

• Normal model with σ fixed: p(µ) ∝ c

• Normal model with µ fixed: p(σ2) ∝ σ−2 ⇔ flat prior on log(σ)

• Normal model with µ and σ2 unknown: p(µ, σ2) ∝ σ−2, which reproducessome classical frequentist results !!!


5.4.4 Improper prior distributions

• Many NI priors are improper (= AUC is infinite)

• Improper prior is technically no problem when posterior is proper

• Example: Normal likelihood (µ unknown + σ2 known) + flat prior on µ

p(µ | y) = p(y | µ) p(µ)∫p(y | µ) p(µ) dµ

=p(y | µ) c∫p(y | µ) c dµ

=1√

2πσ/√nexp

[−n2

(µ− y

σ

)2]

• Complex models: difficult to know when improper prior yields a proper posterior(variance of the level-2 obs in Gaussian hierarchical model)

• Interpretation of improper priors?


5.4.5 Weak/vague priors

• For practical purposes: sufficient that prior is locally uniform also called vague orweak

• Locally uniform: prior ≈ constant on interval outside which likelihood ≈ zero

• Examples for N(µ, σ2) likelihood:

◦ µ: N(0, σ20) prior with σ0 large◦ σ2: IG(ε, ε) prior with ε small ≈ Jeffreys prior


Locally uniform prior

0 200 400 600 800 1000

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

µ

LOCALLY UNIFORM PRIOR

LIKELIHOOD

POSTERIOR


Vague priors in software:

• WinBUGS allows only (proper) vague priors (Jeffreys priors are not allowed)

◦ mu ∼ dnorm(0.0,1.0E-6): normal prior with variance = 1000

◦ tau2 ∼ dgamma(0.001,0.001): inverse gamma prior for variance withshape=rate=10−3

• SAS allows improper priors (allows Jeffreys priors)


Density of log(σ) for σ2 (= 1/τ 2) ∼ IG(ε, ε)

� � � � � �

��

��

��

��

��

��

��

��

��

��

� ��

��

� � � � � �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��


Density of log(σ) for σ2 ∼ IG(ε, ε)

� � � � � �

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��

��

� � � � � �

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

��


5.5 Informative prior distributions


5.5.1 Introduction

• In basically all research some prior knowledge is available

• In this section:

◃ Formalize the use of historical data as prior information using the power prior

◃ Review the use of clinical priors, which are prior distributions based on eitherhistorical data or on expert knowledge

◃ Priors that are based on formal rules expressing prior skepticism and optimism

• The set of priors representing prior knowledge = subjective or informative priors

• But, first two success stories how the Bayesian approach helped to find:

◃ a crashed plane

◃ a lost fisherman on the Atlantic Ocean


Locating a lost plane

◃ Statisticians helped locate an Air France plane in 2011 which was missing for twoyears using Bayesian methods

◃ June 2009: Air France flight 447 went missing flying from Rio de Janeiro in Brazilto Paris, France

◃ Debris from the Airbus A330 was found floating on the surface of the Atlantic fivedays later

◃ After a number of days, the debris would have moved with the ocean current,hence finding the black box is not easy

◃ Existing software (used by the US Coast Guard) did not help

◃ Senior analyst at Metron, Colleen Keller, relied on Bayesian methods to locate theblack box in 2011


Members of the Brazilian Frigate Constituicao recovering debris

in June 2009

Debris from the Air France crash is laid out for investi-

gation in 2009

A 2009 infrared satellite image shows weather conditions

off the Brazilian coast and the plane search area


Finding a lost fisherman on the Atlantic Ocean

New York Times (30 September 2014)

◃ ”. . . if not for statisticians, a Long Island fisherman might have died in theAtlantic Ocean after falling off his boat early one morning last summer

◃ The man owes his life to a once obscure field known as Bayesian statistics - a setof mathematical rules for using new data to continuously update beliefs or existingknowledge

◃ It is proving especially useful in approaching complex problems, including searcheslike the one the Coast Guard used in 2013 to find the missing fisherman, JohnAldridge

◃ But the current debate is about how scientists turn data into knowledge, evidenceand predictions. Concern has been growing in recent years that some fields are notdoing a very good job at this sort of inference. In 2012, for example, a team atthe biotech company Amgen announced that they’d analyzed 53 cancer studies


and found it could not replicate 47 of them

◃ The Coast Guard has been using Bayesian analysis since the 1970s. The approachlends itself well to problems like searches, which involve a single incident and manydifferent kinds of relevant data, said Lawrence Stone, a statistician for Metron, ascientific consulting firm in Reston, Va., that works with the Coast Guard

The Coast Guard, guided by the statistical

method of Thomas Bayes, was able to find the

missing fisherman John Aldridge.


5.5.2 Data-based prior distributions

• In previous chapters:

◦ Combined historical data with current data assuming identical conditions

◦ Discounted importance of prior data by increasing variance

• Generalized by power prior (Ibrahim and Chen):

◦ Likelihood historical data: L(θ | y0) based on y0 = {y01, . . . , y0n0}◦ Prior of historical data: p0(θ | c0)◦ Power prior distribution:

p(θ | y0, a0) ∝ L(θ | y0)a0 p0(θ | c0)

with 0 no accounting ≤ a0 ≤ 1 fully accounting


5.5.3 Elicitation of prior knowledge

• Elicitation of prior knowledge: turn (qualitative) information from ‘experts’ intoprobabilistic language

• Challenges:

◃ Most experts have no statistical background

◃ What to ask to construct prior distribution:

◦ Prior mode, median, mean and prior 95% CI?

◦ Description of the prior: quartiles, mean, SD?

◃ Some probability statements are easier to elicit than others


Example V.5: Stroke study – Prior for 1st interim analysis from experts

Prior knowledge on θ (incidence of SICH), elicitation based on:

◦ Most likely value for θ and prior equal-tail 95% CI

◦ Prior belief pk on each of the K intervals Ik ≡ [θk−1, θk) covering [0,1]

0.0 0.1 0.2 0.3 0.4

02

46

810

θ


Elicitation of prior knowledge – some remarks

• Community and consensus prior: obtained from a community of experts

• Difficulty in eliciting prior information on more than 1 parameter jointly

• Lack of Bayesian papers based on genuine prior information


Identifiability issues

• With overspecified model: non-identifiable model

• Unidentified parameter, when given a NI prior also posterior is NI

• Bayesian approach can make parameters estimable, so that it becomes anidentifiable model

• In next example, not all parameters can be estimated without extra (prior)information


Example V.6: Cysticercosis study – Estimate prevalence without gold standard

Experiment:

◃ 868 pigs tested in Zambia with Ag-ELISA diagnostic test

◃ 496 pigs showed a positive test

◃ Aim: estimate the prevalence π of cysticercosis in Zambia among pigs

If estimate of sensitivity α and specificity β available, then:

π =p+ + β − 1

α + β − 1

◦ p+ = n+/n = proportion of subjects with a positive test

◦ α and β = estimated sensitivity and specificity


Data:

Table of results:

Test Disease (True) Observed

+ -

+ πα (1− π)(1− β) n+=496

- π(1− α) (1− π)β n−=372

Total π (1− π) n=868

◃ Only collapsed table is available

◃ Since α and β vary geographically, expert knowledge is needed


Prior and posterior:

• Prior distribution on π (p(π)) , α (p(α)) and β (p(β)) is needed

• Posterior distribution:

p(π, α, β | n+, n−) ∝(nn+

)[πα + (1− π)(1− β)]n

+

[π(1− α) + (1− π)β]n−p(π)p(α)p(β)

• WinBUGS was used


Posterior of π:

(a) Uniform priors for π, α and β (no prior information)

(b) Beta(21,12) prior for α and Beta(32,4) prior for β (historical data)

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

N = 10000 Bandwidth = 0.04473

(a)

p(π|y)

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

N = 10000 Bandwidth = 0.01542

(b)

p(π|y)


5.5.4 Archetypal prior distributions

• Use of prior information in Phase III RCTs is problematic, except for medicaldevice trials (FDA guidance document)

⇒ Pleas for objective priors in RCTs

• There is a role of subjective priors for interim analyses:

◃ Skeptical prior

◃ Enthusiastic prior


Example V.7: Skeptical priors in a phase III RCT

Tan et al. (2003):

◃ Phase III RCT for treating patients with hepatocellular carcinoma

◃ Standard treatment: surgical resection

◃ Experimental treatment: surgery + adjuvant radioactive iodine (adjuvant therapy)

◃ Planning: recruit 120 patients

Frequentist interim analyses for efficacy were planned:

◃ First interim analysis (30 patients): experimental treatment better (P = 0.01< 0.029 = P -value of stopping rule)

◃ But, scientific community was skeptical about adjuvant therapy

⇒ New multicentric trial (300 patients) was set up


Prior to the start of the subsequent trial:

◃ Pretrial opinions of the 14 clinical investigators were elicited

◃ The prior distributions of each investigator were constructed by eliciting the priorbelief on the treatment effect (adjuvant versus standard) on a grid of intervals

◃ Average of all priors = community prior

◃ Average of the priors of the 5 most skeptical investigators = skeptical prior

To exemplify the use of the skeptical prior:

◃ Combine skeptical prior with interim analysis results of previous trial

⇒ 1-sided contour probability (in 1st interim analysis) = 0.49

⇒ The first trial would not have been stopped for efficacy


Questionnaire:


Prior of investigators:


Skeptical priors:


A formal skeptical/enthusiastic prior

Formal subjective priors (Spiegelhalter et al., 1994) in normal case:

• Useful in the context of monitoring clinical trials in a Bayesian manner

• θ = true effect of treatment (A versus B)

• Skeptical normal prior: choose mean and variance of p(θ) to reflect skepticism

• Enthusiastic normal prior: choose mean and variance of p(θ) to reflectenthusiasm

• See figure next page & book


Example V.8+9

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

θ

skeptical prior

enthusiastic prior

θa

5%5%


5.6 Prior distributions for regression models


5.6.1 Normal linear regression

Normal linear regression model:

yi = xTi β + εi, (i = 1, . . . , n)

y =Xβ + ε


Priors

• Non-informative priors:

◃ Popular NI prior: p(β, σ2) ∝ σ−2 (Jeffreys multi-parameter rule)

◃ WinBUGS: product of independent N(0, σ20) (σ

20 large) + IG(ε, ε) (ε small)

• Conjugate priors:

◃ Conjugate NIG prior = N(β0, σ2Σ0) × IG(a0, b0) (or Inv-χ

2(ν0, τ20 ))

• Historical/expert priors:

◃ Prior knowledge on regression coefficients must be given jointly

◃ Elicitation process via distributions at covariate values

◃ Most popular: express prior based on historical data


5.6.2 Generalized linear models

• In practice choice of NI priors much the same as with linear models

• But, too large prior variance may not be best for sampling, e.g. in logisticregression model

• In SAS: Jeffreys (improper) prior can be chosen

• Conjugate priors are based on fictive historical data

◃ Data augmentation priors & conditional mean priors

◃ Not implemented in classical software, but fictive data can be explicitly addedand then standard software can be used


5.7 Modeling priors

Modeling prior: adapt characteristics of the statistical model

• Multicollinearity: appropriate prior avoids inflation of of β

• Numerical (separation) problems: appropriate prior avoids inflation of β

• Constraints on parameters: constraint can be put in prior

• Variable selection: prior can direct the variable search


Multicollinearity

Multicollinearity: |XTX| ≈ 0 ⇒ regression coefficients and standard errors inflated

Ridge regression:

◃ Minimize: (y∗ −Xβ)T (y∗ −Xβ) + λβTβ with λ ≥ 0 & y∗ = y − y1n

◃ Estimate: βR(λ) = (XTX + λI)−1XTy

= Posterior mode of a Bayesian normal linear regression analysis with:

◃ Normal ridge prior N(0, τ 2I) for β

◃ τ 2 = σ2/λ with σ and λ fixed

• Can be easily extended to BGLIM


Numerical (separation) problems

Separation problems in binary regression models: complete separation andquasi-complete separation

Solution: Take weakly informative prior on regression coefficients

0 2 4 6 8 10

02

46

810

x1

x2

0

0

0

0

0

0

0

1

1

1

1

1

1

1

quasi complete separation

N(0,100)

Cauchy (Gelman)


Constraints on parameters

Signal-Tandmobielr study:

• θk = probability of CE among Flemish children in (k = 1, . . . , 6) school year

• Constraint on parameters: θ1 ≤ θ2 ≤ · · · ≤ θ6

• Solutions:

◃ Prior on θ = (θ1, . . . , θ6)T that maps all θs that violate the constraint to zero

◃ Neglect the values that are not allowed in the posterior (useful when sampling)


Other modeling priors

• LASSO prior (see Bayesian variable selection)

• . . .


5.8 Other regression models

• A great variety of models

• Not considered here: conditional logistic regression model, Cox proportionalhazards model, generalized linear mixed effects models

• . . .


Take home messages

• Often prior is dominated by the likelihood (data)

• Prior in RCTs: prior to the trial

• Conjugate priors: convenient mathematically, computationally and from aninterpretational viewpoint

• Conditional conjugate priors: heavily used in Gibbs sampling

• Hyperpriors: extend the range of conjugate priors, also important in Gibbssampling


• Noninformative priors:

◃ do not exist, strictly speaking

◃ in practice vague priors (e.g. locally uniform) are ok

◃ important class of NI priors: Jeffreys priors

◃ be careful with improper priors, they might imply improper posterior

• Informative priors:

◃ can be based on historical data & expert knowledge (but only useful whenviewpoint of a community of experts)

◃ are useful in clinical trials to reduce sample size


Chapter 6

Markov chain Monte Carlo sampling

Aims:

◃ Introduce the sampling approach(es) that revolutionized Bayesian approach


6.1 Introduction

◃ Solving the posterior distribution analytically is often not feasible due to thedifficulty in determining the integration constant

◃ Computing the integral using numerical integration methods is a practicalalternative if only a few parameters are involved

⇒ New computational approach is needed

◃ Sampling is the way to go!

◃ With Markov chain Monte Carlo (MCMC) methods:

1. Gibbs sampler

2. Metropolis-(Hastings) algorithm

MCMC approaches have revolutionized Bayesian methods!


Intermezzo: Joint, marginal and conditional probability

Two (discrete) random variables X and Y

• Joint probability of X and Y: probability that X=x and Y=y happen together

• Marginal probability of X: probability that X=x happens

• Marginal probability of Y: probability that Y=y happens

• Conditional probability of X given Y=y: probability that X=x happens if Y=y

• Conditional probability of Y given X=x: probability that Y=y happens if X=x



IBBENS study: 563 (556) bank employees in 8 subsidiaries of Belgian bankparticipated in a dietary study

LENGTH

WE

IGH

T

140 150 160 170 180 190 200

4060

8010

012

0



IBBENS study: 563 (556) bank employees in 8 subsidiaries of Belgian bankparticipated in a dietary study

LENGTH

WE

IGH

T

140 150 160 170 180 190 200

4060

8010

012

0



IBBENS study: frequency table

Length

Weight −150 150− 160 160− 170 170− 180 180− 190 190− 200 200− Total

−50 2 12 4 0 0 0 0 18

50− 60 1 25 50 14 0 0 0 90

60− 70 0 12 54 52 13 1 0 132

70− 80 0 5 42 72 34 0 0 153

80− 90 0 0 12 58 32 2 1 105

90− 100 0 0 0 20 18 3 0 41

100− 110 0 0 1 2 7 1 0 11

110− 120 0 0 0 2 2 1 0 5

120− 0 0 0 0 1 0 0 1

Total 3 54 163 220 107 8 1 556



IBBENS study: joint probability

Length

Weight −150 150− 160 160− 170 170− 180 180− 190 190− 200 200− total

−50 2/556 12/556 4/556 0/556 0/556 0/556 0/556 18/556

50− 60 1/556 25/556 50/556 14/556 0/556 0/556 0/556 90/556

60− 70 0/556 12/556 54/556 52/556 13/556 1/556 0/556 132/556

70− 80 0/556 5/556 42/556 72/556 34/556 0/556 0/556 153/556

80− 90 0/556 0/556 12/556 58/556 32/556 2/556 1/556 105/556

90− 100 0/556 0/556 0/556 20/556 18/556 3/556 0/556 41/556

100− 110 0/556 0/556 1/556 2/556 7/556 1/556 0/556 11/556

110− 120 0/556 0/556 0/556 2/556 2/556 1/556 0/556 5/556

120− 0/556 0/556 0/556 0/556 1/556 0/556 0/556 1/556

Total 3/556 54/556 163/556 220/556 107/556 8/556 1/556 1



IBBENS study: marginal probabilities

Length

Weight −150 150− 160 160− 170 170− 180 180− 190 190− 200 200− total

−50 2/556 12/556 4/556 0/556 0/556 0/556 0/556 18/556

50− 60 1/556 25/556 50/556 14/556 0/556 0/556 0/556 90/556

60− 70 0/556 12/556 54/556 52/556 13/556 1/556 0/556 132/556

70− 80 0/556 5/556 42/556 72/556 34/556 0/556 0/556 153/556

80− 90 0/556 0/556 12/556 58/556 32/556 2/556 1/556 105/556

90− 100 0/556 0/556 0/556 20/556 18/556 3/556 0/556 41/556

100− 110 0/556 0/556 1/556 2/556 7/556 1/556 0/556 11/556

110− 120 0/556 0/556 0/556 2/556 2/556 1/556 0/556 5/556

120− 0/556 0/556 0/556 0/556 1/556 0/556 0/556 1/556

Total 3/556 54/556 163/556 220/556 107/556 8/556 1/556 1



IBBENS study: conditional probabilities

Length

Weight −150 150− 160 160− 170 170− 180 180− 190 190− 200 200− total

−50 12/54

50− 60 1/90 25/90 25/54 50/90 14/90 0/90 0/90 0/90 90/90

60− 70 12/54

70− 80 5/54

80− 90 0/54

90− 100 0/54

100− 110 0/54

110− 120 0/54

120− 0/54

Total 54/54


Intermezzo: Joint, marginal and conditional density

Two (continuous) random variables X and Y

• Joint density of X and Y: density f (x, y)

• Marginal density of X: density f (x)

• Marginal density of Y: density f (y)

• Conditional density of X given Y=y: density f (x|y)

• Conditional density of Y given X=x: density f (y|x)



IBBENS study: joint density



IBBENS study: marginal densities



IBBENS study: conditional densities

Conditional density of

LENGTH GIVEN WEIGHT

Conditional density of

WEIGHT GIVEN LENGTH


6.2 The Gibbs sampler

• Gibbs Sampler: introduced by Geman and Geman (1984) in the context ofimage-processing for the estimation of the parameters of the Gibbs distribution

• Gelfand and Smith (1990) introduced Gibbs sampling to tackle complexestimation problems in a Bayesian manner


6.2.1 The bivariate Gibbs sampler

Method of Composition:

• p(θ1, θ2 | y) is completely determined by:

◃ marginal p(θ2 | y)

◃ conditional p(θ1 | θ2,y)

• Split-up yields a simple way to sample from joint distribution


Gibbs sampling:

• p(θ1, θ2 | y) is completely determined by:



• Property yields another simple way to sample from joint distribution:

◃ Take starting values θ01 and θ02 (only 1 is needed)

◃ Given θk1 and θk2 at iteration k, generate the (k + 1)-th value according toiterative scheme:

1. Sample θ(k+1)1 from p(θ1 | θk2 ,y)

2. Sample θ(k+1)2 from p(θ2 | θ(k+1)

1 ,y)


Result of Gibbs sampling:

• Chain of vectors: θk = (θk1 , θk2)T , k = 1, 2, . . .

◦ Consists of dependent elements

◦ Markov property: p(θ(k+1) | θk, θ(k−1), . . . , y) = p(θ(k+1) | θk,y)

• Chain depends on starting value + initial portion/burn-in part must be discarded

• Under mild conditions: sample from the posterior distribution = target distribution

⇒ From k0 on: summary measures calculated from the chain consistently estimatethe true posterior measures

Gibbs sampler is called a Markov chain Monte Carlo method


Example VI.1: SAP study – Gibbs sampling the posterior with NI priors

• Example IV.5: sampling from posterior distribution of the normal likelihood basedon 250 alp measurements of ‘healthy’ patients with NI prior for both parameters

• Now using Gibbs sampler based on y = 100/√alp

• Determine two conditional distributions:

1. p(µ | σ2,y): N(µ | y, σ2/n)2. p(σ2 | µ,y): Inv− χ2(σ2 | n, s2µ) with s2µ = 1

n

∑ni=1(yi − µ)2

• Iterative procedure: At iteration (k + 1)

1. Sample µ(k+1) from N(y, (σ2)k/n)

2. Sample (σ2)(k+1) from Inv− χ2(n, s2µ(k+1))


Gibbs sampling:

6 7 8 9 100

12

34

56

µ

σ2

6 7 8 9 10

01

23

45

6

µ

σ2

6 7 8 9 10

01

23

45

6

µ

σ2

6 7 8 9 10

01

23

45

6

µ

σ2

◦ Sampling from conditional density of µ given σ2

◦ Sampling from conditional density of σ2 given µ


Gibbs sampling path and sample from joint posterior:

6.6 6.8 7.0 7.2 7.4

1.4

1.6

1.8

2.0

2.2

2.4

2.6

µ

σ2

(a)

6.6 6.8 7.0 7.2 7.4

1.4

1.6

1.8

2.0

2.2

2.4

2.6

µ

σ2

(b)

◦ Zigzag pattern in the (µ, σ2)-plane

◦ 1 complete step = 2 substeps (blue=genuine element)

◦ Burn-in = 500, total chain = 1,500



µ

6.8 6.9 7.0 7.1 7.2 7.3 7.4

01

23

4

(a)

σ2

1.4 1.6 1.8 2.0 2.2 2.4

0.0

0.5

1.0

1.5

2.0

2.5

(b)

Solid line = true posterior distribution


Example VI.2: Sampling from a discrete × continuous distribution

• Joint distribution: f (x, y) ∝(nx

)yx+α−1(1− y)(n−x+β−1)

◦ x a discrete random variable taking values in {0, 1, . . . , n}◦ y a random variable on the unit interval

◦ α, β > 0 parameters

• Question: f (x)?


Marginal distribution:

x

Density

0 5 10 15 20 25 30

0.00

0.02

0.04

0.06

0.08

◦ Solid line = true marginal distribution



Example VI.3: SAP study – Gibbs sampling the posterior with I priors

• Example VI.1: now with independent informative priors (semi-conjugate prior)

◦ µ ∼ N(µ0, σ20)

◦ σ2 ∼ Inv− χ2(ν0, τ20 )

• Posterior:

p(µ, σ2 | y) ∝ 1

σ0e− 1

2σ20(µ−µ0)2

× (σ2)−(ν0/2+1) e−ν0 τ20/2σ

2

× 1

σn

n∏i=1

e− 1

2σ2(yi−µ)2

∝n∏i=1

e− 1

2σ2(yi−µ)2 e

− 12σ20

(µ−µ0)2(σ2)−(

n+ν02 +1) e−ν0 τ

20/2σ

2


Conditional distributions:

• Determine two conditional distributions:

1. p(µ | σ2,y):∏n

i=1 e− 1

2σ2(yi−µ)2 e

− 12σ20

(µ−µ0)2(N(µk, (σ2)k

))

2. p(σ2 | µ,y): Inv− χ2(ν0 + n,

∑ni=1(yi−µ)2+ν0τ20

ν0+n

)

• Iterative procedure: At iteration (k + 1)

1. Sample µ(k+1) from N(µk, (σ2)k

)2. Sample (σ2)(k+1) from Inv− χ2

(ν0 + n,

∑ni=1(yi−µ)2+ν0τ20

ν0+n

)


Trace plots:

0 500 1000 1500

5.5

6.0

6.5

7.0

Iteration

µ

(a)

0 500 1000 1500

1.8

2.2

2.6

3.0

Iteration

σ2

(b)


6.2.2 The general Gibbs sampler

Starting position θ0 = (θ01, . . . , θ0d)T

Multivariate version of the Gibbs sampler:

Iteration (k + 1):

1. Sample θ(k+1)1 from p(θ1 | θk2 , . . . , θk(d−1), θ

kd,y)

2. Sample θ(k+1)2 from p(θ2 | θ(k+1)

1 , θk3 , . . . , θkd,y)

...

d. Sample θ(k+1)d from p(θd | θ(k+1)

1 , . . . , θ(k+1)(d−1), y)


• Full conditional distributions: p(θj | θk1 , . . . , θk(j−1), θk(j+1), . . . , θ

k(d−1), θ

kd,y)

• Also called: full conditionals

• Under mild regularity conditions:

θk,θ(k+1), . . . ultimately are observations from the posterior distribution

With the help of advanced sampling algorithms (AR, ARS, ARMS, etc)sampling the full conditionals is done based on the prior × likelihood


Example VI.4: British coal mining disasters data

◃ British coal mining disasters data set: # severe accidents in British coal minesfrom 1851 to 1962

◃ Decrease in frequency of disasters from year 40 (+ 1850) onwards?

0 20 40 60 80 100

01

23

45

6

1850+year

# D

isaste

rs


Statistical model:

• Likelihood: Poisson process with a change point at k

◃ yi ∼ Poisson(θ) for i = 1, . . . , k

◃ yi ∼ Poisson(λ) for i = k + 1, . . . , n (n=112)

• Priors

◃ θ: Gamma(a1, b1), (a1 constant, b1 parameter)

◃ λ: Gamma(a2, b2), (a2 constant, b2 parameter)

◃ k: p(k) = 1/n

◃ b1: Gamma(c1, d1), (c1, d1 constants)

◃ b2: Gamma(c2, d2), (c2, d2 constants)


Full conditionals:

p(θ | y, λ, b1, b2, k) = Gamma(a1 +k∑i=1

yi, k + b1)

p(λ | y, θ, b1, b2, k) = Gamma(a2 +n∑

i=k+1

yi, n− k + b2)

p(b1 | y, θ, λ, b2, k) = Gamma(a1 + c1, θ + d1)

p(b2 | y, θ, λ, b1, k) = Gamma(a2 + c2, λ + d2)

p(k | y, θ, λ, b1, b2) =π(y | k, θ, λ)∑nj=1 π(y | j, θ, λ)

with π(y | k, θ, λ) = exp [k(λ− θ)]

(θ

λ

)∑ki=1 yi

◦ a1 = a2 = 0.5, c1 = c2 = 0, d1 = d2 = 1



θ λ

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

k

35 40 45

0.00

0.05

0.10

0.15

0.20

◦ Posterior mode of k: 1891

◦ Posterior mean for θ/λ= 3.42 with 95% CI = [2.48, 4.59]


Note:

• In most published analyses of this data set b1 and b2 are given inverse gammapriors. The full conditionals are then also inverse gamma

• The results are almost the same ⇒ our analysis is a sensitivity analysis of theanalyses seen in the literature

• Despite the classical full conditionals, the WinBUGS/OpenBUGS sampler for θand λ are not standard gamma but rather a slice sampler. See Exercise 8.10.


Example VI.5: Osteoporosis study – Using the Gibbs sampler

Bayesian linear regression model with NI priors:

◃ Regression model: tbbmci = β0 + β1bmii + εi (i = 1, . . . , n = 234)

◃ Priors: p(β0, β1, σ2) ∝ σ−2

◃ Notation: y = (tbbmc1, . . . , tbbmc234)T , x = (bmi1, . . . , bmi234)

T

Full conditionals: p(σ2 | β0, β1,y) = Inv− χ2(n, s2β)

p(β0 | σ2, β1,y) = N(rβ1, σ2/n)

p(β1 | σ2, β0,y) = N(rβ0, σ2/xTx)

withs2β = 1

n

∑(yi − β0 − β1 xi)

2

rβ1 =1n

∑(yi − β1 xi)

rβ0 =∑

(yi − β0)xi/xTx


Comparison with Method of Composition:

Parameter Method of Composition

2.5% 25% 50% 75% 97.5% Mean SD

β0 0.57 0.74 0.81 0.89 1.05 0.81 0.12

β1 0.032 0.038 0.040 0.043 0.049 0.040 0.004

σ2 0.069 0.078 0.083 0.088 0.100 0.083 0.008

Gibbs sampler

2.5% 25% 50% 75% 97.5% Mean SD

β0 0.67 0.77 0.84 0.91 1.10 0.77 0.11

β1 0.030 0.036 0.040 0.042 0.046 0.039 0.0041

σ2 0.069 0.077 0.083 0.088 0.099 0.083 0.0077

◦ Method of Composition = 1,000 independently sampled values

◦ Gibbs sampler: burn-in = 500, total chain = 1,500


Index plot from Method of Composition:

0 200 400 600 800 1000

0.030

0.045

Index

β1

(a)

0 200 400 600 800 1000

0.07

0.09

0.11

Index

σ2

(b)


Trace plot from Gibbs sampler:

0 500 1000 1500

0.030

0.045

Iteration

β1

(a)

0 500 1000 1500

0.06

0.08

0.10

Iteration

σ2

(b)


Trace versus index plot:

Comparison of index plot with trace plot shows:

• σ2: index plot and trace plot similar ⇒ (almost) independent sampling

• β1: trace plot shows slow mixing ⇒ quite dependent sampling

⇒ Method of Composition and Gibbs sampling: similar posterior measures of σ2

⇒ Method of Composition and Gibbs sampling: less similar posterior measures ofβ1


Autocorrelation:

◃ Autocorrelation of lag 1: correlation of βk1 with β(k−1)1 (k=1, . . .)

◃ Autocorrelation of lag 2: correlation of βk1 with β(k−2)1 (k=1, . . .)

. . .

◃ Autocorrelation of lag m: correlation of βk1 with β(k−m)1 (k=1, . . .)

High autocorrelation:

⇒ burn-in part is larger ⇒ takes longer to forget initial positions

⇒ remaining part needs to be longer to obtain stable posterior measures


6.2.3 Remarks∗

• Full conditionals determine joint distribution

• Generate joint distribution from full conditionals

• Transition kernel


6.2.4 Review of Gibbs sampling approaches

Sampling the full conditionals is done via different algorithms depending on:

◃ Shape of full conditional (classical versus general purpose algorithm)

◃ Preference of software developer:

◦ SASr procedures GENMOD, LIFEREG and PHREG: ARMS algorithm

◦ WinBUGS: variety of samplers

Several versions of the basic Gibbs sampler:

◃ Deterministic- or systematic scan Gibbs sampler: d dims visited in fixed order

◃ Block Gibbs sampler: d dims split up into m blocks of parameters and Gibbssampler applied to blocks


Review of Gibbs sampling approaches – The block Gibbs sampler

Block Gibbs sampler:

• Normal linear regression

◃ p(σ2 | β0, β1,y)

◃ p(β0, β1 | σ2,y)

• May speed up considerably convergence, at the expense of more computationaltime needed at each iteration

• WinBUGS: blocking option on

• SASr procedure MCMC: allows the user to specify the blocks


6.3 The Metropolis(-Hastings) algorithm

Metropolis-Hastings (MH) algorithm = general Markov chain Monte Carlotechnique to sample from the posterior distribution but does not require fullconditionals

• Special case: Metropolis algorithm proposed by Metropolis in 1953

• General case: Metropolis-Hastings algorithm proposed by Hastings in 1970

• Became popular only after introduction of Gelfand & Smith’s paper (1990)

• Further generalization: Reversible Jump MCMC algorithm by Green (1995)


6.3.1 The Metropolis algorithm

Sketch of algorithm:

• New positions are proposed by a proposal density q

• Proposed positions will be:

◃ Accepted:

◦ Proposed location has higher posterior probability: with probability 1

◦ Otherwise: with probability proportional to ratio of posterior probabilities

◃ Rejected:

◦ Otherwise

• Algorithm satisfies again Markov property ⇒ MCMC algorithm

• Similarity with AR algorithm


Metropolis algorithm:

Chain is at θk ⇒ Metropolis algorithm samples value θ(k+1) as follows:

1. Sample a candidate θ from the symmetric proposal density q(θ | θ), withθ = θk

2. The next value θ(k+1) will be equal to:

• θ with probability α(θk, θ) (accept proposal),

• θk otherwise (reject proposal),

with

α(θk, θ) = min

(r =

p(θ | y)p(θk | y)

, 1

)

Function α(θk, θ) = probability of a move


The MH algorithm only requires the product of the prior and the likelihoodto sample from the posterior


Example VI.7: SAP study – Metropolis algorithm for NI prior case

Settings as in Example VI.1, now apply Metropolis algorithm:

◃ Proposal density: N(θk,Σ) with θk = (µk, (σ2)k)T and Σ = diag(0.03, 0.03)

6.6 6.8 7.0 7.2 7.4

1.4

1.6

1.8

2.0

2.2

2.4

2.6

µ

σ2

(a)

6.6 6.8 7.0 7.2 7.4

1.4

1.6

1.8

2.0

2.2

2.4

2.6

µ

σ2

(b)

◦ Jumps to any location in the (µ, σ2)-plane



MH-sampling:

6.5 7.0 7.5

1.5

2.0

2.5

µ

σ2

●

6.5 7.0 7.5

1.5

2.0

2.5

µ

σ2

●

●

●

●

6.5 7.0 7.5

1.5

2.0

2.5

µ

σ2

●

●

●


Marginal posterior distributions:

µ

6.9 7.0 7.1 7.2 7.3

01

23

45

6 (a)

σ2

1.6 1.8 2.0 2.2 2.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0 (b)

◦ Acceptance rate = 40%



Trace plots:

600 800 1000 1200 1400

6.9

7.1

7.3

Iteration

µ

(a)

600 800 1000 1200 1400

1.6

2.0

2.4

Iteration

σ2

(b)

◦ Accepted moves = blue color, rejected moves = red color


Second choice of proposal density:

◃ Proposal density: N(θk,Σ) with θk = (µk, (σ2)k)T and Σ = diag(0.001, 0.001)

6.6 6.8 7.0 7.2 7.4

1.4

1.6

1.8

2.0

2.2

2.4

2.6

µ

σ2

(a)

σ2

1.5 1.7 1.9 2.1

0.0

1.0

2.0

3.0

(b)

◦ Acceptance rate = 84%

◦ Poor approximation of true distribution


Accepted + rejected positions:

6.5 7.0 7.5

1.5

2.0

2.5

Variance proposal = 0.03

µ

σ2

●

●

●●

●

●

●

●

●

●●

●

●●●

●

6.5 7.0 7.5

1.5

2.0

2.5


µ

σ2

● ●

●

●●

●●●

6.5 7.0 7.5

1.5

2.0

2.5


µ

σ2

●

●

●

●●

●

●

●

●

●●

●

●●

●

●


Problem:

What should be the acceptance rate for a good Metropolis algorithm?

From theoretical work + simulations:

• Acceptance rate: 45% for d = 1 and ≈ 24% for d > 1


6.3.2 The Metropolis-Hastings algorithm

Metropolis-Hastings algorithm:

Chain is at θk ⇒ Metropolis-Hastings algorithm samples value θ(k+1) as follows:

1. Sample a candidate θ from the (asymmetric) proposal density q(θ | θ), withθ = θk

2. The next value θ(k+1) will be equal to:

• θ with probability α(θk, θ) (accept proposal),

• θk otherwise (reject proposal),

with

α(θk, θ) = min

(r =

p(θ | y) q(θk | θ)p(θk | y) q(θ | θk)

, 1

)


• Reversibility condition: Probability of move from θ to θ = probability of movefrom θ to θ

• Reversible chain: chain satisfying reversibility condition

• Example asymmetric proposal density: q(θ | θk) ≡ q(θ) (Independent MHalgorithm)

• WinBUGS makes use of univariate MH algorithm to sample from somenon-standard full conditionals


Example VI.8: Sampling a t-distribution using Independent MH algorithm

Target distribution : t3(3, 22)-distribution

(a) Independent MH algorithm with proposal density N(3,42)

(b) Independent MH algorithm with proposal density N(3,22)

t

−5 0 5 10

0.00

0.10

0.20

0.30 (a)

t

−5 0 5 10

0.00

0.10

0.20

0.30 (b)


6.3.3 Remarks*

• The Gibbs sampler is a special case of the Metropolis-Hastings algorithm, butGibbs sampler is still treated differently

• The transition kernel of the MH-algorithm

• The reversibility condition

• Difference with AR algorithm


6.5. Choice of the sampler

Choice of the sampler depends on a variety of considerations


Example VI.9: Caries study – MCMC approaches for logistic regression

Subset of n = 500 children of the Signal-Tandmobielr study at 1st examination:

◃ Research questions:

◦ Have girls a different risk for developing caries experience (CE ) than boys(gender) in the first year of primary school?

◦ Is there an east-west gradient (x-coordinate) in CE?

◃ Bayesian model: logistic regression + N(0, 1002) priors for regression coefficients

◃ No standard full conditionals

◃ Three algorithms:

◦ Self-written R program: evaluate full conditionals on a grid + ICDF-method

◦ WinBUGS program: multivariate MH algorithm (blocking mode on)

◦ SASr procedure MCMC: Random-Walk MH algorithm


Program Parameter Mode Mean SD Median MCSE

Intercept -0.5900 0.2800

MLE gender -0.0379 0.1810

x-coord 0.0052 0.0017

Intercept -0.5880 0.2840 -0.5860 0.0104

R gender -0.0516 0.1850 -0.0578 0.0071

x-coord 0.0052 0.0017 0.0052 6.621E-5

Intercept -0.5800 0.2810 -0.5730 0.0094

WinBUGS gender -0.0379 0.1770 -0.0324 0.0060

x-coord 0.0052 0.0018 0.0053 5.901E-5

Intercept -0.6530 0.2600 -0.6450 0.0317

SASr gender -0.0319 0.1950 -0.0443 0.0208

x-coord 0.0055 0.0016 0.0055 0.00016


Conclusions:

• Posterior means/medians of the three samplers are close (to the MLE)

• Precision with which the posterior mean was determined (high precision = lowMCSE) differs considerably

• The clinical conclusion was the same

⇒ Samplers may have quite a different efficiency


Take home messages

• The two MCMC approaches allow fitting basically any proposed model

• There is no free lunch: computation time can be MUCH longer than withlikelihood approaches

• The choice between Gibbs sampling and the Metropolis-Hastings approachdepends on computational and practical considerations


Date post:	04-Jun-2018
Category:	Documents
Upload:	trandan
View:	234 times
Download:	0 times

Chapter 4 More than one parameter - USP€¦ · Chapter 4 More than one parameter Aims: Moving...

Documents