Introduction to Bayesian Methods
Introduction to Bayesian Methods – p.1/??
IntroductionWe develop the Bayesian paradigm for parametric inference.To
this end, suppose we conduct (or wish to design) a study, in
which the parameterθ is of inferential interest. Hereθ may be
vector valued. For example,
1. θ = difference in treatment means
2. θ = hazard ratio
3. θ = vector of regression coefficients
4. θ = probability a treatment is effective
Introduction to Bayesian Methods – p.2/??
IntroductionIn parametric inference, we specify a parametric model for the
data, indexed by the parameterθ. Letting x denote the data, we
denote this model (density) byp(x|θ). The likelihood function of
θ is any function proportional top(x|θ), i.e.,
L(θ) ∝ p(x|θ).
Example
Supposex|θ Binomial(N, θ). Then
p(x|θ) =
(
N
x
)
θx(1 − θ)N−x,
x = 0, 1, ..., N.
Introduction to Bayesian Methods – p.3/??
IntroductionWe can take
L(θ) = θx(1 − θ)N−x.
The parameterθ is unknown. In the Bayesian mind-set, we
express our uncertainty about quantities by specifying
distributions for them. Thus, we express our uncertainty aboutθ
by specifying aprior distribution for it. We denote the prior
density ofθ by π(θ). The word "prior" is used to denote that it is
the density ofθ before the data x is observed. By Bayes theorem,
we can construct the distribution ofθ|x, which is called the
posterior distribution of θ . We denote the posterior distribution
of θ by p(θ|x).
Introduction to Bayesian Methods – p.4/??
IntroductionBy Bayes theorem,
p(θ|x) =p(x|θ)π(θ)
∫
Θ
p(x|θ)π(θ)dθ
whereΘ denotes the parameter space ofθ. The quantity
p(x) =
∫
Θ
p(x|θ)π(θ)dθ
is the normalizing constant of the posterior distribution.For most
inference problems, p(x) does not have a closed form. Bayesian
inference aboutθ is primarily based on the posterior distribution
of θ , p(θ|x).
Introduction to Bayesian Methods – p.5/??
IntroductionFor example, one can compute various posterior summaries,
such as the mean, median, mode, variance, and quantiles. For
example, the posterior mean ofθ is given by
E(θ|x) =
∫
Θ
θp(θ|x)dθ
Example 1 Given θ, suppose x1, x2, ..., xn are i.i.d.
Binomial(1,θ), and θ ∼ Beta(α, λ). The parameters of the
prior distribution are often called thehyperparameters.
Let us derive the posterior distribution ofθ. Let
x = (x1, x2, ..., xn), and thus,
Introduction to Bayesian Methods – p.6/??
Introduction
p(x|θ) =n∏
i=1
p(xi|θ)
p(x|θ) =n∏
i=1
θxi(1 − θ)n−xi
p(x|θ) = θP
xi(1 − θ)n−P
xi
where∑
xi =∑n
i=1 xi. Also,
π(θ|x) =Γ(α + λ)
Γ(α)Γ(λ)θα−1(1 − θ)λ−1
Now, we can write thekernel of the posterior density as
Introduction to Bayesian Methods – p.7/??
Introduction
p(θ|x) ∝ θP
xiθα−1(1 − θ)n−P
xi(1 − θ)λ−1
= θP
xi+α−1(1 − θ)n−P
xi+λ−1
Thusp(θ|x) ∝ θP
xi+α−1(1 − θ)n−P
xi+λ−1. We can recognize
this kernel as a beta kernel with paramters
(∑
xi + α, n −∑
xi + λ) . Thus,
θ|x ∼ Beta(
∑
xi + α, n −∑
xi + λ)
and therefore
p(θ|x) =Γ(α + n + λ)
Γ(P
xi + α)Γ(n −P
xi + λ)× θ
P
xi+α−1(1 − θ)n−
P
xi+λ−1.
Introduction to Bayesian Methods – p.8/??
IntroductionRemark In deriving posterior densities, an often used technique
is to try and recognize the kernel of the posterior density ofθ.
This avoids direct computation ofp(x). This technique saves lots
of time in derivation. If the kernel cannot be recognized, then
p(x) must be computed directly.
In this example we have
p(x) = p(x1, ..., xn)
∝∫ 1
0
θP
xi+α−1(1 − θ)n−P
xi+λ−1.
=Γ(∑
xi + α)Γ(n −∑
xi + λ)
Γ(α + n + λ)
Introduction to Bayesian Methods – p.9/??
IntroductionThus
p(x1, ..., xn) =Γ(α + λ)
Γ(α)Γ(λ)
Γ(∑
xi + α)Γ(n −∑xi + λ)
Γ(α + n + λ)
for xi = 0, 1, andi = 1, ..., n.
SupposeA1, A2, ... are events such thatAi
⋂
Aj = φ and⋃∞
i=1 = Ω, whereΩ denotes thesample space. Let B denote an
event inΩ. Then Bayes theorem for events can be written as
p(Ai|B) =P (B|Ai)P (Ai)
∑∞i=1 P (B|Ai)P (Ai)
Introduction to Bayesian Methods – p.10/??
IntroductionP (Ai) is the prior probability ofAi andp(Ai|B) is the posterior
probability ofAi givenB has ocurred.
Example 2 Bayes theorem is often used in diagnostic tests for
cancer. A young person was diagnosed as having a type of cancer
that occurs extremely rarely in young people. Naturally, has was
very upset. A friend told him that it was probably a mistake. His
friend reasons as follows. No medical test is perfect: Thereare
always incidences of false positives and false negatives.
Introduction to Bayesian Methods – p.11/??
IntroductionLet C stand for the event that he has cancer and let+ stand for
the event that an individual responds positively to the test.
AssumeP (C) = 1/1, 000, 000 = 10−6 andP (+|Cc) = .01. (So
only one per million people his age have the disease and the test
is extremely god relative to most medical tests, giving only1%
false positives and 1% false negatives). Find the probability that
he has cancer given that he has a positive response. (After you
make this calculation you will not be surprised to learn thathe
did not have cancer.)
P (C|+) =P (+|C)P (C)
P (+|C)P (C) + P (+|Cc)P (Cc)
P (C|+) =(.99)(10−6)
(.99)(10−6) + (.01)(.999999)Introduction to Bayesian Methods – p.12/??
Introduction
P (C|+) =00000099
.01000098= .00009899
Example 3Supposex1, ..., xn is a random sample fromN(µ, σ2).i) Supposeσ2 is known andµ ∼ N(µo, σ
2o). The posterior
density ofµ is given by:
P (µ|x) ∝
nY
i=1
p(xi|µ, σ2)
!
π(µ)
∝
„
exp
−1
2σ2
X
(xi − µ)2ff«
×
„
exp
−1
2σ2o
(µ − µo)2ff«
Introduction to Bayesian Methods – p.13/??
Introduction
∝ exp
−1
2
(
nσ2o + σ2
σ2oσ
2
)
µ2 + 2µ
(
σ2o
∑
xi + µoσ2
2σ2oσ
2
)
= exp
−1
2
(
nσ2o + σ2
σ2oσ
2
)[
µ2 − 2µ
(
σ2o
∑
xi + µoσ2
nσ2o + σ2
)]
∝ exp
−1
2
(
nσ2o + σ2
σ2oσ
2
)[
µ −(
σ2o
∑
xi + µoσ2
nσ2o + σ2
)]2
We can recognize this as a normal kernel with mean
µpost = σ2o
P
xi+µoσ2
nσ2o+σ2 and varianceσ2
post = (nσ2o+σ2
σ2oσ2 )−1 = σ2
oσ2
nσ2o+σ2
Thus
µ|x ∼ N
(
σ2o
∑
xi + µoσ2
nσ2o + σ2
,σ2
oσ2
nσ2o + σ2
)
.
Introduction to Bayesian Methods – p.14/??
Introductionii) Supposeµ is known andσ2 is unknown. Letτ = 1/σ2. τ is
often called theprecisionparameter. Suppose
τ ∼ gamma( δo
2, γo
2). Thus
π(τ) ∝ τδo2−1exp
(
−τγo
2
)
Let us derive the posterior distribution ofτ .
p(τ |x) ∝ τn/2exp
−τ
2
∑
(xi − µ)2
τδo2−1exp
−τγo
2
p(τ |x) ∝ τn+δo
2−1exp
−τ
2(γo +
∑
(xi − µ)2)
Introduction to Bayesian Methods – p.15/??
IntroductionThus
τ |x ∼ gamma
(
n + δo
2,γo +
∑
(xi − µ)2
2
)
iii) Now supposeµ andσ2 are both unknown. Suppose we
specify the joint prior
π(µ, τ) = π(µ|τ)π(τ)
where
µ|τ ∼ N(µo, τ−1σ2
o)
τ ∼ gamma
(
δo
2,γo
2
)
Introduction to Bayesian Methods – p.16/??
IntroductionThe joint posterior density of (µ, τ ) is given by
∝(
τn/2exp
−τ
2
∑
(xi − µ)2)
×(
τ1
2 exp
− τ
2σ2o
(µ − µo)2
)
×(
τ δo/2−1exp
−τγo
2
)
= τn+δo+1
2−1exp
−τ
2
(
γo +(µ − µo)
2
σ2o
+∑
(xi − µ)2
)
The joint posterior does not have a clear recognizable form.Thus,
we need to computep(x) by brute force.
Introduction to Bayesian Methods – p.17/??
Introduction
p(x) ∝
Z
∞
0
Z
∞
−∞
τn+δo+1
2−1exp
−τ
2
„
γo +(µ − µo)2
σ2o
+X
(xi − µ)2«ff
dµdτ
∝
Z
∞
0
Z
∞
−∞
τn+δo+1
2−1 × exp−
τ
2(γo + µ2(n + 1/σ2
o) − 2µ(X
xi + µo/σ2o) +
(µ2o/σ2
o +X
x2i )dµτ
=
„Z
∞
0
τn+δo+1
2−1exp
n
−τ
2(γo + µ2
o/σ2o +
X
x2i )o
«
dτ
×
„Z
∞
−∞
expn
−τ
2
“
µ2(n + 1/σ2o) − 2µ(
X
xi + µo/σ2o)”o
«
dµ
Introduction to Bayesian Methods – p.18/??
IntroductionThe integratal with respect toµ can be evaluated by completing
the square.
∫ ∞
0
exp
−τ(n + σ−2o )2
2
[
µ − (∑
xi + µoσ−2o )
(n + σ−2o )
]2
×exp
τ(∑
xi + µoσ−2o )2
2(n + σ−2o )
dµ
= exp
τ(∑
xi + µoσ−2o )2
2(n + σ−2o )
(2π)1/2τ−1/2(n + σ−2o )−1/2
Introduction to Bayesian Methods – p.19/??
IntroductionNow we need to evaluate
∫ ∞
0
(2π)1/2(n + σ−2o )−1/2τ−1/2τ
n+δo/2−1
2−1
×exp
−τ
2[γo + µ2
o/σ2o +
∑
x2i ]
×exp
τ
2[(∑
xi + µo/σ2o)
2
(n + 1/σ2o)
]
dτ
= (2π)1/2(n + σ−2o )−1/2
∫ ∞
0
τn+δo/2−1
2−1
×exp
−τ
2
[
γo + µ2o/σ
2o +
∑
x2i −
(∑
xi + µo/σ2o)
2
(n + 1/σ2o)
]
dτ
Introduction to Bayesian Methods – p.20/??
Introduction
=(2π)1/2Γ
(
n+δo
2
)
(n + 1/σ2o)
− 1
2
[
12
(
γo + µ2o/σ
2o +
∑
x2i − (
P
xi+µo/σ2o)2
(n+1/σ2o)
)]n+δo
2
=(2π)1/2Γ
(
n+δo
2
)
2n+δo
2 (n + 1/σ2o)
− 1
2
[
γo + µ2o/σ
2o +
∑
x2i − (
P
xi+µo/σ2o)2
(n+1/σ2o)
]n+δo
2
≡ p∗(x)
Thus,
p(x) =
(
(2π)−(n+1)/2σ−1o
(γo
2)δo/2
Γ( δo
2)
)
p∗(x)
Introduction to Bayesian Methods – p.21/??
IntroductionThe joint posterior density of(µ, τ |x) can also be obtained in this
case by derivingp(µ, τ |x) = p(µ|τ |x)p(τ |x).
Exercise:Findp(µ|τ |x) andp(τ |x).
It is of great interest to find the marginal posterior distributions
of µ andτ .
p(µ|x) =
∫ ∞
0
p(µ, τ |x)dτ
∝∫ ∞
0
τn+δ0+1
2−1exp
−τ
2
[
γo + µ2o/σ
2o +
∑
x2i
]
×exp
−τ
2
[
µ2(n + 1/σ2o) − 2µ(
∑
xi + µo/σ2o)]
dτ
Introduction to Bayesian Methods – p.22/??
Introduction
=
∫ ∞
0
τn+δ0+1
2−1exp
−τ
2
[
γo + µ2o/σ
2o +
∑
x2i
]
×exp
−τ(n + 1/σ2o)
2
[
(
µ −∑
xi + µo/σ2o
n + 1/σ2o
)2]
×exp
τ
2
[
(∑
xi + µo/σ2o)
2
n + 1/σ2o
]
dτ
Let a = (∑
xi+µo/σ2
o)
(n+1/σ2o) . Then, we can write the integral
as
Introduction to Bayesian Methods – p.23/??
Introduction
=
Z
∞
0
τn+δ0+1
2−1
×expn
−τ
2
h
γo + µ2o/σ2
o +X
x2i + (n + 1/σ2
o)(µ − a)2 − (n + 1/σ2o)a2
io
dτ
=Γ“
n+δ0+1
2
”
2n+δ0+1
2
ˆ
γo + µ2o/σ2
o +P
x2i + (n + 1/σ2
o)(µ − a)2 − (n + 1/σ2o)a2
˜
∝
»
1 +c(µ − a)2
b − ca2
–
n+δ0+1
2
wherec = n + 1/σ2o andb = γo + µ2
o/σ2o +
∑
x2i . We recognize
this kernel as that of at-distribution with location parametera
and dispersion parameter(
(n+δo)cb−ca2
)−1
, andn+ δo degrees of free-
dom.
Introduction to Bayesian Methods – p.24/??
IntroductionDefinition Let y = (y1, ..., yp)
′ be ap × 1 random vector. Theny
is said to have ap diminsional multivariatet distribution withd
degrees of freedom, location paramterm and dispersion matrix
Σp×p if y has density
p(y) =
(
Γ(
d+p2
)
(πd)−p/2|Σ|−1/2)
Γ(
d2
)
×[
1 +1
d(y − m)′Σ−1(y − m)
]− d+p2
We write this asy ∼ Sp(d, m, Σ). In our problem,p =
1, d = n + δo, m = a, Σ−1 = (n+δo)cb−ca2 , Σ =
(
(n+δo)cb−ca2
)−1
Introduction to Bayesian Methods – p.25/??
IntroductionThe marginal distribution ofτ is give by
p(τ |y) =
Z
∞
0
τn+δ0+1
2−1 × exp
n
−τ
2
h
γo + µ2o/σ2
o +X
x2i
io
×expn τ
2(n + 1/σ2
o)a2o
× exp
−τ(n + 1/σ2
o)
2(m − a)2
ff
dµ
∝ τn+δ0+1
2−1τ−
12 exp
n
−τ
2
h
γo + µ2o/σ2
o +X
x2i − (n + 1/σ2
o)a2io
= τn+δ0
2−1exp
n
−τ
2
h
γo + µ2o/σ2
o +X
x2i − (n + 1/σ2
o)a2io
Thus,
τ |x ∼ gamma
»
n + δ0
2,1
2
“
γo + µ2o/σ2
o +X
x2i − (n + 1/σ2
o)a2”
–
.
Introduction to Bayesian Methods – p.26/??
IntroductionRemark A t distribution can be obtain as a scale mixture of
normals. That is, ifx|τ ∼ Np (m, τ−1Σ) andτ ∼ gamma(δo, γo),
then
p(x) =
∫ ∞
0
p(x|τ)π(τ)dτ
is theSp
(
δo,m, γo
δoΣ)
density. That is,x ∼ Sp
(
δo,m, γo
δoΣ)
Note:
p(x|τ) = (2π)−p/2τ p/2|Σ|−1/2
×exp
−τ
2(x − m)′|Σ|−1(x − m)
.
Introduction to Bayesian Methods – p.27/??
IntroductionRemark Note that in Examples 1 and 3i),ii), the posterior
distribution is of the same family as the prior distribution. When
the posterior distributionof a paramter is of the sme familyas the
prior istribution, such prior distributions are calledconjugateprior distributions .
For example 1, a Beta prior inθ led to a Beta posterior forθ. In
example 3i), a normal prior forµ yielded a normal posterior forµ.
In example 3ii), a gamma prior forτ yielded a gamma posterior
for τ . More on conjugate priors later.
Introduction to Bayesian Methods – p.28/??
Advantages of Bayesian Methods
1. InterpretationHaving a distribution for your unknown parameterθ is easier to
understand that a point estimate and a standard error. In addition,
we consider the following example of a confidence interval. A
95% confidence interval for a population meanθ can be written
as
x ± (1.96)s/√
n.
ThusP (a < θ < b) 6= 0.95.
Introduction to Bayesian Methods – p.29/??
Advantages of Bayesian Methods
1. Interpretation We have to rely on a repeated sampling
interpretation to make a probability as above. Thus, after
observing the data, wecannotmake a statement like the trueθ
has a 95% chance of falling in
x ± (1.96)s/√
n.
although we are tempted to say this.
Introduction to Bayesian Methods – p.30/??
Advantages of Bayesian Methods
2. Bayes Inference Obeys the Likelihood Principal
The likelihood principle: If two distinct sampling plans (designs)
yield proportional likelihood functions forθ, then inference about
θ should be identical from these two designs. Frequentist infer-
ence does not obey the likelihood principle, in general.
Example Suppose in 12 independent tosses of a coin, 9 heads
and 3 tails are observed. I wish to test the null hypothesis
Ho : θ = 1/2 vs.Ho : θ > 1/2, whereθ is the true probabil-
ity of heads.
Introduction to Bayesian Methods – p.31/??
Advantages of Bayesian Methods
Consider the following 2 choices for the likelihood function:
a) Binomial n = 12 (fixed), x = number of heads.x ∼Binomial(12,θ) and the likelihood is
L1(θ) =
(
n
x
)
θx(1 − θ)n−x
=
(
12
9
)
θ9(1 − θ)3
b) Negative Binomial:n is not fixed, flip until the third
tail appears. Herex is the number of flips required to
complete the experiment,x ∼ NegBinomial(r=3,θ).
Introduction to Bayesian Methods – p.32/??
Advantages of Bayesian Methods
L2(θ) =
(
r + x − 1
x
)
θx(1 − θ)r
=
(
11
9
)
θ9(1 − θ)3
Note thatL1(θ) ∝ L2(θ). From a Bayesian perspective, the
posterior distribution ofθ is thesameunder either design. That
is
p(θ|x) =L1(θ)π(θ)
∫
L1(θ)π(θ)dθ≡ L2(θ)π(θ)∫
L2(θ)π(θ)dθ
Introduction to Bayesian Methods – p.33/??
Advantages of Bayesian Methods
However, under the frequentist paradigm, inferences aboutθ are
quite different under each design. The rejection region based on
the binomial likelihood is
p(x ≥ 9|θ = 1/2) =12∑
j=9
(
12
j
)
θj(1 − θ)12−j = 0.075
while for the negative binomial likelihood, thep-value is
p(x ≥ 9|θ = 1/2) =∞∑
j=9
(
2 + j
j
)
θj(1 − θ)3 = 0.0325
The two designs lead to different decisions, rejectingHo under
design 2 and not under design 1.
Introduction to Bayesian Methods – p.34/??
Advantages of Bayesian Methods
3. Bayesian Inference Does not Lead to Absurd Results
Absurd results can be obtained when doing UMVUE estimation.
Supposex ∼ Poisson(λ), and we want to estimateθ = e−2λ,
0 < θ < 1. It can be shown that the UMVUE ofθ is (−1)x. Thus,
if x is even the UMVUE ofθ is 1 and ifx is odd the UMVUE of
θ is -1!!
Introduction to Bayesian Methods – p.35/??
Advantages of Bayesian Methods
4. Bayes Theorem is a formula for learningSuppose you conduct an experiment and collect observations
x1, ..., xn. Then
p(θ|x) =p(x|θ)π(θ)
∫
Θ
p(x|θ)π(θ)dθ
wherex = (x1, ..., xn). Suppose you collect an additional
observationxn+1 in a new study. Then,
p(θ|x, xn+1) =p(xn+1|θ)π(θ|x)
∫
Θ
p(xn+1|θ)π(θ|x)dθ
So your prior in the new study is the posterior from the previous.Introduction to Bayesian Methods – p.36/??
Advantages of Bayesian Methods
5. Bayes inference does not require large sample theoryWith modern computing advances, “exact” calculations can be
carried out using Markov chain Monte Carlo (MCMC) methods.
Bayes methods do not require asymptotics for valid inference.
Thus small sample Bayesian inference proceeds in the same way
as if one had a large sample.
Introduction to Bayesian Methods – p.37/??
Advantages of Bayesian Methods
6. Bayes inference often has frequentist inference as a specialcaseOften one can obtain frequentists answers by choosing a uniform
priorfor the parameters, i.e.π(θ) ∝ 1, so that
p(θ|x) ∝ L(θ)
In such cases, frenquentist answers can be obtained from such a
posterior distribution.
Introduction to Bayesian Methods – p.38/??