Course on
Program Bayesian GAP
and
its use for detrending TFP
C. Planas and A. Rossi
Joint Research Centre
European Commission
Course objective
To explain the Bayesian estimation of TFP cycle as
implemented in Program GAP.
For this we will:
• review the model and the time series tools involved in
detrending TFP;
• introduce Bayesian concepts;
• use the Bayesian module of Program GAP to detrend TFP
1
Course outline
• The bivariate TFP-CU model;
• The likelihood function via state space modelling and
Kalman filtering;
• Introduction to the Bayesian framework;
• Parameterization and prior distributions;
• From prior to posterior via Gibbs sampling;
• Convergence diagnostics;
• Program Bayesian GAP;
• Application to Member States data.
• Free time for open discussion.
2
The TFP - CU model
Cobb-Douglas production function:
Y = TFP K1−α Lα (1)
Y: output, K: capital, L: labour. TFP contains short-term
fluctuations C and persistent efficiency improvements P:
TFP = P × C
Let CUK and CUL denote the variations in the capacity
utilization of capital and of labour. Writing (1) as:
Y = P (CUK ×K)1−α (CUL × L)α
suggests a link between TFP gap and CU such that:
C = CU1−αK CUα
L
Problem: Only aggregated CU is available, no CUK or CUL.
3
We assume high correlation between CU and CUK and:
cuL = γcuK + ε 0 < γ < 1
where x = logX and ε is a random shock. TFP gap is related
to CU utilization through:
c = (1− α + αγ)cu + αε
Hence we consider the bivariate model - or measurement
equations:
tfpt = pt + ct
cut = µcu + βct + ecut
with β =1
1− α (1− γ)(2)
Notice: 0 < α < 1 and 0 < γ < 1 implies β > 1.
4
Unobserved components dynamic - or state equations:
∆pt = w + µt−1
µt = ρ µt−1 + aµt V (aµt) = Vµ
ct = φc1 ct−1 + φc2 ct−2 + act V (act) = Vc (3)
The TFP mean growth rate is E(∆tfpt) = E(∆pt) = w.
For the stochastic term ecut, we consider:
either ecut = acut or ecut = δ ecut−1 + acut
with V (acut) = VCU
For evaluating the likelihood, GAP casts model (1)-(3)
into a state space model and run the Kalman filter.
GAP allows variations around (1)-(3): no CU, noise instead
of AR2-cycle, other trend models.
5
State Space Models
SSM are defined by a measurement equation like
Xt = Hξt + CZt + ut, (1)
where Xt is a n× 1 vector of observations, ξt is the k× 1
state vector, ut a k × 1 vector of shocks with covariance
matrix R, and Zt is a vector of r exogenous variables.
The matrices H, C and R are of adequate dimensions.
The model is completed with a transition equation for the
state
ξt+1 = Fξt + vt+1, (2)
where F is the k × k transition matrix and vt is a vector
of k-shocks with k × k covariance matrix Q. The vectors
ut and vt are orthogonal at all leads and lags.
1
Example of State Space Model
Implicit model in Hodrick-Prescott filtering (Harvey and
Jaeger, JAE 1993): yt = pt + ut, where ∆2pt = apt, and
ut, apt are uncorrelated wn.
pt+1
pt
ut+1
=
2 −1 0
1 0 0
0 0 0
pt
pt−1
ut
+
apt+1
0
ut+1
Measurement equation:
yt = [1 0 1]
pt
pt−1
ut
The signal to noise ratio is V (apt)/V (ut), for instance
1/1600 for quarterly data.
2
Why is I(2)+noise the model implicit in Hodrick-
Prescott filtering? - (skip)
Hodrick-Prescott decompose xt into xt = pt + ut by
minimising the following loss function:
minp1,··· ,pT
T∑t=1
u2t + λ
T∑t=3
(∆2pt)2
where λ is the inverse signal to noise ratio.
Now consider the following model:
xt = pt + ut
∆2pt =√
λat V (at) = V (ut) = 1
Assume we want to estimate p1, · · · , pT that minimise the
sum of squared prediction errors:
minp1,··· ,pT
T∑t=1
(xt − xt|t−1)2 = minp1,··· ,pT
T∑t=1
u2t + λ
T∑t=3
a2t
3
SS for the TFP - CU model
The TFP - CU model:
tfpt = pt + ct
cut = µcu + βct + acut
pt = pt−1 + ηt−1; ηt = ρηt−1 + aηt
ct = φc1ct−1 + φc2ct−2 + act
State and measurement equations (for simplicity w = 0):
pt+1
ηt+1
ct+1
ct
=
1 1 0 0
0 ρ 0 0
0 0 φc1 φc2
0 0 1 0
pt
ηt
ct
ct−1
+
0
aηt+1
act+1
0
tfpt
cut
=
0
µcu
+
1 0 1 0
0 0 β 0
pt
ηt
ct
ct−1
+
0
acut
4
The likelihood function
Given a state-space model with observations xT =
(x1,x2, · · · ,xT ) and a vector of parameters θ, the
likelihood is defined as the joint density of the sample
for a given θ:
L(xT ; θ) = f(xT |θ)
The likelihood can be factorised into the product of
conditional densities:
L(xT ; θ) = f(x1; θ)T∏
t=2
f(xt|xt−1; θ)
for I(1) model. For I(2) models, the minimum conditioning
set contains x1 and x2 instead of only x1.
5
Computing the likelihood function
We focus on the term f(xt|xt−1). Since we assume
normally distributed shocks, we only need to characterize
the conditional mean E(xt|xt−1) and the conditional
variance V (xt|xt−1):
f(xt|xt−1; θ) ∝| V (xt|xt−1) |−1/2 ×
exp{−.5(xt − E(xt|xt−1)
)′V (xt|xt−1)−1
(xt − E(xt|xt−1)
)}
Let ξt|t−1 ≡ E(ξt|xt−1) and Pt|t−1 ≡ V (ξt|xt−1). The
measurement equation implies:
xt|t−1 = H ξt|t−1 + C Zt
V (xt|xt−1) = H Pt|t−1H′ + R
Hence the problem is to find ξt|t−1 and Pt|t−1 for all t.
6
Using
ξt
xt
���xt−1 ∼ N
0@ ξt|t−1
Hξt|t−1 + CZt
,
24 Pt|t−1 Pt|t−1H
′
HPt|t−1H′ + R
351A
The properties of the normal distribution imply:
ξt|t = ξt|t−1+Pt|t−1H′ �HPt|t−1H
′+ R
�−1�Xt −Hξt|t−1 − CZt
�
Pt|t = Pt|t−1−Pt|t−1H′ �HPt|t−1H
′+ R
�−1HPt|t−1
and from the transition equation:
ξt+1|t = Fξt|t
Pt+1|t = FPt|tF′+Q
These are the Kalman recursions. Given a starting
point ξ1|0 and P1|0, the KF produces all quantities ξt+1|t
and Pt+1|t necessary for computing the likelihood.
7
Initialising the Kalman filter
In order to start the Kalman recursions one needs to
specify the initial conditions
ξd+1|d ≡ E[ξd+1|Xd]
Pd+1|d ≡ V [ξd+1|Xd],
where d is the maximum order of integration of the
components of the state vector ξt, i.e. d = 1 in the
TFP-CU model.
Convention: X0 = ∅.
Whatever the frequentist or Bayesian context, the
likelihood function needs exact filter initialization.
8
Case 1. d = 0: all entries in the state vector are
stationary
ξ1|0 = E[ξ1]
P1|0 = V ar[ξ1]
Weak stationarity implies ξ1|0 = ξ0|0 and P1|0 = P0|0.
Use can be made of P1|0 = P0|0 in:
P1|0 = FP1|0F′+Q
to recover P1|0 as:
V ec(P1|0) = [I− F⊗ F]−1V ec(Q),
where V ec is the vectorial operator, I is the identity
matrix, and ⊗ is the Kronecker operator.
9
Case 2. d > 0 At least one entry of the state vector
is I(d). The Diffuse Kalman Filter
Let ξ∗t be the state I(d) element. Its unconditional variance
is infinite. Writing V ar[ξ∗1] = κ for large κ, then:
ξd+1|d ≡ limκ→∞
E[ξd+1|κ,Xd]
Pd+1|d ≡ limκ→∞
V ar[ξd+1|κ,Xd],
GAP implements the Diffuse Kalman Filter through
an algorithm due to de Jong, Annals of Statistics
(1991).
10
Bayesian analysis
In practice we observe the data but parameter θ ∈ Θ
is unknown. To make inference about θ, there are two
approaches: sampling theory and Bayes theorem.
Sampling theory: θ is a constant. Estimators are defined as
θ ≡ θ(Y ), i.e. as a function of hypothetical vectors Y that
have the sampling distribution f(Y |θ).
Inference is made by comparing θ(Y = y) with the
distribution of θ(Y ) induced by f(Y |θ).Example: Y |θ ∼ N(θ1T , IT ) ⇒ θ = Y ∼ N(θ1T , IT/T ).
Bayesian analysis: θ is a random variable with prior
distribution f(θ) that describes the initial state of knowledge.
The joint distribution f(Y |θ) × f(θ) defines the model.
Inference is made through the posterior distribution of θ,
i.e. the conditional distribution of θ given the observations y,
f(θ|Y = y).
1
Bayes’ theorem
The Bayes’ theorem tells us how the data update our prior
knowledge of θ:
f(θ|Y ) =f(Y, θ)f(Y )
=f(Y |θ)f(θ)
f(Y )
where f(Y ) =∫Θ
f(Y |θ)f(θ)dθ is a strictly-positive
normalizing constant aka the marginal likelihood. Since
f(Y ) does not depend on θ we also write:
f(θ|Y ) ∝ f(Y |θ)f(θ) ⇔ likelihood× prior
To compute f(θ|Y ) all sample information enters through
the likelihood f(Y |θ) - likelihood principle.
2
Example: the Normal mean
Assume (Y1, Y2, · · · , YT |θ) ∼ iiN(θ, 1). The likelihood
f(Y |θ) =∏T
j=1 f(Yj|θ) writes
f(Y |θ) = (2π)−T/2 exp[−12(s2 + T (θ − θ)2)]
where θ = Y and s2 = (1/T )∑
j(Yj − Y )2.
Assume f(θ) = N(θ0, V0). By the Bayes’ theorem:
f(θ|Y ) ∝ f(Y |θ)f(θ)
∝ exp[−12T (θ − θ)2 − 1
2V0(θ − θ0)2]
∝ exp[− 12V
(θ − θ)2]
with V = 1T+1/V0
and θ = V (T θ + θ0/V0).
i. Normal prior + Normal likelihood ⇒ Normal posterior.
ii. The prior looses importance as T →∞ or V0 →∞.
3
Prior distributions - Informative
Informative prior distribution can be stated using:
• information from previous studies,
• a macroeconomic model suggesting theoretical values,
• a personal view for instance about an elasticity.
Stating an informative prior requires:
• to choose a distribution family,
• to tune the distribution hyper-parameters according to the
information available.
Program GAP offers graphical tools for prior elicitation.
4
Informative priors - Examples
Beta-distribution: if β ∈ [0, 1], β ∼ Beta(a, b), a, b > 0:
f(β) =Γ(a + b)Γ(a)Γ(b)
βa−1(1− β)b−1
where Γ(a) =∫∞0
xa−1e−xdx.
Moments: E(βr) = Γ(a+r)Γ(a+b)Γ(a)Γ(a+b+r), a + r > 0.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1
2
3
4
Beta(3,3) (b) and Beta(10,10) (r)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
Beta(2,6) (b) and Beta(6,2) (r)
Useful for imposing parameter-bounds. If β ∈ (βl, βu) instead
of β ∈ [0, 1], set β = βl + λ(βu − βl), λ ∼ Beta(a, b).
GAP uses Beta-priors for cycle amplitude and periodicity.
5
Informative priors - Examples
Normal-distribution: β is Normally distributed with mean
β0 and variance m−1β0
, i.e. β ∼ N(β0,m−1β0
) if its pdf is:
f(β) =1√
2πm−1β0
exp{−12mβ0(β − β0)2}
The variance inverse mβ0 is aka the precision.
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
N(0,1)
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
N(0,3)
Bounds can be imposed by truncation.
In GAP, w and ρ are Normal.
6
Informative priors - Examples
Inverted Gamma-2 (IG) for variance parameters;
Vu ∼ IG(s0, ν0), s0, ν0 > 0:
f(Vu) = Γ(ν0/2)−1(s0/2)ν0/2V−1
2(ν0+2)u exp{− s0
2Vu}
Moments:
E(Vu) =s0
ν0 − 2for ν0 > 2
V (Vu) =2
ν0 − 4E(Vu)2 for ν0 > 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
IG(0.5,4)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
2
4
6
8
10
IG(100,202)
GAP uses IG-priors for all variance parameters.
7
Informative priors - Examples
Normal-Inverted Gamma-2 distribution: β, Vu are jointly
NIG-distributed with parameters β0,Mβ0, s0 > 0 and ν0 > 0,
i.e. f(β, Vu) = NIG(β0,Mβ0, s0, ν0), if:
f(β, Vu) ∝ V−1
2(ν0+2+d(β))u ×
× exp{− 12Vu
(s0 + (β − β0)′Mβ0(β − β0))}
NIG-properties:
f(β|Vu) = N(β0, VuM−1β0
)
f(Vu) = IG(s0, ν0)
f(Vu|β) = IG(s0 + (β − β0)′Mβ0(β − β0), ν0 + d(β))
f(β) = t(β0,Mβ0, s0, ν0)
Moments of f(β): E(β) = β0, V (β) = E(Vu)M−1β0
=s0
ν0−2M−1β0
.
In GAP, µCU , β,VCU are NIG.
8
Conjugate prior distributions
Definition If F is a class of sampling distributions f(Y |θ)and P is a class of prior distributions for θ, then the class P
is conjugate for F if the posterior distribution f(θ|y) belongs
to P for all f(Y |θ) ∈ F and f(θ) ∈ P .
Natural conjugacy if prior, likelihood and posterior belong to
the same class of distributions ⇔ natural conjugate families
are closed for inference.
Convenience: the posterior distribution is in a known
parametric form.
Requirement: the prior information available can be
adequately represented with a distribution of this family.
9
Examples
• Beta prior is a conjugate family for binomial likelihoods:
p(θ|y) ∝ θy(1− θ)n−y × θa−1(1− θ)b−1
= Beta(y + a, n− y + b)
• NIG prior is natural conjugate for Normal likelihoods.
In general, when the data density belongs to the exponential
familyf(y|θ) ∝ exp[
m∑
j=1
tj(y)φj(θ)]
like for Normal likelihoods, then prior distributions such as
f(θ) ∝ exp{m∑
j=1
bjφj(θ)}
imply the posterior
f(θ|y) ∝ exp{m∑
j=1
(bj + tj(θ))φj(θ)}
⇒ likelihoods belonging to the exponential family have
natural conjugate prior distributions.
10
Parameterization and prior distributions
Parameterization must be adequate for info available.
Example: consider a cycle whose dynamic is described with
the AR(2) φ(L) = 1 − φ1L − φ2L2. According to business
cycle studies, this cycle has a typical length of 8 years and
amplitude 0.8. How to incorporate this prior information?
Impossible on the φ-scale ⇒ re-parameterize
Let r1,2 = A e±ıw, A < 1, ı2 = −1, w ∈ [0, π], be the
complex roots of z2 − φ1z − φ2 = 0. In terms of periodicity
τ , w = 2π/τ and the AR(2) polynomial can be written as:
φ(L) = (1−Ae−ı2π/τL)(1−Aeı2π/τL)
= 1− 2Acos(2π/τ)L + A2L2
More convenient for inserting prior info: for instance, A ∼Beta(aA, bA) and (τ−τl)
(τu−τl)∼ Beta(τl, τu) for τ ∈ (τl, τu).
11
What if a flat prior is put on φ1, φ2 under the complex root
constraint?
p(φ1 ∈ (−2, 0)) = 0.5 = p(φ1 ∈ (0, 2))
φ1 ∈ (−2, 0) ⇔ τ ∈ (2, 4) (cos(2π/τ) < 0)
φ1 ∈ (0, 2) ⇔ τ ∈ (4,∞) (cos(2π/τ) > 0)
(2, 4) and (4,∞) period-ranges receive equal weights: short
term movements are emphasized.
⇒ GAP uses polar coordinates A and τ instead of AR
coefficients
12
Non-informative prior distributions
Three reasons for being non-informative:
• Ignorance
• Make the analysis as objective as possible
• Assess the prior influence on posterior inference
We would like inference to be only driven by the likelihood
function ⇒ a candidate prior is uniform.
Problem: as in general the shape of the likelihood depends
on the parameter scale, for which parameterization should
the uniform distribution be specified?
General solution: Jeffreys non-informative prior whatever
the parameterization.
Example: p(V ) ∝ 1/V ⇔ p(log V ) ∝ 1.
Program GAP does not use non-informative priors
13
GAP prior assumptions for the TFP - CU model
The following block-independence structure is imposed:
p(θ) = p(A)p(τ)p(Vc)p(w)p(ρ)p(Vη)p(µcu, β, Vcu)
p(A) = Beta(aA, bA)
p(τ − 2T − 2
) = Beta(aτ , bτ)
p(Vc) = IG(sc0, vc0)
p(w) = N(w0, Vw0)
p(ρ) = N(ρ0, Vρ0)
p(Vη) = IG(sη0, vη0)
p(µcu, β, Vcu) = NIG(δ0,M−10 , s0, v0)
The periodicity τ belongs to the interval [2, T ].
The GAP menu Prior enables users to set hyperparameters
and to save & load priors.
14
Bayesian inference
Given a prior p(θ) on parameter θ, we are interested in the
posterior p(θ|y). In some cases, p(θ|y) is known in close
form.
Example: the linear regression
yt = β′xt + ut, ut|Vu ∼ iiN(0, Vu)
Let y = (y1, y2, · · · , yT )′ and x = (x1, x2, · · · , xT )′. The
sampling density verifies:
f(y|x, β, Vu) ∝ V−T
2u exp{− 1
2Vu(y − xβ)′(y − xβ)}
∝ V−T
2u exp{− 1
2Vu[ssr + (β − β)′x′x(β − β)]}
= NIG(β, x′x, ssr, T − 2 − d(β))
where β = (x′x)−1x′y, and ssr = (y − xβ)′(y − xβ).
1
Posteriors with the natural conjugate prior
Assume (β, Vu) ∼ NIG(β0, s0,M0, ν0):
f(β, Vu) = f(β|Vu)f(Vu) = N(β0, VuM−10 ) × IG(s0, ν0)
NIG-Theorem: the posterior density f(β, Vu|y, x) for
the linear regression model with NIG prior is a
NIG(β∗, M∗, s∗, ν∗) with hyperparameters:
M∗ = M0 + x′x
β∗ = M−1∗ (M0β0 + x′xβ)
s∗ = s0 + ssr + (β0 − β)′[M−10 + (x′x)−1]−1(β0 − β)
ν∗ = ν0 + T
GAP uses NIG-theorem for building posterior samples.
2
From the NIG-properties, f(β, Vu|y, x) = NIG(β∗, s∗,M∗, ν∗)
implies:
f(β|y, x) = t(β∗, s∗,M∗, ν∗)
f(Vu|y, x) = IG(s∗, ν∗)
with first two moments:
E(β|y, x) = β∗; V (β|y, x) =s∗
ν∗ − 2M−1
∗
E(Vu|y, x) =s∗
ν∗ − 2; V (Vu|y, x) =
2ν∗ − 4
E(Vu|y, x)2
i. Frequentist interpretation: posterior ⇔ likelihood of data
extended with a pre-sample whose likelihood coincides with
NIG-prior on β, Vu.
ii. V (β|y, x) may be > V (β): conflict of information
between prior and likelihood ⇒ update your prior!
iii. The larger |β0 − β| is, the larger E(Vu|y, x).
3
Proof of the NIG-theorem (skip): By the Bayes’ theorem
f(β, Vu|y, x) ∝ f(y|x, β, Vu)f(β, Vu), hence
f(β, Vu|y, x) ∝ V−T
2u exp{− 1
2Vu[ssr + (β − β)′x′x(β − β)]}
× V−1
2(ν0+2+d(β))u
× exp{− 12Vu
(s0 + (β − β0)′M0(β − β0))}
∝ V−(T+ν0+d(β)+2)
2u exp{− 1
2Vu[s0 + ssr +
(β − β)′x′x(β − β) + (β − β0)′M0(β − β0)]}
Notice that (see Bauwens et al. 1999, pp 58-59)
(β − β)′x′x(β − β) + (β − β0)′M0(β − β0)
= (β − β∗)′M∗(β − β∗) + β′0M0β0 + β′x′xβ − β′
∗M∗β∗
and
β′0M0β0 + β′x′xβ − β′
∗M∗β∗
= (β0 − β)′[M−10 + (x′x)−1]−1(β0 − β)
4
Hence:
f(β, Vu|y, x) ∝ V−(T+ν0+d(β)+2)
2u
× exp{− 12Vu
[(β − β∗)′M∗(β − β∗) + s0 + ssr
+ (β0 − β)′[M−10 + (x′x)−1]−1(β0 − β)]}
= NIG(β∗,M∗, s∗, ν∗)
5
Sampling from unknown distributions: MCMC
Most often however, the posterior f(θ|y) is of unknown form.
One possibility is to approximate it by sampling.
Samples from the posterior f(θ|y) make possible inference
for any quantity of interest, like marginals or functions of θ.
Markov Chain Monte Carlo is a class of algorithms for
sampling from unknown distributions.
The most used MCMC algorithms are Gibbs sampling and
Metropolis-Hastings.
1
Gibbs sampling
Partition θ into k blocks: θ = (θ1, θ2, · · · , θk). Geman and
Geman (IEEE 1984) show that the iterative scheme:
0. Initialize: θ(0)2 , · · · , θ(0)
k ;
1. Sampling from full conditionals:
θ(1)1 ∼ f(θ1|θ(0)
2 , · · · , θ(0)k , y)
θ(1)2 ∼ f(θ2|θ(1)
1 , θ(0)3 , · · · , θ(0)
k , y)
... ...
θ(1)k ∼ f(θk|θ(1)
1 , θ(1)2 , · · · , θ(1)
k−1, y)
2. Repeat G-times θ(g)1 , θ
(g)2 , · · · , θ(g)
k , g = 1, 2, · · · , G.
is such that fG(θ|y) → f(θ|y) as G →∞.
Requirement: sampling from full conditionals is feasible.
2
The Metropolis-Hastings algorithm
0. Choose a starting point θ(0);
1. Generate a candidate θc from an importance density q(·);
2. Evaluate the ratio
α = min{ f(θc|y)/q(θc)f(θ(0)|y)/q(θ(0))
, 1}
2. Generate u ∼ U(0, 1); if u < α set θ(1) = θc, else
θ(1) = θ(0);
3. repeat 2. and return {θ(1), θ(2), · · · , θ(G)}.
3
Choice of the candidate-generating density
i. Random walk chain: θc = θ(0) + white noise;
ii. Independence chain: choose q(·) independent of θ(0);
iii. Factorizing: if f(θ|y) = f1(θ)f2(θ) with f1(θ) uniformly
bounded, then taking q(·) = f2(θ) yields α =
min{ f1(θc)
f1(θ(0))
, 1} - GAP
Most difficult step: set the scale of the importance density.
This impacts the rate of acceptance.
In general the candidate-generating density must have fatter
tails than the target density.
Rule of thumb: target an acceptance ratio in (.20, .45) -
Care: convergence may fail in spite of a good acceptance
ratio.
4
Bayesian inference in the TFP-CU model
tfpt = pt + ct
cut = µcu + βct + acut
pt = pt−1 + ηt−1; ηt = w(1− ρ) + ρηt−1 + aηt
ct = 2A cos(2π/τ)ct−1 −A2ct−2 + act
Let θ denote the model parameters θ = (A, τ, Vc, w, ρ, Vη, µcu,-
β, Vcu), ξt the unobservables ξt = (ct, pt), and yT =
(tfpT , cuT ).
Our target is the joint posterior p(ξT , θ|yT ). Since no-closed
form exists we resort to a Gibbs scheme:
• p(ξT |θ, yT )
• p(θ|ξT , yT )
5
Sampling the state
p(ξT |θ, yT ) is Normal but needs a T × T -covariance matrix.
⇒ use Carter and Kohn (Biom., 1994) state sampler:
p(ξT |θ, y) = p(ξT |θ, yT )T−1∏t=1
p(ξt|θ, yt, ξt+1)
1. compute ξt|t, Pt|t, t = 2, 3, . . . , T , via Kalman filter;
2. sample ξT ∼ p(ξT |θ, y) = N(ξT |T , PT |T );
3. sample ξt ∼ p(ξt|θ, ξt+1, yt) = N(E[ξt|θ, ξt+1, y
t],
V [ξt|θ, ξt+1, yt]) back in time t = T − 1, . . . , 2; for the
first two moments, use Normal properties.
4. ξ1 ∼ p(ξ1|θ, ξ2, y1) = N(E[ξ1|θ, ξ2, y1], V [ξ1|θ, ξ2, y1]).
Last step requires ξ1|1 and P1|1: use Koopman (1997, JASA).
6
Sampling parameters given the state
GAP exploits model parametrization, prior block-
independence and likelihood factorization to build 3 parameter
blocks:
p(A, τ, Vc, w, ρ, Vη, µcu, β, Vcu|ξT , yT ) =
= p(A, τ, Vc|cT )× p(w, ρ, Vη|pT )× p(µcu, β, Vcu|cT , cuT )
Let us focus on CU equation parameters. We can write:
p(µcu, β, Vu|cT , cuT ) ∝ p(cuT |cT , µcu, β, Vu)p(µcu, β, Vu)
∝ NIG(δ∗, s∗,M∗, v∗)
with hyperparameters given by the NIG theorem - also in next
page.
7
(skip) Let Z denote the T × 2 matrix of regressors on the
CU eqn and δ be the OLS estimate of (µcu, β). From the
NIG-theorem:
p(µcu, β, Vu|cT , cuT ) ∝ NIG(δ∗, s∗,M∗, v∗)
with
M∗ = M0 + Z ′Z
δ∗ = M−1∗ [M0δ0 + Z ′Zδ]
v∗ = v0 + T − 2
s∗ = s0 + cuT ′(IT − Z(Z ′Z)−1Z ′)cuT +
+ (δ0 − δ)′[M−10 + (Z ′Z)−1]−1(δ0 − δ)
8
Sampling parameters given the state
For p(w, ρ, Vη|pT ), we use:
• p(w|ρ, Vη|pT ) ∝ p(pT |w, ρ, Vη)× p(w) = N(w∗, V ∗w)
where w∗ and V ∗w can be worked out from the kernel identity:
1Vη
T∑3
(∆pt − ρ∆pt−1 − w(1− ρ))2 +1
Vw0(w − w0)2
= constant +(w − w∗)2
V ∗w
• p(ρ|w, Vη, pT ) ∝ p(pT |ρ,w, Vη)× p(ρ)
∝ p(p21|ρ,w, Vη) p(pT
3 |p21, ρ, w, Vη) p(ρ)
∝ p(p1, p2|ρ,w, Vη)×N(ρ∗, V ∗ρ )
⇒ use MH with N(ρ∗, V ∗ρ ) as proposal - MH within Gibbs.
• p(Vη|pT , w, ρ) ∝ p(pT |w, ρ, Vη)× p(Vη)
∝ IG(sη0 +∑
a2η, vη0 + T − 1)
9
Sampling parameters given the state
Cycle parameters: for Vc it is still a IG, but sampling A and
τ needs either a MH or an ARMS step.
Full details can be read in Planas, Rossi and Fiorentini,
Journal of Business Economic & Statistics (2008).
Sampling in practice
• Set a burn-in period of B iterations and save the next G.
• Select a thinning t, i.e. record every t iterations.
• Monitor chain convergence: a failure invalidates inference.
Non-convergence indicates that some region of the sample
space is poorly explored. Chains may not converge because
of long lasting autocorrelations.
In case of non-convergence: increase the burn-in, change the
parameterization, implement a different MCMC scheme, ....
10
Convergence diagnostics
1. Visual inspection by plotting cumulated posterior means
1g
∑gj=1 θ(j), g = 1, 2, · · · , G.
2. Geweke convergence diagnostic (1992): compare the
mean of the first n1 elements against the mean of the last
n2:
θ1 =1n1
n1∑
j=1
θ(j); θ2 =1n2
G∑
j=G−n2+1
θ(j)
with n1 + n2 < G. As G →∞ and n1G , n2
G remain fixed
Z =θ2 − θ1√
V (θ1) + V (θ2)→ N(0, 1)
Large values of Z indicates lack of convergence.
Geweke suggests n1 = G/5 and n2 = G/2.
11
Convergence diagnostics
GAP reports:
• Geweke CD (as p-value).
• Chain autocorrelations.
• NSE the numerical standard error of the posterior mean
using autocovariances until lag equal to 4% of the recorded
simulations.
• RNE the relative numerical efficiency: the ratio of the
variance of the posterior mean under iid hypothesis to the
squared NSE. Close to 1 values indicate high efficiency.
12
Estimates from MCMC
Any quantity of interest can be derived from the posterior
samples: for instance marginal distribution and related
moments like in
E[θ1|y] =1G
G∑
j=1
θ(j)1 f(θ1) =
1G
G∑
j=1
Iθ(j)1 ∈(θ1−δ,θ1+δ)
Hypothesis testing via Highest posterior density region (HPD)
with probability content α: the smallest interval R such that
p(β ∈ R|y) =∫
R
p(β|y)dβ = α
⇒ accept H0 : β ∈ I if I ∈ R.
13
In output, GAP shows priors and posteriors, posterior
mode, highest posterior regions, unobservables posterior
mean, marginal distribution of pt and ct, forecasts, CU
equation fit, ... All graphics are exportable in PS-format.
Two important checks: chain convergence and prior-posterior
congruency.
Enjoy GAP!
14