Course on Program Bayesian GAP and its use for ... - Europa and Financial... · Course on Program...

Course on

Program Bayesian GAP

and

its use for detrending TFP

C. Planas and A. Rossi

Joint Research Centre

European Commission

[email protected], [email protected]

Course objective

To explain the Bayesian estimation of TFP cycle as

implemented in Program GAP.

For this we will:

• review the model and the time series tools involved in

detrending TFP;

• introduce Bayesian concepts;

• use the Bayesian module of Program GAP to detrend TFP

1

Course outline

• The bivariate TFP-CU model;

• The likelihood function via state space modelling and

Kalman filtering;

• Introduction to the Bayesian framework;

• Parameterization and prior distributions;

• From prior to posterior via Gibbs sampling;

• Convergence diagnostics;

• Program Bayesian GAP;

• Application to Member States data.

• Free time for open discussion.

2

The TFP - CU model

Cobb-Douglas production function:

Y = TFP K1−α Lα (1)

Y: output, K: capital, L: labour. TFP contains short-term

fluctuations C and persistent efficiency improvements P:

TFP = P × C

Let CUK and CUL denote the variations in the capacity

utilization of capital and of labour. Writing (1) as:

Y = P (CUK ×K)1−α (CUL × L)α

suggests a link between TFP gap and CU such that:

C = CU1−αK CUα

L

Problem: Only aggregated CU is available, no CUK or CUL.

3

We assume high correlation between CU and CUK and:

cuL = γcuK + ε 0 < γ < 1

where x = logX and ε is a random shock. TFP gap is related

to CU utilization through:

c = (1− α + αγ)cu + αε

Hence we consider the bivariate model - or measurement

equations:

tfpt = pt + ct

cut = µcu + βct + ecut

with β =1

1− α (1− γ)(2)

Notice: 0 < α < 1 and 0 < γ < 1 implies β > 1.

4

Unobserved components dynamic - or state equations:

∆pt = w + µt−1

µt = ρ µt−1 + aµt V (aµt) = Vµ

ct = φc1 ct−1 + φc2 ct−2 + act V (act) = Vc (3)

The TFP mean growth rate is E(∆tfpt) = E(∆pt) = w.

For the stochastic term ecut, we consider:

either ecut = acut or ecut = δ ecut−1 + acut

with V (acut) = VCU

For evaluating the likelihood, GAP casts model (1)-(3)

into a state space model and run the Kalman filter.

GAP allows variations around (1)-(3): no CU, noise instead

of AR2-cycle, other trend models.

5

State Space Models

SSM are defined by a measurement equation like

Xt = Hξt + CZt + ut, (1)

where Xt is a n× 1 vector of observations, ξt is the k× 1

state vector, ut a k × 1 vector of shocks with covariance

matrix R, and Zt is a vector of r exogenous variables.

The matrices H, C and R are of adequate dimensions.

The model is completed with a transition equation for the

state

ξt+1 = Fξt + vt+1, (2)

where F is the k × k transition matrix and vt is a vector

of k-shocks with k × k covariance matrix Q. The vectors

ut and vt are orthogonal at all leads and lags.

1

Example of State Space Model

Implicit model in Hodrick-Prescott filtering (Harvey and

Jaeger, JAE 1993): yt = pt + ut, where ∆2pt = apt, and

ut, apt are uncorrelated wn.

pt+1

pt

ut+1

=

2 −1 0

1 0 0

0 0 0

pt

pt−1

ut

+

apt+1

0

ut+1

Measurement equation:

yt = [1 0 1]

pt

pt−1

ut

The signal to noise ratio is V (apt)/V (ut), for instance

1/1600 for quarterly data.

2

Why is I(2)+noise the model implicit in Hodrick-

Prescott filtering? - (skip)

Hodrick-Prescott decompose xt into xt = pt + ut by

minimising the following loss function:

minp1,··· ,pT

T∑t=1

u2t + λ

T∑t=3

(∆2pt)2

where λ is the inverse signal to noise ratio.

Now consider the following model:

xt = pt + ut

∆2pt =√

λat V (at) = V (ut) = 1

Assume we want to estimate p1, · · · , pT that minimise the

sum of squared prediction errors:

minp1,··· ,pT

T∑t=1

(xt − xt|t−1)2 = minp1,··· ,pT

T∑t=1

u2t + λ

T∑t=3

a2t

3

SS for the TFP - CU model

The TFP - CU model:

tfpt = pt + ct

cut = µcu + βct + acut

pt = pt−1 + ηt−1; ηt = ρηt−1 + aηt

ct = φc1ct−1 + φc2ct−2 + act

State and measurement equations (for simplicity w = 0):

pt+1

ηt+1

ct+1

ct

=

1 1 0 0

0 ρ 0 0

0 0 φc1 φc2

0 0 1 0

pt

ηt

ct

ct−1

+

0

aηt+1

act+1

0

tfpt

cut

=

0

µcu

+

1 0 1 0

0 0 β 0

pt

ηt

ct

ct−1

+

0

acut

4

The likelihood function

Given a state-space model with observations xT =

(x1,x2, · · · ,xT ) and a vector of parameters θ, the

likelihood is defined as the joint density of the sample

for a given θ:

L(xT ; θ) = f(xT |θ)

The likelihood can be factorised into the product of

conditional densities:

L(xT ; θ) = f(x1; θ)T∏

t=2

f(xt|xt−1; θ)

for I(1) model. For I(2) models, the minimum conditioning

set contains x1 and x2 instead of only x1.

5

Computing the likelihood function

We focus on the term f(xt|xt−1). Since we assume

normally distributed shocks, we only need to characterize

the conditional mean E(xt|xt−1) and the conditional

variance V (xt|xt−1):

f(xt|xt−1; θ) ∝| V (xt|xt−1) |−1/2 ×

exp{−.5(xt − E(xt|xt−1)

)′V (xt|xt−1)−1

(xt − E(xt|xt−1)

)}

Let ξt|t−1 ≡ E(ξt|xt−1) and Pt|t−1 ≡ V (ξt|xt−1). The

measurement equation implies:

xt|t−1 = H ξt|t−1 + C Zt

V (xt|xt−1) = H Pt|t−1H′ + R

Hence the problem is to find ξt|t−1 and Pt|t−1 for all t.

6

Using

ξt

xt

��xt−1 ∼ N

0@ ξt|t−1

Hξt|t−1 + CZt

,

24 Pt|t−1 Pt|t−1H

′

HPt|t−1H′ + R

351A

The properties of the normal distribution imply:

ξt|t = ξt|t−1+Pt|t−1H′ �HPt|t−1H

′+ R

�−1�Xt −Hξt|t−1 − CZt

�

Pt|t = Pt|t−1−Pt|t−1H′ �HPt|t−1H

′+ R

�−1HPt|t−1

and from the transition equation:

ξt+1|t = Fξt|t

Pt+1|t = FPt|tF′+Q

These are the Kalman recursions. Given a starting

point ξ1|0 and P1|0, the KF produces all quantities ξt+1|t

and Pt+1|t necessary for computing the likelihood.

7

Initialising the Kalman filter

In order to start the Kalman recursions one needs to

specify the initial conditions

ξd+1|d ≡ E[ξd+1|Xd]

Pd+1|d ≡ V [ξd+1|Xd],

where d is the maximum order of integration of the

components of the state vector ξt, i.e. d = 1 in the

TFP-CU model.

Convention: X0 = ∅.

Whatever the frequentist or Bayesian context, the

likelihood function needs exact filter initialization.

8

Case 1. d = 0: all entries in the state vector are

stationary

ξ1|0 = E[ξ1]

P1|0 = V ar[ξ1]

Weak stationarity implies ξ1|0 = ξ0|0 and P1|0 = P0|0.

Use can be made of P1|0 = P0|0 in:

P1|0 = FP1|0F′+Q

to recover P1|0 as:

V ec(P1|0) = [I− F⊗ F]−1V ec(Q),

where V ec is the vectorial operator, I is the identity

matrix, and ⊗ is the Kronecker operator.

9

Case 2. d > 0 At least one entry of the state vector

is I(d). The Diffuse Kalman Filter

Let ξ∗t be the state I(d) element. Its unconditional variance

is infinite. Writing V ar[ξ∗1] = κ for large κ, then:

ξd+1|d ≡ limκ→∞

E[ξd+1|κ,Xd]

Pd+1|d ≡ limκ→∞

V ar[ξd+1|κ,Xd],

GAP implements the Diffuse Kalman Filter through

an algorithm due to de Jong, Annals of Statistics

(1991).

10

Bayesian analysis

In practice we observe the data but parameter θ ∈ Θ

is unknown. To make inference about θ, there are two

approaches: sampling theory and Bayes theorem.

Sampling theory: θ is a constant. Estimators are defined as

θ ≡ θ(Y ), i.e. as a function of hypothetical vectors Y that

have the sampling distribution f(Y |θ).

Inference is made by comparing θ(Y = y) with the

distribution of θ(Y ) induced by f(Y |θ).Example: Y |θ ∼ N(θ1T , IT ) ⇒ θ = Y ∼ N(θ1T , IT/T ).

Bayesian analysis: θ is a random variable with prior

distribution f(θ) that describes the initial state of knowledge.

The joint distribution f(Y |θ) × f(θ) defines the model.

Inference is made through the posterior distribution of θ,

i.e. the conditional distribution of θ given the observations y,

f(θ|Y = y).

1

Bayes’ theorem

The Bayes’ theorem tells us how the data update our prior

knowledge of θ:

f(θ|Y ) =f(Y, θ)f(Y )

=f(Y |θ)f(θ)

f(Y )

where f(Y ) =∫Θ

f(Y |θ)f(θ)dθ is a strictly-positive

normalizing constant aka the marginal likelihood. Since

f(Y ) does not depend on θ we also write:

f(θ|Y ) ∝ f(Y |θ)f(θ) ⇔ likelihood× prior

To compute f(θ|Y ) all sample information enters through

the likelihood f(Y |θ) - likelihood principle.

2

Example: the Normal mean

Assume (Y1, Y2, · · · , YT |θ) ∼ iiN(θ, 1). The likelihood

f(Y |θ) =∏T

j=1 f(Yj|θ) writes

f(Y |θ) = (2π)−T/2 exp[−12(s2 + T (θ − θ)2)]

where θ = Y and s2 = (1/T )∑

j(Yj − Y )2.

Assume f(θ) = N(θ0, V0). By the Bayes’ theorem:

f(θ|Y ) ∝ f(Y |θ)f(θ)

∝ exp[−12T (θ − θ)2 − 1

2V0(θ − θ0)2]

∝ exp[− 12V

(θ − θ)2]

with V = 1T+1/V0

and θ = V (T θ + θ0/V0).

i. Normal prior + Normal likelihood ⇒ Normal posterior.

ii. The prior looses importance as T →∞ or V0 →∞.

3

Prior distributions - Informative

Informative prior distribution can be stated using:

• information from previous studies,

• a macroeconomic model suggesting theoretical values,

• a personal view for instance about an elasticity.

Stating an informative prior requires:

• to choose a distribution family,

• to tune the distribution hyper-parameters according to the

information available.

Program GAP offers graphical tools for prior elicitation.

4

Informative priors - Examples

Beta-distribution: if β ∈ [0, 1], β ∼ Beta(a, b), a, b > 0:

f(β) =Γ(a + b)Γ(a)Γ(b)

βa−1(1− β)b−1

where Γ(a) =∫∞0

xa−1e−xdx.

Moments: E(βr) = Γ(a+r)Γ(a+b)Γ(a)Γ(a+b+r), a + r > 0.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

Beta(3,3) (b) and Beta(10,10) (r)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

Beta(2,6) (b) and Beta(6,2) (r)

Useful for imposing parameter-bounds. If β ∈ (βl, βu) instead

of β ∈ [0, 1], set β = βl + λ(βu − βl), λ ∼ Beta(a, b).

GAP uses Beta-priors for cycle amplitude and periodicity.

5


Normal-distribution: β is Normally distributed with mean

β0 and variance m−1β0

, i.e. β ∼ N(β0,m−1β0

) if its pdf is:

f(β) =1√

2πm−1β0

exp{−12mβ0(β − β0)2}

The variance inverse mβ0 is aka the precision.

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

N(0,1)

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

N(0,3)

Bounds can be imposed by truncation.

In GAP, w and ρ are Normal.

6


Inverted Gamma-2 (IG) for variance parameters;

Vu ∼ IG(s0, ν0), s0, ν0 > 0:

f(Vu) = Γ(ν0/2)−1(s0/2)ν0/2V−1

2(ν0+2)u exp{− s0

2Vu}

Moments:

E(Vu) =s0

ν0 − 2for ν0 > 2

V (Vu) =2

ν0 − 4E(Vu)2 for ν0 > 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

IG(0.5,4)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

IG(100,202)

GAP uses IG-priors for all variance parameters.

7


Normal-Inverted Gamma-2 distribution: β, Vu are jointly

NIG-distributed with parameters β0,Mβ0, s0 > 0 and ν0 > 0,

i.e. f(β, Vu) = NIG(β0,Mβ0, s0, ν0), if:

f(β, Vu) ∝ V−1

2(ν0+2+d(β))u ×

× exp{− 12Vu

(s0 + (β − β0)′Mβ0(β − β0))}

NIG-properties:

f(β|Vu) = N(β0, VuM−1β0

)

f(Vu) = IG(s0, ν0)

f(Vu|β) = IG(s0 + (β − β0)′Mβ0(β − β0), ν0 + d(β))

f(β) = t(β0,Mβ0, s0, ν0)

Moments of f(β): E(β) = β0, V (β) = E(Vu)M−1β0

=s0

ν0−2M−1β0

.

In GAP, µCU , β,VCU are NIG.

8

Conjugate prior distributions

Definition If F is a class of sampling distributions f(Y |θ)and P is a class of prior distributions for θ, then the class P

is conjugate for F if the posterior distribution f(θ|y) belongs

to P for all f(Y |θ) ∈ F and f(θ) ∈ P .

Natural conjugacy if prior, likelihood and posterior belong to

the same class of distributions ⇔ natural conjugate families

are closed for inference.

Convenience: the posterior distribution is in a known

parametric form.

Requirement: the prior information available can be

adequately represented with a distribution of this family.

9

Examples

• Beta prior is a conjugate family for binomial likelihoods:

p(θ|y) ∝ θy(1− θ)n−y × θa−1(1− θ)b−1

= Beta(y + a, n− y + b)

• NIG prior is natural conjugate for Normal likelihoods.

In general, when the data density belongs to the exponential

familyf(y|θ) ∝ exp[

m∑

j=1

tj(y)φj(θ)]

like for Normal likelihoods, then prior distributions such as

f(θ) ∝ exp{m∑

j=1

bjφj(θ)}

imply the posterior

f(θ|y) ∝ exp{m∑

j=1

(bj + tj(θ))φj(θ)}

⇒ likelihoods belonging to the exponential family have

natural conjugate prior distributions.

10

Parameterization and prior distributions

Parameterization must be adequate for info available.

Example: consider a cycle whose dynamic is described with

the AR(2) φ(L) = 1 − φ1L − φ2L2. According to business

cycle studies, this cycle has a typical length of 8 years and

amplitude 0.8. How to incorporate this prior information?

Impossible on the φ-scale ⇒ re-parameterize

Let r1,2 = A e±ıw, A < 1, ı2 = −1, w ∈ [0, π], be the

complex roots of z2 − φ1z − φ2 = 0. In terms of periodicity

τ , w = 2π/τ and the AR(2) polynomial can be written as:

φ(L) = (1−Ae−ı2π/τL)(1−Aeı2π/τL)

= 1− 2Acos(2π/τ)L + A2L2

More convenient for inserting prior info: for instance, A ∼Beta(aA, bA) and (τ−τl)

(τu−τl)∼ Beta(τl, τu) for τ ∈ (τl, τu).

11

What if a flat prior is put on φ1, φ2 under the complex root

constraint?

p(φ1 ∈ (−2, 0)) = 0.5 = p(φ1 ∈ (0, 2))

φ1 ∈ (−2, 0) ⇔ τ ∈ (2, 4) (cos(2π/τ) < 0)

φ1 ∈ (0, 2) ⇔ τ ∈ (4,∞) (cos(2π/τ) > 0)

(2, 4) and (4,∞) period-ranges receive equal weights: short

term movements are emphasized.

⇒ GAP uses polar coordinates A and τ instead of AR

coefficients

12

Non-informative prior distributions

Three reasons for being non-informative:

• Ignorance

• Make the analysis as objective as possible

• Assess the prior influence on posterior inference

We would like inference to be only driven by the likelihood

function ⇒ a candidate prior is uniform.

Problem: as in general the shape of the likelihood depends

on the parameter scale, for which parameterization should

the uniform distribution be specified?

General solution: Jeffreys non-informative prior whatever

the parameterization.

Example: p(V ) ∝ 1/V ⇔ p(log V ) ∝ 1.

Program GAP does not use non-informative priors

13

GAP prior assumptions for the TFP - CU model

The following block-independence structure is imposed:

p(θ) = p(A)p(τ)p(Vc)p(w)p(ρ)p(Vη)p(µcu, β, Vcu)

p(A) = Beta(aA, bA)

p(τ − 2T − 2

) = Beta(aτ , bτ)

p(Vc) = IG(sc0, vc0)

p(w) = N(w0, Vw0)

p(ρ) = N(ρ0, Vρ0)

p(Vη) = IG(sη0, vη0)

p(µcu, β, Vcu) = NIG(δ0,M−10 , s0, v0)

The periodicity τ belongs to the interval [2, T ].

The GAP menu Prior enables users to set hyperparameters

and to save & load priors.

14

Bayesian inference

Given a prior p(θ) on parameter θ, we are interested in the

posterior p(θ|y). In some cases, p(θ|y) is known in close

form.

Example: the linear regression

yt = β′xt + ut, ut|Vu ∼ iiN(0, Vu)

Let y = (y1, y2, · · · , yT )′ and x = (x1, x2, · · · , xT )′. The

sampling density verifies:

f(y|x, β, Vu) ∝ V−T

2u exp{− 1

2Vu(y − xβ)′(y − xβ)}

∝ V−T

2u exp{− 1

2Vu[ssr + (β − β)′x′x(β − β)]}

= NIG(β, x′x, ssr, T − 2 − d(β))

where β = (x′x)−1x′y, and ssr = (y − xβ)′(y − xβ).

1

Posteriors with the natural conjugate prior

Assume (β, Vu) ∼ NIG(β0, s0,M0, ν0):

f(β, Vu) = f(β|Vu)f(Vu) = N(β0, VuM−10 ) × IG(s0, ν0)

NIG-Theorem: the posterior density f(β, Vu|y, x) for

the linear regression model with NIG prior is a

NIG(β∗, M∗, s∗, ν∗) with hyperparameters:

M∗ = M0 + x′x

β∗ = M−1∗ (M0β0 + x′xβ)

s∗ = s0 + ssr + (β0 − β)′[M−10 + (x′x)−1]−1(β0 − β)

ν∗ = ν0 + T

GAP uses NIG-theorem for building posterior samples.

2

From the NIG-properties, f(β, Vu|y, x) = NIG(β∗, s∗,M∗, ν∗)

implies:

f(β|y, x) = t(β∗, s∗,M∗, ν∗)

f(Vu|y, x) = IG(s∗, ν∗)

with first two moments:

E(β|y, x) = β∗; V (β|y, x) =s∗

ν∗ − 2M−1

∗

E(Vu|y, x) =s∗

ν∗ − 2; V (Vu|y, x) =

2ν∗ − 4

E(Vu|y, x)2

i. Frequentist interpretation: posterior ⇔ likelihood of data

extended with a pre-sample whose likelihood coincides with

NIG-prior on β, Vu.

ii. V (β|y, x) may be > V (β): conflict of information

between prior and likelihood ⇒ update your prior!

iii. The larger |β0 − β| is, the larger E(Vu|y, x).

3

Proof of the NIG-theorem (skip): By the Bayes’ theorem

f(β, Vu|y, x) ∝ f(y|x, β, Vu)f(β, Vu), hence

f(β, Vu|y, x) ∝ V−T

2u exp{− 1

2Vu[ssr + (β − β)′x′x(β − β)]}

× V−1

2(ν0+2+d(β))u

× exp{− 12Vu

(s0 + (β − β0)′M0(β − β0))}

∝ V−(T+ν0+d(β)+2)

2u exp{− 1

2Vu[s0 + ssr +

(β − β)′x′x(β − β) + (β − β0)′M0(β − β0)]}

Notice that (see Bauwens et al. 1999, pp 58-59)

(β − β)′x′x(β − β) + (β − β0)′M0(β − β0)

= (β − β∗)′M∗(β − β∗) + β′0M0β0 + β′x′xβ − β′

∗M∗β∗

and

β′0M0β0 + β′x′xβ − β′

∗M∗β∗

= (β0 − β)′[M−10 + (x′x)−1]−1(β0 − β)

4

Hence:

f(β, Vu|y, x) ∝ V−(T+ν0+d(β)+2)

2u

× exp{− 12Vu

[(β − β∗)′M∗(β − β∗) + s0 + ssr

+ (β0 − β)′[M−10 + (x′x)−1]−1(β0 − β)]}

= NIG(β∗,M∗, s∗, ν∗)

5

Sampling from unknown distributions: MCMC

Most often however, the posterior f(θ|y) is of unknown form.

One possibility is to approximate it by sampling.

Samples from the posterior f(θ|y) make possible inference

for any quantity of interest, like marginals or functions of θ.

Markov Chain Monte Carlo is a class of algorithms for

sampling from unknown distributions.

The most used MCMC algorithms are Gibbs sampling and

Metropolis-Hastings.

1

Gibbs sampling

Partition θ into k blocks: θ = (θ1, θ2, · · · , θk). Geman and

Geman (IEEE 1984) show that the iterative scheme:

0. Initialize: θ(0)2 , · · · , θ(0)

k ;

1. Sampling from full conditionals:

θ(1)1 ∼ f(θ1|θ(0)

2 , · · · , θ(0)k , y)

θ(1)2 ∼ f(θ2|θ(1)

1 , θ(0)3 , · · · , θ(0)

k , y)

... ...

θ(1)k ∼ f(θk|θ(1)

1 , θ(1)2 , · · · , θ(1)

k−1, y)

2. Repeat G-times θ(g)1 , θ

(g)2 , · · · , θ(g)

k , g = 1, 2, · · · , G.

is such that fG(θ|y) → f(θ|y) as G →∞.

Requirement: sampling from full conditionals is feasible.

2

The Metropolis-Hastings algorithm

0. Choose a starting point θ(0);

1. Generate a candidate θc from an importance density q(·);

2. Evaluate the ratio

α = min{ f(θc|y)/q(θc)f(θ(0)|y)/q(θ(0))

, 1}

2. Generate u ∼ U(0, 1); if u < α set θ(1) = θc, else

θ(1) = θ(0);

3. repeat 2. and return {θ(1), θ(2), · · · , θ(G)}.

3

Choice of the candidate-generating density

i. Random walk chain: θc = θ(0) + white noise;

ii. Independence chain: choose q(·) independent of θ(0);

iii. Factorizing: if f(θ|y) = f1(θ)f2(θ) with f1(θ) uniformly

bounded, then taking q(·) = f2(θ) yields α =

min{ f1(θc)

f1(θ(0))

, 1} - GAP

Most difficult step: set the scale of the importance density.

This impacts the rate of acceptance.

In general the candidate-generating density must have fatter

tails than the target density.

Rule of thumb: target an acceptance ratio in (.20, .45) -

Care: convergence may fail in spite of a good acceptance

ratio.

4

Bayesian inference in the TFP-CU model

tfpt = pt + ct

cut = µcu + βct + acut

pt = pt−1 + ηt−1; ηt = w(1− ρ) + ρηt−1 + aηt

ct = 2A cos(2π/τ)ct−1 −A2ct−2 + act

Let θ denote the model parameters θ = (A, τ, Vc, w, ρ, Vη, µcu,-

β, Vcu), ξt the unobservables ξt = (ct, pt), and yT =

(tfpT , cuT ).

Our target is the joint posterior p(ξT , θ|yT ). Since no-closed

form exists we resort to a Gibbs scheme:

• p(ξT |θ, yT )

• p(θ|ξT , yT )

5

Sampling the state

p(ξT |θ, yT ) is Normal but needs a T × T -covariance matrix.

⇒ use Carter and Kohn (Biom., 1994) state sampler:

p(ξT |θ, y) = p(ξT |θ, yT )T−1∏t=1

p(ξt|θ, yt, ξt+1)

1. compute ξt|t, Pt|t, t = 2, 3, . . . , T , via Kalman filter;

2. sample ξT ∼ p(ξT |θ, y) = N(ξT |T , PT |T );

3. sample ξt ∼ p(ξt|θ, ξt+1, yt) = N(E[ξt|θ, ξt+1, y

t],

V [ξt|θ, ξt+1, yt]) back in time t = T − 1, . . . , 2; for the

first two moments, use Normal properties.

4. ξ1 ∼ p(ξ1|θ, ξ2, y1) = N(E[ξ1|θ, ξ2, y1], V [ξ1|θ, ξ2, y1]).

Last step requires ξ1|1 and P1|1: use Koopman (1997, JASA).

6

Sampling parameters given the state

GAP exploits model parametrization, prior block-

independence and likelihood factorization to build 3 parameter

blocks:

p(A, τ, Vc, w, ρ, Vη, µcu, β, Vcu|ξT , yT ) =

= p(A, τ, Vc|cT )× p(w, ρ, Vη|pT )× p(µcu, β, Vcu|cT , cuT )

Let us focus on CU equation parameters. We can write:

p(µcu, β, Vu|cT , cuT ) ∝ p(cuT |cT , µcu, β, Vu)p(µcu, β, Vu)

∝ NIG(δ∗, s∗,M∗, v∗)

with hyperparameters given by the NIG theorem - also in next

page.

7

(skip) Let Z denote the T × 2 matrix of regressors on the

CU eqn and δ be the OLS estimate of (µcu, β). From the

NIG-theorem:

p(µcu, β, Vu|cT , cuT ) ∝ NIG(δ∗, s∗,M∗, v∗)

with

M∗ = M0 + Z ′Z

δ∗ = M−1∗ [M0δ0 + Z ′Zδ]

v∗ = v0 + T − 2

s∗ = s0 + cuT ′(IT − Z(Z ′Z)−1Z ′)cuT +

+ (δ0 − δ)′[M−10 + (Z ′Z)−1]−1(δ0 − δ)

8


For p(w, ρ, Vη|pT ), we use:

• p(w|ρ, Vη|pT ) ∝ p(pT |w, ρ, Vη)× p(w) = N(w∗, V ∗w)

where w∗ and V ∗w can be worked out from the kernel identity:

1Vη

T∑3

(∆pt − ρ∆pt−1 − w(1− ρ))2 +1

Vw0(w − w0)2

= constant +(w − w∗)2

V ∗w

• p(ρ|w, Vη, pT ) ∝ p(pT |ρ,w, Vη)× p(ρ)

∝ p(p21|ρ,w, Vη) p(pT

3 |p21, ρ, w, Vη) p(ρ)

∝ p(p1, p2|ρ,w, Vη)×N(ρ∗, V ∗ρ )

⇒ use MH with N(ρ∗, V ∗ρ ) as proposal - MH within Gibbs.

• p(Vη|pT , w, ρ) ∝ p(pT |w, ρ, Vη)× p(Vη)

∝ IG(sη0 +∑

a2η, vη0 + T − 1)

9


Cycle parameters: for Vc it is still a IG, but sampling A and

τ needs either a MH or an ARMS step.

Full details can be read in Planas, Rossi and Fiorentini,

Journal of Business Economic & Statistics (2008).

Sampling in practice

• Set a burn-in period of B iterations and save the next G.

• Select a thinning t, i.e. record every t iterations.

• Monitor chain convergence: a failure invalidates inference.

Non-convergence indicates that some region of the sample

space is poorly explored. Chains may not converge because

of long lasting autocorrelations.

In case of non-convergence: increase the burn-in, change the

parameterization, implement a different MCMC scheme, ....

10

Convergence diagnostics

1. Visual inspection by plotting cumulated posterior means

1g

∑gj=1 θ(j), g = 1, 2, · · · , G.

2. Geweke convergence diagnostic (1992): compare the

mean of the first n1 elements against the mean of the last

n2:

θ1 =1n1

n1∑

j=1

θ(j); θ2 =1n2

G∑

j=G−n2+1

θ(j)

with n1 + n2 < G. As G →∞ and n1G , n2

G remain fixed

Z =θ2 − θ1√

V (θ1) + V (θ2)→ N(0, 1)

Large values of Z indicates lack of convergence.

Geweke suggests n1 = G/5 and n2 = G/2.

11

Convergence diagnostics

GAP reports:

• Geweke CD (as p-value).

• Chain autocorrelations.

• NSE the numerical standard error of the posterior mean

using autocovariances until lag equal to 4% of the recorded

simulations.

• RNE the relative numerical efficiency: the ratio of the

variance of the posterior mean under iid hypothesis to the

squared NSE. Close to 1 values indicate high efficiency.

12

Estimates from MCMC

Any quantity of interest can be derived from the posterior

samples: for instance marginal distribution and related

moments like in

E[θ1|y] =1G

G∑

j=1

θ(j)1 f(θ1) =

1G

G∑

j=1

Iθ(j)1 ∈(θ1−δ,θ1+δ)

Hypothesis testing via Highest posterior density region (HPD)

with probability content α: the smallest interval R such that

p(β ∈ R|y) =∫

R

p(β|y)dβ = α

⇒ accept H0 : β ∈ I if I ∈ R.

13

In output, GAP shows priors and posteriors, posterior

mode, highest posterior regions, unobservables posterior

mean, marginal distribution of pt and ct, forecasts, CU

equation fit, ... All graphics are exportable in PS-format.

Two important checks: chain convergence and prior-posterior

congruency.

Enjoy GAP!

14

Date post:	10-Apr-2019
Category:	Documents
Upload:	vokhanh
View:	213 times
Download:	0 times

Course on Program Bayesian GAP and its use for ... - Europa and Financial... · Course on Program...

Documents