Chapter 7: Bayesian Econometrics - univ-orleans.fr · 1. Introduction References Geweke J. (2005),...

transcript

Chapter 7: Bayesian Econometrics

Christophe Hurlin

University of Orléans

June 26, 2014

Christophe Hurlin (University of Orléans) Bayesian Econometrics June 26, 2014 1 / 246

Section 1

Introduction

1. Introduction

The outline of this chapter is the following:

Section 2. Prior and posterior distribution

Section 3. Posterior distributions and inference

Section 4. Applications (VAR and DSGE)

Section 4. Numerical simulations

1. Introduction

References

Geweke J. (2005), Contemporary Bayesian Econometrics and Statistics. NewYork: John Wiley and Sons. (advanced)

Geweke J., Koop G. and Van Dijk H. (2011), The Oxford Handbook ofBayesian Econometrics. Oxford University Press.

Greene W. (2007), Econometric Analysis, sixth edition, Pearson - Prentice Hil

Greenberg E. (2008), Introduction to Bayesian Econometrics, CambridgeUniversity Press. (recommended)

Koop, G. (2003), Bayesian Econometrics. New York: JohnWiley and Sons.

Lancaster T. (2004), An Introduction to Modern Bayesian Inference. OxfordUniversity Press.

1. Introduction

Notations: In this course, I will (try to...) follow some conventions ofnotation.

Y random variabley realisationfY (y) probability density function or probability mass functionFY (y) cumulative distribution functionPr (.) probabilityy vectorY matrix

Problem: this system of notations does not allow to discriminate betweena vector (matrix) of random elements and a vector (matrix) ofnon-stochastic elements (realisation).

Abadir and Magnus (2002), Notation in econometrics: a proposal for astandard, Econometrics Journal.

Section 2

Prior and posterior distributions

2. Prior and posterior distributions

Objectives

The objective of this section are the following:

1 Introduce the concept of prior distribution

2 Dene the hyperparameters

3 Dene the concept of posterior distribution

4 Dene the concept of unormalised posterior distribution

In statistics, there a distinction between two concepts of probability:

1 Frequentist probability

2 Subjective probability

Frequentist probability

Frequentists restrict the assignment of probabilities to statements thatdescribe the outcome of an experiment that can be repeated.

ExampleStatement A1: A coin tossed three times will come up heads either twoor three times.We can imagine repeating the experiment of tossing acoin three times and recording the number of times that two or threeheads were reported. Then:

Pr (A1) = limn!∞

number of times two or three heads occursn

Subjective probability

1 Those who take the subjective view of probability believe thatprobability theory is applicable to any situation in which there isuncertainty.

2 Outcomes of repeated experiments fall in that category, but so dostatements about tomorrows weather, which are not the outcomes ofrepeated experiments.

3 Calling probabilities subjectivedoes not imply that they can be setarbitrarily, and probabilities set in accordance with the axioms areconsistent.

Reminder

The probability of event A is denoted by Pr (A). Probabilities are assumedto satisfy the following axioms:

1 0 Pr (A) 1

2 Pr (A) = 1 if A represents a logical truth

3 If A and B describe disjoint events (events that cannot both occur),then Pr (A[ B) = Pr (A) + Pr (B)

4 Let Pr (AjB) denote the probability of A, given (or conditioned onthe assumption) that B is true.Then

Pr (AjB) = Pr (A\ B)Pr (B)

Reminder

DenitionThe union of A and B is the event that A or B (or both) occur; it isdenoted by A[ B.

DenitionThe intersection of A and B is the event that both A and B occur; it isdenoted by A\ B.

Questions

1 What is the fundamental idea of the posterior distribution?

2 How it can be computed from the likelihood function and the priordistribution?

Example (Subjective view of probability)Let Y a binary variable with Y = 1 if a coin toss results in a head and 0otherwise, and let

Pr (Y = 1) = θ

Pr (Y = 0) = 1 θ

which is assumed to be constant for each trial. In this model, θ is aparameter and the value of Y is the data (realisation y).

Example (contd)Under these assumptions, Y is said to have the Bernoulli distribution,written as

Y Be (θ)with a probability mass function (pmf)

fY (y ; θ) = Pr (Y = y) = θy (1 θ)1y

We consider a sample of i .i .d . variables (Y1, ..,Yn) that corresponds to nrepeated experiments. The realisation is denoted by (y1, .., yn) .

Frequentist approach

1 From the frequentist point of view, probability theory can tell ussomething about the distribution of the data for a given θ.

Y Be (θ)

2 The parameter θ is an unknown number between zero and one.

3 It is not given a probability distribution of its own, because it isnot regarded as being the outcome of a repeated experiment.

Fact (Frequentist approach)

In a frequentist approach, the parameters θ are considered as constantterms and the aim is to study the distribution of the data given θ,through the likelihood of the sample.

Reminder

The likelihood function is dened as to be:

Ln : ΘRn! R+

(θ; y1, .., yn) 7! Ln (θ; y1, .., yn) = fY1,..,YN (y1, y2, .., yn; θ)

Under the i .i .d . assumption

(θ; y1, .., yn) 7! Ln (θ; y1, .., yn) =n

∏i=1fY (yi ; θ)

Remark: the (log-)likelihood function depends on two arguments: thesample (realisation) and θ

Reminder

In our example, Y is a discrete variable and the likelihood can beinterpreted as the joint probability to observe the sample (realisation)(y1, .., yn) given a value of θ

Ln (θ; y1.., yn) = Pr ((Y1 = y1) \ ...\ (Yn = yn))

If the sample (Y1, ..,Yn) is i .i .d ., then:

Ln (θ; y1.., yn) =n

∏i=1Pr (Yi = yi ) =

∏i=1fY (yi ; θ)

Reminder

In our example, the likelihood of the sample (y1, .., yn) is

Ln (θ; y1.., yn) =n

∏i=1fY (yi ; θ)

∏i=1

θyi (1 θ)1yi

Or equivalently

Ln (θ; y1.., yn) = θΣyi (1 θ)Σ(1yi )

Notations:

In the rest of the chapter, I will use the following alternative notations:

Ln (θ; y) L (θ; y1, .., yn) Ln (θ)

`n (θ; y) ln Ln (θ; y) ln L (θ; y1, .., yn) ln Ln (θ)

Frequentist approach

From the subjective point of view, however, θ is an unknownquantity.

Since there is uncertainty over its value, it can be regarded as arandom variable and assigned a probability distribution.

Before seeing the data, it is assigned a prior distribution

π (θ) with 0 θ 1

Denition (Prior distribution)

In a Bayesian framework, the parameters θ associated to the distributionof the data, are considered as random variables. Their distribution iscalled the prior distribution and is denoted by π (θ) .

Remark

In our example, the endogenous variable Yi is discrete (0 or 1):

Yi Be (θ)

but the parameter θ = Pr (Yi = 1) can be considered as a continuousrandom variable over [0, 1]: in this case π (θ) is a pdf.

Example (Prior distribution)For instance, we may postulate that:

θ U[0,1]

Remark

Whatever the type of the distribution of the endogenous variable (discreteor continuous), the prior distribution is generally continuous.

π (θ) = probability density function (pdf)

Denition (Uninformative prior)

An uninformative prior is a at prior. Example: θ U[0,1]

Remark

In most of cases, the prior distribution is parametrised, i.e. the pdfπ (θ;γ) depends on a set of parameters γ.

Denition (Hyperparameters)The parameters of the prior distribution, called hyperparameters, aresupplied by the researcher.

Example (Hyperparameters)

If θ 2 R and if the prior distribution is normal

π (θ;γ) =1

σp2π

(θ µ)2

with γ =µ, σ2

> the vector of hyperparameters.

Example (Beta prior distribution)

If θ 2 [0, 1] , a common (parametrised) prior distribution is the Betadistribution denoted B (α, β) .

π (θ;γ) =Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1 α, β > 0 θ 2 [0, 1]

with γ = (α, β)> the vector of hyperparameters.

Beta distribution: reminder

The gamma function, denoted Γ (p) , is dened as to be:

Γ (p) =Z +∞

0exxp1dx p > 0

with:Γ (p) = (p 1) Γ (p 1)

Γ (p) = (p 1)! = (p 1) (p 2) ... 2 1 if p 2 N

The Beta distribution B (α, β) has very interesting features:

1 Depending on the choice of α and β, this prior can capture beliefs thatindicate θ is centered at 1/2, or it can shade θ toward zero or one;

2 It can be highly concentrated, or it can be spread out;

3 When both parameters are less than one, it can have two modes.

The shape of a Beta distribution can be understood by examining itsmean and variance:

θ B (α, β)

E (θ) =α

α+ βV (θ) =

(α+ β)2 (α+ β+ 1)

1 The mean is equal to 1/2 if α = β

2 A larger α (β) shades the mean toward 1 (0)

3 the variance decreases as α or β increases.

0 0.2 0.4 0.6 0.8 10

a = 0.5 b = 0.5

0 0.2 0.4 0.6 0.8 10

a = 1.0 b = 1.0

0 0.2 0.4 0.6 0.8 10

a = 5.0 b = 5.0

0 0.2 0.4 0.6 0.8 10

a = 30.0 b = 30.0

0 0.2 0.4 0.6 0.8 10

a = 10.0 b = 5.0

0 0.2 0.4 0.6 0.8 10

a = 1.0 b = 30.0

Remark

In some models, the hyperparameters are stochastic: as for theparameters of interest θ, they have a distribution.

This models are called hierarchical models

Denition (Hierarchical models)In a hierarchical model, we add one or more additional levels, where thehyperparameters themselves are given a prior distribution depending onanother set of hyperparameters.

Example (hierarchical model)An example of hierarchical model is given by

y : pdf f (y j θ)

θ : pdf π ( θj α)α : pdf π (αj β)

where the hyperparameters β are constant terms.

Denition (Posterior distribution)

Bayesian inference centers on the posterior distribution π ( θj y), whichis the conditional distribution of the random variable θ given the data(realisation of the sample) y = (y1, .., yn).

θj (Y1 = y1, ..Yn = yn) posterior distribution

Remark

When there is more than one parameter, the posterior distribution is ajoint conditional distribution of all the parameters given the observed data.

Notations

π (θ) prior distribution = pdf of θ

π ( θj y) posterior distribution = conditional pdf

Discrete endogenous variable

p (y ; θ) probability mass function (pmf)

Pr (A) probability of event A

Continuous endogenous variable

fY (y ; θ) probability distribution function (pdf)

FY (y ; θ) cumulative distribution function

Warning

For all the general denitions and the general results, we will employ thenotation fY (y ; θ) for both probability mass and density functions

Be careful with the interpretation of fY (y ; θ): density (continuous case) ordirectly probability (discrete case).

Example (Discrete random variable)

If Y Be (θ) , the pmf given by:

fY (y ; θ) = Pr (Y = y) = θy (1 θ)1y

is a probability.

Example (Continuous random variable)

If Y Nµ, σ2

with θ =

µ, σ2

>, the pdf

fY (y ; θ) =1

σp2π

(y µ)2

is not a probability. The probability is given by

Pr (Y y) = FY (y ; θ) =yZ∞

fY (x ; θ) dx

Pr (Y = y) = 0

The posterior density function π ( θj y) is computed by Bayes theorem:

Theorem (Bayess Theorem)For events A and B, the conditional probability of event A given that Bhas occurred is

Pr (AjB) = Pr (B jA) Pr (A)Pr (B)

Bayes theorem:

Pr (AjB) = Pr (B jA) Pr (A)Pr (B)

By setting A = θ and B = y , we have:

π ( θj y) =fY jθ (y j θ) π (θ)

fY (y)

wherefY (y) =

ZfY jθ (y j θ)π (θ) dθ

Denition (Posterior distribution)For one observation yi , the posterior distribution is the conditionaldistribution of θ given yi , dened as to be:

π ( θj yi ) =fYi jθ (yi j θ) π (θ)

fYi (yi )

wherefYi (yi ) =

fYi jθ (yi j θ)π (θ) dθ

and Θ the support of the distribution of θ.

Remark

π ( θj yi ) =fYi jθ (yi j θ) π (θ)

fYi (yi )

The term fYi jθ (yi j θ) corresponds to the likelihood of the observation yi :

fYi jθ (yi j θ) = Li (θ; yi )

Remark

The e¤ect of dividing by fYi (yi ) is to make π ( θj yi ) a normalizedprobability distribution: integrating π ( θj yi ) with respect to θ yields 1:

Zπ ( θj yi ) dθ =

Z fYi jθ (yi j θ) π (θ)

fYi (yi )dθ

fYi (yi )

ZfYi jθ (yi j θ) π (θ) dθ

since fYi (yi ) =Z

ΘfYi jθ (yi j θ)π (θ) dθ.

Discrete variable

For a discrete random variable Yi , by setting A = θ and B = y , we have:

π ( θj yi )| z posterior (pdf)

p (yi j θ)| z cond. probability

π (θ)| z prior (pdf)

p (yi )| z probability

wherep (yi ) =

Zp (yi j θ)π (θ) dθ

Denition (Prior predictive distribution)

The term p (yi ) or fYi (yi ) is sometimes called the prior predictivedistribution

p (yi ) =Zp (yi j θ)π (θ) dθ

fYi (yi ) =ZfY jθ (yi j θ)π (θ) dθ

Remark

1 In general, we consider an i .i .d . sample Y = (Y1, ..,YN ) with arealisation (data) y = (y1, .., yn) , and not only one observation.

2 In this case, the posterior distribution can be written as function ofthe likelihood of the i .i .d . sample y = (y1, .., yn) .

Ln (θ; y1, .., yn) =n

∏i=1Li (θ; yi ) =

∏i=1fYi (yi )

Denition (Posterior distribution, sample)

For sample (y1, .., yn) , the posterior distribution is the conditionaldistribution of θ given yi , dened as to be:

π ( θj y1, ..yn) =Ln (θ; y1, .., yn) π (θ)

fY1,..,Yn (y1, .., yn)

where Ln (θ; y1, .., yn) is the likelihood of the sample and

fY1,..,Yn (y1, .., yn) =ZΘ

Ln (θ; y1, .., yn)π (θ) dθ

Notations

π ( θj y1, ..yn) =Ln (θ; y1, .., yn) π (θ)

fY1,..,Yn (y1, .., yn)

For simplicity, we skip the notation (y1, .., yn) for the sample and we putonly the generic term y :

π ( θj y) = Ln (θ; y) π (θ)

fY (y)=fY jθ (y j θ) π (θ)

fY (y)

Remark

π ( θj y1, ..yn) =Ln (θ; y1, .., yn) π (θ)

fY1,..Yn (y1, .., yn)

In this setting, the data (y1, ..yn) are viewed as constants whose(marginal) distributions do not involve the parameters of interest θ.

fY1,..Yn (y1, .., yn) = constant

Remark (contd)

As a consequence, the Bayes theorem

Pr (parametersj data) = Pr (dataj parameters) Pr (parameters)Pr (data)

implies that

Pr (parametersj data)| z Posterior Density

∝ Pr (dataj parameters)| z Likelihood

Pr (parameters)| z Prior Density

where the symbol "∝" means is proportional to.

Denition (Unormalised posterior distribution)The unormalised posterior distribution is the product of the likelihoodof the sample and the prior distribution:

π ( θj y1, ..yn) ∝ Ln (θ; y1, .., yn) π (θ)

or with simplest notations

π ( θj y) ∝ Ln (θ; y) π (θ)

where the symbol "∝" means is proportional to.

Example (Posterior distribution)

Consider an i .i .d . sample (Y1, ..,Yn) of binary variables, such thatYi Be (θ) and:

fYi (yi ; θ) = Pr (Yi = yi ) = θyi (1 θ)1yi

We assume that the (uninformative) prior distribution for θ is an(continuous) uniform distribution over [0, 1].

Question: Write the pdf associated to the unormalised posteriordistribution and the posterior distribution

Solution

π ( θj y1, ..yn) =Ln (θ; y1, .., yn) π (θ)

fY1,..Yn (y1, .., yn)

The sample (y1, .., yn) is i .i .d ., so its likelihood is dened as to be:

Ln (θ; y1, .., yn) =n

∏i=1fYi (yi ) =

∏i=1

θyi (1 θ)1yi

So, we have:Ln (θ; y1.., yn) = θΣyi (1 θ)Σ(1yi )

Solution (contd)

π ( θj y1, ..yn) =Ln (θ; y1, .., yn) π (θ)

fY1,..Yn (y1, .., yn)

The (uninformative) priordistribution is

θ U[0,1]

with a pdf:

π (θ) =

if θ 2 [0, 1]otherwise

Source: wikipedia

Solution (contd)

π ( θj y1, ..yn) ∝ Ln (θ; y1, .., yn) π (θ)

Ln (θ; y1.., yn) = θΣyi (1 θ)Σ(1yi )

π (θ) =

The unormalised posterior distribution is

Ln (θ; y1, .., yn) π (θ) =

(θΣyi (1 θ)Σ(1yi )

Solution (contd)

π ( θj y1, ..yn) =Ln (θ; y1, .., yn) π (θ)

fY1,..Yn (y1, .., yn)

The joint density of (Y1, ..,Yn) evaluated at (y1, .., yn)

fY1,..Yn (y1, .., yn) =

fY1,..Yn jθ (y1, .., yn j θ)π (θ) dθ

Ln (θ; y1, .., yn)π (θ) dθ

θΣyi (1 θ)Σ(1yi ) dθ

Solution (contd)

π ( θj y1, ..yn) =Ln (θ; y1, .., yn) π (θ)

fY1,..Yn (y1, .., yn)

Finally, we have

π ( θj y1, ..yn) =θΣyi (1 θ)Σ(1yi )Z 1

0θΣyi (1 θ)Σ(1yi ) dθ

if θ 2 [0, 1]

π ( θj y1, ..yn) = 0 if θ /2 [0, 1]

Consider an i .i .d . sample (Y1, ..,Yn) of binary variables, such thatYi Be (θ) with θ = 0.3. We assume that the (uninformative) priordistribution for θ is an uniform distribution over [0, 1].

Question: Write a Matlab code in order (i) to generate a sample of sizen = 100, (ii) to display the pdf associated to the prior and the posteriordistribution.

0 0.2 0.4 0.6 0.8 10

1.6x 10 28

Unormalised Posterior Distribution

0 0.2 0.4 0.6 0.8 10

Posterior Distribution

We assume that the prior distribution for θ is a Beta distribution B (α, β)with a pdf:

π (θ;γ) =Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1 α, β > 0 θ 2 [0, 1]

Question: Write the pdf associated to the unormalised posteriordistribution.

Solution

The likelihood of the sample (y1, .., yn) is:

Ln (θ; y1.., yn) = θΣyi (1 θ)Σ(1yi )

The prior distribution is:

π (θ;γ) =Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1

So the unormalised posterior distribution is:

π ( θj y1, ..yn) ∝ Ln (θ; y1, .., yn) π (θ)

∝Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1 θΣyi (1 θ)Σ(1yi )

Solution (contd)

π ( θj y1, ..yn) ∝Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1 θΣyi (1 θ)Σ(1yi )

or equivalently:

π ( θj y1, ..yn) ∝Γ (α+ β)

Γ (α) Γ (β)θα1+Σyi (1 θ)β1+Σ(1yi )

Note that the term Γ (α+ β) / (Γ (α) Γ (β)) does not depend on θ. So,the unormalised posterior can also be written as:

π ( θj y1, ..yn) ∝ θ(α+Σyi )1 (1 θ)(β+Σ(1yi ))1

Solution (contd)

π ( θj y1, ..yn) ∝ θ(α+Σyi )1 (1 θ)(β+Σ(1yi ))1

Remind that the pdf of a Beta B (α, β) distribution is

Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1

The posterior distribution is in the form of a beta distribution withparameters

α1 = α+n

∑i=1yi β1 = β+ n

∑i=1yi

This is an example of a conjugate prior, where the posterior distributionis in the same family as the prior distribution

Denition (Conjugate prior)A conjugate prior is such that the posterior distribution is in the samefamily as the prior distribution.

We assume that the prior distribution for θ is a Beta distribution β (α, β)

Question: Determine the mean of the posterior distribution.

2. Prior and posterior distributionsSolution

We know that:π ( θj y1, ..yn) ∝ θα11 (1 θ)β11

α1 = α+n

∑i=1yi β1 = β+ n

∑i=1yi

A simple way to normalise this posterior distribution consists in writing:

π ( θj y1, ..yn) =Γ (α1 + β1)

Γ (α1) Γ (β1)θα11 (1 θ)β11

This expression corresponds to the pdf of B (α1, β1) distribution and it isnormalised to 1 by denition:

π ( θj y1, ..yn) dθ = 1

2. Prior and posterior distributionsSolution (contd)

π ( θj y1, ..yn) =Γ (α1 + β1)

Γ (α1) Γ (β1)θα11 (1 θ)β11

α1 = α+n

∑i=1yi β1 = β+ n

∑i=1yi

Using the properties of the B (α1, β1) distribution, we have:

E ( θj y1, ..yn) =α1

α1 + β1

=α+∑N

i=1 yiα+∑N

i=1 yi + β+ n∑Ni=1 yi

=α+∑N

i=1 yiα+ β+ n

Solution (contd)

E ( θj y1, ..yn) =α+∑N

i=1 yiα+ β+ n

We can express the mean as a function of the MLE estimator,yn = n

1 ∑Ni=1 yi , as follows:

E ( θj y1, ..yn)| z posterior mean

α+ β

α+ β+ n

α+ β| z prior mean

α+ β+ n

yn|zMLE

Remark

E ( θj y1, ..yn)| z posterior mean

α+ β

α+ β+ n

α+ β| z prior mean

α+ β+ n

yn|zMLE

1 If n! ∞, then the weight on the prior mean approaches zero, andthe weight on the MLE approaches one, implying

limn!∞

E ( θj y1, ..yn) = yn

2 If the sample size is very small, n! 0, then we have

limn!0

E ( θj y1, ..yn) =α

α+ β

Consider an i .i .d . sample (Y1, ..,Yn) of binary variables, such thatYi Be (θ) and

∑Ni=1 yi = 3 if n = 10

∑Ni=1 yi = 15 if n = 50

We assume that the prior distribution for θ is a Beta distribution β (2, 2)

Question: Write a Matlab code to plot (i) the prior, (ii) the posteriordistribution and the (iii) the likelihood in the two cases.

0 0.2 0.4 0.6 0.8 10

4x 10 3

LikelihoodPriorPosterior

0 0.2 0.4 0.6 0.8 10

2x 10 11

LikelihoodPriorPosterior

Key concepts of Section 2

1 Frequentist versus subjective probability

2 Prior distribution

3 Hyperparameters

4 Prior predictive distribution

5 Posterior distribution

6 Unormalised posterior distribution

7 Conjugate prior

Section 3

Posterior distributions and inference

3. Posterior distributions and inference

Objectives

1 Generalise the bayesian approach to a vector of parameters

2 Generalise the bayesian approach to a regression model

3 Introduce the Bayesian updating mechanism

4 Study the posterior distribution in the case of a large sample

5 Discuss the inference in a Bayesian framework

The concept of posterior distribution can be generalized to:

1 a case with a vector of parameters θ = (θ1, .., θd )> .

2 a model with exogenous variable and/or lagged endogenous variables(linear regression model, time series models, DSGE etc.).

Sub-Section 3.1

Generalisation to a vector of parameters

3. Posterior distributions and inference3.1. Generalisation to a vector of parameters

Vector of parameters

Consider a model/variable with a pdf/pmf that depends on a vector ofparameters θ = (θ1, .., θd )

The previous denitions of likelihood, prior, and posterior distributionsstill apply.

But they are now, respectively, the joint prior distribution and jointposterior distribution of the multivariate random variable θ

From the joint distributions, we may derive marginal and conditionaldistributions.

Denition (Marginal posterior distribution)

The marginal posterior distribution of θ1can be found by integratingout the remainder of the parameters from the joint posterior distribution:

π ( θ1j y) =Z

π ( θ1, .., θd j y) dθ2..dθd

Denition (Conditional posterior distribution)

The conditional posterior distribution of θ1is dened as to be:

π ( θ1j θ2, .., θd , y) =π ( θ1, .., θd j y)π ( θ2, .., θd j y)

where the denominator on the right-hand side is the marginal posteriordistribution of (θ2, .., θd ) obtained by integrating θ1 from the jointdistribution.

Remark

In most applications, the marginal distribution of a parameter is moreuseful than its conditional distribution because the marginal takes intoaccount the uncertainty over the values of the remaining parameters, whilethe conditional sets them at particular values.

Example (Marginal posterior distribution)

Consider the multinomial distribution Mn (.), which generalizes theBernoulli example discussed above. In this model, each trial, assumedindependent of the other trials, results in one of d outcomes, labeled1, 2,..,d with probabilities θ1, .., θd where ∑d

i=1 θi = 1. When theexperiment is repeated n times and outcome i arises yi times, thelikelihood function is

Ln (θ j y1, .., yn) = θy11 θy22 ...θydd with ∑d

i=1 yi = n

Example (contd)

Consider a Dirichlet prior distribution (generalisation of the Beta) with:

π (θ) =Γ

∑di=1 αi

∏di=1 Γ (αi )

θα111 θα21

2 ...θαd1d , αi > 0, ∑d

i=1 θi = 1

Question: Determine the marginal posterior distribution of θ1.

Solution

Following the previous procedure, we nd the (unormalised) posterior jointdistribution given the data y = (y1, .., yn) :

π (θj y1, ..yn) ∝ Ln (θj y1, .., yn) π (θ)

∝ θy11 θy22 ...θydd

∑di=1 αi

∏di=1 Γ (αi )

θα111 θα21

2 ...θαd1d

Or equivalently:

π (θj y1, ..yn) ∝ θy1+α111 θy2+α21

2 ...θyd+αd1d

Remark : the Dirichlet prior is a conjugate prior for the multinomial model.

Solution (contd)

The marginal distribution of θ1 is dened as to be:

π ( θ1j y1, .., yn) =1Z0

π (θj y1, ..yn) dθ2..dθd

In this context, we have:

π ( θ1j y1, .., yn)

=Γ (∑p

i=1 αi + yi )

∏pi=1 Γ (αi + yi )

θy1+α111

θy2+α212 ...θyd+αd1

d dθ2..dθd

But, we can also use some general results about the Dirichlet distribution.

Solution (contd)

Denition (Dirichlet distribution)The Dirichlet distribution generalizes the Beta distribution. Letx = (x1, .., xp) with 0 xi 1, ∑p

i=1 xi = 1. Then x D (α1, .., αp) if

f (x ; α1, .αp) =Γ (∑p

i=1 αi )

∏pi=1 Γ (αi )

xα111 xα21

2 ...xαd1d , αi > 0

Marginally, we havexi B

αi ,∑k 6=i αk

Solution (contd)

π (θj y1, ..yn) ∝ θy1+α111 θy2+α21

2 ...θyd+αd1d

θj y1, ..yn D (y1 + α1, .., yp + αp)

The marginal posterior distribution of θ1 is a Beta distribution:

θ1j y1, ..yn B (y1 + α1,∑pi=2 yi + αi )

Sub-Section 3.2

Generalisation to a model

3. Posterior distributions and inference3.2. Generalisation to a model

Remark: We can also aim at estimating the parameters of a model (withdependent and explicative variables) such that:

y = g (x ; θ) + ε

where θ denotes the vector or parameters, X a set of explicative variables,ε and error term and g (.) the link function.

In this case, we generally consider the conditional distribution of Y givenX , which is equivalent to unconditional distribution of the error term ε :

Y jX D () ε D

Notations (model)

Let us consider two continuous random variables Y and X

We assume that Y has a conditional distribution given X = x with apdf denoted fY jx (y ; θ) , for y 2 R

θ = (θ1..θK )| is a K 1 vector of unknown parameters. We assume

that θ 2 Θ RK .

Let us consider a sample fX1,YNgni=1 of i .i .d . random variables anda realisation fx1, yNgni=1 .

Denition (Conditional likelihood function)

The (conditional) likelihood function of the i .i .d . sample fXi ,Yigni=1 isdened to be:

Ln (θ; y j x) =n

∏i=1fY jX (yi j xi ; θ)

where fY jX (yi j xi ; θ) denotes the conditional pdf of Yi given Xi .

Remark: The conditional likelihood function is the joint conditionaldensity of the data given θ.

Example (Linear Regression Model)Consider the following linear regression model:

yi = X>i β+ εi

where Xi is a K 1 vector of random variables and β = (β1..βK )> a

K 1 vector of parameters. We assume that the εi are i .i .d . withεi N

0, σ2

. Then, the conditional distribution of Yi given Xi = xi is:

Yi j xi Nx>i β, σ2

Li (θ; y j x) = fY jx (yi j xi ; θ) =1

σp2π

yi x>i β

where θ =

β> σ2>

is K + 1 1 vector.Christophe Hurlin (University of Orléans) Bayesian Econometrics June 26, 2014 104 / 246

Example (Linear Regression Model, contd)

Then, if we consider an i .i .d . sample fyi , xigni=1, the correspondingconditional (log-)likelihood is dened to be:

Ln (θ; y j x) =n

∏i=1fY jX (yi j xi ; θ) =

∏i=1

σp2π

yi x>i β

=σ22π

n/2exp

∑i=1

yi x>i β

`n (θ; y j x) = n2lnσ2 n2ln (2π) 1

∑i=1

yi x>i β

Remark: Given this principle, we can derive the (conditional) likelihoodand the log-likelihood functions associated to a specic sample for anytype of econometric model in which the conditional distribution of thedependent variable is known.

Dichotomic models: probit, logit models etc.

Censored regression models: Tobit etc.

Times series models: AR, ARMA, VAR etc.

GARCH models

2. Prior and posterior distributions3.2. Generalisation to a model

Denition (Posterior distribution, model)

For the sample fyi , xigni=1 , the posterior distribution is the conditionaldistribution of θ given yi , dened as to be:

π ( θj y , x) = Ln (θ; y j x) π (θ)

fY jX (y j x)

where Ln (θ; y j x) is the likelihood of the sample and

fY jX (y j x) =ZΘ

Ln (θ; y j x)π (θ) dθ

Sub-Section 3.3

Bayesian updating

3. Posterior distributions and inference3.3. Bayesian updating

Bayesian updating

A very attractive feature of Bayesian inference is the way in whichposterior distributions are updated as new information becomes available.

Bayesian updating

As usual for the rst observation y1, we have

π ( θj y1) ∝ f (y1j θ)π (θ)

Next, suppose that a new data y2 is obtained, and we wish to compute theposterior distribution given the complete data set π ( θj y1, y2) :

π ( θj y1, y2) ∝ f (y1, y2j θ)π (θ)

The posterior can be rewritten as:

π ( θj y1, y2) ∝ f (y2j y1, θ) f (y1j θ)π (θ)

∝ f (y2j y1, θ)π ( θj y1)

Denition (Bayesian updating)The posterior distribution is updated as new information becomesavailable as follows:

π ( θj y1, .., yn) ∝ f (yn j yn1, θ)π ( θj y1, .., yn1)

If the observations yn and yn1 are independent (i .i .d . sample), thenf (yn j yn1, θ) = f (yn j θ) and

π ( θj y1, .., yn) ∝ f (yn j θ)π ( θj y1, .., yn1)

Example (Bayesian updating)

We assume that the prior distribution for θ is a Beta distribution β (α, β)with a pdf:

π (θ;γ) =Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1 α, β > 0 θ 2 [0, 1]

Question: Write the posterior of θ given (y1, y2) as a function of theposterior given y1.

Solution

The likelihood of (y1, y2) is:

L (θ; y1, y2) = θy1+y2 (1 θ)2y1y2

The prior distribution is:

π (θ;γ) =Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1

So the unormalised posterior distribution is:

π ( θj y1, y2) ∝ L (θ; y1, y2) π (θ)

∝Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1 θy1+y2 (1 θ)2y1y2

Solution (contd)

π ( θj y1, y2) ∝Γ (α+ β)

Γ (α) Γ (β)θα1 (1 θ)β1 θy1+y2 (1 θ)2y1y2

or equivalently:

π ( θj y1, y2) ∝Γ (α+ β)

Γ (α) Γ (β)θα+y1+y21 (1 θ)β+2y1y21

So,θj y1, y2 B (α+ y1 + y2, β+ 2 y1 y2)

Solution (contd)

π ( θj y1, y2) ∝Γ (α+ β)

Γ (α) Γ (β)θα+y1+y21 (1 θ)β+2y1y21

θj y1, y2 B (α+ y1 + y2, β+ 2 y1 y2)Given the observation y1, we have:

π ( θj y1) ∝Γ (α+ β)

Γ (α) Γ (β)θα+y11 (1 θ)β+1y11

θj y1 B (α+ y1, β+ 1 y1)

Solution (contd)

π ( θj y1, y2) ∝Γ (α+ β)

Γ (α) Γ (β)θα+y1+y21 (1 θ)β+2y1y21

π ( θj y1) ∝Γ (α+ β)

Γ (α) Γ (β)θα+y11 (1 θ)β+1y11

The updating mechanism is given by:

π ( θj y1, y2) ∝ π ( θj y1) θy2 (1 θ)1y2

or equivalently

π ( θj y1, y2) ∝ π ( θj y1) fY2 (y2; θ)

Sub-Section 3.4

Large sample

3. Posterior distributions and inference3.4. Large sample

Large samples

Although all statistical results for Bayesian estimators are necessarily"nite sample" (they are conditioned on the sample data), it remains ofinterest to consider how the estimators behave in large samples.

What is the the behavior of the posterior distribution when n is large?

π ( θj y1, .., yn)

Fact (Large sample)

Greenberg (2008) summarises the behavior of the posterior distributionwhen n is large as follows:

(1) the prior distribution plays a relatively small role in determining theposterior distribution,

(2) the posterior distribution converges to a degenerate distribution at thetrue value of the parameter,

(3) the posterior distribution is approximately normally distributed withmean bθ, the MLE of θ.

Large samples

What is the intuition of the two rst results?

1 the prior distribution plays a relatively small role in determining theposterior distribution

2 the posterior distribution converges to a degenerate distribution atthe true value of the parameter

Denition (Log-Likelihood Function)The log-likelihood function is dened to be:

`n : ΘRn! R

(θ; y1, .., yn) 7! `n (θ; y1, .., yn) = ln (Ln (θ; y1, .., yn)) =n

∑i=1ln fY (yi ; θ)

Large samples

Introduce the mean log likelihood contribution (cf. chapter 3):

`n (θ; y) =n

∑i=1ln fY (yi ; θ) = n` (θ; y)

The posterior distribution can be written as

π ( θj y) ∝ expn` (θ; y)

| z depends on n

π (θ)| z does not depend on n

Large samples

π ( θj y) ∝ expn` (θ; y)

| z depends on n

π (θ)| z does not depend on n

For large n, the exponential term dominates π (θ):

The prior distribution will play a relatively smaller role than do thedata (likelihood function), when the sample size is large.

Conversely, the prior distribution has relatively greater weight when nis small

If we denote the true value of θ by θ0, it can be shown that

limn!∞

n` (θ; y) = n` (θ0; y)

Then, we have for n large:

π ( θj y) ∝ expn` (θ0; y)

| z π (θ)

does not depend on θ

8θ 2 Θ

Whatever the value of θ, the value of π ( θj y) tends to a constant timesthe prior.

Example (Large sample)

Consider an i .i .d . sample (Y1, ..,Yn) of binary variables, such thatYi Be (θ) with θ = 0.3 and:

We assume that the prior distribution for θ is a Beta distribution B (2, 2) .

Question: Write a Matlab code to illustrate the changes in the posteriordistribution with n.

An animation is worth 1,000,000 words...

Click me!

Sub-Section 3.5

Inference

3. Posterior distributions and inference3.5. Inference

The output of the Bayesian estimation is the posterior distribution

π ( θj y)

1 In some particular case, the posterior distribution is a standarddistribution (Beta, normal etc..) and its pdf has an analytic form.

2 In other case, the posterior distribution is get from numericalsimulations (cf. section 5).

Whatever the case (analytical or numerical), the outcome of the Bayesianestimation procedure may be:

1 The graph of the posterior density π ( θj y) for all values of θ 2 Θ

2 Some particular moments (expectation, variance etc..) or quantilesof this distribution

Source: Gerali et al. (2014), Credit and Banking in a DSGE Model of the Euro Area, JMCB, 42(6) 107-141.

But, we may also be interested in estimating one parameter of the model.

The Bayesian approach to this problem uses the idea of a loss function

Lbθ, θ

where bθ is the Bayesian estimator (cf. chapter 2).

Example (Absolute loss function)The absolute loss function is dened to be:

Lbθ, θ = bθ θ

Example (Quadratique loss function)The quadratic loss function is dened to be:

Lbθ, θ = bθ θ

Denition (Bayes estimator)

The Bayes estimator of θ is the value of bθ that minimizes the expectedvalue of the loss, where the expectation is taken over the posteriordistribution of θ; that is, bθ is chosen to minimize

ELbθ, θ = Z

Lbθ, θπ ( θj y) dθ

The idea is to minimise the average loss whatever the possible value of θ

ELbθ, θ = Z

Lbθ, θπ ( θj y) dθ

Under quadratic loss, we minimize

ELbθ, θ = Z bθ θ

2π ( θj y) dθ

By di¤erentiating the function with respect to bθ and setting the derivativeequal to zero

2Z bθ θ

π ( θj y) dθ = 0

or equivalently bθ = Zθ π ( θj y) dθ = E ( θj y)

Denition (Bayes estimator, quadratic loss)For a quadratic loss function, the optimal Bayes estimator is theexpectation of the posterior distribution

bθ = E ( θj y) =Z

θ π ( θj y) dθ

Source: Greene W. (2007), Econometric Analysis, sixth edition, Pearson - Prentice Hil

In addition to reporting a point estimate of a parameter θ, it is oftenuseful to report an interval estimate of the form

Pr (θL θ θU ) = 1 α

Bayesians call such intervals credibility intervals (or Bayesian condenceintervals) to distinguish them from a quite di¤erent concept that appearsin frequentist statistics, the condence interval (cf. chapter 2).

Denition (Bayesian condence interval)If the posterior distribution is unimodal, then a Bayesian condenceinterval or credibility interval on the value of θ, is given by the two valuesθL and θU such that:

Pr (θL θ θU ) = 1 α

where α is the level of risk.

For a Bayesian, values of θL and θL can be determined to obtain thedesired probability from the posterior distribution.

If more than one pair is possible, the pair that results in the shortestinterval may be chosen; such a pair yields the highest posterior densityinterval (h.p.d.).

Denition (Highest posterior density interval)

The highest posterior density interval (hpd) is the smallest region H suchthat

Pr (θ 2 H) = 1 α

where α is the level of risk. If the posterior distribution is multimodal, thehpd may be disjoint.

Source: Colletaz et Hurlin (2005), Modèles non linéaires et prévision, Rapport pour lInstitut pour la Recherche CDC.

Another basic issue in statistical inference is the prediction of new datavalues.

Denition (Forecasting)

The general form of the pdf/pmf of Yn+1 given y1, .., yn is:

f (yn+1j y) =Zf (yn+1j θ, y)π ( θj y) dθ

where π ( θj y) is the posterior distribution of θ.

Forecasting

1 If Y is a discrete variable, this formula gives the conditionalprobability of Yn+1 = yn+1 given y1, .., yn :

Pr (Yn+1 = yn+1j y) =Zp (yn+1j θ, y)π ( θj y) dθ

2 If Y is a continuous variable, this formula gives the conditionaldenisty of Yn+1 given y1, .., yn. In this case, we can compute theexpected value of Yn+1 as:

E (yn+1j y) =Zf (yn+1j y) yn+1dyn+1

Forecasting

If Yn+1 and Y1, ..,Yn are independent, then f (yn+1j θ) and we have:

f (yn+1j y) =Zf (yn+1j θ)π ( θj y) dθ

Example (Forecasting)

Consider an i .i .d . sample (Y1, ..,Yn) of binary variables, such thatYi Be (θ). We assume that the prior distribution for θ is a Betadistribution B (α, β) . We want to forecast the value of Yn+1 given therealisations (y1, .., yn) .

Question: Determine the probability Pr (Yn+1 = 1j y1, .., yn).

Solution

In this example, the trials are independent, so Yn+1 is independent of(Y1, ..,Yn) .

Pr (Yn+1 = 1j y) =ZPr (Yn+1 = 1j θ, y)π ( θj y) dθ =

ZPr (Yn+1 = 1j θ)π ( θj y) dθ

The posterior distribution of θ is is in the form of a beta distribution:

π ( θj y) = Γ (α1 + β1)

Γ (α1) + Γ (β1)θα11 (1 θ)β11

α1 = α+n

∑i=1yi β1 = α+ n

∑i=1yi

Solution (contd)

Pr (Yn+1 = 1j y) =ZPr (Yn+1 = 1j θ)π ( θj y) dθ

π ( θj y) = Γ (α1 + β1)

Γ (α1) + Γ (β1)θα11 (1 θ)β11

Since Pr (Yn+1 = 1j θ) = θ, we have

Pr (Yn+1 = 1j y) =Γ (α1 + β1)

Γ (α1) + Γ (β1)

θ θα11 (1 θ)β11 dθ

=Γ (α1 + β1)

Γ (α1) + Γ (β1)

θα1 (1 θ)β11 dθ

Solution (contd)

Pr (Yn+1 = 1j y) =Γ (α1 + β1)

Γ (α1) + Γ (β1)

θα1 (1 θ)β11 dθ

We admit that:

θα1 (1 θ)β11 dθ =Γ (α1 + 1) Γ (β1)Γ (α1 + β1 + 1)

So, we have

Pr (Yn+1 = 1j y) =Γ (α1 + β1)

Γ (α1) + Γ (β1) Γ (α1 + 1) Γ (β1)

Γ (α1 + β1 + 1)

Solution (contd)

Pr (Yn+1 = 1j y) =Γ (α1 + β1)

Γ (α1) + Γ (β1) Γ (α1 + 1) Γ (β1)

Γ (α1 + β1 + 1)

Since Γ (p) = (p 1) Γ (p 1) , we have:

Pr (Yn+1 = 1j y) =Γ (α1 + β1)

Γ (α1) + Γ (β1) Γ (α1) α1 Γ (β1)

Γ (α1 + β1) (α1 + β1)

α1 + β1

Solution (contd)

Pr (Yn+1 = 1j y) =α1

α1 + β1=

α+∑ni=1 yi

α+ β+ n

So, we found that

Pr (Yn+1 = 1j y) = E ( θj y) = α+∑ni=1 yi

α+ β+ n

The estimate of Pr (Yn+1 = 1j y) is the mean of the posterior distributionof θ.

Key concepts of Section 3

1 Marginal and conditional posterior distribution

2 Bayesian updating

3 Bayes estimator

4 Bayesian condence interval

5 Highest posterior density interval

6 Bayesian prediction

Section 4

Applications

5. Applications

Objectives

1 To discuss the Bayesian estimation of VAR models

2 To propose various priors for this type of models

3 To introduce the issue of numerical simulations of the posterior

5. Applications

Consider a typical VAR(p)

Yt = B1Yt1 +B2Yt2 + ..+BpYtp +Dzt + εt t = 1, ..,T

Yt is n 1 vector of endogenous variables

εt is a n 1 vector of error terms i .i .d . with

εt IIN0n1, Σnn

zt is a d 1 vector of exogenous variables

Bi for i = 1, .., p is a n n matrix of parameters

D is a n d matrix of parameters

5. Applications

Classical estimation of the parameters B1, ..,Bp ,D,Σ may yieldimprecisely estimated relations that t the data well only because ofthe large number of variables included: problem known as overtting

The number of parameters to be estimated n (np + d + (n 1) /2)grows geometrically with the number of variables (n) andproportionally with the number of lags (p).

When the number of parameters is relatively high and the sampleinformation is relatively loose (macro-data), it is likely that theestimates are inuenced by noise as opposed to signal

5. Applications

A Bayesian approach to VAR estimation was originally advocated byLitterman (1980) as solution to the overtting problem.

Litterman R. (1980), Techniques for forecasting with VectorAutoRegressions, University of Minnesota, Ph.D. dissertation.

5. Applications

Bayesian estimation is a solution to the overtting that avoidsimposing exact zero restrictions on the parameters

The researcher cannot be sure that some coe¢ cients are zero andshould not ignore their possible range of variations

A Bayesian perspective ts precisely this view with the priordistribution.

5. Applications

Yt = B1Yt1 +B2Yt2 + ..+BpYtp + εt

Rewrite the VAR in a compact form:

Yt = Xtβ+ εt

Xt = In Wt1 is n nk with k = p + d

Wt1 =Y>t1, ..,Y

>tp , z>t

>is k 1

β = vec (B1, ..,Bp ,D) is a nk 1 vector

5. Applications

Denition (Likelihood)Under the normality assumption, the condional distribution ofYt = Xtβ+ εt given Xt and the set of parameters β is normal

Yt jXt , β N (Xtβ,Σ)

and the likelihood of the (realisations) sample (Y1, ..,YT ) , denoted Y, is:

LT (YjX, β,Σ) ∝ jΣjT /2 exp

∑t=1(Yt Xtβ)> Σ1 (Yt Xtβ)

By convention, we will denote LT (YjX, β,Σ) = LT (Y, β,Σ)

5. Applications

Denition (Joint posterior distribution)

Given a prior π (β,Σ) , the joint posterior distribution of the parametersβ is given by

π (β,ΣjY) = LT (Y, β,Σ)π (β,Σ)

or equivalentlyπ (β,ΣjY) ∝ LT (Y, β,Σ)π (β,Σ)

5. Applications

Marginal posterior distribution

Given π (β,ΣjY) , the marginal posterior distribution for β and Σ can beobtained by integration

π (βjY) =Z

π (β,ΣjY) dΣ

π (ΣjY) =Z

π (β,ΣjY) dβ

where dΣ and dβ correspond to integrand dened for all the elements ofΣ or β respectively

5. Applications

Bayesian estimates

Location and dispersion of π (βjY) and π (ΣjY) can be easily analysedto yield point estimates (quadratic loss) of the parameters of interest andmeasure of precision, comparable to those obtained by using classicalapproach to estimation. Especially:

E (π (βjY)) V (π (βjY)) E (π (ΣjY))

5. Applications

Two problems arise:

1 The numerical integration of the marginal posterior distribution

2 The choice of the prior

5. Applications

Fact (Numerical integration)

In most of case, the numerical integration of π (β,ΣjY) may be di¢ cultor even impossible to implement. For instance, if n = 1, p = 4, we have

π (ΣjY) =Z

π (β,ΣjY) dβ =Z Z Z Z

π (β,ΣjY) dβ1dβ2dβ3dβ4

This problem, however, can often be solved by using numerical integrationbased on Monte Carlo simulations.

5. Applications

One particular MCMC (Markov-Chain Monte Carlo) estimation method isthe Gibbs sampler, which a particular version of the Metropolis-Hastingalgorithm (see section 5).

5. Applications

Denition (Gibbs sampler)The Gibbs sampler is a recursive Monte Carlo method that allows togenerate simulated values of (β,Σ) from the joint posterior distribution.This method requires only knowledge of the full conditional posteriordistribution of the parameters of interest π (βjY,Σ) and π (ΣjY, β) .

5. Applications

Denition (Gibbs algorithm)

Suppose that β and Σ are sacalr. The Gibbs sampler algorithm startsfrom arbirtary values

β0,Σ0

, and samples alternatively from the density

of each element of the parameter vector, conditional of the value of theother element sampled in the previous iteration. Thus, the Gibbs samplersamples recursively as follows:

β1 from π

βjY ,Σ0

Σ1 from π

ΣjY , β1

β2 from π

βjY ,Σ1

βm from π

βjY ,Σm1

5. Applications

Fact (Gibbs sampler)

The vectors (βm ,Σm) form a Markov-Chain and for a su¢ ciently largenumber of iterations, m M, can be regarded as draws from the truejoint posterior distribution π (β,ΣjY ) . Given a large sample of drawsfrom this limiting distribution, (βm ,Σm)GMm=M+1, any posterior momentof marginal density of interest can then be easily estimated consistentlywith its corresponding sample average. For instance:

∑m=M+1

βm ! E (βjY )

5. Applications

Remark

The process must be started somewhere, though it does not mattermuch where.

Nonetheless, a burn-in period is required to eliminate the inuence ofthe starting values. So, the rst M values (βm ,Σm) are discarded.

5. Applications

Example (Gibbs sampler)We consider the bivariate normal distribution rst. Suppose we wished todraw a random sample from the population

1 ρρ 1

Question: write a Matlab code to generate a sample of n = 1, 000observations of (x1, x2)

> with a Gibbs sampler.

5. Applications

Solution

The Gibbs sampler takes advantage of the result

X1j x2 Nρx2,

X2j x1 N

5. Applications

3 2 1 0 1 2 33

5. Applications

Priors

A second problem in implementing Bayesian estimation of VAR models isthe choice of the prior distribution for the models parameters

π (β,Σ)

5. Applications

We distinguish two types of priors:

1 Some priors lead to analytical formula for the posterior distribution

1 Di¤use prior

2 Natural conjugate prior

3 Minnesoto prior

2 For other priors there is no analytical formula for the priors and theGibbs sampler is required.

1 Independent Normal-Wishart Prior

5. Applications

Denition (Full Bayesian estimation)A full Bayesian estimation requires specifying prior distribution for thehyperparameters of the prior distribution of the parameters of interest.

5. Applications

Denition (Empirical Bayesian estimation)The empirical Bayesian estimation is based on a preliminary estimationof the hyperparameters γ. These estimates could be obtained by OLS,maximum likelihood, GMM or other estimation method. Then, π (θ;γ) issubstituted by π (θ; bγ) . Note that the uncertainty in the estimation of thehyperparameters is not taken into account in the posterior distribution.

5. Applications

Empirical Bayesian estimation

In this section, we consider only empirical Bayesian estimation methods:

Yt = Xtβ+ εt

εt IIN (0,Σ)Notations:

bβ = T

∑t=1X>t Xt

∑t=1X>t Yt

!OLS estimator of β

bΣ = 1T k

∑t=1(Yt Xtβ)> (Yt Xtβ) OLS estimator of Σ

5. Applications

Matlab codes for Baysesian estimation of VAR models are proposed byKoop and Korobilis (2010)

1 Analytical results:

1 Code BVAR_ANALYT.m gives posterior means and variances ofparameters & predictives, using the analytical formulas.

2 Code BVAR_FULL.m estimates the BVAR model combining all thepriors discussed below, and provides predictions and impulse responses

2 Gibbs sampler: Code BVAR_GIBBS.m estimates this model, butalso allows the prior mean and covariance.

Koop G. and Korobilis D. (2010), Bayesian Multivariate Time SeriesMethods for Empirical Macroeconomics

Sub-Section 4.1

Minnesota Prior

5. Applications

Fact (Stylised facts)

Literman (1986) species his prior by appealing to three statisticalregularities of macroeconomic time series data:

(1) the trending behavior of macroeconomic time series

(2) more recent values of a series usually contain more information on thecurrent value of the series than past values

(3) past values of a given variable contain more information on its currentstate than past values of other variables.

Litterman R. (1986), Forecasting with Bayesian Vector AutoRegressions,JBES, 4, 25-38.

5. Applications

A Bayesian researcher specify these regularities by assigning a probabilitydistribution to the parameters in such way that:

1 the mean of the coe¢ cients assigned to all lags other than the rstone is equal to zero.

2 the variance of the coe¢ cients depends inversely on the number oflags

3 the coe¢ cient of variable j in equation g are assigned a lower priorvariance than those of variable g .

These requirements will be controlled for by the hyperparameters

π = (π1, ..,πd )> =) prior on B1, ..,Bp ,D,Σ

5. Applications

Denition (Minnesota prior, Litterman 1986)In the Minnesota prior, the variance covariance matrix Σ is assumed to bexed and equal to bΣ. Denote βk for k = 1, .., n, the vector of parametersof the k ith equation, the corresponding prior distribution is:

βk ,Ωk

5. Applications

Remark

1 Given these assumptions, there is prior and posterior independencebetween equations.

2 The equations can be estimated separately.

3 By assuming Ωk = 0, the posterior mean of βk corresponds to theOLS estimator of βk .

5. Applications

Litterman (1986) then assigns numerical values to the hyperparametersgiven the previous stylised facts.

For the prior mean:

1 The traditional choice is to set βk = 0 for all the parameters if thek th variable is a growth rate (random walk behavior).

2 If the variable is in level all the hyperparameters are set to 0, exceptthe parameter associated to the rst own lag.

5. Applications

The prior covariance matrix Ωk is assumed to be diagonal with elementsωi ,i for i = 1, .., k.

ωi ,i =

8<:a1r 2a2r 2

σiiσjj

a3σij

for coe¢ cients on own lag r = 1, .., pfor coe¢ cients on lag r of variable j 6= i , r = 1, .., pfor coe¢ cients on exogenous variables

This prior simplies the complicated choice of fully specifying all theele-ments of Ωk to choosing three scalars, a1, a2 and a3.

5. Applications

Theorem (Posterior distribution)

If we assume Minnesota priors, the posterior distribution of βk is:

βk jY Neβk , eΩk

eβk = eΩk

eΩ1k βk + σ2k ,kX

eΩk =

Ω1k + σ2k ,kX

Sub-Section 4.2

Di¤use Prior

5. Applications

Denition (Di¤use prior)The Di¤use prior is dened as to be:

π (β,Σ) ∝ jΣj(n+1)/2

5. Applications

Remarkπ (β,Σ) ∝ jΣj(n+1)/2

Contrary to Minnesota prior:

1 Σ is not assumed to be xed

2 The equations are not independent

5. Applications

Theorem (Posterior distribution)For a Di¤usion prior distribution, the joint posterior distribution is given by:

π (β,ΣjY) = π (βjY,Σ) π (ΣjY)

βjY,Σ Nbβ,Σ

ΣjY IWbΣ,T k

where IW denotes the Inverted Wishart distribution.

Sub-Section 4.3

Natural conjugate prior

5. Applications

Denition (Natural conjugate prior)The natural conjugate prior has the form:

βjΣ N

β,ΣΩ

Σ IWΣ, α

with α > n and IW denotes the Inverted Wishart distribution with adegree of freedom equal to s. The hyperparameters are β, Ω, Σ and α.

5. Applications

Theorem (Posterior distribution)The posterior distribution associated to the natural conjugate prior is:

βjΣ,Y Neβ,Σ eΩ

ΣjY IWeΣ,T + α

where eβ = eΩ eΩ1

β+X>XbβolseΩ =

Ω1+X>X

Sub-Section 4.4

Independent Normal-Wishart Prior

5. Applications

Denition (Independent Normal-Wishart Prior)The independent Normal-Wishart prior is dened as:

β,Σ1= π (β) π

where W (., α) denotes the Wishart distribution with α degrees of freedom.

5. Applications

Remark

1 Note that this prior allows for the prior covariance matrix, Ω, to beanything the researcher chooses, rather than the restrictive ΣΩ

form of the natural conjugate prior. For instance, the researcher couldchoose a prior similar in spirit to the Minnesota prior.

2 A noninformative prior can be obtained by setting

β = Ω = Σ1= 0

5. Applications

Theorem (Posterior distribution)In the case of independent Normal-Wishart prior, the joint posteriordistribution can not be derived analytically. We can only derive theconditional posterior distribution (used in the Gibbs sampler)

βjΣ,Y Neβ,Σ eΩ

ΣjY, β IWeΣ,T + α

where eβ = eΩ

∑t=1X>t Σ1Yt

∑t=1X>t Σ1Xt

Sub-Section 5

Simulation methods

5. Simulation methods

Objectives

1 Introduce the Probability Integral Transform (PIT) method

2 Introduce the Accept-Reject (AR) method

3 Introduce the Importance sampling method

4 Introduce the Gibbs algorithm

5 Introduce the Metropolis Hasting algorithm

5. Simulation methods

In bayesian econometrics, we may distinguish two case give the prior:

1 When we use a conjugate prior, the posterior distribution issometimes "standard" (normal, gamma, beta etc..) and standardpackages of the statistic softwares may be used to compute thedensity π ( θj y) , its moments (for instance Eπ ( θj y)), the hpd etc.

2 In most of cases, the posterior density π ( θj y) is obtainednumerically through simulation methods.

Sub-Section 5.1

Probability Integral Transform (PIT) method

5. Simulation methods5.1. Probability Integral Transform (PIT) method

Suppose we wish to draw a sample of values from a continuousrandom variable X that has cdf FX (.), assumed to be nondecreasing.

The PIT methods allows to generate a sample of X from a sample ofrandom values issued from U where U has a uniform distributionover [0, 1] :

U U[0,1]

Indeed, if we assume that the random variable U is a function of therandom variable X such that:

U = FX (X )

What is the distribution of U?

FX (x) = Pr (X x) = Pr (FX (X ) FX (x)) = Pr (U FX (x))

So, we have:Pr (U FX (x)) = FX (x)

We know that if U has a uniform distribution then

Pr (U u) = u

So, U has a uniform distribution.

Denition (Probability Integral Transform (PIT) )

If X is a continuous random variable with a cdf FX (.) , then thetransformed (probability integral transformation) variable U = FX (X )has a uniform distribution over [0, 1] .

U = FX (X ) U[0,1]

How to get a trial for X from a trial from U?

Denition (PIT algorithm)In order to get a realisation x of the random variable of X with a cdfFX (.) , from a realisation u of the variable U with U U[0,1], thefollowing procedure has to be adopted:

(1) Draw u from U [0, 1] .

(2) Compute x = F1X (u)

Example (PIT method)Suppose we wish to draw a sample from a random variable with densityfunction

fX (x) = 3

0if 0 x 2otherwise

Question: Write a Matlab code a generate a sample (x1, .., xn) withn = 100.

Solution (contd)

First, determine the cdf of X :

FX (x) =38

t2dt =18x3

So, we have:

U = FX (X ) =18X 3 U[0,1]

Then, determine the probability inverse transformation:

X = F1X (U) = 2U1/3

Solution (contd)

0 20 40 60 80 1000 .2

Remark

Note that a multivariate random variable cannot be simulated by thismethod, because its cdf is not one-to-one and therefore not invertible.

Remark

An important application of this method is to the problem ofsampling from a truncated distribution

Suppose that X has a cdf FX (x) and that we want to generaterestricted values of X such that

c1 X c2

Remark (contd)

The cdf of the truncated variable is

FX (x) F (c1)FX (c1) FX (c2)

for c1 x c2

Then, we have:

U =FX (X ) F (c1)FX (c1) FX (c2)

U[0,1]

and the truncated variable can be dened as:

Xtrunc = F1X (FX (c1) + U (FX (c2) FX (c1)))

Recommendation

1 Why using the PIT method?I In order to generate a sample of simulated values from a givendistribution

I This distribution is not available in my statistical software..

2 What are the prerequisites of the PIT?

I The functional form of the cdf is known and we need to knowF1 (x) ....

Sub-Section 5.2

Accept-reject method

5. Simulation methods5.2. Accept-reject method

Denition (Accept-reject method)

The acceptreject (AR) algorithm can be used to simulate values froma density function fX (.) if it is possible to simulate values from a densityg (.) and if a number c can be found such that

fX (x) c g (x)

for all x in the support of fX (.).

Denition (Target and source densities)

The density fX (.) is called the target density and g (.) is called thesource density.

Posterior distribution

In the context of the Bayesian econometrics, we have:

π ( θj y) c g ( θj y) 8θ 2 Θ

c = supθ2Θ

π ( θj y)g ( θj y)

where:

1 Target density = posterior distribution π ( θj y)2 Source density = g ( θj y)

Denition (AR algorithm, posterior distribution)The Accept-Reject algorithm for the posterior distribution is the following:

(1) Generate a value m θs from g ( θj y).(2) Draw a value u from U[0,1].

(3) Return θs as a draw from π ( θj y) if

u π ( θs j y)cg ( θs j y)

If not, reject it and return to step 1.(The e¤ect of this step is to accept θs

with probability π ( θs j y) /cg ( θs j y).)

Example (AR algorithm)

We aim at simulating some realisations from a N (0, 1) distribution(target distribution) with a source density given by a Laplace distributionwith a pdf:

g (x) =12exp ( jx j)

Question: write a Matlab code to simulate a sample of n = 1, 000realisations of the normal distribution.

Solution

In order to simplify the problem, we generate only positive values for x .

The source density can transformed as follows (exponential density withλ = 1).

g (x) = exp (x)

Since the normal and the Laplace are symmetric about zero, if theproposal (x > 0) is accepted, it is assigned a positive value withprobability one half and a negative value with probability one half.

Solution (contd)

The pdf of the target and source distributions are for x > 0:

f (x) =1p2π

g (x) = exp (x)

Determination of c : determine the maximum value of f (x) /g (x) :

f (x)g (x)

=1p2π

exp x 2

exp (x)

The maximum is reached for x = 1, then we have:

c =1p2π

0 2 4 6 8 100

Ratio f(x)/g (x)

0 2 4 6 8 100

f(x)c.g(x)

Solution (contd)

The AR algorithm is the following:

1 Generate x from an exponential distribution with parameter λ = 1.

2 Generate u from an uniform distribution U[0,1]

u φ (x)c exp (x) =

1p2πexp

1p2πexp (1/2) exp (x)

= expx x

then return x if u > 1/2 and x if u 1/2. Otherwise, reject x .

Generated sample

0 200 400 600 800 10004

Kernel density estimates

5 4 3 2 1 0 1 2 3 40

Fact (Unormalised posterior distribution)An interesting feature of the AR algorithm is that is allows to simulatesome values for θ (from the posterior distribution π ( θj y)) by only usingthe unormalised posterior

π ( θj y) = f (y j θ)π ( θj y)f (y)

π ( θj y) ∝ f (y j θ)π ( θj y)

Intuitionπ ( θj y) = 1

f (y)| z unknown

f (y j θ)π ( θj y)| z known

Assume that f (y j θ)π ( θj y) is known, but 1/f (y) is unknow.

If a value of θ generated from g ( θj y) is accepted with probabilityf (y j θ)π ( θj y) /cg ( θj y), the accepted values of θ are a sample fromπ ( θj y).This method can therefore be used even if the normalizing constant ofthe target distribution is unknow

Example (AR and posterior distribution)

Consider an i .i .d . sample (Y1, ..,Yn) of binary variables, such thatYi Be (θ) with θ = 0.3. We assume that the (uninformative) priordistribution for θ is an uniform distribution over [0, 1]. Then, we have

π ( θj y) = Ln (θ; y1, .., yn) π (θ)

fY1,..Yn (y1, .., yn)=

θΣyi (1 θ)Σ(1yi )Z 1

0θΣyi (1 θ)Σ(1yi ) dθ

We can use the pdf of the unormalised posterior eπ ( θj y) to simulate somevalues (θ1, ..θS ) from the posterior distribution

eπ ( θj y) = θΣyi (1 θ)Σ(1yi )

Recommendation

1 Why using the Accept-Reject method?I In order to generate a sample of simulated values from a givendistribution.

I This distribution is not available in my statistical software.

I To generate samples of θ from the unormalised posterior distribution

I The functional form of the pdf of the target distribution is known

Sub-Section 5.3

Importance sampling

5. Simulation methods5.3. Importance sampling

Suppose that one is interested in calculating the value of the integral

I = E (h (θ)j y) =Z

Θh (θ)π ( θj y) dθ

where h (θ) is a continuous function.

Example (Importance sampling)Suppose that we want to compute the expectation and the variance of theposterior distribution

E ( θj y) =Z

Θθ π ( θj y) dθ

θ2 y = Z

Θθ2 π ( θj y) dθ

V ( θj y) = E

θ2 yE2 ( θj y)

Consider a source density g ( θj y) that is easy to sample from and whichis a close match to π ( θj y) . Write:

h (θ)π ( θj y)g ( θj y) g ( θj y) dθ

This integral can be approximated by drawing a sample of G values fromg ( θj y) denotes θg1 , .., θ

gG and computing

I ' 1G

∑i=1h (θgi )

π ( θgi j y)g ( θgi j y)

Denition (Importance sampling)The moment of the posterior distribution

E (h (θ)j y) =Z

can be approximated by drawing G realisations θg1 , .., θgG from a source

density g ( θgi j y) and computing

E (h (θ)j y) ' 1G

∑i=1h (θgi )

π ( θgi j y)g ( θgi j y)

This expression can be regarded as a weighted average of the h (θgi ) whereimportance weights are π ( θgi j y) /g ( θgi j y)

Example (Truncated exponential)Consider a continuous random variable X with an exponential distributionX exp (1) truncated to [0, 1] . We want to approximate

1+ X 2

by the importance sampling method with a source density B (2, 3)because it is dened over [0, 1] and because, for this choice of parameters,the match between the beta function and the target density is good.

Question: write a Matlab code to approximate this integral. Compare itto the value obtained by numerical integration.

Solution

For an exponential distribution with a rate parameter of 1 we have

f (x) = exp (x) F (x) = 1 exp (x) forx > 0

If this density is truncated over [0, 1] (truncated at right) we have

π (x) =f (x)F (1)

=exp (x)

1 exp (1)

So, we aim at computing:

1+ X 2

11+ x2

exp (x)1 exp (1)dx

Solution (contd)

The importance sampling algorithm is the following:

1 Generate a sample of G values x1, .., xG from a Beta distributionB (2, 3)

2 Compute

1+ X 2

∑i=1

1+ x2i

h(xi )

exp (xi )1 exp (1)

π(xi )

1g2,3 (xi )| z g (xi )

where gα,β (xi ) is the pdf of the B (α, β) distribution evaluated at x

gα,β (x) =xα1 (1 x)β1

B (α, β)

and B(α, β) is the beta function.

Results

Recommendation

1 Why using the Importance sampling method?I In order to compute the moments of a given distribution (typically theposterior distribution).

I The functional form of the pdf of the target distribution is known

End of Chapter 7

Christophe Hurlin (University of Orléans)

Chapter 7: Bayesian Econometrics - univ-orleans.fr · 1. Introduction References Geweke J. (2005),...

Documents