Hessian Matrices in Statistics

GVlogo

Hessian Matrices In Statistics

Ferris Jumah, David Schlueter, Matt Vance

MTH 327Final Project

December 7, 2011

Hessian Matrices in Statistics

GVlogo

Topic Introduction

Today we are going to talk about . . .

Introduce the Hessian matrixBrief description of relevant statisticsMaximum Likelihood Estimation (MLE)Fisher Information and Applications


GVlogo

Topic Introduction

Today we are going to talk about . . .Introduce the Hessian matrix

Brief description of relevant statisticsMaximum Likelihood Estimation (MLE)Fisher Information and Applications


GVlogo

Topic Introduction

Today we are going to talk about . . .Introduce the Hessian matrixBrief description of relevant statistics

Maximum Likelihood Estimation (MLE)Fisher Information and Applications


GVlogo

Topic Introduction

Today we are going to talk about . . .Introduce the Hessian matrixBrief description of relevant statisticsMaximum Likelihood Estimation (MLE)

Fisher Information and Applications


GVlogo

Topic Introduction

Today we are going to talk about . . .Introduce the Hessian matrixBrief description of relevant statisticsMaximum Likelihood Estimation (MLE)Fisher Information and Applications


GVlogo

The Hessian Matrix

Recall the Hessian matrix

H(f) =

∂2f∂x21

∂2f∂x1 ∂x2

· · · ∂2f∂x1 ∂xn

∂2f∂x2 ∂x1

∂2f∂x22

· · · ∂2f∂x2 ∂xn

......

. . ....

∂2f∂xn ∂x1

∂2f∂xn ∂x2

· · · ∂2f∂x2n

(1)


GVlogo

Statistics: Some things to recall

Now, let’s talk a bit about Inferential StatisiticsParameters

Random VariablesDefinition: A random variable X is a function X : Ω→ R

Each r.v. follows a distribution that has associated probability functionf(x|θ)E.g.

f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)

What is a Random Sample? X1, . . . , Xn i.i.d.Outputs of these r.v.s are our sample data


GVlogo


Now, let’s talk a bit about Inferential Statisitics

Parameters



f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)



GVlogo





f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)



GVlogo





f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)



GVlogo




Each r.v. follows a distribution that has associated probability functionf(x|θ)

E.g.

f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)



GVlogo





f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)



GVlogo





f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)

What is a Random Sample?

X1, . . . , Xn i.i.d.Outputs of these r.v.s are our sample data


GVlogo





f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)

What is a Random Sample? X1, . . . , Xn i.i.d.

Outputs of these r.v.s are our sample data


GVlogo





f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)



GVlogo





f(x|µ, σ2) =1

σ√

2πexp

[−

(x− µ)2

2σ2

](2)



GVlogo

Stats cont.

Estimators (θ) of Population Parameters

Definition: Estimator is often a formula to calculate an estimate of aparameter, θ based on sample dataMany estimators, but which is the best?


GVlogo

Stats cont.

Estimators (θ) of Population ParametersDefinition: Estimator is often a formula to calculate an estimate of aparameter, θ based on sample data

Many estimators, but which is the best?


GVlogo

Stats cont.

Estimators (θ) of Population ParametersDefinition: Estimator is often a formula to calculate an estimate of aparameter, θ based on sample dataMany estimators, but which is the best?


GVlogo

Maximum Likelihood Estimation (MLE)

Key Concept: Maximum Likelihood Estimation

GOAL: to determine the best estimate of a parameter θ from a sampleLikelihood Function

We obtain data vector x = (x1, . . . , xn)Since random sample is i.i.d., we express the probability of our observeddata given θ as

f(x1, x2, . . . , xn | θ) = f(x1|θ) · f(x2|θ) · · · f(xn|θ) (3)

fn(x|θ) =n∏i=1

f(xi|θ) (4)

Implication of maximizing likelihood function


GVlogo


Key Concept: Maximum Likelihood EstimationGOAL: to determine the best estimate of a parameter θ from a sample

Likelihood FunctionWe obtain data vector x = (x1, . . . , xn)Since random sample is i.i.d., we express the probability of our observeddata given θ as


fn(x|θ) =n∏i=1

f(xi|θ) (4)



GVlogo


Key Concept: Maximum Likelihood EstimationGOAL: to determine the best estimate of a parameter θ from a sampleLikelihood Function



fn(x|θ) =n∏i=1

f(xi|θ) (4)



GVlogo



We obtain data vector x = (x1, . . . , xn)

Since random sample is i.i.d., we express the probability of our observeddata given θ as


fn(x|θ) =n∏i=1

f(xi|θ) (4)



GVlogo





fn(x|θ) =n∏i=1

f(xi|θ) (4)



GVlogo





fn(x|θ) =n∏i=1

f(xi|θ) (4)



GVlogo





fn(x|θ) =n∏i=1

f(xi|θ) (4)



GVlogo

Example of MLE

Example: Gaussian (Normal) Linear regression

Recall Least Squares RegressionWish to determine weight vector wLikelihood function given by

P (y|x,w) =

(1

σ√

2π

)nexp

[−∑i(yi −wTxi)2

2σ2

](5)

Need to minimizen∑i=1

(yi −wTxi)2 = (y −Aw)T (y −Aw) (6)

where A is the design matrix of our data.


GVlogo

Example of MLE

Example: Gaussian (Normal) Linear regressionRecall Least Squares RegressionWish to determine weight vector w

Likelihood function given by

P (y|x,w) =

(1

σ√

2π

)nexp


2σ2

](5)


(yi −wTxi)2 = (y −Aw)T (y −Aw) (6)



GVlogo

Example of MLE

Example: Gaussian (Normal) Linear regressionRecall Least Squares RegressionWish to determine weight vector wLikelihood function given by

P (y|x,w) =

(1

σ√

2π

)nexp


2σ2

](5)


(yi −wTxi)2 = (y −Aw)T (y −Aw) (6)



GVlogo

Example of MLE


P (y|x,w) =

(1

σ√

2π

)nexp


2σ2

](5)


(yi −wTxi)2 = (y −Aw)T (y −Aw) (6)



GVlogo

Example of MLE


P (y|x,w) =

(1

σ√

2π

)nexp


2σ2

](5)


(yi −wTxi)2 = (y −Aw)T (y −Aw) (6)



GVlogo

Example of MLE cont.

Following standard optimization procedure, we compute gradient of

∇S = −ATy +ATAw (7)

Notice linear combination of weights and columns of ATAOur resulting critical point is

w = (ATA)−1ATy, (8)

which we recognize to be the normal equations!


GVlogo





w = (ATA)−1ATy, (8)



GVlogo




Notice linear combination of weights and columns of ATA

Our resulting critical point is

w = (ATA)−1ATy, (8)



GVlogo





w = (ATA)−1ATy, (8)



GVlogo





w = (ATA)−1ATy, (8)



GVlogo

Computing the Hessian Matrix

We compute the Hessian in order to show that this is minimum

∂

∂wk∇S =

∂

∂wk

w1

x1,1

...

xn,1

+ · · ·+ wk

x1,k

...

xn,k

+ · · ·+ wn

x1,n

...

xn,n

=

x1,k

...

xn,k

Therefore,

H = ATA (9)

which is positive semi-definite. Therefore, our estimate for wmaximizes our likelihood function


GVlogo



∂

∂wk∇S =

∂

∂wk

w1

x1,1

...

xn,1

+ · · ·+ wk

x1,k

...

xn,k

+ · · ·+ wn

x1,n

...

xn,n

=

x1,k

...

xn,k

Therefore,H = ATA (9)



GVlogo



∂

∂wk∇S =

∂

∂wk

w1

x1,1

...

xn,1

+ · · ·+ wk

x1,k

...

xn,k

+ · · ·+ wn

x1,n

...

xn,n

=

x1,k

...

xn,k

Therefore,

H = ATA (9)



GVlogo

MLE cont.

Advantages and Disadvantages

Larger samples, as n→∞, give better estimatesθn → θ

Other AdvantagesDisadvantages: Uniqueness, existence, reliance upon distribution fitBegs the question: How much information about a parameter can begathered from sample data?


GVlogo

MLE cont.

Advantages and DisadvantagesLarger samples, as n→∞, give better estimates

θn → θ



GVlogo

MLE cont.


θn → θ



GVlogo

MLE cont.


θn → θ

Other Advantages

Disadvantages: Uniqueness, existence, reliance upon distribution fitBegs the question: How much information about a parameter can begathered from sample data?


GVlogo

MLE cont.


θn → θ

Other AdvantagesDisadvantages: Uniqueness, existence, reliance upon distribution fit

Begs the question: How much information about a parameter can begathered from sample data?


GVlogo

MLE cont.


θn → θ



GVlogo

Fisher Information

Key Concept: Fisher Information

We determine the amount of information about a parameter fromsample using Fisher information defined by

I(θ) = −E[∂2 ln[f(x|θ)]

∂θ

]. (10)

Intuitive appeal: More data provides more information aboutpopulation parameter


GVlogo

Fisher Information

Key Concept: Fisher InformationWe determine the amount of information about a parameter fromsample using Fisher information defined by

I(θ) = −E[∂2 ln[f(x|θ)]

∂θ

]. (10)



GVlogo

Fisher Information


I(θ) = −E[∂2 ln[f(x|θ)]

∂θ

]. (10)



GVlogo

Fisher Information


I(θ) = −E[∂2 ln[f(x|θ)]

∂θ

]. (10)



GVlogo

Fisher information example

Example: Finding the Fisher information for the normal distributionN(µ, σ2)

Log likelihood function of

ln[f(x|θ)] = −1

2ln(2πσ2)− (x− µ)2

2σ2(11)

where the the parameter vector θ = (µ, σ2).The gradient of the log likelihood is,(

∂ ln[f(x|θ)]∂µ

,∂ ln[f(x|θ)]

∂σ2

)=

(x− µσ2

,(x− µ)2

2σ4− 1

2σ2

)(12)


GVlogo




ln[f(x|θ)] = −1

2ln(2πσ2)− (x− µ)2

2σ2(11)

where the the parameter vector θ = (µ, σ2).

The gradient of the log likelihood is,(∂ ln[f(x|θ)]

∂µ,∂ ln[f(x|θ)]

∂σ2

)=

(x− µσ2

,(x− µ)2

2σ4− 1

2σ2

)(12)


GVlogo




ln[f(x|θ)] = −1

2ln(2πσ2)− (x− µ)2

2σ2(11)

where the the parameter vector θ = (µ, σ2).The gradient of the log likelihood is,

(∂ ln[f(x|θ)]

∂µ,∂ ln[f(x|θ)]

∂σ2

)=

(x− µσ2

,(x− µ)2

2σ4− 1

2σ2

)(12)


GVlogo




ln[f(x|θ)] = −1

2ln(2πσ2)− (x− µ)2

2σ2(11)

where the the parameter vector θ = (µ, σ2).The gradient of the log likelihood is,(

∂ ln[f(x|θ)]∂µ

,∂ ln[f(x|θ)]

∂σ2

)=

(x− µσ2

,(x− µ)2

2σ4− 1

2σ2

)(12)


GVlogo

Fisher information example continued

We now compute the Hessian matrix that will lead us to our Fisherinformation matrix

∂2 ln[f(x|θ)])∂θ2

=

∂2 ln[f(x|θ)]

∂µ2

∂2 ln[f(x|θ)])∂µ∂σ2

∂2 ln[f(x|θ)]∂µ∂σ2

∂2 ln[f(x|θ)]∂(σ2)2

=

(−1σ2

)−(x− µ

σ4

)

−(x− µ

σ4

) (1

2σ4− (x− µ)2

σ6

) (13)

We now compute our Fisher information matrix. We see that

I(θ) = −E(∂2f(x|θ)∂θ2

)(14)

=

[1σ2 0

0 −12σ4

](15)


GVlogo



∂2 ln[f(x|θ)])∂θ2

=

∂2 ln[f(x|θ)]

∂µ2

∂2 ln[f(x|θ)])∂µ∂σ2

∂2 ln[f(x|θ)]∂µ∂σ2

∂2 ln[f(x|θ)]∂(σ2)2

=

(−1σ2

)−(x− µ

σ4

)

−(x− µ

σ4

) (1

2σ4− (x− µ)2

σ6

) (13)


I(θ) = −E(∂2f(x|θ)∂θ2

)(14)

=

[1σ2 0

0 −12σ4

](15)


GVlogo



∂2 ln[f(x|θ)])∂θ2

=

∂2 ln[f(x|θ)]

∂µ2

∂2 ln[f(x|θ)])∂µ∂σ2

∂2 ln[f(x|θ)]∂µ∂σ2

∂2 ln[f(x|θ)]∂(σ2)2

=

(−1σ2

)−(x− µ

σ4

)

−(x− µ

σ4

) (1

2σ4− (x− µ)2

σ6

) (13)

We now compute our Fisher information matrix.

We see that

I(θ) = −E(∂2f(x|θ)∂θ2

)(14)

=

[1σ2 0

0 −12σ4

](15)


GVlogo



∂2 ln[f(x|θ)])∂θ2

=

∂2 ln[f(x|θ)]

∂µ2

∂2 ln[f(x|θ)])∂µ∂σ2

∂2 ln[f(x|θ)]∂µ∂σ2

∂2 ln[f(x|θ)]∂(σ2)2

=

(−1σ2

)−(x− µ

σ4

)

−(x− µ

σ4

) (1

2σ4− (x− µ)2

σ6

) (13)


I(θ) = −E(∂2f(x|θ)∂θ2

)(14)

=

[1σ2 0

0 −12σ4

](15)


GVlogo



∂2 ln[f(x|θ)])∂θ2

=

∂2 ln[f(x|θ)]

∂µ2

∂2 ln[f(x|θ)])∂µ∂σ2

∂2 ln[f(x|θ)]∂µ∂σ2

∂2 ln[f(x|θ)]∂(σ2)2

=

(−1σ2

)−(x− µ

σ4

)

−(x− µ

σ4

) (1

2σ4− (x− µ)2

σ6

) (13)


I(θ) = −E(∂2f(x|θ)∂θ2

)(14)

=

[1σ2 0

0 −12σ4

](15)


GVlogo

Applications of Fisher information

Fisher information is used in the calculation of . . .

Lower bound of V ar(θ) given by

V ar(θ) ≥1

I(θ)(16)

for an estimator θWald Test: Comparing a proposed value of θ against the MLE

Test statistic given by

W =θ − θ0s.e.(θ)

(17)

wheres.e.(θ) =

1√I(θ)

(18)


GVlogo


Fisher information is used in the calculation of . . .Lower bound of V ar(θ) given by

V ar(θ) ≥1

I(θ)(16)



W =θ − θ0s.e.(θ)

(17)

wheres.e.(θ) =

1√I(θ)

(18)


GVlogo



V ar(θ) ≥1

I(θ)(16)

for an estimator θ

Wald Test: Comparing a proposed value of θ against the MLETest statistic given by

W =θ − θ0s.e.(θ)

(17)

wheres.e.(θ) =

1√I(θ)

(18)


GVlogo



V ar(θ) ≥1

I(θ)(16)



W =θ − θ0s.e.(θ)

(17)

wheres.e.(θ) =

1√I(θ)

(18)


GVlogo



V ar(θ) ≥1

I(θ)(16)



W =θ − θ0s.e.(θ)

(17)

wheres.e.(θ) =

1√I(θ)

(18)


GVlogo



V ar(θ) ≥1

I(θ)(16)



W =θ − θ0s.e.(θ)

(17)

wheres.e.(θ) =

1√I(θ)

(18)


Date post:	08-Jul-2015
Category:	Data & Analytics
Upload:	ferris-jumah
View:	385 times
Download:	1 times

Hessian Matrices in Statistics

Data & Analytics