+ All Categories
Home > Documents > The theory of the generalized linear models -...

The theory of the generalized linear models -...

Date post: 02-Jun-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
12
58 Generalized linear models The theory of the generalized linear models We cover in this section three aspects of the statistical theory for generalized linear models. We treat maximum likelihood estima- tion based on the exponential dispersion models. This includes the derivation of the non-linear estimation equation, known as the score equation, and the iterative weighted least squares algorithm that is used in practice to fit the models to data. We also show a result on the existence and uniqueness of the solution to the score equation for the canonical link. We introduce the likelihood ratio tests (deviance tests) and z-tests. These tests replace the F- and t-tests used for the linear model. In contrast to the linear model it is not possible to derive exact distributional results in the general case. However, the operational procedures for using the tests are the same – the only dierence being that the distributions used to compute p-values are approximations. Formal justifications can be based on asymptotic arguments, which are discussed again in the next chapter, page 72, but asymptotic arguments will only be treated briefly. Finally, we discuss model diagnostics and, among other things, the dierent possible generalizations of residuals. Maximum likelihood estimation We first consider the simple case where Y E (q (h ), n y ), that is, the distribution of Y is given by the exponential dispersion model with canonical parameter q (h ) and structure measure n y . Derivations of the score equation and Fisher information in this case can then be used to derive the score equation and Fisher information in the general case when we have independent observations Y 1 ,..., Y n with Y i | X i E (q (h i ), n y ) and h i = X T i b. Definition 3.13. The score statistic is the gradient of the log- likelihood function, U(h ) := r h `(h ). The Fisher information, J (h )= -E h D h U(h ), is minus the expectation of the derivative of the score statistic, or, equivalently, the expectation of the second derivative of the negative-
Transcript
Page 1: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

58 Generalized linear models

The theory of the generalized linear models

We cover in this section three aspects of the statistical theory for

generalized linear models. We treat maximum likelihood estima-

tion based on the exponential dispersion models. This includes the

derivation of the non-linear estimation equation, known as the score

equation, and the iterative weighted least squares algorithm that is

used in practice to fit the models to data. We also show a result on

the existence and uniqueness of the solution to the score equation for

the canonical link. We introduce the likelihood ratio tests (deviance

tests) and z-tests. These tests replace the F- and t-tests used for the

linear model. In contrast to the linear model it is not possible to

derive exact distributional results in the general case. However, the

operational procedures for using the tests are the same – the only

di↵erence being that the distributions used to compute p-values are

approximations. Formal justifications can be based on asymptotic

arguments, which are discussed again in the next chapter, page 72,

but asymptotic arguments will only be treated briefly. Finally, we

discuss model diagnostics and, among other things, the di↵erent

possible generalizations of residuals.

Maximum likelihood estimation

We first consider the simple case where Y ⇠ E(q(h), ny), that is, the

distribution of Y is given by the exponential dispersion model with

canonical parameter q(h) and structure measure ny. Derivations

of the score equation and Fisher information in this case can then

be used to derive the score equation and Fisher information in the

general case when we have independent observations Y

1

, . . . , Y

n

with

Y

i

| X

i

⇠ E(q(hi

), ny) and hi

= X

T

i

b.

Definition 3.13. The score statistic is the gradient of the log-

likelihood function,

U(h) := rh`(h).

The Fisher information,

J (h) = �Eh DhU(h),

is minus the expectation of the derivative of the score statistic, or,

equivalently, the expectation of the second derivative of the negative-

Page 2: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

The theory of the generalized linear models 59

log-likelihood.

The score equation is obtained by equating the score statistic

equal to 0. In all that follows, the dispersion parameter y is regarded

as fixed.

Lemma 3.14. If Y ⇠ E(q(h), ny) then the negative-log-likelihood

function is

`Y

(h) =q(h)Y � c(h)

y,

the score function is U(h) = q0(h)(Y � µ(h))/y, and the Fisher

information is

J (h) =q0(h)µ0(h)

y.

Proof. The density for the distribution of Y w.r.t. ny is by definition

e

q(h)y�c(h)y

,

and it follows that `Y

(h) has the stated form. Di↵erentiation of

y`Y

(h) yields

yU(h) = y`0(h) = q0(h)Y � c

0(h) = q0(h)⇣

Y � c

0(h)q0(h)| {z }

µ(h)

⌘,

where we have used Corollary 3.9. Furthermore, we find that

yU

0(h) = q00(h)(Y � µ(h))� q0(h)µ0(h),

and since EhY = µ(h) it follows that

J (h) = �EhU

0(h)

y=

q0(h)µ0(h)y

.

With only a single observation, the score equation is equivalent

to µ(h) = Y, and it follows that there is a solution to the score

equation if Y 2 J = µ(I

�). However, the situation with a single

observation is not relevant for practical purposes. The result is

only given as an intermediate towards the next result.

Page 3: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

60 Generalized linear models

The score function U above is a function of the univariate param-

eter h. We adapt in the following the convention that for a vector

h = (h1

, . . . , hn

)T

U(h) = (U(h1

), . . . , U(hn

))T

.

That is, the score function is applied coordinatewisely to the h-

vector. Note that the derivative (the Jacobian) of h 7! U(h) is an

n ⇥ n diagonal matrix.∂h

i

U(h)j

=

⇢U

0(hj

) if i = j

0 if i 6= j

Theorem 3.15. Assume that Y

1

, . . . , Y

n

are independent and that

Y

i

| X

i

⇠ E(q(hi

), ny) where hi

= X

T

i

b. Then with h = Xb the score

statistic is

U(b) = X

T

U(h).

The Fisher information isThe diagonal weight matrixW is

1

y

0

BBB@

µ0(h1

)2

V(h1

) . . . 0

.... . .

...

0 . . .

µ0(hn

)2

V(hn

)

1

CCCA

J (b) = X

T

WX,

with the entries in the diagonal weight matrix W being

w

ii

=µ0(h

i

)2

yV(hi

)=

q0(hi

)µ0(hi

)y

.

Proof. By the independence assumption the log-likelihood is

`Y

(b) =n

Âi=1

`Y

i

(hi

)

where hi

= X

T

i

b. By the chain rule,

U(b) = rb`Y

(b) =n

Âi=1

X

i

U(hi

) = X

T

U(h).

As argued above, �DhU(h) is diagonal, and the expectation of

the diagonal entries are according to Lemma 3.14

w

ii

=q0(h

i

)µ0(hi

)y

.

Thus

J (b) = �EbDbU(b) = �EbX

T

DhU(h)X = X

T

WX.

Page 4: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

The theory of the generalized linear models 61

By observing that

U(h)i

=q0(h

i

)(Yi

� µ(hi

))y

it follows, that the score equation U(b) = 0 is equivalent to the

system of equations

n

Âi=1

q0(hi

)(Yi

� µ(hi

))X

ij

= 0

for j = 1, . . . , p. Note that the score equation and thus its solution

does not depend upon the dispersion parameter. Note also, that for

the canonical link function the equations simplify because q0(hi

) =

1, and the weights also simplify to

w

ii

=µ0(h

i

)y

=V(h

i

)y

.

Whether there is a solution to the score equation, and whether

it is unique, has a complete solution for the canonical link. For

arbitrary link functions the situation is less clear, and we must be

prepared for the existence of multiple solutions or no solutions in

practice.

Example 3.16. For the normal distribution N (µ, s2) and with the

canonical link function the log-likelihood function becomes

`(b) =1

y

n

Âi=1

Y

i

X

T

i

b �(X

T

i

b)2

2

=1

2y

⇣2Y

T

Xb � bT

X

T

Xb⌘

=1

2y

⇣||Y||2 � ||Y � Xb||2

⌘.

Up to the term ||Y||2 – that doesn’t depend upon the unknown b-

vector – the log-likelihood function is proportional to the squared

error loss with proportionality constant �1/(2y). The maximum

likelihood estimator is thus equal to the least squares estimator.

The general, non-linear score equation does not have

a closed form solution and must be solved by iterative methods.

Newton’s algorithm is based on a first order Taylor approximation

Page 5: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

62 Generalized linear models

of the score function. The resulting approximation of the score

equation is a linear equation. Newton’s algorithm consists of itera-

tively computing the first order Taylor approximation and solving

the resulting linear approximation. The preferred algorithm for es-

timation of generalized linear models is a slight modification where

the derivative of the score is replaced by its expectation, that is,

by the Fisher information. To present the idea we consider a sim-

ple example of estimation in the exponential distribution with i.i.d.

observations.

Example 3.17. Consider the parametrization q(h) = �h�k for

h > 0 (and a fixed k > 0) of the canonical parameter in the expo-

nential distribution. That is, the density isIf Z is Weibull distributedwith shape parameter k andscale parameter h then Y

k

is exponentially distributedwith scale parameter hk.This explains the interest inthis particular parametriza-tion as it allows us to fitmodels with Weibull dis-tributed responses.

e

q(h)y�k log h

w.r.t. the Lebesgue measure on (0, •). The mean value function is

µ(h) = � 1

q(h)= hk

.

With Y

1

, . . . , Y

n

i.i.d. observations from this distribution and with

¯

Y =1

n

n

Âi=1

Y

i

the score function amounts to

U(h) =n

Âi=1

q0(h)(Yi

� µ(h)) = nkh�1(h�k

¯

Y � 1)

where we have used that q0(h) = kh�k�1. The score equation is

thus equivalent to

h�k

¯

Y � 1 = 0.

It is straight forward to solve this equation analytically4, and the4 Which shows that ifZ

1

, . . . , Z

n

are i.i.d. Weibulldistributed with knownshape parameter k and scaleparameter h the MLE of his

ˆh =

1

n

n

Âi=1

Z

k

i

!1/k

.

solution is

h = ¯

Y

1/k

.

However, to illustrate the general techniques we Taylor expand the

left hand side of the equation around hm

to first order and obtain

the linear equation

h�k

m

¯

Y � 1 � kh�k�1

m

¯

Y(h � hm

) = 0.

Page 6: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

The theory of the generalized linear models 63

The solution of this linear equation is

hm+1

= hm

� h�k

m

¯

Y � 1

�kh�k�1

m

¯

Y

= hm

� U(hm

)U

0(hm

)

provided that U

0(hm

) 6= 0. This is Newton’s algorithm. With a

suitable choice of starting value h1

we iteratively update hm

until

convergence.

If we replace the derivative of the score function in the approxi-

mating linear equation with its expectation we arrive at the linear

equation

h�k

m

¯

Y � 1 � kh�1

m

(h � hm

) = 0,

whose solution is

hm+1

= hm

� h�k

m

¯

Y � 1

�kh�1

m

= hm

+U(h

m

)J (h

m

).

This general technique is known as Fisher scoring.

The iterative weighted least squares algorithm in the

general case is no more di�cult to formulate than for the one-

dimensional example above. The derivative of minus the score func-

tion5 is found as in the proof of Theorem 3.15 to be 5 Often called the observedFisher information.

U

0(hi

) =1

y

⇣q00(h

i

)(Yi

� µ(hi

))

�q0(hi

)µ0(hi

)⌘

.

�DbU(b) = X

T

W

obs

X

where

W

obs = �

0

BB@

U

01

(h1

) . . . 0

.... . .

...

0 . . . U

0n

(hn

)

1

CCA .

A first order Taylor expansion of the score function results in the

following linearization of the score equation

X

T

U(hm

)� X

T

W

obs

m

X(b � bm

) = 0

whose solution is

Note that W

obs

m

as well asW

m

depend upon the cur-rent b

m

through hm

= Xbm

,hence the subscript m.

bm+1

= bm

+ (XT

W

obs

m

X)�1

X

T

U(hm

)

provided that X

T

W

obs

m

X has full rank p. Replacing W

obs

m

by W

m

from Theorem 3.15 we get the Fisher scoring algorithm.

Page 7: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

64 Generalized linear models

We can rewrite the update formula for the Fisher scoring algo-

rithm as follows

bm+1

= bm

+ (XT

W

m

X)�1

X

T

U(hm

))T

= (XT

W

m

X)�1

X

T

W

m

⇣Xb

m

+ W

�1

m

U(hm

)T

| {z }Z

.

The vector Z is known as the working response and it can be sim-

plified to

Z = Xbm

+

0

BBB@

Y

1

�µ(hm,1

)µ0(h

m,1

)...

Y

n

�µ(hm,n

)µ0(h

m,n

)

1

CCCA. (3.5)

In terms of the working response, the vector bm+1

is the minimizer

of the weighted squared error loss

(Z � Xb)T

W

m

(Z � Xb), (3.6)

see Theorem 2.1. The Fisher scoring algorithm for generalized lin-

ear models is known as iterative weighted least squares (IWLS),

since it can be understood as iteratively solving a weighted least

squares problem. To implement the algorithm we can also rely on

general solvers of weighted least squares problems. This results in

the following version of IWLS. Given b1

we iterate over the steps

1–3 until convergence:

The dispersion parametercan be eliminated, and inpractice we take y = 1 in theIWLS algorithm. It doesn’tmean that the dispersion pa-rameter is irrelevant. Itmatters for the subsequentstatistical analysis, but notfor the estimation.

1. Compute the working response vector Z based on bm

using (3.5).

2. Compute the weights

w

ii

=µ0(h

m,i

)2

V(hm,i

).

3. Compute bm+1

by minimizing the weighted sum of squares (3.6).

It is noteworthy that the computations only rely on the mean value

map µ, its derivative µ0 and the variance function V . Thus the

IWLS algorithm depends on the mean and variance structure, as

specified in the assumptions GA1 and GA2, and not on any other

aspects of the exponential dispersion model.

Page 8: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

The theory of the generalized linear models 65

The canonical link function o↵ers some simplifications com-

pared to the general case. We will, in particular, show a satisfac-

tory characterization of existence and uniqueness of a solution to

the score equation for the canonical link. We first observe that for

the canonical link function q00(h) = 0, and the observed Fisher in-

formation coincides with the Fisher information. That is, for the

canonical link function

W

obs = W,

and Newton’s algorithm coincides with the IWLS algorithm. More-

over, if we introduce

t(b) =n

Âi=1

µ(X

T

i

b)X

i

and t =n

Âi=1

Y

i

X

i

,

then the score equation is equivalent to the equation

t(b) = t. (3.7)

Define the convex, open set

D = {b 2 Rp | Xb 2 (I

�)n}

of parameters for which the linear predictor is an interior point of

I. The set depends upon X and the map t is defined on D. We

only search for solutions in D. We will also assume in the following

that µ0(h) = V(h) > 0 for h 2 I

�. This implies that the diagonal

entries in W are strictly positive for b 2 D, and thus that the Fisher

information

J (b) = X

T

WX

is positive definite for b 2 D if and only if X has rank p.

Theorem 3.18. If X has full rank p the map t : D ! Rp is one-

to-one. With C := t(D) there is a unique solution to (3.7) if and

only if t 2 C.

Proof. All we need is to establish that t is one-to-one. To reach

a contradiction, assume that t(b) = t(b0) with r = b0 � b 6= 0.

Then consider the function

h(a) = r

Tt(b + ar)

Page 9: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

66 Generalized linear models

with the property that h(0) = h(1). We find that h is continuously

di↵erentiable with

h

0(a) = r

TJ (b + ar)r.

Since h(0) = h(1) there is an a0 2 (0, 1) where h attains a local

optimum and thus h

0(a0) = 0. This implies that J (b + a0r) is not

positive definite (only positive semidefinite), which contradicts the

full rank p assumption on X.

Recall that J = µ(I

�). Disclaimer: In the following proof we

assume that D = Rp, which is not needed for the result to hold

Lemma 3.19. If t

0

= Ân

i=1

µi

X

i

with µi

2 J then t

0

2 C.

Proof. Take n = n1

, that is, n is the structure measure correspond-

ing to the dispersion parameter y = 1. If µ 2 J there is a q such

that

0 =Z(y � µ)eqyn(dy)

=Z

{yµ}(y � µ)eqyn(dy) +

Z

{y>µ}(y � µ)eqyn(dy).

If the latter integral is 0 the former integral must be 0 too, which

implies that n is degenerate at µ. This contradicts the assump-

tion that the mean value map µ is strictly increasing (that is, that

V(h) = µ0(h) > 0). Thus the latter integral is non-zero, and the

important conclusion is that n({y | y � µ > 0}) > 0. Likewise,

n({y | µ � y > 0}) > 0.

With

L(b) = e

Ân

i=1

µi

X

T

i

b�k(X

T

i

b) =n

’i=1

e

µi

X

T

i

b�k(X

T

i

b)

we see that Db log L(b) = 0 is equivalent to the equation t(b) = t

0

.

Thus if we can show that the function L attains a maximum we are

done. To this end fix a unit vector e 2 Rp. By definition

e

k(lX

T

i

e) =Z

e

ly

i

X

T

i

en(dy

i

),

and if we plug this into the definition of L we get that

L(le)�1 =n

’i=1

e

�lµi

X

T

i

e

Ze

ly

i

X

T

i

en(dy

i

)

=Z

e

l(Ân

i=1

(yi

�µi

)X

T

i

e)n⌦n(dy)

Page 10: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

The theory of the generalized linear models 67

for l > 0. With A+ = {(y1

, . . . , y

n

) | (yi

� µi

)sign(X

T

i

e) > 0} it

follows from the previous considerations that n⌦n(A+) > 0 and by

monotone convergence that

L(le)�1 �Z

A+

e

l(Ân

i=1

(yi

�µi

)X

T

i

e)n⌦n(dy) ! • · n⌦n(A+) = •

for l ! •.

If 0 is a maximizer we are done so we assume it is not. Then

there is a sequence bn

such that

L(0) L(bn

) % sup

b2D

L(b)

and such that ln

:= ||bn

|| > 0 for all n. Define the unit vectors

e

n

= bn

/ln

, then since the unit sphere is compact this sequence

has a convergent subsequence. By taking a subsequence we can thus

assume that e

n

! e for n ! •. By taking a further subsequence we

can assume that either ln

is convergent or ln

! •. In the former

case we conclude that bn

= ln

e

n

is convergent, and by continuity

of L the limit is a maximizer.

To reach a contradiction, assume therefore that ln

! •. Choose

le

according to previous derivations such that L(le

e) < L(0) then

ln

> le

from a certain point, and by log-concavity of L it holds

that

L(le

e

n

) � L(0)

from this point onward. Since the left hand side converges to L(le

e)

we reach a contradiction. We conclude that L always attains a

maximum, that this maximum is a solution to the equation t(b) =

t

0

, and thus that t

0

2 C.

Corollary 3.20. The set C = t(D) has the representation

C =

(n

Âi=1

µi

X

i

| µi

2 J

)(3.8)

and is convex. If X has full rank p then C is open.

Proof. Lemma 3.19 shows that C has the claimed representation.

The function µ is continuous and increasing, and, by assumption,

it is strictly increasing. Since J := µ(I

�) is the image of the open

interval I

� the continuity of µ assures that J is an interval and

Page 11: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

68 Generalized linear models

strict monotonicity assures that J is open. Since J is an interval, C

is convex. The full rank condition ensures that X

T maps Rn onto

Rp as a linear map, which implies that X

T is an open map (it maps

open sets onto open sets). In particular, we have that

C = X

T(J ⇥ . . . ⇥ J)

is open.

To prove that there is a unique solution to the score equation

amounts to proving that t 2 C. This is, by Corollary 3.20, clearly

the case if

Pq(Y 2 J) = 1,

but less trivial to check if Pq(Y 2 ∂J) > 0.

Note that the solution, if it exists, is unique if X has full rank p.

Note also that Y 2 J – the observations are in the closure of J. The

following is a useful observation. Suppose that X has full rank p

such that C is open and assume that t 2 C. Consider one additional

observation (Yn+1

, X

n+1

) and let C

0 denote the C-set corresponding

to the enlarged data set. Then t + Y

n+1

X

n+1

2 C

0. This is obvious

if Y

n+1

2 J. Assume that Y

n+1

is the left end point of J, then for

su�ciently small d > 0

t + Y

n+1

X

n+1

= t � dX

n+1

+ (Yn+1

+ d| {z }2J

)X

n+1

,

and t � dX

n+1

! t 2 C for d ! 0. Since C is open, t � dX

n+1

2 C

for su�ciently small d. A similar argument applies if Y

n+1

is the

right end point. In conclusion, if X has full rank p and t 2 C such

that the score equation has a unique solution, there will still be a

unique solution if we add more observations. Figuratively speaking,

we cannot loose the existence and uniqueness of the solution to the

score equation once it is obtained.

Page 12: The theory of the generalized linear models - kuweb.math.ku.dk/~richard/courses/regression2013/handoutWeek3.pdf · The theory of the generalized linear models We cover in this section

Big data 69

Big data

Something on using the biglm package and the bigmemory project.

Exercises

Exercise 3.1. The point probabilities for any probability measure

on {0, 1} can be written as

p

y(1 � p)1�y

for y 2 {0, 1}. Show that this family of probability measures for

p 2 (0, 1) form an exponential family. Identify I and how the

canonical parameter depends on p.

Exercise 3.2. Show that the binomial distributions on {0, 1, . . . , n}with success probabilities p 2 (0, 1) form an exponential family.

Exercise 3.3. Show that if the structure measure n = dy

is the

Dirac-measure in y then rq = dy

for all q 2 R, where rq is the

corresponding exponential family. Show next, that if n is not a

one-point measure, then rq is not a Dirac-measure for q 2 I

� and

conclude that its variance is strictly positive.


Recommended