Generalized Linear Models-1

transcript

7/31/2019 Generalized Linear Models-1

1/29

Generalized Linear Models


2/29

Generalized linear models: Exponential family

In Generalized Linear Models the response is assumed to possess a

probability distribution of Exponential distribution function:

dispersin)

momentos)(

enlacedefuncin

),()(

)(

a(

b

y

eyc

a

by


3/29


Normal distribution has the form:

)2

1ln

2()c(y,)a(

2)(

2

1)|(

2

22

2

),()(

)(2

1ln

2

2

2

1ln

2

2

2

1ln

2

)(

2

)(

2

2

2

2

2

22

2

2

2

2

ybyuy

eee

eeyf

yca

by

yyu

uyy

yy


4/29


Poisson Distribution has the form:

)!ln(),(1))()ln(

!)|(

),()(

)(

)!ln()ln(

yyca(ubuyy

eey

ueyf

yca

by

yuuyyu


5/29


Binomial Distribution has the form:

ynyca(pnb

ppyy

ee

eppy

nyf

yca

by

y

npn

p

py

pynpyy

n

yny

ln),(1)))1ln(()()1

log(

)1()|(

),()(

)(ln))1ln(()

1ln(

)1ln()()ln(ln


6/29


In Generalized Linear Models the response is assumed to possess a

probability distribution of Exponential distribution function:

)(g)))

)|(

),()(

)(

a(b(Var(Y)b(YE

eyf

yca

by


7/29


Normal distribution has the form:

uugubYVaruy

ub

ybyuy

eee

eeyf

uy

yca

by

yyu

uyy

yy

)()a() ()(

2

2)(YE

)21ln

2()c(y,)a(

2)(

2

1)|(

2

2

2

2

2

),()(

)(2

1ln

2

2

2

1ln

2

2

2

1ln

2

)(

2

)(

2

2

2

2

2

22

2

2

2

2


8/29


Poisson Distribution has the form:

)!ln(),(1))()ln(

!)|(

),()(

)(

)!ln()ln(

yyca(ubuyy

eey

ueyf

yca

by

yuuyyu

)ln()()exp()a() ()(

u)exp()(YE)exp()

)ln(

)ln(

uuguubYVar

ubb(u

u

u


9/29


Binomial Distribution has the form:

y

nyca(pnb

p

pyy

ee

eppy

nyf

yca

by

y

n

pnp

p

y

pynpyyn

yny

ln),(1)))1ln(()()1

log(

)1()|(

),()(

)(ln))1ln(()1ln(

)1ln()()ln(ln

)

1

ln()()1()a()()(

)())exp1ln)(1

ln

p

ppgpnpubYVar

npb((nbp

p


10/29

Probability distributions

Normal:(2)

Inverse Gaussian: (3)

Gamma: (4)


11/29

Probability distributions

Negative Binomial: (5)

Poisson: (6)

Binomial:

(6)


12/29

Generalized Linear Models (GLM)

General class of linear models that are made up of 3components: Random, Systematic, and Link Function

Random component: Identifies dependent variable(Y) and its probability distribution

Systematic Component: Identifies the set ofexplanatory variables (X1,...,Xk)

Link Function: Identifies a function of the meanthat is a linear function of the explanatory

variables

kkXXg 11)(


13/29

Generalised linear model

If the distribution of observations is one of the distributions from theexponential family and some function of the expected value of the

observations is a linear function of the parameters then generalisedlinear model is used:

Function g is called the link function. Here is a list of the popular distributionand corresponding link functions:

binomial - logit = ln(p/(1-p))

normal - identity

Gamma - inverse

Poisson - log

Most natural way is to use =X . The optimization for this kind of functions isdone iteratively.

Xyy ))('(,,),('(())((,,)),((( 11 nn BgBgEgEg


14/29

Likelihood function

)(

)()/()/(

Xp

pXfXf


15/29

Likelihood function

n iXn

n

iXn

pxfpxxxL

pxfpxxxL

);(ln);,...,,(ln

);();,...,,(

21

21


16/29

Likelihood function Poisson distribution

)!log()log(ln

!)|(

)!log()log(

yuuyL

ey

ueyfL

yuuyyu

Newton Raphson algorithm


17/29

Likelihood ratio test

Let us assume that we have a sample of size n(x=(x1,,,,xn)) and we want to estimate aparameter vector =( 1, 2). Both 1 and 2can also be vectors. We want to test null-hypothesis against alternative one:

Let us assume that likelihood function is L(x| ). Then likelihood ratio test works asfollows: 1) Maximise the likelihood function under null-hypothesis (I.e. fixparameter(s) 1 equal to 10, find the value of likelihood at the maximum, 2)maximisethe likelihood under alternative hypothesis (I.e. unconditional maximisation), find thevalue of the likelihood at the maximum, then find the ratio:

wis the likelihood ratio statistic. Tests carried out using this statistic are called likelihoodratio tests. In this case it is clear that:

If the value of wis small then null-hypothesis is rejected. If g(w) is the the density of thedistribution for wthen critical region can be calculated using:

10111010 :against: HH

onmaximisatinedunconstraiafterparametersboththeofvaluestheare

,

onmaximisati)(dconstraineafterparamatertheofvaluetheis

),

|(/),|(

21

1011

21210

xLxLw

0 w 1

c

dwwg

0

)(


18/29

Deviances

In linear model, we maximise the likelihood with full model and under the nullhypothesis. The ratio of the values of the likelihood function under twohypotheses (null and alternative) is related to F-distribution. Interpretation is that

how much variance would increase if we would remove part of the model (nullhypothesis).

In logisitc and log-linear models, again likelihood function is maximised under the nulland alternative hypotheses. Then logarithm (it is called deviance) of ratio of thevalues of the likelihood under these two hypotheses asymptotically has chi-

squared distribution:

That is the difference between maximum achievable log-likelihood and the value oflikelihood at the estimated parameters

That is the reason why in log-linear and logistic regressions it is usual to talk aboutdeviances and chi-squared statistics instead of variances and F-statistics.Analysis based on log-linear and logistic models (in general for generalisedlinear models) is usually called analyisis of deviances. Reason for this is thatchi-squared is related to deviation of the fitted model and observations.

Another test is based on Pearsons chi-squared test. It approaches asymptotically to

chi-squared with n-p (n is the number of observations and p is the numberparameters) degrees of freedom.

2 2.0 (l(y | y) log(l(y | )

X

2 (yi i)

2

Var(yi)i1

n


19/29

19

Goodness of Fits: Deviance

Deviance = -2[LM

-LS

]where LM is the maximum log likelihood of the model of interestLS is the maximum log likelihood for the most complex model, which has aseparated parameter at each explanatory setting (saturated model).Deviance has approximately a chi-square distribution with df = N-pWhere N = number of observations and p = number of parameters (including

intercept).

Likelihood ratio test for model comparison between M1 and M0 (M0 is asimpler model than M

1)

Likelihood ratio = -2[L0-L1)=2[L0-LS]-{-2[L1-LS]} = Deviance0-Deviance1


20/29

Model fit metrics Covariance matrix for parameters

computed from 2nd partial derivatives of the loglikelihood function

Likelihood ratio test Ratio of max square log likelihood to square log

likelihood of null hypothesis

Deviance Measure of how overdetermined the system is

Compare max of full system to max of saturatedmodel (number of parameters equals number of datapoints)


21/29

Range of plausible modelsLikelihood ratio

);,f(

);,f(

2

2

0

y

y

b

With b0 the specified model and the bestmodel

Ratio of likelihood of any model to likelihood of best model

z

-y

-y

-y

2

21

2

21

2

21

2

21

-exp-exp

-exp

-exp

0

b

0

Log-likelihood ratio ln = - z2z2 = -2ln


22/29

ExampleSite

Longitude

Latitude

Alt Sl Te Pp V Ndvi Soil Lc S M B P

Nicols B. -104.7 24.38 1920 2 17 450 5 90 7 8 9 3 79 80

Librado R. -104.26 24.4 2005 2 17 450 7 84 9 8 24 28 31 32

La Ermita -104.33 23.89 2169 10 17 550 6 109 9 11 47 6 13 14

Madero -104.29 24.27 1942 2 17 450 3 74 9 8 16 85 110 111

Castillo N. -104.49 24.34 1923 2 17 450 7 78 7 8 20 58 33 34

Km 188 -104.61 25.38 1501 3 21 350 4 83 2 10 29 28 27 28

Km 130 -104.51 24.99 1733 4 19 450 4 85 3 9 22 20 20 21

Las Huertas -104.29 24.27 1930 2 17 450 3 75 9 8 15 13 20 21

18 de Agosto -104.15 23.95 1866 1 17 450 7 81 9 12 17 10 20 21

El Venado -104.28 23.87 1747 4 17 450 6 83 4 8 20 10 20 21

Km 23 -104.46 24.51 2126 4 15 550 7 83 3 11 16 8 14 15

Km 73 -104.32 25.13 1284 5 21 350 4 78 9 8 1 0 1 2

Rodrguez -104.09 24.32 2064 4 17 550 6 78 2 11 15 0 78

Berros - Tuitan -104.27 23.97 1855 4 17 450 6 84 9 9 15 0 0 1

27 de Noviembre -104.49 24.22 1862 2 17 450 5 91 9 8 17 0 31 32

Km 86 -104.64 24.7 1954 6 17 450 3 89 4 8 8 4 10 11

Morcillo -104.7 24.11 1971 3 17 550 8 88 9 10 2 0 5 6

Km 43 -104.47 24.65 1908 3 17 450 7 75 3 9 12 0 3 4Zarco -

Cieneguilla -104.04 24.1 2143 5 15 550 7 82 2 9 10 0 9 10

Berros - Saltito -104.28 23.94 1858 15 17 450 6 83 9 8 6 0 1 2

Km 36 -104.7 24.27 1909 2 17 450 3 86 9 9 0 0 0 1

Zaragoza -104.16 23.87 1856 1 17 450 7 76 9 8 11 15 21 22

Entrada Guadiana -104.34 23.95 1867 8 17 550 6 83 9 11 1 0 23

Carlos R. -104.44 24.27 1867 1 17 450 5 88 7 8 15 0 5 6

Km 153 -104.53 25.12 1360 2 21 350 7 81 9 9 8 6 0 1

Km 51 -104.16 25.21 1416 24 21 250 7 79 4 9 1 0 0 1

Km 29 -104.16 25.36 1396 7 21 250 7 71 4 10 0 0 0 1

Km 237 -104.75 25.76 1905 2 17 450 6 79 9 11 5 1 0 1

Km 260 -104.89 25.96 1940 2 17 450 6 82 2 9 0 0 0 1

Km 84 -104.42 25.82 1770 3 19 350 7 75 9 11 0 0 0 1

Francisco Z. -104.07 24.22 2166 3 15 550 6 74 2 8 2 0 0 1

Km 61 -104.29 25.55 1651 7 21 250 7 77 9 9 0 0 0 1

Km 104 -104.58 25.79 1942 1 17 450 6 77 9 10 3 0 1 2

Km 76 -104.28 25.67 1817 5 19 250 4 77 4 10 1 0 01


23/29

Variable distribution

M

B

P-20 0 20 40 60 80 100 120

Grasshopper count

0

2

4

6

8

10

12

14

16

18

20

Frequency


24/29

Correlation

Longitude Latitude Altitude Slope Temperature Precipitation Vegetation Ndvi Soil Landcover

M. lakinus 0.09 -0.42a 0.28 -0.10 -0.19 0.34a -0.09 0.57a -0.05 0.03

B. nubilum 0.03 -0.16 0.07 -0.21 -0.05 0.04 -0.24 -0.16 0.10 -0.26

P.

nebrascensis-0.04 -0.28 0.17 -0.25 -0.17 0.14 -0.33 0.07 0.13 -0.31


25/29

MulticolinearityLongitude Latitude Altitude Slope Temperature Precipitation Vegetation Ndvi Soil Landcover

Longitude 1.00 -0.37a 0.00 0.32 -0.01 -0.06 0.22 -0.30 0.09 -0.01

Latitude -0.37a 1.00 -0.45a 0.02 0.57a -0.66a -0.02 -0.37a -0.18 0.23

Altitude 0.00 -0.45a 1.00 -0.28 -0.92a 0.79a 0.04 0.30 -0.08 0.09

Slope 0.32 0.02 -0.28 1.00 0.32 -0.30 0.16 0.11 0.12 0.03

Temperature -0.01 0.57 -0.92a 0.32 1.00 -0.85a -0.04 -0.22 0.05 0.04

Precipitation -0.06 -0.64 0.79a -0.30 -0.85a 1.00 0.09 0.39a -0.03 0.08

Vegetation 0.22 -0.02 0.04 0.15 -0.04 0.06 1.00 -0.10 0.04 0.32

Ndvi -0.30 -0.37a 0.30 0.11 -0.22 0.39a -0.10 1.00 0.16 0.06

Soil 0.09 -0.18 -0.08 0.12 0.05 -0.03 0.04 0.16 1.00 -0.02

Landcover -0.01 0.23 0.09 0.04 0.05 0.08 0.32 0.06 -0.02 1.00


26/29

Deviance

Species

Model Link function Deviance

M. lakinus Value d.f. Value/df

Gamma Log 2.244 7 0.335

B. nubilum

Gamma Log 11.211 9 1.080

P. nebrascensis

Gamma Log 2.835 7 0.715

95% W ld C fid I l H h i T


27/29

Parameter B Std. Error

95% Wald Confidence Interval Hypothesis Test

Lower Upper Wald Chi-Square df Sig.

(Intercept) 835.919 62.1403 714.126 957.712 180.960 1 .000

[Temperature=15.00] -2.627 .5287 -3.663 -1.591 24.692 1 .000

[Temperature =17.00] -2.889 .6660 -4.195 -1.584 18.822 1 .000

[Temperature =19.00] -5.630 .5807 -6.768 -4.491 93.982 1 .000

[Temperature =21.00] 0a

. . . . . .

[Precipitation =250.00] -4.156 .3781 -4.897 -3.415 120.788 1 .000

[Precipitation =350.00] 0a

. . . . . .[Precipitation =450.00] 2.332 .4734 1.404 3.260 24.268 1 .000

[Precipitation =550.00] 0a

. . . . . .

[Vegetation=3.00] -3.481 .5117 -4.484 -2.478 46.261 1 .000

[Vegetation=4.00] -2.383 .8388 -4.027 -.739 8.072 1 .004

[Vegetation=5.00] -3.402 .5694 -4.518 -2.286 35.696 1 .000

[Vegetation=6.00] -4.161 .5299 -5.199 -3.122 61.647 1 .000

[Vegetation=7.00] -4.288 .5890 -5.442 -3.133 52.991 1 .000

[Vegetation=8.00] 0a

. . . . . .

[Soil=2.00] -.103 .3156 -.721 .516 .106 1 .745[Soil=3.00] 1.911 .2833 1.356 2.467 45.522 1 .000

[Soil=4.00] .488 .1837 .128 .848 7.052 1 .008

[Soil=7.00] .533 .2793 -.015 1.080 3.638 1 .056

[Soil=9.00] 0a

. . . . . .

[Landcover=8.00] .436 .2687 -.090 .963 2.636 1 .104

[Landcover=9.00] .369 .2951 -.210 .947 1.561 1 .212

[Landcover=10.00] -.118 .3949 -.892 .656 .089 1 .765

[Landcover=11.00] 1.666 .4357 .812 2.520 14.623 1 .000

[Landcover=12.00] 0a

. . . . . .

Longitude 8.360 .6374 7.111 9.610 172.047 1 .000

Latitude 1.366 .2438 .889 1.844 31.421 1 .000

Slope -.039 .0147 -.067 -.010 6.859 1 .009

Ndvi .122 .0113 .100 .144 117.147 1 .000

(Scale) .048b

.0122 .029 .079

Dependent Variable: M1

Model: (Intercept), Precipitation, Temperature, Vegetation, Soil, Landcover, Longitude, Latitude, Slope, Ndvi

a. Set to zero because this parameter is redundant.

b. Maximum likelihood estimate.


28/29

Residual


29/29

Fit

Generalized Linear Models-1

Documents