Datamining and statistical learning - lecture 9 Generalized linear models (GAMs) Some examples of...

Datamining and statistical learning - lecture 9

Generalized linear models (GAMs)

Some examples of linear models

Proc GAM in SAS

Model selection in GAM


Linear regression models

The inputs can be:

quantitative inputs

functions of quantitative inputs

base expansions of quantitative inputs

dummy variables

interaction terms

p

jjjn XXXYE

101 )...,,|(


Justification of linear regression models

Many response variables are linearly or almost linearly

related to a set of inputs

Linear models are easy to comprehend and to fit to observed data

Linear regression models are particularly useful when:

• the number of cases is moderate

• data are sparse

• the signal-to-noise ratio is low


Performance of predictors based on:

(i) a simple linear regression model

(ii) a quadratic regression model

when the true expected response is a second order polynomial

in the input

02468

1012141618

0 1 2 3 4

E(y)

yhat

02468

1012141618

0 1 2 3 4

E(y)

yhat2

Predictions based on a linear model Predictions based on a quadratic model


Logistic regression of multiple purchases

vs first amount spent

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000

First amount spent

Observed binary response Estimated event probability


Logistic regression for a binary response variable Y

xxXYP

xXYP

xXYE

xXYE

x

xxXYE

10

10

10

)|0(

)|1(log

)|(1

)|(log

)exp(1

)exp()|(

The expectation of Y given x is a linear function of x

0

0.2

0.4

0.6

0.8

1

0 2 4x

E(Y

| X

=x)


Generalized additive models: some examples

A nonlinear, additive model

A mixed linear and nonlinear, additive model

A mixed linear and nonlinear, additive model with a class variable

)(...)()...,,|( 111 ppn XsXsXXYE

q

jjpjppn XXsXsXXYE

1111 )(...)()...,,|(

q

jjpjppkn XXsXskClassXXYE

1111 )(...)(),...,,|(


Generalized additive models:Modelling the concentration of total nitrogen at Lobith on the Rhine

1989

1993

1997

2001

JanJun

Nov0

2

4

6

8

Total-N concentration

(mg N/l)

Jan Feb

Mar Apr

May Jun

Jul Aug

Sep Oct

Nov Dec

1989

1993

1997

2001

JanJun

Nov0

2

4

6

8

Total-N concentration

(mg N/l)

Jan Feb

Mar Apr

May Jun

Jul Aug

Sep Oct

Nov Dec

Observed data

Fitted model

Output:

Total-N conc

Inputs:

Monthly pattern

Trend function


Modelling the concentration of total nitrogen at Lobith on the Rhine:

Extracted additive components

Year components

Month components

-1.0

-0.5

0.0

0.5

1.0

1.5

0 2 4 6 8 10 12 14

Month: linear component Month: Smooth component

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

1988 1990 1992 1994 1996 1998 2000 2002 2004

Year: linear component Year: Smooth component


Weekly mortality and confirmed cases of influenza in Sweden

0

500

1000

1500

2000

2500

3000

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Mo

rta

lity

0

150

300

450

Co

nfi

rme

d c

as

es

of

infl

ue

nza

Mortality Influenza

Response:

Weekly mortality

Inputs:

Confirmed cases of influenza

Seasonal dummies

Long-term trend


SYNTAX for common GAM models

Type of Model Syntax Mathematical Form

Parametric model y = param(x);

Nonparametric model y = spline(x);

Nonparametric model y = loess(x);

Semiparametric model y = param(x1) spline(x2);

Additive model y = spline(x1) spline(x2);

Thin-plate spline model y = spline2(x1,x2);



Model 1proc gam data=Mining.Rhine;model Nconc = spline(Year) spline(Month);output out = addmodel1;run;

Model 2proc gam data=Mining.Rhine;model Nconc = spline2(Year, Month);output out = addmodel2;run;


Proc GAM – degrees of freedom of the spline components

The degrees of freedom of the spline components is selected by the user or by specifying method=GCV

proc gam data=Mining.Rhine;model Nconc = spline(Year, df=3) spline(Month, df=3);output out = addmodel1;run;

• Df=3 implies that the same cubic polynomial is valid in the entire range of the input

• Increasing the df-value implies that knots are introduced



-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

0.25

1 13 25 37 49 61 73 85 97 109 121 133 145 157

Observation nr

P_Year

-1.00

-0.50

0.00

0.50

1.00

1.50

1 13 25 37 49 61 73 85 97 109 121 133 145 157

Observation nr

P_Month

proc gam data=Mining.Rhine;model Nconc = spline(Year) spline(Month);output out = addmodel1;run;



2

4

199019901995

2000

6

4

02000

8

4

12

P_Nconc

Month

Year

Surface Plot of P_Nconc vs Month, Year

Model 1



Model 2

df=4

3

4

5

199019901995

2000

6

4

02000

8

4

12

P_Nconc_2

Month

Year

Surface Plot of P_Nconc_2 vs Month, Year



12

83.0

4.5

4

6.0

19901995 02000

P_Nconc_3

Month

Year

Surface Plot of P_Nconc_3 vs Month, Year

Model 3

df=20



The GAM Procedure

Dependent Variable: Nconc Smoothing Model Component(s): spline(Year) spline(Month)

Summary of Input Data Set

Number of Observations 168 Number of Missing Observations 0 Distribution Gaussian Link Function Identity

Iteration Summary and Fit Statistics

Final Number of Backfitting Iterations 2 Final Backfitting Criterion 1.987193E-30 The Deviance of the Final Estimate 42.92519322

The local score algorithm converged.

Model 1



Regression Model Analysis Parameter Estimates

Parameter Standard Parameter Estimate Error t Value Pr > |t|

Intercept 420.69388 19.84413 21.20 <.0001 Linear(Year) -0.20824 0.00994 -20.94 <.0001 Linear(Month) -0.10461 0.01161 -9.01 <.0001

Smoothing Model Analysis

Analysis of Deviance

Sum of Source DF Squares Chi-Square Pr > ChiSq

Spline(Year) 3.00000 2.527155 9.3609 0.0249 Spline(Month) 3.00000 51.143931 189.4432 <.0001

Model 1



Iteration Summary and Fit Statistics Final Number of Backfitting Iterations 2 Final Backfitting Criterion 0 The Deviance of the Final Estimate 74.22284569 Regression Model Analysis Parameter Estimates Parameter Standard Parameter Estimate Error t Value Pr > |t| Intercept 4.46475 0.05206 85.76 <.0001 Smoothing Model Analysis Analysis of Deviance Sum of Source DF Squares Chi-Square Pr > ChiSq Spline2(Year Month) 4.00000 162.668070 357.2336 <.0001

Model 2



Iteration Summary and Fit Statistics Final Number of Backfitting Iterations 2 Final Backfitting Criterion 0 The Deviance of the Final Estimate 36.577160798 Regression Model Analysis Parameter Estimates Parameter Standard Parameter Estimate Error t Value Pr > |t| Intercept 4.46475 0.03849 116.01 <.0001 Smoothing Model Analysis Analysis of Deviance Sum of Source DF Squares Chi-Square Pr > ChiSq Spline2(Year Month) 20.00000 200.313755 805.0412 <.0001

Model 2 (20 df)


Estimation of additive models

- the backfitting algorithm

N

iijjjj

jkikkijj

j

N

ii

xfN

ff

xfysf

pppj

pjfyN

1

1

)(ˆ1ˆˆ

)(ˆˆ(ˆ

,...,1,...,,...,2,,...,1:2.Cycle

,...,1,0ˆ,1

ˆ:ze1.Initiali

)(...)(),...,|( 1111 pppp xfxfxXxXYE


Modelling ln daily electricity consumption as a spline function

of the population-weighted mean temperature in Sweden

proc gam data=sasuser.smhi;

model lnDaily_consumption = spline(Meantemp, df=20);

ID Time;

output out=smhiouttemp pred resid;

run;

12.2

12.4

12.6

12.8

13.0

13.2

13.4

-30 -20 -10 0 10 20 30

Population-weighted temperature

ln d

aily

ele

ctri

city

co

nsu

mp

tio

n (

MW

h)

Observed Fitted


Modelling ln daily electricity consumption as a spline function

of the population-weighted mean temperature in Sweden:

residual analysis

-0.25

-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

0 100 200 300 400

Julian day

Res

idu

al


Modelling ln daily electricity consumption in Sweden

- residual analysis

-0.25

-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

0 100 200 300 400

Julian day

Res

idu

al

-0.25

-0.20

-0.15

-0.10

-0.050.00

0.05

0.10

0.15

0.20

0 100 200 300 400Julian day

Res

idu

al

Spline of temperature

Spline of temperatureSpline of Julian dayWeekday dummies


Modelling ln daily electricity consumption in Sweden

- residual analysis

-0.25

-0.20

-0.15

-0.10

-0.050.00

0.05

0.10

0.15

0.20

0 100 200 300 400Julian day

Res

idu

al

Spline of temperatureSpline of Julian dayWeekday dummies

-0.25

-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

0 100 200 300 400Julian day

Re

sid

ua

l

Splines of contemporaneous and time-lagged weather data

Splines of Julian day and timeWeekday and holiday dummies


Deviance analysis of the investigated models of

ln daily electricity consumption in Sweden

Deviance

10.233

3.822

0.742

0

2

4

6

8

10

12

Temp only Temp, Julian day,weekday

Final model

The residual deviance of a fitted model is minus twice its log-likelihood

If the error terms are normally distributed, the deviance is equal to the

sum of squared residuals


Modelling ln daily electricity consumption in Sweden:

time series plot of residuals

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0 500 1000 1500 2000

Time

Re

sid

ua

l

Date post:	18-Dec-2015
Category:	Documents
View:	226 times
Download:	4 times

Datamining and statistical learning - lecture 9 Generalized linear models (GAMs) Some examples of...

Documents