+ All Categories

Francis

Date post: 16-Jan-2016
Category:
Upload: qingtnap
View: 6 times
Download: 0 times
Share this document with a friend
Description:
HOC
Popular Tags:
50
Introduction to Generalized Linear Models 2007 CAS Predictive Modeling Seminar Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected] October 11, 2007
Transcript
Page 1: Francis

Introduction to Generalized Linear Models

2007 CAS Predictive Modeling SeminarPrepared by

Louise Francis Francis Analytics and Actuarial Data Mining, Inc.

[email protected]

October 11, 2007

Page 2: Francis

Objectives

Gentle introduction to Linear ModelsIllustrate some simple applications of linear modelsAddress some practical modeling issues

Show features common to LMs and GLMs

Page 3: Francis

Predictive Modeling Family

Predictive Modeling

Classical Linear Models GLMs Data Mining

Page 4: Francis

Linear Models Are Basic Statistical Building Blocks: Ex:Mean Payment by Age Group

Page 5: Francis

Linear Model for Means: A Step Function Ex:Mean Payment by Age Group

Page 6: Francis

Linear Models Based on MeansPayment by Age Group and Attorney Involvement

Page 7: Francis

An Introduction to Linear Regression

Sev

erity

Year

Page 8: Francis

Intro to Regression Cont.

Fits line that minimizes squared deviation between actual and fitted values

( )2min ( )iY Y−∑)

Page 9: Francis

Some Work Related Liability DataClosed Claims from Tx Dept of Insurance

Total AwardInitial Indemnity reservePolicy LimitAttorney InvolvementLags

ClosingReport

InjurySprain, back injury, death, etc

Data, along with some of analysis will be posted on internet

Page 10: Francis

Simple IllustrationTotal Settlement vs. Initial Indemnity Reserve

Page 11: Francis

How Strong Is Linear Relationship?: Correlation Coefficient

Varies between -1 and 1Zero = no linear correlation

lnInitialIndemnityRes lnTotalAward lnInitialExpense lnReportlaglnInitialIndemnityRes 1.000lnTotalAward 0.303 1.000lnInitialExpense 0.118 0.227 1.000lnReportlag -0.112 0.048 0.090 1.000

Page 12: Francis

Scatterplot Matrix

Prepared with Excel add-in XLMiner

Page 13: Francis

Linear Modeling Tools Widely Available: Excel Analysis Toolpak

Install Data Analysis Tool Pak (Add In) that comes wit ExcelClick Tools, Data Analysis, Regression

Page 14: Francis

How Good is the fit?

Page 15: Francis

First Step: Compute residualResidual = actual – fitted

Sum the square of the residuals (SSE)Compute total variance of data with no model (SST)

Y=lnTotalAward Predicted Residual10.13 11.76 -1.6314.08 12.47 1.6110.31 11.65 -1.34

Page 16: Francis

Goodness of Fit Statistics

R2: (SSE Regression/SS Total)percentage of variance explained

Adjusted R2

R2 adjusted for number of coefficients in model

Note SSE = Sum squared errorsMS is Mean Square Error

Page 17: Francis

R2 Statistic

Page 18: Francis

Significance of Regression

F statistic: (Mean square error of Regression/Mean Square Error of Residual)Df of numerator = k = number of predictor varsDf denominator = N - k

Page 19: Francis

ANOVA (Analysis of Variance) Table

Standard way to evaluate fit of modelBreaks Sum Squared Error into model and residual components

Page 20: Francis

Goodness of Fit Statistics

T statistics: Uses SE of coefficient to determine if it is significant

SE of coefficient is a function of s (standard error of regression)Uses T-distribution for testIt is customary to drop variable if coefficient not significant

Page 21: Francis

T-Statistic: Are the Intercept and Coefficient Significant?

CoefficientsStandard

Error t Stat P-valueIntercept 10.343 0.112 92.122 0lnInitialIndemnityRes 0.154 0.011 13.530 8.21E-40

Page 22: Francis

Other Diagnostics: Residual PlotIndependent Variable vs. Residual

Points should scatter randomly around zeroIf not, regression assumptions are violated

Page 23: Francis

Predicted vs. Residual

Page 24: Francis

Random Residual

DATA WITH NORMALLY DISTRIBUTED ERRORS RANDOMLY GENERATED

Page 25: Francis

What May Residuals Indicate?

If absolute size of residuals increases as predicted increases, may indicate non-constant variance

may indicate need to log dependent variableMay need to use weighted regression

May indicate a nonlinear relationship

Page 26: Francis

Standardized Residual: Find OutliersN

2i

i=1

ˆ(y )ˆ( ) , =1

ii i

i sese

yy yz

N kσ

σ

−−

=− −

Standardized Residuals by Observation

-4

-2

0

2

4

6

0 500 1000 1500 2000

Observation

Page 27: Francis

Outliers

May represent errorMay be legitimate but have undue influence on regressionCan downweight oultliers

Weight inversely proportional to variance of observationRobust Regression

Based on absolute deviationsBased on lower weights for more extreme observations

Page 28: Francis

Non-Linear Relationship

Page 29: Francis

Non-Linear Relationships

Suppose Relationship between dependent and independent variable is non-linear?Linear regression requires a linear relationship

Page 30: Francis

Transformation of Variables

Apply a transformation to either the dependent variable, the independent variable or bothExamples:

Y’ = log(Y)X’ = log(X)X’ = 1/XY’=Y1/2

Page 31: Francis

Transformation of Variables: Skewness of DistributionUse Exploratory Data Analysis to detect skewness, and heavy tails

Page 32: Francis

After Log Transformation-Data much less skewed, more like Normal, though still skewed

Box Plot

0

24

6

8

1012

14

1618

20

Y Va

lues

lnTotalAward

Histogram

050

100150200250300350400450500

9.6 10.4 11.2 12 12.8 13.6 14.4 15.2 16 16.8 17.6

lnTotalAward

Freq

uenc

y

Page 33: Francis

Transformation of Variables

Suppose the Claim Severity is a function of the log of report lag

Compute X’ = log(Report Lag)Regress Severity on X’

Page 34: Francis

Categorical Independent Variables:The Other Linear Model: ANOVA

Average of TotalsettlementamountorcourtawardInjury TotalAmputation 567,889 Backinjury 168,747 Braindamage 863,485 Burnschemical 1,097,402 Burnsheat 801,748 Circulatorycondition 302,500

Table above created with Excel Pivot Tables

Page 35: Francis

Model

Model is Model Y = ai, where i is a category of the independent variable. ai is the mean of category i.

-

2, 000. 00

4, 000. 00

6, 000. 00

8, 000. 00

10, 000. 00

12, 000. 00

14, 000. 00

16, 000. 00

Average Sever ity By Injur y

Total 4,215.78 2,185.64 2,608.14 1,248.90 534.23 14,197.4 6,849.98 3,960.45 7,493.70

BRUISE BURNCRUSHI

NGCUT/ PU

NCTEYE

FRACTURE

OTHER SPRAIN STRAIN

Drop Page Fields Here

Average of Trended Sever ity

Injury

Drop Series Fields Here

Y=a1

Y=a9

Page 36: Francis

Two Categories

Model Y = ai, where i is a category of the independent variable and aiis its meanIn traditional statistics we compare a1 to a2

Page 37: Francis

If Only Two Categories: T-Test for test of Significance of Independent Variable

Variable 1 Variable 2Mean 124,002 440,758 Variance 2.35142E+11 1.86746E+12Observations 354 1448Hypothesized Mea 0df 1591t Stat -7.17P(T<=t) one-tail 0.00t Critical one-tail 1.65P(T<=t) two-tail 0.00t Critical two-tail 1.96

Use T-Test from Excel Data Analysis Toolpak

Page 38: Francis

More Than Two Categories

Use F-Test instead of T-TestWith More than 2 categories, we refer to it as an Analysis of Variance (ANOVA)

Page 39: Francis

Fitting ANOVA With Two Categories Using A Regression

Create A Dummy Variable for Attorney InvolvementVariable is 1 If Attorney Involved, and 0 OtherwiseAttorneyinvolvement-insurer Attorney TotalSettlement

Y 1 25000Y 1 1300000Y 1 30000N 0 42500Y 1 25000N 0 30000Y 1 36963Y 1 145000N 0 875000

Page 40: Francis

More Than 2 Categories

If there are K Categories-Create k-1 Dummy Variables

Dummyi = 1 if claim is in category i, and is 0 otherwise

The kth Variable is 0 for all the DummiesIts value is the intercept of the regression

Page 41: Francis

Design Matrix

Injury Code Injury_Backinjury

Injury_Multipleinjuries

Injury_Nervouscondition Injury_Other

1 0 0 0 01 0 0 0 0

12 1 0 0 011 0 1 0 017 0 0 0 1

Top table Dummy variables were hand coded, Bottom table dummy variables created by XLMiner.

Page 42: Francis

Regression Output for Categorical Independent

Page 43: Francis

A More Complex Model Multiple Regression

• Let Y = a + b1*X1 + b2*X2 + …bn*Xn+e

• The X’s can be numeric variables or categorical dummies

Page 44: Francis

Multiple RegressionY = a + b1* Initial Reserve+ b2* Report Lag + b3*PolLimit + b4*age+ ciAttorneyi+dkInjury k+e

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.49844R Square 0.24844Adjusted R Square 0.24213Standard Error 1.10306Observations 1802

ANOVAdf SS MS F

Regression 15 718.36 47.89 39.360Residual 1786 2173.09 1.22Total 1801 2891.45

Coefficients Standard Error t Stat P-valueIntercept 10.052 0.156 64.374 0.000lnInitialIndemnityRes 0.105 0.011 9.588 0.000lnReportlag 0.020 0.011 1.887 0.059Policy Lim it 0.000 0.000 4.405 0.000Clmt Age -0.002 0.002 -1.037 0.300Attorney 0.718 0.068 10.599 0.000Injury_Backinjury -0.150 0.075 -1.995 0.046Injury_Braindamage 0.834 0.224 3.719 0.000Injury_Burnschemical 0.587 0.247 2.375 0.018Injury_Burnsheat 0.637 0.175 3.645 0.000Injury_Circulatorycondition 0.935 0.782 1.196 0.232

Page 45: Francis

More Than One Categorical Variable

For each categorical variableCreate k-1 Dummy variablesK is the total number of variablesThe category left out becomes the “base”categoryIt’s value is contained in the interceptModel is Y = ai + bj + …+ e or

Y = u+ai + bj + …+ e, where ai + bjare offsets to u

e is random error term

Page 46: Francis

Correlation of Predictor Variables: Multicollinearity

Page 47: Francis

Multicollinearity

• Predictor variables are assumed uncorrelated

• Assess with correlation matrix

Page 48: Francis

Remedies for Multicollinearity

• Drop one or more of the highly correlated variables

• Use Factor analysis or Principle components to produce a new variable which is a weighted average of the correlated variables

• Use stepwise regression to select variables to include

Page 49: Francis

Similarities with GLMs

Linear ModelsTransformation of VariablesUse dummy coding for categorical variablesResidualTest significance of coefficients

GLMsLink functions

Use dummy coding for categorical variablesDevianceTest significance of coefficients

Page 50: Francis

Introductory Modeling Library Recommendations• Berry, W., Understanding Regression Assumptions,

Sage University Press• Iversen, R. and Norpoth, H., Analysis of Variance,

Sage University Press• Fox, J., Regression Diagnostics, Sage University

Press• Data Mining for Business Intelligence, Concepts,

Applications and Techniques in Microsoft Office Excel with XLMiner,Shmueli, Patel and Bruce, Wiley 2007

• De Jong and Heller, Generalized Linear Models for Insurance Data, Cambridge, 2008


Recommended