NOTE TO USERScollectionscanada.gc.ca/obj/s4/f2/dsk3/ftp04/MQ64046.pdf · Chmider the mode1 (1.1)...

NOTE TO USERS

This reproduction is the best copy available.

Diagnmtics for Generalized Lin- Models

Sonia Benghiat

A Thesis

in

The Department

of

Mathematics ancl Statistio;

Presenteci in Partial FWillment of the Requirements

for the Degree of Maser of Science at

Concordia UniversiS.

Montreal, Quek, Canada

Natianal Library Bibiiotheque nationale du Canada

Acquisitions and Acquisitions et Bibiiiraphic Services sewices bibliographiques

The author has granted a non- exclusive licence dowing the National Library of Canada to reproduce, loaa, distribute or sell copies of this thesis in microform, paper or elechmiic formats.

The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or otherwise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/fiIm, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

Abstract

Diagndics for Generalieed Linear Models

Sonia Benghiat

The analysis of residuals c m capture departures from a parametrized model. In

this thesis we look at how the generalid Iinear model has become one of the m a t

important developments in statistics in the las* thirty years, anci on the aciequacy of

regessiion m d e l diagnostics that are meaningfiù and sienificaat in a generalimcl linear

model context. Some aymptotic pmperties are di- and numerical examples are

providl to ilinstrate the techniques for binomial, Poimon, and gamma clistributed

random variables.

Résumé

Des diagnostiques pour les modèles

Sonia Benghiat

généralisés

L'analyse des résidia est un outil fort puhant qui nous permet de vérifier la va-

lidité d'un moc.èle paramètnquc. Dans ce mémoire, je donne un aperçu de hqmtance

que les modèles linéaires généralisé; ont eu sur le déroulement des statistiques daas

les trentes dernières années. J'analyse la facilité que nous procurent de tels modèles

1orsqii'i.l s'agit des dinpostiques de régressiom. J'éxamine également les lois acc

yniptatiques cmnoeniant ce8 modèles. Finalement, je présente des exemples pour des

variables aléatoires b'moniiales, Poisson, et gamnm-

Acknowledgements

This thesis mdd not have k m passible without the patience and the boundless

support £rom rny hushd . To h I one a debt of gratitude. M y parents, niy brother

anci my sister continuously remindeci me of the importance of completing my mas%ers

degree aucl to them 1 am thankfd for their peMis-tent encouragements. 1 hold a

peat respect for mv supervisor Prof- Y. Chaubey. He very patiently guided the

advancenients of th thesis. To hini 1 express m y sincerest gratitude. 1 woiild also

like to thank Prof. J. Carrido who wjlhgly provided me with some usefiil materjal

for the realizatiou of tbis thesis. 1 thank Prof. A. Canty for kindly acxxpting to advise

me on the choice of my software application. 1 thank the graduate ~ecretaries and

the proft?ssors from the Mathematics and Statistics department, anct my clasmates,

not least, for their insightfui help and for dering a pleasant 1eaRLing environnient

altogether.

Contents

1 Introduction 1

1.1 The Linear M d e l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

. . . . . . . . . . . . . . . . . . . . . 1.1.1 V ' v of Ass~mptions 5

. . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 a b e r Diagnmtics 10

. . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Remdial Mt?i~.%ues 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Outliue of Thesis 16

2 The Generatized Linear Mode1 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Historicril A S S A S 17

2.2 hiean and Variance Functions in au

. . . . . . . . . . . . . . . . . . . . . . . . . . . . Expnential F d y 19

. . . . . . . . . . . . . . 2.3 h n p t i ~ ~ of the Generalized Linear Mode1 20

2.4 Maximm Likelihood Estimation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . for the GLM 25

. . . . . . . . . . . . . . . . . . 2.4.1 The Newton-Raphson Methoù 29

. . . . . . . . . . . . . . . . . . . . . 2.4.2 Fisher's Smring Methd 29

. . . . . . . . . . . 2.4.3 lteratively Weighted Least Squares (W) 31

. . . . . . . . . . . . . . . . . . . . . . . 2.5 The G o o h e s of M d e l Fit 34

. . . . . . . . . . . . . . . . . . . . . . 2.5.1 The Deviance Function 35

. . . . . . . . . . . . . . . . . . . . . . . 2.5.2 The Pearson StatWtic 36

. . . . . . . . . . . . . . 2.5.3 MdualsandtheProjectionMatrix 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Alternative hlodels 38

3 Residual Diagnostic Measures 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Modifieci Rtsiduals 43

. . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Muential Observations 50

. . . . . . . . . . . . . . . . . . . . . . . 3.3 Tésting the Chdnesof-Fit 52

. . . . . . . . . . . . . . . . . . . 3.4 Testing Goodnesof-Link hc.tions 55

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Software Apyliratims GO

4 Numerical Examples 62

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Intrdiiction 62

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Binomial Data 62

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Poisson Data 6'3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Gamma Data 76

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusion 81

A Progréuns for Parameter Estimation for DifFerent Families 82

. . . . . . . . . . . . . . . . . . . . A.l MLE program for binomial f d y 82

. . . . . . . . . . . . . . . . . . . . A.2 MLE program for Poison f d y 85

. . . . . . . . . . . . . . . . . . . . A.3 MLE program for Gamma f d y 88

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Onestep hinction 91

B 92

. . . . . . . . . . . . . . . . . . . . . . B.1 Output for the Herbicide data 92

B.2 Output for One-Step fiulction using the Herbicide data . . . . . . . . 93

List of Figures

4.1 Deviame residid for birth abnorrualities due to herbicide spray ex-

pmirue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 x residnals for birth abnonualities due tn herbicide spray exyosure . 4.3 Projection matrix cliagonai elements for birth abnormalities due to

herbicide spray qxmue . . , . . . . . . . . . - . . . . . . . . . . . . 4.4 Standardizerl change in for b ï ï aabormalities due tu herbicide

spray expcmre . . . . - . . . . . . . . . . . . . . . . . . . . . - . . . 4.5 stand^^ diange iu for herbicide data . . . . . . . . . . . . . . 4.6 Devimce residiiâls for defects found on furnitlue producecl in a certain

manSacturing plant . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 X. residuais for defects f o d on furnitun! prociuecl in a certain nianu-

fachuhg plant . . . . . . . . . . . . . . . . . . . . . . . . . . . . , . 4.8 Projection ruatrix diagonal elements for defects found on furniture prw

ducd in a certain m1if8c.turing plant . . . . . . . . . . . . . . . . . 4.9 Staudaràized diange in & for defecb fond on furnihue produced in

a certain manufadurhg plant . . . . . . . . . . . . . . . . . . . . . . 4.10 Standardid change ix& for furnifame damage data . . . . . . . . . 4.11 Standardized change ir& for furniture damage data . . . . . . . . .

. . . . . . . . . 4-12 ~tandardiz~dcbangeïn&forfurnituredarnegedata 74

. . . . . . . . . 4.13 Standarclizerl change in b4 for furniture damage data 75

. . . . . . . . . 4.14 Standarûizeci change in for furnitute cîarnage data 75

. . . . . . . . . . . . . . 4.15 Deviane residuals for lot1 of b1ooddot time 78

. . . . . . . . . . . . . . . . . . 4-16 x resid~ialc; for lot1 of bloodclot t h e 79

. . . . 4.17 Projedon ~lliitxix diagonal elements for lot1 of bldclot time 79

. . . . . . . . . 4.18 Stand=- change in A for lot1 of b l d c l o t time 80

. . . . . . . . . . 4.19 Standarclizecl change for lot1 of bloodclot time 80

List of Tables

2.1 Dispersion Pamrricter. Canoniaai tir& and Viricitux Function for LLs-

tnbutioris of the Exponer~tid Fundy . . . . . . . . . . . . . . . . . . . 24

2-2 Dictributiorr Functiorrs un'tlr h i r Assocàatd Ltrrk.9 . . . . . . . . . . . 41

2.3 An Extension of the N o d - T h e o r y L h m r M& to the GLM . . . . 41

2.4 Dtxriaficx Furtctiorr fm Ezponctriid Famdy DiPtributions . . . . . . . . 42

3.1 Anscombe und VaMnce-SLa6iiiting ReSidds Ezpmssed for the Bi-

nomial. Poi~son and Gamma d%sttibutions . . . . . . . . . . . . . . . 48

3.2 D ( r ~ t * u ~ ~ r ~ arui Adjusted Deviorux R d & for the Time Dhtributior~~ 49

4.1 h r n b c r of birth thornuditics out of total b i r t ? ~ pcr nronth for hcrli-

cidceflcct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Contingeney table for -tu= defect . . . . . . . . . . . . . . . . . . 69

4.3 Bloml dotting tzmw in seconds for 9 perr;entage wncentmtiong of p l a m

a n d f o r 2 l o t s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Chapter 1

Introduction

1.1 The Linear Mode1

Most of the generalized linear mode1 cont;epts Stern from the theory of the normal

linear model. Before intrcducbg the generalized linear model, it is itsefirl to set the

scene by providiy: a brief review of the nomal linear mode1 in this hrst chapter, and

hence to derstand anti see the para1leL.s between the two types of mciclels.

The normal-theory linew m d e l is given by

where y is an n x 1 observation vector, X is a n x p known design matrix, P is a p x 1

vector of unklluwn parameters, d e d regression parameters anri e is an n x 1 vedor

of unobserveci random Vanables with zero mean and constant varianc'~ 2, whkh are

independently anci noroi;ùly dishibuteci. The muùel (1.1) is alternatively described

by the ruean-vector anci varianc~(YIVBfi8nce niatrix of the obrvatious y as

CHAPTER 1. INTRODUCTlON 2

The linearity of the mode1 is understoal in temis of the regrt&on parameters P. For estimation of the paramet-, the maicinnun W r & d method can be iised when

the error~ are normal. LikewWe, the principle of ieast squares provicies the same

estimates of the regression parameters. However, it does not require any distributional

asmmptiou. It is d e ~ T i M klw.

Least Squares Estimation of Parameters f l The 1-t squares methoci estimates the regression parameters f l by minimizing the

suni of squares:

= y'y - 2CIX'y + gx9Cp.

In additiou to being i ~ ~ , the least qm estuiiator (LSE) /j, the foilowhg

properties:

(1) have niinini~m variarice mong ail unbiased linear estiniatars (GaussMarkm

theoreni),

( 2 ) consistent, auci

Projection Matrix and Residuals

The builcihg blocks for cletecting influentid oticsewatioris in a giveri data are generatd

by the projection matrù, M, anci residd, e which are d&d in what follows.

Chmider the mode1 (1.1) with correspondhg fit;td values (9) and r&id vec%or (e)

dehed by:

The projection matrix M = (r*,) is definecl by:

is c d e d the " k t matrïxn. The projection 1~1;8trix is niast usefiil in the d . y s i s of

residds as it spam the r & d d space, Le.,

The residU e memue the Merence between the obse~ed anci the fitted values,

with the f o l l h g pruperties:

0 Var(e) = 02 (1 - H).

An d i a s e d estbator of d l>ased on the residual e is given by

whereby (1.8) is denoted by MSE, the nurrn s~urrrie due to c m r . Therefm,

Vm(e) = M S E (1 - H)

Theorem 1.1 The follouing am important pmpdie .5 &ted with the pmjection nta-

triz M:

1. H and M = (1 - H) are symmetric and idempotent,

2. m n l - M = r n n k ( I - H ) = t r ( M ) = t r ( I - H ) = n - p ,

and

CHAPTER 1- INTRODUCTION 5

2. Since (1 - H) is idempotent, r d ( I - H) = tr (1 - H). Fbtherrnore, since

It can be further deducecl that

In fitting a linear rqpsion malel, the &duah e c m be uyed to juste the auvmp

tions about the ranciou e m m r. Since e ir iinear in y, e iu a nmd011l variable f o U h g

a normal distribution, and hence the assumption of m d t y can be used to draw

inferences about the h e a r model. TbuY, an anal* which combines the d d u a l s

and the f i t d values will examine whether there are any departuns h m the linear

mode1 with n o d errors. The mode1 departimes to be examined are categorized as :

a non-constant variance,

non-independence,

omission of independent covariates.

Graphid methoch (se Draper rrml Smith [7], Chapter 4), involvirig the residuak

provide iisefiil tmls fur detecting s u i mode1 departures. They are describai below:

1- Plots of rwiduals agabt independent variables will detect potentid outliers,

non-constaut. &illlce, non-linearïty of an idependent variable or the need for

niore independent variables,

2. Plots of resicliiah agains* the titted valiies wili detect non-c'onstancy of variance,

3. Plots of residuak a-t t h (*(if pocrsible) will de-% non-independence anion*

errors or if the t h e effect has been omittd h m the mcdel,

4. BW-plots, n o r d probabiity piou, Half-normal plots, histograms and stem-

and-ieaf plots will check for n o d ~ and outliers, and

5. Plots of residuais again& other signiscant independent variables (if possible)

will detec* whether such variables rue to be included in the d e l .

Formal tes* buikl statistics iavolving residualy which are uYed to test the Müdity of

the foUawbe; u o d linear regression moclel assumptions:

F-test for Adequacy of the Regmamion Mode1

Consider the Liriear regession mode1 (1.1) whereby the e m ~i are assumeci to be

i - id. . The aûequacy of the mode1 is interpreted in the forru of the sienificmce of the

indepeudent variables (xi} i = 1 . . . : p - 1. The following hypotheses are testeci:

Ho : / 3 1 = / & = . . . = & l = 0

Ha : not all p, =O; j = 1: ... , p - 1-

It can be shown that the likelifioocl ratio tes* for Ho t V s . Ha if Ho is true yieldv the

following F-S-tistic:

J'(I-H)Y e'e MSE = = -

and

yC(H - k l ' l ) ~ M S R =

with the randtnu variable Fm,, having an F-distribution wïth VI. y desees of frec

dom. The critical region given in (1.15) is jus%ifid by the folloaring fats:

hlSK (ü) ( p - 1) 7 - Xc (A). where A = /3'Xf (H - 11') Xfit &(A) denotg the non-

central &-square random variable with u degrees of f i d o m and non-centraüty

parameter ( m p ) A.

(iii) AME and M S R are independent,

(iv) E ( M S R ) = 2 + IrX>(H - !II') x/~/O, - 1) 1 a2 = E ( M S E ) .

The asertions (i)-(üi) are consequences of Cochmn's Theomm (see Searle [23], Chap

ter 3), essentially l>y usine; the following theoreni:

Theorem 1.2 Let a - N(0,I). Then,

(1) d h ha9 a ?-distribution with d ( A ) = &pzs of jkdom, iif/. A ip idempo-

tent;

CHAPTER 1. INTRODUCTION

(2) dAz and z'Bz am independent iff. AB = O.

where z - N ( O t 1) md A = (1 - H ) .

S i m e A is idempotent witb rauk rz - p (Theoreni 1. l ) , it fdows that

M S E (n - P) 0' Xn-p

and, similarly MSR f ( H - 1'1) y b ~ - 1)- =

u'L €9

lia a non-central q u e distribution with degrees of freedom=

trace (H - 5 11') = p - 1 and non-cxmtraliw parameter

S k : e HX = X, the nou-centrality paranieter simplifies to

which is 2 O and equal zero S. Ho holds.

Independence easily follows since

The as~?rtiori in (iv) is s strict ineqU81ity if at least one of the pj # O.

1.1.2 Other Diagnostics

Some diagnmtic bols are d to detect infiuential and outl-Mng observations in a

given regression model. The Studentited ~ s i d d is very informative in examinhg

residuals d e r a n o d mode1 skce it is stanciar- ancl it introduces the idea of

casse deletion, where the fit for al1 o\iervations is ixnnpard to the fit witb the delet&

case. MW,

Vkre) = C? M.

where

The diagonal elenients m, of the pmjectzon matriz depict thœe observations with

Iiigli-leverage (i-e. hi@y influentid observatioirs) since they are relatai to the dis-

tarice Msween % and S. Giveu tht X is of fidl r d , then

Hence, the average of diagonal elments mii is 1 - p/n and high-levwdge observa-

tions shoiild have b7n;ill values for m, as compared to 1 - p/n. A s a d e of thumb,

fiom H e i n and W e W ([l l]), if m y 5 1 - 2p/n, then the ith observation is a

hi&-leverage point. Thus, M is a uaefui diagioiltic tool for detecting iduential o b

servations.

Another type of ill-fitting point which a r h s in mod&fitting is an outiier. It does

not n a d y imp1y an iduential observation in a dven niodel. In fa*, an outlier

may be outweighed by neighboring X-valued points. S a , the effect that an outlying

point exerts on the fit ne& to be measurd. The smder the number of okrvations

involveci in a model, the greater the dec* of the outlier on the model- This can

be done through the diaepcstic t o d of Cook's distatux whicb meamues the &a% of

delethg an outlier from the data:

= (A~&xIY(A~~~). (1.18)

where A ~ B = - p-f, o., denotuig the m a l LSE of P 4 t h the tth obervation

deleted fiom the data.

It nives the (iistaux, h e e n the U Y I ~ Least s q i i a r a estiruator an<l the le& squares

estimator obtairied after the Ah observation has ben deleteci and provicies a measlue

for the change in leart squares es-ates B for the deletion of the tth observation. It

cati be showri that

A& =

hence, it be writteu that

cc

The residrial s i m of squares (RSS) will alw &ange a rem& of an okrvatiou

deletiou. This is measued by:

AeRSS = RSS - RSSdc

CHAPTER 1. INTRODUCTION 12

where RSS-( reptesents the RSS nith the îth case deleted. Another a p p r d is to

me.sure the perttubation of the fit by letting 4 - N(0, d / v i ) . Coasider

where O 5 r. 5 1 is a weight k.tot defùung the matrix V = diag(z+). The mdtiug

weighteci LSE of P is denoted by &).

At r = 1 : B(i) = 8, the wuai le& squares estimate, aucl

at r = O : &O) = & , the least squares =&hate when the Oth point is deletecl

from the data-

The nomal equatiom are chaugecl and cowequently, so are the least squares esti-

mates. B(r) c m be e x p r d as

The perturbation effect is rneas'ufeci by dinerentiating (1 -22) with respect to r:

1.1.3 Remedial Measures

If the nomalie arwmptions made on the 1-t squares eitimates for linear models

are not met in practico, then m e reniedial measlm need to be taken. Tbrotighoiit

the extensive literahue available on th topic, one of the mœt prominent solutions

is to use a tr@offlliition on the data whkh may keep the normal hear regression

CHAPTER 1. INTRODUCTION 13

form. Howeer, the implications indved with a dected transformation may not

necesyarily be easy tO interpret. Some of the standard rexnedial measmes talren in

case of Yarious mode1 departmes are d d b e d below.

Non-Lineariw

Non-linear Least Squares Estimation:

When a model has normally distrihted arors with cons&mt variance, but is

non-iinear in the independent variables, then the property of dùitive errors may

enable a linear mode1 thugh a transfomation of the independent variables.

The most cornmon trans.formations are the following:

Stich m d d s are intrirrnmlly bear ([7], Chapter 5). If these traadolllliltions

are not possible, then alternative non-linear models nÿry have to be oonsidered:

where x represeuts a vec-tor of pr&%or variables, g(& x) is not linear in ,O.

The least squares estimatoi of p for 0 is obtained tlirough ditferentiation of

the p nomal equationv which are not lin-, unlike in the case for ordinary

lesst squares. Hence, these normal equations are more complicatecl to solve.

Consequently, numerical methods are uYually required fa be useci to obtain

solutions.

When the observations are independent yet have uneqd variances, an ordi-

nary 1- squares @on may yield u n b i i estimates, but it wiU not have

minimum variance- Then the chenmtims need to be transformeci in ternis of

weights, u;- > 0: Var(yi) = sULh that

Large weights u i imply s d variances anci have more impact in a regressim

model.

Exaniples of weight cmmponents:

1. if the ith respoase is the rem& of an average of ni equally variable ohser-

vatiom, then Var(%) = d/n i wliere u;- = ni;

Theri, introdiiciug the weight matrix, W, the m&ed estimator of 13 is

Variance Stabilizing ~ - f o r m a t i o n s :

When the variancm of the observations are not constant, it is possible to trans-

form (see Rao 1221, Chapter 6) the observations to d e the variance constant.

For this methd to work, the form of the hetermœùasticity must be known,

which is often not the case. Hence, in practice, one seeks transformations in

C ' T E R 1. ClVTRODUCTION 15

- a larger family and loob for an opamal member in this famüy, which c l d y

follm the assumptionu of the mode1. One such transformation, known as the

Bcix-Ch transfommtiou, is disc-ussed later.

The r o u ~ t s s penalty apprmh using cubic spliaes is a method for reLaxing the

mode1 asnuuptions in the normal-th- linear case. It addrti~ses two equally

important problenis in m e estimation: that of finciing a goorl fit to the data

ued ancl that of quant- the rapid fluctuation of a c m . Consider a moclel

wliich in specified without phcing any restrictions on the m e g. Henco, if thcre

are rio ciis-tribiitional assmnptionu d e , then the norr.dity of errors assuruptiou

is relaxecl. hteth& associateù with the above mdel corne under the general

auqir:es of the topic of Non-paranietnc Repesion and the literatiire on this

topic is extensive (e G m u anù Silverman (101).

Non-normlllity and Hetemsced=ticity

for a positive response variable y > O. This transformation may brhg nymmetry

to a skewed resyoim and reduce the heavy tails of a distribution while still

retaïning the siniplicity of the normal iinear model. When it does not provide

a g d fit to the data, alternative apprdes have to be explored. One such

a p p d is to use the genemlued liruw model (GLM), where the response is

ass'u11ied to be10ng to the exponential f m y - The assixxmptions made here are

baseci on the concept that the response depends on the preciictors through a

linear fom. Thus, the Lin- mdeis are gendzeci through

1. a litrk hmdiou which relates the expectattion of the response to the linear

preciic-tor, and thruugh

2. an exp~nential f d y distribution for the emrs.

This d e l will l>e descxibed in detail in Chapter 2 and i9 the highlight of this thesis.

1.2 Outline of Thesis

The next chapter introcluces the GLM, with all the relevant notatioac. It gives the

properties of estimators and computational details for estimating the parameters for

conuiion exponentiai fiunilieri. Tests for gOo(iflfssof-fit and incIusion/excllusion of

variables are dso includd. The basic properties of res5duaJs in the nomai theory

linear niodels are i ~ d for extendhg the regression diagnostics to the generalized

h e m models in -ter 3. This extension is d e possible t h u g h transfomeci

residiials, whi& is explaineci in detail in that chapter. The final b p t e r presents

numerical illustrations of the techniques cliscussecl in Chapter 3 and @ v a a handson

experienco with real data through cornputer programi developed using the %Plus

software application.

Chapt er 2

The Generalized Linear Mode1

2.1 Historical Aspects

The terru "generalized linear modeln w= fiFst introduced by Nelder and Wedder-

buni iii 1072. The geueralized hear d e l ?us been one of the m05.t important

developnients in the field of statis-tics in the last thirty y-. Much uyed in applica-

tions to the social sciences anù medicine, these models also play an important role

in the aaalysis of sumival data. As their name imggest, these mcxiels generalize the

nomial-theory hear modehi s u c h that the usual linear regression coniponent is 1 d

to desc.ribe a wider class of yrobbility distributioiis, specIfidy the exponential faru-

ily distributions. A1thoite;li g m e r m hezu modeIs have had an important impact

on statistics, most introcluctory Ytativtics textbooks however, st i l l only present n o d

linear mdeis.

It was ~ e e n in Chapter 1 that an aùequate lin- m o n m d d ybdd inchde a

d e which ellsl1it-s the canibination of wnstancy of variance, appmbate normality

of the emfs, auri additivity of the qmtematic effects. Huwever, this d e does not

CHAPTER 2. THE GENERALUED iimE4.R MODEL 18

always respect all three criteria. For example, if some discrete data is found to

have errors with an apprkmate Poison distriition, the systematic dects may be

multiplicative, in which case log-linear models are uYually employed. The folIowbg

choices of sushg are obtained by t r d o m ü q tm :

0 yL/2 tO ensure apprmimate anwtancy of variance,

Generally, none of these Ycaling powibiilities combine di three criteria for an adquate

hear regession analalysks. Alternatively, a generalized linear mode1 encornpumes sr-

ponentially djstriLmtd enors anci a variance fimc%ion whi& depenàs on the mean

in some known way, so t h t there is no neeà to d e y for nonriality of errors or

for constancy of variance. In fact, the scd.ing problem is reciuced to ensuhg that

the sys-tematic effW are aciclitive. It may be considemi to be an extension to the

normal-theq lincar moclel with ~ o m e ddeù modifications where the mean p of

an exponential M y with resyonse variable y is linearly related to the predictors

XI.. . . x,, by a Iuik hinction, g(p). This L describecl in detail in the sec.tions ttiat

f0Ilrn~.

2.2 Mean and Variance Fhnctions in an

Exponent ial Family

An observation y foUows an exponential f d y distribution if its probability demi@

fiindion is givien by

where a! b. anclc are some known functions, û is the h t a o n pommeler and 4 is the

dispersion pamnxtcr. This is denoted by

When the dispersion parameter 4 is hown, 0 is the aznonid parameter. The mean

rind variance of y are given by U(B) and a(4)lP(B). Thuy it can be written that

is called the vu&mce finction. For example, in the case of the normal distribution,

û = pt V ( p ) = 1 and a(4) = O? These may be c l e n d kom

CHAPTER 2- THE GENERALBED fiLNEAR MODEL

respectively, where l is the log-iikelihod fimctïon, Note that

hence equatiou (2.6) yields

Var (y) = a(#)b"(B).

2.3 Description of the Generalized Linear Mode1

The okrvations belonging to a statistical mode1 can be summarid in terms of a

spteniatic component and a ranàom component. In the generalizeù linear mode1

CHAPTER 2- THE GENERALIZED LZNEAR MODEL 21

(GLM) diPcussed by McCullagh and NeMa [l?], the d o m copupanent is inherent

in the exponential M y &tribution of the o h t i o n , while the systematic camp

nent assumes a linear struc.tUre in the predictor vafiaHes for a func%ion of the mean.

This fiuiction is h m as the link fwrction. When the parameter 8 is modeled as

a linear function of the predictors, then the link function is known as the c a n o n i d

litut. Themfort?, for a g*en set of okrvatiorw {yi)& where yi iY wIwidered tu be

asmciated with pfeciictor values xi = (zil,. . . , z*)', the GLM is expressed as:

where 6 is assumcl to depend on xi through the relation

If g is the canonid link, theri, the link function is specifieci by

h yractice, a @en <lata set may be diytributed accordhg to soue uniaiown m e d e r

of the expnential family auci thetefore, different lïnk func.tiom have to be evdimted.

The lirik fuuction serves to d e t e d e the d e on whkh hearity is assunied, aiid the

form of the exponential famüy st~ctturea: the variation in the data. If the parameters

. . . , are unrestrickd, then g(p) can take a ~ y value in R, hence the link fimction

is determineci to some extent by the domuin of variation of p. For example, if the

response is a proportion, then the link function g must map the unit interval of the

domain of variation onto the unrestncted range (-w, oo). In the case where the

respom is limiteci ta king positive, g niiist niap the e t i v e interval onto R.

It is shown, as follows, that in the case of a canonid link, the sufücient statistic

for the linear parameter /3 is given by X'y, where X = (xi, .... qJr represents the

CHAPTER 2. THE GENERAtlZEn LINEAR MODEL 22

desi- rnatrix of the p predictor variables and y mpmsents the dimui vector of the

n observations-

To yee this, firss note that p = V(8) and for the canonid link g(p) = 8, then it

follows t k t

fact is i d in deriving the likelihood estimator of whkh will be

consequently shown to depencl on the okrvations y thr@ X'y. pro* the Yufn-

ciency. Here, the 1%-likelihd function is @veu by

where Bi = Now, the differentiation of the hkelihood function iu equation (2.10)

Using q m t i o n (2.9) dong with the a b e equation produces

which implies for canonical links that

X'y = X' - q(p),

for wme nonlinear func.tion q. This is attnbuted to the fact that g(p) = 9 bol& for

canonid linkii only Henc-,

Now, canonid links for the binomial, P o h n and gamma famüies are given re-

spec%ively by the l e t , log and inverse transfhnatim. Consider the probability

distribution of the proportion y b a d on a seqt~cnce of m identid Bernoulli trials

with proùab'ilit'. of sucx.ess ir, then

wliere B = kg et lieuce the c.auoriical luilr is @wu by the logît transfonuatiou and

the peralirRci liuear mode1 is @ v a by

For the Poisson data with mean p, the probabiiity distribution fimction is denotecl

by :

f (Y; 014) = exI?{(yB - ee) - log(y)}.

where B = log p, then clear1y the log transformation *el& a canonid link. Similady

for the ganuna data with deriYi@

1 -w/k a-i f (9) = Y 9

it may l>e reparauietrizec-l such that a = 114 and k = -#/O, hence to get

Therefore, p = ka = -118 and mwequently, the canonid link is given by

Table 2.1 : Diqm-szon Pamnieter, Cononad Li& a d Variancc Functzon for Distri-

b u t i o ~ ~ ~ of the Ezpmm~tiul Fandy

DISTRIBUTION Notation a(&) 9 = g(p) Nauie v(14

Table 2.1 gives cananical links and other components for oommon distribution

faniiiicr with respect tn the exponential family gïven by equation (2.1) [17]. The choice

of a proper link function that will sa* the criterion of the domain of variation p is

b d on:

1. how the liiik fundion wil l &y interpret the paranieters in the linear predidor;

2. how the link fits to the data; and

3. the existence of a sinipie siiffiCient statistic.

cHAPTER 2- THE GENERALUED LINt3A.R MODEL 25

Pœsible link functions aasocîateci to some important members of the scponentiaî

family are ated in 'Ildole 2.2. In sunmary, gendZRd lin- models make up a

general chus of pmhab'ic regression m d e h with the assumptions tbat:

(1) the respnse probability distribution is a menit.m of the exponential fàtuily of

distributions;

(2) the respLw ?/i i = 1: . . . ? n is a set of independent raadom variables;

(3) the explanatory variables are linearly combined to srplain systematic variation

in a func*tion of the mean.

in a practical &ta sikiatiou, GLM fittuig involves the following:

a choasing an error distribution that is relevant;

ideutmg the independent variables to be included in the systeniatic conipo-

nent; and

a s p e c m the link funr:tion.

The next section presents the maxixu~uii likelihd method for estimathe; the regres

sion parameters assurning that the above have been specifieci.

2.4 Maximum Likelihood Estimation

for the GLM

If the probab'ility specifications of an exponential f d y mode1 are howu by f (y? d ) ,

then the h - t way to fit a generalid lin- mde i is by Maximum Likelihd b%i-

matiou of the parameters 13 for the data oùservd (Silverman aml Green [IO]). With

CHAPTER 2- THE GEIWMUZEI) tJNEAR MODEL 26

many desirable pmperties of maximum klihood estimatom su& as mIlSiStency, e f L

ciency, diiciency and asymptotic nORnali@, it is naturd to amsider such a method

for GLMs. In p e r d , the maximum lïkelihood equations which result fiom GLMs

cannot be solved expficitly and hence remunie must be made to ~ m n e r i d methcuis.

Three meth& ore deynihl in thW section: the Newton-Fbpbn method, the Fisher

Scoring methoci, and the Iteratively Weightecl Leart Squares rnethoù. But k t , the

niaxinizm likelihood equations are derived. Given the raponses ylr.. . , y,, where gi

is üonsidered tu lx geuerated h m a menber of the exponential f d y &(Ol 4; a. 6. c) ,

the likelikood fiuiction is written as

i=l i= 1

Then the 1%-likeiihd is @ven by

whereby Ei is the ith mmponent to the log-likelihood and is therefore given by

The l ikel ihd implicitly depends on the pai.anetefs pj: j = 1:. . . ? p , h t l y t h @

the link fimction g(p) and se<mndy through the hearity that it encompasses with

respec* to values. The derivatives of the 1%-likelihd with respec.t to pi, 0 t h M ~ e

known as the score hmCtions, are evaluated by the chain rule:

Hence, the score fimctions rduce to

In a ve<.tor foriu, the score equatioiu; are ejven by

where

The maximm l ikaoocl etirnator of 0 is obtained by solving (2.19) uskg the lin-

earity founcl in g(p) = Xfi, where g(p) = (y(pi)?. . . , q(p.,,))'. Numerid methods

tu solve (2.19) are essentiaily iterative. Cornnion to all these methods is the starting

value of the estimate. With the i d t h t e aini of obtahhg a ugood" starting value of

the estimate, the following technique is employed using the apprOOLimacte linearkxi

form of g ( ~ ) = g(p) + (y - ~)g ' (p ) . The a d ! t e d dependent vatwte, z which depends

on both y and p is i n t d u d .

Given that the variance of z is ~(q5))[g'(~)]~V(~), an initial estimate of 0 may be

obtaind by Weightecl Least Squares of z (with p = y) on X, with varian~covariance

matrix given by a diagonal matrix whobpe campnents are @eCi by

1

Knom as the working ueights makiz, th% matrix is denoted by W. In msa; where

repeatd o ~ t i o n s occur at a @ven design point, yi is replacect by the awage of the

saiupk observations. Sioce the average also belongs to the sanie exponential f d y ,

with the variance replaceci by a(4)V(fi)/wr ni behg the number of observations on

whidi the saiup1e mean is baYed upou, the working weights mat* contains diagonal

elements @wu by

or eqiiivalently, so1vhg for the weighted least squares estiniator from the mode1

Both z and W are id for maximum iikelihood estimation t h u g h a tveighted least

squaws reg ries si or^ This promm is iterative, sime both z and W depend on the

fitted values of cment estimates available- Some smring methods are needed to

measure the iteratiori variations for a weightecl least squares mgesion of a GLM,

until convergence is mched.

2.4.1 The Newton-Raphson Method

The Newton-Raphson method pr-ts a numerical approâch to d d a t i n g the maxi-

mtm l ikel i t id esthate p. This itenrtive pr- begim with a weighted least squares

estiniator obtained from the initial sulution of (2.23). A Taylor-series expansion of

€(f i) about t(@)) is i~d:

This is iteratively repeated until convergence is obt;iined [IO].

2.4.2 Fisher's Scoring Method

If the negative sewnd4erivative matrix, or the H d a n matrix, is not positive definite

at every iteration (i.e. if it is not invertibe), then the Newton-Rapttson's algorithm

cHN"TER 2. THE G E N . E D LtIMMR MODEL 30

is no longer valid. In this case, the H d a n matrix is replsced by its expectation,

obtaining Fisher's s e aljlorithm This methal is simple &ce the expectd matrix

is more likely ta tx positive definite as

wbich is the expectation of a p a s definite matrix. Thus, the itmative process for

Fisher's scoring algorithm is @en bj.:

-1 where

= - ( E [&] ) k evahiated at the previous iteratim. For evaluatùig

the derivatives in (2.28), the linear pceilictor Q is irPed where = 4p:

and

Note tliat -E ij = [a($)]-Iwij for i = j , and it is = O for i # j. Consider do) to be the initial n-vector with

Then it follows that

Since q = X'P, then by the dain rule

The Fisher's sconrig aigorithm nriyieId~ the foUowing sequence of updated es-timates:

The <lispers-iori parameter 4 is eiim.inatd because a@) geh canceied in the multipli-

cation, heuce it L d e c l a nuisance paramekr (McCuhgh and Nelder [17]).

2.4.3 Iteratively Weighted Least Squares (IWLS)

As indicated in Section 2.4, the intrduction of the adjusteci dependent variate z

results in the foll- equation for the MLE [see (2.23)];

However, the 1; and W depend on the unlmown fi , hence th& equation gives rise to

the iterative pr- p l ) = b(i)

CHAPTER 2. THE GENL.IRALIzED LJNEAR MODEL 32

This is lmown as the method of àtemtàoely tuicighted faut squarre~, Iw"S. The starting

value of the iteration is obtained by substituting fio = y. At each iteration i , a

weighted least squares mgmision of the working respoilse tariate z(') on the design

matrix X W obtained with the working weights matrix ~ ( ' 1 , where di) and w(') are obtaind by rreplackg p with fi(') = g-l(~@i)). T h . aigorithm can thus be

s u m m a r i d as follows :

9 Start with a siifticient statistiç fiom the data to get an initial fitted d u e vedor

p.

O From t h statistic, the link fim&ion g is used to denve an initial hear predictor

p.

TLiese stntisticr are us& iu uesting the s-tarting adju&eù dependent variate and

workhg weight matrix as foUows:

A weighterl least squares regmgsion is carried out of do) on X for the mode1 E(z ) = XP with the working weights matrix, w(O) to obtain a nrst xnmcimm likelihood e s t h t e :

which is tùen iiserl to obtain updated values of i j and ji:

This process is repeateci to update the regression esthates at each iteration via a

ycoring algorith, until the variation fmn one iteratim to the next is sufficiently

sniail. The niaxiniuni likelihood estimation method through the IWLS procedure

is an ateasion tu the non-iterative least quarts method of estimation for nomal-

theory Lùiear models, with W1/=X as the design matrix and the adjiisted dependent

variate W1/2z as the reqmIlSe variable

At mnvergeuce, if it OCC'UTS, z h m e s z = + W-a(y - f i ) so th* the &-

iikeiihood estimate of fl is:

If the working weights matrix W = 1 (the identity matrix), then the maximum lik*

lihood and leact squares niethoch mincide. No iteration is requked for the maximum

likelihooci estimation:

Hence, the IWLS niethocl extends the least squares proccdure beyond the lin-

mode1 to the generalizeci lin- mode1 that indudes the binomial, Poiuson, nord,

inverse normal, gamma, exponential, and multinomial clistributions.

An interesting point to note Ir that the worLing aRights matrix u d in IWLS, W,

is updated at each iterative step of IWLS so that each element of W, u*ii is updated

too for each observation i. Hence, W depends entirely on the fit of the mdel , ancl

not at aU on the LikeIihOOd equations X'(y - f i ) = O useü to determine B- in mntrast,

the weights deteniiine the fit in the weighteù least squams method-

The basic components of the generakmi linear model, as an actensioon to the normal

thmry mode1, m q be summarized in the foUowing table:

2.5 The Goodness of Mode1 Fit

As previously statecj, the link fhction which is usecl to describe the systematic corn-

ponent is often unknown. Canonical iinks may simplify the mathematics, but tkey

u y mt necessarily represent the best predi&iori- A natural question bund to a1-2~

in fittiug a GLM 14 YhOW good is the link htion used?", in m & n to sonie

othcr potential link fimc.tions, In f&, the mode1 fit is questioned. Other issues at-

tributable to d e l fitting are baseci on assmmptions such as the exponential family

distribution of the observations, the coIlStaDlcy of the dispersion parameter and the

iiidependenc* of the observations, mich like t h e seen in the nod- thmry linex

models, and the issue of ident&ing iduential observations.

A cx)mmon pal in postidating the systematic: de& is to have ody as -y in-

dependent variables as nwxsary for a goal fit. Consequently, r n ~ ~ which can

determine the quality of the fit and .statis%icaI tes* for keeping the variates in the

model are sought for. In partidar, the two most usefd goodness-of-fit .statistics are

the devuInce mea.w~rr: and the Peamon statistic The deviance measure is motivateci

by the discrepancy between the maxima of the o M ancl the expected (under

the mdel) log-IikeIihOOCl functions, Conviersely, the Pearson ststis-tic measmes the

relative ciifference between the obaerved and the fitted values, Both of these statistics

can ùe appruxhated by the 3 distribution with amesponding degrees of f i d o m .

Iu either caîe, a large deviane or &&square value inipües poorly fitted olx3ervations

with res'pect to the model.

CHAPTER 2. THE G-4 L3lWNZ MODEL

2.5.1 The Deviance b c t i o n

The maxinitzed . . likelihood for a given mode1 may be considered to be an indicator of

the goodnes-of-fit. Fur example, the ratio of the niaamized likelihoods under taro

models as a measure of the gcmchess of one niode1 over the other niay be such an

inriicator, or alternatively, taking the logarithm of this ratio. The deviance measnue D

is thus defineci as twie the logarithni of the likelihood ratio. Subsequent1y, a relatecl

implies a sudl D*, a g d fit is inclicahxi by d values of the deviance. The table

below expresses the deviance function for the Merent membem of the exponential

f d y with their respx-tive mnonictal links. Note that fi is the vahie of pi = E(yi)

for the mode1 co~1sidered.

The imscalled version of the deviance is

in (2.38), the parameter êr) = MLE of under the fitte<l mode1 . Each di measure contributes to the deviance. The value of Oi which maximizes the

( A r ) likelihd function, for each ith observation, i9 0;") whereby b'(ei ) = y,.

2.5.2 The Pearson Statistic

The Pearson statistic is dehed usïng the weightd least squarar a p p r d , which

prwides the follcrwing chi- quar ri: godnessof-fit:

2 = min C 1 ~ ' ~ (% - ~ i f l ) ' . s

i=L

This measmm is computationally simpler than the deviace mearure but it is more

usehd for distributions dmer to the N o r d fandy, as it resembles the RSS under the

normal-theory for other ùiagnmtic p-. However, when the probabiüty density

function of the obyemtions is m a r k d y a~ymmetric, the outliers may not be well

d e k M by Peamon TeSidiials. C o n v d y , the deviance residuais wiil detcwt outliers

better in these situations.

2.5.3 Residuals and the Projection Mat&

The ~ f u l n e s s of residuals r, = yi - & where 6 is fiom the mdel fit as wed for diag-

nostic pt- in normal-theory linear mdels, does not apply in GLW. However,

< serves as a measure of goodn-of-fit iu norrual-theory mociels, it wouici be

best if the two measnues gven here crndd be demmpowd into components, which in

turn could serve as modified reyiduals in GLhb. Ushg this concept, it can be yeen

that

CHAPTER 2. THE GENERALIZED LllVEAR MODEL

where

which are the weighteû res5dual.s or the Peaxson fesiduals.

Similady,

where

Tliese are tlw deviance residiials (see Preg ih [ml). Hence like in nod-theow

mdels, both the Pearson and deviance re?riduaJs may be usefid in developing diag-

nostic tmls in GLhls. This wili be d i d in Chapter 3.

For detecting influentid observations anci outliers, the use of the adjusseci dependeut

variate z permits the use of the projection matrix

usiry: the transformation X -t w1I2x = Xw and the least squares theory as intr*

ducd in Chapter 1. Hence

shares the properties of a projection matrix. As mentioned in Chapter 1, the diagonal

elenients mi:) ran be 1 4 for diagpœtiic piirpases. It is ako interesthg to note that

This inrplies that

where x deiiotes the vector of Pearson residuals for the cawnical W. Hence to

conclude, MM- spans the s p m of the Pearson residuaLv under the cvnclition that the

canotùcai link is usecl.

2.6 Alternat ive Models

For botli n o d h e a r d e l s and GLMY, the fom of the distribution and therefore

the Likelihoocl function is known. However, in practice this information may not

be available. Then some feahires of the ciab need to be evaluateci svch a how

the mean reqmrtse p relates to the independent d a t e s , how the varialiüty of

the respome relates to p, and whether the obervations are all independent. Quosi-

1ikc1ihd estimation is based on the idea of inoomplete distribution zrpecification. It is

determinecl entirely by the mean aml variance functions. Lüre the optimal ptoperty of

linear least squares estimates, quasi-iikelihood estimab have asymptotic optimalih.

properties.

Consider g to be the link funciion which d a t e s the mean T n s e f i to the systematic

part of a GLM:

dl4 = 4 P :

Only Ihe fonn of the mean and variance /unctimu are muxsary for the quasi-likelihood

fi~r~ctiort.

Tlrc quasi-likdiltood funetion ir def ird by thc ~ m t t c jonrt

Since V ( p ) is mœt o h praportional to Cm&), it is safe to m e tbat V ( p ) =

Cov(y). Here, the proportionality of Cm(y) to a ma- of lmown constants in n o r d

hear modeIs is extendeci to the pmpartionaliw to a matrix of known func.tions of

the mean vector p for nonlinear mdeJs. Then it foilows h m the 1east sqt- that

(1) the estinrate 17 mhimbm the quadratic form of Q(p;y ) over p@), and

(2) the weightd aini of squares estimate p will aatisf3. the quasi-smre equations

This approach is the GUf mmterpart of the least squares a p p d to the usual

linear mde1 with normality assumption. It makes a base for uskg the generalized

linear mdel without aùhering to a partidar arponentiai faniily assmmption.

Table 2-2: Dirtribution lhctwru ununth their Aawciated Links

FAMILX MEMBER

LINK Normal PoWn Binomial Gamma Inverse Gaussian

Table 2.3: An E x t t ~ ~ i o r r of the Nornrol-Thw~i/ Linmr Model to the GLM

N o d - L i n e a r GLM

y - dependent rariate a - adjus- dependent variate

ji - linear predictor f j - lin- p d c t o r

s2 - the res-iduai variance PM h~ 6v@) X W"~X

H - the bt (projection) matrix H = w'/~x(x%)-' x-''~

Chapter 3

Residual Diagnostic Measures

Two hrpes of residuals were introdud in Chapter 2, namely, the Pearson type (re)

and the deviance-baseci (rDJ It is f d that the devianae-bssecl residuals pr*

vide better goodneof-fit mea~ures for GLMs than does the Pearsou statistic, even

thmgh the latter is niore nearly chi-sq~iared distrib~ited. me reasonc for this are

the alni& n o d t y of the d e v i a n ~ ~ residuals and the mnvenience in their

use for likelihd-baseù inference. In f&, deviancebased residuak are especially

appropriate for identifv;ne individual poorly fittd observations. Aere, the <lisper-

sion parameter q5 is considemi to be known, in which case the exponentid family is

essentially given by the density hc t ion

where the d e parameter iY omitted. The Oi are asumed to foilow the tentative

mode1 given by

where g(- ) is a specifieci funetion, q 4- a vector of known d a b l e s , and P is a

vector of unknm parameters. The residuals discvssed in this chapter, however, are

usefid in a niore general setting t h just for the expnential family distribution. The

diagnostics are basai ou the aqmptotic distribution of residuals. In GLM, two types

of aqmptotic situations arisc:

(1) when n + w, and

(2) when the index rn + m, which iu equivalent to each Yi becorning approximately

normal.

These situations are r e f e d to as n - asymptat ics and m - asymptdics respectively.

In situation (2), rn wotild mrreqond to the (3an1ple size for the binomial distribution,

the meaus for Poison, or the gamma sbape parameters. Hence m can be thought of

as a cornon factor niultiplIllIllving the exponents in these aforementioned dens-ities. The

standard m p t o t i c r d t s for estimation and b p t h e s h testine; with respect to ,û

apply if either m or n is large. However, asyrnptotic r d t s pertaining to individual

case diagnostics require large rn, h p e c t i v e of n. The problm arises when n is

large but rn is not. This is a mmmon occurring situation for rez9ciual distributions.

Distinguishing betveen fÙs+ and second- order rn - usymptutics (i.e.: corresponds

to the stochastiic convergence of order m-L/2 and rn-l respec.tively), the second-order

agyniptotic r d t s are more iisehil when m is small t h the firstorder ones (see

Pierce ami Sdiafer [21]).

C H ' R 3. RJWDUAL DIAGNOSTIC lMEASURES 45

Consider TeSiduals that are sppmxhateiy mrmally distributed. In the following

models, ei is treated as known, but in prac-tice, it is replaced by

Three typeY of residuah are cowidered:

where E = mean and SD = standard deviation,

where t ( - ) is a s'pecifieci transtformation dependhg on the p r t i d a r distribution

of y.

There are two wa-p to go about in choosing a trandorniation t ( - ) . One way lets

the fhsborcier m - asymptotic skewnem of t ( y ) be zero (Le. symmetrizing) and

hence approiamate normaiity may be achievecl. This is done iising primarily

the Anscombe tesidual.

(a) Anscombe Rtsidual (see [2])

Starting witb a function which wil l make the distribution of A(y) as normal

as pocrrible, standarùïzed 4 t h O mean a d unit dmce to the first order

in p, for the likelihood hctions in GLMs, the fundion A(-) is given by:

A 'symmetrizing trari9ZDmtion '(see Chaubey d Mudhdkar (31) on t(-)

(for t' # 0) can be obtaheù by solving

lu the ixwe of the binomial &-trilution with proportions T a ~ d rrt trials,

the symmetrizing traosfoffiiation is dven by

which cari solved nunierically using the inmmp1ete beta fiinc.tion, with

no explkit solution-

For a Poisson clistribution with mean p, the transformation fields

As for the ganuna distribution with mean p and shape paranieter a, the

traasfomtiori is known zw the Wilson-Hilferty c11be-root trandomation

An alternative to the apprriJrimate norrnality objective iÿ to choose a t ( - )

that wiU make the m - crpymptdic variance of t(y) constant in 8.

(b) V'aoce Stabiizing Mduai (see [a) I f ( t , , } ,n = 1:2 !... : is asequenceofstatisticssuch that

Le. fi@, - 9 ) has an asymptotic distribution,

then it follm that if g is a fuaction with the first derivative existhg and

b e i mutinuous, g'(0) # O, then

and further, if O(@) is eontinuous, then

By the Taylor series expansion,

Now if h is a h c t i o n sich that h'(0)o (6) = c where c is independent of 8,

dl1 c -=- l riB. de o ( * ) ' " = ' / ~

Then the a5sptotic Mnance of h(t,) is independent of 8:

If y is a r d o m variable with B(m, x ) , then the variancestabiizing trans-

formation for the binomial distribution is

and for the Poisson, P(p) ,

CHAPTER 3. RESIDUAL DIAGNOSTIC MEASURES 48

The vatiancestabilipng - - . ~ ~ t i o n ior the gamma distrihticm G(a, k),

where E(y) = ak = p, Var (y) = al? = kp yieIds the fo11uwjng asymptotic

mean

Table

and variance

(3.1) smmmuks the Anscmmbe residuals with a O(m-Ir2) correction

added to t [E , (y ) ] and the vari8acestabiiizïng residduals (see (21)):

Binomial, Pois.~on and Gamma distributions

ANSCOMBE RESIDUAL VARUNCE-STABI~IZING

RD(yt 6) = sW(ê - e)p[c(ê, Y) - qe, (a 11)

8 is the MLE of B ixwxi on y without restriction by model Bi = g ( ~ @ ) . The * . deviance residual will measure the disclepancy betareen the 1%-

likelihood for the cument mode1 and the maximum poesible log-Likelihd for

the data, Under a f i r s t d e r m-asymptotic, the deviane has an appmximate

normal standard distribution. An adjustment to the deviancf! residual will

remove the bias comiing fimm the asymptotic tam, O(m1I2), and the adjusted

delriatm residual is fonued, as d d b e d next.

(4) Adjusted denance residual

The table whch follows cites the expressions for deviance residiials and adjusted

deviauce residuals, for the three given d d t i t s .

Table 3.2: DeviOncie and AdNted D&me R d d 9 for th Thme Dish-ibutim~

ADJUSTMENT TERM TO

Taking the n o r d apprmhatd tail probbiities, these residuals for different values

of y lie ktween .ûûû1 and .10 for the binomial ancl Poivvon distributions and are

equal to -05 ancl .O1 for the gamma- Pierce and Schafer [21] comparecl the tme tail

pmbabiLities for each respective density,

9[R(y + -5: O)] and 1 - +[R(y - -5, O ) ] ,

by considering Merent residuals R, where y is an integer. In all thme density func-

tions, Pierce and Sdiafer fouad that the Ansmmbe residual and the adjusted deviame

residual are good for appmcïmate mmmdity, evai when m is small. Rrrthermore, the

adjusteci deviance residual should be consisidxntly the clozçest to the true tail pmbab'il-

i@ throughout, for the different distributions due to its alma-normal characteristic.

3.2 Muent ial Observations

Deletion or perturbation of obeervations finru a ejven mode1 helps deted t h e in-

dividiial obavatioiiu wkck may ex& infiuence on the various cwniponents of the

fitted d e l . The followiiie; approach is descxibed in F'regibon [20]. To see the effet%

of perturbing an individual okmation is to e the &et of its deletion. Pregibon

pursues this idea by considering tbt? likelihd n

wherti ansidering t.i = 1: V i yiel&c the u s i d likeLihood, whereas z;- = 1 V i except

i = C anioiints to delethg the Eth observation. Thus, a niatrix composed of diagonal

f o r O < v < 1.

Then the l ike1i.d es-timate B becornes a function of V and is denoted by &c). The

l ike l ihd equations are

CHAPTER 3. RESWUAL DIAGNOSTIC MEMURES 51

Then Fisher's scoring aigorithm for the modifieci likdihood leads b a new sequence

of estimates:

@+'(r) = p(r) + (X'W'~~VW'~~X)-%V(~ - f i) . (3.15)

As r + 0, the Pth point has lers leverage in the fit. The êth point is iduential if a

small value for r yielcis a large D1(r):

nieasim the inipac* that rui

a GLM regesion. Plotthg

8th

the

observation exerts on the v-r of e c i e n t s in

standardized in coefficients ~&''/s.e.(fi') agahst l detect.5 any iduential observations in the s e l d coeffiuent, &. Cook's statistic y, measwes the impact of an ohrervation on al1 the coefllcients p. One conveuient way of interpretiag y in a GLM wntext is by the amficience region

nispkwxuent for due to deleting an (th obziervation, naniely,

CHAPTER 3. RESIDUAL DIAGNOSTIC MEASURES 52

A large y corresponds to a highiy hduential (th observation on the overail fit of the

model. By applying a secondder 'Iâylor aieries expansion to (3.19), the addence

region is generated by the Iiruiting Normal distribution of p. The concept of observation deletions can be extenderi to perturbations by let-

tvt = O so that = &O) measure> the influence that the Cth point sr& on the d-

cient estimates B through y. Then the confidence i n t e 4 displacement is measured

by the one-step approximation to &O):

where X: = r$, (2.43)-

3.3 Test ing the Goodness-of-Fit

M-vriag the goocln-f-fit of a model can be done by calculahg the effect of

change in u on the diagnœi%ic measmes of the deviace function D anù Pearson's &a-

tistic X2. In case of the deviance fimction, the maximum likelihd estimate should

m h h k D, much like the least quates estimate xuhimkm the resichial SIUU of

Yquares RSS in a nomal-theory linear model. Subeiequently, deletion of an observa-

tion d e c ~ e h s D , iïke it wodd decrease RSS in the normal-theory model.

Using the observation munt ruatrix V in the l ~ l i h o o d fundion fielcis a deViance

A onestep estimate b1 (c), and a second-order Taylor Senes of D, (@' (v); Y)

CHAPTER 3. RESIDUAL DIAGNOSTIC MEAS-

about 6 appmximates the above quantity :

at c = O : D,.(x~~'(v); y) is at a tnininlum of D ( X D ; ~ ) - (& + ci), where is

the ciiwge in the corifidence interval displaÿaiient diagno&ic i$.

The deviailice dec~eases as t. O.

î l e rate of tfiwee of D due to perturbations is obtxhed by taking the derivative of

(3.22) with res'yec-t to r..

The change in devimce due tu deletion of the tth point is apprazaniatd by:

which are i ~ f i d for index plotting. The presence of 2 components is a feature founci

in the onclstep appraxiniation, ruakine; it a useful diagnostic tool.

The Pearson's statistic is not a shaîghtforward rueasure to interpret since it doam't

extend fiom the normal-theory linear model as does the deviance huiction- As o k r -

vations are deletecl kom a given model, the 2 me- does not necesady decreaw.

However, like the RSS, the 2 is the d t of the sum of squares of Merences of the

observed ikom the fitted vahies. The one-step approDcimation to the 3 due to the

deletion of the tth observation is:

In extrerne cases, 2 WU in<=rease for some observation deletions.

The devianre fimction anû Pear~on's 2 goodnea~of-fit statistics can be interpreted

in h o way:

(1) when the lth point is not well fit by a @ven model, ive. an outlier, then a model

perturbation ca\~d by tl will be refie<%ed iu the single mmpoueuts of D arid

(2) when the (th point is an cxtreme point in the design matrix, Le. an influential

point, theri ail the individual components of D and X2 will change.

A change in either the deviance hction or the Pearson's statistic w 7 t distinguish

whether the change cornes from (1) or (2). An addtiod dia&nobItic mesnue hU cari

rerioIve this problern, where hri is an off4rigwal dement in the hat matrix H for the

(th observation with repec.t. to the jth ~bRenratiicm, Ihtjl 5 &JI;;;. The Iitj's in

comb'itiou with the xt and are usefid for nieesuring h m an îth point is irifluential

on the remaining (n - 1) points.

Tliere are other wa-ys of measuting the g o o d n d f i t such as by investigating the

interactiom between avariates, or by 160king for non-linear efiec.- by adding some

tems to a model in the hopes of reùucing the appmximated deviance.

Once a model bas been tested for potential outliers and influental observations and

that they've been r e m o d kom the data, then the valiciity of the link function needs

to be checkeù. Consider a pneraIizeci hear mode1 to be fittecl with a helvpothesizecl

link hinction g&) generated h m a class of functiow, of which the true and un-

lmown link function g.(p) is also a member. AU link functions belonghg to a clas

of functions are in<lexed by one or more unknown parametem. Plotting for a range

of fixd parameter values again& the corresponcling deViances is wefd in d e c i h g

which range of parmeter values are rn& mILSiZfteDt with the data. The adequacy of

the hypothesized luik fimction i9 examinecl by expan- and linearizing the link to

opthize over the range of paranietm. The deviaoces obtained h m f ù c d parameter

values are teyted agaiast bestfitting values. This is callecl the g d ~ s - o f - l z n k test.

If a class of linL functions is generated by the the power f d y for one parameter A,

then it is clefincd either by

Mth limiting value g(p; A) = logp as X + O

or by

The power f d y trdorxw the fitted d u e s p in a GLM ase. Conversely, the

Buoc-Cm t d o r m a t i o n k a power hinc.tion which transfomis the data in a normal

linear d e l .

If a model is fitted with a link function g&) when the true link is g,(p), then this

can be represented by:

To optimk over A., oue approach is to l i n e m the power famüy thro~qgh a fùst-order

Taylor series expnsirn ahut g,,(p). B d on the apprackate relationshiy

the true l i d c g.(p) = XP is apprachateù by

where d = (y;(p; &)) and -/ = (-A. + &).

The h.ypothesizecl Link function is now modifiecl by the addition of a covariate z' to

t h design niatrix and i t s parmeter estimate 9 fields a first-order arljus.tment to A,.

Heuce the dditiunal Ficrtx~r in the systematic h e a r mmponent accm~ints fur local

ciifferences betweeu the hypothesized link and the niodified one. These differences

are rneaswd by a recluctian in the cieviance. In turn, this reciuctian senres to test

whether A,, is suitable enough for A,:

(e= O.& - p - q ) or x?/(n - p - q ) .

When g,(p) is atsnmd to have the identity link (in. the data is n o d y distributecl),

then the apprmhnations macle on the 9 ùistribution are areexact:

The proces is repeated to form a new adjusteà value for A, at each iteration u n t l

a possible mnvergence is resched and then the mriyimm likelihooà estimate of A,

is obtained. If the initial X, is su8iciently clme to &, amvergence is assurecl. Tben

the Linearization of the power family will yïeld the true d u m likelihoocl estimate.

The p r w . foUm a sequeme

which is implemented in the iterations for fitting a generalized linear d e l .

The link moiiification methrwl has its limits such that it is restricted to a specitieù

class of link functiom g. The mas* which can be done is to improve an already

rasonable fit in order to obtain the true link function. On the other hanci, if the

h'rpothesized link is iuadqtiate, then the tn~e link func%ion belongs to another class

of luzk fimc~ons altogetkr. This is attributakde to a micrypec3ication of the systematic

corupnent of the nidel.

Com-ider a mode1 initiaily fitteù with link go(p) = XP to get estimates fi and fitted

values 6 = xB- Thus ê = (&(fi; &)) can be obtaind, and the model L refitted with

the extendeci design matrix now incluciïng the covariatti 2 = ex. In turn,

The s u m of squares corre5ponùing to 9 (to test if 7 = O) is

A parallel reduction in the d m of W o m anci in the deviance h m the initial

model to the extendecl one iacluùing Ê is produd. This reduction is evaluated by

an F-test to daide for the validity of the hypothesized fimctioi.

For every parameter added to the p<lwer function, an extra OOvaTiate is aàded to

the design matrix which is givai by -% Ir=&. The powa f d y provides link

generaüzations for the n o r d distribution with identiw link, for the Poisson with log

link, for the gamma with reciprocal link and for the invewe pussian with p-2 link.

For log-linear data, the power hm.ily is defined by the one-link parameter function

As for binomial data, the powet f d y does not apply. Another onclparanieter link

family is given insteé,ul hy

As A + 0, the complementary log-log linL is generateci:

lim g(p; A) = log A 4

?; = p / m is the respon* proportion. It ïs a tweparameter Iuik f d y with

parameters a and 6 (baserl on tolerance ciit&ributiorur). This famiiy of fuactions

generatcs the logit Iirik as the limiting form of y:

For this d e l , the series eXpiiri.rion is

The tme link func-tion is apprmrimatd by the extendeci mode1

The maxiniu~u likelihood estimate of y is r d e d thn,ue;h the iterative prooess de-

saiM earlier. A rediiction in devianoe residts h m adcihg on the additional fac-tor

to the ~teniat ic linear component. Fïrdy, an F-test uses the change in ùeviance

to a ~ e s wliether the e s t h t e of -y via (a,: &), henm of the link functioa i t d , is

3.5 Software Applications

The software application GLIlli ("Generalizeci Linear Interactive hlodelling") was

createù in the early 1970's for generalid linear mdei computations, but because

one kad to have some in-depth knawledge of s~tis%ics to use this tool, the geueralized

lincar d e l u were not popularized. It took twenfy yeam for generalized hear mad-

elling procedures to becmue d b l e to everyone through user-"friendy" software

appliations. lu SAS, GLMs can be E t t d through the Genmod p d u r e , and the

GEE macro analCyzes longitudinal ùata by ushg the Generalized Estimation Equation

approach. In SPlus, the StatMod libfcif'y contains some functions for GLM statisti-

cal modebg. R, which is a non-commed equivalent to S-Plus, can fit GLMs. It

shares =me libraries mth SPlus which are accessible h m the website

LispStat is iiseful for GLMs and uses some R a x h g . Matlab uses a module called

g l n a to fit GLMY. Another application is Gerritat which is mu& like GLIM. S m e

websites offer artides anci abstracts on GLMs. The foilowing are only a few websites

worth consuking for a start:

0 http:/ /www.ams.org/w~et/ and

Chapter 4

Numerical Examples

4.1 Introduction

In this rhpter, three sets of data are id for illinhratioii of the techniques presented

earlier for gerieralizect linear modeb. The tirs% set of data iy a ~ s u n i d to corne h m the

binomial f d y , the second one fÏom the Poisson f d y ancl the third one nom the

gamma f d y . In each case, maximum likelihd fit of the mode1 is providecl dong

Mth the residud diagnostics. The parameter estimates were obtained through some

cornputer yrqpms c m a t d in SPliiy. Th- prowanis are provided in Appendix A:

see A.1 for binomial data, A.2 for Poisson, and A.3 for gamma data.

4.2 Binomial Data

A study of a herbicide &et on the proportion of b i a b n o d t i e s was conducted

over a time y p a n of one year (see Aitken, Anderson, d Rancis, 1989, %tatisti-

atl hIdeUin; in G L W ) . The data was deded on a monthly b i s . The birth

abnormality proportions are determined by dividing the oberved number of birth

abnormalities by the total number of b ' i for a gïven month.

Table 4.1: Number of birth abnornuilitics out of total births pcr month for herbicàde

effect

MONTH ABNORM. TOTAL HERB MONTH ABNORM. TOTAL HERB - - -

Jan. 10 222 O J ~ Y 20 208 788

Feb- 17 221 O A%- 17 210 O

Mar. 18 188 O sep- 9 198 304

Apr - I l 183 O Oc*. 15 216 5m

May 16 197 1454 Na-. 16 244 O

Jmc 24 218 3280 Dec. 15 218 O

Based on the ass-umption that the data is b'mdally ciistributeci and that the logit

liuk is d to fit tkis mocle1, a combiition of gmphid and aridfical tt?chniclues are

d to tes* for au\. higb1everdge or outltlving obrjervations. The maximum likelihooci

esthates for this logistic: regmision mode1 are calc.Ulateci ~ising au S-Plus canipiiter

program tLat was created for this purpose. Other pertinent sktistics(-b adjusted

depemlent variate, fitted viiues, variance) are also calculateci in an iterative fashion

thruugù the SPl- hear mode1 func.tion(see lm, A.1). The output is presented in

the fdowing page in table format. Testhg the griodnestwf-fit for the cmrrent logit

d e l with one explanatory variable accounting for birth abnonualities, the test sta-

tistic (2.37) @ves D' = 8.31 < 18.3 = 2as4=,o which irnplies that thti logit mode1

is well fitted by the b'momiaily distributed data at a 5% level of sigdicance. F'urther,

a one-&ep function base<t on Pregibon's work [20j whïch modifies the loglikelihood

fimction was also developed in S P b to determine the effect that each observation

exertg on the w o n coefficients tbrough mode1 perturbations to the ertent of case

deletions. 1 d e d this fimc4k.m ka0nesteps(see Appendix B). A sriLall change in

coefficients for lth observation means that the obBervation is non-infiuential in the

mode1 fit.

Data Fitteci Vdiies

10 15-05~x1

17 14.93851

18 12.75041

11 12.41 130

16 16.63094

24 24.076GG

20 15.89181

17 14.85287

9 14.06200

15 15.94563

16 16.-

15 14.785û5

Adjusted Dependent Variable Variance

Figure 4.1 : Deviance reiiduals for birth abnormalitier due tr, herbicide spray expoyure

Figure 4.2: x rsidualv for birth abnormalities due to herbicide spray expoaure

Figue 4.3: Projection matrix diagonal elements for b i h abnonilities due to her-

Figue 4.4: Standardizd change in ,&, for birth a b n d t i e due to herbicide qmy

expm.ue

Figure 4.5: Stanùarûid change in for herbicide data

Accordhg to the deviance misidual aucl the x nisidual index plots, the month of

Mar& woull inclicate that the herbicide spray dect is sigrdirnntly p a t e r on birth

abnomalities than for any other month of the year, The standardid chauge plots

in both the intercq>t(A) and the herbicide spray aqxmire variable(&) woidd also

a- t h t a perti~bation or a deletion of the obrvation for the month of Mar&

(i.e.w = 0.5$.2 or u. = O rep.) would cause a pater st;ancladize<l change in the

regrasion coefEicients than for any other month. Hence, basal on these d i a ~ ~ ~ c s ,

it is likely that the month of March artrtts an d u e influence on the total m b e r of

birth abnormali ties.

4.3 Poisson Data

The set of data gîven hem dass ik the ddects found on furniture £rom a gïven

manufktmiq plant obtaïneci kom (see Aitken, Andefson, and Rancis, [l]). The

defects are thus classified as the type of deféct, ancl the production shift. There were

a total of n = 309 defects reconld in all, clasdiecl in one of four types: A: B: C, D.

Each piae of fiunihue is also classifieci by one of tbree production shifts: 1,2,3.

Tlie mntingmcy table beluw tabulates these dekt cuuntts by CVpe of def& and

pduction shift. The Poisson distribution d e l is fitted to the data with the log

Table 4.2: Cmtirrgmcy table for jùrniturre defect

link. The cornputer program in Appendix A.2 caldates the ML& for the GLM

log-liuear repssiou. The oittpit is , m m in the following tabla:

Data

15

21

45

13

26

31

3l

5

33

17

49

20

- -

Fitteà Values -- -

Variance

22.31133

2 0 2 9

38.9385 1

11.35987

22.99029

21.43G89

39.7GG99

11.80583

28.49838

26.57282

49.29450

14.W30

This mode1 is explainecl by four leveb of defst types and three levels of prodiicdion

shifts. To asses the sienifir;uicz of this log-linear mdel, the statistics £rom equatiom

(2.37) and (2.42) are comparecl tu x&,, = 12.6. Silice Do = 20.34 anci 9 = 19.14,

it is concludecl that the lq-linear m d e l does not provide a g& fit to the Pois

son distributecl data at a 5% sienificance level. In fact, the gcxxh-f fit for this

model is only significant at the 1% level. The index plots of the deviance residu-

als, the x residuals anci the diagonal elenients of the projection niatrix are based

on the fitted log-linear d e l . Both the 6th and the 8th observations, wbi& mrre-

spond to the T-ype B nuruber of defects and Type D number of defec- respectively,

f o d in the ~econd proùuction shift, are not well fit by the mdel. In hct, the

8th okrvation has a very large value. The s&mdardized change in d u e n t

plots for the intercePt(&), the B defed -le&), and the semd production

shift variable(& agree that the 6th observation is caiising instability in these aeffi-

cients, wkile the 8th observation is mushg instsb' i more so in the Type D def&

variable(& and the second production shift variable(&). Henœ, the sdandardizeù

&ange in coefficient plots are in-line with the residual aml projection mat& index

ploLs.

Figure 4.6: Deviance residual?3 for defects f m d on hirniture producd in a certain

miifarrtiuing plant

Figure 4.7: x resirlids for defeckz f d on furnittue p m d u d in a certain mufac -

turing plant

Figure 4.8: Proja%ion niatrk diagonal elements for defects f o d on hiniiture prw

dt~ced in a certain mtifac.turing plant

Figue 4.9: S t a n d a r M change in & for defecis found ou hirniture prociuced in a

certain manufacturing plant

Figure 4.10: S t a n b a change in for hirniture damage data

C ' A M E R 4- IVIIMERlCAL EXAMPLES

Figue 4-11: Starid=&& change in for huniture damage dab

Figue. 4.13: StandiuciiWCI &ange in b4 for hirniture damage data

Figure 4.14: Standardid chauge in &, for hirniture damage data

4.4 Gamma Data

The next set of data are taken h m McCuhgh aad Nelder, 1989, 'Generalized Linear

Modeid"' p.300. They d d b e blwd clotting times, in seconcis, for normal plasma

ciilutmi at nine different percentage concentratiom(X) with a protbmbin-free agent.

The b l d clotting is indud by two lots of tbn,mbopIss-tin. Bliss(1970) fitteù a

hyperl>dic d e l I>y iisuig an inverse transformation of the data to the first lot o n .

Here, the data assumm a ganum distribution with the inverse Lulk applied to eacb

lot separately, since some initial plots indimte that the two intercepts and slopes are

rlirrerent for the two lots. Some of the output fiom the program in Appenciix A.3 is

Table 4.3: Blood clottirig tirne.. in smr& for 9 perce~toge wnm~tmtion+ of plamui

and for 2 lots

9% CONCENTRATION

LOT 1 118 42 35 27 25 21 19 18

LOT 2 69 % 26 21 18 16 13 12 12

1 Data (lot 1) 1 118 58 42 35 27 25 21 19 18

1 Data (lot 2) 1 69 35 26 21 18 16 13 12 12

1 Fitted Values / 71.06 32.86 25 21.37 17.74 15.M 13.75 12.58 11.8

If the level of sienificance is 0.05, then the 95th percentiie of the 3 = 14.1. The d u e

obtained t h & (2.37) is mu& less than that: Do = 0.017 for lot 1 and D* = 0.013

for lot S. Thiiy, the ganuiia àistributed b l d cldting times provides a g d mode1

fit for both lots.

In the graphs that follow, some diagœtic bols are used to a s m ~ which observations

exert some influence on the fitted mode1 for lot 1. The firYt two index plots agree

that okmtion 2, which is the 10% concentration of the prothrombin-free agent, is

not well fittecl bq. the inverse mode1 of the blood dotting times. However, the two

standarctized change in coefficient plob for the intercePt(& anci the percentage of

agent coucentration(&) agme that the 5% concentration level is greatly ïnfluential

on the mode1 fit, depeuclhg on the level of perturbation(u7 = 0.5,0.2) or on a case

deletion(u = 0) altugether.

Figure 4.15: DeViance residuals for lot 1 of bloodclot tirne

Figure 4.16: x residds for lot1 of bldc1ot time

Figure 4.1 7: Projection matrix d i a g d elements for lot 1 of blwdclot time

Figure 4.18: Standardized change in /jo for lot1 of bloodclot t h e

Figue 4.19: Standardid change in for lot1 of bloodclot tirne

4.5 Conclusion

The diagnostic measutes developed through the one-step function provide an effective

coune device to modify the loglikelihd fiinction which is not too time consuming.

In fact, the one-step function p-ts an dequate way of detecting ancl quantifying

the effect of outl-ying obsemtions and extreme points for G m . It is noteworthy to

mention that for lo&%ic regremion, the H d - D o n n e r phenornenon can occur (see

[27], p.225). When the b, are large, the t statiatic g- to zero. This implies that

highly sienificant fi, may have non-sipnifieant t ratiw. For example, when deaiing

with fitted vali~es that are wxy dase to either one or zero, then a dual mnfii& of

the Hauck-Damer phenomenon and convergence problems may arise. This can be

seen wkeri deaihg with a very large dataset of say, 1000 okmmtions, and about fi.v biuary explanatory variables, whereLy one of the OOYatiates is al- one to cmnfirni

the prcsencr? of a di,sase, for example. Then the resulting fittd pmbailities with

respect to that cwariate mus* neceSSafily be one, and hence its d a t e c i regressiori

cdcient , f j j - = m. K c ; in t u m hplies thprt the I1l8Xim- likelihood estimates do

not &,

Since the geiierahd linea mociels are menibers of the exponential f d y distri-

butions, the computations ancl cliagncstic measwes deycribed hem can be extended

to a greater =ope to lead to applications in tirne series md& and survival modes.

Some reyeârch work on diagnab-tic measmres for smrvival mai& hzw been investigatd

by D. Pregibon.

Appendix A

Programs for Paramet ex-

Estimation for Different Families

A.l MLE program for binomial f d y

# Binomial data program: sufficient statistic is the proportion of y to m # for the ith obserpation. # muhati <- function(y,m) (

m u h a t <- rep(NA, length(y) for (i in l:length<y>) C

if (yLiJ/mCi]-O l l y[il/mCil-1) C muhat [il <- (y [il +O. 5 ) /(m [il +l)

> else { mhat [il <- y [il /di]

> 3 muhat

> # Cather al1 ïnformation(z - adjusted dependent variate, X - covariates,

iC V - mightvalae) in a datafraae - CsnDataFrame. #

GenDataF:ame <- function(zValue, X, weightValue. nRows) C X Ceaerate mtdata. matdata <- data.frame(zValue, X[l , ] ) i f (nRows >= 2) (

for (j in 2:nRoas) ( matdata <- data.frame(matdata. XCj.1)

> > matdata <- data.frame(matdata.weightValue)

> # Purpose of this function is t o create and execute the coranand: # betavalue <- lm(zValue ' x C l , 1 +x [2,1 +x [3,1 +etc. . . , matdatavaiue) Écoef f icient s # thsough concatenation of each covariate X i . # Cenlm <- fuaçtion(zVa1ue. X. matdatavalue. nRovs, weightvalue) I

# cat-file & parse file t o generate B e t a O cat("betaVaiue <- h(zValua ' X[l,]a,filet=mtnp.l*) i f (nRows>=2) (

for ( i in 2:nRovs) i ~ a t < ~ + X [ ~ . file="tmp.la, append=T) cat (1, f ile="tmp. la, append=T) ~ a t < ~ . ] " , file="tmp.la. appnd=T)

3 1

# Now erecute the created a-d aval (parsa<f ile='tmp. la) . local=T)

>

t Purpose of this fmction is t o create and execute the command: # etahat <- betaValue C l ] + betavalue [2] *X [ID 1 + betavalue [3] *X [2,] +

betavaiue *X 13.1 + etc. # Genetahat <- functionbetaValue , X, nbws) C

# H e e d a for loop t o genesate the required c-d. cat ("etabat <- betaValne Cl] f ilwntmp. 1") for (k in 1:nRous) (

cat ("+ bataValue Ln , f ile=="tmp. lnD append=T) cat(k+l, filetatmp.la, appnd=T) ~at(~]*X[", file="tmp.in, append=T) cat(k,file="tmp.ln, append=T)

# Nov execute the created comand. eval(parse (f ile="tmp. 1") )

> # This part is made to measure for binomial data - need to extract # pertinent statistics . # iterbin <- function(y, X, m. i-50) {

# Fiad out hou many Xi's, by the length of a column. n b s <- length (X 1.11 1

n <- O for (i in 1:itma.x) <

a <- n+l

wight <- met*( l -muhat) z <- etahat + m*(<y/m>-auhat>/wei@t matdata <- GaDataFrame(z, X, w e i e t , nâows) beta <- Gealm(z, X, aatdata, nRovs. weight) if (sum(abs (beta-beta0) ) <= 10- (-10) ) {

retura(list("Passn=T, coefficients=beta, fittedvaluesainnrhat, adjustedValue~z, Variance=ueigbt, iterations=n))

3 betaO <- beta

>

A.2 MLE program for Poisson family

# Poisson data program: sufficient s tat i s t ic is the mean of y # for the i th obserpation. #y is vector of the sum of counts rnuhatpoi <- functiody ,m) {

YI^ >

# Gather al1 infonnation(z - ndjusted dependent variate, X - covariates, # Y - oeightvalue) in a àataframe - hoDataFrame. #

GenDataFrame <- function(zValue, X, weightVaîue, nRows) €

X Rirpose of this function is to crûate and execute the coiiand: # betavdue <- lm<zValue ' x [l , j +x[2 ,] +x[3. ] +etc. . . , matdatavalue) Scoef f icients #

Genlm <- fimction(zValue. X, matdatavalue, aRows. weightvalud C t cat-file & parse file to generate BetaO ~at(~betaValtie <- lm(zValue ' X[l,)g,file--mtarp.l') if (nRows>=2) ( for (i in 2:dbws) ( cat ("+X 1" . f ilet="tmp. ln, append=T) cat(i, file=wtmp.lm, apperid=T) cat Ca, J " . f ile="tmp tm , append=T)

1 3

# Nou execute the created command eval (parse (file=" tmp . lm) . local=T)

>

# Rirpose of this function is to create and execute the command: # etahat <- betavalue [il + betaValue [ZJ *X Cl ,] + betaVaïue [3] *X [2,1 +

# betaValue[4]*X[3,] + etc # Cenetahat <- f unction(betaValue , X, nRows) <

# Need a for loop to generate the reqpired c-d. cat ("etahat <- bataValue [Il q .f ilwmtmp. 1') for (k ia 1:nRous) C

cat('+ betaValue['. file='tmp.las append=T) cat (k+1, f ile=%mp. lm, appnd=T) ~at(~J*x[". fiie=atmp.l"s append=T) cat~k.filetatip.l'. appead=T) ~at(~.)", file='tmp.ln, append=T)

>

t Rov execute t he created coaanand. eval(parse(file="tmp- 1"))

# This part is made t o measure f o r poisson data - need t o extract # pertinent statistics - #

iterpoi <- functioncy, X, m, i--100) { # Find out hou many Xi's, by the length of a column. n b w s <- length(X C, 11 ) etahatO <- log(muhatpoi(y,m)) weightO <- rep(1, length(y) ) z0 <- etahatO # Cenerate matdata. matdata <- CenDataFrame (zO, X, weightb, nRous)

# cat-file & parse f i le to gePerate B e t a O # beta0 <- h ( z O ' x[l,I+x[2,]+~[3.J+etc~~~, matdata)Scoefficients betaO <- Cenlm(z0 , X, matdata, nRows, weight0)

h <- O for (i in 1:itmax) <

h <- h+l # etahat <- beta0 Cl] + betaû 121 *X Cl, 1 + betaO 131 *X [2 ,] +betaO 141 *X L3.1 etahat <- Cenetahat(beta0, X, nRows)

mubat <- expcetahat) weight <- muhat z <- etahat + <yluhat)/weight t Generate matàata matdata <- CenDataFram(z, X, weight. nRows) beta <- Cenlm<z, X, aatdata, PRows, weight)

> betaW-beta

1 list ("Pas8 "=F, coef f icients=beta, iterationwh)

3

A.3 MLE program for Gamma family

# Gamma data program: sufficient statistic is the mean of y # for the ith obsemation. # mhatgam <- functiody .ml C

muhat <- rep(Wd. length<y)) for (i in 1:lengthCy)) <

mùhat Ci3 <- y Ci3 /m [il 1

muhat

# Cather al1 information(z: aàjusted dependent variate, X: covariates, # V: weightvalue) in a dataframe - CenDataFname. # GenDataFrame <- function(zValue, X, ueightvalue, nRovs) {

# Cenerate matdata. matdata <- data-frame(zValue, X f l , ) ) i f (nRovs >= 2) C for (j in 2:n-s) {

matdata <- data-f ramdmatdata. X [ j ) 3

3 matdata <- data. f rame (matdata,uei@tValue)

>

# Pwpose of th i s function is to create and execute the m d :

# betavalue <- lm(zVaiue ' r Cl. ] +r CS, 1 +x C3.1 +etc. .., matdatavalue) $coef f icionta # C e d m <- functiodzvalue . X. ma+dataValue. nlbos. weightvahe) {

t cat-file & parse file to genesa-te BetaO ~at(~betaVaïue <- lm(zValue ' X[l.]a.file=atmp.la) if (nRovs>=2) (

for (i in 2:nRows) ( ~at(~+X[~. file=atmp.la. appand=T) cat(i. file=atmp.l". append=T) cat Ca .] f ile=atmp. la. append=T)

3 > cat (" . veights=riei@tValue)$coef f icientsm . f ile="tmp. 1' . "\nn. append=T) # Nov execute the created comand ev~(parse(fi1e='tmpPla), local=T)

1

# Purpose of this function is to create and erecute the conunand: # etahat <- betavaiue Cl] + betavalue [2j *X cl .j + betavalue C31 *X [2 ,] + # betaValue[4]*X[3,] + etc # Cenetahat <- function(betaValue, X, nRovs) C

X leed a for loop to generate the requued conwuand. catcnetahat <- betaVal~e[l]~,fil~"~.l~) for (k in 1:nRoos) C

cat("+ betaValueCa. file="tmp.la, append=T) cat(k+i. file=atmp.la, append=T) ~at(~~*X[". file='tmp.ia, append=T) cat(k.file="tmp.la. append==T) ~at(".]~. fil~~tmp.1~. app.nd=T)

3

# Nov execute the created comnand. eval(parse (f ile=%mp. la))

3

# This part is m a à e t o measure for data - aeed to extract pertinent # statist ics . # itergam <- function(y, X, m. i-50) 1

# Find out hou many Xi's, by the length of a column. ORows <- length(X [. 11 )

# Cenerate matdata. matdata <- GeaDataFrame(z0. X. weigbt0, nlbws)

# cat-f i l e & parse f i l e to generate BetaO # beta0 <- h(z0 ' x [lm 1 +x C2.J +r [3 ,J +etc. . . , matdata)Scoef f icients betaO <- Geoln(z0, X. matdata. nlbos. weightO)

h <- O for Ci in 1:itmar) <

h <- h+l # etahat <- beta0 C l ] + betaO 123 *X Cl ,] + betaO [3] *X 12.1 +betaO [4] *X C3.1 etahat <- Genetahat (betaO , X, nRous)

muhat <- inverse(etahat) weight <- muhat-2 z <- etahat + <y-anrhat)/weight # Generate -tdata matdata <- CenDataFramedz, X. we igh t , n8oss) beta <- Genldz. X, matdata, IiRows, weight)

for (i in 1:dimen) C W- diag (rep ( 1. dimen) Y [ i , i J <- O

temp <- onestep(X,V,U,z,i)

Appendix B

B.l Output for the Herbicide data

B.2 Output for One-Step nuiction using the Her- bicide data

Bibliography

[l] Aitken, M., Anderson, D. and Rancis, B. (1989). Statistical ModeIIing in GLIM.

M o r d Univeruity Press, New York.

[2] Amcombe, F. J. (1948). The ~8I1Sformation of PoiYson, Binomial, Negative Bi-

nomial data. Bzornctda, 35 246-254.

[3] Chaubey, Y.P. and Mudholloir, CS. (IW). On the SymmetMng Itnnsformo-

lions of mndom varMbIes. Paper unpubiished.

[4] Cook, R.D. and Weisberg, S. (1982). Residuals and Infience in i?egri~s~sàon Wi-

ley, New York.

[5] Cm, D.R., Hinkley, D.V., Reid, N. anci Sn&, E.J. (1991). Stotistiml Thwry and

Modelhg: In Homur of Sir David Cox, FRS- Chapman and HaLi, London ; New

York.

[G] Davison, A C . and Gi@, A. (1989). Deviance Residuak and Normal Sawe Plots.

Bionre trika, 79 2 1 1-221.

[7] Draper, N.R. and Smith, H. (1981). Appfied Regvssim Analys&. Second ed..

W'ïey, New York.

[8] Firth, D. (1988). Multiplicative - Log-Normal or Ganmia?

J. R.Statist.Soc. B, 50 2GG268.

[9] Green, P.J. (1984). Iteratively Reweighted Least Squares fm Uaxinnuii Like1.i-

hood Estimation and =me Robust and Resivtant Alternatives. J.R.Statist. Soc.,

46 14!&1!32.

[IO] Green, P.J. anci Silvc?rman, B.W. (1994). Nonpm~net~ïc Rqms ion and Geneml-

ired Lineur MadeL~, A mughnes.4 penalty apprrmch. Chapmm and Hall, London.

[il] Hoa*, D.C. and Wekh, RE. (1978). The Hat Matrix in Regressiou and

ANOVA. Amer- Statisticiarr, 32 17-22.

[12] L i u k y , .J. K. (1997). App1yhg Gtmmditd Linear M o ~ ~ P - Sprhger-Verlag, New

York.

1131 Mathai, A.M. anci Provcst, S. (1W2). Quadmtic fonns an mndmn variable.^:

theory and upplimtion. Dekker, New York.

[14] Mathsoft (1997). S-PL US Pmgmmmer's Guide, Data AnalAnalysis Prducts Divi-

sion, Seattle, WA.

[15] McCdlagh, P. (1985). On the Asqmptotic Distribution of Pearson's Statis-tic in

Linear Exponential Family Models. Internotional Stotistd -ew, 53 61-67.

[16] McCullagh, P. and Nelder, J.A. (1983). G d k d Li- Modeu Chapman

and HaU, London.

[17] McCullagh, P. and Nelder, J.A. (1989). Genedized Limeur Md&, Second

eù. .Chapman and Hall, London.

[18] Nelder, J.A. and Wedderbuni, R W . U (1972). Genaaliaed Lin- Models. J. R.

Statist. Soc. A, 135 37@383.

[19) Pregibon, D. (1980). Goodness of Link Tests for Generalized Linear Modeis.

Appl. Statist., 29 15-24.

[20] Pregibon, D. (1981). Logistic Regression Diagnostics. The A n d of Stutistics,

9 705-724.

[2i] Pierce, D.A. and Schder, D.W. (1986). Residuals in Generalizeci Linear Modeis.

JASA, 396 97'7-986.

[22] Rao, C.R. (1973). Lin- ShtUtiad Infereruz and i t s AppIiaations. Wiley, New

York.

[23] Searle, S.R. (1971). Lirmw Mal&. Wiiey, New Yak.

[24] Seber, G.A.F. (19m). Lin- Repssion Adysis . Wiley, New York.

[25] Seber, G.A.F. anù Wild, C.J. (1989). Nodineur l k p e s s i o ~ ~ Wiley, New York.

[26] Spectur, P. (1994). An I n ~ u c t i o n to S a d SPlus, Duxbury Press.

[27] Venables, W.N. and Ripley, B.D. ( l m ) . Modem AppZied Stati.stics with S m ,

Third d.. Springer-Verlag, New York.

[28] Williams, D.A. (1987). Generaiized Linear M d e l Diagnostics Using the Deviance

and Single Case Deletions Appl. StatUt., 36 181-191.

[29] Zellner, A. (1976). Bayesian and Non-Ba.VeYian Analysis of the F&gress.ion M d e l

with Mdtivariate Student-t Error Terms. JASA, 354 400-405.

Date post:	23-May-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

NOTE TO USERScollectionscanada.gc.ca/obj/s4/f2/dsk3/ftp04/MQ64046.pdf · Chmider the mode1 (1.1)...

Documents