Date post: | 14-Apr-2017 |
Category: |
Documents |
Upload: | sylvia-cheung |
View: | 228 times |
Download: | 2 times |
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
CHAPTER 2: BINARY LOGIT ANALYSIS OFCONTINGENCY TABLES
Prof. Alan Wan
1 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Table of contents
1. Introduction
2. Two-way classification and PROC GENMOD2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
3. Three-way classification
4. Class exercises
2 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Introduction
I Contingency table: a table containing two or more variables ofclassification, and the purpose is to determine if thesevariables are related;
I Here is an example:
Annual changesin stock prices
Up Down Total
January changes Up 22(16.1) 1(6.9) 23in stock prices Down 6(11.9) 11(5.1) 17
Total 28 12 40
3 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Introduction
I Contingency table: a table containing two or more variables ofclassification, and the purpose is to determine if thesevariables are related;
I Here is an example:
Annual changesin stock prices
Up Down Total
January changes Up 22(16.1) 1(6.9) 23in stock prices Down 6(11.9) 11(5.1) 17
Total 28 12 40
3 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Introduction
I A table containing information of this sort can be used to testwhether, as some financial analysts suggest, January is a goodprediction of whether stock prices will go up or down in theentire year; i.e., we can testH0: whether or not stock prices go up in the entire year is thesame regardless of the behaviour in January, vs.H1: otherwise
I Expected frequencies (under H0) are shown in parentheses inthe table.
4 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Introduction
I The expected frequencies under H0 are calculated as follows:
16.1 =28
40× 23; 6.9 =
12
40× 23; 11.9 =
28
40× 17; 5.1 =
12
40× 17;
I Why? Take 16.1 as an example;
I Note that Pr(UpY ∩ UpJ) = Pr(UpY |UpJ)Pr(UpJ);
I But under independence (H0), Pr(UpY |UpJ) = Pr(UpY ).Hence Pr(UpY ∩ UpJ) = Pr(UpY )Pr(UpJ) = 28
402340 = 16.1
40 .
5 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Introduction
I This test can be conducted using the usual Pearson’sChi-square statistic:
Pearson′sχ2 =∑n
i=1(Oi−Ei )
2
Ei∼ χ2
(r−1)(c−1), where r and care the numbers of rows and columns in the table respectively;
I For this example,∑4i=1
(22−16.1)216.1 + (1−6.9)2
6.9 + (6−11.9)211.9 + (11−5.1)2
5.1 = 16.96;
I Now, χ21,0.05 = 3.84. Hence we reject H0 and conclude that
stock price movements during the whole year are notindependent of their movements in January of the year.
6 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Introduction
data stock; input f yp jp; datalines; 22 1 1 6 1 0 1 0 1 11 0 0 ; proc freq data=stock; weight f; tables yp*jp/chisq cmh; run; Statistics for Table of yp by jp Statistic DF Value Prob Chi-Square 1 16.9577 <.0001 Likelihood Ratio Chi-Square 1 18.5678 <.0001 Continuity Adj. Chi-Square 1 14.2053 0.0002 Mantel-Haenszel Chi-Square 1 16.5338 <.0001 Phi Coefficient 0.6511 Contingency Coefficient 0.5456 Cramer's V 0.6511
7 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: frequency weight syntax
I Consider the penalty data of Chapter 1. Suppose individualdata are unavailable and all we have is the following table:
Blacks Non-blacks Total
Death 28 22 50Life 45 52 97
Total 73 74 147
I The Logit model for regressing DEATH on BLACKD withdata contained in a contingency table is PROC GENMOD;
I One way to invoke PROC GENMOD is to use the FREQcommand, which simply replicates the observations andconverts the data into individual format based on thefrequency specified.
8 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: frequency weight syntax
DATA CONT1; INPUT F BLACKD DEATH; DATALINES; 22 0 1 28 1 1 52 0 0 45 1 0 ; PROC GENMOD DATA=CONT1 DESCENDING; FREQ F; MODEL DEATH=BLACKD/D=B; RUN;
9 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: frequency weight syntax
The GENMOD Procedure Model Information Data Set WORK.CONT1 Distribution Binomial Link Function Logit Dependent Variable DEATH Frequency Weight Variable F Observations Used 4 Sum Of Frequency Weights 147 Response Profile Ordered Total Value DEATH Frequency 1 1 50 2 0 97
PROC GENMOD is modeling the probability that DEATH='1'.
10 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: frequency weight syntax
Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 145 187.2704 1.2915 Scaled Deviance 145 187.2704 1.2915 Pearson Chi-Square 145 147.0000 1.0138 Scaled Pearson X2 145 147.0000 1.0138 Log Likelihood -93.6352
Algorithm converged. Analysis Of Parameter Estimates
Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -0.8602 0.2543 -1.3587 -0.3617 11.44 0.0007 BLACKD 1 0.3857 0.3502 -0.3006 1.0721 1.21 0.2706 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed.
11 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: frequency weight syntax
I As far as PROC GENMOD is concerned, only 4 observationshave been inputted;
I The actual number of observations, namely, 147, is consideredto be the sum of the frequencies. The FREQ commandconverts the 4 observations into 147 frequencies to be usedfor ML estimation;
12 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: frequency weight syntax
I The Deviance statistic is a LR test that tests if there aresignificant difference between the ”estimated” (restricted) and”saturated” (unrestricted) model;
I
Deviance = 2[lnL(β̂S)− lnL(β̂E )] ∼ χ2m,
where m is the difference in the number of parametersbetween the saturated and the estimated models;
I The saturated model is a model with number of unknownparameters being equal to the number of observations;
I Hence for a model estimated by individual data, there are nobservations for n unknowns, resulting in L(β̂S) = 1,lnL(β̂S) = 0 and Deviance = −2[lnL(β̂E )].
13 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: event/trial syntax
I Instead of inputting all 4 internal cell counts, the cellfrequencies for death sentences (”events”) along with thecolumn totals (”trials”) are inputted.
I
DATA CONT1; INPUT DEATH TOTAL BLACKD; DATALINES; 22 74 0 28 73 1 ; PROC GENMOD DATA=CONT1; MODEL DEATH/TOTAL=BLACKD/D=B; RUN;
14 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: event/trial syntax
The GENMOD Procedure Model Information Data Set WORK.CONT1 Distribution Binomial Link Function Logit Response Variable (Events) DEATH Response Variable (Trials) TOTAL Observations Used 2 Number Of Events 50 Number Of Trials 147
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF Deviance 0 0.0000 . Scaled Deviance 0 0.0000 . Pearson Chi-Square 0 0.0000 . Scaled Pearson X2 0 0.0000 . Log Likelihood -93.6352
Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -0.8602 0.2543 -1.3587 -0.3617 11.44 0.0007 BLACKD 1 0.3857 0.3502 -0.3006 1.0721 1.21 0.2706 Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE: The scale parameter was held fixed.
15 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
I With frequency weighting syntax:
L =147∏i=1
[1
1 + e−(β1+β2BLACKDi )]DEATHi
×[1− 1
1 + e−(β1+β2BLACKDi )]1−DEATHi
I With event/trial syntax:
L = {[ 1
1 + e−(β1+β2(BLACKD=0))]22[1− 1
1 + e−(β1+β2(BLACKD=0))]52}
×{[ 1
1 + e−(β1+β2(BLACKD=1))]28[1− 1
1 + e−(β1+β2(BLACKD=1))]45}
16 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
2.1. PROC GENMOD: frequency weight syntax2.2. PROC GENMOD: event/trial syntax
PROC GENMOD: event/trial syntax
I The two likelihood functions are of course algebraicallyidentical, but PROC GENMOD treats the first likelihood asbeing based on 147 Bernoulli(p) observations, and the secondlikelihood as being based on 2 observations, each being aproduct of Bernoulli(p) densities corresponding to a commonvalue of BLACKD, namely, BLACKD=0 for the firstobservation and BLACKD=1 for the second observation;
I Under the event/trial syntax, there are 2 observations forestimating 2 parameters. Hence the estimated model is thesaturated model, thus resulting in a Deviance statistic of 0;
I The Deviance statistic carries no significant meaning fortwo-way cross classification.
16 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I Consider the cross classification of race, gender and possessionof a driver’s license for a sample of 17 and 18 year old kids:
Driver’s license
Race Gender Yes No
White Male 43 134Female 26 149
Black Male 29 23Female 22 36
I Let YES represent the ”event” of interest, andTOTAL=YES+NO represent the ”trial”.
17 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
DATA DRIVER; INPUT WHITE MALE YES NO; TOTAL = YES+NO; DATALINES; 1 1 43 134 1 0 26 149 0 1 29 23 0 0 22 36 ; PROC GENMOD DATA=DRIVER; MODEL YES/TOTAL=WHITE MALE/D=B; RUN;
18 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
Model Information Data Set WORK.DRIVER Distribution Binomial Link Function Logit Response Variable (Events) YES Response Variable (Trials) TOTAL Observations Used 4 Number Of Events 120 Number Of Trials 462 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 1 0.0583 0.0583 Scaled Deviance 1 0.0583 0.0583 Pearson Chi-Square 1 0.0583 0.0583 Scaled Pearson X2 1 0.0583 0.0583 Log Likelihood -245.8974 Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -0.4555 0.2221 -0.8909 -0.0201 4.20 0.0403 WHITE 1 -1.3135 0.2378 -1.7795 -0.8474 30.51 <.0001 MALE 1 0.6478 0.2250 0.2068 1.0889 8.29 0.0040 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. 19 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I There are 4 observations, 120 events of YES and 462 trialsrepresented by TOTAL;
I Both the race and gender coefficients are significantlydifferent from zero;
I The present estimated model is not the saturated model asthere are 4 observations for 3 parameters. There is thedifference of one parameter between the estimated andsaturated models. Hence the Deviance statistic has df=1.
20 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I How to construct the saturated model with the available data?
I The present model can be expanded to yield the saturatedmodel by introducing the interaction term WHITE×MALE;
I The Deviance test is essentially a test of the significance ofthe interaction term
21 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I How to construct the saturated model with the available data?
I The present model can be expanded to yield the saturatedmodel by introducing the interaction term WHITE×MALE;
I The Deviance test is essentially a test of the significance ofthe interaction term
21 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I How to construct the saturated model with the available data?
I The present model can be expanded to yield the saturatedmodel by introducing the interaction term WHITE×MALE;
I The Deviance test is essentially a test of the significance ofthe interaction term
21 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I Estimated model:
pi =1
1 + e−(β1+β2WHITEi+β3MALEi )
I Saturated model:
pi =1
1 + e−(β1+β2WHITEi+β3MALEi+β4WHITEi×MALEi )
22 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I Estimated model:
pi =1
1 + e−(β1+β2WHITEi+β3MALEi )
I Saturated model:
pi =1
1 + e−(β1+β2WHITEi+β3MALEi+β4WHITEi×MALEi )
22 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I Testing the significance of the difference between theestimated and saturated models is the same as testing β4 = 0;
I The p-value corresponding to the Deviance statistic of 0.0583can be computed using the following SAS commands:
data;chi=1-probchi(0.0583,1);put chi;run;
I This results in a p-value of 0.8092. Hence the interaction termbetween MALE and WHITE differs insignificantly from zero.
23 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I To see this more clearly, let us fit the model explicitly with theinteraction term:
DATA DRIVER; INPUT WHITE MALE YES NO; TOTAL = YES+NO; DATALINES; 1 1 43 134 1 0 26 149 0 1 29 23 0 0 22 36 ; PROC GENMOD DATA=DRIVER; MODEL YES/TOTAL=WHITE MALE WHITE*MALE/D=B; RUN;
24 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
The GENMOD Procedure Model Information Data Set WORK.DRIVER Distribution Binomial Link Function Logit Response Variable (Events) YES Response Variable (Trials) TOTAL Observations Used 4 Number Of Events 120 Number Of Trials 462 Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 0 0.0000 . Scaled Deviance 0 0.0000 . Pearson Chi-Square 0 0.0000 . Scaled Pearson X2 0 0.0000 . Log Likelihood -245.8682 Algorithm converged. Analysis Of Parameter Estimates Standard Wald 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 -0.4925 0.2706 -1.0229 0.0379 3.31 0.0688 WHITE 1 -1.2534 0.3441 -1.9278 -0.5789 13.27 0.0003 MALE 1 0.7243 0.3888 -0.0378 1.4864 3.47 0.0625 WHITE*MALE 1 -0.1151 0.4765 -1.0491 0.8189 0.06 0.8092 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed.
25 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I Now, to test H0 : β4 = 0 vs. H1 : otherwise, we apply the LRtest:
Deviance = 2(−245.8682−−245.8974)
= 0.0584
I Also, the log of the odds is given by
Zi = β1 + β2WHITEi + β3MALEi + β4WHITEi ×MALEi .
So, ∂Zi∂WHITEi
= β2 + β4MALEi ,
and ∂Zi∂MALEi
= β3 + β4WHITEi
26 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
Hence the odds ratio estimates of WHITE and MALE are:
e−1.2534−0.1151MALEi and e0.7243−0.1151WHITEi
respectively, with the following interpretations:
I The odds of having a license for white females aree−1.2534 = 0.286 times the odds for black females;
I The odds of having a license for white males aree−1.2534−0.1151 = 0.2544 times the odds for black males;
I The odds of having a license for black males aree0.7243 = 2.063 times the odds for black females;
I The odds of having a license for white males aree0.7243−0.1151 = 1.839 the odds for white females.
27 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Three-way classification
I The interaction term also affects the marginal effect on pi .For example,
∂pi∂WHITEi
= f (Zi )(β2 + β4MALEi ),
In other words, the marginal change of pi with respect to achange of race from black to white is dependent on thegender of the person;
I Pearson’s Chi-square goodness of fit test: see Tutorial 2
28 / 29
1. Introduction2. Two-way classification and PROC GENMOD
3. Three-way classification4. Class exercises
Class exercises
1. Tutorial 2
2. 2004 Final Exam, Question 1
3. 2007 Final Exam, Question 1
29 / 29