Ch. 3 Simple Inference for Categorical...

Ch. 3 Simple Inference for Categorical

Data

Christopher Kinson

Department of StatisticsUniversity of Illinois at Urbana-Champaign

What is Categorical Data?

I Any value with a meaning other than numeric quantities

I Can be words and/or numbers so long as the definition ismeaningful

I May be stored as counts (frequencies) or proportions(risks) in

I standard data setI two-way tables for two categorical variablesI multi-way tables for multiple categorical variables

Data description includes

I data visualization - proc sgplot

I numerical summaries - proc freq

Inference includes

I hypothesis testing - proc freq

How is Categorical Data Stored?Most data sets may not be completely made up of categoricalvariables, but they may have only a few. Those data sets oftenappear as columns of data with rows corresponding to newobservations or subjects - the standard we’ve used before.

For this chapter, we will also analyze categorical datapresented in tables. These tables have various names:contingency, frequency, cross-tabulated, cross-classified, orn-way tables. n is the number of categorical variables.

A 2× 2 contingency table

Column VariableRow Variable Category 1 Category 2 TotalCategory 1 n11 n12 n1+Category 2 n21 n22 n2+

Total n+1 n+2 N

Tests of Association

Hypothesis testing for a two-way table includes

I H0: there is no association between the row variable andcolumn variable

I HA: there is an association between the two variables

To test the null, we compare the observed and expectedcounts within the table.

I observed count: the frequency reported in the table i.e.the raw count

I expected count: (row total)·(column total)overall total

I small p-value means strong evidence of associationI but we don’t know the characteristics of the association

Hypothesis tests based on deviations from expected counts

Hypothesis Testing: Pearson Chi-Square

I rows i = 1, . . . ,R and columns j = 1 . . . ,C

I overall total N =∑R

i=1

∑Cj=1 nij

I observed counts nijI expected counts

µ̂ij =ni+n+j

NI Pearson’s chi-square statistic

X 2 =R∑i=1

C∑j=1

(nij − µ̂ij)2

µ̂ij

I degrees of freedom df = (R − 1)(C − 1)

I larger values of X 2 imply more evidence against the nullhypothesis

Hypothesis Testing: Likelihood Ratio Chi-Square

I likelihood ratio chi-square statistic

G 2 = 2R∑i=1

C∑j=1

nij log(nij/µ̂ij)

I degrees of freedom df = (R − 1)(C − 1)

I larger values of G 2 imply more evidence against the nullhypothesis

I alternative to Pearson’s X 2

Measures of Association: Phi CoefficientI based on X 2 statistic and overall total N

I take the square root of

φ2 =X 2

N

I small values mean weak association

I large values mean strong associationI for a 2× 2 contingency table:

I phi coefficient has same interpretation as Pearsoncorrelation

φ =n11n22 − n12n21√n1+n2+n+1n+2

I values range between [−1,+1]I values near −1 mean strong negative associationI values near +1 mean strong positive associationI values near 0 mean weak or no association

Measures of Association: Contingency Coefficient

I based on X 2 statistic and overall total N

cc =

√X 2

N + X 2

I values always less than 1


I large values mean strong association

I Useful for a square contingency table larger than 2× 2

Measures of Association: Cramer’s V Coefficient

I based on phi coefficient

V =

√φ2

min(R − 1,C − 1)

I R is number total number of rowsI C is total number of columns

I values range between [0, 1]


I large values mean strong association

I for a 2× 2 contingency table:I Cramer’s V is the same as phi coefficient

Using proc freq

I proc freq can produce the aforementioned tests forassociation and measures of association

I tables statement sets up the frequency table containingcounts, overall percentages, row percentages, and columnpercentages by default

I chisq - runs chi-square and other tests of association &shows measures of association

I expected - shows the expected counts for each cellI deviation - shows the residual (observed-expected) for

each cellI cellchi2 - shows the chi-square contribution for each cellI norow - hides the row percentagesI nocol - hides the column percentagesI nopercent - hides the overall percentagesI noprint - does not print the table

Using proc freq (cont.)

When the data exists as a contingency table, the weightstatement is used to notify SAS that the values representobserved counts

proc freq data=dataName;

tables rowVarName*columnVarName / chisq expected;

*weight n;

run;

Oral Contraceptives DataWomen patients in several hospitals were asked whether theyuse oral contraceptives (such as birth control pills). Each casewas a married woman who suffered from blood clots(idiopathic thromboembolism) over a 3-year period. Thesecases were matched to controls - women without blood clotswho were discharged alive from the same hospital in the sametime interval. What we care about is if there’s an associationbetween suffering from blood clots and oral contraceptiveusage. See Sartwell et al. (1969) “Thromboembolism and oralcontraceptives: An epidemiological case-control study” formore details.

Oral Contraceptive Usage: ControlsOral Contraceptive Usage: Cases Used Not Used

Used 10 57Not Used 13 95

Oral Contraceptives Data (cont.)

Manually reading it into SAS!

data pill;

input caseuse $ controluse $ n;

cards;

Y Y 10

Y N 57

N Y 13

N N 95

;

Heart Attack Data

A 5-year randomized study was done on male physicians tostudy aspirin’s affect on cardiovascular disease. The physicianseither took one aspirin or one placebo, and they did not knowthey type of pill they took. What we care about is if theplacebo group and the aspirin group experience heart attacks(myocardial infarctions) similarly. See Preliminary Report:Findings from the Aspirin Component of the OngoingPhysicians’ Health Study (1988) for more information.

Myocardial InfarctionDrug Group Yes No

Placebo 189 10845Aspirin 104 10933

Heart Attack Data (cont.)

Manually reading it into SAS!

data heart;

input group $ attack $ n ;

datalines;

placebo yes 189

placebo no 10845

aspirin yes 104

aspirin no 10933

;

Car Accidents DataThe data set is a subset of the National Automotive Sampling System’s(NASS) Crashworthiness Data System (CDS) which contains a stratifiedrandom sampling of nationwide police-reported crashes between years1997-2002. CDS data focus on passenger vehicle crashes, and are used toinvestigate injury mechanisms. Something interesting to investigate is thedebunking of the horrible stereotype that women are inadequate drivers.The data contains 26218 observations and will retain 6 of the original 15variables.

filename adata url "https://tinyurl.com/y8hujqvk";

data accidentDB;

infile adata dsd dlm=’09’x truncover firstobs=2;

input weight dead $ airbag $ seatbelt $ frontal sex $

ageOFocc yearacc yearVeh abcat $ occRole $ deploy

injSeverity;

keep dead weight sex occrole yearacc ageofocc;

run;

proc print data=accidentDB (obs=50);

run;

Steam Video Game Data

This is a subset of the steam-200k data set on Kaggle with anadded ESRB rating column. Steam is a very popular onlinegaming hub with hundreds of thousands of users playingmillions of hours of video games. The video games featured onSteam are of various genre and can be found across multipleconsoles such as Playstation, XBox, Wii, and many others.The data is a random selection of 100 users and the gamethey spent the most hours playing. The variables in the datainclude user ID, game title, hours played, ESRB rating, genre,and a binary variable indicating whether (1) or not (0) theyplayed more than 40 hours of that game. The ESRB ratingdoes not apply to the online gaming experience, but I use itanyways. A question of interest is if there’s an associationbetween the ESRB rating and the number of hours played.

Steam Video Game Data (cont.)

Reading it into SAS!

filename vgdata url "https://tinyurl.com/ya2cvtt6";

data game;

infile vgdata dsd dlm=’09’x truncover firstobs=2;

input userID title $ hoursplayed Rating $ Genre $

over40hours ;

run;

proc print data=game;

run;

Weight Perception Data

This data comes from a national youth survey, consisting of a nationallyrepresentative sample of young people ages 14 to 20 years old as ofDecember 31, 1999 who self report about their weight perception withthe prompt “How would you describe your weight?” Variables includeage, height (in inches), weight (in pounds), sex, and categorical responseabout weight perception. Do upperclassmen feel better about theirweight than underclassmen? Do young women tend to feel more stronglyabout their weight than young men?

filename wdata url "https://tinyurl.com/yc7dnv8p";

data teenweightDB;

infile wdata dsd dlm=’09’x truncover firstobs=2;

input gender $ age height weight weightperception $16.;

run;

proc print data=teenweightDB;

run;

Exercise: Car Accidents Data

1. Create bar graphs of the variables: sex and occupant’srole. Make sure the title of the graph says: “Bar Graph ofOccupant’s Role”.

2. Report the frequencies and expected values for the twovariables.

3. Run a Pearson X 2 test of association. What conclusionsdo you draw from the results?

4. Use the measures of association to describe the strengthof association between the sex and occupant’s role.

Hypothesis Testing: Risk DifferenceFor 2× 2 tables,

I risks are binomial proportionsI row percentage from proc freq by defaultI comparing proportions of row 1 to row 2 in the table

I We can test whether the difference in risks is significantI H0: risk1 − risk2 = 0I HA: risk1 − risk2 6= 0

I We can find confidence intervals for the difference in risksI Asymptotically normal under the nullI If the individual risks are very close to 0, then

I The risk difference test results can be misleadingI Compute the odds ratio for an alternative interpretation


tables rowVarName*columnVarName / riskdiff;

*weight n;

run;

Measure of Association: Odds Ratio

I For 2× 2 tables, sample odds ratioOR = odds1

odds2= n11/n12

n21/n22= n11n22

n12n21

I OR = 1 - the row and column variables are independent(no association)

I OR >> |1| - strong association

I OR > 1 - subjects in row 1 are more likely to have asuccess than subjects in row 2

I OR < 1 - subjects in row 1 are less likely to have asuccess than subjects in row 2

I Confidence intervals based on log(OR)

Measure of Association: Odds Ratio (cont.)

For two binary variables, the sample odds ratio OR = 1.25 hasthe following equivalent interpretations:

1. the odds of success in row 1 are 1.25 times the odds ofsuccess in row 2

2. odds of success are 1/1.25 = 0.8 times as high in row 2than in row 1

3. the odds of success are 25% higher for row 1.I this 3rd interpretation makes sense when the odds ratio

is 1 < OR < 2

/* To show the odds ratios only*/


tables rowVarName*columnVarName / or(cl=Wald);

*weight n;

run;

Hypothesis Testing: Fisher’s Exact

For 2× 2 contingency tables

I Appropriate for a table containing small expected cellfrequencies and/or small sample size

I In theory, we assume the table contains row total andcolumn total and the statistic relies on n11

I This test is very conservativeI reject only for really small p-values

For larger tables

I Appropriate when data is sparseI sparse: several cells with frequency of 0

Hypothesis Testing: Fisher’s Exact (cont.)

/* for 2 by 2 tables you can use*/


tables rowVarName*columnVarName / chisq;

*weight n;

run;

/* for general sized two-way tables you can use*/


tables rowVarName*columnVarName / exact;

*weight n;

run;

Hypothesis Testing: Mantel-HaenszelA special kind of association test

I Test for linear association (as categoriesincrease/decrease in value)

I Appropriate for two ordinal variablesI categorical variable containing ordered categoriesI e.g. age groups listed from youngest to oldest, level of

agreement ordered from strongly disagree to stronglyagree

I Asymptotically Chi-square with 1 degree of freedomI H0: no linear associationI HA: increases/decreases in one variable are associated

with increases/decreases in the other variable



*weight n;

run;

Hypothesis Testing: McNemar’s

I Appropriate for 2× 2 tables with matched pairs design

I Some matched pairs designs includeI Case-control studiesI Studies about twinsI Studies of one group of subjects at two time points

I H0: the two marginal proportions are the same n+1 = n1+& n+2 = n2+

I HA: the two marginal proportions are not the same

/* To show McNemar’s Test and the Kappa Coefficient*/


tables rowVarName*columnVarName / agree;

weight n;

ods select McNemarsTest ;

run;

Hypothesis Testing: Exact TestsI When sample sizes are small but the expected frequencies

are not a problem, you can use an exact test for theappropriate test of association

I Some examples of exact tests:I chisq - exact Pearson, Likelihood ratio, and

Mantel-Haenszel chi-square testsI pchi - exact Pearson chi-squareI lchi - exact likelihood ratio chi-squareI mchi -exact Mantel-Haenszel chi-squareI fisher - Fisher’s exact testI or - exact confidence limits for odds ratio



*weight n;

exact chisq;

run;

Additional GuidanceUsing the order=data option as part of the proc freq statement willprint the contingency table without alphabetizing the categories, butinstead print the table with the order of appearance of the categories.

I If the first entry in a data set for one variable, say gender forexample, appears as women and the second unique entry is men,then the contingency table will list women as the first category thenmen as the second.

In general, if the sample size is too small, then exact tests should beused. The exact statement can be used to produce some results as wellas an exact option for other methods.

For more details and/or guidance, check any of the following links

I SAS Procedures by Name and Product

I Chi-Square Tests and Statistics

I UCLA’s Proc freq SAS Annotated Output

http://support.sas.com/documentation/cdl/en/allprodsproc/70141/HTML/default/viewer.htm#procedures.htm

https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_freq_a0000000658.htm

https://stats.idre.ucla.edu/sas/output/proc-freq/

Exercise: Oral Contraceptives Data

1. Create a frequency table showing the counts, percentages,and expected counts. Do any cells have large frequencies?

2. Run a test of association and state why you chose thistest. Comment on the test results.

3. Give a confidence interval for the difference in theproportion of women with blood clots who use oralcontraceptives and the corresponding proportion ofwomen not suffering from blood clots. What does thisinterval tell us?

Exercise: Weight Perception Data

1. Create a subset of the data with weights larger than 0 and whereteens are younger than 19 years old.

2. Create a status variable which is categorical such that teens agesyounger than 17 are labeled ‘Lowerclassman’ and teens ages 17-18are labeled as ‘Upperclassman’.

3. Obtain the expected counts and the chi-square contributions for thetable with status vs weightperception for each cell. Make sure theordering is in the same direction for both variables.

4. Run the Manel-Haenszel test for linear association on the table withvariables variables status and weightperception. Comment on theresults.

5. Run the Manel-Haenszel test for linear association on the table withvariables variables sex and weightperception. Interpret the results.

6. Compute the odds ratios and their confidence levels for the tablewith sex and weightperception. Interpret the results.

Exercise: Steam Video Game Data

1. Print the data to see if some of your favorite games are inthis data set.

2. Make a subset of the data for the games with mature andeveryone ratings. Create a new variable that is a 1 formature games and 0 for everyone games.

3. Suppose anyone who plays video games more than 40hours is extreme. Is there an association for the rating ofa game and the extreme gamers?

4. Which test is appropriate for this setting and why?

Exercise: Heart Attack Data

Using the table from above complete the following questions.

1. Read in the data using SAS.

2. Obtain the expected counts and the chi-squarecontributions for each cell.

3. Test for association and comment on the results and statewhy you chose this test. Comment on the test results.

4. Obtain risk estimates to see if the difference is significant.

One book found the 95% CI for risk difference betweenthe placebo group who experienced a heart attack andthe aspirin group who experienced a heart attack to be0.008± 0.003 or (0.005, 0.011).

5. Compute the odds ratios and their confidence levels.Interpret the results.

Exercise: Car Accidents Data

1. Obtain risk estimates to see if there’s any difference inthe occupant’s role for the males and females.

2. Interpret the odds ratio and its confidence interval for the2× 2 table.

3. Suppose we created subset by matching each femaledriver with a male driver and randomly choosing 8000 ofpairs. We checked whether the accidents resulted indeath. Using Table 1 below, what can we conclude froma test of association for this data?

Table 1 Male drivers: Dead?Female drivers: Dead? Dead Alive

Dead 10 271Alive 398 7321

Exercise: Car Accidents Data (cont.)

4. We randomly selected 4000 male and 4000 femaleoccupants and checked whether they died in trafficaccidents. These results are in the Table 2. Determinethe risk difference of female vs male death for the tableand comment on the results.

Table 2 Dead?Sex Dead Alive

Female 139 3861Male 193 3807

Date post:	17-Mar-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Ch. 3 Simple Inference for Categorical...

Documents