Lecture 22 Psychology 790
Regression Models for Quantitative andQualitative Predictors
Lecture 22 (and last of the year)December 5, 2006
Psychology 790
Overview Todays Lecture Schedule Announcements
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Todays Lecture
Regression with a single categorical independent variable.
Coding procedures for analysis.
Dummy coding.
Relationship between categorical independent variableregression and other statistical terms.
A surprise! (and teaching evaluations).
Lecture 22 Psychology 790
Our New Schedule
Date Topic Chapter12/5 Qualitative and Quantitative Predictors K 8.3-8.7
12/7 Final exam discussion
Overview Todays Lecture Schedule Announcements
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Announcements
Interested in learning more statistics?
Here are three courses you should consider:
Psych 791: it goes without saying, but learn why thegeneral linear model is so cool.
Psych 892: Test Theory.
Learn about how we develop scales and questionnaires.
Psych 993: Statistical Consulting.
Have data and need stats help?
Or, do you want hands-on stats experience under myguidance?
Having taken 790, you are prepared for all of these courses.
Lecture 22 Psychology 790
Categorical Variables
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Regression with Continuous Variables
Linear regression regresses a continuous-valued dependentvariable, Y , onto a set of continuous-valued independentvariables X.
The regression line gives the estimate of the mean of Yconditional on the values of X, or E(Y |X).
But what happens when some or all independent variablesare categorical in nature?
Is the point of the regression to determine E(Y |X), acrossthe levels of Y ?
Cant we just put the categorical variables into SASs procglm and push the run" button?
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Example Data SetNeter (1996, p. 676).
The Kenton Food Company wished to test four differentpackage designs for a new breakfast cereal.
Twenty stores, with approximately equal sales volumes,were selected as the experimental units.
Each store was randomly assigned one of the packagedesigns, with each package design assigned to five stores.
The stores were chosen to be comparable in location andsales volume.
Other relevant conditions that could affect sales, such asprice, amount and location of shelf space, and specialpromotional efforts, were kept the same for all of the stores inthe experiment.
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Cereal
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
A Regular Regression?
1.00 2.00 3.00 4.00
Package Type
10.00
15.00
20.00
25.00
30.00
Nu
mb
er
of
Ca
se
s S
old
W
WW
WW
W
W
W
W
W
W
W
WW
W
W
W
W
W
W
Number of Cases Sold = 7.70 + 4.38 * package
RSquare = 0.64
What is wrong with this picture?
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Categorical Variables
Categorical variables commonly occur in research settings.
Another term sometimes used to describe for categorialvariables is that of qualitative variables.
A strict definition of a qualitative or categorical variable is thatof a variable that has a finite number of levels.
Continuous (or quantitative) variables, alternatively, haveinfinitely many levels.
Often this is assumed more than practiced.
Quantitative variables often have countably many levels.
Level of precision of an instrument can limit the number oflevels of a quantitative variable.
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Research Design
Categorical variables can occur in many different researchdesigns:
Experimental research.
Quasi-experimental research.
Nonexperimental/Observational research.
Such variables can be used with regression for:
Prediction.
Explanation.
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Analysis Specifics
Because of nature of categorical variables, emphasis ofregression is not on linear trends but on differences betweenmeans (of Y ) at each level of the category.
Not all categorical variables are ordered (like cereal boxtype, gender,etc...).
When considering differences in the mean of the dependentvariable, the type of analysis being conducted by aregression is commonly called an ANalysis Of VAriance(ANOVA).
Combinations of categorical and continuous variables in thesame regression is called ANalysis Of CoVAriance(ANCOVA - Chapters 14 and 15).
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Example Variable: Two Categories
From Pedhazur (1997; p. 343): Assume that the datareported [below] were obtained in an experiment in which Erepresents an experimental group and C represents acontrol group.
E C
20 1018 1217 1117 1513 17
Y 85 65Y 17 13
(Y Y )2 = y2 26 34
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Old School Statistics: The t-test
As you may recall from an earlier course on statistics, aneasy way to determine if the means of the two conditionsdiffer significantly is to use a t-test (with n1 + n2 2) degreesof freedom.
H0 1 = 2
HA 1 6= 2
t =Y1 Y2
y21+y2
2
n1+n22
(
1
n1+ 1
n2
)
Overview
CategoricalVariables Regression
Basics Not A Good Idea Categorical
Variables Research Design Analysis Specs Example Example Analysis
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Old School Statistics: The t-test
t =17 13
26+34
5+52
(
1
5+ 1
5
)
=43
= 2.31
From Excel (=tdist(2.31,8,2)), p = 0.0496.
If we used a Type-I error rate of 0.05, we would reject the nullhypothesis, and conclude the means of the two groups weresignificantly different.
But what if we had more than two groups?.
This type of problem can be solved equivalently from withinthe context of the General Linear Model.
Lecture 22 Psychology 790
Variable Coding
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Variable Coding
When using categorical variables in regression, levels of thecategories must be recoded from their original value toensure the regression model truly estimates the meandifferences at levels of the categories.
Several types of coding strategies are common:
Dummy coding.
Effect coding.
Each type will produce the same fit of the model (via R2).
The estimated regression parameters are different acrosscoding types, thereby representing the true difference inapproaches employed by each type of coding.
The choice of method of coding does not differ as a functionof the type of research or analysis or purpose (explanation orprediction) of the analysis.
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Variable Coding
Definition: a code is a set of symbols to which meaningscan be assigned (Pedhazur, 1997; p. 342).
The assignment of symbols follows a rule (or set of rules)determined by the categories of the variable used.
Typically symbols represent the respective levels of acategorical variable.
All entities within the same symbol are considered alike (orhomogeneous) within that category level.
Categorical levels must be predetermined prior to analysis.
Some variables are obviously categorical - gender.
Some variables are not so obviously categorial - politicalaffiliation.
Lecture 22 Psychology 790
Dummy Coding
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coding
The most straight-forward method of coding categoricalvariables is dummy coding.
In dummy coding, one creates a set of column vectors thatrepresent the membership of an observation to a givencategory level.
If an observation is a member of a specific category level,they are given a value of 1 in that category levels columnvector.
If an observation is not a member of a specific category, theyare given a value of 0 in that category levels column vector.
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coding
For each observation, a no more that a single 1 will appear inthe set of column vectors for that variable.
The column vectors represent the predictor variables in aregression analysis, where the dependent variable ismodeled as a function of these columns.
Because of linear dependence with an intercept, onecategory-level vector is often excluded from the analysis.
Because all observations at a given category level have thesame value across the set of predictors, the predicted valueof the dependent variable, Y , will be identical for allobservations within a category.
The set of category vectors (and a vector for an intercept)are now used as input into a regression model.
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coded Regression Example
Y X0 X1 X2 Group
20 1 1 0 E18 1 1 0 E17 1 1 0 E17 1 1 0 E13 1 1 0 E10 1 0 1 C12 1 0 1 C11 1 0 1 C15 1 0 1 C17 1 0 1 C
Mean 15 1 0.5 0.5SS 100 0 2.5 2.5
yx2 = 10
yx3 = 10
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coded Regression
The General Linear Model states that the estimatedregression parameters are given by:
b = (XX)1XY
From the previous slide, you can see what our entries for Xcould be, but...
Notice that X1 = X2 + X3.
This linear dependency means that:
(XX) is a singular matrix - no inverse exists.
Any combination of two of the columns would rid us of thelinear dependency.
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coded Regression - X1 and X2
For our first example analysis, consider the regression of Yon X1 and X2.
Yi = 1Xi1 + 2Xi2 + i
b1 = 17
b2 = 13
SSTO = 100
SSE = XX = 60
SSR = 100 60 = 40
R2 = 40100
= 0.4
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coded Regression - X1 and X2
b1 = 17 is the mean for the E category.
b2 = 13 is the mean for the C category.
Without an intercept, the model is fairly easy to interpret.
For more advanced models, an intercept will prove to behelpful in interpretation.
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coded Regression - X0 and X1
For our second example analysis, consider the regression ofY on X0 (the intercept) and X1.
Yi = 0 + 1Xi1 + i
b0 = 13
b1 = 4
SSTO = 100
SSE = XX = 60
SSR = 100 60 = 40
R2 = 40100
= 0.4
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coded Regression - X0 and X1 b0 = 13 is the mean for the C category. b1 = 4 is the mean difference between the E category and
the C category. The C category is called reference category. For members of the C category:
Y = b0 + b1Xi1 = 13 + 4(0) = 13
For members of the E category:
Y = b0 + b1Xi1 = 13 + 4(1) = 17
With the intercept, the model parameters are now differentfrom the first example.
The fit of the model, however, is the same.
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coded Regression - X0 and X2
For our third example analysis, consider the regression of Yon X0 (the intercept) and X2.
Yi = 0 + 2Xi2 + i
b0 = 17
b2 = 4
SSTO = 100
SSE = XX = 60
SSR = 100 60 = 40
R2 = 40100
= 0.4
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Dummy Coded Regression - X0 and X2 b0 = 17 is the mean for the E category. b2 = 4 is the mean difference between the C category and
the E category. The E category is called reference category. For members of the E category:
Y = b0 + b2Xi2 = 17 4(0) = 17 For members of the E category:
Y = b0 + b2Xi2 = 17 4(1) = 13 With the intercept, the model parameters are now different
from the first example. The fit of the model, however, is the same.
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Hypothesis Test- Regression Coefficient
Because each model had the same value for R2 and thesame number of degrees of freedom for the regression (1),all hypothesis tests of the model parameters will result in thesame value of the test statistic.
F =MSR
MSE= 5.33
From Excel (=fdist(5.33,1,8)), p = 0.0496.
If we used a Type-I error rate of 0.05, we would reject the nullhypothesis, and conclude the regression coefficient for eachanalysis would be significantly different from zero.
Overview
CategoricalVariables
Variable Coding
Dummy Coding Example: Dummy
Coded Example 1 Example 2 Example 3 Hypothesis Test
Multiple Categories(> 2)
Wrapping Up
Lecture 22 Psychology 790
Hypothesis Test of the Regression Coefficient
Recall from the t-test of the mean difference, t = 2.321
For the test of the coefficient, notice that F = t2.
Also notice that the p-values for each hypothesis test werethe same, p = 0.0496.
The test of the regression coefficient is equivalent to runninga t-test when using a single categorical variable with twocategories.
Lecture 22 Psychology 790
Multiple Categories (> 2)
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2) Breakfast Cereal
Example Categories in
SAS Hypothesis Test
Wrapping Up
Lecture 22 Psychology 790
Multiple Categories (> 2)
Generalizing the concept of dummy coding, we revisit ourfirst example data set, the cereal experiment data.
Recall that there were four different types of cereal boxes.
A dummy coding scheme would involve creation of four newcolumn vectors, each representing observations from eachbox type.
Just as was the case with two categories, a lineardependency is created if we wanted to use all four variables.
Therefore, we must choose which category to remove fromthe analysis.
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2) Breakfast Cereal
Example Categories in
SAS Hypothesis Test
Wrapping Up
Lecture 22 Psychology 790
One-Way Analysis of Variance
Just as was the case for the example with two categories, amultiple category regression model with a single categoricalindependent variable has a direct link to a statistical test youmay be familiar with.
The regression model tests for mean differences across allpairings of category levels simultaneously.
Testing for a difference between multiple groups (> 2)equates to a one-way ANOVA model (for a model with asingle categorical independent variable).
Y X0 X1 X2 X3 X4 Type11 1 1 0 0 0 1
17 1 1 0 0 0 1
16 1 1 0 0 0 1
14 1 1 0 0 0 1
15 1 1 0 0 0 1
12 1 0 1 0 0 2
10 1 0 1 0 0 2
15 1 0 1 0 0 2
19 1 0 1 0 0 2
11 1 0 1 0 0 2
23 1 0 0 1 0 3
20 1 0 0 1 0 3
18 1 0 0 1 0 3
17 1 0 0 1 0 3
19 1 0 0 1 0 3
27 1 0 0 0 1 4
33 1 0 0 0 1 4
22 1 0 0 0 1 4
26 1 0 0 0 1 4
28 1 0 0 0 1 4
34-1
Lecture 22 Psychology 790
Categorical Variables in SAS
libname cereal C:\Documents and Settings\Jonathan Templin\sas;
*SASs coding;proc glm data=cereal;class boxuncode;model units_sold=boxuncode/solution;run;
Notice the class line - this indicates to SAS that the variable(s) listed on thisline are categorical variables and to encode them prior to the analysis.
To get regression model parameter estimates, we must put the/solution; following the model specification on the model line.
Lecture 22 Psychology 790
Categorical Variables in SAS
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 3 588.1500000 196.0500000 19.80 F
BoxUnCode 3 588.1500000 196.0500000 19.80 |t|
Intercept 27.20000000 B 1.40712473 19.33
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2) Breakfast Cereal
Example Categories in
SAS Hypothesis Test
Wrapping Up
Lecture 22 Psychology 790
Breakfast Cereal Example
To make things interesting, lets drop X4 from our analysis(this is actually what SAS does).
Y = 0 + 1Xi1 + 2Xi2 + 3Xi3 + i
Because X4 (representing box type four) was omitted fromour model, the estimated intercept parameter nowrepresents the mean for group X4.
All other parameters represent the difference between theirrespective category level and category level four with respectto the dependent variable.
b0 = 27.2
b1 = 12.6 b2 = 13.8 b3 = 7.8
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2) Breakfast Cereal
Example Categories in
SAS Hypothesis Test
Wrapping Up
Lecture 22 Psychology 790
Breakfast Cereal Example
Therefore:
YA = YA = b0 + b1(1) + b2(0) + b3(0) = 27.2 12.6 = 14.6
YB = YB = b0 + b1(0) + b2(1) + b3(0) = 27.2 13.8 = 13.4
YC = YC = b0 + b1(0) + b2(0) + b3(1) = 27.2 7.8 = 19.4
YD = YD = b0 + b1(0) + b2(0) + b3(0) = 27.2
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2) Breakfast Cereal
Example Categories in
SAS Hypothesis Test
Wrapping Up
Lecture 22 Psychology 790
Hypothesis Test
To test that all means are equal to each other(H0 : 1 = 2 = . . . = k) against the hypothesis that at leastone mean differs (H1 : At least one 6= ), called anomnibus test, the same hypothesis test from before can beused:
F =R2/k
(1 R2)/(N k 1) =0.4/1
(1 0.4)/(10 1 1) = 5.33
SSTO = 1013.0
SSE = 158.4
SSR = 1013.0 158.4 = 854.6
R2 = 854.6/1013.0 = 0.844
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2) Breakfast Cereal
Example Categories in
SAS Hypothesis Test
Wrapping Up
Lecture 22 Psychology 790
Hypothesis Tests
F =MSR
MSE= 28.77
From Excel (=fdist(28.77,3,16)), p = 0.000001.
If we used a Type-I error rate of 0.05, we would reject the nullhypothesis, and conclude that at least one regressioncoefficient for this analysis would be significantly differentfrom zero.
Having a regression coefficient of zero means having zerodifference between two means (reference and specificcategory being compared).
Having all regression coefficients of zero means absolutelyno difference between any of the means.
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up Final Thought Next Class
Lecture 22 Psychology 790
Final Thought
Regression withcategorical variables canbe accomplished by codingschemes.
Differing ways of coding (orinclusion of certain codedcolumn vectors) maychange the interpretation ofthe model parameters, butwill not change the overallfit of the model.
Combinations of categorical and continuous X variables leadto something called Analysis of Covariance (see: Psych791).
mario.wmvMedia File (video/x-ms-wmv)
Overview
CategoricalVariables
Variable Coding
Dummy Coding
Multiple Categories(> 2)
Wrapping Up Final Thought Next Class
Lecture 22 Psychology 790
Next Time
The final exam is handed out.
We discuss the final.
We all say goodbye for the year.
OverviewToday's LectureOur New ScheduleAnnouncements
Categorical VariablesRegression with Continuous VariablesExample Data SetCerealA Regular Regression?Categorical VariablesResearch DesignAnalysis SpecificsExample Variable: Two CategoriesOld School Statistics: The t-testOld School Statistics: The t-test
Variable CodingVariable CodingVariable Coding
Dummy CodingDummy CodingDummy CodingDummy Coded Regression ExampleDummy Coded RegressionDummy Coded Regression - X1 and X2Dummy Coded Regression - X1 and X2Dummy Coded Regression - X0 and X1Dummy Coded Regression - X0 and X1Dummy Coded Regression - X0 and X2Dummy Coded Regression - X0 and X2Hypothesis Test- Regression CoefficientHypothesis Test of the Regression Coefficient
Multiple Categories (>2)Multiple Categories (>2)One-Way Analysis of VarianceCategorical Variables in SASCategorical Variables in SASBreakfast Cereal ExampleBreakfast Cereal ExampleHypothesis TestHypothesis Tests
Wrapping UpFinal ThoughtNext Time