Chi-Square and Analysis Chi-Square and Analysis of Variance (ANOVA)of Variance (ANOVA)
Lecture 9Lecture 9
The Chi-Square Distribution The Chi-Square Distribution and Test for Independenceand Test for Independence
Hypothesis testing between two or Hypothesis testing between two or more categorical variablesmore categorical variables
Chi-square Test of IndependenceChi-square Test of Independence
Tests the association between two Tests the association between two nominal (categorical) variables.nominal (categorical) variables. Null Hyp: The 2 variables are independent.Null Hyp: The 2 variables are independent.
Its really just a comparison between Its really just a comparison between expected frequencies and observed expected frequencies and observed frequencies among the cells in a frequencies among the cells in a crosstabulation table.crosstabulation table.
Yes No Total
Males 46 (40.97) 71 (76.02) 117
Females 37 (42.03) 83(77.97) 120
Total 83 154 237
Example Crosstab: gender x Example Crosstab: gender x binary question binary question
Degrees of freedomDegrees of freedom
Chi-square degrees of freedomChi-square degrees of freedom
df = (r-1) (c-1)df = (r-1) (c-1)Where r = # of rows, c = # of columnsWhere r = # of rows, c = # of columns
Thus, in any 2x2 contingency table, the degrees of Thus, in any 2x2 contingency table, the degrees of freedom = 1.freedom = 1.
As the degrees of freedom increase, the As the degrees of freedom increase, the distribution shifts to the right and the critical values distribution shifts to the right and the critical values of chi-square become larger.of chi-square become larger.
Chi-Square DistributionChi-Square Distribution
The chi-square distribution results when The chi-square distribution results when independent variables with standard independent variables with standard normal distributions are squared and normal distributions are squared and summed.summed.
Requirements for Chi-Square testRequirements for Chi-Square test
Must be a random sample from populationMust be a random sample from population
Data must be in raw frequenciesData must be in raw frequencies
Variables must be independentVariables must be independent
Categories for each I.V. must be mutually Categories for each I.V. must be mutually exclusive and exhaustiveexclusive and exhaustive
Using the Chi-Square TestUsing the Chi-Square Test
Often used with contingency tables (i.e., Often used with contingency tables (i.e., crosstabulations)crosstabulations) E.g., gender x raceE.g., gender x race
Basically, the chi-square test of independence Basically, the chi-square test of independence tests whether the columns are contingent on the tests whether the columns are contingent on the rows in the table.rows in the table. In this case, the null hypothesis is that there is no In this case, the null hypothesis is that there is no
relationship between row and column frequencies.relationship between row and column frequencies.
Practical Example:Practical Example:
Expected frequencies versus observed Expected frequencies versus observed frequenciesfrequencies
General Social Survey ExampleGeneral Social Survey Example
ANOVA ANOVA and the f-distributionand the f-distribution
Hypothesis testing between a 3+ Hypothesis testing between a 3+ category variable and a metric category variable and a metric
variablevariable
Analysis of VarianceAnalysis of Variance
In its simplest form, it is used to compare In its simplest form, it is used to compare means for three or more categories.means for three or more categories. Example:Example:
Life Happiness scale and Marital Status (married, Life Happiness scale and Marital Status (married, never married, divorced)never married, divorced)
Relies on the F-distributionRelies on the F-distribution Just like the t-distribution and chi-square Just like the t-distribution and chi-square
distribution, there are several sampling distribution, there are several sampling distributions for each possible value of df.distributions for each possible value of df.
What is ANOVA?What is ANOVA?
If we have a categorical variable with 3+ If we have a categorical variable with 3+ categories and a metric/scale variable, we could categories and a metric/scale variable, we could just run 3 t-tests.just run 3 t-tests. The problem is that the 3 tests would not be The problem is that the 3 tests would not be
independent of each other (i.e., all of the information independent of each other (i.e., all of the information is known).is known).
A better approach: compare the variability A better approach: compare the variability between groups (treatment variance + error) to between groups (treatment variance + error) to the variability within the groups (error)the variability within the groups (error)
The F-ratioThe F-ratio
MS = mean squareMS = mean square
bg = between groupsbg = between groups
wg = within groupswg = within groups
wg
bg
MS
MSF
Numerator is the “effect” Numerator is the “effect” and denominator is the and denominator is the “error”“error”
df = # of categories – 1 (k-df = # of categories – 1 (k-1)1)
Between-Group Sum of Squares Between-Group Sum of Squares (Numerator)(Numerator)
Total variability – Residual VariabilityTotal variability – Residual Variability
TotalTotal variability is quantified as the sum of the variability is quantified as the sum of the squares of the differences between each value squares of the differences between each value and the grand mean.and the grand mean. Also called the total sum-of-squaresAlso called the total sum-of-squares
Variability Variability withinwithin groups is quantified as the sum groups is quantified as the sum of squares of the differences between each of squares of the differences between each value and its group meanvalue and its group mean Also called residual sum-of-squaresAlso called residual sum-of-squares
Null Hypothesis in ANOVANull Hypothesis in ANOVA
If there is no If there is no difference between difference between the means, then the the means, then the between-group sum between-group sum of squares should = of squares should = the within-group sum the within-group sum of squares.of squares.
wg
bg
MS
MSF
F-distributionF-distribution
F-test is always a one-tailed test.F-test is always a one-tailed test. Why?Why?
Logic of the ANOVALogic of the ANOVA
Conceptual Intro to ANOVAConceptual Intro to ANOVA
Bringing it all together: Bringing it all together:
Choosing the appropriate bivariate Choosing the appropriate bivariate statisticstatistic
Reminder About CausalityReminder About Causality
Remember from earlier lectures: bivariate Remember from earlier lectures: bivariate statistics do not test causal relationships, statistics do not test causal relationships, they only show that there is a relationship.they only show that there is a relationship.
Even if you plan to use more sophisticated Even if you plan to use more sophisticated causal tests, you should always run simple causal tests, you should always run simple bivariate statistics on your key variables to bivariate statistics on your key variables to understand their relationships.understand their relationships.
Choosing the Appropriate Choosing the Appropriate Statistical TestStatistical Test
General rules for choosing a bivariate test:General rules for choosing a bivariate test:
Two categorical variablesTwo categorical variablesChi-Square (crosstabulations)Chi-Square (crosstabulations)
Two metric variablesTwo metric variablesCorrelationCorrelation
One 3+ categorical variable, one metric variableOne 3+ categorical variable, one metric variable ANOVAANOVA
One binary categorical variable, one metric variableOne binary categorical variable, one metric variableT-testT-test
Assignment #2Assignment #2
Online (Online (course websitecourse website))
Due next Monday in class (April 10Due next Monday in class (April 10thth))