Agresti/Franklin Statistics, 1 of 52
Chapter 3Association: Contingency,
Correlation, and Regression
Learn ….
How to examine links between two variables
Agresti/Franklin Statistics, 2 of 52
Section 3.2
How Can We Explore the Association Between Two Quantitative
Variables?
Agresti/Franklin Statistics, 3 of 52
Scatterplot Graphical display of two quantitative
variables:
• Horizontal Axis: Explanatory variable, x
• Vertical Axis: Response variable, y
Agresti/Franklin Statistics, 4 of 52
Example: Internet Usage and Gross National Product (GDP)
Agresti/Franklin Statistics, 5 of 52
Positive Association Two quantitative variables, x and y, are
said to have a positive association when high values of x tend to occur with high values of y, and when low values of x tend to occur with low values of y
Agresti/Franklin Statistics, 6 of 52
Negative Association
Two quantitative variables, x and y, are said to have a negative association when high values of x tend to occur with low values of y, and when low values of x tend to occur with high values of y
Agresti/Franklin Statistics, 7 of 52
Example: Did the Butterfly Ballot Cost Al Gore the 2000 Presidential Election?
Agresti/Franklin Statistics, 8 of 52
Linear Correlation: r Measures the strength of the linear
association between x and y
• A positive r-value indicates a positive association• A negative r-value indicates a negative association• An r-value close to +1 or -1 indicates a strong linear
association• An r-value close to 0 indicates a weak association
Agresti/Franklin Statistics, 9 of 52
Calculating the correlation, r
))((11
yx syy
sxx
nr
Agresti/Franklin Statistics, 10 of 52
Example: 100 cars on the lot of a used-car dealership
Would you expect a positive association, a negative association or no association between the age of the car and the mileage on the odometer? Positive association Negative association No association
Agresti/Franklin Statistics, 11 of 52
Section 3.3
How Can We Predict the Outcome of a Variable?
Agresti/Franklin Statistics, 12 of 52
Regression Line Predicts the value for the response
variable, y, as a straight-line function of the value of the explanatory variable, x
bxay ˆ
Agresti/Franklin Statistics, 13 of 52
Example: How Can Anthropologists Predict Height Using Human Remains?
Regression Equation:
is the predicted height and is the length of a femur (thighbone), measured in centimeters
xy 4.24.61ˆ
y x
Agresti/Franklin Statistics, 14 of 52
Example: How Can Anthropologists Predict Height Using Human Remains?
Use the regression equation to predict the height of a person whose femur length was 50 centimeters
ˆ 61.4 2.4(50)y
Agresti/Franklin Statistics, 15 of 52
Interpreting the y-Intercept
y-Intercept: • the predicted value for y when x = 0
• helps in plotting the line
• May not have any interpretative value if no observations had x values near 0
Agresti/Franklin Statistics, 16 of 52
Interpreting the Slope Slope: measures the change in the
predicted variable for every unit change in the explanatory variable
Example: A 1 cm increase in femur length results in a 2.4 cm increase in predicted height
Agresti/Franklin Statistics, 17 of 52
Slope Values: Positive, Negative, Equal to 0
Agresti/Franklin Statistics, 18 of 52
Residuals Measure the size of the prediction
errors
Each observation has a residual
Calculation for each residual:ˆy y
Agresti/Franklin Statistics, 19 of 52
Residuals
A large residual indicates an unusual observation
Large residuals can easily be found by constructing a histogram of the residuals
Agresti/Franklin Statistics, 20 of 52
“Least Squares Method” Yields the Regression Line
Residual sum of squares:
The optimal line through the data is the line that minimizes the residual sum of squares
2 2ˆ( ) ( )residuals y y
Agresti/Franklin Statistics, 21 of 52
Regression Formulas for y-Intercept and Slope
Slope:
Y-Intercept:
( )yx
sb r
s
( )a y b x
Agresti/Franklin Statistics, 22 of 52
The Slope and the Correlation Correlation:
• Describes the strength of the association between 2 variables
• Does not change when the units of measurement change
• It is not necessary to identify which variable is the response and which is the explanatory
Agresti/Franklin Statistics, 23 of 52
The Slope and the Correlation Slope:
• Numerical value depends on the units used to measure the variables
• Does not tell us whether the association is strong or weak
• The two variables must be identified as response and explanatory variables
• The regression equation can be used to predict the response variable
Agresti/Franklin Statistics, 24 of 52
Section 3.4
What Are Some Cautions in Analyzing Associations?
Agresti/Franklin Statistics, 25 of 52
Extrapolation Extrapolation: Using a regression line
to predict y-values for x-values outside the observed range of the data• Riskier the farther we move from the range
of the given x-values• There is no guarantee that the relationship
will have the same trend outside the range of x-values
Agresti/Franklin Statistics, 26 of 52
Regression Outliers
Construct a scatterplot
Search for data points that are well removed from the trend that the rest of the data points follow
Agresti/Franklin Statistics, 27 of 52
Influential Observation An observation that has a large effect on
the regression analysis
Two conditions must hold for an observation to be influential:
Its x-value is relatively low or high compared to the rest of the data
It is a regression outlier, falling quite far from the trend that the rest of the data follow
Agresti/Franklin Statistics, 28 of 52
Which Regression Outlier is Influential?
Agresti/Franklin Statistics, 29 of 52
Example: Does More Education Cause More Crime?
Agresti/Franklin Statistics, 30 of 52
Correlation does not Imply Causation
A correlation between x and y means that there is a linear trend that exists between the two variables
A correlation between x and y, does not mean that x causes y
Agresti/Franklin Statistics, 31 of 52
Lurking Variable
A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest
Agresti/Franklin Statistics, 32 of 52
Simpson’s Paradox
The direction of an association between two variables can change after we include a third variable and analyze the data at separate levels of that variable
Agresti/Franklin Statistics, 33 of 52
Example: Is Smoking Actually Beneficial to Your Health?
Agresti/Franklin Statistics, 34 of 52
Example: Is Smoking Actually Beneficial to Your Health?
Agresti/Franklin Statistics, 35 of 52
Example: Is Smoking Actually Beneficial to Your Health?
Agresti/Franklin Statistics, 36 of 52
Example: Is Smoking Actually Beneficial to Your Health?
Agresti/Franklin Statistics, 37 of 52
Example: Is Smoking Actually Beneficial to Your Health?
An association can look quite different after adjusting for the effect of a third variable by grouping the data according to the values of the third variable
Agresti/Franklin Statistics, 38 of 52
Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire
Would you expect the correlation to be negative, zero, or positive?
a. Negativeb. Zeroc. Positive
Agresti/Franklin Statistics, 39 of 52
If the correlation is positive, does this mean that having more firefighters at a fire causes the damages to be worse?
a. Yesb.No
Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire
Agresti/Franklin Statistics, 40 of 52
Identify a third variable that could be considered a common cause of x and y:
a. Distance from the fire stationb. Intensity of the firec. Time of day that the fire was
discovered
Data are available for all fires in Chicago last year on x = number of firefighters at the fires and y = cost of damages due to fire