Chapter 3: Examining Relationships
Introduction: Most statistical studies involve more than one variable. In Chapters 1 and 2 we used the boxplot /stem-leaf plot/ histogram to analyze one-variable distributions . In this chapter we will observe studies with two or more variables.
Exploration: SAT Activity - Verbal vs. Math Scores Consider the following: Is there an obvious pattern? Can you describe the pattern? Can you see any obvious association between SAT Math
and SAT Verbal?
I. Variables –
A. Response – measures an outcome of a study
B. Explanatory – attempts to explain the observed outcome
Example 3.1 Effect of Alcohol on Body Temperature
Alcohol has many effects on the body. One effect is a drop in body
temperature. To study this effect, researchers give several different amounts of
alcohol to mice, then measure the change in each mouse’s body temperature in
the 15 minutes after taking the alcohol.
Response Variable –
Explanatory Variable –
Note: When attempting to predict future outcomes, we are interested in
the response variable.
• In Example 3.1 alcohol actually causes a change in body
temperature. Sometimes there is no cause-and-effect
relationship between variables. As long as scores are closely
related, we can use one to predict the other.
• Correlation and Causation
Exercise 3.1 Explanatory and Response Variables
In each of the following situations, is it more reasonable to simply explore the
relationship between the two variables or to view one of the variables as an
explanatory variable and the other as a response variable? In the latter case,
which is the explanatory variable and which is the response variable?
(a) The amount of time spent studying for a stat test and the grade on the test
(b) The weight and height of a person
(c) The amount of yearly rainfall and the yield of a crop
(d) A student’s grades in statistics and French
(e) The occupational class of a father and of a son
• a) Time spent studying is explanatory; the grade is
the response variable.
• (b) Explore the relationship; there is no reason to
view one or the other as explanatory.
• (c) Rainfall is explanatory; crop yield is the response
variable.
• (d) Explore the relationship.
• (e) The father’s class is explanatory; the son’s class is
the response variable.
Exercise 3.2 Quantitative and Categorical Variables
How well does a child’s height at age 6 predict height at
age 16? To find out, measure the heights of a large
group of children at age 6, wait until they reach age 16,
then measure their heights again. What are the
explanatory and response variables here? Are these
variables categorical or quantitative?
Height at age six is explanatory, and height at age 16 is the
response variable. Both are quantitative.
3.1 Scatterplots
Scatterplots show the relationship between two quantitative variables taken on the same
individual observation.
Exercise 3.7 Are Jet Skis Dangerous
An article in the August 1997 issue of the Journal of the American Medical Association
reported on a survey that tracked emergency room visits at randomly selected hospitals
nationwide. Here are data on the number of jet skis in use, the number of accidents, and
the number of fatalities for the years 1987 – 1996:
Year Number in use Accidents Fatalities
1987 92,756 376 5
1988 126,881 650 20
1989 178,510 844 20
1990 241,376 1,162 28
1991 305,915 1,513 26
1992 372,283 1,650 34
1993 454,545 2,236 35
1994 600,000 3,002 56
1995 760,000 4,028 68
1996 900,000 4,010 55
Interpreting Scatterplots
Look for an overall pattern and for striking deviation from that
pattern
In particular, the direction, form and strength of the relationship
between the two variables.
a) direction – negative, positive
b) form – clustering of data, linear, quadratic, etc.
c) strength – how closely the points follow a clear form
Positive correlation (direction) – when above average values of one
variable tend to accompany above average values of the other variable.
Also, the same for below average.
Negative correlation (direction) – when above average values of
one variable tend to accompany below average values of the other
variable.
Note: The more students who took the test, the lower the state average- -this is the “open enrollment” concept.
No correlation
Linear Relationship – when points lie in a straight line pattern.
Outliers – in this case, deviations from the overall scatterplot patterns.
III. Drawing Scatterplots –
Step 1. Scale the horizontal and vertical axes. The intervals MUST be
the same.
Step 2. Label both axes.
Step 3. Adopt a scale that uses the whole grid.
Step 4. Add a categorical variable by using a different plotting symbol
or color.
A scatterplot with categorical data included.
III. Example 3.4 pg 127: Heating Degree-Days
end Sec. 3.1
Sec. 3.2 Correlation ( r ) - is a measure of the direction and strength
of the linear relationship between two
quantitative variables.
Sec. 3.2 Correlation ( r )
I. Characteristics of “r”-
A. Indicates the direction of a linear relationship by its sign
B. Always satisfies 1 1r
C. Ignores the distinction between explanatory and response
variables
D. Requires that both variables be quantitative
E. Is not resistant*
* herein, correlation joins the mean and standard deviation
II. Graphical Interpretations of “r”
II. Least-Squares Regression-
Summary:
The method of least-squares is a standard approach to the approximate
solution of over determined systems (sets of equations in which there are
more equations than unknowns). "Least-squares" means that the overall
solution minimizes the sum of the squares of the errors made in solving
every single equation.
The most important application is in data fitting. The best fit in the least-
squares sense minimizes the sum of residuals, a residual being the
difference between an observed value and the fitted value provided by a
model.
a regression line is a straight line that describes how a response
variable y changes as an explanatory variable x changes. A
regression line is often used to predict the value of y for a given
value of x. Regression, unlike correlation, requires that we have an
explanatory variable and a response variable.
Simply put,
III. Recall: Correlation (r) makes no use of the distinction between explanatory
and response variables. That is, it is not necessary that we define one of the
quantitative variables as a response variable and the second as the explanatory
variable. It doesn’t matter—for the calculation of rho--which one
is the y (D.V.) and which one is x (I.V.).
In least-squares regression we are required to designate
a response variable (y) and an explanatory variable (x).
ˆ( )y a bx
IV. Regression line – a line that describes how a response variable (y)
changes as an explanatory varible (x) changes.
A. Regression lines are often used to predict the value of “y” for a
given value of “x”.
ˆ
,
the slope
the intercept
y = the predicted rather than an actual "y" for any "x"
y
x
y a bx
where
sb r
s
a y bx
equation of the regression line
“y” is the observed value, 𝑦 is the predicted value of “y”.
B. Least-squares regression line of “y” on “x” is the line that makes
the sum of the squares of the vertical distances (or, “y’s”) of the
data points from the line as small as possible. This minimizes the
error in predicting 𝑦 .
𝑦 = 𝑎 + 𝑏𝑥
C. Prediction errors are errors in “y” which is the vertical direction in the
scatterplot and are calculated as
error = observed value - predicted value
D. To determine the equation of a least-squares line, (1) solve for the intercept
“a” and the slope “b” —thus we have two unknowns. However, it can be
shown that every least-squares regression line passes through the point,
𝑋 , 𝑌 ). Next, (2) the slope of the least-squares line is equal to the product
of the correlation (rho) and the quotient of the s.d.’s
y
x
sb r
s
E. Constructing the least-squares equation-
Given some explanatory and response variable with the following stat
summary - 17.222, 161.111, 19.696, 33.479, 0.997x yx y s s r
Even though we don’t know the actual data we can till construct the
equation and use it to make predictions. The slope and intercept can be
calculated as
33.479 0.997 1.695
19.696
161.111 1.695 17.222 131.920
y
x
sb r
s
a y bx
So that the least-squares line has equation 131.920 1.695y x
2IV. The Role of r in Regression Analysis
We know that the correlation r is the slope of the least-squares regression
line when we measure both x and y in standardized units. The square of
the correlation
2 r is the fraction of the variance of one variable that is
explained by least-squares regression on the other variable.
When you report a regression, give r-square as a measure of how
successful the regression was in explaining the response.
While correlation coefficients are normally reported as r = (a value
between -1 and +1), squaring them makes then easier to understand. The
square of the coefficient (or r-square) is equal to the percent of the
variation in one variable that is related to the variation in the other. After
squaring r, ignore the decimal point. An r of .5 means 25% of the
variation is related (.5 squared =.25). An r value of .7 means 49% of
the variance is related (.7 squared = .49).
V. Summary of Least-Squares Regression-
A. The distinction between explanatory and response variables is essential
in regression.
B. There is a close relationship between correlation and the slope of the
least-squares line. The slope is
This means that along the regression line, a change of one s.d. in x corresponds to
a change of r standard deviations in y.
C. The LSRL always passes through the point on the graph of y
against x.
D. The correlation describes the strength of the straight-line relationship. In
the regression setting, this description takes a specific form: the square of
the correlation is the fraction of the variation in the values of y that is
explained by the least-squares regression of y on x.
y
x
sb r
s
, x y
VI. Residuals - a residual is the difference between an observed value of the
response variable and the value predicted by the regression line. That is,
residual = observed y – predicted y
ˆ y y
A. A residual plot is a scatterplot of the regression residuals against the explanatory variable. These plots assist us in assessing the fit of a regression line.
B. Residual example:
A study was conducted to test whether the age at which a child begins
to talk can be reliably used to predict later scores on a test (taken much
later in years) of mental ability—in this instance, the “Gesell Adaptive
Score” test. The data appear on in the next slide.
Scatterplot of Gesell Adaptive Scores
vs the age at first word for 21 kids.
The line is the LSRL for predicting
Gesell score from age at first word.
The residual plot for the regression
shows how far the data fall from the
regression line. Child 19 is an outlier
and child 18 is an “influential”
observation that does not have a large
residual.
The plot shows a (-) relationship that is
moderately linear, with r = -.640
Uniform scatter indicates that the
regression line fits the data well.
The curved pattern means that a
straight line in an inappropriate
model.
The response variable y has more
spread for larger values of the
explanatory variables x, so the
predication will be less accurate
when x is larger.
C. The residuals from the least-squares line have the special property:
the mean of the least-squares residuals is always zero.
VI. Influential Observations – an observation is influential for a statistical
calculation if removing it would markedly change the result of the calculation.
A. Outlier- is an observation that lies outside the overall pattern of other
observations.
B. Points that are outliers in the x direction of a scatterplot are often
influential for the LSRL but need not have large residuals.
end Sec. 3.3 and
Chpt. 3