9/15/15
1
Correlation and Regression!STA 2300!Chapter 4!
Response Variable!n Also called the dependent variable!n Measures an outcome of a study!
n Is often a/the primary focus of the study!n Individual values are called responses!
9/15/15
2
Explanatory Variable!n Also called the predictor or the independent variable!
n Influences the response variable directly or indirectly!
n Helps explain variation in responses!n Can be used to predict responses!
Graphical Approach!
Process Inputs Outputs
Response Variable Explanatory
Variable
9/15/15
3
Examples!n Quarterback’s salary for next season and
number of touchdowns thrown!n Explanatory:! ! !Response:!
n Weight loss and amount of exercise!n Explanatory:! ! !Response:!
n Years in job and income!n Explanatory:! ! !Response:!
Scatterplots!Represent the association between two quantitative variables measured on the same individuals.
Person Income RentSally 4,000$ 925$ Austin 1,200$ 425$ Joe 600$ 635$ Ginny 2,000$ 600$ Ruby $0 525$
Monthly
$0
$200
$400
$600
$800
$1,000
$0 $2,000 $4,000
Income
Rent
9/15/15
4
What to Look For in a Scatterplot!n Form of the overall pattern!n Direction of the overall pattern!n Strength of the Relationship!
Form of the Overall Pattern!n Linear !n Curvature!n Deviations from
that pattern!
$0
$200
$400
$600
$800
$1,000
$0 $2,000 $4,000
Income
Rent
Use the rectangle method to assess whether a linear pattern
9/15/15
5
Direction of the Overall Pattern!n Positive: As one variable increases so does the other!n Negative: As one variable increases the other
decreases!n No relationship: None of the above!
$0
$200
$400
$600
$800
$1,000
$0 $2,000 $4,000
Income
Rent
Positive
$0
$500
$1,000
$1,500
$2,000
0 50 100 150
Distance from NYC
Taxe
s
Negative
8,0008,5009,0009,50010,00010,50011,000
0 50 100
Outside Temperature
DJI
A
No Relationship
Strength of the Relationship!n For linear patterns, use the rectangle to observe
the length-to-width ratio!n Large length-to-width ratio indicate strong
relationships; small, indicate weak relationships!
$0
$200
$400
$600
$800
$1,000
$0 $2,000 $4,000
Income
Rent
$0
$200
$400
$600
$800
$1,000
$0 $2,000 $4,000
Income
Rent
$0
$200
$400
$600
$800
$1,000
$0 $2,000 $4,000
Income
Rent
Weak Moderate Strong
9/15/15
6
Adding Explanatory Variables!
Mass (kg) Rate (cal)36.1 99554.6 142548.5 1396
42 141850.6 1502
42 125640.3 118933.1 91342.4 112434.5 105251.1 134741.2 1204
Female
800100012001400160018002000
30 50 70
Mass
Rate
Adding Explanatory Variables!
Mass (kg) Rate (cal)36.1 99554.6 142548.5 1396
42 141850.6 1502
42 125640.3 118933.1 91342.4 112434.5 105251.1 134741.2 1204
Female
800100012001400160018002000
30 50 70
Mass
RateMass (kg) Rate (cal)
62 179262.9 166647.4 136248.7 161451.9 146051.9 186746.9 1439
Male
800100012001400160018002000
30 50 70
Mass
Rate
9/15/15
7
Correlation — r!n Measures the strength and direction of the
linear relationship.!n Does NOT describe curved relationships, no matter
how strong they are.!n Always a number between -1 and 1.!n Strong linear relationships: r is close to -1 or 1.!n Weak linear relationships:r is close to 0.!
n Does NOT mean there is not some other relationship!
Correlation — r!n Requires both variables be quantitative.!n Has no units.!n Direction of linear relationship indicated
by sign of r:!n Positive relationship: Positive r!n Negative relationship: Negative r!
n Strongly affected by outliers.!
9/15/15
8
Weak
Strong
Calculation of r!
€
r =1
n −1xi − x
sx
#
$ %
&
' (
yi − y sy
#
$ % %
&
' ( ( ∑
Note: This requires the mean and standard deviation for both variables.
n represents the number of individuals.
9/15/15
9
Calculation of r: Example!
!!"
#$$%
& −
x
i
sxx
A Size(GB)
B C Price($)
D Product (BxC)
8 -0.6 310 -0.5 0.306 -0.8 290 -0.6 0.48
18 0.5 500 0.4 0.2030 1.7 800 1.9 3.2320 0.7 470 0.3 0.2110 -0.4 330 -0.4 0.163 -1.1 150 -1.2 1.32
!!"
#$$%
& −
x
i
sxx
!!"
#$$%
& −
y
i
syy
Mean 13.6Standard Deviation 9.5
407.1209.0
-0.6-0.80.51.70.7-0.4-1.1
-0.5-0.60.41.90.3-0.4-1.2
0.300.480.203.230.210.161.32
Sum = 5.90
r = 5.90/(7-1) = 0.98
More on r!n Correlation makes no distinction between
explanatory and response variables.!n Labeling of x and y is in calculation is
arbitrary.!n Does not change when the units of
measurement of x and/or y are changed.!n Since the standardized values of the
observations are used in the calculation.!
9/15/15
10
Regression Line n A straight line
n Form: y = b0 + b1x n Describes how the response y is affected
by changes in the explanatory variable, x n Often used to predict values of y from
values of x
Fitting the Regression Line n Using the scatterplot, a line of best fit may be “eyeballed.”
n The line provides a description of the association between two variables.
Person Income RentSally 4,000$ 925$ Austin 1,200$ 425$ Joe 600$ 635$ Ginny 2,000$ 600$ Ruby $0 525$
Monthly
$0
$200
$400
$600
$800
$1,000
$0 $2,000 $4,000
Income
Rent
Regression Line
9/15/15
11
$0
$200
$400
$600
$800
$1,000
$0 $2,000 $4,000
Income
Rent
y-Intercept: 450 Change in y: 200
Change in x: 2000
“Eyeballing” the Line
y-Intercept: 450
Change in y: 200
Change in x: 2000
y = 450 + 0.1 x
“Eyeball” Estimate of the Equation
Slope: 200/2000 = 0.1
“Eyeballing” the Line
9/15/15
12
Least-Squares Regression Line n Mathematical method to determine the line of
best fit. n Minimizes the sum of the squares of the vertical
distances between the points and the line.
X
Y
residual
Residuals n The difference between an observed
value of the response variable and the value predicted by the regression line.
n The sum of the residuals is 0.
êi = yi − yi
€
ˆ e ii∑ = 0
9/15/15
13
Equation of the Least-Squares Regression Line n Equation notation:
Where is the predicted response for any x.
n Calculation of slope:
n Calculation of the intercept:
y = b0 + b1x
b1 = rsysx
b0 = y − bx
€
ˆ y
Example
€
x =1.75%y = 9.07% %35.15
%36.5=
=
y
x
ss 596.0=r
707.136.535.15596.0 =!"
#$%
&==x
y
ssrb
083.6)75.1(707.107.9 =−=−= xbya
Slope
Intercept
Equation xy 707.1083.6ˆ +=
9/15/15
14
Important Facts n The distinction between explanatory and
response variables is essential. n The correlation and the slope of the line
always have the same sign. n The LS regression line always passes
through the point
€
x ,y ( )
Coefficient of Determination: r2 n The fraction of the variation in the values
of y that is explained by the LS regression of y on x.
35.5% in the variation in y can be explained by x.
r2 = (0.596)2 = .355 Example
9/15/15
15
Cautions n Correlation and regression only describe linear
relationships. n r and the regression line are NOT resistant to
outliers. n Do not use a regression line to predict far
outside the range of the observed explanatory variable, x. n Called Extrapolation
n Be aware of lurking variables n Variables not included in the study, yet may influence
the interpretation of the relationship.
Lastly,
Correlation
Does NOT
Imply Causation