ObjectiveFind the line of regression.
Use the Line of Regression to Make Predictions.
RelevanceTo be able to find a model to best represent quantitative data with 2 variables and use it to make predictions.
A better alternative to “storing” numbers!
2-Variable Statistics
2-Variable StatisticsNow that we have used one variable statistics to “store” our necessary numbers, let’s learn another way that’s even better
Find the mean and standard deviation of the x’s and y’s using 2-var stats.
x y
21 6
18 9
30 3
35 4
Find the mean and standard deviation of the x’s and y’s using 2-var stats.
x y
21 6
18 9
30 3
35 4
Use this when using your lists to find r.
Find the correlation Coefficient:
x y
4 6
8 15
15 22
19 18
22 27
𝑟=3.599887921
4=0.900
Find the correlation Coefficient:
x y32 27 40 82 30 34 18 14 15 1 25 22
𝑟=4.558674571
5=0.912
Find the correlation Coefficient:
x y2 72 8 60 10 64 14 52 28 43 32 40 18 32
𝑟=−4.868894211
6=−0.811
A student wonders if tall women tend to date taller men than do short women. She measures herself, her dormitory roommate, and the women in the adjoining rooms. Then she measures the next man each woman dates. Draw & discuss the scatterplot and calculate the correlation coefficient.
Women(x)
Men(y)
66 72
64 68
66 70
65 68
70 71
65 65
A student wonders if tall women tend to date taller men than do short women. She measures herself, her dormitory roommate, and the women in the adjoining rooms. Then she measures the next man each woman dates. Draw & discuss the scatterplot and calculate the correlation coefficient.
Women
(x)
Men(y)
66 72 0 1.1859 0
64 68 -0.9535
-0.3953
0.3769
66 70 0 0.3953 0
65 68 -0.4767
-0.3953
0.1884
70 71 1.9069 0.7906 1.5076
65 65 -0.4767
-1.581 0.7538
𝑟=2.826668855
50.565
Linear Regression
Guess the correlation coefficient
http://istics.net/stat/Correlations/
Can we make a Line of Best Fit
Want:1)The distances
to the line to be
the same.2) The smallest
distances.
Regression LineWhen a scatterplot shows a linear relationship, we’d like to
summarize the overall pattern by drawing a line on the scatterplot.
A regression line summarizes the relationship between two variables, but only in a specific setting: when one of the variables helps explain or predict the other.
Regression – unlike scatter plots – REQUIRES that we have an explanatory variable and a response variable.
Regression LineThis is a line that describes how a response variable (y) changes as an explanatory variable (x) changes.
It’s used to predict the value of (y) for a given value of (x).
The regression line is a model for the data.
Let’s try some!http://illuminations.nctm.org/ActivityDetail.aspx?ID=146
Regression LineWhen given the
response variable (y) and the explanatory variable (x), the regression line relating y to x has equation of the following form:
Predicted Value: ( – The predicted value of y for a given value of x.
y-intercept: () - the predicted value of
the y when x is 0.
Slope: () – the amount by which y is predicted to change when x increases by 1 unit.
�̂�=𝒃𝟎+𝒃𝟏 𝒙
The following data shows the number of miles driven and advertised price for 11 used Honda CR-Vs from the 2002-2006 model years (prices found at www.carmax.com). The scatterplot below shows a strong, negative linear association between number of miles and advertised cost. The correlation is -0.874. The line on the plot is the regression line for predicting advertised price based on number of miles.
ThousandMiles
Driven
Cost(dollars)
22 1799829 1645035 1499839 1399845 1459949 1498855 1359956 1459969 1199870 1445086 10998
10
12
14
16
18
ThousandMilesDriven20 30 40 50 60 70 80 90
Cost = 1.88e+04 - 86.2ThousandMilesDriven
Use the regression line to answer the following.
Slope y-interceptThe predicted price of the car decreases by $86.18 for every additional thousand miles driven.
The predicted cost ($18,773) of a used Honda 2002 to 2006 CR-V with 0 miles.
𝐶𝑜𝑠𝑡=18773−86 .18(𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑙𝑒𝑠)
Predict the price for a Honda with 50,000 miles. (Use 50 in equation!)
𝑐𝑜𝑠𝑡=18773−86.18 (𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑙𝑒𝑠)
$14, 464.
ExtrapolationThis refers to using a
regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line.
They are not usually very accurate predictions.
Should we predict the asking price for a used 2002-2006 Honda CR-V with 250,000 miles?
No! We only have data for cars with between 22,000 and 86,000 miles. We don’t know if the linear pattern will continue beyond these values. In fact, if we did predict the asking price for a car with 250 thousand miles, it would be −$2772!
Slope:
Y-int:
Predict weight after 16 wk
Predict weight at 2 years:
𝑆𝑙𝑜𝑝𝑒=40. h𝑇 𝑒𝑟𝑎𝑡𝑤𝑖𝑙𝑙 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒 h𝑡 𝑒𝑖𝑟 h𝑤𝑒𝑖𝑔 𝑡 40𝑔𝑟𝑎𝑚𝑠𝑝𝑒𝑟𝑤𝑒𝑒𝑘 .
𝑦−𝑖𝑛𝑡=100. h𝑇 𝑒𝑟𝑎𝑡 𝑤𝑖𝑙𝑙 h𝑤𝑒𝑖𝑔 100𝑔𝑟𝑎𝑚𝑠𝑎𝑡 h𝑏𝑖𝑟𝑡 .
�̂�=100+40 (16 )=740𝑔𝑟𝑎𝑚𝑠
This is unreasonable and is a result of extrapolation.
Residual
A residual is the difference between an observed value of the response variable and the value predicted by the regression line.
residual = observed y – predicted y
residual =
ExampleThe equation of the least-squares regression line
for the sprint time and long-jump distance data is.
Find and interpret the residual for the student who had a sprint time of 8.09 seconds.
ˇ𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆=𝟑𝟎𝟒 .𝟓𝟔𝟐𝟕 .𝟔𝟑 (𝟖 .𝟎𝟗)=𝟖𝟏 .𝟎𝟑𝒊𝒏𝒄𝒉𝒆𝒔
This student jumped 69.97 inches farther than we expected based on his sprint time.
ˇ𝒅𝒊𝒔𝒕𝒂𝒏𝒄𝒆=𝟑𝟎𝟒 .𝟓𝟔𝟐𝟕 .𝟔𝟑 (𝒔𝒑𝒓𝒊𝒏𝒕𝒕𝒊𝒎𝒆 )=𝟖𝟏 .𝟎𝟑𝒊𝒏𝒄𝒉𝒆𝒔
RegressionLet’s see how a regression line is calculated.
Fat vs Calories in BurgersFat (g) Calories
19 410
31 580
34 590
35 570
39 640
39 680
43 660
Let’s standardize the variables
Fat Cal z - x's z - y's
19 410 -1.959 -2
31 580 -0.42 -0.1
34 590 -0.036 0
35 570 0.09 -0.2
39 640 0.6 0.56
39 680 0.6 1
43 660 1.12 0.78
The line must contain the point and pass through the origin.
,x y
𝒙
𝒚
Let’s clarify a little. (Just watch & listen)
The equation for a line that passes through the origin can be written with just a slope & no intercept:
But, we’re using z-scores so our equation should reflect this and thus it’s
Many lines with different slope pass through the origin. Which one fits our data the best? That is, which slope determines the line that minimizes the sum of the squared residuals.
y xz mz
Line of Best Fit –Least Squares Regression Line
It’s the line for which the sum of the squared residuals is smallest. We want to find the mean squared residual.
Focus on the vertical deviations from the line.
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍=𝑶𝒃𝒔𝒆𝒓𝒗𝒆𝒅−𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
Let’s find it. (just watch & soak it in)
2
2
2 2 2
2 22
2
1
1
2
1
21 1 1
1 2
yy
y x
y x y x
y x y x
z zMSR
n
z mzMSR
n
z mz z m zMSR
nz z z z
MSR m mn n n
MSR mr m
since y xz mz
St. Dev of z scores is 1 so variance is 1 also.
This is r!
Note: MSR is “Mean Squared Residual”
𝑀𝑆𝑅=1−2𝑟𝑚+𝑚2
Continue……
Since this is a parabola – it reaches it’s minimum at 2
bx
a
This gives us
Hence – the slope of the best fit line for z-scores is the correlation coefficient → r.
𝑀𝑆𝑅=1−2𝑟𝑚+𝑚2
𝒎=−(−𝟐𝒓 )𝟐(𝟏)
=𝒓
Slope – rise over run A slope of r for z-scores means that for every increase of 1
standard deviation in there is an increase of r standard deviations in . “Over 1 and up r.”
Translate back to x & y values – “over one standard deviation in x, up r standard deviations in y.
Slope of the regression line is:
xz
yz
𝒃𝟏=𝒓 𝒔 𝒚
𝒔𝒙
Why is correlation “r”Because it was calculated from the
regression of y on x after standardizing the variables – just like we have just done – thus he used r to stand for (standardized) regression.
Let’s Write the Equation
Fat (g) Calories
19 410
31 580
34 590
35 570
39 640
39 680
43 660
0
0 1
1
from algebra
y-intercept
slope
y mx b
by b b x
b
Slope:
Explain the slope:
𝑏1=𝑟 𝑠𝑦𝑠𝑥
=0.961 (89.815)
7.804=11.056
Your calories increase by 11.056 for every additional gram of fat.
�̂�=𝒃𝟎+𝒃𝟏 𝒙
Now for the final part – the equation!y-intercept: Remember – it has to pass through the
point . ,x y
𝒃𝒐=𝒚 −𝒃𝟏𝒙
𝒃𝟎=𝒚 −𝒃𝒙=𝟐𝟏𝟎 .𝟗𝟓𝟒
𝑦=𝑏0+𝑏1𝑥
Solve for y-intercept
Find the value of the y-intercept
Put the parts together to form the equation of the regression line. Now it can be used to predict.
How many calories do I expect to find in a hamburger that has 25 grams of fat?
Try another problem
Mean call -to-shock
timeSurvival
Rate
2 90
6 45
7 30
9 5
12 2
𝑏0=𝑦−𝑏1𝑥=101.3285
𝑟=−3.84030233
4=−0.960
𝑏1=𝑟 𝑠𝑦𝑠𝑥
=−9.2956
Interpret the slope:
Interpret the y-intercept:
Predict the survival rate for a 10 min. call to shock time
Predict the survival rate for a 20 min. call to shock time
The survival rate will decrease by 9.2956 for every additional minute of call-to-shock.
The survival rate is 101.3285 when there is NO call to shock time.
�̂�𝒖𝒓𝒗𝒊𝒗𝒂𝒍𝒓𝒂𝒕𝒆=𝟏𝟎𝟏 .𝟑𝟐𝟖𝟓−𝟗 .𝟐𝟗𝟓𝟔 (𝟏𝟎 )=𝟖 .𝟑𝟕𝟐𝟓𝒎𝒊𝒏𝒖𝒕𝒆𝒔
�̂�𝒖𝒓𝒗𝒊𝒗𝒂𝒍𝒓𝒂𝒕𝒆=𝟏𝟎𝟏 .𝟑𝟐𝟖𝟓−𝟗 .𝟐𝟗𝟓𝟔 (𝟐𝟎 )=−𝟖𝟒 .𝟓𝟖𝟑𝟓𝒎𝒊𝒏𝒖𝒕𝒆𝒔𝑬𝒙𝒕𝒓𝒂𝒑𝒐𝒍𝒂𝒕𝒊𝒐𝒏
Try another problem
SAT Math SAT Verbal
600 650
720 800
540 600
450 500
620 620
𝑟=3.853
4=0.963
𝑏1=1.05
𝑏0=20.7
�̂�𝑨𝑻𝑽𝒆𝒓𝒃𝒂𝒍=𝟐𝟎 .𝟕+𝟏 .𝟎𝟓 (𝑺𝑨𝑻𝑴𝒂𝒕𝒉 )
Interpret the slope:
Interpret the y-intercept:
Predict the verbal score for a math score of 400
Predict the verbal score for a math score of 500
Verbal score will increase by 1.05 pts for every additional point in math.
Verbal score with no math score.
�̂�𝒆𝒓𝒃𝒂𝒍 𝑺𝒄𝒐𝒓𝒆=𝟐𝟎 .𝟕+𝟏 .𝟎𝟓 (𝟒𝟎𝟎 )=𝟒𝟑𝟗 .𝟑𝟑
�̂�𝒆𝒓𝒃𝒂𝒍 𝑺𝒄𝒐𝒓𝒆=𝟐𝟎 .𝟕+𝟏 .𝟎𝟓 (𝟓𝟎𝟎 )=𝟓𝟒𝟑 .𝟗𝟗
�̂�𝒆𝒓𝒃𝒂𝒍 𝑺𝒄𝒐𝒓𝒆=𝟐𝟎 .𝟕+𝟏 .𝟎𝟓 (𝑴𝒂𝒕𝒉 )
Extrapolated!
That’s…all…..Folks!Homework: p. 191 (27-32, 35, 37,39,41, 47)