Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | augusta-bryant |
View: | 214 times |
Download: | 0 times |
Section 2.2 Correlation
A numerical measure to supplement the graph.
Will give us an indication of “how closely” the data points fit a particular line – the least squares regression line.
Will give us an indication of the type of association – positive or negative.
Warning:Notice the Scale!
Notes: 1. Data in the summation is standardized like a z-score (and thus is not affected by change in units).2. Divide scatterplot into quadrants based on centroid-Data in 1st and 3rd quadrants contribute positive values to r-Data in 2nd and 4th quadrants contribute negative values to r
Page 124
How Data Relative to Centroid Affects r
Quadrant 1Quadrant 2
Quadrant 3 Quadrant 4
1. Turn Diagnostics On: 2nd Catalog, scroll down to DiagnosticOn and press Enter (you do not have to repeat this step everytime!)2. Compute r (and a few other things!): Stat|Calc|LinReg(a+bx) press Enter and then give your lists: L1,L23. Your output should be: a=102.5, b=-3.62, r^2=0.8915, r=-0.9442
Student A B C D E F G
Number of Absences (L1) 6 2 15 9 12 5 8
Final Grade (L2) 82 86 43 74 58 90 78
Final Class Grade versus Number of Absences
40
50
60
70
80
90
100
0 5 10 15 20
X - Number of Absences
y -
Fin
al C
lass
Gra
de Data
Centroid
What are the meanings of these numbers??
Let’s start with r….
Let’s use our TI’s to find the correlation for our data set!
• Symmetric in X and Y (makes no difference which variable is the explanatory and which is response)
• Both variables must be quantitative!• -1 <= r <= 1 ALWAYS• The closer in magnitude r is to 1, the stronger the
linear relationship between X and Y• The sign of r indicates whether there is a positive
or negative relationship between X and Y• Just like the mean and standard deviation, r is
strongly affected by outliers• See page 125 for more!
Properties of the Correlation Coefficient (r)
Getting a Feel for r
Let’s Play the Guessing Correlations Game!
http://www.stat.uiuc.edu/courses/stat100/java/GCApplet/GuessCGI.html
I will put this link on your assignments page!
Figure onPage 126
Section 2.3 Least-Squares Regression
We will first learn how to find the least-squares regression line and then understand how to interpret it.
Please enter the data in Example 2.9 on page 152 into L3 and L4 on your TI calculator. L3 is NEA Increase (cal) and L4 is Fat Gain (kg)
Scatterplot of Example 2.9 Data (page 133)r=?
Using LinReg(a+bx) L3,L4 we get the coefficients for the least-squares regression line:
a=3.505, b=-0.00344, r^2=0.6061, and r=-0.7786So we have the line:
L4 variable = a + b*(L3 variable)fat gain = 3.505 – 0.00334*(NEA increase)
To use line to predict fat gain for an NEA increase of 400 calories (Example 2.10 page 134) plug value of 400 into NEA increase.How does this look graphically???
fat gain = 3.505 – 0.00334*(NEA increase)Slope = b = -0.00334Y-intercept = a = 3.505
aLeast-squaresregression line
Let’s Get the Equation of the least-squares regression line for our Absence and Final Grade Data (It should still be in L1 and L2):
LinReg(a+bx) L1,L2 gives: y=a+bx where:
a=102.49, b=-3.622, r^2=0.8915 and r=-0.9442
So the equation of the least-squares regression line is:
Final Grade = 102.49 – 3.622*(Number of Absences)
Use this model to predict the Final Grade for a student who has 10 absences.
Let’s look at this graphically….
Another Example
Final Class Grade versus Number of Absences
y = -3.6219x + 102.49
R2 = 0.8915
40
50
60
70
80
90
100
0 5 10 15 20
X - Number of Absences
y -
Fin
al C
lass
Gra
de Data
Centroid
Linear (Data)
Notice:•The least-squares regression line goes through the centroid.•We can graphically represent the prediction of the Final Class Grade for a given Number of Absences.•What is the meaning of b and r^2????
Caution! Using the Regression Line to Make PredictionsFor Certain Values of x
Interpretation of the Least-Squares Regression Line
Page136Error = Residual = Observed - Predicted
Makes sense becausethe line always passesthrough the centroid!
Interpretation of b, the Slope of the Regression Line (page 138)
• A change of one unit in x corresponds to a change of b units in y.
• A change of one standard deviation in x corresponds to a change of r standard deviations in y.
• Let’s find b via the formula on page 137 for our example data:
• How do we interpret b?• What are the units for b?
Two Sources of Variability in y: - Relationship between x and y via the regression line (r^2 tells %) - Variability for a fixed value of x
Page141, 142
Interpretation of r^2 (p141, 142)
Let’s use some list operations to verify r^2 for our example data set of Absences and Final Grade: -Regression Line:
Final Grade = 102.49 – 3.622*(Number of Absences)-Observed values of y are in L2-To get predicted values of y for each value of x (in L1):
102.49-3.622*L1L5 ( is the STO key)-To get the residuals (i.e. the Predicted – Observed):
L5-L2L6 (What is the meaning of the data in L6???)
Interpretation of r^2 (p141, 142)
r^2 = (Variance of Predicted Values)/(Variance of Observed Values)
Interpretation of r^2 (page 142)
r^2 = (Variance of Predicted Values)/(Variance of Observed Values)
= (standard dev. L5)^2/(standard dev. L2)^2 = (15.8472)^2/(16.7829)^2 = (15.8472/16.7829)^2 = 0.8916 (note we have some round-off error in the
4th decimal place)So regression line explains about 89% of the
variability in the values of y (a very strong result!)
Here r^2=0.606 so the regression model explains about 61% of the variability in y, i.e. about 61% of the vertical scatter in y.
Two Sources of Variability in y: - Relationship between x and y via the regression line (r^2 tells %) - Variability for a fixed value of x
Page141, 142
Section 2.4 - Cautions about Correlation and Regression
Error = Residual = Observed - Predicted
Example 2.15 (scatterplot with regression line page 152)
An Interesting Fact: The sum of the residuals about
the least-squares regression line is always zero.
A residual plot (page 153) gives us a visual representation in the
leftover variance in the response variable after taking into account the regression. It helps us to assess how well the
line describes the data.
IF the regression line catches the overall pattern of the data there should be no pattern in
the residuals.
(b) Negative Residual
(a) Positive Residual
The residual plot will
A Residual Plot Note: No discernable patternto residuals
Meaurements of Pipe Defects
y = 0.7267x + 4.9433R2 = 0.8921
01020
30405060
708090
0 50 100
Laboratory Measurements
Fie
ld M
easu
rem
ents
Data
Centroid
Linear (Data)
Meaurements of Pipe Defects
01020
30405060
708090
0 50 100
Laboratory Measurements
Fie
ld M
easu
rem
ents
Data
Centroid
Example 2.4 (page 108) Revisited
Both the scatter plot and residual Plot show more variability in field
measurements as true (laboratory measured) defect size increases, despite strong correlation (r=0.9445) and large percent of variability in y explained by
regression (r^2=0.8921)
Beware of Outliers and Influential Points
Example 2.16 (page 154 – 157)Weakens Regression
Strengthens Regression
Data Point r with data point
r without data point
Subject 15 0.4819 0.5684
Subject 18 0.4819 0.3837
Beware the Lurking Variable (page 158) and Remember:Correlation does not imply Causation! (page 160)
Lurking variables can create “nonsense correlations” or possibly hide true relationships between x and y.
A “nonsense” correlationLurking variable? Both variables increased during the time period
plotted. Thus the common year is a lurking variable.
Example 2.2 page 159
Example 2.21 page 160
Data Set A
y = 0.5001x + 3.0001R2 = 0.6665
4
6
8
10
12
0 5 10 15
x
y
Data Set B
y = 0.5x + 3.0009R2 = 0.6662
3
5
7
9
11
0 5 10 15
x
y
Data Set C
y = 0.4997x + 3.0025R2 = 0.6663
4
6
8
10
12
14
0 5 10 15
x
y
Data Set D
y = 0.4999x + 3.0017R2 = 0.6667
5
6
7
8
9
10
11
12
13
5 10 15 20
x
y
Problem 2.80 (page169) – Any Observations??????
Figure for Problem 2.81
Figure for Problem 2.83
Section 2.5 The Question of Causation
: Dashed double arrow line is an observed association.: Solid arrow from x to y shows “x causes y”