Section 2.2 Correlation A numerical measure to supplement the graph. Will give us an indication of...

transcript

Section 2.2 Correlation

A numerical measure to supplement the graph.

Will give us an indication of “how closely” the data points fit a particular line – the least squares regression line.

Will give us an indication of the type of association – positive or negative.

Warning:Notice the Scale!

Notes: 1. Data in the summation is standardized like a z-score (and thus is not affected by change in units).2. Divide scatterplot into quadrants based on centroid-Data in 1st and 3rd quadrants contribute positive values to r-Data in 2nd and 4th quadrants contribute negative values to r

How Data Relative to Centroid Affects r

Quadrant 1Quadrant 2

Quadrant 3 Quadrant 4

1. Turn Diagnostics On: 2nd Catalog, scroll down to DiagnosticOn and press Enter (you do not have to repeat this step everytime!)2. Compute r (and a few other things!): Stat|Calc|LinReg(a+bx) press Enter and then give your lists: L1,L23. Your output should be: a=102.5, b=-3.62, r^2=0.8915, r=-0.9442

Student A B C D E F G

Number of Absences (L1) 6 2 15 9 12 5 8

Final Grade (L2) 82 86 43 74 58 90 78

Final Class Grade versus Number of Absences

0 5 10 15 20

X - Number of Absences

de Data

Centroid

What are the meanings of these numbers??

Let’s start with r….

Let’s use our TI’s to find the correlation for our data set!

• Symmetric in X and Y (makes no difference which variable is the explanatory and which is response)

• Both variables must be quantitative!• -1 <= r <= 1 ALWAYS• The closer in magnitude r is to 1, the stronger the

linear relationship between X and Y• The sign of r indicates whether there is a positive

or negative relationship between X and Y• Just like the mean and standard deviation, r is

strongly affected by outliers• See page 125 for more!

Properties of the Correlation Coefficient (r)

Getting a Feel for r

Let’s Play the Guessing Correlations Game!

http://www.stat.uiuc.edu/courses/stat100/java/GCApplet/GuessCGI.html

I will put this link on your assignments page!

Figure onPage 126

Section 2.3 Least-Squares Regression

We will first learn how to find the least-squares regression line and then understand how to interpret it.

Please enter the data in Example 2.9 on page 152 into L3 and L4 on your TI calculator. L3 is NEA Increase (cal) and L4 is Fat Gain (kg)

Scatterplot of Example 2.9 Data (page 133)r=?

Using LinReg(a+bx) L3,L4 we get the coefficients for the least-squares regression line:

a=3.505, b=-0.00344, r^2=0.6061, and r=-0.7786So we have the line:

L4 variable = a + b*(L3 variable)fat gain = 3.505 – 0.00334*(NEA increase)

To use line to predict fat gain for an NEA increase of 400 calories (Example 2.10 page 134) plug value of 400 into NEA increase.How does this look graphically???

fat gain = 3.505 – 0.00334*(NEA increase)Slope = b = -0.00334Y-intercept = a = 3.505

aLeast-squaresregression line

Let’s Get the Equation of the least-squares regression line for our Absence and Final Grade Data (It should still be in L1 and L2):

LinReg(a+bx) L1,L2 gives: y=a+bx where:

a=102.49, b=-3.622, r^2=0.8915 and r=-0.9442

So the equation of the least-squares regression line is:

Final Grade = 102.49 – 3.622*(Number of Absences)

Use this model to predict the Final Grade for a student who has 10 absences.

Let’s look at this graphically….

Another Example

Final Class Grade versus Number of Absences

y = -3.6219x + 102.49

R2 = 0.8915

0 5 10 15 20

X - Number of Absences

de Data

Centroid

Linear (Data)

Notice:•The least-squares regression line goes through the centroid.•We can graphically represent the prediction of the Final Class Grade for a given Number of Absences.•What is the meaning of b and r^2????

Caution! Using the Regression Line to Make PredictionsFor Certain Values of x

Interpretation of the Least-Squares Regression Line

Error = Residual = Observed - Predicted

Makes sense becausethe line always passesthrough the centroid!

Interpretation of b, the Slope of the Regression Line (page 138)

• A change of one unit in x corresponds to a change of b units in y.

• A change of one standard deviation in x corresponds to a change of r standard deviations in y.

• Let’s find b via the formula on page 137 for our example data:

• How do we interpret b?• What are the units for b?

Two Sources of Variability in y: - Relationship between x and y via the regression line (r^2 tells %) - Variability for a fixed value of x

Interpretation of r^2 (p141, 142)

Let’s use some list operations to verify r^2 for our example data set of Absences and Final Grade: -Regression Line:

Final Grade = 102.49 – 3.622*(Number of Absences)-Observed values of y are in L2-To get predicted values of y for each value of x (in L1):

102.49-3.622*L1L5 ( is the STO key)-To get the residuals (i.e. the Predicted – Observed):

L5-L2L6 (What is the meaning of the data in L6???)

Interpretation of r^2 (p141, 142)

r^2 = (Variance of Predicted Values)/(Variance of Observed Values)

Interpretation of r^2 (page 142)

r^2 = (Variance of Predicted Values)/(Variance of Observed Values)

= (standard dev. L5)^2/(standard dev. L2)^2 = (15.8472)^2/(16.7829)^2 = (15.8472/16.7829)^2 = 0.8916 (note we have some round-off error in the

4th decimal place)So regression line explains about 89% of the

variability in the values of y (a very strong result!)

Here r^2=0.606 so the regression model explains about 61% of the variability in y, i.e. about 61% of the vertical scatter in y.

Two Sources of Variability in y: - Relationship between x and y via the regression line (r^2 tells %) - Variability for a fixed value of x

Section 2.4 - Cautions about Correlation and Regression

Error = Residual = Observed - Predicted

Example 2.15 (scatterplot with regression line page 152)

An Interesting Fact: The sum of the residuals about

the least-squares regression line is always zero.

A residual plot (page 153) gives us a visual representation in the

leftover variance in the response variable after taking into account the regression. It helps us to assess how well the

line describes the data.

IF the regression line catches the overall pattern of the data there should be no pattern in

the residuals.

(b) Negative Residual

(a) Positive Residual

The residual plot will

A Residual Plot Note: No discernable patternto residuals

Meaurements of Pipe Defects

y = 0.7267x + 4.9433R2 = 0.8921

30405060

708090

0 50 100

Laboratory Measurements

Centroid

Linear (Data)

Meaurements of Pipe Defects

30405060

708090

0 50 100

Laboratory Measurements

Centroid

Example 2.4 (page 108) Revisited

Both the scatter plot and residual Plot show more variability in field

measurements as true (laboratory measured) defect size increases, despite strong correlation (r=0.9445) and large percent of variability in y explained by

regression (r^2=0.8921)

Beware of Outliers and Influential Points

Example 2.16 (page 154 – 157)Weakens Regression

Strengthens Regression

Data Point r with data point

r without data point

Subject 15 0.4819 0.5684

Subject 18 0.4819 0.3837

Beware the Lurking Variable (page 158) and Remember:Correlation does not imply Causation! (page 160)

Lurking variables can create “nonsense correlations” or possibly hide true relationships between x and y.

A “nonsense” correlationLurking variable? Both variables increased during the time period

plotted. Thus the common year is a lurking variable.

Example 2.2 page 159

Example 2.21 page 160

Data Set A

y = 0.5001x + 3.0001R2 = 0.6665

0 5 10 15

Data Set B

y = 0.5x + 3.0009R2 = 0.6662

0 5 10 15

Data Set C

y = 0.4997x + 3.0025R2 = 0.6663

0 5 10 15

Data Set D

y = 0.4999x + 3.0017R2 = 0.6667

5 10 15 20

Problem 2.80 (page169) – Any Observations??????

Figure for Problem 2.81

Figure for Problem 2.83

Section 2.5 The Question of Causation

: Dashed double arrow line is an observed association.: Solid arrow from x to y shows “x causes y”

Section 2.2 Correlation A numerical measure to supplement the graph. Will give us an indication of...

Documents