+ All Categories
Home > Documents > Math 3680 Lecture #19 Correlation and Regression

Math 3680 Lecture #19 Correlation and Regression

Date post: 09-Jan-2016
Category:
Upload: ion
View: 35 times
Download: 0 times
Share this document with a friend
Description:
Math 3680 Lecture #19 Correlation and Regression. The Correlation Coefficient: Limitations. Moral : Correlation coefficients only measure linear association. Moral : Correlation coefficients are most appropriate for football-shaped scatter diagrams and can be very sensitive to outliers. - PowerPoint PPT Presentation
38
Math 3680 Lecture #19 orrelation and Regressio
Transcript
Page 1: Math 3680 Lecture #19 Correlation and Regression

Math 3680

Lecture #19

Correlation and Regression

Page 2: Math 3680 Lecture #19 Correlation and Regression

The Correlation Coefficient:

Limitations

Page 3: Math 3680 Lecture #19 Correlation and Regression

0

5

10

15

20

25

30

-6 -4 -2 0 2 4 6

X Y X (su) Y (su) Product-5 25 -1.25 1.1254 -1.40676-4 16 -1 0.2572 -0.25724-1 1 -0.25 -1.1897 0.2974292 4 0.5 -0.9003 -0.450163 9 0.75 -0.418 -0.313515 25 1.25 1.1254 1.40676

AVG AVG0 13.33333 r = -0.1447

SD SD4 10.36661

Moral: Correlation coefficients only measure linear association.

Page 4: Math 3680 Lecture #19 Correlation and Regression

Film starring Matthew

McConaughey

x, Minutes Shirtless

y, Opening Weekend Gross (millions of dollars)

We Are Marshall 0 6.1ED tv 0.8 8.3Reign of Fire 1.6 15.6Sahara 1.8 18.1Fool's Gold 14.6 21.6

Correlation 0.733965

Page 5: Math 3680 Lecture #19 Correlation and Regression

Film starring Matthew

McConaughey

x, Minutes Shirtless

y, Opening Weekend Gross (millions of dollars)

We Are Marshall 0 6.1ED tv 0.8 8.3Reign of Fire 1.6 15.6Sahara 1.8 18.1Fool's Gold

Correlation 0.966256

Moral: Correlation coefficients are most appropriate for football-shaped scatter diagrams and can be very sensitive to outliers.

Page 6: Math 3680 Lecture #19 Correlation and Regression

Regression

Page 7: Math 3680 Lecture #19 Correlation and Regression

The heights and weights from a survey of 988 men are shown in the scatter diagram.

Avg height = 70 in

SD height = 3 in

Avg weight = 162 lb

SD weight = 30 lb

r = 0.47

Page 8: Math 3680 Lecture #19 Correlation and Regression

Example. Suppose a man is one SD above average in height (70 + 3 = 73 inches).

Should you guess his weight to be one SD above average (162 + 30 = 192 pounds)?

Page 9: Math 3680 Lecture #19 Correlation and Regression

Solution: No. Notice that maybe 10 or 11 of the men 73 inches tall have weights above 192 pounds, while dozens have weights below 192 pounds.

(73, 192)

Page 10: Math 3680 Lecture #19 Correlation and Regression

A better prediction is obtained by increasing by not a full SD but by r SDs:

Prediction = Average + ( r )(# SDs)(SD)

= 162 + (0.47) (1) (30)

= 176.1 lb

This is our second interpretation of the correlation coefficient.

Page 11: Math 3680 Lecture #19 Correlation and Regression

Prediction

=162+(1) (0.47) (30)

=176.1 lb

(73, 176.1)

Page 12: Math 3680 Lecture #19 Correlation and Regression

(73, 176.1)

(70, 162)

3 inches

(0.47)(30) lbs

Slope = x

y

s

sr

3

3047.0

Page 13: Math 3680 Lecture #19 Correlation and Regression

Example: Predict the height of a man who is 176.1 pounds. Does this contradict our previous example?

Page 14: Math 3680 Lecture #19 Correlation and Regression

Example: Predict the weight of a man who is 5’6”. Where does this prediction appear in the diagram?

Page 15: Math 3680 Lecture #19 Correlation and Regression

Notice that these points are displayed on the solid line in the diagram. This line is called the regression line. To obtain this line, you start at the point of averages, and draw a line with slope

In other words, the equation of the regression line is

Reverse the roles of x and y when predicting in the other direction.

sy r sx

)(ˆ xxs

sryy

x

y

Page 16: Math 3680 Lecture #19 Correlation and Regression

Example: Find the equation of the regression line for the height-weight diagram.

Page 17: Math 3680 Lecture #19 Correlation and Regression

Average SAT score = 550 SD = 80

Average first-year GPA = 2.6 SD = 0.6

 

Example: A university has made a statistical analysis of the relationship between SAT-M scores and first-year GPA. The results are:

40.r

The scatter diagram is football shaped. Find the equation of the regression line. Then predict the first-year GPA of a randomly chosen student with an SAT-M score of 650.

Page 18: Math 3680 Lecture #19 Correlation and Regression

Both Excel and TI calculators are capable of computing and visualizing regression lines. (See book p. 426). In Excel 2007, highlight the x- and y-values and use

Insert, Scatter,

to draw a scatter plot. Click the data points, and then right-click Add Trendline to see the regression line.

Age (mo) Height (cm)24 8748 10160 12096 15963 13539 10463 12639 96

Age (mo) Line Fit Plot

0

20

40

60

80

100

120

140

160

180

0 20 40 60 80 100 120Age (mo)

Hei

ght (

cm)

Height (cm)

Predicted Height (cm)

Page 19: Math 3680 Lecture #19 Correlation and Regression

The Regression Effect

Page 20: Math 3680 Lecture #19 Correlation and Regression

For a study of 1,078 fathers and sons:

Average fathers’ height = 68 inSD = 2.7 inAverage sons’ height = 69 inSD = 2.7 inches 0.5 r

Suppose a father is 72 inches tall. How tall would you predict his son to be?

Repeat for a father who is 64 inches tall.

Page 21: Math 3680 Lecture #19 Correlation and Regression

Notice that tall fathers tend to have tall sons – though sons who are not as tall. Likewise, short fathers on average will have short sons – just not as short. Hence the term, “regression.” A pioneering but aristocratic statistician (Galton) called this effect, the “regression toward mediocrity,” and the term has stuck.

There is no biological cause to this effect – it is strictly statistical. Thinking that the regression effect is due to something important is called the regression fallacy.

Page 22: Math 3680 Lecture #19 Correlation and Regression

Example: A preschool program attempts to boost students’ IQs. The children are tested when they enter the program (pretest), and again when they leave the program (post-test). On both occasions, the average IQ score was 100, with an SD of 15. Also, students with below-average IQs on the pretest had scores that went up by 5 points, while students with above average scores of the pretest had their scores drop by an average of 5 points.

What is going on? Does the program equalize intelligence?

Page 23: Math 3680 Lecture #19 Correlation and Regression

Example. Suppose someone gets a score of 140 on the pretest. Does this mean that the student has an IQ of exactly 140?

Page 24: Math 3680 Lecture #19 Correlation and Regression

Solution: No. There will always be chance error associated with the measurement. For the sake of argument, let’s assume that the chance error is equal to 5 points.

Then there are two likely explanations, they are:

         Actual IQ of 135, with a chance error of +5

         Actual IQ of 145, with a chance error of -5

Which of the above is the most likely explanation?

Page 25: Math 3680 Lecture #19 Correlation and Regression

This explains the regression effect. If someone scores above average on the first test, we would estimate that the true score is probably a bit lower than the observed score.

Page 26: Math 3680 Lecture #19 Correlation and Regression

Example: An instructor gives a midterm. She asks the students who score 20 points below average to see her regularly during her office hours for special tutoring. They all scored at least average on the final. Can this improvement be attributed to the regression effect?

Page 27: Math 3680 Lecture #19 Correlation and Regression

Regression and

Error Estimation

Page 28: Math 3680 Lecture #19 Correlation and Regression

n = 988Avg height = 70 inSD height = 3 inAvg weight = 162 lbSD weight = 30 lbr = 0.47

For a man 73 in tall,we predict weight of

162+(1) (0.47) (30) =176.1 lb

Next question: what is the error for this estimate?Based on the picture, is it 30 lb? Or less?

Page 29: Math 3680 Lecture #19 Correlation and Regression

THEOREM. Assuming the data is normally distributed, we have

For the current example, that means

The weight is therefore estimated as 176.1 ± 26.5 lb.

2

11 2

n

nrss ye

5.26986

987)47.0(130 2 es

Page 30: Math 3680 Lecture #19 Correlation and Regression

Note. If n is large, the last factor in

may be safely ignored. In other words, if n is large, then

2

11 2

n

nrss ye

21 rss ye

Page 31: Math 3680 Lecture #19 Correlation and Regression

Example: For a study of 1,078 fathers and sons:

Fathers:Average height = 68 in SD = 2.7 in

Sons: Average height = 69 in SD = 2.7 inches r = 0.5

Suppose a father is 63 inches tall. What percentage have sons who are at least 66 inches tall?

Page 32: Math 3680 Lecture #19 Correlation and Regression
Page 33: Math 3680 Lecture #19 Correlation and Regression

Testing Correlation

Page 34: Math 3680 Lecture #19 Correlation and Regression

Recall the equation of the regression line:

so that the slope of the regression line is .

The standard error for the slope is given by

)(ˆ xxs

sryy

x

y

x

y

s

srb

2

1

2

1

1

22

n

r

r

b

n

r

s

s

ns

sSE

x

y

x

eb

Page 35: Math 3680 Lecture #19 Correlation and Regression

The standard error for the slope is given by

Under the assumptions of:

1. normality, and

2. homoscedasticity (see below),

the t distribution with df = n - 2 may be used to find confidence intervals and perform hypothesis tests.

Homoscedasticity means that the variability of the data about the regression line does not depend on the value of x.

2

1

2

1

1

22

n

r

r

b

n

r

s

s

ns

sSE

x

y

x

eb

Page 36: Math 3680 Lecture #19 Correlation and Regression

n = 988Avg height = 70 inSD height = 3 inAvg weight = 162 lbSD weight = 30 lbr = 0.47

7.43

30)47.0(

b

Page 37: Math 3680 Lecture #19 Correlation and Regression

281.0986

)47.0(1

47.0

7.4 2

bSE

For df = 986, the Student t distribution is almost normal. So a 95% confidence interval for the slope is

The corresponding confidence interval for the correlation coefficient for all men is

55.07.4)281.0)(96.1(7.4

055.047.0

Page 38: Math 3680 Lecture #19 Correlation and Regression

Example: For a study of 1,078 fathers and sons:

Fathers:Average height = 68 in SD = 2.7 in

Sons: Average height = 69 in SD = 2.7 inches r = 0.5

Test the hypothesis that the correlation coefficient for all fathers and sons is positive.


Recommended