Describing Relationships: Regression, Prediction and...

Post on 23-Sep-2020

2 views 0 download

transcript

Describing Relationships:

Regression, Prediction and

CausationChapter 15 plus extra

November 5, 2012

PredictionPredictions for a Scatter PlotAnatomy of a LineThe Regression LineRegression and Least SquaresRegression Fallacy

1.0 Prediction

If we have two quantitative variables X and Y that arelinearly related to each other, then knowing the particularvalue of X for one individual can help us to estimate(or predict) the value of Y for that individual.

We will explore what is the best prediction of theresponse variable (Y ) given a value of the explanatoryvariable (X ) in football shaped data clouds.

What is the likely size of the prediction error?

1.1 Fundamental Principle of

Prediction

Incoming students at a large law school have an averageL.S.A.T. score of 163 and a S.D. of 8. You may assume thehistogram of these data follows a normal curve approximately.Tomorrow one of these students will be chosen at random.

What is your best guess for their score?

The guess will be compared to their actual score to seehow far off it is. What is the likely size for the error inyour guess?

2.0 Predictions for a Scatter Plot

55 60 65 70 75 80

6065

7075

Father's height (inches)

Son

's h

eigh

t (in

ches

)

55 60 65 70 75 80

6065

7075

Father's height (inches) S

on's

hei

ght (

inch

es)

2.0 Predictions for a Scatter Plot

55 60 65 70 75 80

6065

7075

Father's height (inches)

Son

's h

eigh

t (in

ches

)

The graph ofaverages shows theaverage son’s height foreach father’s height.

It is close to a straightline in the middle.

At the ends, it is quitebumpy.

2.0 Prediction for a Scatter Plot

Use the mean of the relevant sub-group of data as ourpredictor.

S.D. of the group gives the “likely size” of the error inour prediction.

2.0 Prediction for a Scatter Plot

55 60 65 70 75 80

6065

7075

Father's height (inches)

Son

's h

eigh

t (in

ches

)

The regressionmethod for predictionuses a line fit to thegraph of averages.

It smooths away someof the chance variationin the group averages.

2.1 Example

The figure below is based on a representative sample ofmarried couples in New York. The graph shows the averageincome of the wives (Y), given their husband’s income (X).The regression line is plotted too. Predict the income of wiveswhen the husband makes $60,000.

3.0 Anatomy of a Line

Any line can be described by its slope and intercept.The y -intercept is the height of the line when x is 0.The slope is the rate at which y increases, per unitincrease in x .The equation of a line can be written in terms of its slopeand intercept:

y = slope × x + intercept

3.1 Father and Son Example

..

Average height of fathers ~ 68 in.SD of height of fathers ~ 2.7 inAverage height of sons ~ 69 in.SD of height of sons ~ 2.7 in

r ~ 0.5

X

X

X

S.D. Line: the line where 1 SD change in X is matchedby 1 SD change in Y

RegressionLine

....

....

.

..

...

.. ..

...

3.1 Father and Son Example

Take the fathers’ who are 72 inches tall:

average X + 1.5 × SD of X ≈ 72 in.

The average height of their sons’ is only 71 inches.

average Y + 0.5 × 1.5 × SD of Y ≈ 71 in.

What about the fathers who are 64 inches tall?

average X - 1.5 × SD of X ≈ 64 in.

The average height of their sons’ is only 67 inches.

average Y - 0.5 × 1.5 × SD of Y ≈ 67 in.

3.1 Father and Son Example

SonsHeight

Fathers'Height

Ave. sons ht.69 in.

Ave. fathers ht. 68 in

1.5 x SD of fathers ht.

r x 1.5 x SD ofsons ht.

Slope =( r x 1.5 x SD of sons ht) / (1.5 x SD of fathers ht.) = 0.5 in. / in.

68in x 0.5 in./in= 34 in.

Intercept69in-34in=35in.

4.0 The Regression Line

The regression line for predicting Y from X has the form:

Y = slope × X + intercept

where

b = slope,

=r × S.D. of Y

S.D of X.

a = intercept,

= Y − b X ,

= Y −(

r × S.D. of Y

S.D of X

)X .

4.1 Prediction from a Regression Line

The predicted value of Y for a given value of X say X ∗

has the form:

Y ∗ = b X ∗ + a,

=

(r × S.D. of Y

S.D of X

)X ∗ + Y −

(r × S.D. of Y

S.D of X

)X

4.2 Example

A university has made a statistical analysis of the relationshipbetween Math S.A.T. scores (ranging from 200 to 800) andfirst year G.P.A.s (ranging from 0 to 4.0), for students whocomplete the first year. The results:

average S.A.T. score = 550, S.D. = 80,average first year G.P.A. = 2.6, S.D. = 0.6,

r=0.4

The scatter diagram is football shaped. A student is chosen atrandom, and has an S.A.T. of 650. Predict this individual’sfirst year G.P.A.

5.0 Regression and Least Squares

The Regression Line is familiarly referred to as the leastsquares line. This is because it minimizes the sum of thesquares of the vertical distances of the data points.

Data point

Vertical distanceto line

y

x

Regression Line

Figure 15.3

6.0 The Regression Fallacy

In virtually every scatterplot with less than perfect correlation,the data points that are extreme along the x axis tend not tobe as extreme on the y axis. This is called the regressioneffect.

DefinitionThinking that the regression effect must be due to somethingimportant, not just chance error, is called the regressionfallacy.

6.1 Example

An instructor standardizes both her midterm and the finaleach semester so the class average is 50 and the S.D. is 10 onboth tests. The correlation between the tests is around 0.5.One semester she took all the students who scored below 30 inthe midterm and gave them special tutoring. On average, theygained 10 points the final. She claims that her tutoringworked. Can you give her alternate explanation?