22s:152 Applied Linear Regression Chapter 2:...

Post on 29-Apr-2018

218 views 3 download

transcript

22s:152 Applied Linear Regression

Chapter 2: Regression Analysis————————————————————

Regression analysis

• a class of statistical methods for

– studying relationships between variablesthat can be measured

e.g. predicting blood pressure from age

– using known values of certain variables topredict the values of other variables for thesame subjects

e.g. given a person’s age, cholesterol,and weight, predict blood pressure

1

Well-known Example:Space Shuttle Challenger

On January 27, 1986, the night before a planned launch,a 3-hour discussion took place.

The discussion was about the forecasted low temperaturefor the next day of 31◦ F, and the effect of low tempera-ture on O-ring performance. (O-rings seal joints).

In their discussion they utilized the following plot show-

ing the relationship between the number of O-rings hav-

ing some thermal distress and the temperature to decide

whether the shuttle should take-off as planned.

50 55 60 65 70 75 80 85

−0.

50.

51.

01.

52.

02.

53.

0

temperature

Num

ber

of in

cide

nts

● ● ● ●●

2

The final decision was to launch the shuttle as planned.

- 7 astronauts were killed

- combustion gas leak through an O-ring was the causeof the accident

Post-tragedy, a commission noted that a mistake in the

analysis of the data was that the flights with zero inci-

dents were left off because it was felt that these flights

did not contribute any information about the tempera-

ture effect.

50 55 60 65 70 75 80 85

−0.

50.

51.

01.

52.

02.

53.

0

temperature

Num

ber

of in

cide

nts

● ● ● ●●

●●●●

● ● ●● ● ● ● ●

● ● ● ● ●

3

What may have helped in the decision makingprocess?

- use off all the data (rather than using dataconditional on the occurrence of an incident)

- quantification of the relationship between tem-perature and O-ring failure (perhaps as aconditional probability)

- prediction of the probability of O-ring failureat 31◦ F (logistic regression, Dalal et al. used this approach

in the their 1989 article)

Dalal, S.R, Fowlkes, E.B. and Hoadley, B. (1989). Risk analysis of

the Space Shuttle: Pre-Chellenger Predicton of Failure. Journal of

the American Statistical Association, v.84, 945-957.

4

‘Investing it: duffers need not apply’New York Times, May 31, 1998An example of inappropriate removal of outliers

- An investment compensation expert carriedout a study purporting to show that the ma-jor companies, whose C.E.O’s hadlow golf scores, had high performingstocks.

- The expert obtained data for golf scores fromthe journal Golf Digest and used his own dataon the stock market performance of the com-panies of 51 chief executives.

- He created a Stock Rating which gave eachcompany a stock rating based on how in-vestors who held their stock did with 100being highest and 0 lowest.

5

All data points Points consideredoutliers

5 10 15 20 25 30 35

020

4060

8010

0

handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●

All data points

corr = −0.04

5 10 15 20 25 30 35

020

4060

8010

0handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●

X XX X XX

X

'Outliers' marked

Data in final analysis

5 10 15 20 25 30 35

020

4060

8010

0

handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

'Outliers' removed

corr = −0.41

King, B. (1998) Critique of ‘Investing it: duffers need not

apply.’ Chance News 7.06.

6

Ch.2 Regression analysis...(as stated in book p. 16)

examines the relationship between a quanti-tative dependent variable Y and one or morequantitative independent variables, X1, . . . ,Xk. (He reserves the term regression for quantita-

tive variables)

Regression analysis traces the conditionaldistribution of Y - or some aspect of thedistribution, such as its mean - as a functionof the X ’s

Examples:

- General relationship between X and Y(where ε represents a random error).

Y = f (X) + ε↑

May be a linear ornon-linear relationship.

7

Linear Models (linear in the parameters)

- Simple linear relationship:Model the conditional mean response of acontinuous variable using a linear relation-ship to a single continuous variable assumingnormal errors

Y = β0+β1X+ε with ε ∼ N(0, σ2)

Given X , Y has a normal distribution witha mean(center) of [β0 + β1X ] and a varianceof σ2.

Also written as: Y |X ∼ N(β0 + β1X, σ2)

Sketch of plot showing normal conditional distributions:

8

- Quadratic relationship:Model the conditional mean response of acontinuous variable as a quadratic relation-ship to a single continuous variable (this isstill a linear model as it’s linear in the pa-rameters)

Y = β0 + β1X + β2X2 + ε with

ε ∼ N(0, σ2)

- Multiple linear relationships:Model the conditional mean response of acontinuous variable as a linear relationshipwith each of two continuous variables (no in-teraction)

Y = β0 + β1X1 + β2X2 + ε withε ∼ N(0, σ2)

Mean response surface shown on next page...

9

Mean response surface (errors not shown):

x1

y

Z

This surface is a plane in space.

10

Non-Linear Models(not linear in the parameters)

- Specific relationship:

Y = β0 + β1Xβ21 + β3X

β42 + ε with

ε ∼ N(0, σ2)

- Specific relationship:

Y = f (X1, X2) + ε withε ∼ N(0, σ2)

Mean response surface (errors not shown):

11

Non-normalityThe conditional distribution of Y given X doesnot have to be normal. BUT the validity ofmany of our common hypothesis tests dependson normality.

Y = β0+β1X+ε with ε ∼ a right-skeweddistribution

sketch

- Might attain normality of errors through trans-formations⇒ if so, common statistical testsvalid

- Could use the original skewed data and maxi-mum likelihood methods for estimation (witha specified non-normal distribution)

12

Nonparametric Regression

LOWESS (locally weighted scatterplot smoother)

● ●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

0 5000 10000 15000 20000 25000

2040

6080

Average Income, USD

Pre

stig

e

- The lowess smoother estimates the function...Yi = f (xi) + εi

- The predicted Yi for a given xi is determinedby considering only ‘local’ points in a ‘win-dow’ around xi

- Often a simple linear regression is fit to thelocal points, and the prediction falls on thisline

- Researcher chooses width of window

13

Other analyses

• The type of data will affect how the data ismodeled and the choice of analysis

– Binary response (0/1) with covariate pre-dictors:

Logistic regression

– Relationship between categorical/ordinalvariables:

Contingency tables, chi-squared test(we won’t cover this in this class)

– Relationship between a quantitative de-pendent variable (Y) and qualitative pre-dictor:t-test or ANOVA

14

– Predicting a continuous response from bothquantitative and qualitative variables:

Dummy-variable regression or ANCOVA

– Response is a count (Poisson distribution)and the Poisson distribution mean is de-pendent on the covariates:

Poisson regression

15