+ All Categories
Home > Documents > 22s:152 Applied Linear Regression Chapter 2:...

22s:152 Applied Linear Regression Chapter 2:...

Date post: 29-Apr-2018
Category:
Upload: nguyenthuan
View: 218 times
Download: 3 times
Share this document with a friend
15
22s:152 Applied Linear Regression Chapter 2: Regression Analysis ———————————————————— Regression analysis a class of statistical methods for studying relationships between variables that can be measured e.g. predicting blood pressure from age using known values of certain variables to predict the values of other variables for the same subjects e.g. given a person’s age, cholesterol, and weight, predict blood pressure 1
Transcript
Page 1: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

22s:152 Applied Linear Regression

Chapter 2: Regression Analysis————————————————————

Regression analysis

• a class of statistical methods for

– studying relationships between variablesthat can be measured

e.g. predicting blood pressure from age

– using known values of certain variables topredict the values of other variables for thesame subjects

e.g. given a person’s age, cholesterol,and weight, predict blood pressure

1

Page 2: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

Well-known Example:Space Shuttle Challenger

On January 27, 1986, the night before a planned launch,a 3-hour discussion took place.

The discussion was about the forecasted low temperaturefor the next day of 31◦ F, and the effect of low tempera-ture on O-ring performance. (O-rings seal joints).

In their discussion they utilized the following plot show-

ing the relationship between the number of O-rings hav-

ing some thermal distress and the temperature to decide

whether the shuttle should take-off as planned.

50 55 60 65 70 75 80 85

−0.

50.

51.

01.

52.

02.

53.

0

temperature

Num

ber

of in

cide

nts

● ● ● ●●

2

Page 3: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

The final decision was to launch the shuttle as planned.

- 7 astronauts were killed

- combustion gas leak through an O-ring was the causeof the accident

Post-tragedy, a commission noted that a mistake in the

analysis of the data was that the flights with zero inci-

dents were left off because it was felt that these flights

did not contribute any information about the tempera-

ture effect.

50 55 60 65 70 75 80 85

−0.

50.

51.

01.

52.

02.

53.

0

temperature

Num

ber

of in

cide

nts

● ● ● ●●

●●●●

● ● ●● ● ● ● ●

● ● ● ● ●

3

Page 4: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

What may have helped in the decision makingprocess?

- use off all the data (rather than using dataconditional on the occurrence of an incident)

- quantification of the relationship between tem-perature and O-ring failure (perhaps as aconditional probability)

- prediction of the probability of O-ring failureat 31◦ F (logistic regression, Dalal et al. used this approach

in the their 1989 article)

Dalal, S.R, Fowlkes, E.B. and Hoadley, B. (1989). Risk analysis of

the Space Shuttle: Pre-Chellenger Predicton of Failure. Journal of

the American Statistical Association, v.84, 945-957.

4

Page 5: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

‘Investing it: duffers need not apply’New York Times, May 31, 1998An example of inappropriate removal of outliers

- An investment compensation expert carriedout a study purporting to show that the ma-jor companies, whose C.E.O’s hadlow golf scores, had high performingstocks.

- The expert obtained data for golf scores fromthe journal Golf Digest and used his own dataon the stock market performance of the com-panies of 51 chief executives.

- He created a Stock Rating which gave eachcompany a stock rating based on how in-vestors who held their stock did with 100being highest and 0 lowest.

5

Page 6: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

All data points Points consideredoutliers

5 10 15 20 25 30 35

020

4060

8010

0

handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●

All data points

corr = −0.04

5 10 15 20 25 30 35

020

4060

8010

0handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●

X XX X XX

X

'Outliers' marked

Data in final analysis

5 10 15 20 25 30 35

020

4060

8010

0

handicap

stoc

k ra

ting

●● ●

●●

●● ●●

●●● ●

●●● ●●

●●●●

● ●● ●●●

●●●●

●●

●●

●●●●

●●●

'Outliers' removed

corr = −0.41

King, B. (1998) Critique of ‘Investing it: duffers need not

apply.’ Chance News 7.06.

6

Page 7: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

Ch.2 Regression analysis...(as stated in book p. 16)

examines the relationship between a quanti-tative dependent variable Y and one or morequantitative independent variables, X1, . . . ,Xk. (He reserves the term regression for quantita-

tive variables)

Regression analysis traces the conditionaldistribution of Y - or some aspect of thedistribution, such as its mean - as a functionof the X ’s

Examples:

- General relationship between X and Y(where ε represents a random error).

Y = f (X) + ε↑

May be a linear ornon-linear relationship.

7

Page 8: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

Linear Models (linear in the parameters)

- Simple linear relationship:Model the conditional mean response of acontinuous variable using a linear relation-ship to a single continuous variable assumingnormal errors

Y = β0+β1X+ε with ε ∼ N(0, σ2)

Given X , Y has a normal distribution witha mean(center) of [β0 + β1X ] and a varianceof σ2.

Also written as: Y |X ∼ N(β0 + β1X, σ2)

Sketch of plot showing normal conditional distributions:

8

Page 9: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

- Quadratic relationship:Model the conditional mean response of acontinuous variable as a quadratic relation-ship to a single continuous variable (this isstill a linear model as it’s linear in the pa-rameters)

Y = β0 + β1X + β2X2 + ε with

ε ∼ N(0, σ2)

- Multiple linear relationships:Model the conditional mean response of acontinuous variable as a linear relationshipwith each of two continuous variables (no in-teraction)

Y = β0 + β1X1 + β2X2 + ε withε ∼ N(0, σ2)

Mean response surface shown on next page...

9

Page 10: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

Mean response surface (errors not shown):

x1

y

Z

This surface is a plane in space.

10

Page 11: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

Non-Linear Models(not linear in the parameters)

- Specific relationship:

Y = β0 + β1Xβ21 + β3X

β42 + ε with

ε ∼ N(0, σ2)

- Specific relationship:

Y = f (X1, X2) + ε withε ∼ N(0, σ2)

Mean response surface (errors not shown):

11

Page 12: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

Non-normalityThe conditional distribution of Y given X doesnot have to be normal. BUT the validity ofmany of our common hypothesis tests dependson normality.

Y = β0+β1X+ε with ε ∼ a right-skeweddistribution

sketch

- Might attain normality of errors through trans-formations⇒ if so, common statistical testsvalid

- Could use the original skewed data and maxi-mum likelihood methods for estimation (witha specified non-normal distribution)

12

Page 13: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

Nonparametric Regression

LOWESS (locally weighted scatterplot smoother)

● ●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

0 5000 10000 15000 20000 25000

2040

6080

Average Income, USD

Pre

stig

e

- The lowess smoother estimates the function...Yi = f (xi) + εi

- The predicted Yi for a given xi is determinedby considering only ‘local’ points in a ‘win-dow’ around xi

- Often a simple linear regression is fit to thelocal points, and the prediction falls on thisline

- Researcher chooses width of window

13

Page 14: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

Other analyses

• The type of data will affect how the data ismodeled and the choice of analysis

– Binary response (0/1) with covariate pre-dictors:

Logistic regression

– Relationship between categorical/ordinalvariables:

Contingency tables, chi-squared test(we won’t cover this in this class)

– Relationship between a quantitative de-pendent variable (Y) and qualitative pre-dictor:t-test or ANOVA

14

Page 15: 22s:152 Applied Linear Regression Chapter 2: …homepage.stat.uiowa.edu/~rdecook/stat3200/notes/ch2.pdfChapter 2: Regression Analysis ||||| Regression analysis ... outliers 5 10 15

– Predicting a continuous response from bothquantitative and qualitative variables:

Dummy-variable regression or ANCOVA

– Response is a count (Poisson distribution)and the Poisson distribution mean is de-pendent on the covariates:

Poisson regression

15


Recommended