+ All Categories
Home > Documents > Unit 4: Regression assumptions: Evaluating their tenability

Unit 4: Regression assumptions: Evaluating their tenability

Date post: 31-Dec-2015
Category:
Upload: jackson-daniel
View: 46 times
Download: 3 times
Share this document with a friend
Description:
Unit 4: Regression assumptions: Evaluating their tenability. The S-030 roadmap: Where’s this unit in the big picture?. Unit 1: Introduction to simple linear regression. Unit 2: Correlation and causality. Unit 3: Inference for the regression model. Building a solid foundation. - PowerPoint PPT Presentation
25
© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 1 Unit 4: Regression assumptions: Evaluating their tenability
Transcript
Page 1: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 1

Unit 4: Regression assumptions: Evaluating their tenability

Page 2: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 2

The S-030 roadmap: Where’s this unit in the big picture?

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Unit 4:Regression assumptions:

Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in

depth:Correlation and

collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression modeling

in practice

Unit 1:Introduction to

simple linear regression

Building a solid

foundation

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of

predictors and effects

Pulling it all

together

Page 3: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 3

In this unit, we’re going to learn about…

• Reprise of the assumptions required for least squares estimation and inference

• The four major types of model violations:– Outliers– Nonlinearity– Heteroscedasticity– Non-independence of errors

• Determining whether the regression assumptions hold—strategies and rationale– Why residuals provide a powerful lens for evaluating regression

assumptions– Residuals as controlled observations– Raw residuals and studentized residuals– Residual plots: How to construct them and what to look for– What should we do if we find an outlier or other unusual observation?

• How would we summarize our results?

Page 4: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 4

Assumptions ARE about the

Assumptions ARE NOT about the sample, X, or Y overall.

(Reprise of) Assumptions required for least squares estimation & inference

1.At each value of X, there is a distribution of Y. These distributions have a mean µY|X and a variance of σ2

Y|X

2.The straight line model is correct. The means of each of these distributions, the µY|X‘s, may be joined by a straight line.

3.Homoscedasticity. The variances of each of these distributions, the σ2

Y|X’s, are identical.

4.Independence of observations.At each given value of X (at each xi), the values of Y (the yi’s) are independent of each other.

XY 10

X

Y

x1 x2 … x3

Y|x3

Y|x1

Y|x2

5.Normality. At each given value of X (at each xi), the values of Y (the yi’s) are normally distributed

at each Xin the populationbehavior of Y

Page 5: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 5

Anscombe’s data: Good looks can be deceiving

Data Set I x y10 8.04 8 6.9513 7.58 9 8.8111 8.3314 9.96 6 7.24 4 4.2612 10.84 7 4.82 5 5.68

Data Set IIIx y

10 7.46 8 6.7713 12.74 9 7.1111 7.8114 8.84 6 6.08 4 5.3912 8.15 7 6.42 5 5.73

Data Set IIx y

10 9.14 8 8.1413 8.74 9 8.7711 9.2614 8.10 6 6.13 4 3.1012 9.13 7 7.26 5 4.74

Data Set IVx y

8 6.58 8 5.76 8 7.71 8 8.84 8 8.47 8 7.04 8 5.25 8 5.56 8 7.91 8 6.8919 12.50

%67,24.4

5.00.3ˆ2

Rt

Xy

Page 6: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 6

Four major types of model violations

Non-independence of errors. Observations within the data set are clustered (or otherwise related)**

Heteroscedasticity. The variance of Y varies as a function of X

Outliers. Extreme observations that don’t fit the general pattern

Nonlinearity. There’s a relationship between Y and X, but it’s not best summarized by a straight line

**Remember we can’t see this one visuallyA cautionary tale….

Page 7: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 7

Where to go for dinner? Using Zagat ratings to identify restaurants that are “good” and “poor” values

NAME COST RATINGJacob Wirth 26 16.3333Grafton St. Pub 24 17.000029 Newbury 38 17.0000Vox Populi 31 17.0000Avenue One 35 17.3333Orleans 28 17.3333Daedalus 26 17.6667

...224 Boston St. 31 21.3333West Side Lounge 30 21.3333Fava 43 21.6667

...Harvest 46 23.6667Square Café 39 23.6667Troquet 54 23.6667TW Foods 39 24.0000flora 40 24.0000

...Hamersley's 56 25.3333Icarus 54 25.3333Excelsior 65 25.6667Federalist 69 25.6667Meritage 63 25.6667

The UNIVARIATE ProcedureVariable: RATING Location VariabilityMean 20.98684 Std Deviation 2.49589Median 21.00000 Variance 6.22945Mode 19.66667 Range 9.33333

Stem Leaf # Boxplot 25 777 3 | 25 333 3 | 24 | 24 0000033 7 | 23 7777 4 | 23 03 2 | 22 777 3 +-----+ 22 0000333 7 | | 21 7 1 | | 21 000003333 9 *--+--* 20 777 3 | | 20 00033 5 | | 19 77777 5 | | 19 0003333 7 +-----+ 18 777 3 | 18 00333 5 | 17 777 3 | 17 00033 5 | 16 | 16 3 1 | ----+----+----+----+

OutcomeThe UNIVARIATE ProcedureVariable: COST Location VariabilityMean 38.52632 Std Deviation 10.54574Median 38.00000 Variance 111.21263Mode 38.00000 Range 50.00000

Stem Leaf # Boxplot 68 0 1 0 66 64 0 1 | 62 0 1 | 60 | 58 0 1 | 56 0 1 | 54 00 2 | 52 000 3 | 50 0000 4 | 48 | 46 00 2 | 44 00000 5 +-----+ 42 00000 5 | | 40 000 3 | | 38 0000000000 10 *--+--* 36 0000 4 | | 34 000000 6 | | 32 0000 4 | | 30 0000000 7 +-----+ 28 000000 6 | 26 00000 5 | 24 000 3 | 22 0 1 | 20 | 18 0 1 | ----+----+---

Predictor

RQ: Do you get what you pay for?: What’s the relationship between a restaurant’s rating and its cost?

n = 76

COSTRATING 10

RQ: Which restaurants are good values?: Given what you’re paying (controlling for price), where do you get the best food?

Residuals as “controlled” observations

Page 8: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 8

What’s the relationship between cost and restaurant ratings?

The REG ProcedureDependent Variable: RATING

Analysis of Variance

Sum of Mean Source DF Squares Square Model 1 287.86080 287.86080 Error 74 179.34826 2.42363 Corrected Total 75 467.20906

Root MSE 1.55680 R-Square 0.6161Dependent Mean 20.98684 Adj R-Sq 0.6109Coeff Var 7.41798

Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 13.82968 0.68057 20.32 <.0001COST 1 0.18577 0.01705 10.90 <.0001

COSTINGTRA 19.083.13ˆ

Effect is strong: 61.6% of the variation in ratings is associated with

cost

Estimated effect is large†: Each $10.00 difference in cost is positively

associated with a 1.9 difference in ratings

Effect is statistically significant: Unlikely that in the

population of Boston restaurants, there’s no relationship between

price & ratings

Estimated effect is quite precise: Narrow 95% CI

(0.1523, 0.2193)

†How large is large? Compare to SD of outcome (≈ 2.5)

Page 9: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 9

How can we determine whether the assumptions hold?

Assumptions we can examine:1. Normality2. Homoscedasticity3. Linearitywhich all refer to the values of Y at each X

0

-̂Y

x1 … x3x2X

Y

So the residual distributions should be:

1. Normal2. Homoscedastic3. Totally unrelated to X

XY 10

X

Y

x1 x2 … x3

Y|x3

Y|x1

Y|x2

The hope? To see random scatter in the residual plot Go to Put Points

What does it mean to refer to the “values of Y at each X?”

residuals

Page 10: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 10

A first look at residuals for the restaurant data

Dependent PredictedObs NAME Variable Value Residual

6 Bambara 20.6667 21.6322 -0.9655 7 Birch St. Bistro 19.3333 19.4029 -0.0695 8 Blarney Stone 18.3333 17.9167 0.4166 9 blu 23.6667 23.3041 0.362510 Brenden Crocker' 22.0000 20.7033 1.296711 Bristol 25.3333 22.1895 3.143912 B-Side Lounge 18.6667 18.6598 0.00688113 Central Kitchen 20.0000 17.3594 2.640614 Daedalus 17.6667 18.6598 -0.993115 Dalia's Bistro 18.0000 19.5887 -1.588716 Dalya's 21.0000 21.0748 -0.074817 Dedo Lounge 19.6667 20.1460 -0.479318 Devlin's 20.3333 18.4740 1.859319 Excelsior 25.6667 25.9049 -0.238320 Fava 21.6667 21.8179 -0.151321 Federalist 25.6667 26.6480 -0.981422 flora 24.0000 21.2606 2.739423 Franklin Café 21.0000 19.7744 1.225624 Gardner Museum 19.0000 18.4740 0.526025 Gargoyles 22.6667 20.7033 1.963426 Grafton St. Pub 17.0000 18.2882 -1.288227 Grapevine 22.6667 20.8891 1.777628 Green St. Grill 19.3333 19.0313 0.302029 Hamersley's Bist 25.3333 24.2330 1.100330 Harvest 23.6667 22.3753 1.2914

. . .57 Square Café 23.6667 21.0748 2.591858 Stanhope Grille 18.6667 21.8179 -3.151359 Stephanie's 18.6667 19.9602 -1.293560 Temple Bar 17.6667 18.8456 -1.178961 Ten Tables 23.0000 20.8891 2.110962 33 Restaurant 19.3333 21.6322 -2.298863 Top of the Hub 22.0000 23.1183 -1.118364 Tremont 647 19.6667 20.3317 -0.665165 Troquet 23.6667 23.8614 -0.194866 Tryst 20.6667 21.2606 -0.593967 29 Newbury 17.0000 20.8891 -3.8891

The UNIVARIATE ProcedureVariable: rawres1 (Residual)

Basic Statistical Measures

Location Variability

Mean 0.000000 Std Deviation 1.54639Median 0.001587 Variance 2.39131Mode . Range 7.03292 Interquartile Range 2.14756

Stem Leaf # Boxplot 3 1 1 | 2 6679 4 | 2 0013 4 | 1 57899 5 | 1 00112334 8 +-----+ 0 5566778 7 | | 0 011133344 9 *--+--* -0 443222110 9 | | -0 977665 6 | | -1 33322110000 11 +-----+ -1 765 3 | -2 3311 4 | -2 65 2 | -3 20 2 | -3 9 1 | ----+----+----+----+

negative

positive

Page 11: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 11

What would the residuals look like if they were normally distributed?

Introducing studentized residuals

iresidual

i

sd

residualiResidualdStudentize

Rule of thumb: If studentized residuals are normally distributed, we expect:

• 5% > 2• 1% > 2.5

Flag & examine stdres > 2

The UNIVARIATE ProcedureVariable: stdres (Studentized Residual)

Basic Statistical Measures

Location Variability

Mean -0.00016 Std Deviation 1.00425Median 0.00105 Variance 1.00852Mode . Range 4.55280 Interquartile Range 1.40607

Stem Leaf # Boxplot 20 4 1 | 18 9 1 | 16 857 3 | 14 6 1 | 12 25786 5 | 10 35 2 | 8 4437 4 | 6 67239 5 +-----+ 4 1353 4 | | 2 0147346 7 | | 0 08997 5 *-----* -0 630550 6 | + | -2 81961 5 | | -4 7530 4 | | -6 97307652 8 +-----+ -8 9644 4 | -10 13 2 | -12 97 2 | -14 29 2 | -16 83 2 | -18 4 1 | -20 4 1 | -22 | -24 1 1 |

3 stdres > 2 is well within

our expectations when n= 76

-4 -3 -2 -1 0 +1 +2 +3 +4

67% 1

95% (1.96)99% ( 2.58)

Page 12: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 12

Do we see anything usual in the residual plots for the restaurant data?

Raw residuals

5 most over-rated (controlling for cost)

NAME COST RATING RESIDUAL

Jer-Ne 52 21.0000 -2.48989Vox Populi 31 17.0000 -2.58865Avenue One 35 17.3333 -2.99841Stanhope Grille 43 18.6667 -3.1512729 Newbury 38 17.0000 -3.88907

5 best places to eat (controlling for cost)

NAME COST RATING RESIDUAL

Square Café 39 23.6667 2.59183Central Kitchen 19 20.0000 2.64063flora 40 24.0000 2.73939TWFoods 39 24.0000 2.92516Bristol 45 25.3333 3.14385

29 Newbury

Stanhope GrilleAvenue OneVox Populi Jer ne

29 Newbury

Stanhope GrilleAvenue OneVox Populi Jer ne

Studentized residuals

BristolTWFoods

floraCentral Kitchen Square CafeBristolTWFoods

floraCentral KitchenSquare Cafe

Page 13: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 13

Could (would?) we have spotted these observations on the scatterplot?

Raw residuals Where were these on the original scatterplot?

5 most over-rated (controlling for cost)

NAME COST RATING RESIDUAL

Jer-Ne 52 21.0000 -2.48989Vox Populi 31 17.0000 -2.58865Avenue One 35 17.3333 -2.99841Stanhope Grille 43 18.6667 -3.1512729 Newbury 38 17.0000 -3.88907

5 best places to eat (controlling for cost)

NAME COST RATING RESIDUAL

Square Café 39 23.6667 2.59183Central Kitchen 19 20.0000 2.64063flora 40 24.0000 2.73939TWFoods 39 24.0000 2.92516Bristol 45 25.3333 3.14385

29 Newbury

Stanhope GrilleAvenue OneVox Populi Jer ne

BristolTWFoods

floraCentral Kitchen Square Cafe

29 Newbury

Stanhope Grille

Avenue One

Vox Populi

Jer ne

Bristol

TWFoods flora

Central Kitchen

Square Cafe

••

•••

•• •

Boston Restaurant Weeks, March 9th – 14th & 16th to 21st

Three-course Prix-fixe Lunch Menu: $20.08Three-course Prix-fixe Dinner Menu: $33.08

Page 14: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 14

Did the butterfly ballot affect the 2000 Presidential race in Florida?

Although Democrats are listed second in the left hand column, you vote Democratic by punching the third hole

If you punch the second hole, you are voting for the Reform party (ie, Pat Buchanan)

Poly-CY, Internet Resources for political science has much more informationon the statistical analysis of the 2000 Presidential election results

RQ: In the 2000 Presidential election, did Buchanan get more votes than we “would

have expected?”

Of the nearly 6 million

votes cast in Florida, the official tally has Bush

beating Gore by 537 votes

Page 15: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 15

Did Buchanan receive more votes than expected in Palm Beach county?

ID COUNTY BUCH REGREF

1 ALACHUA 262 91 2 BAKER 73 4 3 BAY 248 55 4 BRADFORD 65 3 5 BREVARD 570 148 6 BROWARD 789 332

. . .48 ORANGE 446 199 49 OSCEOLA 145 62 50 PALM BEACH 3407 337 51 PASCO 570 167 52 PINELLAS 1010 425

. . .64 VOLUSIA 396 176 65 WAKULLA 46 7 66 WALTON 120 22 67 WASHINGTON 88 9

REGREFBUCH 10

The UNIVARIATE ProcedureVariable: BUCH

Location Variability

Mean 258.6119 Std Deviation 449.48775Median 114.0000 Variance 202039Mode 29.0000 Range 3398

Stem Leaf # Boxplot 34 1 1 * 32 30 28 26 24 22 20 18 16 14 12 10 1 1 0 8 4 1 0 6 59 2 0 4 05046677 8 | 2 345677789011 12 +--+--+ 0 112233333333444455677788999900011122245899 42 *-----* ----+----+----+----+----+----+----+----+-- Multiply Stem.Leaf by 10**+2

PinellasHillsboroughBroward, Duvall

10 Nov 2000: “The Bush campaign claims that the number of votes for Buchanan in Palm Beach County is perfectly accurate. ‘New information has come to our attention that puts in perspective the results of the vote in Palm Beach County,’ Bush spokesman Ari Fleischer said on Thursday. ‘Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there.’” (Salon.com) View Article

n = 67

Page 16: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 16

Is Palm Beach county a Buchanan (and Reform party) stronghold?

The REG ProcedureDependent Variable: BUCH

Analysis of Variance

Sum of Mean Source DF Squares Square Model 1 7412114 7412114 Error 65 5922476 91115 Corrected Total 66 467.20906

Root MSE 301.85265 R-Square 0.5559Dependent Mean 258.61194 Adj R-Sq 0.5490Coeff Var 116.72031

Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 1.53252 46.60847 0.03 0.9739REGREF 1 3.68671 0.40876 9.02 <.0001

Effect is strong: 55.6% of the variation in Buchanan votes is associated with Reform

party registration

Estimated effect is large: Each registered Reform party member is

associated with 3.69 votes for Buchanan

Effect is statistically significant: Unlikely that in the population of Florida counties, there’s no relationship between reform party registration and Buchanan

votes

Estimated effect is somewhat precise: 95% CI

(2.87, 4.50)

REGREFCHUB 69.353.1ˆ

Palm Beach

Page 17: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 17

Buchanan stronghold or very unusual data point?

Palm Beach

10 Nov 2000: When asked about the Bush campaign's statement, Buchanan's Florida coordinator, Jim McConnell, responded: "That's nonsense.“ He estimate[s] the number of Buchanan activists in the county to be between 300 and 500 -- nowhere near the 3,407 who voted for him. (Salon.com) View Article

Page 18: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 18

What happens if we set aside Palm Beach County?

Collier

DuvalPolk

Marion

93.875)337(45.228.50ˆ

45.228.50ˆ

CHUB

:predict would weBeach Palm so...in

REGREFCHUB

R2=86.4%

1106

646

Page 19: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 19

How would we summarize the results of this analysis?

Comparison of regression models predicting Buchanan’s total vote (Florida Presidential Election data, 2000)

Predictor

All Florida counties (n=67)

Without Palm Beach (n=66)

Intercept 1.53(46.61)

0.03

50.28***(12.98)

3.87

Number of

registered reform voters

3.69***(0.41)

9.02

2.45***(0.12)20.18

R2 55.6% 86.4%Cell entries are estimated regression coefficients, (standard errors) and t-statistics.*** p<.001

Without Palm Beach

All Florida counties

Palm Beach

It’s déjà vu all over again…The results of the 2006 congressional elections in Florida

The New York Times, 24 February 2007

Page 20: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 20

Decisions about just a few data points can have serious consequences

Page 21: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 21

What’s the big takeaway from this unit?

• Regression models invoke a series of important assumptions– Before accepting a set of regression results, you should examine the

assumptions to make sure they’re tenable– The assumptions may well be reasonable but you can’t be sure your

conclusions are correct unless you have evaluated their tenability– Residuals are the key to evaluating regression assumptions

• Regression as statistical control– We often want to do more than just summarize the relationship between

variables– Regression provides a straightforward strategy that allows us to

statistically control for the effects of a predictor and see what’s “left over”

– Residuals can be easily interpreted as “controlled observations”

• Outliers can distort regression results or be interesting on their own– Always inspect scatterplots and residual plots to determine whether

there are any unusual values that might unduly influence the fitted regression line

– If you find outliers, re-fit the regression model without those observations and compare the results

– Regardless of how you decide to handle the presence of outliers, always tell your audience about their existence and what you did about them

Page 22: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 22

*------------------------------------------------------*Sorting observations in sample from lowest to highest RATING *------------------------------------------------------*; proc sort data=one; by rating;

Appendix: Annotated PC-SAS Code for Unit 4, ZAGAT data

proc reg data=one; title2 "Examining residuals from the regression of Rating on Cost"; model rating=cost/p; plot residual.*cost; plot student.*cost; output out=resdat r=rawres student=stdres; id name;*--------------------------------------------------------*Univariate summary information on raw and studentized residualsfrom OLS regression model RATING on COST *-------------------------------------------------------*; proc univariate data=resdat plot; title2 "Distribution of residuals from the regr of Rating on Cost"; var rawres stdres; id name;

proc sort sorts the newly created SAS data set (named “one”). The by statement identifies the variable according to which to data is sorted

proc sort sorts the newly created SAS data set (named “one”). The by statement identifies the variable according to which to data is sorted

Note that the handouts include only annotations for the new additional SAS code. For the complete program, check program “Unit 4—ZAGAT analysis” on the website.

Note that the handouts include only annotations for the new additional SAS code. For the complete program, check program “Unit 4—ZAGAT analysis” on the website.

proc reg allows you to add a plot statement that produces a bivariate scatterplot of the raw and studentized residuals by the predictor (“residual” stands for raw residuals, and “student” for studentized). The syntax is identical to that of proc gplot (ie, plot y*x), (note the “.” after naming the residual, as part of the PLOT statement). The output statement creates a new (temporary) dataset, called RESDAT, that contains all the data in ONE as well as the raw and standardized residuals.

proc reg allows you to add a plot statement that produces a bivariate scatterplot of the raw and studentized residuals by the predictor (“residual” stands for raw residuals, and “student” for studentized). The syntax is identical to that of proc gplot (ie, plot y*x), (note the “.” after naming the residual, as part of the PLOT statement). The output statement creates a new (temporary) dataset, called RESDAT, that contains all the data in ONE as well as the raw and standardized residuals.

proc univariate can be used tp analyze the new dataset RESDAT and presents summary statistics of the residuals (e.g., means, sd’s, stem-and-leaf displays). As in any proc univariate, the var statement specifies the residuals you want analyzed; the id statement provides identifiers for extreme values

proc univariate can be used tp analyze the new dataset RESDAT and presents summary statistics of the residuals (e.g., means, sd’s, stem-and-leaf displays). As in any proc univariate, the var statement specifies the residuals you want analyzed; the id statement provides identifiers for extreme values

Page 23: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 23

Appendix: Annotated PC-SAS Code for Unit 4, FLVOTE data

*---------------------------------------------------------------------*Creating FLVote data subset that excludes Palm Beach County (id=50)*----------------------------------------------------------------*;data two; set one; if id=50 then delete;

The data step here creates a new dataset called “two.” The if-then-delete statement specifies which data points to delete.

The data step here creates a new dataset called “two.” The if-then-delete statement specifies which data points to delete.

*---------------------------------------------------------------------*Fitting OLS regression model BUCH on REGREF, excluding Palm Beach County Plotting BUCH vs REGREF with 95% prediction interval bands Plotting studentized residuals on REGREF *-------------------------------------------------------------------*; proc reg data=two; title2 "Regression results and residual analysis"; model buch=regref/p; plot buch*regref/pred95; plot student.*regref; output out=resdat2 r=residual student=student; id county;*-------------------------------------------------------------------*Univariate summary information of studentized residuals from OLS regression model BUCH on REGREF, excluding Palm Beach County *-------------------------------------------------------------------*; proc univariate data = resdat2 plot; title2 "Studentized Residuals"; var student; id county;

proc reg here reads the data from the new dataset “two”. An option has been added to the model statement to produce 95% prediction intervals for individuals levels of Y at each value of X (/conf95), superimposed on the bivariate scatterplot. The other plot statement produces a bivariate scatterplot of studentized residuals by predictor. Finally, the output statement creates a new (temporary) dataset, called RESDAT2, that includes all the data in ONE and the raw and standardized residuals.

proc reg here reads the data from the new dataset “two”. An option has been added to the model statement to produce 95% prediction intervals for individuals levels of Y at each value of X (/conf95), superimposed on the bivariate scatterplot. The other plot statement produces a bivariate scatterplot of studentized residuals by predictor. Finally, the output statement creates a new (temporary) dataset, called RESDAT2, that includes all the data in ONE and the raw and standardized residuals.

proc univariate analyzes the new dataset RESDAT2 and presents summary statistics of the residuals

proc univariate analyzes the new dataset RESDAT2 and presents summary statistics of the residuals

Page 24: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 24

Glossary terms included in Unit 4

• Assumptions• Homoscedasticity• Independence of observations• Linearity• Normality• Outlier• Studentized residuals

Page 25: Unit 4: Regression assumptions: Evaluating their tenability

© Judith D. Singer, Harvard Graduate School of Education Unit 4/Slide 25

A cautionary tale….


Recommended