V. Multiple Linear Regression€¦ · variables in a regression model: ... The ANOVA F test...

Stat 5100 – Linear Regression and Time Series

Dr. Corcoran, Spring 2013

V. Multiple Linear RegressionThe term “multiple” refers to the inclusion of more than one regression variable. The general regression model extends the simple model by including p – 1 covariates Xi1 Xi2 Xi 1 as follows:model by including p 1 covariates Xi1,Xi2,…,Xi,p-1 as follows:

,1,122110 ipipiii XXXY 1,122110 ipipiii

where as before we assume εi ~ N(0,σ2).



Why “control” for additional factors?

In scientific settings, there may be several reasons to include additional variables in a regression model:

• Other investigative variables have potential scientifically interesting predictive effects relative to Y.

• We may be interested in testing the simultaneous effects of multiple factorsfactors.

• We need to model a predictor that (a) is a nominal (unordered) categorical variable (e.g., the feed type in Example III.K), or (b) has nonlinear effects that require including polynomial terms (e.g., an X2 variable for a quadratic effect).

• There are confounding factors that need to be controlled for as a part• There are confounding factors that need to be controlled for as a part of the modeling analysis.



Confounding VariablesSuppose that we are investigating the relationship between two

i bl X d Y Th i k if i i i Xvariables X and Y. That is, we want to know if variation in X causesvariation in Y.

Controlled or randomized experiments are those in which study subjects are randomly assigned to levels of X. This ensures balanced variation between factors that might influence the relationshipvariation between factors that might influence the relationship between X and Y. We often refer to these additional factors as confounding variables.

Observational studies are those in which there is no randomization –investigative groups are defined based on how subjects selected g g p jthemselves to a particular treatment or exposure. In this case, we need to control for confounding factors as a part of the data analysis.



Example V.AConsider recent findings from a study of “lighthearted” pursuits on average blood pressure:

http://www.cnn.com/2011/HEALTH/03/25/laughter.music.lower.blood.pressure/index.html.

Is this a controlled or observational study? What advantages (if any) does the study design have in this case?

Consider also the following results reported on a study of weight and bullying among children:

http://inhealth.cnn.com/living-well/overweight-kids-more-likely-to-be-bullied?did=cnnmodule1.

What kind of study is this? What factors might the researchers need to control for?



Interpretation of Regression Coefficients

For the multiple regression model a coefficient β represents theFor the multiple regression model, a coefficient βj represents the effect of Xij on the E{Yi} (the average of the outcome variable), holding all other variables constant.

In this sense, we are controlling for the effects of other factors, by assuming a common effect of Xij on the average of Y across all levels g ij gof the additional variables.

Note that in the regression setting it is also quite straightforward toNote that in the regression setting it is also quite straightforward to model interactions between two predictor variables, if the effect of one depends on the effect of the other. We will discuss interactions

lmore later.



Example V.B

C id i bl d lConsider a two-variable model

.22110 iiii XXY

Suppose that β0 = 5, β1 = 2, and β2 = 1.5.

What is your interpretation of β0?

What is the effect of Xi1 on the average of Yi when Xi2 = 3? What isWhat is the effect of Xi1 on the average of Yi when Xi2 3? What is the effect of Xi1 when Xi2 = 10?



Example V.B (continued)A geometric illustration of this relationship is shown in the plot below. This is a three-dimensional scatterplot, assuming a random sample of points. The superimposed plane represents the combined linear effect of X1 and X2 on the average of Y. What model parameter measures the variation of the points around the plane?

h l i hi d b d if h ldWhat relationship do we assume between X1 and Y if we hold X2 constant?



Example V.B (continued)

Note that for three or more predictor variables the linear relationshipNote that for three or more predictor variables, the linear relationship is manifest through a hyperplane in p dimensions, where p – 1 represents the number of variables included in the model.

Beyond two predictors, it is therefore difficult to actually visualize the model, but we can apply the same interpretation of the relative , pp y peffects (in terms of the model coefficients) on the average of the outcome variable.



Additional Remarks on “Linear”“Linear” refers to the effects of the coefficients, not necessarily on the ff f h di h leffects of the predictors themselves.

We have already seen this before when discussing the role of y gtransformations for a single variable. For example, we may observe a relationship such that

log(Y) = β0 + Xi1β1 + ··· + Xi,p-1βp-1 + εi,

h fi i h li i d l i i l f fi iso that fitting the linear regression model is simply a matter of fitting

Y’ = β0 + Xi1β1 + ··· + Xi,p-1βp-1 + εi,β0 i1β1 i,p 1βp 1 i

where Y’ = log(Y).



Least-squares FitNote that the least squares fit involves the same minimization of qsquared residuals (in this case around a plane or hyperplane). The residual is defined as:

To find the least squares fit we minimize the objective function

).( 1,110 pipiii XXY

.)]([ 21110

2 n

pipii

n

i XXYQ

To find the least-squares fit, we minimize the objective function

)]([1

1,1101

i

pipiii

iQ

Note that there is a lot of matrix notation in chapter 6 to describe the l t h U l ’ i t t d i th lileast-squares approach. Unless you’re interested in the linear algebra, you can ignore all of that.



Fitted Residuals and MSEAs in the simple case, the least-square fit yields estimates b0, b1,…, b f h i ffi i d di i i bbp-1 for the regression coefficients, and our prediction equation can be expressed as:

ˆ XbXbXbbY .1,122110 pipiii XbXbXbbY

The observed residuals are then given by:

,,...,1 ),(ˆ1,1110 niXbXbbYYYe pipiiiii

ˆ11 SSE

and our estimated error variance (MSE) is computed as

.)ˆ(111

21

22

n

i iin

i i MSEpn

SSEYYpn

epn

s



Analysis of Variance

Note the error degrees of freedom for MSE This suggests anNote the error degrees of freedom for MSE. This suggests an ANOVA table for a multiple regression model that looks something like this:

SourceDegrees ofFreedom

Sum of Squares

MeanSquares F-statistic p-value

Regression p – 1 SSR MSR MSR/MSE

Error n – p SSE MSE

Total n – 1 SSTO



Components of Variation

A ith th i l d lAs with the simple model:

• SSTO represents the total variation in Y.

• SSR represents the variation in Y explained by the regression model that includes the set of X variables X1,…X 1.model that includes the set of X variables X1,…Xp-1.

• SSE represents the residual (unexplained) variation, or the total variation of the points around the hyperplanevariation of the points around the hyperplane.



ANOVA F TestNote that there are p – 1 degrees of freedom associated with SSR, and n 1 degrees of freedom for SSE The ANOVA F test statistic isand n – 1 degrees of freedom for SSE. The ANOVA F test statistic is therefore given by MSR/MSE, and approximately follows an F(p – 1, n – 1) distribution.

The question is: what is the F statistic for? For a regression model with multiple predictors, the F statistic tests the relationship between Y d th t f X i bl X X I th d th ll FY and the set of X variables X1,…Xp-1. In other words, the overall Ftest evaluates the hypothesis

H : β β β 0 ers sH0: β1 = β2 = ··· = βp-1 = 0, versusHA: At least one βj ≠ 0, for j = 1,…,p – 1.

Under H0, MSR and MSE should be approximately equal, so a large value of F (i.e., a small p-value) provides evidence against the null.



Distribution of Individual Fitted CoefficientsAs with the estimated slope in the simple case, the individual fitted

ffi i b b l i l ll di ib dcoefficients b1,…bp-1 are at least approximately normally distributed.

All we need to know for a given bj is its estimated standard error g js{bj}, and the t distribution can be applied as before.

In other words the statisticIn other words, the statistic

}{kk

bsb

follows a t(n – p) distribution. Note that (especially in the case of lti l di t ) th t ti f {b } i t li t d t

}{ kbs

multiple predictors) the computation of s{bj} is too complicated to perform by hand, and in general is carried out using software.



Inference for Individual CoefficientsUsing the distribution on the previous slide, a (1 – α)100% confidence interval for βj is given by

}.{);2/1( jj bspntb We also would like to test the null hypothesis H0 : βj = 0 versus the alternative hypothesis HA: βj ≠ 0. A test statistic for assessing the evidence against H0 is given byevidence against H0 is given by

.}{0

j

j

bsb

t

Under H0, this test statistic approximately follows the t(n–p) distribution. The p-value is therefore given by 2P{t(n–p) ≥ |t|}.

j

Note that we can conceivably test against any specific value of βj, although 0 is generally the value of interest.



Example V.CThe data for this example come from a study of cholesterol levels (Y) among 25 individuals, who were measured also for weight in kg (X1) and age in years (X2).individuals, who were measured also for weight in kg (X1) and age in years (X2). The data are tabulated below:

Chol Wt Age Chol Wt Age354 84 46 254 57 23354 84 46190 73 20405 65 52263 70 30

254 57 23395 59 60434 69 48220 60 34263 70 30

451 76 57302 69 25288 63 28

220 60 34374 79 51308 75 50220 82 34

385 72 36402 79 57365 75 44

311 59 46181 67 23274 85 37

209 27 24290 89 31346 65 52

303 55 40244 63 30



Example V.C (continued)A good exploratory tool when looking at associations between several variables is a scatterplot matrix, as shown on the following slide. (A scatterplot matrix can be produced in SAS – for version 9.2 or higher – using commands shown later in this example.)g g p )

What bivariate associations do you observe?

Based on these associations, what do you predict we will see with respect to a fitted model that includes both predictors?

Is it possible that age might confound the relationship between weight and cholesterol level?





Example V.C (continued)We want to fit a linear model for these data of the form

The SAS code below reads in the data file, fits the two-variable model (with confidence intervals for the individual variables) and also

,22110 iiii XXY

(with confidence intervals for the individual variables), and also produces a scatterplot matrix.options ls=79 nodate;

data; infile "c:\chris\classes\stat5100\data\cholesterol.txt" firstobs=2; input chol weight age;

proc reg; model chol=weight age / clb; run;

proc sgscatter; title "Scatterplot Matrix for Cholesterol Data"; matrix chol weight age;

run;


Dr. Corcoran, Spring 2013SAS Output

Example VI.C (continued)

Interpret the 95% confidence intervals for each of the coefficients of

The REG ProcedureModel: MODEL1

Dependent Variable: chol

Interpret the 95% confidence intervals for each of the coefficients of weight and age.

I th id th t ith i ht i i di id ll i t d

Number of Observations Read 25Number of Observations Used 25

Analysis of VarianceIs there evidence that either weight or age is individually associated with average cholesterol level in the presence of the other?


Sum of MeanSource DF Squares Square F Value Pr > F

M d l 2 102571 51285 26 36 0001Interpret the ANOVA F test results.Model 2 102571 51285 26.36 <.0001Error 22 42806 1945.73752Corrected Total 24 145377

Root MSE 44.11051 R-Square 0.7056Dependent Mean 310.72000 Adj R-Sq 0.6788Coeff Var 14.19623


Dr. Corcoran, Spring 2013SAS Output

Example VI.C (continued)

Interpret the 95% confidence intervals for each of the coefficients ofParameter EstimatesInterpret the 95% confidence intervals for each of the coefficients of weight and age.

I th id th t ith i ht i i di id ll i t d

Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 77.98254 52.42964 1.49 0.1511Is there evidence that either weight or age is individually associated with average cholesterol level in the presence of the other?

Intercept 1 77.98254 52.42964 1.49 0.1511weight 1 0.41736 0.72878 0.57 0.5727age 1 5.21659 0.75724 6.89 <.0001

Parameter Estimates

Interpret the ANOVA F test results.Variable DF 95% Confidence Limits

Intercept 1 -30.74988 186.71495weight 1 -1.09403 1.92875gage 1 3.64616 6.78702



Example V.C (continued)

Using the SAS results, write out the fitted model.g ,

Interpret the fitted regression coefficients, including the intercept.

Report and interpret the estimated model variance.

I h 95% fid i l f h f h ffi i fInterpret the 95% confidence intervals for each of the coefficients of weight and age.

Is there evidence that either weight or age is individually associated with average cholesterol level in the presence of the other?

Interpret the ANOVA F test results.



Analyzing Partial Sums of Squares

Sums of squares in the ANOVA table can be decomposed into components that correspond to individual predictor variables or sets of predictor variables.

This is a very useful tool for comparing models in order to assess which variables should be included and in what form.

When variables (either one or more) are added to a given model, the SSE is always reduced. The idea behind analyzing so-called partial or extra sums of squares is that we would like to know whether the addition of these additional or extra factors leads to a significant reduction in SSE. If the reduction is significant, thissignificant reduction in SSE. If the reduction is significant, this indicates that the additional variables likewise have significant effects with respect to the outcome variable.



Example V.DConsider again the cholesterol data in Example V.C. Suppose we b i i h i l i d l i l h i h i blbegin with a simple regression model using only the weight variable Xi1. Fitting that model, we obtain the prediction equation

ˆ

along with the ANOVA table below:

,62.130.199ˆ1ii XY

Source df SS MS F p-valueRegression 1 10232 10232 1.74 0.200

Error 23 135145 5875.883

T t l 24 145377Total 24 145377



Example V.D (continued)

If we add in the age variable Xi2, as in the two-variable model of g i2,Example V.C, note what the effect is on the regression and error sums of squares. Denote the two-variable model as (I), and the simple model on the previous slide as (II)model on the previous slide as (II).

How is (I) nested in (II)?

What are the SSR and SSE for model (I)? What are the SSR and SSEfor model (II)? How do they compare? ( ) y p



Effects of Additional Covariates on ANOVA Sums of Squares

In summary, what we observe in Example V.D holds generally: adding covariates to a regression model always increases the SSR, and decreases the SSE by the same amount.

The issue is whether this change in the sums of squares is significant e ssue s w e e s c a ge e su s o squa es s s g caenough to warrant the larger model.

In other words is the information or cost required to estimateIn other words, is the information or cost required to estimate additional effects worth it?



Nested ModelsConsider the predictive factors X1, X2,…, Xq, yielding the model

Suppose we augment this with additional factors represented by Xq+1, h i ldi h d l

(I) .22110 iiqqiii XXXY

Xq+2,…, Xp-1, where p–1 ≥ q, yielding the model

(II) ,1,11,1110 ipipqiqiqqii XXXXY

We say that model (I) is nested within model (II), in the sense that the regression parameters in model (I) represent a subset of the

t i d l (II)parameters in model (II).

We can evaluate whether (II) is an improvement on (I) – i.e., whether the reduction in SSE is significant relative to the number of additionalthe reduction in SSE is significant relative to the number of additional parameters – by using a partial F test that compares (II) to (I).



Notation

SSR(X1,…,Xp-1)Regression Sum of squares including all coefficients

SSR(X1,…,Xq)Regression sum of squares

i l di l X XSS ( 1,…, q) including only X1,…,Xq.

SSR(Xq+1,…,Xp-1 | X1,…,Xq) = SSE(X1,…,Xq) – SSE(X1,…,Xp-1)

Additional or extra regression sum of squares due to

inclusion of Xq+1,…,Xp 1.c us o o q+1,…, p-1.



Comparing Nested Models with Partial F Tests

Considering the effect of adding to the model, how do we assess whether the additional covariates are “worth it”? We wish to assess the hypotheses

0H

.1,...,2,1 oneleast at for ,0 :versus,0 : 1210

pqqkHH

kA

pqq

To make this comparison, we can use the F statistic

)|( XXXXMSR

)/()|(),...,(

),...,|,...,(

11

111

p

qpq

qpXXXXSSRXXMSE

XXXXMSRF

.),...,(

)/(),...,|,...,(

11

111

p

qpq

XXMSEqpXXXXSSR



Comparing Nested Models with Partial F Tests, continued

Note the degrees of freedom in the numerator correspond to theNote the degrees of freedom in the numerator correspond to the difference in the number of parameters fit for each model.

H d th ll h th i thi t ti ti f ll F( )Hence, under the null hypothesis, this statistic follows an F(p–q, n–p) distribution.

The p-value is the right tail probability corresponding to the observed statistic.



Example V.E

i h l i C dUsing the results in Example V.C and Example V.D, carry out an F test of the hypothesis of no age effect.

What are the test statistic and corresponding p-value?

What are the degrees of freedom associated with this test statistic?What are the degrees of freedom associated with this test statistic?

How do these results compare to the confidence interval and t-test f h ffi i f ?for the coefficient of age?



Common Variable Types in Multiple Regression

A multivariable regression model can accommodate:A multivariable regression model can accommodate:

• non-linear (polynomial) effects,

• nominal categorical variables,

i i• interactions.

The accompanying three handouts describe such models, andThe accompanying three handouts describe such models, and provide examples analyzed using SAS to illustrate their application and interpretation.



General Guidelines for Model SelectionWhat do you do with a large data set that has many covariates? How

i ht b t fi di hi h b t f th i t d t lmight you go about finding which subset of the covariates adequately describe their effects on the outcome of interest? How might you decide when to treat a given covariate as discrete versus continuous? Which covariates might interact?

These questions are often best answered using a combination of both h d i f d l b ildi Th f h d i hthe art and science of model building. The former has to do with our

substantive knowledge of the research question at hand (e.g., what’s been accomplished previously in your line of research), and the latter p p y y )has to do with empiricism and formal statistical analyses.

The slides that follow contain some suggestions for achieving the b l f i d hi i i h d li d d lbalance of parsimony and sophistication that underlies a good model.



1. Understand your covariates.In any data analysis setting, you must spend some quality time exploring the distributions of the outcome and predictor variable.

This involves performing simple cross-validations and univariateThis involves performing simple cross validations and univariateprocedures to ensure consistency and correctness, and exploring two-or three-way (or higher order) associations between the variables in order to discern the types of marginal associations that may existorder to discern the types of marginal associations that may exist.

Exploratory analysis generally involves a combination of descriptive, inferential, and plotting tools.



1. Understand your covariates (continued).

Many an analyst has found after much work on a problem that, forMany an analyst has found after much work on a problem that, for example, some of the data were either miscoded or coded in a way that the investigator neglected to understand. Models and their associated interpretations in that case need to be completelyassociated interpretations in that case need to be completely reassessed.

In addition, a thoughtful exploratory analysis helps you to avoid including covariates in the model that are very highly correlated. Including such a set of covariates results in the problem ofIncluding such a set of covariates results in the problem of collinearity, where your fitted model may be unreliable: small perturbations in the data may lead to large differences in the resulting parameter estimates as well as their standard errorsparameter estimates, as well as their standard errors.



2. Start simple.

It’s a good idea to make a short list of some the most importantIt s a good idea to make a short list of some the most important covariates in your data set, and begin by considering simple models that involve only those. Your list should be dictated by your familiarity with the problem at hand, as well as the primary hypotheses of interest. For example, in an observational public health study of people, it’s a good bet that you would want to y p p , g yaccount for measures like gender, race, age, and socioeconomic status.

Especially with respect to observational data, there is a tendency for researchers to consider everything at once. This can make the

l i i i i ll h l ianalysis initially overwhelming.



3. Use automated model selection procedures sparingly.

Procedures that automatically select a model for you while popularProcedures that automatically select a model for you, while popular in some settings, are completely ad hoc. Stepwise methods, for example, basically add covariates one at a time (or subtract one at a time), using an arbitrary significance level as the only criterion for inclusion versus exclusion. While such a tool might prove useful for exploratory purposes, it’s a terrible idea to wholly rely on thesefor exploratory purposes, it s a terrible idea to wholly rely on these kinds of procedures.



4. Remember goodness-of-fit.

Goodness of fit is an aspect of data analysis that is all too oftenGoodness-of-fit is an aspect of data analysis that is all too often ignored. For example, investigators often simply treat covariates as continuous, assuming that their effects are linear without bothering to simply check such assumptions through simple exploratory analyses.

Avoid running roughshod over your data by checking to make sure the final model you’ve selected is not missing something importantimportant.



Variable Inclusion versus Exclusion

h l i i h i f i bl i l iThere are several motivations that can inform variable inclusion. For example, you likely want to include a variable if:

• It represents a primary hypothesis of interest.

I h b i l d i i l i h• It has been consistently used in prior analyses in the same line of research.

• It has a statistically significant predictive effect within your own analyses.



Formal Model Comparison Procedures

In terms of the science (i e empirical approaches) for modelIn terms of the science (i.e., empirical approaches) for model selection, we have already discussed important tools such as t tests for individual coefficients and partial F tests for one or more coefficients.

Aside from these, there are other strategies and metrics that make an objective attempt at balancing predictive power versus parsimony.j p g p p p y

The issue: given P–1 total predictor variables available in your dataset how do we select the “best” subset p–1? This will yield pdataset, how do we select the best subset p–1? This will yield ptotal regression coefficients (including the intercept) such that

1 P1 ≤ p ≤ P.



Using the Coefficient of Determination

The so-called coefficient of multiple determination is conventionally p ydenoted by R2, and represents the percent of the total variation in Ythat is explained by a given regression model with p–1 predictor variables Its definition is given byvariables. Its definition is given by

.12 SSESSRR SSTOSSTO

You can see that, by construction, 0 ≤ R2 ≤ 1.You can see that, by construction, 0 ≤ R ≤ 1.



The Adjusted R2

Recall that whenever a predictor is added to a model, SSR increases, p , ,so that R2 will also increase. To balance this increase against the worth of an additional coefficient, the adjusted R2 is often also considered defined asconsidered, defined as

)/(12

pnSSER .)1/(

1

nSSTO

Ra

Note that the adjusted value may actually decrease with an additional predictor, if the decrease in SSE is offset by the loss of a degree of freedom in the numerator.



Selection Criteria Based on R2 Measures

.covariatesofset givena for of valueThe 22p pRR

.“large”a d that yielcovariatesofset a choose tois criterion One

.covariates ofset givena for of valueThe

g

2

22,

p

apa

p

Rp

pRR

p

gy pp

Recall that since the this value increases each time a variable is added to the model, under this criterion we need to make a subjective decision about whether the increase is sufficiently large to warrant the additional factor.

T k th d i i bj ti th it i i t l tTo make the decision more objective, another criterion is to select a model with p covariates that maximizes the adjusted R2.



Mallows’ Cp

This metric attempts to quantify the total mean square error (MSE) p q y q ( )across the sampled subjects, relative to the model fit. Note that MSE is measure that combines variability and bias. Here, the bias represents deviations from the true underlying model for E{Y} thatrepresents deviations from the true underlying model for E{Y} that arise because important variables have not been included.

While the technical details behind the computation of Cp for a given model are not that critical, in terms of interpretation we note that if key factors are omitted then Cp > p. That is, generally speaking models p p , g y p gwith little bias (which is more desirable) will yield values of Cp close to p.



AICp and SBCp

As with the adjusted R and Mallows’ C both the AIC and SBCAs with the adjusted R2 and Mallows’ Cp, both the AICp and SBCppenalize models that have large numbers of predictors relative to their predictive power. Their definitions are respectively given by

AICp = nlog(SSEp) – nlog(n) + 2p, and

SBCp = nlog(SSEp) – nlog(n) + [log(n)]p.

The first term in these definitions will decrease as p increases TheThe first term in these definitions will decrease as p increases. The second term is fixed for a given sample size. The third term in either will increase with larger p.



Comparing AICp and SBCp

Models with small SSE will perform well using either of theseModels with small SSEp will perform well using either of these criteria, provided that the penalties imposed by the third term are not too large.

Note that if n ≥ 8, then the penalty for SBCp is larger than AICp. Hence, the SBCp criterion favors models with fewer covariates.e ce, t e S Cp c te o avo s ode s w t ewe cova ates.



Example V.E

To illustrate the use of these various criteria, we will use the ,Concord, NH, summer water-use data that we examined for Example IV.U, which are posted on the course website in the file concord.txt. The outcome variable Y is gallons of water used during the summerThe outcome variable Y is gallons of water used during the summer months in 1980. Predictors include annual income, years of education, retirement status (yes/no), and number of individuals in the household.

See the accompanying handout for SAS code, including commands p y g , gto generate all possible regressions along with the model selection metrics discussed thus far.



CollinearityCollinearity describes strong correlation between two predictors,

hil lti lli it d ib tti i hi h di t iwhile multicollinearity describes a setting in which one predictor is highly correlated with a linear combination of two or more additional predictors. The two terms are often used interchangeably. When such a set or subset of explanatory variables is included in a model, we can experience problems with

• the conventional interpretation of regression coefficients (i.e., effects of predictive factors holding all other variables constant);

hi h li i bili f fi d ffi i ki i• higher sampling variability of fitted coefficients, making it more difficult to detect actual associations between the response variable and predictors;p ;

• numerical computation required for the least-squares fit.



Example V.FConsider a case where we have a model for an outcome variable Y = (5, 7, 4, 3, 8), with two predictors X = (4 5 5 4 11) and X = (8 10 10 8 22) Pair wisewith two predictors X1 = (4, 5, 5, 4, 11) and X2 = (8, 10, 10, 8, 22). Pair-wise scatterplots of these three variables are shown below. What is the relationship between Y and X1? Between Y and X2? What do you notice about the X1 and X2? What is the correlation coefficient between the two predictors?What is the correlation coefficient between the two predictors?



Example V.F (continued)Suppose we consider first the model

Because X1 = 2X2, we have perfect collinearity between the predictors. In general, hi i if di i bl b d l f i f h

.110 XY

this is true if any predictor variable can be expressed as a linear function of the others. In this case, suppose we now consider the two-variable model:

)2(2 XXXY .)2(2 121012110 XXXY

Notice that for this second model the coefficients are no longer unique. That is, suppose we choose α2 = 0. Then α1 = β1 yields the original (simple) model. Nowsuppose we choose α2 0. Then α1 β1 yields the original (simple) model. Now consider α2 = 2. Then α1 = β1 – 4 yields the original model.

Computationally speaking, given our data, the least-squares solution for the two-p y p g g qvariable model will be indeterminate.



Example V.F (continued)In fact, if we try to fit the two-variable model in SAS, this is the result:

The REG ProcedureModel: MODEL1

Dependent Variable: Y

Number of Observations Read 5Number of Observations Read 5Number of Observations Used 5


Sum of MeanSource DF Squares Square F Value Pr > F

Model 1 10.89811 10.89811 5.19 0.1072Error 3 6.30189 2.10063Corrected Total 4 17.20000

Root MSE 1 44935 R Square 0 6336Root MSE 1.44935 R-Square 0.6336Dependent Mean 5.40000 Adj R-Sq 0.5115Coeff Var 26.83990



Example V.F (continued)

NOTE: Model is not full rank. Least-squares solutions for the parameters aret i S t ti ti ill b i l di A t d DF f 0 Bnot unique. Some statistics will be misleading. A reported DF of 0 or B

means that the estimate is biased.NOTE: The following parameters have been set to 0, since the variables are a

linear combination of other variables as shown.

X2 = 2 * X1

Parameter EstimatesParameter Estimates

Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 1.52830 1.81920 0.84 0.4625X1 B 0.71698 0.31478 2.28 0.1072X2 0 0 . . .



Near versus Perfect Collinearity

Of course, having perfectly correlated variables is an easily detectable problem, as your software package will yield an error when fitting models that include those variables.

The more practical worry in real-world research settings is the issue of variables that are highly – but not perfectly – correlated.

How can you detect such a potential problem?



Collinearity Diagnostics• Compute the correlation matrix for your set of variables. A rule of thumb is to be

f l b i l di di i h 2 f d 0 8 hi hvery careful about including two predictors with an r2 of around 0.8 or higher.

• Regress each predictor on all others in turn, and examine the value of the coefficient of determination R2 for each model This strategy gives rise to twocoefficient of determination R for each model. This strategy gives rise to two useful diagnostic measures, with respect to each of the predictor variables: the so-called Variable Inflation Factor (VIF) and Tolerance.

Let Rk2 represent the coefficient of determination for a regression of the kth

predictor on the other predictors. It turns out that

Tolerance = 1 R 2 andTolerance = 1 – Rk2, and

VIF = (1 – Rk2)–1.

A l l f th b i th t di t ith VIF > 10 ( i l tlA general rule of thumb is that predictors with VIF > 10 (equivalently, Tolerance < 0.10) are a concern.



Collinearity Diagnostics (continued)

• Examine the marginal effect on the fit of coefficients already in the model when• Examine the marginal effect on the fit of coefficients already in the model when an additional factor is added. For example, if a previously included variable has a highly significant coefficient whose magnitude, sign, and/or p-value dramatically changes with the addition of another factor, then those twodramatically changes with the addition of another factor, then those two predictors should be closely examined.

• (Not mentioned in the text.) Examine the eigenvalues of the correlation matrix of the predictors. Eigenvalues equal to zero or relatively close to zero indicate singularity or near-singularity in this matrix. The ratio of the largest to smallest eigenvalue is called the condition number of the matrix. A common rule of th b i th t diti b > 30 t d flthumb is that a condition number > 30 represents a red flag.



Example V.G

This example uses PROC REG in SAS to produce some additional collinearityThis example uses PROC REG in SAS to produce some additional collinearitymetrics mentioned on the previous slides for the Concord water data. We will specifically look at the Tolerance and VIF metrics for each predictor, along with the condition number of the matrix of correlations among the predictors.



Residual and Influence Diagnostics

Recall that our residual diagnostics for simple linear regression involve examiningRecall that our residual diagnostics for simple linear regression involve examining the semistudentized residuals of the form

,ˆ

* iiii

YYee

to examine potentially outlying observations, along with possible violations of the model assumptions (e g linearity and normality and independence of residuals)

,MSEMSEie

model assumptions (e.g., linearity, and normality and independence of residuals).

In addition, there are other metrics that allow us to determine whether outlying observations may have undue influence on the overall model fit. y



The “Hat” MatrixAlthough we’ve thus far avoided the linear algebra underlying the least-squares fit,Although we ve thus far avoided the linear algebra underlying the least squares fit, one interesting byproduct of the algebra is the so-called hat matrix H, which is an n x n matrix defined as

H = X(X’X)-1X’H = X(X X) X ,

where X is an n x p matrix, with an ith row Xi given by Xi = (1, Xi1, Xi2,…, Xi,p-1).

It t t th t th di l l t f thi t i d t d b h f i 1It turns out that the diagonal elements of this matrix, denoted by hii for i = 1,…,n, are a useful tool for measuring the influence of the ith observation on the model fit.



Some Properties of the hii : Identifying Outliers

• For i = 1,…,n, 0 ≤ hii ≤ 1 and Σi hii = p.For i 1,…,n, 0 ≤ hii ≤ 1 and Σi hii p.• It can also be shown that hii is a measure of the distance between the X values for

the ith case and the averages of the X variables for all cases. This makes hii useful as a means of identifying outliers.as a means of identifying outliers.

• Recall from our discussion of simple linear regression that an observation that is significantly outlying with respect to the distribution of the X variable can exert serious influence on the model fit.serious influence on the model fit.

• A couple of rules of thumb for identifying potentially outlying observations using hii: If h is more than twice as large as the average leverage value Σ h /n = p/n If hii is more than twice as large as the average leverage value Σi hii /n p/n,

meaning that hii > 2p/n. If hii is greater than 0.5 (hii between 0.2 and 0.5 could be considered

borderline)borderline).



The hii and Residuals

The semistudentized residuals can be refined in a few key ways using the hii, to help id if li N h h b d id l ll h i ifi lidentify outliers. Note that the observed residuals ei may actually have significantly different variances. We’ve talked about s2 = MSE as an estimator of the model error variance σ2, but it turns out that the estimator s2{ei} = MSE(1 – hii) can better account for these differences in variability With this in mind:account for these differences in variability. With this in mind:

• The studentized residual is defined as ei / s{ei}.• The deleted residual is defined as the difference between the observed and• The deleted residual is defined as the difference between the observed and

predicted Yi, if the prediction is made from an analysis where the ith observation is deleted. We denote this difference by di, and it turns out that di = ei / (1 – hii). The formula for the standard error s{di} is likewise based on a model fit that excludesformula for the standard error s{di} is likewise based on a model fit that excludes the ith observation (see section 10.2 in the text for more details).

• The studentized deleted residual therefore is given by di / s{di}.



The hii and Residuals, continued

Studentized deleted residuals (SDR) can be examined in the same way as i d i d id l i id if i i l li SDR f llsemistudentized residuals in identifying potential outliers: SDR follow a t

distribution with n – p – 1 degrees of freedom, so we identify observations with “large” SDR in terms of absolute value. Typically this is accomplished with a Bonferroni correction For example we compare each |SDR| to an upper tailBonferroni correction. For example, we compare each |SDR| to an upper-tail critical value from the t(n – p – 1) distribution, using a tail probability of α/2n.

What is the value of using the SDR as opposed to an uncorrectedWhat is the value of using the SDR as opposed to an uncorrected standardized residual? As the SDR for a given observation considers the model fit without that observation, there are occasions when the SDR identifies or detects a significant outlier that the semistandardized residual does not.g



The hii and Influence Statistics: DFFITSOnce we have identified outlying observations with respect either to their X or Yvalues the next step is to assess whether those observations have undue influencevalues, the next step is to assess whether those observations have undue influence on the model fit.

(DFFITS)i is an observation-specific diagnostic that measures whether the fitted l i l di th t b ti i i ifi tl diff t f th fitt d lvalue including that observation is significantly different from the fitted value

excluding the same observation. It is defined as

ˆˆ)( )(iii YY

DFFITS

,)()(

)(

iii

iiii hMSE

DFFITS

where the second predicted value in the numerator is the prediction for the ith

b i f d l i h b i f h l i d MSE i hobservation after deleting that observation from the analysis, and MSE(i) is the corresponding error variance for that analysis.

A guideline for identifying an influential case is that DFFITS > 1 for small to moderate datasets, or that DFFITS > 2(p/n)1/2 for relatively large datasets.



The hii and Influence Statistics: DFBETASAnother question regarding the influence of individual cases has to do with their leverage on values of the fitted regression coefficientsleverage on values of the fitted regression coefficients.

(DFBETAS)k(i) is an observation-specific diagnostic that measures how much the ith observation affects the fit of the kth regression coefficient. It is defined as:

,1,...,1 ,0 ,)()(

)()(

pk

cMSEbb

DFBETASkki

ikkik

where bk(i) is the fitted coefficient for kth coefficient after deleting the ith

observation from the analysis, and ckk is the kth diagonal from the matrix (X’X)-1.

A id li f id if i i fl i l i h h ffi i iA guideline for identifying an influential case with respect to the coefficients is that DFBETAS > 1 for small to moderate datasets, or that DFFITS > 2/n1/2 for relatively large datasets.



Example V.H

This example uses the Concord water data along with PROC REG in SAS toThis example uses the Concord water data along with PROC REG in SAS to illustrate and assess some of the influence diagnostics we have discussed. We will specifically look at the leverage values hii, the studentized deleted residuals, the DFFITS, and the DFBETAS.

Date post:	08-Aug-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

V. Multiple Linear Regression€¦ · variables in a regression model: ... The ANOVA F test...

Documents