Intermediate R - Multiple Regression

REGRESSION ANALYSIS Used for explaining or modelling the relationship between a single dependent variable Y and one or more independent variables, X1, X2Xp.

MULTIPLE REGRESSION

Simple Linear regression if p = 1

= 0 + 1 1 + Multiple regression if p > 1

= 0 + 1 1 + 2 2 + Y (response or dependent variable) - continuousSlides prepared by: and

Leilani Nora Assistant Scientist

Violeta Bartolome Senior Associate Scientist

X1 and X2 (predictor, independent or explanatory variables)- can be continuous, discrete, or categorical.

PBGB-CRIL

Selection of Independent Variables Extent to which the variable contributes in explaining the variation in Y. Importance of the variable as causal agent in the process under study.

Scope of the Model Need to restrict the coverage of the model to some interval or regions of values of the independent variable/s. The scope is determined either by the design of investigation or by the range of data at hand.

EXAMPLE Y is family disbursement (in thousand pesos) total family income food household operation fuel and light expenditure (X1 = income) (X2 = food) (X3 = house) (X4 = fulyt)

ANOVA APPROACH TO REGRESSION ANALYSISSV Error TOTAL SS DF MS MSR=SSR/p-1 MSE=SSE/n-p MSR/MSE Fc

Regression SSR p-1 SSE n-p SST n-1

total number of family members (X5 = numem) Annual based which are all expressed in thousand pesos Linear Model

= 0 + 11 + 2 2 + 3 3 + 4 4 + 5 5 +

COEFFICIENT OF DETERMINATION One measure of the model fit is the R2, coefficient of determination which measures the reduction of the total variation in Y associated with the predictors stated in the model. SSR SSE R2 = = 1 SST SST Values closer to 1 indicate better fit For simple linear regression R2 = r2

DATAFRAME: Fies.csv Dataframe with 50 observations, 1 dependent variable (disburse) and 5 independent variables (income, food, house, fulyt, numem)

An alternative measure of fit is which is directly related to the standard errors of the estimates of

DATAFRAME: Fies.csvRead data file Fies.csv > DIS names(gfit)[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" [8] "df.residual" "xlevels" "call" "terms "model"

MULTICOLLINEARITY Exists when 2 or more independent variables are correlated Effects of multicollinearity 1. Imprecise estimates of 2. The interpretation of regression coefficients is not applicable. 3. t-tests may fail to reveal significant factors

VARIANCE INFLATION FACTOR (VIF) Most reliable way to examine multicollinearity Rule of thumb: If VIF > 10 (or >5 to be very conservative) there is a multicollinearity problem. vif() is one of the functions in package faraway which computes the variance inflation factor > vif(object, ) # object a data matrix of Xs or a model object

VARIANCE INFLATION FACTOR (VIF) Check the correlation matrix of all independent variables > library(agricolae) > Xs corr.Xs library(faraway) > vif(Xs) # or vif(gfit)income food house fulyt numem 2.691149 2.327383 1.237047 1.900161 1.380973

REMEDIAL MEASURES OF MULTICOLLINEARITY1. Drop predictors which are correlated with other predictors 2. Add some observation to break collinearity pattern 3. Ridge Regression variant of multiple regression whose objective is to solve multicollinearity 4. Regression using Principal Components

CRITERION-BASED PROCEDURES VARIABLE SELECTION Intended to select the best subset of predictors To explain the data in the simplest way redundant predictors are removed. Fit the 2p possible models and choose the best one according to some criterion. Also known as All possible-regression procedure Criteria for Comparing Regression models 1. Akaike Information Criterion (AIC) and Bayes Information Criterion (BIC) AIC = -2log-likelihood + 2p BIC = -2log-likelihood + plogn where: -2log-likelihood =nlog(SSE/n) also known as the deviance Rule : Lowest AIC or BIC is the best model

CRITERION-BASED PROCEDURESCriteria for Comparing Regression models 2. Adjusted R2 or Ra2 :

CRITERION-BASED PROCEDURESCriteria for Comparing Regression models 3. Mallows Cp Statistic : C p =SSE p MSE + 2p n

n 1 SSE MSE 2 Ra = 1 n p SST = 1 SST / n 1 Rule : The model with the highest adjusted R2 value is considered best.

- MSE is from the model with all predictors and SSEp if from the model with p parameters - When number of independent variables = p, Cp=p. Thus a model with bad fit will have Cp much bigger than p

BACKWARD ELIMINATION1. Fit a model with all predictors in the model 2. Remove the predictor with highest p-value greater than significance level crit 3. Refit the model and repeat Step 2. 4. Stop when all the pvalues < crit This approach is known as the saturated model approach, where we saturate the model with all terms and remove those that are insignificant relative to the presence of the others.

FORWARD SELECTION1. Fit the intercept or the null model, without variables in the model 2. For all predictors not in the model, check their p-values if they are added to the model. Choose the one with lowest p-value less than crit. 3. Continue until no new predictors can be added.

STEPWISE REGRESSION Combination of backward elimination and forward selection At each stage a variable may be added or removed ad there are several variations on exactly how this is done. Some drawbacks 1. Possible to miss the optimal model 2. The procedures are not directly linked to final objectives of prediction or explanation and may not help solve the problem 3. Tends to pick smaller models than desirable for prediction or explanation.

regsubsets() One of the functions in package leaps, which is used for regression subset selection. > regsubsets(x, data, method = c(exhaustive", backward", forward", seqrep), force.in=n[,n...]) # x design matrix or model formula # method method of variable selection such as exhaustive search, forward, backward or sequential replacement. (Note: for few number of independent variables, the exhaustive method is recommended) # force.in specifies one or more variables to be included in all models.

subsets() One of the functions in package car, which contains mostly functions for applied regression, linear models, and GLM. subsets() finds an optimal subsets of predictors. This function plots a measure of fit against a subset size. > subsets(object, statistic=c("bic", "cp", "adjr2", "rsq", "rss") # object a regsubsets object produced by the regsubsets() in the leaps package # statistic statistic to plot for each predictor subset such as bic, cp, adjusted R2, unadjusted R2 and residual sum of squares.

BACKWARD ELIMINATION : regsubsets()> library(leaps) > bwd bwds names(bwds)[1] "which" "bic" "rsq" "rss" "outmat" "obj" "adjr2" "cp"

subsets()par(mfrow=c(2,2)) subsets(bwd,statistic="rss") subsets(bwd,statistic="adjr2") subsets(bwd,statistic="cp") subsets(bwd,statistic="bic") par(mfrow=c(1,1))

> STAT.bwd rownames(STAT.bwd) colnames(STAT.bwd) round(STAT.bwd, 4)i i-h i-fd-h i-fd-h-n i-fd-fl-h-n rss 66515.2057 48606.4305 37245.9663 34722.6207 34568.5604 adjr2 0.9033 0.9278 0.9435 0.9462 0.9452 cp 38.6627 17.8679 5.4079 4.1961 6.0000 bic -110.0121 -121.7838 -131.1824 -130.7780 -127.0883

FORWARD ELIMINATION CRITERION> fwd fwds STAT.fwd rownames(STAT.fwd) colnames(STAT.fwd) round(STAT.fwd, 4)i i-h i-fd-h i-fd-h-n i-fd-fl-h-n rss 66515.2057 48606.4305 37245.9663 34722.6207 34568.5604 adjr2 0.9033 0.9278 0.9435 0.9462 0.9452 cp 38.6627 17.8679 5.4079 4.1961 6.0000 bic -110.0121 -121.7838 -131.1824 -130.7780 -127.0883

STEPWISE REGRESSION : step() Use to select a model using formula-based model by AIC > step(object, direction=c(both", forward", backward)) # object an lm object # statistic mode of stepwise search, can be one of both, backward, or forward

STEPWISE REGRESSION : step()> library(MASS) > stepw par(mfrow=c(2,2) > qqnorm(trans2$res,ylab="Raw Residuals") > qqline(trans2$res) > qqnorm(rstudent(trans2), ylab="Studentized residuals") > abline(0,1) > hist(trans2$res,10) > boxplot(trans2$res,main="Boxplot of residuals") > par(mfrow=c(1,1)

Outlier An observation that is unconditionally unusual in either its Y or X value Can cause us to misinterpret patterns in plots. Outliers can affect visual resolution of remaining data in plots (forces observations into clusters) Can have a strong influence on statistical models. Deleting outliers from a regression model can sometimes give completely different results

OUTLIER TESTS An outlier tests is useful to enable us to distinguish between truly unusual points and residuals which are large but are not exceptional Before performing a test for outlier, check if the extreme observations are due to errors in computations or some other explainable causes.

IDENTIFYING INFLUENTIAL CASES The popular measures of influential point is the Cooks Distance Statistics (Di). Use to ascertain whether or not outliers are influential cases An index plot of Di can be used to identify influential points. Cooks original criteria is 1.0 (too liberal) Fox proposed 4/(n-p-1)

COOKS DISTANCE : cooks.distance() cooks.distance() is one of the functions of the package stats that computes the cooks distance statistics > cooks.distance(model, ) # model an object returned by lm or glm Illustration > cook plot(cook,ylab="Cooks distances", xlim=c(0,60)) > OBS identify(1:50,cook,OBS)

REMEDIAL MEASURES FOR INFLUENTIAL CASESWhen an outlying observation is accurate but no explanation can be found for it, the following may be considered: Check for any independent variable that may have an unusual distribution that produces extreme values. Check if the dependent variable has properties that may lead to a potential or occasional large residuals Check appropriateness of model Apply appropriate transformation. Use robust approaches (refer to regression books)

Summary Use of lm() to fit multiple linear regression Different methods in selecting independent variables How to handle any violation in the assumptions of regression Multicollinearity Variance heterogeneity Non-normality of residuals

How to handle outliers

Thank you!

Date post:	22-Nov-2014
Category:	Documents
Upload:	vivay-salazar
View:	683 times
Download:	2 times

Intermediate R - Multiple Regression

Documents