+ All Categories
Home > Technology > Regression Analysis

Regression Analysis

Date post: 15-Nov-2014
Category:
Upload: nadiazaheer
View: 5,846 times
Download: 2 times
Share this document with a friend
Description:
Chap
Popular Tags:
45
Regression analysis Week no 2 - 19th to 23rd Sept, 2011
Transcript
Page 1: Regression Analysis

Regression analysisWeek no 2 - 19th to 23rd Sept, 2011

Page 2: Regression Analysis

Course Map

Introduction to Quantitative Analysis, Ch1, RSH (1 Week)

Regression Models Ch4 (1week)

Decision Analysis, Ch3, RSH (2 Weeks)

Linear Programming Models: Graphical & Computer Methods, Ch7, RSH (2 Weeks)

Linear Programming Modeling Applications: With Computer Analyses in Excel, Ch8, RSH (2 Weeks)

Simulation Modeling, Ch15, RSH (2 Weeks)

Forecasting, Ch5, RSH. (2 Weeks)

Waiting Lines and Queuing Theory Models, Ch14, RSH. (2 Weeks)

Page 3: Regression Analysis

regression analysisA very valuable tool for today’s manager. Regression Analysis is used to:

Understand the relationship between variables.

Predict the value of one variable based on another variable.

A regression model has:

dependent, or response, variable - Y axis

an independent, or predictor, variable - X axis

Page 4: Regression Analysis

How to perform Regression analysis

Page 5: Regression Analysis

regression analysisTriple A Construction Company renovates old

homes in Albany. They have found that its dollar volume of renovation work is dependent on the

Albany area payroll.Local Payroll

($100,000,000's)Triple A Sales($100,000's)

3 64 86 94 52 4.55 9.5

Page 6: Regression Analysis

Scatter plot

0

2

4

6

8

10

0 1 2 3 4 5 6Local Payroll($100,000,000's)

Sal

es10

0,00

0

Page 7: Regression Analysis

regression analysis model

Create a Scatter PlotPerform Regression Analysis

some random error that cannot be

predicted.

Slope

Intercept (Value of Y when

X=0)

Independent Variable, Predictor

Dependent Variable, Response

Regression: Understand & Predict

Page 8: Regression Analysis

regression analysis model

Sample data are used to estimate the true values for the intercept and slope.

Y = b + b X Where,

Y = predicted value of Y

Error = (actual value) – (predicted value)

e = Y - Y

The difference between the actual value of Y and the predicted value (using sample data) is known as the error.

0 1

Page 9: Regression Analysis

regression analysis model

Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y)

6 3 1 1

8 4 0 0

9 6 4 4

5 4 0 0

4.5 2 4 5

9.5 5 1 2.5

Summations for each column: 42 24 10 12.5

Y = 42/6 = 7 X = 24/6 = 4

_ _ _

__

Calculating the required parameters:

b = (X-X)(Y-Y) 12.5 (X-X) 10

b = Y – b X = 7 – (1.25)(4) = 2

So,

Y = 2 + 1.25 X

! ! 2

o 1

1 = = 1.25

2

Page 10: Regression Analysis

Measuring the Fit of the linear Regression

Model

Page 11: Regression Analysis

Measuring the Fit of the linear Regression Model

To understand how well the X predicts the Y, we evaluate

Variability in the Y variable

SSR –> Regression Variability that is explained by the relationship b/w X & Y

+SSE –> Unexplained

Variability, due to factors then the regression

------------------------------------ SST –> Total variability about

the mean

Coefficient of Determination R Sq - Proportion of explained variation

Correlation Coefficient

r – Strength of the relationship

between Y and X variables

Standard Error

St Deviation of error

around the Regression

Line

Residual AnalysisValidation of

Model

Test for LinearitySignificance of the

Regression Model i.e. Linear Regression Model

Page 12: Regression Analysis

Variability

0

2

4

6

8

10

0 1 2 3 4 5 6

y = 1.25x + 2R² = 0.6944

Local Payroll($100,000,000's)

Regression Line

SSTSSESSR

explained variability

Y_

Page 13: Regression Analysis

Variability

!  Sum of Squares Total (SST) measures the total variable in Y.

!  Sum of the Squared Error (SSE) is less than the SST because the regression line reduced the variability.

!  Sum of Squares due to Regression (SSR) indicated how much of the total variability is explained by the regression model.

Errors (deviations) may be positive or negative. Summing the errors would be misleading, thus we square the terms prior to summing.

SST = (Y-Y) 2 !

SSE = e = (Y-Y) 2 2 ! !

SSR = (Y-Y) ! 2

For Triple A Construction:

SST = (Y-Y) 2 !

SSE = e = (Y-Y) 2 2 ! !

SSR = (Y-Y) ! 2

= 22.5

= 6.875

= 15.625

Note: SST = SSR + SSE

Explained Variability

Unexplained Variability

Page 14: Regression Analysis

Coefficient of Determination

The coefficient of determination (r2 ) is the proportion of the variability in Y that is explained by the regression equation.

r2 = SSR = 1 – SSE SST SST

For Triple A Construction:

r2 = 15.625 = 0.6944 22.5

69% of the variability in sales is explained by the regression based on payroll.

Note: 0 < r2 < 1

SST, SSR and SSE just themselves

provide little direct interpretation. This

measures the usefulness of

regression

Page 15: Regression Analysis

Correlation Coefficient

=

( ) [ ] ! ! - 2 2 ( ) [ ] ! ! - - !

2 2 2 Y Y ( Y n X X n

! ! - ! Y X XY n r

For Triple A Construction, r = 0.8333

The correlation coefficient (r) measures the strength of the linear relationship.

Note: -1 < r < 1

Possible Scatter Diagrams

for values of r.

Shown as Multiple R in the output of Excel

file

Page 16: Regression Analysis

Correlation Coefficient

Page 17: Regression Analysis

Standard error

s = MSE = SSE n–k-1

The mean squared error (MSE) is the estimate of the error variance of the regression equation.

2

Where, n = number of observations in the sample k = number of independent variables

For Triple A Construction, s = 1.31 2

Estimate of Variance. Just like St Dev (which is around mean), it measures the

variation of Y variation around the regression line OR St Dev of error

around the Regression Line. Same units as Y. Means +1.3 x 100,000 USD Sales

error in prediction

Page 18: Regression Analysis

Test for linearityAn F-test is used to statistically test the null hypothesis that there is no linear relationship between the X and Y variables (i.e. ! = 0). If the significance level for the F test is low, we reject Ho and conclude there is a linear relationship.

F = MSR MSE where, MSR = SSR k

1 For Triple A Construction:

MSR = 15.625 = 15.625 1

F = 15.625 = 9.0909 1.7188

The significance level for F = 9.0909 is 0.0394, indicating we reject Ho and conclude a linear relationship exists between sales and payroll.

p value is significance levelalpha = level of significance or

= 1-confidence interval

If p<alphaReject the null hypothesis that there is no linear relationship between X & Y

Page 19: Regression Analysis

Computer Software for Regression

In Excel, use Tools/ Data Analysis. This

is an ‘add-in’ option.

Page 20: Regression Analysis

Computer Software for Regression

Page 21: Regression Analysis

Computer Software for Regression

Multiple R is correlation coefficient

Estimate of Variance. Just like St Dev (which is around mean), it measures the variation of Y variation around the regression line OR St Dev of error around the Regression Line.

Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction

p Value < Alpha (0.05 or 0.1) means relationship between X & Y is linear

The

adju

sted

R S

q ta

kes i

nto

acco

unt t

he

num

ber o

f ind

epen

dent

var

iabl

es in

the

mod

el.

Page 22: Regression Analysis

Anova table

Page 23: Regression Analysis

Residual Analysis:to verify regression assumptions

are correct

Page 24: Regression Analysis

Assumptions of the Regression Model

! Errors are independent. ! Errors are normally distributed. ! Errors have a mean of zero. ! Errors have a constant variance.

We make certain assumptions about the errors in a regression model which allow for statistical testing.

Assumptions:

A plot of the errors (Real

Value minus predicted value of Y), also called residuals in excel may

highlightproblems with the

model.

PITFALLS: Prediction beyond the range of X values in the sample can be misleading, including

interpretation of the intercept (X=0).A linear regression model may not be the best model, even in the presence of a significant F

test.

Page 25: Regression Analysis

Constant varianceTriple A Construction

Errors have constant Variance Assumption

Plot Residues w.r.t X valuesPattern should be random!

Non-constant Variation in Error Residual Plot –violation

0 X

Page 26: Regression Analysis

Normal distribution

Histogram of Residuals - Should look like a bell curve

Triple A Construction

Not possible to see the bell curve with just 6 observations. Need

more samples

Page 27: Regression Analysis

zero meanTriple A Construction

Errors have zero Mean

0 X

Page 28: Regression Analysis

independent errors

If samples collected over a period of time and not at the

same time, then plot the residues w.r.t time to see if

any pattern (Autocorrelation) exists.

If substantial autocorrelation, Regression Model Validity

becomes doubtfulAutocorrelation can also be checked

using Durbin–Watson statistic.

Example: Manager of a package delivery store wants to predict

weekly sales based on the number of customers making purchases for a period of 100 days. Data is collected over a period of time so check for

autocorrelation (pattern) effect.

time

Res

idue

s

Cyclical Pattern! A Violation

Page 29: Regression Analysis

Residual analysis for validating assumptions

Nonlinear Residual Plot –violation

Page 30: Regression Analysis

multiple regression

Page 31: Regression Analysis

multiple regressionMultiple regression models are similar to simple linear regression models except they include more than one X variable.

Y = b + b X + b X +…+ b X 0 1 1 2 2 n n

Independent variables

slope

Price Sq. Feet Age Condition

35000 1926 30 Good

47000 2069 40 Excellent

49900 1720 30 Excellent

55000 1396 15 Good

58900 1706 32 Mint

60000 1847 38 Mint

67000 1950 27 Mint

70000 2323 30 Excellent

78500 2285 26 Mint

79000 3752 35 Good

87500 2300 18 Good

93000 2525 17 Good

95000 3800 40 Excellent

97000 1740 12 Mint

Wilson Realty wants to develop a model to determine the suggested listing price for a house based on size and age.

Page 32: Regression Analysis

multiple regression

67% of the variation in sales price is explained by size and age.

Ho: No linear relationship is rejected

Ho: !1 = 0 is rejected Ho: !2 = 0 is rejected

Y = 60815.45 + 21.91(size) – 1449.34 (age)

Y = 60815.45 + 21.91(size) – 1449.34 (age)

Wilson Realty has found a linear relationship between price and size and age. The coefficient for size indicates each additional square foot increases the value by $21.91, while each additional year in age decreases the value by $1449.34.

For a 1900 square foot house that is 10 years old, the following prediction can be made:

$87,951 = 21.91(1900) + 1449.34(10)

Page 33: Regression Analysis

binary or dummy variables

Page 34: Regression Analysis

dummy variables

!  A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise.

!  The number of dummy variables must equal one less than the number of categories of the qualitative variable.

Binary (or dummy) variables are special variables that are created for qualitative data.

Return to Wilson Realty, and let’s evaluate how to use property condition in the regression model. There are three categories: Mint, Excellent, and Good.

X = 1 if the house is in excellent condition = 0 otherwise

X = 1 if the house is in mint condition = 0 otherwise

Note: If both X and X = 0 then the house is in good condition

3

4

Page 35: Regression Analysis

dummy variables

Y = 48329.23 + 28.21 (size) – 1981.41(age) + 23684.62 (if mint) + 16581.32 (if excellent)

As more variables are added to the model, the r2

usually increases.

Page 36: Regression Analysis

model building

Page 37: Regression Analysis

adjusted r-Square

!  As more variables are added to the model, the r2 usually increases.

!  The adjusted r2 takes into account the number of independent variables in the model.

The best model is a statistically significant model with a high r2

and a few variables.

Note: When variables are added to the model, the value of r2 can never decrease; however, the adjusted r2 may decrease.

Page 38: Regression Analysis

multicollinearity

!  Collinearity and multicollinearity create problems in the coefficients.

!  The overall model prediction is still good; however individual interpretation of the variables is questionable.

Collinearity or multicollinearity exists when an independent variable is correlated with another independent variable.

Duplication of information occurs

When multicollinearity exists, the overall F test is still valid, but the hypothesis tests related to the individual coefficients are not.

A variable may appear to be significant when it is insignificant, or a variable may appear to be insignificant when it is significant.

Page 39: Regression Analysis

non-linear regression

Page 40: Regression Analysis

non-linear regressionEngineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are studying the impact of weight on miles per gallon (MPG).

Linear regression model: MPG = 47.8 – 8.2 (weight)

F significance = .0003 r2 = .7446

Page 41: Regression Analysis

non-linear regression Nonlinear (transformed variable)

regression model

MPG = 79.8 – 30.2(weight) + 3.4(weight)

F significance = .0002 R2 = .8478

2

Page 42: Regression Analysis

non-linear regression

We should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared).

Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in X1, while holding all other variables constant.

Obviously holding one variable constant while changing the other is impossible in this example since If changes, then must change also.

This is an example of a problem that exists when multicollinearity is present.

Page 43: Regression Analysis

chapter assignments on LMS

Page 44: Regression Analysis

quiz in next class

Page 45: Regression Analysis

Case studies


Recommended