Download - 7_Regression.pptx

8/14/2019 7_Regression.pptx

1/96

Linear Regression Analysis

Correlation

Simple Linear Regression

The Multiple Linear Regression Model

Least Squares Estimates

R2and AdjustedR2

Overall Validity of the Model (Ftest) Testing for individual regressor (ttest)

Problem of Multicollinearity

Gaurav Garg (IIM Lucknow)


2/96

Smoking and Lung Capacity Suppose, for example, we want to investigate the

relationship between cigarette smoking and lungcapacity

We might ask a group of people about their smoking

habits, and measure their lung capacities

Cigarettes (X) Lung Capacity (Y)

0 45

5 4210 33

15 31

20 29



3/96

Scatter plot of the data

We can see that as smoking goes up, lung

capacity tends to go down. The two variables change the values in opposite

directions.

0

20

40

60

0 10 20 30

Lung Capacity



4/96

Height and Weight Consider the following data of heights and weights of 5

women swimmers:Height (inch): 62 64 65 66 68

Weight (pounds): 102 108 115 128 132

We can observe that weight is also increasing withheight.

0

50

100

150

60 65 70



5/96

Sometimes two variables are related to eachother.

The values of both of the variables are paired.

Change in the value of one affects the value ofother.

Usually these two variables are two attributes of

each member of the population For Example:

Height Weight

Advertising Expenditure Sales Volume

Unemployment Crime Rate

Rainfall Food Production

Expenditure SavingsGaurav Garg (IIM Lucknow)


6/96

We have already studied one measure of relationship

between two variablesCovariance

Covariance between two random variables, Xand Yisgiven by

For paired observations on variables Xand Y,

)()()(),( YEXEXYEYXCov XY

n

iiiXY yyxxnYXCov 1 ))((

1

),(

y

x

xx

yy



7/96

Properties of Covariance:

Cov(X+a, Y+b) = Cov(X, Y) [not affected by change in location]

Cov(aX, bY) = ab Cov(X, Y) [affected by change in scale]

Covariance can take any value from - to +.

Cov(X,Y) > 0 means Xand Ychange in the same direction

Cov(X,Y) < 0 means Xand Ychange in the opposite direction

IfXand Yare independent, Cov(X,Y) = 0 [other way may not be true] It is not unit free.

So it is not a good measure of relationship between two

variables.

A better measure is correlation coefficient. It is unit free and takes values in [-1,+1].



8/96

Correlation Karl Pearsons Correlation coefficient is given by

When the joint distribution of Xand Yis known

When observations on Xand Yare available

)()(

),(),(

YVarXVar

YXCovYXCorrrXY

2222 )]([)()(,)]([)()(

)()()(),(

YEYEYVarXEXEXVar

YEXEXYEYXCov

n

i

i

n

i

i

n

i

ii

yyn

YVarxxn

XVar

yyxxn

YXCov

1

2

1

2

1

)(1

)(,)(1

)(

))((1),(



9/96

Properties of Correlation Coefficient

Corr(aX+b, cY+d) = Corr(X, Y), It is unit free.

It measures the strength of relationship on ascale of -1to +1.

So, it can be used to compare the relationships ofvarious pairs of variables.

Values close to 0indicate little or no correlation

Values close to +1indicate very strong positivecorrelation.

Values close to -1indicate very strong negativecorrelation.



10/96

Scatter Diagram

Positively Correlated Negatively Correlated

Weakly Correlated Strongly Correlated Not Correlated

X

Y



11/96

Correlation Coefficient measures the strength of

linearrelationship.

r = 0 does not necessarily imply that there is nocorrelation.

It may be there, but is not a linearone.

x

y

x

y



12/96

x

1.25

1.75

2.25

2.00

2.50

2.25

2.70

2.50

17.50

y

125

105

65

85

75

80

50

55

640

-0.9

-0.4

0.1

-0.15

0.35

0.1

0.55

0.35

0

45

25

-15

5

-5

0

-30

-25

0

0.8100

0.1600

0.0100

0.0225

0.1225

0.0100

0.3025

0.1225

1.560

SSX

2025

625

225

25

25

0

900

625

4450

SSY

-40.50

-10.00

-1.50

-0.75

-1.75

0

-16.50

-8.75

-79.75

SSXY

957.0

445056.1

75.79

)()(

),(

SSYSSX

SSXY

YVarXVar

YXCovr

xx yy 2)( xx 2)( yy ))(( yyxx



13/96

x

1.25

1.75

2.25

2.00

2.50

2.25

2.70

2.50

17.20

y

125

105

65

85

75

80

50

55

640

x2

1.5625

3.0625

5.0625

4.0000

6.2500

5.0625

7.2500

6.2500

38.54

y2

15625

11025

4225

7225

5625

6400

2500

3025

55650

x.y

156.25

183.75

146.25

170.00

187.50

180.00

135.00

137.50

1296.25

,2

2

nx

xSSX ,2

2

ny

ySSY nyx

xySSXY

SSX = 1.56

SSY = 4450

SSXY= -79.75

Alternative Formulas for Sum of Squares

957.0

445056.1

75.79

)()(

),(

SSYSSX

SSXY

YVarXVar

YXCovr



14/96

Smoking and Lung Capacity Example

Cigarettes

(X) XY

Lung

Capacity(Y)

0 0 0 2025 45

5 25 210 1764 42

10 100 330 1089 33

15 225 465 961 31

20 400 580 841 29

50 750 1585 6680 180

2X 2Y



15/96

2 2

(5)(1585) (50)(180)

(5)(750) 50 (5)(6680) 180

7925 9000(3750 2500)(33400 32400)

1075

.96151250 (1000)

xyr



16/96

Regression Analysis Having determined the correlation between X and Y, we

wish to determine a mathematical relationship betweenthem.

Dependent variable: the variable you wish to explain

Independent variables: the variables used to explain the

dependent variable

Regression analysis is used to:

Predict the value of dependent variable based on the

value of independent variable(s) Explain the impact of changes in an independent

variable on the dependent variable



17/96

Types of Relationships

Y

X

Y

X

Y

Y

X

X

Linear relationships Curvilinear relationships



18/96


Y

X

Y

X

Y

Y

X

X

Strong relationships Weak relationships



19/96


Y

X

Y

X

No relationship



20/96

Simple Linear Regression Analysis

The simplest mathematical relationship is

Y = a + bX + error (linear) Changes in Y are related to the changes inX

What are the most suitable values of

a (intercept) and b (slope)?

X

Y

X

Y

y = a + b.x}a

1

b



21/96

X

Y

(xi, yi)

yi

xi

Method of Least Squares

ibxa

bXa

The best fitted line would be for which all the

ERRORS are minimum.

error



22/96

We want to fit a line for which all the errors are

minimum.

We want to obtain such values of aand b inY = a + bX + errorfor which all the errors are

minimum.

To minimize all the errors together we minimizethe sum of squares of errors (SSE).

n

i

ii bXaYSSE1

2)(



23/96

To get the values of aand bwhich minimize SSE, we

proceed as follows:

Eq (1) and (2) are called normal equations.

Solve normal equations to get aandb

)1(

0)(20

11

1

n

i

i

n

i

i

i

n

i

i

XbnaY

bXaYa

SSE

)2(

0)(20

1

2

11

1

n

i

i

n

i

ii

n

i

i

ii

n

i

i

XbXaXY

XbXaYb

SSE



24/96

Solving above normal equations, we get

n

i

i

n

i

i

n

i

ii

n

i i

n

i i

XbXaXY

XbnaY

1

2

11

11

SSX

SSXY

XX

XXYY

XX

XYXYn

bn

ii

n

i

ii

n

ii

n

ii

n

i

i

n

i

i

n

i

ii

1

2

1

2

11

2

111

XbYa Gaurav Garg (IIM Lucknow)


25/96

The values of aand bobtained using least squares

method are called as least squares estimates (LSE)

of aand b. Thus, LSE of aand bare given by

Also the correlation coefficient between Xand Yis

.

SSX

SSXY

b,XbYa

SSY

SSXb

SSY

SSX

SSX

SSXY

SSYSSX

SSXY

YVarXVar

YXCovrXY

)()(

),(



26/96

x

1.25

1.75

2.25

2.00

2.50

2.25

2.70

2.50

17.50

y

125

105

65

85

75

80

50

55

640

-0.9

-0.4

0.1

-0.15

0.35

0.1

0.55

0.35

0

45

25

-15

5

-5

0

-30

-25

0

0.8100

0.1600

0.0100

0.0225

0.1225

0.0100

0.3025

0.1225

1.560

SSX

2025

625

225

25

25

0

900

625

4450

SSY

-40.50

-10.00

-1.50

-0.75

-1.75

0

-16.50

-8.75

-79.75

SSXY

xx yy 2)( xx 2)( yy ))(( yyxx

.80,15.2 YXGaurav Garg (IIM Lucknow)


27/96

957.0SSYSSX

SSXYr

12.51 SSX

SSXYb 91.189 XbYa

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75

140

120

100

80

60

40

XY 12.5191.189

isLineFitted



28/96

189.91is the estimated mean value of Ywhen

the value of Xis zero.

-51.12 is the change in the average value of Yasa result of a one-unit change in X.

We can predict the value of Yfor some given

value of X.

For example at X=2.15, predicted value of Yis

189.9151.12 x 2.15= 80.002

XY 12.5191.189 isLineFitted



29/96

Residual is the unexplained part of Y

The smaller the residuals, the better the utility ofRegression.

Sum of Residuals is always zero. Least Squareprocedure ensures that.

Residuals play an important role in investigatingthe adequacy of the fitted model.

We obtain coefficient of determination (R2)using the residuals.

R2is used to examine the adequacy of the fittedlinear model to the given data.

iii YYe :Residuals



30/96

Coefficient of Determination

X

Y

Y

YY YY

YY

n

i

i YYSST1

2)(:SquaresofSumTotal

n

ii YYSSR 1

2

)

(:SquaresofSumRegression

n

i

ii YYSSE1

2)(:SquaresofSumError

Also, SST = SSR + SSEGaurav Garg (IIM Lucknow)


31/96

The fraction of SSTexplainedby Regression is given by R2 R2= SSR/ SST = 1(SSE/ SST)

Clearly, 0 R2 1

When SSRis closed to SST, R2will be closed to 1.

This means that regression explains most of the variability

in Y. (Fit is good) When SSEis closed to SST, R2will be closed to 0.

This means that regression does not explain much

variability in Y.(Fit is not good)

R2is the square of correlation coefficient between Xand

Y. (proof omitted)



32/96

r = 1

r = -1

R2= 1

Perfect linearrelationship

100% of the variation

in Yis explained by X

0 < R2< 1

Weak linear

relationships

Some but not all of

the variation in Yis

explained by X

R2= 0

No linear

relationship

None of the

variation in Yis

explained by XGaurav Garg (IIM Lucknow)


33/96

Coefficient of Determination: R2= (4450-370.5)/4450 = 0.916

Correlation Coefficient: r = -0.957

Coefficient of Determination = (Correlation Coefficient)2

X Y

1.25 125 126.0 45 -1 46 2025 1 2116

1.75 105 100.5 25 4.5 20.5 625 20.25 420.25

2.25 65 74.9 -15 -9.9 -5.1 225 98.00 26.01

2.00 85 87.7 5 -2.2 7.7 25 4.84 59.29

2.50 75 62.1 -5 12.9 -17.7 25 166.41 313.29

2.25 80 74.9 0 5.1 -5.1 0 26.01 26.01

2.70 50 51.9 -30 -1.9 -28.1 900 3.61 789.61

2.50 55 62.1 -25 -7.1 -17.9 625 50.41 320.41

17.20 640 4450 370.54 4079.4

6

Y )( YY )( YY )( YY 2)( YY 2)( YY 2)( YY



34/96

Example:

Watching television also reduces the amount of physical exercise,

causing weight gains.

A sample of fifteen 10-year old children was taken.

The number of pounds each child was overweight was recorded

(a negative number indicates the child is underweight).

Additionally, the number of hours of television viewing per weekswas also recorded. These data are listed here.

Calculate the sample regression line and describe what the

coefficients tell you about the relationship between the two

variables.

Y = -24.709 + 0.967 X and R2= 0.768

TV 42 34 25 35 37 38 31 33 19 29 38 28 29 36 18

Overweight 18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7



35/96



36/96

-15.00

-10.00

-5.00

0.00

5.00

10.00

15.00

20.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Y

Predicted Y


Standard Error


37/96

Standard Error

Consider a dataset.

All the observations can not be exactly the same as

arithmetic mean (AM).

Variability of the observations around AM is measured

by standard deviation.

Similarly in regression, all Yvalues can not be the sameas predicted Yvalues.

Variability of Yvalues around the prediction line is

measured by STANDARD ERROR OF THE ESTIMATE.

It is given by

2

)(

2

1

2

n

YY

n

SSES

n

i

ii

YX



38/96

Assumptions

The relationship between X and Y is linear

Error values are statistically independent All the Errors have a common variance.

(Homoscedasticity)

Var(ei)=2, where E(ei)= 0

No distributional assumption about errors is

required for least squares method.

iii YYe



39/96


40/96

Independence

Not Independent Independent

X

X

residuals

residuals

X

residuals


E l V i


41/96

Equal VarianceUnequal variance

(Heteroscadastic)

Equal variance

(Homoscadastic)

XX

Y

X X

Y

residuals

residuals



42/96

TV WatchingWeight Gain Example

Scatter Plot of X and Y

Scatter Plot of X and Residuals

-12.00

-10.00

-8.00

-6.00

-4.00

-2.00

0.00

2.00

4.00

6.00

0 5 10 15 20 25 30 35 40 45

-15.00

-10.00

-5.00

0.00

5.00

10.00

15.00

20.00

0 5 10 15 20 25 30 35 40 45



43/96

The Multiple Linear Regression Model In simple linear regression analysis, we fit linear relation

between

one independent variable (X) and

one dependent variable (Y).

We assume that Y is regressed on only one regressor

variable X. In some situations, the variable Yis regressed on more

than one regressor variables (X1, X2, X3, ).

For EXample:

Cost -> Labor cost, Electricity cost, Raw material cost

Salary -> Education, EXperience

Sales -> Cost, Advertising EXpenditure



44/96

Example:

A distributor of frozen dessert pies wants to

evaluate factors which influence the demand

Dependent variable:

Y: Pie sales (units per week)

Independent variables:

X1:Price (in $)

X2: Advertising Expenditure ($100s)

Data are collected for 15 weeksGaurav Garg (IIM Lucknow)

Pie Price Advertising


45/96

Week

Pie

Sales

Price

($)

Advertising

($100s)

1 350 5.50 3.3

2 460 7.50 3.3

3 350 8.00 3.0

4 430 8.00 4.5

5 350 6.80 3.0

6 380 7.50 4.0

7 430 4.50 3.08 470 6.40 3.7

9 450 7.00 3.5

10 490 5.00 4.0

11 340 7.20 3.5

12 300 7.90 3.2

13 440 5.90 4.0

14 450 5.00 3.5

15 300 7.00 2.7



46/96

Using the given data, we wish to fit a linearfunction of the form:

where


X1:

Price (in $)


Fitting means, we want to get the values ofregression coefficients denoted by

Original values ofs are not known.

We estimate them using the given data.

,22110 iiii XXY .15,,2,1

i


h l l d l


47/96

The Multiple Linear Regression Model

Examine the linear relationship between

one dependent(Y) and two or more independent variables (X1, X2, , Xk).

Multiple Linear Regression Model with k

Independent Variables:

ikikiii XXXY 22110

Intercept Slopes Random Error

.,,2,1 ni Gaurav Garg (IIM Lucknow)

l l


48/96

Multiple Linear Regression Equation

Intercept and Slopes are estimated using observed

data. Multiple linear regression equation with k

independent variables

kikiii XbXbXbbY 22110

Estimatedvalue Estimates of slopes

Estimate ofintercept

.,,2,1 ni Gaurav Garg (IIM Lucknow)

M l i l R i E i


49/96

Multiple Regression Equation

EXample with two independent variables

Y

X1

X2

22110 XbXbbY


E i i R i C ffi i


50/96

Estimating Regression Coefficients The multiple linear regression model

In matriX notations

or

n

k

nknn

k

k

n XXX

XXXXXX

Y

YY

2

1

2

1

0

21

22221

11211

2

1

1

11

XY

,...,n,,iXXXY ikikiii 2122110


A i


51/96

Assumptions

No. of observations (n) is greater than no. of

regressors (k). i.e., n> k Random Errors are independent

Random Errors have the same variances.

(Homoscedasticity) Var( i)=2

In long run, mean effect of random errors is zero.

E( i)= 0. No Assumption on distribution of Random errors

is required for least squares method.Gaurav Garg (IIM Lucknow)


52/96

In order to find the estimate of, we minimize

We differentiate S()with respect toand equateto zero, i.e.,

This gives

bis called least squares estimator of.

XXYXY-Y

X(Y)X(YS(n

i i

2

))1

2

,

S0

YXX)X(b 1


Example: Consider the pie example


53/96

Example: Consider the pie example.

We want to fit the model

The variables are


X1:Price (in $)


Using the matriX formula, the least squares estimate(LSE) ofs are obtained as below:

Pie Sales = 306.5324.98 Price + 74.13 Adv. Expend.

,22110 iiii XXY

LSE of Intercept 0

Intercept (b0) 306.53

LSE of slope1 Price (b1) -24.98

LSE of slope2 Advertising (b2) 74.13



54/96

)(1374)(982453306Sales 21 X.X.-.

b1= -24.98: sales will decrease, on

average, by 24.98 pies per week for

each $1 increase in selling price,

while advertising expenses are kept

fixed.

b2= 74.13: sales will

increase, on average, by

74.13 pies per week foreach $100 increase in

advertising, while selling

price are kept fixed.



55/96

Prediction:

Predict sales for a week in which

selling price is $5.50

Advertising eXpenditure is $350:

Sales = 306.5324.98 X1+ 74.13 X2

= 306.5324.98 (5.50) + 74.13 ( 3.5)

= 428.62

Predicted sales is 428.62 piesNote that Advertising is in $100s, so X2= 3.5



56/96

Y X1 X2 Predicted Y Residuals

350 5.5 3.3 413.77 -63.80460 7.5 3.3 363.81 96.15

350 8.0 3.0 329.08 20.88

430 8.0 4.5 440.28 -10.31

350 6.8 3.0 359.06 -9.09

380 7.5 4.0 415.70 -35.74430 4.5 3.0 416.51 13.47

470 6.4 3.7 420.94 49.03

450 7.0 3.5 391.13 58.84

490 5.0 4.0 478.15 11.83

340 7.2 3.5 386.13 -46.16300 7.9 3.2 346.40 -46.44

440 5.9 4.0 455.67 -15.70

450 5.0 3.5 441.09 8.89

300 7.0 2.7 331.82 -31.85

21 1309674975092452619306 X.X..Y



57/96

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Y

Predicted Y




58/96


Coefficient of Determination (R2 ) is obtained using the

same formula as was in simple linear regression.

R2 = SSR/SST = 1(SSE/SST)

R2 is the proportion of variation in Yexplained by

regression.

n

i

i YY1

2)(SSTSquares,ofSumTotal

n

i

i YY

1

2)(SSRSquares,ofSumRegression

n

i

ii YY1

2)(SSESquares,ofSumError

Also, SST = SSR + SSE


Since SST = SSR + SSE


59/96

Since SST = SSR + SSE

and all three quantities are non-negative,

Also, 0 SSR SST

So 0 SSR/SST 1

Or 0 R2 1

When R2is close to 0, the linear fit is not good

And Xvariables do not contribute in explaining thevariability in Y.

When R2 is close to 1, the linear fit is good.

In the previously discussed example, R2

= 0.5215 If we consider Yand X1only, R

2 =0.1965

If we consider Yand X2only, R2 =0.3095


Adjusted R2


60/96

Adjusted R2 If one more regressor is added to the model, the value

of R2 will increase

This increase is regardless of the contribution of newly

added regressor.

So, an adjusted value of R2 is defined, which is called as

adjusted R2 and defined as

This Adjusted R2 will only increase, if the additionalvariable contribute in explaining the variation in Y.

For our example, Adjusted R2 = 0.4417

)(nSST

)k(nSSER

Adj1

112


F -Test for Overall Significance


61/96

F-Test for Overall Significance

We check if there is a linear relationship between all the

regressors (X1, X2, , Xk)and response (Y).

Use F test statistic

To test:

H0: 1=2==k= 0 (no regressor is significant)

H1:at least one i 0 (at least one regressor affects Y)

The technique of Analysis of Variance is used.

Assumptions:

n > k, Var( i)=2, E( i)= 0.

isare independent. This implies that Corr ( i , j ) = 0, for i j is have Normal Distribution. [ i ~ N(0, 2)]

[NEW ASSUMPTION]



62/96

Total Sum of Square (SST) is partitioned into

Sum of Squares due to Regression (SSR) and

Sum of Squares due to Residuals (SSE)

where

eisare called the residuals.

SSESSTSSR

YYeSSE

YYSST

n

i

ii

n

i

i

n

i

i

1

2

1

2

2

1


Analysis of Variance Table


63/96

Analysis of Variance Table

Test Statistic: Fc= MSR / MSE ~ F(k, n-k-1)

For the previous eXample, we wish to test

H0: 1=2= 0 Against H1:at least one i 0

ANOVA Table

Thus H0is rejected at 5% level of significance.

df SS MS Fc

Regression k SSR MSR MSR/MSE

Residual or Error n-k-1 SSE MSE

Total n-1 SST

df SS MS F F(2,12)(0.05)

Regression 2 29460.03 14730.01 6.5386 3.89

Residual or Error 12 27033.31 2252.78

Total 14 56493.33


Individual Variables Tests of Hypothesis


64/96

Individual Variables Tests of Hypothesis

We test if there is a linear relationship between a

particular regressor Xjand Y Hypotheses:

H0: j= 0 (no linear relationship)

H1: j 0 (linear relationship exists between Xjand Y)

We use a two tailed t-test

If H0: j= 0is accepted,

this indicates that the variable Xjcan be deletedfrom the model.


b


65/96

Test Statistic:

Tc ~Students twith (n-k-1) degree of freedom

bjis the least squares estimate ofj

Cj jis the (j, j)thelement of matrix (XX)-1

(MSEis obtained in ANOVA Table)

jj

j

c

C

bT

2

MSE2



66/96

In our example

and

To test H0: 1= 0 against H1: 1 0

Tc = -2.3057

To test H0: 2= 0 against H1: 2 0

Tc =2.8548

Two tailed critical values of t at 12 d.f. are

3.0545 for 1% level of significance 2.6810 for 2% level of significance

2.1788 for 5% level of significance

2252.77552

299300038001651

003800521033120

0165133120794651

...

...

...

X)X(


Standard Error


67/96

Consider a dataset.

All the observations can not be exactly the same as

arithmetic mean (AM).

Variability of the observations around AM is measured

by standard deviation.

Similarly in regression, all Yvalues can not be the sameas predicted Yvalues.

Variability of Yvalues around the prediction line is

measured by STANDARD ERROR OF THE ESTIMATE.

It is given by

1

)(

1

1

2

kn

YY

kn

SSES

n

i

ii

YX


Assumption of Linearity


68/96

Assumption of Linearity

Not Linear Linear

residua

ls

Y

X

Y

X

residua

ls

Y Y


Assumption of Equal Variance


69/96

Assumption of Equal Variance

We assume that Var( i)=2 The variance is constant for all observations.

This assumption is examined by looking at the

plot of

Predicted values and residualsi

Yiii

YYe


Residual Analysis for Equal Variance


70/96

Residual Analysis for Equal Variance

Unequal variance Equal variance

residuals

residuals

Y Y


Assumption of Uncorrelated Residuals


71/96

Assumption of Uncorrelated Residuals

DurbinWatson statistic is a test statistic used to detect

the presence of autocorrelation. It is given by

The value of dalways lies between 0and 4.

d = 2 indicates no autocorrelation.

Small values of d < 2 indicate successive error terms arepositively correlated.

If d > 2 successive error terms are negatively correlated.

The value of dmore than 3 and less than 1 are alarming.

n

i

i

n

i

ii

e

ee

d

1

2

2

2

1 )(


Residual Analysis for Independence


72/96

(Uncorrelated Errors)

Not Independent Independent

residuals

residuals

residu

als

Y

Y

Y


Assumption of Normality


73/96

p y

When we use Ftest or ttest, we assume that 1,

2, , nare normally distributed. This assumption can be examined by histogram

of residuals.

NOT NORMAL NORMAL


l l b d l


74/96

Normality can also be examined using Q-Q plot

or Normal probability plot.

NOT NORMAL NORMAL


Standardized Regression Coefficient


75/96

g

In a multiple linear regression, we may like to know

which regressor contributes more.

We obtain standardized estimates of regression

coefficients.

For that, first we standardize the observations.

1

2

2

1 1

2

1 1 1 11 1

2

2 2 2 2

1 1

1 1, ( )

1

1 1

, ( )1

1 1, ( )

1

n n

i Y i

i i

n n

i X ii i

n n

i X i

i i

Y Y s Y Y n n

X X s X Xn n

X X s X Xn n


Standardize all Y, X1 and X2 values as follows:


76/96

Standardize all Y, X1and X2values as follows:

Fit the regression in the standardized data and obtainthe least squares estimate of regression coefficients.

These coefficients are dimensionless or unit-free andcan be compared.

Look for the regression coefficient having the highestmagnitude.

Corresponding regressor contributes the most.

1 2

1 1 2 21 2

Standardized ,

Standardized , Standardized

i

Y

i ii i

X X

Y YY

s

X X X XX X

s s


Standardized Data


77/96

Y= 0

0.461 X1+ 0.570X2

Since 0.461 < 0.570

X2 Contributes the most

Week

Pie

Sales

Price

($)

Advertising

($100s)

1 -0.78 -0.95 -0.37

2 0.96 0.76 -0.373 -0.78 1.18 -0.98

4 0.48 1.18 2.09

5 -0.78 0.16 -0.98

6 -0.30 0.76 1.06

7 0.48 -1.80 -0.98

8 1.11 -0.18 0.45

9 0.80 0.33 0.04

10 1.43 -1.38 1.06

11 -0.93 0.50 0.04

12 -1.56 1.10 -0.57

13 0.64 -0.61 1.06

14 0.80 -1.38 0.04

15 -1.56 0.33 -1.60


Note that:


78/96

Note that:

Adjusted R2can be negative Adjusted R2is always less than or equal to R2

Inclusion of intercept term is not necessary.

It depends on the problem. Analyst may decide on this.

)1(

)1)(1(1

22

kn

nRR

Adj

)1(

)1(2

2

Rk

RknF

c


Example: Following data was collected for the sales, number of


79/96

advertisements published and advertizing expenditure for 12

weeks. Fit a regression model to predict the sales.

Sales (0,000 Rs) Ads (Nos.) Adv Ex (000 Rs)

43.6 12 13.9

38.0 11 12

30.1 9 9.3

35.3 7 9.7

46.4 12 12.3

34.2 8 11.4

30.2 6 9.3

40.7 13 14.3

38.5 8 10.222.6 6 8.4

37.6 8 11.2

35.2 10 11.1


ANOVAb

Sum of


80/96

ModelSum of

Squares df Mean Square F Sig.

1 Regression 309.986 2 154.993 9.741 .006a

Residual 143.201 9 15.911

Total 453.187 11

a. Predictors: (Constant), Ex_Adv, No_Adv

b. Dependent Variable: Sales

Coefficientsa

Model

Unstandardized CoefficientsStandardizedCoefficients

t Sig.B Std. Error Beta1 (Constant) 6.584 8.542 .771 .461

No_Adv .625 1.120 .234 .558 .591

Ex_Adv 2.139 1.470 .611 1.455 .180

a. Dependent Variable: Sales

p-value < 0.05; H0is rejected; All s are not zero

All p-values > 0.05; No H0rejected. 0 =0, 1 =0, 2 =0

CONTRADICTION


Multicollinearity


81/96

We assume that regressors are independent variables.

When we regress Y on regressors X1

, X2

, , Xk

.

We assume that all regressors X1, X2, , Xkarestatistically independent of each other.

All the regressors affect the values of Y.

One regressor does not affect the values of otherregressor.

Sometimes, in practice this assumption is not met.

We face the problem of multicollinearity. The correlated variables contribute redundant information

to the model


Including two highly correlated independent variables can


82/96

g g y padversely affect the regression results

Can lead to unstable coefficients

Some Indications of Strong Multicollinearity: Coefficient signs may not match prior expectations

Large change in the value of a previous coefficient when a newvariable is added to the model

A previously significant variable becomes insignificant when anew independent variable is added.

F says at least one variable is significant, but none of the tsindicates a useful variable.

Large standard error and corresponding regressors is stillsignificant.

MSE is very high and/or R2is very small


EXAMPLES IN WHICH THIS MIGHT HAPPEN:


83/96

Miles per gallon Vs. horsepower and engine size

Income Vs. age and experience

Sales Vs. No. of Advertisement and Advert. Expenditure Variance Inflationary Factor:

VIFjis used to measure multicollinearity generatedby variable Xj

It is given by

where R2j

is the coefficient of determination of aregression model that uses Xjas the dependent variable and

all other Xvariables as the independent variables.

21

1

j

jR

VIF


If VIF > 5 X is highly correlated with the other


84/96

If VIFj > 5, Xj is highly correlated with the otherindependent variables

Mathematically, the problem of multicollinearity occurs

when the columns of matrix Xhave near lineardependence

LSE bcan not be obtained when the matrixXX is singular

The matrixXXbecomes singular when

the columns of matrix Xhave exact linear dependence

If any of the eigen value of matrixXX is zero

Thus, near zero eigen value is also an indication ofmulticollinearity.

The methods of dealing with multicollinearity:

Collecting Additional Data

Variable Elimination



85/96


86/96

We may use the method of variable elimination.

In practice, IfCorr (X1, X2)

is more than 0.7 or

less than -0.7, we eliminate one of them.

Techniques:

Stepwise (based on ANOVA)

Forward Inclusion (based on Correlation)

Backward Elimination (based on Correlation)


Stepwise Regression


87/96

Y = 0+ 1X1+ 2X2+ 3X3+ 4X4+ 5X5+

Step 1: Run 5 simple linear regressions:

Y = 0+ 1X1

Y = 0+ 2X2

Y = 0+ 3X3

Y = 0+ 4X4 Y = 0+ 5X5

Step 2: Run 4 two-variable linear regressions:

Y = 0+ 4X4 + 1X1

Y = 0+ 4X4 + 2X2

Y = 0+ 4X4 + 3X3

Y = 0+ 4X4 + 5X5


88/96

Step 3: Run 3 three-variable linear regressions:

Y = 0+ 3X3 + 4X4 + 1X1

Y = 0+ 3X3 + 4X4 + 2X2

Y = 0+ 3X3 + 4X4 + 5X5

Suppose none of these models have

p-values < 0.05

STOP Best model is the one with X3and X4only


Example: Following data was collected for the sales, number of

d ti t bli h d d d ti i dit f 12


89/96

advertisements published and advertizing expenditure for 12

months. Fit a regression model to predict the sales.

Sales (0,000 Rs) Ads (Nos.) Adv Ex (000 Rs)

43.6 12 13.9

38.0 11 12

30.1 9 9.3

35.3 7 9.7

46.4 12 12.3

34.2 8 11.4

30.2 6 9.3

40.7 13 14.3

38.5 8 10.222.6 6 8.4

37.6 8 11.2

35.2 10 11.1


Summary Output 1: Sales Vs. No_Adv


90/96

Model Summary

Model R R Square Adjusted R SquareStd. Error of the

Estimate

1 .781a .610 .571 4.20570

a. Predictors: (Constant), No_Adv

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression 276.308 1 276.308 15.621 .003a

Residual 176.879 10 17.688

Total453.187 11

a. Predictors: (Constant), No_Adv


Coefficientsa

Model


t Sig.B Std. Error Beta1 (Constant) 16.937 4.982 3.400 .007

No_Adv 2.083 .527 .781 3.952 .003



Summary Output 2: Sales Vs. Ex_Adv


91/96

Model Summary


Estimate

1 .820a .673 .640 3.84900

a. Predictors: (Constant), Ex_Adv

ANOVAb


1 Regression 305.039 1 305.039 20.590 .001a

Residual 148.148 10 14.815

Total453.187 11

a. Predictors: (Constant), Ex_Adv


Coefficientsa

Model



Ex_Adv 2.872 .633 .820 4.538 .001



Summary Output 3: Sales Vs. No_Adv & Ex_Adv


92/96

Model Summary


Estimate

1 .827a .684 .614 3.98888


ANOVAb


1 Regression 309.986 2 154.993 9.741 .006a

Residual 143.201 9 15.911

Total453.187 11



Coefficientsa

Model



No_Adv .625 1.120 .234 .558 .591

Ex_Adv 2.139 1.470 .611 1.455 .180



Qualitative Independent Variables


93/96

Johnson Filtration, Inc., provides maintenanceservice for water filtration systems throughoutsouthern Florida.

To estimate the service time and the service cost,the managers want to predict the repair time

necessary for each maintenance request. Repair time is believed to be related to two

factors-

Number of months since the last maintenanceservice

Type of repair problem (mechanical or electrical)


Data for a sample of 10 service calls are given:


94/96

Let Ydenote the repair time, X1denote the number of

months since last maintenance service. Regression Model that uses X1only to regress Yis

Y=0+1X1+

Service Call

Months Since Last

Service Type of Repair

Repair Time in

Hours

1 2 electrical 2.9

2 6 mechanical 3.03 8 electrical 4.8

4 3 mechanical 1.8

5 2 electrical 2.9

6 7 electrical 4.9

7 9 mechanical 4.2

8 8 mechanical 4.8

9 4 electrical 4.4

10 6 electrical 4.5


Using least squares method, we fitted the model as


95/96

g q

R2 =0.534

At 5% level of significance, we reject H0: 0= 0 (Using ttest)

H0: 1= 0 (Using tand Ftest)

X1 alone explains 53.4% variability in repair time.

To introduce the type of repair into the model, we define adummy variable given as

Regression Model that uses X1 and X2 to regress Yis

Y=0+1X1+2X2+

Is the new model improved?

13041.01473.2 XY

electricalisrepairoftypeif1,

mechanicalisrepairoftypeif,02X


Summary

M l i l li i d l Y X


96/96

Multiple linear regression model Y=X+

Least Squares Estimate ofis given by b= (XX)-1XY

R2and adjusted R2 Using ANOVA (Ftest), we examine if alls are zero or

not.

ttest is conducted for each regressor separately.

Using t test, we examine ifcorresponding to thatregressor is zero or not.

Problem of MulticollinearityVIF, eigen value

Dummy Variable

Examining the assumptions :

common variance, independence, normality