8/14/2019 7_Regression.pptx
1/96
Linear Regression Analysis
Correlation
Simple Linear Regression
The Multiple Linear Regression Model
Least Squares Estimates
R2and AdjustedR2
Overall Validity of the Model (Ftest) Testing for individual regressor (ttest)
Problem of Multicollinearity
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
2/96
Smoking and Lung Capacity Suppose, for example, we want to investigate the
relationship between cigarette smoking and lungcapacity
We might ask a group of people about their smoking
habits, and measure their lung capacities
Cigarettes (X) Lung Capacity (Y)
0 45
5 4210 33
15 31
20 29
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
3/96
Scatter plot of the data
We can see that as smoking goes up, lung
capacity tends to go down. The two variables change the values in opposite
directions.
0
20
40
60
0 10 20 30
Lung Capacity
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
4/96
Height and Weight Consider the following data of heights and weights of 5
women swimmers:Height (inch): 62 64 65 66 68
Weight (pounds): 102 108 115 128 132
We can observe that weight is also increasing withheight.
0
50
100
150
60 65 70
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
5/96
Sometimes two variables are related to eachother.
The values of both of the variables are paired.
Change in the value of one affects the value ofother.
Usually these two variables are two attributes of
each member of the population For Example:
Height Weight
Advertising Expenditure Sales Volume
Unemployment Crime Rate
Rainfall Food Production
Expenditure SavingsGaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
6/96
We have already studied one measure of relationship
between two variablesCovariance
Covariance between two random variables, Xand Yisgiven by
For paired observations on variables Xand Y,
)()()(),( YEXEXYEYXCov XY
n
iiiXY yyxxnYXCov 1 ))((
1
),(
y
x
xx
yy
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
7/96
Properties of Covariance:
Cov(X+a, Y+b) = Cov(X, Y) [not affected by change in location]
Cov(aX, bY) = ab Cov(X, Y) [affected by change in scale]
Covariance can take any value from - to +.
Cov(X,Y) > 0 means Xand Ychange in the same direction
Cov(X,Y) < 0 means Xand Ychange in the opposite direction
IfXand Yare independent, Cov(X,Y) = 0 [other way may not be true] It is not unit free.
So it is not a good measure of relationship between two
variables.
A better measure is correlation coefficient. It is unit free and takes values in [-1,+1].
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
8/96
Correlation Karl Pearsons Correlation coefficient is given by
When the joint distribution of Xand Yis known
When observations on Xand Yare available
)()(
),(),(
YVarXVar
YXCovYXCorrrXY
2222 )]([)()(,)]([)()(
)()()(),(
YEYEYVarXEXEXVar
YEXEXYEYXCov
n
i
i
n
i
i
n
i
ii
yyn
YVarxxn
XVar
yyxxn
YXCov
1
2
1
2
1
)(1
)(,)(1
)(
))((1),(
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
9/96
Properties of Correlation Coefficient
Corr(aX+b, cY+d) = Corr(X, Y), It is unit free.
It measures the strength of relationship on ascale of -1to +1.
So, it can be used to compare the relationships ofvarious pairs of variables.
Values close to 0indicate little or no correlation
Values close to +1indicate very strong positivecorrelation.
Values close to -1indicate very strong negativecorrelation.
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
10/96
Scatter Diagram
Positively Correlated Negatively Correlated
Weakly Correlated Strongly Correlated Not Correlated
X
Y
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
11/96
Correlation Coefficient measures the strength of
linearrelationship.
r = 0 does not necessarily imply that there is nocorrelation.
It may be there, but is not a linearone.
x
y
x
y
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
12/96
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.50
y
125
105
65
85
75
80
50
55
640
-0.9
-0.4
0.1
-0.15
0.35
0.1
0.55
0.35
0
45
25
-15
5
-5
0
-30
-25
0
0.8100
0.1600
0.0100
0.0225
0.1225
0.0100
0.3025
0.1225
1.560
SSX
2025
625
225
25
25
0
900
625
4450
SSY
-40.50
-10.00
-1.50
-0.75
-1.75
0
-16.50
-8.75
-79.75
SSXY
957.0
445056.1
75.79
)()(
),(
SSYSSX
SSXY
YVarXVar
YXCovr
xx yy 2)( xx 2)( yy ))(( yyxx
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
13/96
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.20
y
125
105
65
85
75
80
50
55
640
x2
1.5625
3.0625
5.0625
4.0000
6.2500
5.0625
7.2500
6.2500
38.54
y2
15625
11025
4225
7225
5625
6400
2500
3025
55650
x.y
156.25
183.75
146.25
170.00
187.50
180.00
135.00
137.50
1296.25
,2
2
nx
xSSX ,2
2
ny
ySSY nyx
xySSXY
SSX = 1.56
SSY = 4450
SSXY= -79.75
Alternative Formulas for Sum of Squares
957.0
445056.1
75.79
)()(
),(
SSYSSX
SSXY
YVarXVar
YXCovr
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
14/96
Smoking and Lung Capacity Example
Cigarettes
(X) XY
Lung
Capacity(Y)
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
50 750 1585 6680 180
2X 2Y
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
15/96
2 2
(5)(1585) (50)(180)
(5)(750) 50 (5)(6680) 180
7925 9000(3750 2500)(33400 32400)
1075
.96151250 (1000)
xyr
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
16/96
Regression Analysis Having determined the correlation between X and Y, we
wish to determine a mathematical relationship betweenthem.
Dependent variable: the variable you wish to explain
Independent variables: the variables used to explain the
dependent variable
Regression analysis is used to:
Predict the value of dependent variable based on the
value of independent variable(s) Explain the impact of changes in an independent
variable on the dependent variable
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
17/96
Types of Relationships
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
18/96
Types of Relationships
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
19/96
Types of Relationships
Y
X
Y
X
No relationship
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
20/96
Simple Linear Regression Analysis
The simplest mathematical relationship is
Y = a + bX + error (linear) Changes in Y are related to the changes inX
What are the most suitable values of
a (intercept) and b (slope)?
X
Y
X
Y
y = a + b.x}a
1
b
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
21/96
X
Y
(xi, yi)
yi
xi
Method of Least Squares
ibxa
bXa
The best fitted line would be for which all the
ERRORS are minimum.
error
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
22/96
We want to fit a line for which all the errors are
minimum.
We want to obtain such values of aand b inY = a + bX + errorfor which all the errors are
minimum.
To minimize all the errors together we minimizethe sum of squares of errors (SSE).
n
i
ii bXaYSSE1
2)(
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
23/96
To get the values of aand bwhich minimize SSE, we
proceed as follows:
Eq (1) and (2) are called normal equations.
Solve normal equations to get aandb
)1(
0)(20
11
1
n
i
i
n
i
i
i
n
i
i
XbnaY
bXaYa
SSE
)2(
0)(20
1
2
11
1
n
i
i
n
i
ii
n
i
i
ii
n
i
i
XbXaXY
XbXaYb
SSE
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
24/96
Solving above normal equations, we get
n
i
i
n
i
i
n
i
ii
n
i i
n
i i
XbXaXY
XbnaY
1
2
11
11
SSX
SSXY
XX
XXYY
XX
XYXYn
bn
ii
n
i
ii
n
ii
n
ii
n
i
i
n
i
i
n
i
ii
1
2
1
2
11
2
111
XbYa Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
25/96
The values of aand bobtained using least squares
method are called as least squares estimates (LSE)
of aand b. Thus, LSE of aand bare given by
Also the correlation coefficient between Xand Yis
.
SSX
SSXY
b,XbYa
SSY
SSXb
SSY
SSX
SSX
SSXY
SSYSSX
SSXY
YVarXVar
YXCovrXY
)()(
),(
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
26/96
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.50
y
125
105
65
85
75
80
50
55
640
-0.9
-0.4
0.1
-0.15
0.35
0.1
0.55
0.35
0
45
25
-15
5
-5
0
-30
-25
0
0.8100
0.1600
0.0100
0.0225
0.1225
0.0100
0.3025
0.1225
1.560
SSX
2025
625
225
25
25
0
900
625
4450
SSY
-40.50
-10.00
-1.50
-0.75
-1.75
0
-16.50
-8.75
-79.75
SSXY
xx yy 2)( xx 2)( yy ))(( yyxx
.80,15.2 YXGaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
27/96
957.0SSYSSX
SSXYr
12.51 SSX
SSXYb 91.189 XbYa
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
140
120
100
80
60
40
XY 12.5191.189
isLineFitted
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
28/96
189.91is the estimated mean value of Ywhen
the value of Xis zero.
-51.12 is the change in the average value of Yasa result of a one-unit change in X.
We can predict the value of Yfor some given
value of X.
For example at X=2.15, predicted value of Yis
189.9151.12 x 2.15= 80.002
XY 12.5191.189 isLineFitted
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
29/96
Residual is the unexplained part of Y
The smaller the residuals, the better the utility ofRegression.
Sum of Residuals is always zero. Least Squareprocedure ensures that.
Residuals play an important role in investigatingthe adequacy of the fitted model.
We obtain coefficient of determination (R2)using the residuals.
R2is used to examine the adequacy of the fittedlinear model to the given data.
iii YYe :Residuals
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
30/96
Coefficient of Determination
X
Y
Y
YY YY
YY
n
i
i YYSST1
2)(:SquaresofSumTotal
n
ii YYSSR 1
2
)
(:SquaresofSumRegression
n
i
ii YYSSE1
2)(:SquaresofSumError
Also, SST = SSR + SSEGaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
31/96
The fraction of SSTexplainedby Regression is given by R2 R2= SSR/ SST = 1(SSE/ SST)
Clearly, 0 R2 1
When SSRis closed to SST, R2will be closed to 1.
This means that regression explains most of the variability
in Y. (Fit is good) When SSEis closed to SST, R2will be closed to 0.
This means that regression does not explain much
variability in Y.(Fit is not good)
R2is the square of correlation coefficient between Xand
Y. (proof omitted)
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
32/96
r = 1
r = -1
R2= 1
Perfect linearrelationship
100% of the variation
in Yis explained by X
0 < R2< 1
Weak linear
relationships
Some but not all of
the variation in Yis
explained by X
R2= 0
No linear
relationship
None of the
variation in Yis
explained by XGaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
33/96
Coefficient of Determination: R2= (4450-370.5)/4450 = 0.916
Correlation Coefficient: r = -0.957
Coefficient of Determination = (Correlation Coefficient)2
X Y
1.25 125 126.0 45 -1 46 2025 1 2116
1.75 105 100.5 25 4.5 20.5 625 20.25 420.25
2.25 65 74.9 -15 -9.9 -5.1 225 98.00 26.01
2.00 85 87.7 5 -2.2 7.7 25 4.84 59.29
2.50 75 62.1 -5 12.9 -17.7 25 166.41 313.29
2.25 80 74.9 0 5.1 -5.1 0 26.01 26.01
2.70 50 51.9 -30 -1.9 -28.1 900 3.61 789.61
2.50 55 62.1 -25 -7.1 -17.9 625 50.41 320.41
17.20 640 4450 370.54 4079.4
6
Y )( YY )( YY )( YY 2)( YY 2)( YY 2)( YY
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
34/96
Example:
Watching television also reduces the amount of physical exercise,
causing weight gains.
A sample of fifteen 10-year old children was taken.
The number of pounds each child was overweight was recorded
(a negative number indicates the child is underweight).
Additionally, the number of hours of television viewing per weekswas also recorded. These data are listed here.
Calculate the sample regression line and describe what the
coefficients tell you about the relationship between the two
variables.
Y = -24.709 + 0.967 X and R2= 0.768
TV 42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
Overweight 18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
35/96
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
36/96
-15.00
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Y
Predicted Y
Gaurav Garg (IIM Lucknow)
Standard Error
8/14/2019 7_Regression.pptx
37/96
Standard Error
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Yvalues can not be the sameas predicted Yvalues.
Variability of Yvalues around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
It is given by
2
)(
2
1
2
n
YY
n
SSES
n
i
ii
YX
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
38/96
Assumptions
The relationship between X and Y is linear
Error values are statistically independent All the Errors have a common variance.
(Homoscedasticity)
Var(ei)=2, where E(ei)= 0
No distributional assumption about errors is
required for least squares method.
iii YYe
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
39/96
8/14/2019 7_Regression.pptx
40/96
Independence
Not Independent Independent
X
X
residuals
residuals
X
residuals
Gaurav Garg (IIM Lucknow)
E l V i
8/14/2019 7_Regression.pptx
41/96
Equal VarianceUnequal variance
(Heteroscadastic)
Equal variance
(Homoscadastic)
XX
Y
X X
Y
residuals
residuals
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
42/96
TV WatchingWeight Gain Example
Scatter Plot of X and Y
Scatter Plot of X and Residuals
-12.00
-10.00
-8.00
-6.00
-4.00
-2.00
0.00
2.00
4.00
6.00
0 5 10 15 20 25 30 35 40 45
-15.00
-10.00
-5.00
0.00
5.00
10.00
15.00
20.00
0 5 10 15 20 25 30 35 40 45
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
43/96
The Multiple Linear Regression Model In simple linear regression analysis, we fit linear relation
between
one independent variable (X) and
one dependent variable (Y).
We assume that Y is regressed on only one regressor
variable X. In some situations, the variable Yis regressed on more
than one regressor variables (X1, X2, X3, ).
For EXample:
Cost -> Labor cost, Electricity cost, Raw material cost
Salary -> Education, EXperience
Sales -> Cost, Advertising EXpenditure
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
44/96
Example:
A distributor of frozen dessert pies wants to
evaluate factors which influence the demand
Dependent variable:
Y: Pie sales (units per week)
Independent variables:
X1:Price (in $)
X2: Advertising Expenditure ($100s)
Data are collected for 15 weeksGaurav Garg (IIM Lucknow)
Pie Price Advertising
8/14/2019 7_Regression.pptx
45/96
Week
Pie
Sales
Price
($)
Advertising
($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.08 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
46/96
Using the given data, we wish to fit a linearfunction of the form:
where
Y: Pie sales (units per week)
X1:
Price (in $)
X2: Advertising Expenditure ($100s)
Fitting means, we want to get the values ofregression coefficients denoted by
Original values ofs are not known.
We estimate them using the given data.
,22110 iiii XXY .15,,2,1
i
Gaurav Garg (IIM Lucknow)
h l l d l
8/14/2019 7_Regression.pptx
47/96
The Multiple Linear Regression Model
Examine the linear relationship between
one dependent(Y) and two or more independent variables (X1, X2, , Xk).
Multiple Linear Regression Model with k
Independent Variables:
ikikiii XXXY 22110
Intercept Slopes Random Error
.,,2,1 ni Gaurav Garg (IIM Lucknow)
l l
8/14/2019 7_Regression.pptx
48/96
Multiple Linear Regression Equation
Intercept and Slopes are estimated using observed
data. Multiple linear regression equation with k
independent variables
kikiii XbXbXbbY 22110
Estimatedvalue Estimates of slopes
Estimate ofintercept
.,,2,1 ni Gaurav Garg (IIM Lucknow)
M l i l R i E i
8/14/2019 7_Regression.pptx
49/96
Multiple Regression Equation
EXample with two independent variables
Y
X1
X2
22110 XbXbbY
Gaurav Garg (IIM Lucknow)
E i i R i C ffi i
8/14/2019 7_Regression.pptx
50/96
Estimating Regression Coefficients The multiple linear regression model
In matriX notations
or
n
k
nknn
k
k
n XXX
XXXXXX
Y
YY
2
1
2
1
0
21
22221
11211
2
1
1
11
XY
,...,n,,iXXXY ikikiii 2122110
Gaurav Garg (IIM Lucknow)
A i
8/14/2019 7_Regression.pptx
51/96
Assumptions
No. of observations (n) is greater than no. of
regressors (k). i.e., n> k Random Errors are independent
Random Errors have the same variances.
(Homoscedasticity) Var( i)=2
In long run, mean effect of random errors is zero.
E( i)= 0. No Assumption on distribution of Random errors
is required for least squares method.Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
52/96
In order to find the estimate of, we minimize
We differentiate S()with respect toand equateto zero, i.e.,
This gives
bis called least squares estimator of.
XXYXY-Y
X(Y)X(YS(n
i i
2
))1
2
,
S0
YXX)X(b 1
Gaurav Garg (IIM Lucknow)
Example: Consider the pie example
8/14/2019 7_Regression.pptx
53/96
Example: Consider the pie example.
We want to fit the model
The variables are
Y: Pie sales (units per week)
X1:Price (in $)
X2: Advertising Expenditure ($100s)
Using the matriX formula, the least squares estimate(LSE) ofs are obtained as below:
Pie Sales = 306.5324.98 Price + 74.13 Adv. Expend.
,22110 iiii XXY
LSE of Intercept 0
Intercept (b0) 306.53
LSE of slope1 Price (b1) -24.98
LSE of slope2 Advertising (b2) 74.13
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
54/96
)(1374)(982453306Sales 21 X.X.-.
b1= -24.98: sales will decrease, on
average, by 24.98 pies per week for
each $1 increase in selling price,
while advertising expenses are kept
fixed.
b2= 74.13: sales will
increase, on average, by
74.13 pies per week foreach $100 increase in
advertising, while selling
price are kept fixed.
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
55/96
Prediction:
Predict sales for a week in which
selling price is $5.50
Advertising eXpenditure is $350:
Sales = 306.5324.98 X1+ 74.13 X2
= 306.5324.98 (5.50) + 74.13 ( 3.5)
= 428.62
Predicted sales is 428.62 piesNote that Advertising is in $100s, so X2= 3.5
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
56/96
Y X1 X2 Predicted Y Residuals
350 5.5 3.3 413.77 -63.80460 7.5 3.3 363.81 96.15
350 8.0 3.0 329.08 20.88
430 8.0 4.5 440.28 -10.31
350 6.8 3.0 359.06 -9.09
380 7.5 4.0 415.70 -35.74430 4.5 3.0 416.51 13.47
470 6.4 3.7 420.94 49.03
450 7.0 3.5 391.13 58.84
490 5.0 4.0 478.15 11.83
340 7.2 3.5 386.13 -46.16300 7.9 3.2 346.40 -46.44
440 5.9 4.0 455.67 -15.70
450 5.0 3.5 441.09 8.89
300 7.0 2.7 331.82 -31.85
21 1309674975092452619306 X.X..Y
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
57/96
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Y
Predicted Y
Gaurav Garg (IIM Lucknow)
Coefficient of Determination
8/14/2019 7_Regression.pptx
58/96
Coefficient of Determination
Coefficient of Determination (R2 ) is obtained using the
same formula as was in simple linear regression.
R2 = SSR/SST = 1(SSE/SST)
R2 is the proportion of variation in Yexplained by
regression.
n
i
i YY1
2)(SSTSquares,ofSumTotal
n
i
i YY
1
2)(SSRSquares,ofSumRegression
n
i
ii YY1
2)(SSESquares,ofSumError
Also, SST = SSR + SSE
Gaurav Garg (IIM Lucknow)
Since SST = SSR + SSE
8/14/2019 7_Regression.pptx
59/96
Since SST = SSR + SSE
and all three quantities are non-negative,
Also, 0 SSR SST
So 0 SSR/SST 1
Or 0 R2 1
When R2is close to 0, the linear fit is not good
And Xvariables do not contribute in explaining thevariability in Y.
When R2 is close to 1, the linear fit is good.
In the previously discussed example, R2
= 0.5215 If we consider Yand X1only, R
2 =0.1965
If we consider Yand X2only, R2 =0.3095
Gaurav Garg (IIM Lucknow)
Adjusted R2
8/14/2019 7_Regression.pptx
60/96
Adjusted R2 If one more regressor is added to the model, the value
of R2 will increase
This increase is regardless of the contribution of newly
added regressor.
So, an adjusted value of R2 is defined, which is called as
adjusted R2 and defined as
This Adjusted R2 will only increase, if the additionalvariable contribute in explaining the variation in Y.
For our example, Adjusted R2 = 0.4417
)(nSST
)k(nSSER
Adj1
112
Gaurav Garg (IIM Lucknow)
F -Test for Overall Significance
8/14/2019 7_Regression.pptx
61/96
F-Test for Overall Significance
We check if there is a linear relationship between all the
regressors (X1, X2, , Xk)and response (Y).
Use F test statistic
To test:
H0: 1=2==k= 0 (no regressor is significant)
H1:at least one i 0 (at least one regressor affects Y)
The technique of Analysis of Variance is used.
Assumptions:
n > k, Var( i)=2, E( i)= 0.
isare independent. This implies that Corr ( i , j ) = 0, for i j is have Normal Distribution. [ i ~ N(0, 2)]
[NEW ASSUMPTION]
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
62/96
Total Sum of Square (SST) is partitioned into
Sum of Squares due to Regression (SSR) and
Sum of Squares due to Residuals (SSE)
where
eisare called the residuals.
SSESSTSSR
YYeSSE
YYSST
n
i
ii
n
i
i
n
i
i
1
2
1
2
2
1
Gaurav Garg (IIM Lucknow)
Analysis of Variance Table
8/14/2019 7_Regression.pptx
63/96
Analysis of Variance Table
Test Statistic: Fc= MSR / MSE ~ F(k, n-k-1)
For the previous eXample, we wish to test
H0: 1=2= 0 Against H1:at least one i 0
ANOVA Table
Thus H0is rejected at 5% level of significance.
df SS MS Fc
Regression k SSR MSR MSR/MSE
Residual or Error n-k-1 SSE MSE
Total n-1 SST
df SS MS F F(2,12)(0.05)
Regression 2 29460.03 14730.01 6.5386 3.89
Residual or Error 12 27033.31 2252.78
Total 14 56493.33
Gaurav Garg (IIM Lucknow)
Individual Variables Tests of Hypothesis
8/14/2019 7_Regression.pptx
64/96
Individual Variables Tests of Hypothesis
We test if there is a linear relationship between a
particular regressor Xjand Y Hypotheses:
H0: j= 0 (no linear relationship)
H1: j 0 (linear relationship exists between Xjand Y)
We use a two tailed t-test
If H0: j= 0is accepted,
this indicates that the variable Xjcan be deletedfrom the model.
Gaurav Garg (IIM Lucknow)
b
8/14/2019 7_Regression.pptx
65/96
Test Statistic:
Tc ~Students twith (n-k-1) degree of freedom
bjis the least squares estimate ofj
Cj jis the (j, j)thelement of matrix (XX)-1
(MSEis obtained in ANOVA Table)
jj
j
c
C
bT
2
MSE2
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
66/96
In our example
and
To test H0: 1= 0 against H1: 1 0
Tc = -2.3057
To test H0: 2= 0 against H1: 2 0
Tc =2.8548
Two tailed critical values of t at 12 d.f. are
3.0545 for 1% level of significance 2.6810 for 2% level of significance
2.1788 for 5% level of significance
2252.77552
299300038001651
003800521033120
0165133120794651
...
...
...
X)X(
Gaurav Garg (IIM Lucknow)
Standard Error
8/14/2019 7_Regression.pptx
67/96
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Yvalues can not be the sameas predicted Yvalues.
Variability of Yvalues around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
It is given by
1
)(
1
1
2
kn
YY
kn
SSES
n
i
ii
YX
Gaurav Garg (IIM Lucknow)
Assumption of Linearity
8/14/2019 7_Regression.pptx
68/96
Assumption of Linearity
Not Linear Linear
residua
ls
Y
X
Y
X
residua
ls
Y Y
Gaurav Garg (IIM Lucknow)
Assumption of Equal Variance
8/14/2019 7_Regression.pptx
69/96
Assumption of Equal Variance
We assume that Var( i)=2 The variance is constant for all observations.
This assumption is examined by looking at the
plot of
Predicted values and residualsi
Yiii
YYe
Gaurav Garg (IIM Lucknow)
Residual Analysis for Equal Variance
8/14/2019 7_Regression.pptx
70/96
Residual Analysis for Equal Variance
Unequal variance Equal variance
residuals
residuals
Y Y
Gaurav Garg (IIM Lucknow)
Assumption of Uncorrelated Residuals
8/14/2019 7_Regression.pptx
71/96
Assumption of Uncorrelated Residuals
DurbinWatson statistic is a test statistic used to detect
the presence of autocorrelation. It is given by
The value of dalways lies between 0and 4.
d = 2 indicates no autocorrelation.
Small values of d < 2 indicate successive error terms arepositively correlated.
If d > 2 successive error terms are negatively correlated.
The value of dmore than 3 and less than 1 are alarming.
n
i
i
n
i
ii
e
ee
d
1
2
2
2
1 )(
Gaurav Garg (IIM Lucknow)
Residual Analysis for Independence
8/14/2019 7_Regression.pptx
72/96
(Uncorrelated Errors)
Not Independent Independent
residuals
residuals
residu
als
Y
Y
Y
Gaurav Garg (IIM Lucknow)
Assumption of Normality
8/14/2019 7_Regression.pptx
73/96
p y
When we use Ftest or ttest, we assume that 1,
2, , nare normally distributed. This assumption can be examined by histogram
of residuals.
NOT NORMAL NORMAL
Gaurav Garg (IIM Lucknow)
l l b d l
8/14/2019 7_Regression.pptx
74/96
Normality can also be examined using Q-Q plot
or Normal probability plot.
NOT NORMAL NORMAL
Gaurav Garg (IIM Lucknow)
Standardized Regression Coefficient
8/14/2019 7_Regression.pptx
75/96
g
In a multiple linear regression, we may like to know
which regressor contributes more.
We obtain standardized estimates of regression
coefficients.
For that, first we standardize the observations.
1
2
2
1 1
2
1 1 1 11 1
2
2 2 2 2
1 1
1 1, ( )
1
1 1
, ( )1
1 1, ( )
1
n n
i Y i
i i
n n
i X ii i
n n
i X i
i i
Y Y s Y Y n n
X X s X Xn n
X X s X Xn n
Gaurav Garg (IIM Lucknow)
Standardize all Y, X1 and X2 values as follows:
8/14/2019 7_Regression.pptx
76/96
Standardize all Y, X1and X2values as follows:
Fit the regression in the standardized data and obtainthe least squares estimate of regression coefficients.
These coefficients are dimensionless or unit-free andcan be compared.
Look for the regression coefficient having the highestmagnitude.
Corresponding regressor contributes the most.
1 2
1 1 2 21 2
Standardized ,
Standardized , Standardized
i
Y
i ii i
X X
Y YY
s
X X X XX X
s s
Gaurav Garg (IIM Lucknow)
Standardized Data
8/14/2019 7_Regression.pptx
77/96
Y= 0
0.461 X1+ 0.570X2
Since 0.461 < 0.570
X2 Contributes the most
Week
Pie
Sales
Price
($)
Advertising
($100s)
1 -0.78 -0.95 -0.37
2 0.96 0.76 -0.373 -0.78 1.18 -0.98
4 0.48 1.18 2.09
5 -0.78 0.16 -0.98
6 -0.30 0.76 1.06
7 0.48 -1.80 -0.98
8 1.11 -0.18 0.45
9 0.80 0.33 0.04
10 1.43 -1.38 1.06
11 -0.93 0.50 0.04
12 -1.56 1.10 -0.57
13 0.64 -0.61 1.06
14 0.80 -1.38 0.04
15 -1.56 0.33 -1.60
Gaurav Garg (IIM Lucknow)
Note that:
8/14/2019 7_Regression.pptx
78/96
Note that:
Adjusted R2can be negative Adjusted R2is always less than or equal to R2
Inclusion of intercept term is not necessary.
It depends on the problem. Analyst may decide on this.
)1(
)1)(1(1
22
kn
nRR
Adj
)1(
)1(2
2
Rk
RknF
c
Gaurav Garg (IIM Lucknow)
Example: Following data was collected for the sales, number of
8/14/2019 7_Regression.pptx
79/96
advertisements published and advertizing expenditure for 12
weeks. Fit a regression model to predict the sales.
Sales (0,000 Rs) Ads (Nos.) Adv Ex (000 Rs)
43.6 12 13.9
38.0 11 12
30.1 9 9.3
35.3 7 9.7
46.4 12 12.3
34.2 8 11.4
30.2 6 9.3
40.7 13 14.3
38.5 8 10.222.6 6 8.4
37.6 8 11.2
35.2 10 11.1
Gaurav Garg (IIM Lucknow)
ANOVAb
Sum of
8/14/2019 7_Regression.pptx
80/96
ModelSum of
Squares df Mean Square F Sig.
1 Regression 309.986 2 154.993 9.741 .006a
Residual 143.201 9 15.911
Total 453.187 11
a. Predictors: (Constant), Ex_Adv, No_Adv
b. Dependent Variable: Sales
Coefficientsa
Model
Unstandardized CoefficientsStandardizedCoefficients
t Sig.B Std. Error Beta1 (Constant) 6.584 8.542 .771 .461
No_Adv .625 1.120 .234 .558 .591
Ex_Adv 2.139 1.470 .611 1.455 .180
a. Dependent Variable: Sales
p-value < 0.05; H0is rejected; All s are not zero
All p-values > 0.05; No H0rejected. 0 =0, 1 =0, 2 =0
CONTRADICTION
Gaurav Garg (IIM Lucknow)
Multicollinearity
8/14/2019 7_Regression.pptx
81/96
We assume that regressors are independent variables.
When we regress Y on regressors X1
, X2
, , Xk
.
We assume that all regressors X1, X2, , Xkarestatistically independent of each other.
All the regressors affect the values of Y.
One regressor does not affect the values of otherregressor.
Sometimes, in practice this assumption is not met.
We face the problem of multicollinearity. The correlated variables contribute redundant information
to the model
Gaurav Garg (IIM Lucknow)
Including two highly correlated independent variables can
8/14/2019 7_Regression.pptx
82/96
g g y padversely affect the regression results
Can lead to unstable coefficients
Some Indications of Strong Multicollinearity: Coefficient signs may not match prior expectations
Large change in the value of a previous coefficient when a newvariable is added to the model
A previously significant variable becomes insignificant when anew independent variable is added.
F says at least one variable is significant, but none of the tsindicates a useful variable.
Large standard error and corresponding regressors is stillsignificant.
MSE is very high and/or R2is very small
Gaurav Garg (IIM Lucknow)
EXAMPLES IN WHICH THIS MIGHT HAPPEN:
8/14/2019 7_Regression.pptx
83/96
Miles per gallon Vs. horsepower and engine size
Income Vs. age and experience
Sales Vs. No. of Advertisement and Advert. Expenditure Variance Inflationary Factor:
VIFjis used to measure multicollinearity generatedby variable Xj
It is given by
where R2j
is the coefficient of determination of aregression model that uses Xjas the dependent variable and
all other Xvariables as the independent variables.
21
1
j
jR
VIF
Gaurav Garg (IIM Lucknow)
If VIF > 5 X is highly correlated with the other
8/14/2019 7_Regression.pptx
84/96
If VIFj > 5, Xj is highly correlated with the otherindependent variables
Mathematically, the problem of multicollinearity occurs
when the columns of matrix Xhave near lineardependence
LSE bcan not be obtained when the matrixXX is singular
The matrixXXbecomes singular when
the columns of matrix Xhave exact linear dependence
If any of the eigen value of matrixXX is zero
Thus, near zero eigen value is also an indication ofmulticollinearity.
The methods of dealing with multicollinearity:
Collecting Additional Data
Variable Elimination
Gaurav Garg (IIM Lucknow)
8/14/2019 7_Regression.pptx
85/96
8/14/2019 7_Regression.pptx
86/96
We may use the method of variable elimination.
In practice, IfCorr (X1, X2)
is more than 0.7 or
less than -0.7, we eliminate one of them.
Techniques:
Stepwise (based on ANOVA)
Forward Inclusion (based on Correlation)
Backward Elimination (based on Correlation)
Gaurav Garg (IIM Lucknow)
Stepwise Regression
8/14/2019 7_Regression.pptx
87/96
Y = 0+ 1X1+ 2X2+ 3X3+ 4X4+ 5X5+
Step 1: Run 5 simple linear regressions:
Y = 0+ 1X1
Y = 0+ 2X2
Y = 0+ 3X3
Y = 0+ 4X4 Y = 0+ 5X5
Step 2: Run 4 two-variable linear regressions:
Y = 0+ 4X4 + 1X1
Y = 0+ 4X4 + 2X2
Y = 0+ 4X4 + 3X3
Y = 0+ 4X4 + 5X5
8/14/2019 7_Regression.pptx
88/96
Step 3: Run 3 three-variable linear regressions:
Y = 0+ 3X3 + 4X4 + 1X1
Y = 0+ 3X3 + 4X4 + 2X2
Y = 0+ 3X3 + 4X4 + 5X5
Suppose none of these models have
p-values < 0.05
STOP Best model is the one with X3and X4only
Gaurav Garg (IIM Lucknow)
Example: Following data was collected for the sales, number of
d ti t bli h d d d ti i dit f 12
8/14/2019 7_Regression.pptx
89/96
advertisements published and advertizing expenditure for 12
months. Fit a regression model to predict the sales.
Sales (0,000 Rs) Ads (Nos.) Adv Ex (000 Rs)
43.6 12 13.9
38.0 11 12
30.1 9 9.3
35.3 7 9.7
46.4 12 12.3
34.2 8 11.4
30.2 6 9.3
40.7 13 14.3
38.5 8 10.222.6 6 8.4
37.6 8 11.2
35.2 10 11.1
Gaurav Garg (IIM Lucknow)
Summary Output 1: Sales Vs. No_Adv
8/14/2019 7_Regression.pptx
90/96
Model Summary
Model R R Square Adjusted R SquareStd. Error of the
Estimate
1 .781a .610 .571 4.20570
a. Predictors: (Constant), No_Adv
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 276.308 1 276.308 15.621 .003a
Residual 176.879 10 17.688
Total453.187 11
a. Predictors: (Constant), No_Adv
b. Dependent Variable: Sales
Coefficientsa
Model
Unstandardized CoefficientsStandardizedCoefficients
t Sig.B Std. Error Beta1 (Constant) 16.937 4.982 3.400 .007
No_Adv 2.083 .527 .781 3.952 .003
a. Dependent Variable: Sales
Gaurav Garg (IIM Lucknow)
Summary Output 2: Sales Vs. Ex_Adv
8/14/2019 7_Regression.pptx
91/96
Model Summary
Model R R Square Adjusted R SquareStd. Error of the
Estimate
1 .820a .673 .640 3.84900
a. Predictors: (Constant), Ex_Adv
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 305.039 1 305.039 20.590 .001a
Residual 148.148 10 14.815
Total453.187 11
a. Predictors: (Constant), Ex_Adv
b. Dependent Variable: Sales
Coefficientsa
Model
Unstandardized CoefficientsStandardizedCoefficients
t Sig.B Std. Error Beta1 (Constant) 4.173 7.109 .587 .570
Ex_Adv 2.872 .633 .820 4.538 .001
a. Dependent Variable: Sales
Gaurav Garg (IIM Lucknow)
Summary Output 3: Sales Vs. No_Adv & Ex_Adv
8/14/2019 7_Regression.pptx
92/96
Model Summary
Model R R Square Adjusted R SquareStd. Error of the
Estimate
1 .827a .684 .614 3.98888
a. Predictors: (Constant), Ex_Adv, No_Adv
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 309.986 2 154.993 9.741 .006a
Residual 143.201 9 15.911
Total453.187 11
a. Predictors: (Constant), Ex_Adv, No_Adv
b. Dependent Variable: Sales
Coefficientsa
Model
Unstandardized CoefficientsStandardizedCoefficients
t Sig.B Std. Error Beta1 (Constant) 6.584 8.542 .771 .461
No_Adv .625 1.120 .234 .558 .591
Ex_Adv 2.139 1.470 .611 1.455 .180
a. Dependent Variable: Sales
Gaurav Garg (IIM Lucknow)
Qualitative Independent Variables
8/14/2019 7_Regression.pptx
93/96
Johnson Filtration, Inc., provides maintenanceservice for water filtration systems throughoutsouthern Florida.
To estimate the service time and the service cost,the managers want to predict the repair time
necessary for each maintenance request. Repair time is believed to be related to two
factors-
Number of months since the last maintenanceservice
Type of repair problem (mechanical or electrical)
Gaurav Garg (IIM Lucknow)
Data for a sample of 10 service calls are given:
8/14/2019 7_Regression.pptx
94/96
Let Ydenote the repair time, X1denote the number of
months since last maintenance service. Regression Model that uses X1only to regress Yis
Y=0+1X1+
Service Call
Months Since Last
Service Type of Repair
Repair Time in
Hours
1 2 electrical 2.9
2 6 mechanical 3.03 8 electrical 4.8
4 3 mechanical 1.8
5 2 electrical 2.9
6 7 electrical 4.9
7 9 mechanical 4.2
8 8 mechanical 4.8
9 4 electrical 4.4
10 6 electrical 4.5
Gaurav Garg (IIM Lucknow)
Using least squares method, we fitted the model as
8/14/2019 7_Regression.pptx
95/96
g q
R2 =0.534
At 5% level of significance, we reject H0: 0= 0 (Using ttest)
H0: 1= 0 (Using tand Ftest)
X1 alone explains 53.4% variability in repair time.
To introduce the type of repair into the model, we define adummy variable given as
Regression Model that uses X1 and X2 to regress Yis
Y=0+1X1+2X2+
Is the new model improved?
13041.01473.2 XY
electricalisrepairoftypeif1,
mechanicalisrepairoftypeif,02X
Gaurav Garg (IIM Lucknow)
Summary
M l i l li i d l Y X
8/14/2019 7_Regression.pptx
96/96
Multiple linear regression model Y=X+
Least Squares Estimate ofis given by b= (XX)-1XY
R2and adjusted R2 Using ANOVA (Ftest), we examine if alls are zero or
not.
ttest is conducted for each regressor separately.
Using t test, we examine ifcorresponding to thatregressor is zero or not.
Problem of MulticollinearityVIF, eigen value
Dummy Variable
Examining the assumptions :
common variance, independence, normality