of 40
8/13/2019 Regression Explained
1/40
8/13/2019 Regression Explained
2/40
Regression Analysis
There are three kinds of data arrangements. Time series Cross sectional Panel
Therefore regression can be of all three. Based on number of variables regression is
Bivariate and multivariate.
8/13/2019 Regression Explained
3/40
Bivariate Regression
A measure of linear association thatinvestigates a straight line relationship
Useful in estimation/forecasting
8/13/2019 Regression Explained
4/40
Introduction to Regression Analysis Regression analysis is used to:
Predict the value of a dependent variable based onthe value of at least one independent variable
Explain the impact of changes in an independentvariable on the dependent variable
Dependent variable: the variable we wish toexplainIndependent variable: the variable used toexplain the dependent variable
8/13/2019 Regression Explained
5/40
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
8/13/2019 Regression Explained
6/40
Bivariate Linear Regression
A measure of linear association thatinvestigates a straight-line relationship.
Y = + X + where Y is the dependent variable X is the independent variable and are two constants to be estimated is error or residual term
8/13/2019 Regression Explained
7/40
Y intercept
An intercepted segment of a line The point at which a regression line intercepts
the Y-axis
8/13/2019 Regression Explained
8/40
Slope
The inclination of a regression line ascompared to a base line
8/13/2019 Regression Explained
9/40
X
Y
160
150
140
130
120
110
100
90
80
70 80 90 100 110 120 130 140 150 160 170 180 190
Y hat
Actual Y
Y hat
Actual Y
Regression Line
= a + bx + e
is usedforpredictedvalue of Y
8/13/2019 Regression Explained
10/40
The Least-Square Method
The criterion of attempting to make the leastamount of total error in prediction of Y fromX. More technically, the procedure used in theleast-squares method generates a straight linethat minimizes the sum of squared deviationsof the actual values from this predicted
regression line.
8/13/2019 Regression Explained
11/40
The Least-Square Method
A relatively simple mathematical techniquethat ensures that the straight line will mostclosely represent the relationship between Xand Y.
8/13/2019 Regression Explained
12/40
= - (The residual)
= actual value of the dependent variable
= estimated value of the dependent variable (Y hat)
n = number of observations
ie
iY
iY
iY
iY
n
i
ie1
2 minimumis
Regression - Least-Square Method
210
22
x))b(a(y
)y(ye
8/13/2019 Regression Explained
13/40
22 X X nY X XY n
Finding out the values of a and b
= estimated slope of the line (the regression coefficient)
= estimated intercept of the y axis
= dependent variable
= mean of the dependent variable
= independent variable
= mean of the independent variable
= number of observations
X
Y
n
a
Y
X
22
X n X
Y X n XY
X Y a
8/13/2019 Regression Explained
14/40
The other method of calculating &
Use of simultaneous equation Method Y=N+X (where y is dependent variable and x is
independent variable) XY=X+X2
8/13/2019 Regression Explained
15/40
F-Test (Regression)-Goodness of fit
A procedure to determine whether there ismore variability explained by the regression orunexplained by the regression.
Total deviation equals= Deviation explained bythe regression + Deviation unexplained by theregression
8/13/2019 Regression Explained
16/40
222 iiii Y Y Y Y Y Y Totalvariation =
Explainedvariation
Unexplainedvariation(residual)
+
iiii Y Y Y Y Y Y
Partitioning the Variance
= Mean of the total group
= Value predicted with regression equation
= Actual value
Y
Y
iY
8/13/2019 Regression Explained
17/40
SSeSSr SSt
Sum of Squares
SSt
SSe
SSt
SSr r 12
The proportion of variance in Y that is explained by X (or vice versa) is referred as
Coefficient of Determination-r 2.R2 can also be calculated by squaring the correlation i.e. r. This is also known asexplained variance.
8/13/2019 Regression Explained
18/40
X Y
3 40
10 35
11 30
15 32
22 19
22 26
23 24
28 22
28 18
35 6
Equation for Line of Best Fit: y = .94x + 43.7
Correlation = -.94
Calculating The Value of R Square
8/13/2019 Regression Explained
19/40
X YPredicted
Y ValueError Error
Squared
Distancebetween Y
values andtheir mean
Meandistancessquared
3 40
10 35
11 30
15 32
22 19
22 26
23 24
28 22
28 18
35 6
Mean: Sum: Sum:
Equation for Line of Best Fit: y = .94x + 43.7
8/13/2019 Regression Explained
20/40
X YPredicted
Y ValueError Error
Squared
Distancebetween Y
values andtheir mean
Meandistancessquared
3 40 40.88 .88 .77 14.8 219.04
10 35 34.30 -.70 .49 9.8 96.04
11 30 33.36 3.36 11.29 4.8 23.04
15 32 29.60 -2.40 5.76 6.8 46.24
22 19 23.02 4.02 16.16 -6.2 38.44
22 26 23.02 -2.98 8.88 .8 .64
23 24 22.08 -1.92 3.69 -1.2 1.44
28 22 17.38 -4.62 21.34 -3.2 10.24
28 18 17.38 -.62 .38 -7.2 51.84
35 6 10.80 4.8 23.04 -19.2 368.65
Mean: 25.2 Sum: 91.81 Sum: 855.60
Equation for Line of Best Fit: y = .94x + 43.7
8/13/2019 Regression Explained
21/40
1-Sum of squared distances between the
actual and predicted Y values
Sum of squared distances between theactual Y values and their mean
To calculate R Squared
1- 91.81855.601- 0.11 = .89
8/13/2019 Regression Explained
22/40
X Y
3 40
10 35
11 30
15 32
22 19
22 26
23 24
28 22
28 18
35 6
r = -.944
The value we got for R Squared was .89
Heres a short -cut. To find R Squared
Square r
r2 = -.944 -.944r2 = .89
8/13/2019 Regression Explained
23/40
R Squared To determine how well the regression line
fits the data, we find a value called R-Squared (r2)
To find r 2, simply square the correlation The closer r 2 is +1, the better the line fits
the data r2 will always be a positive number
8/13/2019 Regression Explained
24/40
Understanding the Output of
Regression
8/13/2019 Regression Explained
25/40
Sample Data for House Price Model
House Price in Thousands(y)
Square Feet(x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
8/13/2019 Regression Explained
26/40
Regression Using Excel
Tools / Data Analysis / Regression
8/13/2019 Regression Explained
27/40
Excel OutputRegress ion Stat i s t ics
Multiple R 0.76211
R Square 0.58082
Adjusted RSquare 0.52842
Standard Error 41.3303
Observations 10
ANOVAd f SS MS F
Signi f icanceF
Regression 1 18934.9348 18934.934 11.08 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficien ts Standard Error t Stat
P- va lue Low er 95%
Upper95 %
Intercept 98.24833 58.03348 1.69296 0.1289 -35.57720 232.0738
Square Feet 0.10977 0.03297 3.32938 0.0103 0.03374 0.18580
The regression equation is:
feet)(square0.1097798.24833pricehouse
8/13/2019 Regression Explained
28/40
050
100
150
200250
300350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
H o u s e
P r i c e
( $ 1 0 0 0 s
)
Graphical Presentation House price model: scatter plot and regression
line
feet)(square0.1097798.24833pricehouse
Slope= 0.10977
Intercept= 98.248
8/13/2019 Regression Explained
29/40
Interpretation of theIntercept, b 0
is the estimated average value of Y when thevalue of X is zero (if x = 0 is in the range ofobserved x values)
Here, no houses had 0 square feet, so =98.24833 just indicates that, for houses within therange of sizes observed, 98,248.33 is the portionof the house price not explained by square feet
feet)(square0.1097798.24833pricehouse
8/13/2019 Regression Explained
30/40
Interpretation of theSlope Coefficient, b 1
measures the estimated change in theaverage value of Y as a result of a one-unitchange in X
Here, = .10977 tells us that the average value of ahouse increases by .10977(1000) = 109.77, onaverage, for each additional one square foot of size
feet)(square0.1097798.24833pricehouse
8/13/2019 Regression Explained
31/40
Excel OutputRegress ion S ta ti s t i cs
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVAd f SS MS F Signi f icance F
Regression (Main) 1 18934.9348 18934.9348 11.0848 0.01039
Residual (Error) 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Er ror t Stat P-value Low er 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
58.08% of the variation in house prices isexplained by variation in square feet
0.5808232600.500018934.9348
SSTSSRR 2
Adjusted R square Used to test if an additionalindependent variable improves the model.
8/13/2019 Regression Explained
32/40
Standard Error of Estimate
The standard deviation of the variation ofobservations around the regression line isestimated by
1knSSE
s
WhereSSE = Sum of squares error
n = Sample sizek = number of independent variables in the model
8/13/2019 Regression Explained
33/40
The Standard Deviation of theRegression Slope
The standard error of the regression slopecoefficient (b 1) is estimated by
n
x)(x
s
)x(x
ss 2
2
2
b1
where:= Estimate of the standard error of the least squares slope
= Sample standard error of the estimate
1bs
2n
SSEs
8/13/2019 Regression Explained
34/40
Regress ion Stat i s t ics
Multiple R 0.76211
R Square 0.58082Adjusted RSquare 0.52842
Standard Error 41.33032
Observations 10
ANOVAd f SS MS F
Signi f icanceF
Regression 1 18934.9348 18934.934 11.084 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficien ts Standard Error t Stat
P- va lue Low er 95%
Upper95 %
Intercept 98.24833 58.03348 1.69296 0.1289 -35.57720 232.0738
Square Feet 0.10977 0.03297 3.32938 0.0103 0.03374 0.18580
Thus, 41.33 means that the expected error for ahouse price prediction is off by 41330 rupees.
8/13/2019 Regression Explained
35/40
Inference about the Slope: t Test
t test for a population slope Is there a linear relationship between x and y? Null and alternative hypotheses
H0: 1 = 0 (no linear relationship) H1: 1 0 (linear relationship does exist)
Test statistic
1b
11
s bt
2nd.f.
where:
b1 = Sample regression slopecoefficient
1 = Hypothesized slope
sb1 = Estimator of the standarderror of the slope
8/13/2019 Regression Explained
36/40
House Pricein $1000s
(y)
Square Feet(x)
245 1400
312 1600279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
(sq.ft.)0.109898.25pricehouse
Estimated Regression Equation:
The slope of this model is 0.1098
Does square footage of the houseaffect its sales price?
Inference about the Slope: t Test(continued)
8/13/2019 Regression Explained
37/40
Inferences about the Slope:t Test Example
H0: 1 = 0HA: 1 0
Test Statistic: t = 3.329
There is sufficient evidencethat square footage affects
house price
Reject H 0
Coeff ic ient s
StandardError t Stat
P- va lue
Intercept 98.24833 58.03348 1.6929 0.1289
SquareFeet 0.10977 0.03297 3.3293 0.0103
1bs
tb1
Decision:
Conclusion:
Reject H 0Reject H 0
a /2=.025
-t /2Do not reject H 0
0 t /2
a /2=.025
-2.3060 2.3060 3.329
d.f. = 10-2 = 8
8/13/2019 Regression Explained
38/40
Regression Analysis for Description
Confidence Interval Estimate of the Slope:
Excel Printout for House Prices:
At 95% level of confidence, the confidence interval for theslope is (0.0337, 0.1858)
1b/21stb a
Coeff ic ient
s Standard
Error t Sta t P-value Low er 95% Upper
95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
d.f. = n - 2
8/13/2019 Regression Explained
39/40
Regression Analysis for Description
Since the units of the house price variable is $1000s, we are 95%confident that the average impact on sales price is between$33.70 and $185.80 per square foot of house size
Coeff ic ient s
StandardError t Sta t P-value Low er 95%
Upper95 %
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
This 95% confidence interval does not include 0 .
Conclusion: There is a significant relationship between house price and squarefeet at the .05 level of significance
8/13/2019 Regression Explained
40/40
Multiple Regression
Extension of Bivariate Regression Multidimensional when three or more
variables are involved Simultaneously investigates the effect of two
or more variables on a single dependentvariable