Chapter 13
Multiple Regression and Model
Building
Multiple Regression Models
The General Multiple Regression Model
is the dependent variable
are the independent variables
is the deterministic portion of
the model
determines the contribution of the independent variable
y
0 1 1 2 2. . .
k ky x x x
0 1 1 2 2 . . . k kE y x x x
ix
1 2, , . . . ,
kx x x
i
Multiple Regression Models
Analyzing a Multiple Regression Model
1. Hypothesize the deterministic component of the model
2. Use sample data to estimate β0,β1,β2,… βk
3. Specify probability distribution of ε and estimate σ
4. Check that assumptions on ε are satisfied
5. Statistically evaluate model usefulness
6. Useful model used for prediction, estimation, other
purposes
The First-Order Model: Estimating
and Interpreting the -Parameters
For
the chosen fitted model
minimizes
0 1 1
ˆ ˆ ˆˆ . . .k k
y x x
0 1 1 2 2 3 3 4 4 5 5E y x x x x x
2
ˆS S E y y
The First-Order Model: Estimating
and Interpreting the -Parameters
y = β0 + β1x1 + β2x2 + β3x3 + ε
where
Y = Sales price (dollars)
X1 = Appraised land value (dollars)
X2 = Appraised improvements (dollars)
X3 = Area (square feet)
The First-Order Model: Estimating
and Interpreting the -Parameters
Plot of data for sample size n=20
The First-Order Model: Estimating
and Interpreting the -Parameters
Fit model to data
The First-Order Model: Estimating
and Interpreting the -Parameters
Interpret β estimates
2
ˆ .8 2 0 4
1
ˆ 1 3 .5 3
1
ˆ .8 1 4 5
E(y), the mean sale price of the property is
estimated to increase .8145 dollars for every $1
increase in appraised land value, holding other
variables constant
E(y), the mean sale price of the property is
estimated to increase .8204 dollars for every $1
increase in appraised improvements, holding other
variables constant
E(y), the mean sale price of the property is
estimated to increase 13.53 dollars for additional
square foot of living area, holding other variables
constant
The First-Order Model: Estimating
and Interpreting the -Parameters
Given the model E(y) = 1 +2x1 +x2, the effect
of x2 on E(y), holding x1 and x2 constant is
The First-Order Model: Estimating
and Interpreting the -Parameters
Given the model E(y) = 1 +2x1 +x2, the effect
of x2 on E(y), holding x1 and x2 constant is
Model Assumptions
Assumptions about Random Error ε
1. For any given set of values of x1, x2,…..xk, the random
error has a normal probability distribution with mean 0
and variance σ2
2. The random errors are independent
Estimators of σ2 for a Multiple Regression Model
with k Independent Variables
s2=SSE
=SSE
n-Number of Estimated β parameters n-(k+1)
Inferences about the -Parameters
2 types of inferences can be made, using
either confidence intervals or hypothesis
testing
For any inferences to be made, the
assumptions made about the random error
term ε (normal distribution with mean 0 and
variance σ2, independence or errors) must
be met
Inferences about the -Parameters
A 100(1-α)% Confidence Interval for a -Parameter
where tα/2 is based on n-(k+1) degrees of freedom and
n = Number of observations
k+1 = Number of parameters in the model
ˆ2
ˆ
ii
t s
Inferences about the -Parameters
A Test of an Individual Parameter Coefficient
One-Tailed TestTwo-Tailed
Test
H0: βi=0
Ha: βi0)
H0: βi=0
Ha: βi≠0
Rejection region: t< -tα
(or t< -tα when Ha: β1>0)
Rejection
region: |t|> tα/2
Where tα and tα/2 are based on n-(k+1) degrees of freedom
ˆ
ˆ
:
i
iT e s t S ta t i s t i c t
s
Inferences about the -Parameters
An Excel Analysis
Use for
confidence
Intervals
Use for hypotheses
about parameter
coefficients
Checking the Overall Utility of a
Model
3 tests:1. Multiple coefficient of determination R2
2. Adjusted multiple coefficient of determination
3. Global F-test
2 21 1
1 1 11 1
a
y y
n nS S ER R
n k S S n k
21
y y
y y y y
S S S S ES S E E x p la in e d v a r ia b i l i t yR
S S S S T o ta l v a r ia b i l i t y
2
2:
1 1 1
y yS S S S E k R k
T e s t s ta t i s t ic F
S S E n k R n k
Checking the Overall Utility of a
Model
Testing Global Usefulness of the Model: The
Analysis of Variance F-test
H0: β1 =β2=....βk=0
Ha: At least one βi ≠ 0
where n is the sample size and k is number of terms in the model
Rejection region: F>Fα, with k numerator degrees of freedom and [n-
(k+1)] denominator degrees of freedom
2
2:
1 1 1
y yS S S S E k R k M e a n S q u a r e M o d e l
T e s t s ta t i s t ic FM e a n S q u a r e E r r o rS S E n k R n k
Checking the Overall Utility of a
Model
Checking the Utility of a Multiple Regression Model
1. Conduct a test of overall model adequacy
using the F-test. If H0 is rejected, proceed to
step 2
2. Conduct t-tests on β parameters of particular
interest
Using the Model for Estimation and
Prediction
As in Simple Linear Regression, intervals around a
predicted value will be wider than intervals around
an estimated value
Most statistics packages will print out both
confidence and prediction intervals
Model Building: Interaction Models
An Interaction Model relating E(y) to Two Quantitative Independent Variables
where
represents the change in E(y) for every 1-unit increase in x1, holding x2 fixed
represents the change in E(y) for every 1-unit increase in x2, holding x1 fixed
1 3 2 x
0 1 1 2 2 3 1 2 E y x x x x
2 3 1 x
Model Building: Interaction Models
When the relationship between two y
and xi is not impacted by a second x
(no interaction)
When the linear relationship
between y and xi depends on
another x
Model Building: Interaction Models
Model Building: Quadratic and
other Higher-Order Models
A Quadratic (Second-Order) Model
where
is the y-intercept of the curve
is a shift parameter
is the rate of curvature
2
0 1 2 E y x x
0
1
2
Model Building: Quadratic and
other Higher-Order Models
Home Size-Electrical
Usage Data
Size of Home,
x (sq. ft.)
Monthly Usage,
y (kilowatt-hours)
1,290 1,182
1,350 1,172
1,470 1,264
1,600 1,493
1,710 1,571
1,840 1,711
1,980 1,804
2,230 1,840
2,400 1,95
2,930 1,954
Model Building: Quadratic and
other Higher-Order Models
2ˆ 1, 2 1 6 .1 2 .3 9 8 9 .0 0 0 4 5y x x
Model Building: Quadratic and
other Higher-Order Models
A Complete Second-Order Model with Two
Quantitative Independent Variables
where
is the y-intercept, value of E(y) when x1=x2=0
changes cause the surface to shift along the x1 and x2axes
controls the rotation of the surface
control the type of surface, rates of curvature
2 2
0 1 2 2 3 1 2 4 1 5 2 E y x x x x x x
0
1 2,
3
4 5,
Model Building: Quadratic and
other Higher-Order Models
Model Building: Qualitative
(Dummy) Variable Models
Dummy variables – coded, qualitative variables
•Codes are in the form of (1, 0), 1 being the presence of a condition, 0 the absence
•Create Dummy variables so that there is one less dummy variable than categories of the qualitative variable of interest
Gender dummy variable coded
as x = 1 if male, x=0 if female
If model is E(y)=β0+β1x ,
β1 captures the effect of being
male on the dependent variable
Model Building: Models with both
Quantitative and Qualitative Variables
Start with a first order model with one quantitative
variable, E(y)=β0+β1x
Adding a qualitative variable
with no interaction,
E(y)=β0+β1x1+ β2x2+ β3x3
Model Building: Models with both
Quantitative and Qualitative Variables
Adding an interaction term,
E(y)=β0+β1x1+ β2x2+ β3x3+ β4x1x2+ β5x1x3
Main effect, Main effect Interaction
x1 x2 and x3
Model Building: Comparing Nested
Models
Models are nested if one model contains all
the terms of the other model and at least
one additional term.
Complete (full) model – the more complex
model
Reduced model – the simpler model
Model Building: Comparing Nested
Models
Models are nested if one model contains all the
terms of the other model and at least one
additional term.
Complete (full) model – the more complex model
Reduced model – the simpler model
2 2
0 1 2 2 3 1 2 4 1 5 2 E y x x x x x x
0 1 2 2 3 1 2 E y x x x x
Model Building: Comparing Nested
Models
F-Test for comparing nested models:F-Test for Comparing Nested ModelsReduced model
Complete Model
H0: βg+1 =βg+2=....βk=0
Ha: At least one β under test is nonzero.
Rejection region: F>Fα, with k-g numerator degrees of freedom and
[n-(k+1)] denominator degrees of freedom
0# ':
1
R C R C
CC
S S E S S E k g S S E S S E s te s te d in HT e s t s ta t is t ic F
M S ES S E n k
0 1 1 . . . g gE y x x
0 1 1 1 1. . . . . . g g g g k kE y x x x x
Model Building: Stepwise
Regression
Used when a large set of independent
variables
Software packages will add in variables in
order of explanatory value.
Decisions based on largest t-values at each
step
Procedure is best used as a screening
procedure only
Residual Analysis: Checking the
Regression Assumptions
Regression Residual – the difference between an observed y value and its corresponding predicted value
Properties of Regression Residuals•The mean of the residuals equals zero
•The standard deviation of the residuals is equal to the
standard deviation of the fitted regression model
ˆ ˆy y
Residual Analysis: Checking the
Regression Assumptions
Analyzing Residuals
Top plot of residuals reveals
non-random pattern, curved
shape
Second plot, based on
second-order term being
added to model, results in
random pattern, better
model
Residual Analysis: Checking the
Regression Assumptions
Identifying OutliersResidual plots can reveal outliers
Outliers need to be checked to try
to determine if error is involved
If error is involved, or observation
is not representative, analysis can
be rerun after deleting data point
to assess the effect.
Outlier
Residual Analysis: Checking the
Regression Assumptions
With Outlier Without Outlier
Checking for Normal Errors
Residual Analysis: Checking the
Regression Assumptions
Checking for Equal Variances
Pattern in residuals indicate violation of equal
variance assumption
Can point to use of transformation on the
dependent variable to stabilize variance
Residual Analysis: Checking the
Regression Assumptions
Steps in Residual Analysis
1. Check for misspecified model by plotting
residuals against quantitative independent
variables
2. Examine residual plots for outliers
3. Check for non-normal error using frequency
distribution of residuals
4. Check for unequal error variances using plots
of residuals against predicted values
Some Pitfalls: Estimability,
Multicollinearity, and Extrapolation
Estimability – the number of levels of
observed x-values must be one more than
the order of the polynomial in x that you
want to fit
Multicollinearity – when two or more
independent variables are correlated
Some Pitfalls: Estimability,
Multicollinearity, and Extrapolation
Multicollinearity – when two or more independent variables are correlated
Leads to confusing, misleading results, incorrect parameter estimate signs.
Can be identified by
–checking correlations among x’s
–non-significant for most/all x’s
–signs opposite from expected in the estimated β parameters
Can be addressed by–Dropping one or more of the correlated variables in the model
–Restricting inferences to range of sample data, not making inferences about individual β parameters based on t-tests.
Some Pitfalls: Estimability,
Multicollinearity, and Extrapolation
Extrapolation – use of model to predict
outside of range of sample data is
dangerous
Correlated Errors – most common when
working with time series data, values of y
and x’s observed over a period of time.
Solution is to develop a time series model.