Multiple Regression Analysis
Where as simple linear regression has 2 variables (1 dependent, 1 independent):
Multiple linear regression has >2 variables (1 dependent, many independent):
The problems and solutions are the same as bivariate regression, except there are more parameters to estimate.
nn xbxbxbay ...ˆ 2211 +++=
bxay +=ˆ
In bivariate regression we fit a line through points plotted in 2-dimenstional space:
In multiple regression with 3 variables we fit a plane through points plotted in 3-dimenstional space:
Additional variables add additional dimensions to the variable space.
In addition to the assumptions of bivariate regression, multiple regression has the assumption of no multicollinearity among the independent variables.
Multicollinearity – when two or more of the independent variables are highly correlated, making it difficult to separate their effects on the dependent variable.
Example: Determine the strength of the relationship between native American male standing height, average yearly minimum temperature, and annual temperature range.
Variables:
MHT Male Standing Height(cm) Dependent
AnnMinTemp Annual Minimum Temp (ºF) Independent
AnnRange Annual Temp Range (ºF) Independent
Model Summaryb
.654a .428 .416 30.04066 1.683Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Durbin-Watson
Predictors: (Constant), AnnRange, AnnMinTempa.
Dependent Variable: MHTb.
ANOVAb
63546.875 2 31773.438 35.208 .000a
84829.502 94 902.442148376.4 96
RegressionResidualTotal
Model1
Sum ofSquares df Mean Square F Sig.
Predic tors: (Constant), AnnRange, AnnMinTempa.
Dependent Variable: MHTb.
Coefficientsa
1665.620 15.964 104.334 .0004.492 .603 .855 7.446 .000 .462 2.1661.565 .552 .325 2.834 .006 .462 2.166
(Constant)AnnMinTempAnnRange
Model1
B Std. Error
UnstandardizedCoeffic ients
Beta
StandardizedCoeffic ients
t Sig. Tolerance VIFCollinearity Statistics
Dependent Variable: MHTa.
41.6% of height explained by min temp and range.
Model is significant.
Slopes are not zero. Some collinearity.
We interpret the regression equation as follows:
The equation can be interpreted as follows:
Every 1ºF increase in minimum temperature adds 4.49 centimetersin male standing height… holding constant the temperature range.
Conversely, every increase of 1ºF in the annual temperature range adds 1.57 centimeters in male standing height, holding constantthe minimum temperature.
range) temp F1.57(º temp) min F4.49(º1665.6Height Standing Male ++=
Normality of the residuals is one of the most important assumptions of linear regression. In this case the residuals arenormally distributed.
The observed and predicted residuals do not display any systematic bias, which would indicate that the independentvariables vary systematically with each other.
Tolerance is the amount of the variance in a given independent variable that can not be explained by other independent variables. In this case 46.2% of the variance in one can not be explained by the other… meaning that 53.8% of the variance IS shared or collinear.
Coefficientsa
1665.620 15.964 104.334 .0004.492 .603 .855 7.446 .000 .462 2.1661.565 .552 .325 2.834 .006 .462 2.166
(Constant)AnnMinTempAnnRange
Model1
B Std. Error
UnstandardizedCoeffic ients
Beta
StandardizedCoeffic ients
t Sig. Tolerance VIFCollinearity Statistics
Dependent Variable: MHTa.
Model Summaryb
.654a .428 .416 30.04066 1.683Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Durbin-Watson
Predictors: (Constant), AnnRange, AnnMinTempa.
Dependent Variable: MHTb.
This is why the standard error of the estimate is so large. The standard error of the estimate is the average error expressed in the original units (e.g. centimeters). 30cm is 1 foot of error... in a person’s height.
VIFs (variance inflation factors) higher than 2 are considered problematic (according to SPSS) and our VIFs are over just over 2.1.
Coefficientsa
1665.620 15.964 104.334 .0004.492 .603 .855 7.446 .000 .462 2.1661.565 .552 .325 2.834 .006 .462 2.166
(Constant)AnnMinTempAnnRange
Model1
B Std. Error
UnstandardizedCoeffic ients
Beta
StandardizedCoeffic ients
t Sig. Tolerance VIFCollinearity Statistics
Dependent Variable: MHTa.
The standard beta values indicate the strength of the relationship between the independent and dependent variables. Minimum temperature is a much stronger predictor of height than annual range.
Coefficientsa
1665.620 15.964 104.334 .0004.492 .603 .855 7.446 .000 .462 2.1661.565 .552 .325 2.834 .006 .462 2.166
(Constant)AnnMinTempAnnRange
Model1
B Std. Error
UnstandardizedCoeffic ients
Beta
StandardizedCoeffic ients
t Sig. Tolerance VIFCollinearity Statistics
Dependent Variable: MHTa.
The question becomes: do these collinearity statistics rise to the level of indicating multicollinearity among the independent variables? In this example they do.
Correlations
1 -.734**.000
97 97-.734** 1.000
97 97
Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N
AnnMinTemp
AnnRange
AnnMinTemp AnnRange
Correlation is s ignificant at the 0.01 level (2-tai led).**.
Misspecification – an error in the regression equation due to the exclusion of an independent variable that influences the dependent variable OR the inclusion of an independent variable that does not influence the dependent variable.
Misspecification errors are common since it is difficult to know a priori what factors influence the dependent variable.
Misspecification is a hypothesis, not statistical issue.
Data Transformation
Often the association between two variables in not linear. Data transformation (log, etc…) is perfectly acceptable. The type of transformation must be stated in your summary statement.
In this case, log transforming the population data created a linear relationship.
Converting to natural log is easy. For example, the mining town of Argentine has a population of 100, its natural log would be:
Converting back to the original units is also easy:
60517.4)100ln((ln) ==pop
10060517.4 == epop
Calculator transformations:
Converting to a log: use the ln key.
Converting from a log: use the ex key.
SPSS transformations:
Converting to a log: Transform>Compute variable> Arithmetic>Ln
Converting from a log: Transform>Compute variable> Arithmetic>Exp
Population and Elevation in Colorado Mining Towns.
The model is significant. What is the standard error of the estimate telling us? What are the units?
Population = 46852.9 + (-4.238)(Elevation)
Ln(population) = 33.108 – 0.003(elevation)
Population and Elevation in Colorado Mining Towns: Log Transformation
The model is significant. What is the standard error of the estimate telling us? What are the units?
Town Population Elevation (ft) (ln)Population (ln)Predicted (ln)ResidualArgentine 100 11161 4.61 4.90195 -.29678Boreas 200 11535 5.30 3.95677 1.34155Breckenridge 8000 9597 8.99 8.85453 .13267Buckskin Joe 500 10860 6.21 5.66264 .55196Chihuahua 200 10571 5.30 6.39301 -1.09469Dudley 200 10400 5.30 6.82517 -1.52685Fairplay 8000 9931 8.99 8.01043 .97676Hamilton 3000 9997 8.01 7.84364 .16273Horseshoe 800 10544 6.68 6.46125 .22337Lamartine 500 10485 6.21 6.61035 -.39574Lincoln 1500 10384 7.31 6.86560 .44762Montezuma 800 10358 6.68 6.93131 -.24670Mosquito 250 10720 5.52 6.01645 -.49499Park City 300 10587 5.70 6.35258 -.64879Parkville 10000 9944 9.21 7.97758 1.23276Quartzville 200 11424 5.30 4.23729 1.06103Rexford 50 11201 3.91 4.80086 -.88884Sacramento 100 11398 4.61 4.30300 .30217Saints John 200 10798 5.30 5.81933 -.52101Silverheels 150 10771 5.01 5.88757 -.87693Swandyke 200 11093 5.30 5.07380 .22452Silver Plume 5500 9825 8.61 8.27832 .33418
Converting to Original Units from a Log Transformation
Town = HorseshoePopulation = 800Elevation = 10,544ft
Predicted ln(population) = 6.46125Calculated ln(residual) = 0.22337
Converting to original units (people): population = e(6.46125) = 640
Converting the residual: residual = e0.22337 =1.25028A
AThis is the ratio of the difference between the actual and predicted value.
Original population = (640)(1.25028) = 800.2
Residual in original units (people): difference = 800 – 640 = 160
e.g. the equation under-predicted Horseshoe’s population by 160 people.
Observed – Predicted = residual
Town Population Elevation (ft) (ln)Population (ln)Predicted (ln)ResidualArgentine 100 11161 4.61 4.90195 -.29678
Observed Population = 100(ln)Predicted Population = 4.90195(ln)Residual = -0.29678
predictedobservede predicted
−==
Residualpopulation Predicted )(
What are the predicted population and residual values, in the original units?
Iterative Regression
If you are exploring a data base for associations, one method isto use iterative regression.
Iterative Regression – a iterative procedure which either adds or removes variables from a regression model based on their significance.
IMPORTANT:
The SPSS stepwise procedure give results that are inconsistent with the other methods. Due to this inconsistency it is recommended that the stepwise procedure not be used.
A better method of performing iterative regression is to use all variables with the enter procedure, then remove insignificant variables individually. OR use the backwards or forwardsprocedures.
Types of Iterative Regression:
Enter – all variables are entered in a single step.
Stepwise – independent variables are entered based on the smallest F probability. Variables already in the equation are removed if their probability of F becomes too large.
Backward – all variables are entered into the equation and then sequentially removed based on the smallest partial correlation.
Forward - A stepwise variable selection procedure in which variables are sequentially entered into the model.
Harrisburg Housing Value (Iterative using the Enter procedure)
Not significant
Not significant
Predicted value ($) = -233435.212 + 19.515(Square Feet) + 143.475(Year Built) – 3848.55 (Bedrooms) + 10101.928(Half Baths) + 4.545(Parcel Size) – 12.126(Distance to Front St)
With insignificant variables removed.
No changes here.
All slopes are significant.
Standardized Coefficients
Standardized or beta coefficients are slope values that have been standardized so that their variances are 1. They can be used to determine which of the independent variables have a greater effect on the dependent variable when the variables are measured in different units of measurement.
In this case, Square Feet and Distance to Front Street are having the greatest effect.
705 ½ South Front StreetValue = $133,900Square Feet = 2380Parcel Size = 2975Distance to Front Street = 84Year Built = 1900Bedrooms = 3Half Baths = 1
Predicted value ($) = -233435.212 + 19.515(2380) + 143.475(1900) – 3848.55 (3) + 10101.928(1) + 4.545(2975) – 12.126(84)
Predicted value ($) = -233435.212 + 46445.7 + 272602.5 – 11545.65 +10101.928 +13521.375 – 1018.884
Predicted value ($) =109236.3
Residual ($) = 109236.3 – 133900 = -24663.7 This is not surprising considering thatthe r2 was 0.591. Over 40% of the variation in housing value is not explained by this model.
Mapping Regression Residuals
Temperature Recording SitesKyrgystan Region
Average yearly temperature is influenced by:
Elevation: 6.4˚C per 1000 m elevation change.
Latitude: 4.0˚C per 1000 km latitude change.
To what degree can we predict temperature based on both elevation and latitude?
Elevation Latitude
Model Summaryb
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .824a .679 .677 3.48693a. Predictors: (Constant), Elevationb. Dependent Variable: Average Temperature
ANOVAa
ModelSum of Squares df
Mean SquareF Sig.
1Regression 4936.797 1 4936.797 406.031 .000b
Residual 2334.466 192 12.159Total 7271.264 193
a. Dependent Variable: Average temperatureb. Predictors: (Constant), Elevation
Coefficientsa
Model
UnstandardizedCoefficients
Standardized Coefficients
t Sig.B Std. Error Beta
1 (Constant) 14.683 .390 37.691 .000Elevation -.005 .000 -.824 -20.150 .000
a. Dependent Variable: Average Temperature
TemperatureP = 14.683 – (0.005)Elevation
Model: Elevation
The standard error of the estimate is about 3.5 ˚C, which is half of the number of degree change per 1000m elevation.
This model is not very accurate.
Model Summaryb
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .824a .679 .677 3.48693a. Predictors: (Constant), Elevationb. Dependent Variable: Average Temperature
Missing explanatory variable.
Unknown
Model Summaryb
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .254a .065 .060 5.95185a. Predictors: (Constant), Latitudeb. Dependent Variable: Average Temperature
ANOVAa
ModelSum of Squares df
Mean SquareF Sig.
1Regression 469.747 1 469.747 13.260 .000b
Residual 6801.516 192 35.425Total 7271.264 193
a. Dependent Variable: Average Temperatureb. Predictors: (Constant), Latitude
Coefficientsa
Model
UnstandardizedCoefficients
Standardized Coefficients
t Sig.B Std. Error Beta
1 (Constant) 30.470 6.002 5.077 .000Latitude -.531 .146 -.254 -3.641 .000
a. Dependent Variable: Average Temperature
Model: Latitude
This R2 is really low.
The standard error of the estimate is about 6 ˚C, which is nearly the number of degree change per 1000m elevation.
This model is also not very accurate. By itself, the variable latitudeis not a good predictor of temperature.
Model Summaryb
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .254a .065 .060 5.95185a. Predictors: (Constant), Latitudeb. Dependent Variable: Average Temperature
This similarity in pattern suggests that together elevation and latitude combined may produce a strong predictive model.
Model Summaryb
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .952a .907 .906 1.88058a. Predictors: (Constant), Elevation, Latitudeb. Dependent Variable: Average Temperature
ANOVAa
ModelSum of Squares df
Mean SquareF Sig.
1Regression 6595.775 2 3297.887 932.505 .000b
Residual 675.489 191 3.537Total 7271.264 193
a. Dependent Variable: Average Temperatureb. Predictors: (Constant), Elevation, Latitude
Model: Elevation + Latitude
Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients
t Sig.
Correlations Collinearity Statistics
BStd. Error
BetaZero-order
Partial PartTolerance VIF
1(Constant) 57.936 2.008 28.852 .000Elevation -.005 .000 -.949 -41.620 .000 -.824 -.949 -.918 .936 1.068Latitude -1.032 .048 -.494 -21.658 .000 -.254 -.843 -.478 .936 1.068
a. Dependent Variable: Average Temperature
The standard error of the estimate is less than 2 ˚C, which is by far the best estimator.
This model is very accurate (90%).
Model Summaryb
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .952a .907 .906 1.88058a. Predictors: (Constant), Latitudeb. Dependent Variable: Average Temperature
Outlier?
Significantly over/under predicted locations.
There does not appear to be any spatial pattern to the distribution of residuals.
• The residuals appear to be spatially random.
• The number of large over/under predictions is about equal.
• It might be a good idea to examine large over/under predicted locations in greater detail.
Over-predictionName Lon Lat Elev Temp ResidHumrogi 71.33 38.28 1737 12.17 3.10Dzhergetal 73.1 41.57 1800 10.43 5.10Gasan-kuli 39.22 52.22 23 16.06 12.13
Under-predictionName Lon Lat Elev Temp ResidKushka 62.35 35.28 57 15.23 -5.99Susamyr 74 42.2 2087 -1.95 -5.06Aksai 76.49 42.07 3135 -7.27 -4.86
An initial inspection does not show any locational influences, with the exception of Gasan-kuli, which is located far from the other sites.
Gasan-kuli
Kushka
Susamyr
Humrogi
Aksai
Dzhergetal
Key Points:
1. Let theory drive your selection of independent variables.• Individual variable analyses (regressions) were misleading.
2. Use the tools available.• Both statistics and graphs.
3. Map residuals and look for patterns.• Patterns may be of interest.• The absence of patterns is NOT a failure.