Housing Prediction

Lauren Parker & Wei Ying| MUSA 507 | HW 5 P a g e | 1

LAUREN PARKER & WEI YING

MUSA 507: MIDTERM

Due: Oct. 17, 2014

1 Introduction

The purpose of this project is to develop a regression model that can accurately predict home sale prices at multiple

locations within Philadelphia. Property is a significant part of economy as well as relevant to city planning; as such,

housing market prediction has a great impact on housing trade and economic development. Accordingly, accurate

predictions of housing prices is a vital task, to which we should focus much attention.

However, making accurate predictions is anything but an easy work, for the variability of actual home prices is affected

by many factors. Therefore, the first challenge of this project is to determine what factors are related to home price by

referring to housing articles and by testing candidate variables correlation to the home prices. Additionally, the

accuracy of prediction depends on the regression model we build, but linear regressions are a somewhat simple

approximation of reality and can fail to explain the exact relationship between independent variables and the

dependent variable. Therefore, finding variables that are highly related to home prices is an important, yet difficult task

for a good linear regression model.

Our overall modeling strategy is building and improving a regression model using data of training sites, or observations

for which we have home sale prices, and then using our best model to predict sales fo a test dataset, or sites for

which we have no sale price. We set existing home prices of training sites as the dependent variable and compile several

explanatory factors as independent variables. We check the relationship between each independent variable and home

prices in order to make log transformations when necessary and remove outliers to improve the model. We then test

the regression model within the training data and judge the residual error between actual values and predicted values.

We included 12 variables in the final regression. The adjusted R2 of our model in training samples is 0.56, which suggests

a relatively explanatory regression. In our out of sample testing, we got a R2 as 0.5624, and the root mean square error

(RMSE) of 0.8533, which is generally a strong model. Overall, we believe the regression model to be a fairly reliable way

to predict home prices in Philadelphia.

2 Methodology

Data Collection and Processing

In addition to the data originally provided by Ken Steif, we tried to find other possible variables that may explain home

prices. Figure 1 presents all of our original candidate variables, data sources and processing steps. In total, we tested 44

variables to determine which were best at improving the explanatory power of the model. We also present maps of our

dependent variable (Figure 2) and several key independent variables (Figure 3).

Figure 1. Table of all variables tested for explanatory power.

Varia le Des riptio Data Sour e Pro essi g

Ke _ID U i ue ID Ke “teif

SalePri e Depe de t Va ia les sale i dolla s Ke “teif

SaleMo th Mo th of the ea i of sale Ke “teif

Fro tage idth of pa el Ke “teif

Depth le gth of pa el Ke “teif

GarageDu Whethe o tai ga age Ke “teif

E tCo d Catego i al a ia le des i i g o ditio of uildi g e te io

Ke “teif


Varia le Des riptio Data Sour e Pro essi g

YearBuilt ea of o st u tio Ke “teif

Nu Roo s Nu e of oo s Ke “teif

Nu BedR s Nu e of ed oo s Ke “teif

Nu Baths Nu e of aths Ke “teif

TotLi Area Total i te al s ua e footage Ke “teif

Pop Populatio i i lo k g oup of house poi t. Ce sus Bu eau, A GI“: spatial joi losest pol go

MedI Media household i o e of lo k g oup of house. A e i a Co u it “u e , -

A GI“: spatial joi losest pol go

White Nu e of hite households i lo k g oup of house. Ce sus Bu eau, A GI“: spatial joi losest pol go

AfA Nu e of Af i a -A e i a households i lo k g oup of house.

Ce sus Bu eau, A GI“: spatial joi losest pol go

A I Nu e of A e i a I dia households i lo k g oup of house.


Asia Nu e of Asia households i lo k g oup of house. Ce sus Bu eau, A GI“: spatial joi losest pol go

HiPI Nu e of Ha aiia a d Pa ifi Isla de households i lo k g oup of house.


Other Nu e of households of "Othe " a e i lo k g oup of house.


T oRa e Nu e of households of "T o ‘a es" i lo k g oup of house.


HH Nu e of households i lo k g oup. Ce sus Bu eau, A GI“: spatial joi losest pol go

O HH Nu e of o upied households i lo k g oup. Ce sus Bu eau, A GI“: spatial joi losest pol go

Va HH Nu e of a a t households i lo k g oup. Ce sus Bu eau, A GI“: spatial joi losest pol go

P tWhite Pe e t White households i lo k g oup. Ce sus Bu eau, A GI“: spatial joi losest pol go

A eTa MktVal A e age a ket alue of houses ea est to Ke 's house.

Cit of Phila. Offi e of P ope t Assess e t

A GI“: Ge e ate Nea Ta le tool; ge e ate a e age a ket alue of eigh o houses

S hDistFt Dista e ft to losest s hool. PA“DA A GI“: spatial joi losest poi t

BusDistFt Dista e ft to losest us stop. DV‘PC A GI“: spatial joi losest poi t

StrDistFt Dista e ft to losest st eet i te se tio . DV‘PC A GI“: spatial joi losest pol go

ParkDis Dista e ft to losest pa k poi t Ope Data Phill A GI“: spatial joi losest pol li e

HealthDis Dista e ft to losest health e te Ope Data Phill A GI“: spatial joi losest poi t

Re reaDis Dista e ft to losest e eatio pla e Ope Data Phill A GI“: spatial joi losest poi t

A eSalePrKe Near A e age sale p i e of houses ea est Ke 's house. Ke “teif A GI“: Ge e ate Nea Ta le tool

E pP tLa or E plo e t ate i hole la o fo e i e sus t a t of house poi t

A e i a Co u it “u e Cal ulate the pe e tage

E pP tTotal E plo e t ate i total populatio i e sus t a t of house poi t

A e i a Co u it “u e Cal ulate the pe e tage

GMAA ePri e A e age sale p i e of Ke 's houses ithi ea h Geog aphi Ma ket A ea GMA .

Ope Data Phill A GI“: spatial joi a e age

Ce terCit Dis Dista e ft to Philadelphia e te it Ope Data Phill A GI“: spatial joi losest pol go

ParkPolDis Dista e ft to losest pa k pol go Ope Data Phill A GI“: spatial joi losest pol go

Co erDis Dista e ft to losest o e ial zo e Ope Data Phill A GI“: spatial joi losest pol go

I dusDis Dista e ft to losest i dust zo e Ope Data Phill A GI“: spatial joi losest pol go

WaterDis Dista e ft to losest h d olog pol go Ope Data Phill A GI“: spatial joi losest pol go

Parki gDis Dista e ft to losest pa ki g lot Ope Data Phill A GI“: spatial joi losest poi t

ParkNo Nu e of pa ks ithi . - ile uffe of house poi t Ope Data Phill A GI“: uffe ; spatial joi ou t

Re reNo Nu e of e eatio pla es ithi . - ile uffe of house poi t

Ope Data Phill A GI“: uffe ; spatial joi ou t

Parki gNo Nu e of pa ki g lots ithi . - ile uffe of house poi t

Ope Data Phill A GI“: uffe ; spatial joi ou t

Distri ts_I de Du : if Ke 's house as i of dist i ts. “et as fi ed a ia les e.g.Dist i t_

Ope Data Phill A GI“: spatial joi poi t falls i side

Co ditio _I de Du a ia le i di ati g o ditio of uildi g e te io e.g. HighCo d

Ke “teif Mask sites o ditio as High o ditio , Mediu Co ditio o Lo o ditio


Figure 2. Map of independent variable (sales price in the training data) in Philadelphia.

Figure 3. Maps of selected independent variables.


Methods

Our methodology involved 6 main parts: 1) preparing our dependent and independent variables before running the first

regression model, 2) running the first regression model, 3) removing variables and improving the model based on results

of regression 1, 4) running our final regression, 5) performing validation tests on our training dataset, and 6) making sale

price predictions on our test dataset. A flow chart of our methodology is presented in Figure 4.

Figure 4. Flow chart illustrating the methodology of the regression analysis.


We began by taking a subset of the data to only include training sites (n=12,155). We then removed any observations

that had very high values for sales price (above $1,100,000); these observations were considered to be outliers, and as

such, would reduce the predictive capability of the model. We then plotted a histogram of the sales price for all

remaining observations to determine if a log transformation would be appropriate. An exponential trend was observed,

so we did indeed use a log transformation of the dependent variable (Figure 5).

Figure 5. Histograms of (a) sale price, and (b) log(sale price) with a more normal distribution.

(a).

(b).

Before running a regression model, we tried to check for any correlation between each pair of variables in an effort to

omit any variables that were collinear. In Figure 6, we present 15 representative variables in a sample correlation matrix.

Figure 6. Correlation matrix between a sample of independent variables.

As shown in the matrix above, it is clear that using total households (HH10) and total occupied households (OcHH10) in

our regression, for example, would not be sound as these two variables were highly collinear. Based on the correlation

matrix, we removed total households as a candidate variable. We then plotted the each remaining independent variable

Key:

1. log(SalePrice) 6. HH10 11. CenterCityDis

2. TotLivArea 7. OcHH10 12. PctWhite

3. Depth 8. AveTaxMktVal 13. BusDistFt

4. NumRooms 9. EmpPctTotal 14. ParkingDis

5 MedInc10 10. GMAAvePrice 15. HealthDis


against the log of sale price to determine for which independent variables a log transformation was appropriate. An

example of this process is shown in Figure 7.

Figure 7. Plot of (a) average GMA sale price and (b) log of average GMA sale price against log of sale price.

(a).

(b).

Once we had performed the necessary transformations, we ran our first regression: log � � = � + � � � + � ℎ + � + � � ℎ + � log+ � log + � � � + � � + � log � + � ℎ�+ � ��ℎ + � + � � �� + � � ℎ � + � � �

where: � is the y-intercept value, � is the regression coefficient for all independent variables. This produced a fairly

good R-squared value of 0.56 and most independent variables were highly statistically significant, but we wanted to look

more deeply into our errors to determine if there were ways to improve the model. We produced diagnostic plots of the

fitted values against the observed dependent values (Figure 8a), observed dependent values against the standardized

residuals (Figure 8b), fitted values against standardized residuals (Figure 8c), and observed independent variables

against standardized residuals (example in Figure 8d).

Figure 8. Diagnostic plots to facilitate improvements to regression 1.

(a).

(b).

(c). (d).


As we can see from the plot in Figure 8 (d), one outlier was greatly influencing the residuals for this independent

variable. Using these plots, we removed outliers like these and omitted any independent variables that exhibited high

heteroscedasticity (i.e., the standardized residuals were clearly non-random). The remaining independent variables,

summarized in Figure 9, were used to develop our final regression. The final regression was then used to conduct out-of-

sample testing, cross-validation, and final predictions of sale price for our test dataset. The results of these steps are

described in the following section.

Figure 9. Summary of variables used in the final regression model.

Variable Mean St.Dev Min Max

SalePrice 113,980.00 135,817.50 1,000 2,138,000

Depth 8,048.57 3,223.00 100 62,900

NumRooms 48.085 25.639 0 140

NumBaths 8.929 5.393 0 51

MedInc10 39,038.54 19,649.74 4,432 178,194

OcHH10 486.427 198.922 40 1,535

AveTaxMktVal 135,091.60 272,913.70 2,500.00 27,684,000.00

EmpPctTotal 50.844 11.105 22.321 81.658

GMAAvePrice 105,227.80 115,500.00 2,750.00 1,283,938.00

TotLivArea 1,322.90 643.076 300 30,096

MedCond - - 0 1

LowCond - - 0 1

PctWhite 0.414 0.331 0 0.976

3 Results

After efforts to compile data, clean the data, transform variables and remove outliers, we developed our final regression

model. Our final model is as follows1: log � � = � + � � � + � ℎ + � + � � ℎ + � log + � log+ � � � + � � + � log � + � ℎ� + � ��ℎ+ �

1 I ‘, atu al log t a sfo atio s a e spe ified the ope ato log . As su h, efe e es to log t a sfo atio s i this eport are

indicative of a natural log transformation and should not be confused with log (base 10).


where: � is the y-intercept value, � is the regression coefficient for all independent variables. All independent variables

are defined in detail in the Introduction of this report. The results of this regression are illustrated in Figure 10.

Figure 10. Linear regression: change in sale price of selected homes in Philadelphia.

Variable Estimate Std. Error t Value Pr(>|t|)

(Intercept) 1.83E+00 2.76E-01 6.634 3.40e-11 ***

TotLivArea 1.31E-04 1.69E-05 7.726 1.20e-14 ***

Depth 1.74E-05 2.78E-06 6.276 3.59e-10 ***

NumRooms -2.00E-03 4.57E-04 -4.388 1.15e-05 ***

NumBaths 9.20E-03 2.14E-03 4.304 1.69e-05 ***

log(MedInc10) 9.27E-02 2.20E-02 4.213 2.54e-05 ***

log(OcHH10) 6.51E-02 2.27E-02 2.863 0.0042 **

AveTaxMktVal 1.18E-06 1.42E-07 8.296 < 2e-16 ***

EmpPctTotal 1.08E-02 1.18E-03 9.164 < 2e-16 ***

log(GMAAvePrice) 6.30E-01 1.62E-02 38.996 < 2e-16 ***

PctWhite 3.95E-01 3.30E-02 11.984 < 2e-16 ***

HighCond -7.18E-01 4.47E-02 -16.089 < 2e-16 ***

MedCond -3.81E-01 3.82E-02 -9.964 < 2e-16 ***

Signif. codes: *** . ** . * . 5 . .

Residual standard error: 0.8664 on 12138 degrees of freedom

Multiple R-squared: 0.5599, Adjusted R-squared: 0.5595

F-statistic: 1287 on 12 and 12139 on DF, p-value: <2.2e-16

The results indicate that the independent variables explain about 56% of the observed variability in sales price (R2=0.56).

While this R-s ua ed alue is ot ega ded as high i the field of most physical sciences, it should be noted that all of

the included independent variables are highly statistically significant ( ≤ 0.001). Our results generally show logical

relationships between our independent variables and the dependent variable. For example, it is intuitive that sale price

would increase with total livable area, and our model suggests that with every additional square foot of livable area, the

sale price increases by $13,100. Similarly, our model suggests that with every additional installation of a bathroom, the

sale price increases by $9,200. This seems to be reasonable, as it likely costs on the order of $10,000 to add a bathroom

to a house. The results for the model on the training dataset can be further summarized as:

Adjusted R-squared: 0.5595

Root mean square error (RMSE): 0.8659

Mean absolute error: 0.6135

Mean absolute percent error (MAPE): 112.2%

These results suggest that, on average, our predicted values are about 112% different than our observed values. To

more deeply understand the accuracy of our model, we developed several diagnostic plots (Figure 11).

Figure 11. Plots of (a) observed versus predicted sales price, (b) observed sales price versus standardized residuals, (c)

fitted dependent values versus standardized residuals.

(a).

(b).

(c).


These plots suggest that our model predicts sales values only fairly well; two main issues are revealed by these plots: 1)

there is not a perfect fit between observed and predicted values, and 2) residuals exhibit a non-random pattern. There is

strong variability in the plot of observed and predicted values. But at the same time, the line of fit between the observed

and predicted values is not far off from the optimal 45 degree line, suggesting that our fitted values do not deviate by an

extreme amount from observed values. There is a pattern in the plot of observed sales values and standardized

residuals, indicating that we may not meet assumptions of homoscedasticity (i.e., the residuals are not completely

random), which is an issue that should be addressed. Finally, we do not see constant variation in the plot of fitted values

against standardized residuals; furthermore, there seems to be a bifurcated trend. This suggests again, that the residuals

may not be completely random, but also there may be another independent variable that we did not include that has

some explanatory power. We looked more closely at the spatial distribution of our residuals to determine if there was

some spatial auto-correlation (i.e., clustering) by mapping the residuals (Figure 12).

Figure 12. Map of (a) linear regression residuals and (b) results of Moran s I test.

(a).

(b).

Based on a visual inspection of the map of our residuals, we see that there may be some spatial auto-correlation in the

form of clustering. For example, there is a cluster of high, positive residuals in South Philly, and perhaps the smallest

error (residuals closest to zero, in pale yellow) are in Northeast Philly. We a a test fo Mo a s I in ArcGIS, which

reinforced that our residuals were spatially auto-correlated. Due to the short timeframe of this project, we could not

properly address the errors present in our model, but we recognize that including additional explanatory variables and

controlling for the non-random nature of our residuals would be important follow-up work. Perhaps a spatial regression

would be more appropriate.

We can further explore the localized error in our model by mapping mean average percent error (MAPE) by

neighborhood. Figure 13 illustrates our aggregate MAPE by neighborhood. While most of the neighborhoods with low


MAPE are those areas that contain the most open-space (e.g., the neighborhood encompassing the Wissahickon), there

are some highly populated areas where model predicted with high accuracy. Those neighborhoods include Center City

East, Chinatown, Callowhill and East Poplar.

Figure 13. Mean absolute percent error (MAPE) by neighborhood.

Cross-Validation

We then performed a 10-fold cross validation on our training dataset, where each observation was predicted by

regressing 9 other observations. In this cross-validation, we see a dramatic improvement in our predictive ability (R-

squared = 0.8224). We theorize that this improvement is biased, in part, by using a very small sample (n=10).

Variables Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.53E+00 1.79E-01 36.55 < 2e-16 ***

Ken_ID 2.91E-04 2.17E-06 134.066 < 2e-16 ***

TotLivArea 3.43E-05 1.08E-05 3.186 0.00144 **

Depth -2.53E-05 1.79E-06 -14.101 < 2e-16 ***

NumRooms 5.79E-04 2.91E-04 1.99 0.04657 *

NumBaths 4.19E-03 1.36E-03 3.085 0.00204 **

logMedInc 5.79E-02 1.40E-02 4.146 3.41e-05 ***

logOcHH -1.29E-02 1.44E-02 -0.893 0.37195

AveTaxVal 1.06E-06 9.02E-08 11.739 < 2e-16 ***

EmpPct 3.68E-03 7.50E-04 4.91 9.24e-07 ***

logGMAPrice 1.92E-01 1.08E-02 17.885 < 2e-16 ***

PctWhite -3.56E-01 2.17E-02 -16.422 < 2e-16 ***

HighCond -1.92E-01 2.86E-02 -6.693 2.28e-11 ***

MedCond -6.10E-02 2.44E-02 -2.503 0.01232 *

“ig if. odes: *** . ** . * . . .

Residual standard error: 0.5501 on 12138 degrees of freedom

Multiple R-squared: 0.8226 Adjusted R-squared: 0.8224

F-statistic: 4330 on 13 and 12138 on DF, p-value: <2.2e-16


We see in the plots of Figure 14 that similar errors arise, in that there is a non-random distribution of residuals against

observed and predicted values (i.e., heteroscedasticity), and that there is non-constant variation.

Figure 14. Results of 10-fold cross validation of training dataset, with (a) observed sales prices against residuals, and (b)

predicted values against residuals.

(a).

(b).

Out-of-Sample Training Prediction

Next, we performed an out-of-sample prediction by randomly dividing our dataset into two subsets: 1) a sub-training

dataset composed of 75% of the observations, and 2) a sub-testing dataset consisting of 25% of the observations. We

regressed the sub-training dataset onto the sub-testing dataset using the same independent variables as noted in the

model above, and got the following results:

Adjusted R-squared: 0.5624

Root mean square error (RMSE): 0.8533

Mean absolute error: 0.6174

Mean absolute percent error (MAPE): 106.2%

These results suggest again, that while the model does have some strong explanatory power, there is still some error

that has not been properly accounted for.

Prediction for Testing Dataset

We then used our original regression to predict alues fo hi h e ha e o sales p i es i.e., the test dataset . We

can see in Figure 15 how the spatial distribution of home prices is translated from the training dataset to the predicted

dataset. Generally, in both maps, more expensive houses are clustered in Center and Old City, and there are pockets of

lower priced houses in Mantua, southern West Philly and areas around Olney. We interpret that these similarities

between the actual and predicted data are suggestive of a fairly accurate model.


Figure 15. Map of predicted values in (a) the training dataset (n=12,152), (b) the test dataset (n=2,154).

(a).

(b).

4 Discussion

Depending on the application, R-squared values of 0.56 could either be considered a breakthrough or something to be

immediately discounted. As it relates to predicting home prices in Philadelphia, our model would perhaps fall between

those two extremes. We recognize that our model has some residual error, only moderate explanatory power, positive

spatial auto-correlation, and possibly some multi-collinearity between independent variables. However, when

comparing the overall spatial distribution of prices in our actual versus predicted dataset, we see similar trends emerge

suggesting that our model can produce some quality predictions. Also, there are areas across Philadelphia for which our

residual error was very low; more specifically, if we look at the mean absolute percent error (MAPE) by neighborhood

we see that there are several key neighborhoods close to Center City for which our error is very low.

In our analysis, we found that three of our independent variables strongly improved the model: employment rate,

median household income and the average price of homes within each geographic market area (GMA). We can imagine

why this may be true: it seems reasonable to think that those areas with a higher employment rate have residents who

can afford a more expensive house, and likewise, those households who earn a higher income may be able to willing to

spend more on a house. We believe that, despite our spatially-dependent errors, we have accounted for some spatial

variability in price by using the average price of homes within each GMA. The GMAs are already spatial in nature, in that

they spatially divide the city into groups of common housing markets, so it is not such a reach to expect that this variable

controlled for some of our spatial auto-correlation.

5 Conclusion

Given the short timeframe, we believe our model is a great start. We were successful in trying out 44 different variables

and determined which of these variables would best improve the explanatory power of the model. In the end, our


model can explain about 56% of the variability in home prices in Philadelphia. We identified issues in our model, such as

spatial auto-correlation and non-random patterns in our residuals, and recognize that these are areas where we could

improve our model given more time. At this point, however, we do not recommend this model to Zillow. According to

the Zillow website, their current forecasting model has a median error of 6%, with 86.2% of homes forecasting within

20% of the sale price. Our current model does not predict within those narrow margins of error and it would require

additio al a al ses to i p o e ou odel to eet )illo s ta get fo e asti g a ilities.

In addition to spending more time to control for our residuals by perhaps performing a spatial regression, we could

i o po ate additio al datasets that a e used i )illo s fo e asti g odel. P o ided we could procure the datasets, we

could try to include: historical sale values, property tax rate, construction costs, and percentage of loans that are

subprime, as these a e additio al datasets that a e i luded i )illo s fo e asti g odel. In our current analysis, we

were not able to include these variables because they were either not publicly available at no-cost, or the geographic

level at which we could obtain the data were too aggregate (e.g., county).

Date post:	15-Aug-2015
Category:	Documents
Upload:	wei-ying
View:	110 times
Download:	1 times

Housing Prediction

Documents