+ All Categories
Home > Documents > Prediction of house price using multiple regression

Prediction of house price using multiple regression

Date post: 26-Jan-2017
Category:
Upload: vinovk
View: 16,420 times
Download: 1 times
Share this document with a friend
18
Prediction of House Price using Multiple Regression By Vinod Kumar Shanmugam MATH 661 – APPLIED STATISTICS PROFESSOR ARIDAMAN JAIN FALL 2009
Transcript
Page 1: Prediction of house price using multiple regression

Prediction of House Price using Multiple Regression

By

Vinod Kumar Shanmugam

MATH 661 – APPLIED STATISTICSPROFESSOR ARIDAMAN JAIN

FALL 2009

Page 2: Prediction of house price using multiple regression

ABSTRACT

• This project focuses on predicting the selling price of the house depending on various parameters like Year built, Square feet, Lot size, number of beds and baths, features, Walk score etc.

• The data is taken from www.zillow.com.

• What is zillow.com?-Zillow is an online real estate service dedicated to people

to get an edge in real estate by providing with valuable tools and information.

• PROJECT OBJECTIVE:-This project aims in constructing a mathematical model

using Multiple Regression to estimate the selling price of the house based on a set of predictor variables.

• Analysis Software Used – SAS (Statistical Analysis Software)

Page 3: Prediction of house price using multiple regression

VARIABLES USED FOR ANALYSIS

LIST OF DEPENDENT AND INDEPENDENT VARIABLES-We have 8 independent variables and 1 dependent variable.we

screen variables based on correlation coefficient with price and amount of variability explained by the model (R-square).

Page 4: Prediction of house price using multiple regression

STASTISTICAL APPROACH

• The statistical approach used here is Multiple Regression.

• What is Multiple Regression?-Multiple regression involves the use of more than one

independent variable to predict a dependent variable.

• EQUATION FOR MULTIPLE REGRESSION:-> Y = b0 + b1*X1 + b2*X2 + ... + bp*Xp-> X1, X2…Xp are the independent variables and Y is the

housing price and is the dependent variable that is being predicted or explained.

-> bo is the Constant or intercept-> b1 is the Slope (Beta coefficient) for X1, b2 is the Beta

coefficient for X2, etc…

• This equation is estimated using the Least-Squares method.

Page 5: Prediction of house price using multiple regression

EXPLORATORY DATA ANALYSIS• The exploratory data analysis involves the scatter plot outputs

between house price and predictor variables with natural log transform of price and without natural log transform of price variable.

• The log transformation is necessary for price to have a linearity relationship between price and other independent variables and there by to have accurate prediction.

Page 6: Prediction of house price using multiple regression

DISTRIBUTION OF HOUSING PRICE VARIABLE WITHOUT NATURAL LOG TRANSFORM • Distribution

Page 7: Prediction of house price using multiple regression

DISTRIBUTION OF HOUSING PRICE VARIABLE WITH NATURAL LOG TRANSFORM

Distribution1)Normal Probability plot 2)Histogram

1) The housing price is transformed using natural log and appears very close to normal distribution. This ensures linearity relationship between housing price and other predictor variables.

2) The distribution is not that much skewed compared to before transformation.

Page 8: Prediction of house price using multiple regression

CORRELATION AND REGRESSION ANALYSIS: • What is correlation?

-Correlation is a statistical relation between two or more variables such that systematic changes in the value of one variable are accompanied by systematic changes in the other. It is represented by r and ranges between -1 to +1.

• Pearson correlation coefficient: associates the independent variable price with other features of the house like age, sqft, appliances_cnt etc…

The highlighted correlation is greater than 0.5 and have strong positive or negative correlation and will be able to explain the variation of house price in the regression model better than other variables. Automatic variable selection is done in sas based on amount of variability explained in the model.

Page 9: Prediction of house price using multiple regression

MULTIPLE REGRESSION ANALYSIS:• Multiple regression was done on the data set using the Proc REG procedure

in SAS.ANOVA TABLE:

Page 10: Prediction of house price using multiple regression

Main Points from SAS output:

• The F-Value is 37.32 and P value is <0.05, so the regression model is significant.

• The P-value for the t-statistic of the selected variables are all <=0.05, so all the variables are significant in the model

• The R-square is 0.8092, which means 80.92% of the total variability is explained by the age, lotsizesqft, bedrooms, appliances_cnt and numfloors variables

• The Regression equation to predict the house price is

Page 11: Prediction of house price using multiple regression

Identifying Outliers using residuals

• After Identifying influential observations, the outliers were removed from the data. The top 3 and bottom 3 cases were removed, to see if it improves the variability explained by the model. The R-square value increased from 0.8092 to 0.8322, which is good, so we retain the newly fit model after removing the outliers.

Page 12: Prediction of house price using multiple regression

Main Points from SAS output

• The F-Value is 37.70 and P value is <0.05, so the regression model is significant.

• The P-value for the t-statistic of the selected variables are all <=0.05, so all the variables are significant in the model, except numfloors, we can remove the variable from the model if we wanted to.

• The R-square is 0.8322, which means 83.32% of the total variability is explained by the age, lotsizesqft, bedrooms, appliances_cnt and numfloors variables after removing for outliers

FINAL MODEL

Page 13: Prediction of house price using multiple regression

Explaining the effect of each independent variable selected by the regression model

• Interpreting Regression Co-efficient - Each regression Coefficient measures the average

change in Y per unit change in the relevant independent variables.

a) Starting to compare two houses: same input value, same output value- no change here:

Page 14: Prediction of house price using multiple regression

Explaining the effect of each independent variable selected by the regression model (Cont..)

• Explaining Age Coefficient Explaining lot coefficient

Page 15: Prediction of house price using multiple regression

Explaining the effect of each independent variable selected by the regression model (Cont..)

• Predicting : New Case -1 Predicting : New Case - 2

•This will help the house seller or realtor to suggest modifications to existing house, if they wanted a good selling price in the neighborhood.

•The X (independent) variables should be within the min and max of the data set that was used to fit the regression model, as out of range predictions will not work

Page 16: Prediction of house price using multiple regression

PLOT OF ACTUAL VS PREDICTED VALUE

• BEFORE REMOVING OUTLIERS AFTER REMOVING OUTLIERS

PLOTS OF Actuals vs Predicted Value after removing outliers, now it looks quite linear association between actual vs predicted.

Page 17: Prediction of house price using multiple regression

CUMULATIVE DISTRIBUTION OF PREDICTION ERROR %

• The formula is (abs(actual-predicted)*100/actual).

•This cumulative chart shows that 70% (0.7 on y-axis) of cases have less than 9% prediction error when compare to the actual selling price.80% of cases have less than 10% prediction error90% of cases have less than 12% prediction error

Page 18: Prediction of house price using multiple regression

CONCLUSION

• we are able to predict house price with around 90% accuracy for most of the cases and we have a good R-square of 0.83, which means 83% of the variability is explained by the model and we are also able to explain the interpretation of the estimates of the model .

SCOPE OF THE PROJECT:• In future we can also include, latitude, longitude and elevation of

the house in the model to predict the house price with more accuracy. Future work can also include demographics variable like income, number of children, education, age of the family group etc in the model, to explain the variability in the house pricing and to predict house pricing more effectively.


Recommended