Post on 23-Aug-2014
description
transcript
Getting Started with Regression
Presented By: Tim Wilmath, MAI
Prepared For: Hillsborough County Property Appraiser’s Office
Predicted Values
$600,000$500,000$400,000$300,000$200,000$100,000
Sal
es P
rices
$700,000
$600,000
$500,000
$400,000
$300,000
$200,000
$100,000
History of Regression
James Galton created Regression Analysis in 1885 when he was attempting to predict a person’s height based on the height of his or her parent.
History of Regression
Galton found that children born to tall parents would be
shorter than their parents - and children born to short
parents would be taller than their parents. Both groups
of children regressed toward the mean height of all children.
History of Regression
In 1922, a PhD student by the name of Casper G. Haas suggested using regression for farm land valuation.
History of Regression
What is truly remarkable is that Mr. Haas was using a technique that required significant
amounts of calculations, calculations that today are done by sophisticated computer programs
in seconds. Mr. Haas did these calculations by hand. In looking at this excerpt, it is remarkable
how the nomenclature and the statistical output has varied so little in more than 85 years.
Uses of Regression
Predicting the Weather
Uses of Regression
Predicting Election Results
Uses of Regression
Predicting Sales Prices
What is Regression?
When Regression Analysis is used to predict sales prices or establish assessments it becomes an
Automated Sales Comparison Approach
Steps in Regression
1. Data Exploration and cleanup
2. Specifying the model
3. Calibrating the model
4. Interpreting the results
Data Exploration & CleanupIs there a pattern suggesting a
relationship between variables?
Because of the potential for extreme values to influence the mean,
modelers often remove or “trim” extreme values.
HEATED AREA
70006000500040003000200010000
SA
LES
PR
ICE
800000
700000
600000
500000
400000
300000
200000
100000
0
Note the outliers.
These will adversely
affect our final values
if we don’t deal with
them now
Model Specification
• Additive - Most common for residential properties
• Multiplicative- Often used for land valuation
• Hybrid - Most advanced
Specifying the model means picking the appropriate
equation and which variables that will be used.
We are going to use an Additive Model
in this presentation
Models can be:
Independent Variables: • Size• Age• Location• Condition• Lot size• Construction• Quality• Amenities
Regression Components
Dependent Variable:• Sales Price
Simple Regression includes one Dependent Variable (sales price) and only one Independent Variable - such as Square Footage.
Simple Regression
HEATED AREA
500040003000200010000
SA
LES
PR
ICE
500000
400000
300000
200000
100000
0
Using this model,
a 1,000 sf home would
be valued at $75,000
Simple Regression using only size as the independent variable will predict sales prices, however, it will treat all homes with the same size equally.
Simple Regression
1,000 square feet - $75,000?1,000 square feet - $75,000
Multiple RegressionWe know square footage is an important variable
but what other variables should we include and how do we decide?
Heated Area
QualityLot Size
Exterior Wall Type
Actual Age
Effective Age
Roof Type
Heat/Ac Type
Swimming Pool
GarageScreen Porch
View Location
Correlation Analysis
Correlations SALEPRICE BLDSIZE BEDROOMS DOCKSALEPRICE Pearson Correlation 1 0.855 0.557 0.142
Sig. (2-tailed) . 0 0 0N 1367 1367 1367 1367
BLDSIZE Pearson Correlation 0.855 1 0.659 0.062Sig. (2-tailed) 0 . 0 0.021N 1367 1367 1367 1367
BEDROOMS Pearson Correlation 0.557 0.659 1 0.037Sig. (2-tailed) 0 0 . 0.176N 1367 1367 1367 1367
DOCK Pearson Correlation 0.142 0.062 0.037 1Sig. (2-tailed) 0 0.021 0.176 .N 1367 1367 1367 1367
Notice the high
correlation between
sales price and size
Pearson’s Correlation tells you the degree of relationships between variables.
Very little
correlation between
sales price and dockCorrelation Analysis also helps identify “Collinearity”, which is a correlation between 2 independent variables. For example, the living area
of a home is highly correlated to the number of bedrooms. It would only be necessary to have one of these variables in the model.
Y = b0 + b1 X1 + b2 X2 + . . . + bK XK
Regression Equations
Y=mx+b
Running RegressionStatistical Software makes using Regression much easier,
performing the necessary calculations quickly and accurately.
Let’s Run
This!
Regression Results
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .855(a) .732 .731 25406.53266545
a Predictors: (Constant), BLDSIZE
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 6838.585 2195.717 3.115 .0021
BLDSIZE 75.068 1.231 .855 60.997 .000
a Dependent Variable: SALEPRIC
The closer the
Adj. R-Square is to “1”
the better
And - it gives us the coefficients (or adjustments)
$6,838
+ Bldsize x $75.07
= Property Value
Model 1
The adjusted R2 statistic measures the amount of total variation explained by the Regression Model. It ranges from 0.00 to 1.00 with 1.00
being the desired value. A high number, say 0.910 means that approximately 91% of the value can be explained by the model.
The Output tells us how good our model is working
Regression Results
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 6838.585 2195.717 3.115 .0021
BLDSIZE 75.068 1.231 .855 60.997 .000
a Dependent Variable: SALEPRIC
The output includes the coefficient and the “Constant”
The “Constant” represents the un-explained
value that is not included in the model.
Running RegressionLet’s add another variable to the model - Say Land Size
Let’s run
this model and
see if results
improve.
Regression Results
Model 2
We also have new coefficients (or adjustments)
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .895(a) .801 .801 21864.78975921
a Predictors: (Constant), LANDSF, BLDSIZE
Our Adj. R2 went up from
.731 to .801!
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 6119.232 1889.914 3.238 .001BLDSIZE 72.660 1.065 .828 68.237 .000
1
LANDSF .382 .017 .266 21.887 .000
a Dependent Variable: SALEPRIC
$6,119
+ Bldsize x $72.66
+ Landsf x $0.382
= Property Value
Running RegressionLet’s add Age to the model
If Age is
significant
to value, the model
should improve.
Let’s run it.
Regression Results
Model 3
Notice the age coefficient is negative
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .912(a) .832 .832 20114.04445033
a Predictors: (Constant), AGE, LANDSF, BLDSIZE
Our Adj. R2 went up from
.801 to .832!
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 22855.587 2036.809 11.221 .000BLDSIZE 67.276 1.037 .767 64.856 .000LANDSF .444 .017 .309 26.868 .000
1
AGE -630.763 39.991 -.189 -15.773 .000
a Dependent Variable: SALEPRIC
$22,855
+ Bldsize x $67.28
+ Landsf x $0.44
+ Age x ($630.76)
= Property Value
Running RegressionLet’s add Building Quality to the model
We may have
a problem.
Let’s run it
and see.
Regression Results
Notice the constant is now negative - that’s not good!
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .924(a) .854 .853 18784.15717760
a Predictors: (Constant), QUAL, LANDSF, AGE, BLDSIZE
Our Adj. R2 went up from
.832 to .854 after
adding quality, but
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) -45723.503 5199.675 -8.794 .000BLDSIZE 59.808 1.103 .681 54.234 .000LANDSF .445 .015 .309 28.831 .000AGE -605.886 37.388 -.182 -16.205 .000
1
QUAL 26110.420 1842.475 .171 14.171 .000
a Dependent Variable: SALEPRIC
What do we do with this
quality adjustment?
Model 4
Regression ResultsCoefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) -45723.503 5199.675 -8.794 .000BLDSIZE 59.808 1.103 .681 54.234 .000LANDSF .445 .015 .309 28.831 .000AGE -605.886 37.388 -.182 -16.205 .000
1
QUAL 26110.420 1842.475 .171 14.171 .000
a Dependent Variable: SALEPRIC
Quality1 - Fair2 - Average3 - Good4 - Excellent5 - Superior
= 1 x $26,110 = $26,110
= 2 x $26,110 = $52,220
= 3 x $26,110 = $78,330= 4 x $26,110 = $104,440
= 5 x $26,110 = $130,550
Resulting Adjustment This doesn’t make
sense because the
codes 1,2,3, etc.
were not meant
to be a rank
A Note about Data Types
There are 3 primary types of property Characteristics:
• Continuous: Based on a size or measurement. Examples: Square Footage or Lot Size
• Discrete: Specific pre-defined value.Examples: Roof Material, Building Quality
• Binary: Either the item is present or notExamples: corner location, Lakefront Location
TransformationsTo solve the problem we need to convert the “discrete” variable Quality into individual “binary” variableswhich allows Regression to distinguish each type:
Fair - Yes/No Average - Yes/No Good - Yes/No Excellent - Yes/No Superior - Yes/No
“Quality” BECOMES
Running RegressionNow that we have transformed the variable
Quality
we can put it back in the model
Notice we left
“Average” out
Regression Results
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .933(a) .870 .869 17717.09739523
a Predictors: (Constant), SUPERIOR, EXCEL, AGE, FAIR, GOOD, LANDSF, BLDSIZE
Our Adj. R2 went up from
.832 to .869.
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 35633.753 1922.792 18.532 .000BLDSIZE 58.537 1.045 .667 56.031 .000LANDSF .419 .016 .291 26.342 .000AGE -625.742 35.363 -.188 -17.695 .000FAIR -25511.289 8693.178 -.031 -2.935 .003GOOD 21095.623 1838.228 .127 11.476 .000EXCEL 75844.967 12720.934 .059 5.962 .000
1
SUPERIOR 305671.839 18494.059 .169 16.528 .000
a Dependent Variable: SALEPRIC
These Quality
adjustments
are all relative to
“Average”
Model 5
Running Regression
Notice we left
out the“Base”
Neighborhood
(the most typical)
Let’s transform Neighborhood into a binary and add it to the model
Regression Results
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .936(a) .875 .874 17391.93018134
a Predictors: (Constant), NB211006, BLDSIZE, EXCEL, FAIR, SUPERIOR, NB211002,NB211001, NB211005, AGE, LANDSF, GOOD, NB211003
Our Adj. R2 went up from
.869 to .874.
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 40799.859 2299.668 17.742 .000BLDSIZE 56.000 1.143 .638 48.980 .000LANDSF .423 .016 .294 25.753 .000AGE -671.493 37.221 -.201 -18.041 .000FAIR -33476.331 8602.963 -.041 -3.891 .000GOOD 17371.495 2023.937 .105 8.583 .000EXCEL 72617.618 12567.147 .057 5.778 .000SUPERIOR 313444.055 18313.237 .173 17.116 .000NB211001 14199.881 2321.457 .070 6.117 .000NB211002 -3514.034 1657.862 -.025 -2.120 .034NB211003 -1483.623 1244.877 -.015 -1.192 .234NB211005 4044.357 2266.186 .021 1.785 .075
1
NB211006 1915.755 2601.773 .008 .736 .462
a Dependent Variable: SALEPRIC
These Neighborhood
adjustments
are all relative to
our “Base”
Neighborhood
Model 6
Running RegressionMultiplicative Transformations combine two variables into one
Square Footage x Quality = SQFT1Reflects the fact that quality may contribute greater value in larger homes and less value in smaller homes. In other words, without combining these variables, all Good Quality homes get the same adjustment regardless of their size. Let’s add this new combined variable to the model.
Since we combined SF
and Quality, we remove
them as stand-alone
variables
Regression ResultsModel Summary
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .938(a) .880 .879 17065.96846831
a Predictors: (Constant), SQFT5, SQFT4, AGE, NB211002, SQFT2, SQFT1, NB211006,NB211001, NB211005, LANDSF, NB211003, SQFT3
Our Adj. R2 went up from
.874 to .879.
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 43999.158 2299.663 19.133 .000LANDSF .418 .016 .291 25.996 .000AGE -660.473 36.505 -.198 -18.092 .000NB211001 10975.273 2335.844 .054 4.699 .000NB211002 -3611.418 1624.028 -.026 -2.224 .026NB211003 -1250.573 1221.119 -.013 -1.024 .306NB211005 6350.688 2243.206 .033 2.831 .005NB211006 1923.311 2554.324 .008 .753 .452SQFT1 21.119 8.533 .026 2.475 .013SQFT2 53.673 1.169 .723 45.916 .000SQFT3 63.139 1.074 .964 58.806 .000SQFT4 77.267 3.557 .210 21.720 .000
1
SQFT5 108.100 2.941 .356 36.759 .000
a Dependent Variable: SALEPRIC
Notice the adjustments
went from fixed dollar
amounts to
“per square foot”
Model 7
Advanced TransformationsExponential transformations - Raise variable to a power
Land Size x .75 = LAND75
Reflects the principle of diminishing returns. The unit price of land tends to decrease as size increases. Without this transformation land would get the same adjustment, regardless of size. Raising land size to the power of .75 reflects the curve shown below.
SINGLE FAMILY LOT PRICES
$2.40$2.45$2.50$2.55$2.60$2.65$2.70$2.75$2.80$2.85
LOT SIZE
PRIC
E P
ER
SF
Running RegressionLet’s add our new transformed land variable to the model
Regression ResultsModel 8
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 40782.649 2277.915 17.903 .000AGE -731.178 36.549 -.219 -20.005 .000NB211001 10061.900 2314.108 .050 4.348 .000NB211002 -3196.888 1609.968 -.023 -1.986 .047NB211003 -1646.847 1211.025 -.017 -1.360 .174NB211005 6714.691 2224.018 .035 3.019 .003NB211006 -5595.936 2625.622 -.024 -2.131 .033SQFT1 30.298 8.324 .038 3.640 .000SQFT2 51.834 1.167 .698 44.421 .000SQFT3 60.732 1.081 .927 56.177 .000SQFT4 71.516 3.559 .194 20.094 .000SQFT5 104.644 2.937 .345 35.625 .000
1
LAND75 12.233 .459 .314 26.668 .000
a Dependent Variable: SALEPRIC
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .939(a) .882 .881 16919.04533480
a Predictors: (Constant), LAND75, NB211005, NB211001, SQFT4, NB211002, SQFT5,SQFT1, AGE, SQFT2, NB211006, NB211003, SQFT3
Our Adj. R2 went up from
.879 to .881.
Running RegressionLet’s add garages, pools, and baths just to round out our model.
Regression ResultsModel 9
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 29680.695 2885.532 10.286 .000AGE -705.817 38.491 -.212 -18.337 .000NB211001 12374.064 2176.815 .061 5.684 .000NB211002 -1094.891 1527.977 -.008 -.717 .474NB211003 -938.838 1136.671 -.010 -.826 .409NB211005 12639.946 2139.489 .066 5.908 .000NB211006 852.109 2535.266 .004 .336 .737SQFT1 31.388 7.815 .039 4.016 .000SQFT2 44.166 1.365 .595 32.349 .000SQFT3 52.939 1.265 .808 41.857 .000SQFT4 60.447 3.561 .164 16.974 .000SQFT5 94.723 2.943 .312 32.186 .000LAND75 11.788 .433 .303 27.240 .000BATHS 7714.093 1338.204 .076 5.765 .000POOL 13359.275 1184.469 .105 11.279 .000
1
GARAGE 10.750 3.137 .038 3.427 .001
a Dependent Variable: SALEPRIC
Model Summary(b)
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .947(a) .897 .895 15854.87728402
Our Adj. R2 went up from
.881 to .895.
Regression ResultsCoefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 35633.753 1922.792 18.532 .000BLDSIZE 58.537 1.045 .667 56.031 .000LANDSF .419 .016 .291 26.342 .000AGE -625.742 35.363 -.188 -17.695 .000FAIR -25511.289 8693.178 -.031 -2.935 .003GOOD 21095.623 1838.228 .127 11.476 .000EXCEL 75844.967 12720.934 .059 5.962 .000
1
SUPERIOR 305671.839 18494.059 .169 16.528 .000
a Dependent Variable: SALEPRIC
The “Beta” value in column 4 indicates the partial correlation
of the variable. It is used in stepwise regression in deciding
which variable to add next.
Regression Results
Coefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 29680.695 2885.532 10.286 .000AGE -705.817 38.491 -.212 -18.337 .000NB211001 12374.064 2176.815 .061 5.684 .000NB211002 -1094.891 1527.977 -.008 -.717 .474NB211003 -938.838 1136.671 -.010 -.826 .409NB211005 12639.946 2139.489 .066 5.908 .000NB211006 852.109 2535.266 .004 .336 .737SQFT1 31.388 7.815 .039 4.016 .000SQFT2 44.166 1.365 .595 32.349 .000SQFT3 52.939 1.265 .808 41.857 .000SQFT4 60.447 3.561 .164 16.974 .000SQFT5 94.723 2.943 .312 32.186 .000LAND75 11.788 .433 .303 27.240 .000BATHS 7714.093 1338.204 .076 5.765 .000POOL 13359.275 1184.469 .105 11.279 .000
1
GARAGE 10.750 3.137 .038 3.427 .001
a Dependent Variable: SALEPRIC
Rule of Thumb:
“t” scores should
be 2.0 or greater
The significance of each variable to the model can be determined
by looking at the “t” values.
NB211002
NB211003
NB211006
are insignificant
Regression ResultsCoefficients(a)
Unstandardized CoefficientsStandardizedCoefficients
Model B Std. Error Beta t Sig.(Constant) 35633.753 1922.792 18.532 .000BLDSIZE 58.537 1.045 .667 56.031 .000LANDSF .419 .016 .291 26.342 .000AGE -625.742 35.363 -.188 -17.695 .000FAIR -25511.289 8693.178 -.031 -2.935 .003GOOD 21095.623 1838.228 .127 11.476 .000EXCEL 75844.967 12720.934 .059 5.962 .000
1
SUPERIOR 305671.839 18494.059 .169 16.528 .000
a Dependent Variable: SALEPRIC
The “t-statistic” is calculated by dividing the coefficient of
a variable by its standard error. For example: for the variable
BLDSIZE, the “t-statistic” is calculated as follows:
58.537 / 1.045 = 56.0
Regression ResultsModel Summary(b)
Model R R SquareAdjusted R
SquareStd. Error of the
Estimate1 .947(a) .897 .895 15854.87728402
The “Standard Error of the Estimate” in the regression model tells us
how much a sale estimate will vary from its actual value.
This number alone is meaningless unless related to the average
sales price in the sale sample. Dividing the Standard Error by
the Average SalesPrice produces the Coefficient of Variation (COV)
$15,854 / $134,043 = 11.82% COV
Regression Options“Enter” is the default regression method in most statistical software programs. This method includes all variables “entered” by the modeler.
“Stepwise” multiple regression automatically eliminates
redundant or insignificant variables.Coefficients(a)
Model: 4
Unstandardized CoefficientsStandardizedCoefficients
B Std. Error Beta t Sig.(Constant) 28624.283 2584.025 11.077 .000AGE -697.862 37.689 -.209 -18.516 .000NB211001 12794.553 2071.093 .063 6.178 .000NB211005 13302.885 1969.163 .069 6.756 .000SQFT1 31.406 7.797 .039 4.028 .000SQFT2 44.305 1.354 .597 32.723 .000SQFT3 53.134 1.249 .811 42.525 .000SQFT4 60.544 3.557 .164 17.023 .000SQFT5 94.884 2.924 .313 32.446 .000LAND75 11.891 .393 .305 30.243 .000BATHS 7732.836 1332.987 .076 5.801 .000POOL 13317.394 1179.165 .105 11.294 .000GARAGE 10.586 3.047 .037 3.474 .001
a Dependent Variable: SALEPRIC
Notice that Stepwise
Regression
“kicked out” the
neighborhoods that had
low “t-scores"
Creating New Assessments
Once you have calibrated
your model, the Regression
software allows you to predict
the new values (or assessments)
using the coefficients
(or adjustments) you created.
Reviewing Ratio Statistics
Once the new assessments are created using our final model, we can
review the accuracy of our new values using traditional ratio statistics.
Ratio Statistics for ASSESS Unstandardized Predicted Value / SALEPRIC
Weighted Mean 1.000 Price Related Differential 1.008 Coefficient of Dispersion .079
Mean Centered 11.1% Coefficient of Variation
Median Centered 11.2%
Valuing the PopulationValuing the population requires transforming the same variables
you used in the model, then applying the coefficients to those variables.
This can be done internally within some CAMA systems, using
Microsoft Excel or other spreadsheet software, or within the
regression software.
Valuing the population is one of the most difficult aspects
of regression modeling because changes in the physical attributes of
any one parcel often requires re-running the entire model and
re-calculating values.
ConclusionPredicting assessments using Regression requires the appraiser to:
• Explore data to determine relationships and cleanup outliers
• Specify which model and variables will be used
• transform variables and run regression
• Review Results, modify or add variables
• Create predicted assessments and review ratio statistics
• Value Population using final coefficients
The End
Predicted Values
5000004000003000002000001000000
SA
LE P
RIC
ES
500000
400000
300000
200000
100000
0