Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | poppy-rich |
View: | 214 times |
Download: | 0 times |
FPP 10 kind of
Regression
1
Plan of attackIntroduce regression model
Correctly interpret intercept and slope
Prediction
Pit falls to avoid
2
Regression lineCorrelation coefficient a nice numerical
summary of two quantitative variablesIt indicates direction and strength of
association
But does it quantify the association?
It would be of interest to do this forPredictionsUnderstanding phenomena
3
Regression line Correlation measures the direction and strength
of the straight-line (linear) relationship between two quantitative variables
If a scatter plot shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatter plot
This line represents a mathematical model. Later we will make the mathematical model a statistical one.
4
Slope intercept form review
5
Regression lineSlope intercept form notation
Regression form notation€
y = mx + b
€
ˆ y = α + βx
6
Regression
Price of Homes Based on Square FeetPrice = -90.2458 + 0.1598SQFT
r = 0.8718945
7
Which line is best
Price = -90.2458 + 0.1598SQFT (red)Price = -300 + 0.3SQFT (blue)Price = 0 + 0.1SQFT (green)
8
Which model to useDifferent people might draw different lines by
eye on a scatterplot
What are some ways we can determine which model(line) out of all the possible models(lines) is the “best” one?
What are some ways that we can numerically rank the different models? (i.e. the different lines)
This will come later in the course
9
Slope interpretation
The slope, β, of a regression line is almost always important for interpreting the data.
The slope is a rate of change. It is the mean amount of change in y-hat when x increases by 1
10
€
ˆ y = α + βx
Slope interpretationPrice of Homes Based on Square FeetPrice = -90.2458 + 0.1598SQFT
r = 0.8718945
For every 1 sqft increase in size of home on average the house price increases by $159.8 dollars
11
Intercept interpretation
The intercept, α, of the regression line is the value of y-hat when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero.
12
€
ˆ y = α + βx
Intercept interpretationPrice of Homes Based on Square FeetPrice = -90.2458 + 0.1598SQFT
r = 0.8718945
If the sqft of a home was 0 on average the house price will be -$90,245.80 dollars
This doesn’t make much sense here because x (sqft) doesn’t take on values close to zero.
13
Prediction
Price of Homes Based on Square FeetPrice = -90.2458 + 0.1598SQFT
r = 0.8718945
For a 3500 sqft home we would predict the selling price to be price = -90.2458 + 0.1598*3500 price = $469,054.2
14
OECD data: Income and unemployment in the U.S.What is the relationship between
households’ disposable income and the nation’s unemployment rate?
Data from the U.S. 1980 to 1998(data provided by the economics department
at Duke)
15
Disposable income vs unemployment rates
3500000
4000000
4500000
5000000
5500000
Rea
l Hou
seho
ld D
ispo
sabl
e In
com
e
5 6 7 8 9 10
Unemployment Rate
Linear Fit
Bivariate Fit of Real Household Disposable Income By Unemployment Rate
16
Disposable income and unemployment rates regression output
1000000
2000000
3000000
4000000
5000000
6000000
7000000
Hou
seho
ld D
ispo
sabl
e In
com
e
5 6 7 8 9 10
Unemployment Rate
Linear Fit
Household Disposable Income = 8266987.1 - 664053.26 Unemployment
Rate
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.507648
0.478687
920643.7
3833103
19
Summary of Fit
Model
Error
C. Total
Source
1
17
18
DF
1.48566e13
1.44089e13
2.92656e13
Sum of Squares
1.4857e13
8.4758e11
Mean Square
17.5282
F Ratio
0.0006
Prob > F
Analysis of Variance
Intercept
Unemployment Rate
Term
8266987.1
-664053.3
Estimate
1079905
158611.4
Std Error
7.66
-4.19
t Ratio
<.0001
0.0006
Prob>|t|
Parameter Estimates
Linear Fit
Bivariate Fit of Household Disposable Income By Unemployment Rate
17
Facts about regressionThere is a close relationship between the
correlation coefficient and the slope of a regression line
They have the same signThey are proportional to each other
The intercept has no relationship with the correlation coefficient but here is the formula
€
β =rσ y
σ x
or β = rSDy
SDx
€
α =μy − βμ x18
Facts about regressionThe distinction between explanatory and
response variable is essential in regressionIf you have a slope computed using x as the
explanatory and y as the response variable you can’t “back solve” to get a slope and intercept for the regression model with x being the response and y the explanatory variables.
If you want to predict x given a y then you must find the intercept and slope with y being the explanatory variable and x being the response
19
Facts about regressionR2 (coefficient of determination) provides a
one number summary of how well regression line fits data
R2 is the percentage of variation in Y’s explained by the regression line
R2 lies between 0 and 1Values near 1 indicate regression predicts y’s
in data set very closelyValues near 0 indicate regression does not
predict the y’s in the data set very closely
20
Facts about regressionExample:
The correlation coefficient between sale price and square feet was r = 0.8718945
Thus the coefficient of determination is R2=(0.8718)2=0.76
So 76% of the variability in sale price is explained by (taken into account by) the regression line with square feet.
21
Does regression fit data well?A regression line is reasonable if
Association between two variables is indeed linear
When points are randomly scattered around line
Income/unemployment rate data well-described by regression line.
22
Regression of AIDS rates per 1000 people of GNP per capita
Line is too low for GDP values near zero and too high for big GDP values.
We shouldn’t use line for predictions
0
50
100
150
200
HIV
/AID
S p
er 1
000
0 5000 10000 15000 20000 25000 30000
GDP/capita
Linear Fit
Linear Fit
HIV/AIDS per 1000 = 34.840498 - 0.0013312 GDP/capita
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.073769
0.051715
39.24748
27.33318
44
Summary of Fit
Model
Error
C. Total
Source
1
42
43
DF
5152.577
64695.328
69847.905
Sum of Squares
5152.58
1540.36
Mean Square
3.3450
F Ratio
0.0745
Prob > F
Analysis of Variance
Intercept
GDP/capita
Term
34.840498
-0.001331
Estimate
7.201186
0.000728
Std Error
4.84
-1.83
t Ratio
<.0001
0.0745
Prob>|t|
Parameter Estimates
Linear Fit
Bivariate Fit of HIV/AIDS per 1000 By GDP/capita
23
Changing the response variableWhen the regression line fits the data
badly, sometimes you can transform variables to obtain a better fitting line.
With monetary variables, typically this can be accomplished by taking logarithms.
24
Regression of log(AIDS) on log(GNP)
Much better fit
Predict log(AIDS) from log(GNP). Exponentiate to estimate AIDS
-1
0
1
2
3
4
5
6
loga
ids
6 6.5 7 7.5 8 8.5 9 9.5 10 10.5
logGNP
Linear Fit
logaids = 8.8562593 - 0.8185802 logGNP
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.346571
0.331013
1.213907
2.312979
44
Summary of Fit
Intercept
logGNP
Term
8.8562593
-0.81858
Estimate
1.398379
0.173436
Std Error
6.33
-4.72
t Ratio
<.0001
<.0001
Prob>|t|
Parameter Estimates
-3
-1
1
3R
esid
ual
6 6.5 7 7.5 8 8.5 9 9.5 10 10.5
logGNP
Linear Fit
Bivariate Fit of logaids By logGNP
25
Birth and death rates in 74 countries
5
10
15
20
25
30
deat
h
10 20 30 40 50
birth
5
10
15
20
25
30
deat
h
10 20 30 40 50
birth
26
Warnings about regressionPredicting y at values of x beyond the
range of x in the data is called extrapolation
This is risky, because we have no evidence to believe that the association between x and y remains linear for unseen x values
Extrapolated predictions can be absolutely wrong
27
ExtrapolationDiamond price and
carat
Explanatory variable is measured by carats and response variable is dollars
Predict price of hope diamond
€
ˆ y = 48.88 + 2430.77(45.52) = $110,697.5328
ExtrapolationThe relationship
between diamond carat and price doesn’t remain linear after a carat size of about 0.4
29
ExtrapolationGreen line is
linear fit with only diamonds less then 0.4 carats
Blue line is linear fit with all carat sizes
Red curve a quadratic fit
30
Lurking variableA variable not being considered could be
driving the relationship
In practice this is a difficult issue to tackle. Especially when everything seems OK
31
Influential pointAn outlier in either the X or Y direction
which, if removed, would markedly change the value of the slope and y-interept.
applet
32
CausalityOn its own, regression only quantifies an
association between x and y
It does not prove causality
Under a carefully designed experiment (or in some cases observational studies) regression can be used to show causality.
33