Post on 22-Dec-2015
transcript
Stat 112 Notes 11
• Today:– Fitting Curvilinear Relationships (Chapter 5)
• Homework 3 due Friday.
• I will e-mail Homework 4 tonight, but it will not be due for two weeks (October 26th).
Curvilinear Relationships
• Relationship between Y and X is curvilinear if E(Y|X) is not a straight line.
• Linearity for simple linear regression model is violated for a curvilinear relationship.
• Approaches to estimating E(Y|X) for a curvilinear relationship– Polynomial Regression – Transformations
Transformations
• Curvilinear relationship: E(Y|X) is not a straight line.
• Another approach to fitting curvilinear relationships is to transform Y or x.
• Transformations: Perhaps E(f(Y)|g(X)) is a straight line, where f(Y) and g(X) are transformations of Y and X, and a simple linear regression model holds for the response variable f(Y) and explanatory variable g(X).
Curvilinear RelationshipBivariate Fit of Life Expectancy By Per Capita GDP
40
50
60
70
80
Life
Exp
ecta
ncy
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Y=Life Expectancy in 1999X=Per Capita GDP (in US Dollars) in 1999Data in gdplife.JMP
Linearity assumption of simplelinear regression is clearly violated.The increase in mean life expectancy for each additional dollarof GDP is less for large GDPs thanSmall GDPs. Decreasing returns toincreases in GDP.
Bivariate Fit of Life Expectancy By log Per Capita GDP
40
50
60
70
80
Life
Exp
ecta
ncy
6 7 8 9 10
log Per Capita GDP
Linear Fit Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
-25
-15
-5
5
15
Res
idua
l
6 7 8 9 10
log Per Capita GDP
The mean of Life Expectancy | Log Per Capita appears to be approximatelya straight line.
How do we use the transformation?
•
• Testing for association between Y and X: If the simple linear regression model holds for f(Y) and g(X), then Y and X are associated if and only if the slope in the regression of f(Y) and g(X) does not equal zero. P-value for test that slope is zero is <.0001: Strong evidence that per capita GDP and life expectancy are associated.
• Prediction and mean response: What would you predict the life expectancy to be for a country with a per capita GDP of $20,000?
Linear Fit Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -7.97718 3.943378 -2.02 0.0454 log Per Capita GDP
8.729051 0.474257 18.41 <.0001
47.789035.9*7291.89772.7)9035.9log|(ˆ
)000,20loglog|(ˆ)000,20|(ˆ
XYE
XYEXYE
How do we choose a transformation?
• Tukey’s Bulging Rule.
• See Handout.
• Match curvature in data to the shape of one of the curves drawn in the four quadrants of the figure in the handout. Then use the associated transformations, selecting one for either X, Y or both.
Transformations in JMP1. Use Tukey’s Bulging rule (see handout) to determine
transformations which might help. 2. After Fit Y by X, click red triangle next to Bivariate Fit and click Fit
Special. Experiment with transformations suggested by Tukey’s Bulging rule.
3. Make residual plots of the residuals for transformed model vs. the original X by clicking red triangle next to Transformed Fit to … and clicking plot residuals. Choose transformations which make the residual plot have no pattern in the mean of the residuals vs. X.
4. Compare different transformations by looking for transformation with smallest root mean square error on original y-scale. If using a transformation that involves transforming y, look at root mean square error for fit measured on original scale.
Bivariate Fit of Life Expectancy By Per Capita GDP
40
50
60
70
80Li
fe E
xpec
tanc
y
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Linear Fit
Transformed Fit to Log
Transformed Fit to Sqrt
Transformed Fit Square
`
•
Linear Fit Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP Summary of Fit RSquare 0.515026 RSquare Adj 0.510734 Root Mean Square Error 8.353485 Mean of Response 63.86957 Observations (or Sum Wgts) 115 Transformed Fit to Log Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP) Summary of Fit RSquare 0.749874 RSquare Adj 0.74766 Root Mean Square Error 5.999128 Mean of Response 63.86957 Observations (or Sum Wgts) 115
Transformed Fit to Sqrt Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP) Summary of Fit RSquare 0.636551 RSquare Adj 0.633335 Root Mean Square Error 7.231524 Mean of Response 63.86957 Observations (or Sum Wgts) 115 Transformed Fit Square Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP Fit Measured on Original Scale Sum of Squared Error 7597.7156 Root Mean Square Error 8.1997818 RSquare 0.5327083 Sum of Residuals -70.29942
By looking at the root mean square error on the original y-scale, we see thatall of the transformations improve upon the untransformed model and that the transformation to log x is by far the best.
Linear Fit
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Transformation to Log X
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Transformation to X
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Transformation to 2Y
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
The transformation to Log X appears to have mostly removed a trend in the meanof the residuals. This means that . There is still a problem of nonconstant variance.
XXYE log)|( 10
Comparing models for curvilinear relationships
• In comparing two transformations, use transformation with lower RMSE, using the fit measured on the original scale if y was transformed on the original y-scale
• In comparing transformations to polynomial regression models, compare RMSE of best transformation to best polynomial regression model (selected using the criterion from Note 10).
• If the transfomation’s RMSE is close to (e.g., within 1%) but not as small as the polynomial regression’s, it is still reasonable to use the transformation on the grounds of parsimony.
Transformations and Polynomial Regression for Display.JMP
RMSE
Linear 51.59
log x 41.31
1/x 40.04
46.02
Fourth order poly. 37.79
x
Fourth order polynomial is the best polynomial regression model using the criterion on slide 10
Fourth order polynomial is the best model – it has the smallest RMSE by a considerable amount (more than 1% advantage over best transformation of 1/x.
Interpreting the Coefficient on Log XSuppose 0 1( | ) logE Y X X
Then using the properties of logarithms,
0 1 0 1
1
1 1
( | 2 ) ( | ) ( log 2 ) ( log )
2 log
log 2 0.69
E Y X E Y X X X
X
X
Thus, the interpretation of 1 is that a doubling of X is associated
with a 1 1log 2 0.69 increase in the mean of Y.
Similarly, a tripling of X is associated with a 1 log3 increase in
the mean of Y
For life expectancy data, Transformed Fit to Log Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP)
A doubling of GDP is associated with a 8.73*log2=8.73*.69=6.02 year increase in mean life expectancy.
Log Transformation of Both X and Y variables
• It is sometimes useful to transform both the X and Y variables.
• A particularly common transformation is to transform X to log(X) and Y to log(Y)
0 1
0 1
(log | ) log
( | ) exp( log )
E Y X X
E Y X X
Heart Disease-Wine Consumption Data (heartwine.JMP)
Bivariate Fit of Heart Disease Mortality By Wine Consumption
2
4
6
8
10
12
Hea
rt D
isea
se M
orta
lity
0 10 20 30 40 50 60 70 80
Wine Consumption
Linear Fit
Transformed Fit Log to Log
Residual Plot for Simple Linear Regression Model
-3-2-10123
Res
idua
l
0 10 20 30 40 50 60 70 80
Wine Consumption
Residual Plot for Log-Log Transformed Model
-3
-1
1
3
Res
idua
l
0 10 20 30 40 50 60 70 80
Wine Consumption
Evaluating Transformed Y Variable ModelsThe residuals for a log-log transformation model on the original Y-scale are
0 1
ˆˆ ( | )
exp( log )i i i
i i
e Y E Y X
Y b b X
The root mean square error and R2 on the original Y-scale are shown in JMP under Fit Measured on Original Scale. To evaluate models with transformed Y variables and compare their R2’s and root mean square error to models with untransformed Y variables, use the root mean square error and R2 on the original Y-scale for the transformed Y variables.
Linear Fit Heart Disease Mortality = 7.6865549 - 0.0760809 Wine Consumption Summary of Fit RSquare 0.555872 RSquare Adj 0.528114 Root Mean Square Error 1.618923 Transformed Fit Log to Log Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption) Fit Measured on Original Scale Sum of Squared Error 41.557487 Root Mean Square Error 1.6116274 RSquare 0.5598656
The log-log transformation provides slightly better predictionsthan the simple linear regressionModel.
Interpreting Coefficients in Log-Log Models
0 1
0 1
(log | ) log
( | ) exp( log )
E Y X X
E Y X X
Assuming that
0 1(log | log ) logE Y X X
satisfies the simple linear regression model assumptions, then
0 1( | ) exp( )exp( )Median Y X X
Thus,
10 1
0 1
exp( )exp( log 2 )( | log 2 )2
( | log ) exp( )exp( log )
XMedian Y X
Median Y X X
Thus, a doubling of X is associated with a multiplicative change of 12 in the median of Y. Transformed Fit Log to Log Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption) Doubling wine consumption is associated with multiplying median heart disease mortality by 0.3562 0.781 .
Another interpretation of coefficients in log-log models
For a 1% increase in X,
10 10
0 1
exp( )exp( log1.01 )( | log1.01 )exp( )1.01
( | log ) exp( )exp( log )
XMedian Y X
Median Y X X
Because 1
11.01 1 .01 ,
a 1% increase in X in associated with a 1 percent increase in the median (or mean) of Y.
Transformed Fit Log to Log Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption) Increasing wine consumption by 1% is associated with a -0.36% decrease in mean heart disease mortality. Similarly a 10% increase in X is associated with a 10 1 percent increase in mean heart
disease mortality. Increasing wine consumption by 10% is associated with a -3.6% decrease in mean heart disease mortality. For large percentage changes (e.g., 50%, 100%) , this interpretation is not accurate.
Another Example of Transformations: Y=Count of tree
seeds, X= weight of treeBivariate Fit of Seed Count By Seed weight (mg)
-5000
0
5000
10000
15000
20000
25000
30000
See
d C
ount
-1000 0 1000 2000 3000 4000 5000
Seed weight (mg)
Bivariate Fit of Seed Count By Seed weight (mg)
-5000
0
5000
10000
15000
20000
25000
30000
See
d C
ount
-1000 0 1000 2000 3000 4000 5000
Seed weight (mg)
Linear Fit
Transformed Fit Log to Log
Transformed Fit to Log
Linear Fit Seed Count = 6751.7179 - 2.1076776 Seed weight (mg) Summary of Fit RSquare 0.220603 RSquare Adj 0.174756 Root Mean Square Error 6199.931 Mean of Response 4398.474 Observations (or Sum Wgts)
19
Transformed Fit Log to Log Log(Seed Count) = 9.758665 - 0.5670124 Log(Seed weight (mg)) Fit Measured on Original Scale Sum of Squared Error 161960739 Root Mean Square Error
3086.6004
RSquare 0.8068273 Sum of Residuals 3142.2066
Transformed Fit to Log Seed Count = 12174.621 - 1672.3962 Log(Seed weight (mg)) Summary of Fit RSquare 0.566422 RSquare Adj 0.540918 Root Mean Square Error 4624.247 Mean of Response 4398.474 Observations (or Sum Wgts)
19
By looking at the root mean square error on the original y-scale, we see thatBoth of the transformations improve upon the untransformed model and that the transformation to log y and log x is by far the best.
Comparison of Transformations to Polynomials for Tree Data
Bivariate Fit of Seed Count By Seed weight (mg)
-5000
0
5000
10000
15000
20000
25000
30000
See
d C
ount
0 1000 2000 3000 4000 5000
Seed w eight (mg)
Transformed Fit Log to Log
Polynomial Fit Degree=6
Transformed Fit Log to Log Log(Seed Count) = 9.758665 - 0.5670124*Log(Seed weight (mg)) Fit Measured on Original Scale Root Mean Square Error 3086.6004 Polynomial Fit Degree=6 Seed Count = 1539.0377 + 2.453857*Seed weight (mg) -0.0139213*(Seed weight (mg)-1116.51)^2 +1.2747e-6*(Seed weight (mg)-1116.51)^3 +1.0463e-8*(Seed weight (mg)-1116.51)^4 - 5.675e-12*(Seed weight (mg)-1116.51)^5 + 8.269e-16*(Seed weight (mg)-1116.51)^6 Summary of Fit Root Mean Square Error 6138.581
For the tree data, the log-log transformation is much better than polynomial regression.
Prediction using the log y/log x transformation
• What is the predicted seed count of a tree that weights 50 mg?
• Math trick: exp{log(y)}=y (Remember by log, we always mean the natural log, ln), i.e.,
96.1882}5406.7exp{}912.3*5670.07587.9exp{
)}912.3log|(logˆexp{)}50loglog|(logˆexp{
)}50|(logˆexp{)50|(ˆ
XYEXYE
XYEXYE
1010log e