Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | angelina-johnson |
View: | 219 times |
Download: | 0 times |
Stat 112 Notes 10
• Today:– Fitting Curvilinear Relationships (Chapter 5)
• Homework 3 due Thursday.
Curvilinear Relationship• Reconsider the simple regression problem of
estimating the conditional mean of y given x, • For many problems, is not linear. • Linear regression model makes restrictive assumption
that increase in mean of y|x for a one unit increase in x equals
• Curvilinear relationship: is a curve, not a straight line; increase in mean of y|x is not the same for all x.
• When the relationship is curvilinear, the residual plot from a simple linear regression will violate linearity and there will be ranges of X for which the mean of the residuals is not approximately zero.
( | )E y x
( | )E y x
1( | )E y x
Example 1: How does rainfall affect yield?
• Data on average corn yield and rainfall in six U.S. states (1890-1927), cornyield.JMP
20
25
30
35
40
YIE
LD
6 7 8 9 10 11 12 13 14 15 16 17
RAINFALL
Bivariate Fit of YIELD By RAINFALL
-10
-5
0
5
Res
idua
l
6 7 8 9 10 11 12 13 14 15 16 17
RAINFALL
Residual plot indicates violation of linearity – mean of residuals is above zero for rainfall between about 10-12 and below zero for rainfall from about 13-17.
Example 2: How do people’s incomes change as they age
• Weekly wages and age of 200 randomly chosen males between ages 18 and 70 from the 1998 March Current Population Survey Bivariate Fit of wage By age
0
500
1000
1500
2000
2500
wa
ge
20 30 40 50 60 70
age
Example 3: Display.JMP
• A large chain of liquor stores would like to know how much display space in its stores to devote to a new wine. It collects sales and display space data from 47 of its stores.
Bivariate Fit of Sales By DisplayFeet
0
50
100
150
200
250
300
350
400
450
Sa
les
0 1 2 3 4 5 6 7 8
DisplayFeet
Polynomial Regression
• Add polynomial terms in x as additional explanatory variables in a multiple regression model.
• In JMP is used in the place of x.
This does not affect the that is obtained from the multiple regression model.
• Quadratic model (K=2) is often sufficient.
20 1 2( | ) K
KE Y X x x x
( )x x2
0 1 2( | ) ( ) ( )KKE Y X x x x x x
y
Polynomial Regression in JMP
• Two ways to fit model:– Create variables . Use
fit model with variables
(we will illustrate this method when we apply polynomial regression when there is more than one explanatory variable)
– Use Fit Y by X. Click on red triangle next to Bivariate Analysis … and click Fit Polynomial instead of the usual Fit Line . This method produces nicer plots.
kxxxxxx )(,...,)(,)( 32 kxxxxxxx )(,...,)(,)(, 32
Bivariate Fit of YIELD By RAINFALL
20
25
30
35
40
YIE
LD
6 7 8 9 10 11 12 13 14 15 16 17
RAINFALL
Linear Fit YIELD = 23.552103 + 0.7755493 RAINFALL Summary of Fit RSquare 0.16211 RSquare Adj 0.138835 Root Mean Square Error 4.049471 Mean of Response 31.91579 Observations (or Sum Wgts) 38 Polynomial Fit Degree=2 YIELD = 21.660175 + 1.0572654 RAINFALL - 0.2293639 (RAINFALL-10.7842)^2 Summary of Fit RSquare 0.296674 RSquare Adj 0.256484 Root Mean Square Error 3.762707 Mean of Response 31.91579 Observations (or Sum Wgts) 38 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 21.660175 3.094868 7.00 <.0001 RAINFALL 1.0572654 0.293956 3.60 0.0010 (RAINFALL-10.7842)^2 -0.229364 0.088635 -2.59 0.0140
B i v a r i a t e F i t o f w a g e B y a g e
0
50 0
10 00
15 00
20 00
25 00
wage
2 0 30 40 50 60 70
ag e
L in ea r Fit
Po ly n omia l Fit De gre e=2
Linear Fit wage = 407.72321 + 6.5370642 age Summary of Fit RSquare 0.049778 RSquare Adj 0.044979 Root Mean Square Error 345.4422 Polynomial Fit Degree=2 wage = 356.39651 + 9.6873755 age - 0.4769883 (age-38.22)^2 Summary of Fit RSquare 0.095328 RSquare Adj 0.086143 Root Mean Square Error 337.9155 Mean of Response 657.5698 Observations (or Sum Wgts) 200 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 356.39651 81.21184 4.39 <.0001 age 9.6873755 2.223264 4.36 <.0001 (age-38.22)^2 -0.476988 0.151453 -3.15 0.0019
Interpretation of coefficients in polynomial regression
• The usual interpretation of multiple regression coefficients doesn’t make sense in polynomial regresssion.
• We can’t hold x fixed and change .
• Effect of increasing x by one unit depends on the starting x=x*
20 1 2( | ) ( )E Y X X X X
2( )X X
* * * * 20 1 2
* * 2 *0 1 2 1 2 2
( | 1) ( | ) [ ( 1) ( 1 ) ]
[ ( ) ] [2 2 ]
E Y X X E Y X X X X X
X X X X X
Interpretation of coefficients in wage data
Polynomial Fit Degree=2 wage = 356.39651 + 9.6873755 age - 0.4769883 (age-38.22)^2 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 356.39651 81.21184 4.39 <.0001 age 9.6873755 2.223264 4.36 <.0001 (age-38.22)^2 -0.476988 0.151453 -3.15 0.0019
Change in Mean Wage Associated with One Year Increase in Age
Change in Mean Wage From 29 to 30 18.00 From 39 to 40 8.47 From 49 to 50 -1.07 From 59 to 60 -10.61
Choosing the order in polynomial regression
• Is it necessary to include a kth order term ?
• Test vs.• Choose largest k so that test still rejects (at 0.05
level)• If we use , always keep the lower order
terms in the model. • For corn yield data, use K=2 polynomial regression
model. • For income data, use K=2 polynomial regression
model
( )kX X
20 1 2( | ) ( ) ( )kKE Y X X X X X X
0:0 kH 0: kaH
( )kX X
0H
B i v a r i a t e F i t o f Y I E L D B y R A I N F A L L
20
25
30
35
40
YIELD
6 7 8 9 10 11 12 13 14 15 16 17
RAINFALL
L inea r Fit
Po ly nomia l Fit Degree=2
Po ly nomia l Fit Degree=3 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | In t e r c e p t 2 9 . 2 8 1 2 8 1 5 . 6 2 5 5 3 7 5 . 2 1 < . 0 0 0 1 R A IN F A L L 0 . 3 7 6 7 0 9 0 . 5 1 1 8 1 7 0 . 7 4 0 . 4 6 6 8 ( R A IN F A L L - 1 0 . 7 8 4 2 ) ^ 2 - 0 . 3 4 9 3 3 5 0 . 1 1 4 4 0 1 - 3 . 0 5 0 . 0 0 4 4 ( R A IN F A L L - 1 0 . 7 8 4 2 ) ^ 3 0 . 0 5 1 7 5 6 8 0 . 0 3 2 2 0 2 1 . 6 1 0 . 1 1 7 2
Transformations
• Curvilinear relationship: E(Y|X) is not a straight line.
• Another approach to fitting curvilinear relationships is to transform Y or x.
• Transformations: Perhaps E(f(Y)|g(X)) is a straight line, where f(Y) and g(X) are transformations of Y and X, and a simple linear regression model holds for the response variable f(Y) and explanatory variable g(X).
Curvilinear RelationshipBivariate Fit of Life Expectancy By Per Capita GDP
40
50
60
70
80
Life
Exp
ecta
ncy
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Y=Life Expectancy in 1999X=Per Capita GDP (in US Dollars) in 1999Data in gdplife.JMP
Linearity assumption of simplelinear regression is clearly violated.The increase in mean life expectancy for each additional dollarof GDP is less for large GDPs thanSmall GDPs. Decreasing returns toincreases in GDP.
Bivariate Fit of Life Expectancy By log Per Capita GDP
40
50
60
70
80
Life
Exp
ecta
ncy
6 7 8 9 10
log Per Capita GDP
Linear Fit Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
-25
-15
-5
5
15
Res
idua
l
6 7 8 9 10
log Per Capita GDP
The mean of Life Expectancy | Log Per Capita appears to be approximatelya straight line.
How do we use the transformation?
•
• Testing for association between Y and X: If the simple linear regression model holds for f(Y) and g(X), then Y and X are associated if and only if the slope in the regression of f(Y) and g(X) does not equal zero. P-value for test that slope is zero is <.0001: Strong evidence that per capita GDP and life expectancy are associated.
• Prediction and mean response: What would you predict the life expectancy to be for a country with a per capita GDP of $20,000?
Linear Fit Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -7.97718 3.943378 -2.02 0.0454 log Per Capita GDP
8.729051 0.474257 18.41 <.0001
47.789035.9*7291.89772.7)9035.9log|(ˆ
)000,20loglog|(ˆ)000,20|(ˆ
XYE
XYEXYE
How do we choose a transformation?
• Tukey’s Bulging Rule.
• See Handout.
• Match curvature in data to the shape of one of the curves drawn in the four quadrants of the figure in the handout. Then use the associated transformations, selecting one for either X, Y or both.
Transformations in JMP1. Use Tukey’s Bulging rule (see handout) to determine
transformations which might help. 2. After Fit Y by X, click red triangle next to Bivariate Fit and click Fit
Special. Experiment with transformations suggested by Tukey’s Bulging rule.
3. Make residual plots of the residuals for transformed model vs. the original X by clicking red triangle next to Transformed Fit to … and clicking plot residuals. Choose transformations which make the residual plot have no pattern in the mean of the residuals vs. X.
4. Compare different transformations by looking for transformation with smallest root mean square error on original y-scale. If using a transformation that involves transforming y, look at root mean square error for fit measured on original scale.
Bivariate Fit of Life Expectancy By Per Capita GDP
40
50
60
70
80Li
fe E
xpec
tanc
y
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Linear Fit
Transformed Fit to Log
Transformed Fit to Sqrt
Transformed Fit Square
`
•
Linear Fit Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP Summary of Fit RSquare 0.515026 RSquare Adj 0.510734 Root Mean Square Error 8.353485 Mean of Response 63.86957 Observations (or Sum Wgts) 115 Transformed Fit to Log Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP) Summary of Fit RSquare 0.749874 RSquare Adj 0.74766 Root Mean Square Error 5.999128 Mean of Response 63.86957 Observations (or Sum Wgts) 115
Transformed Fit to Sqrt Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP) Summary of Fit RSquare 0.636551 RSquare Adj 0.633335 Root Mean Square Error 7.231524 Mean of Response 63.86957 Observations (or Sum Wgts) 115 Transformed Fit Square Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP Fit Measured on Original Scale Sum of Squared Error 7597.7156 Root Mean Square Error 8.1997818 RSquare 0.5327083 Sum of Residuals -70.29942
By looking at the root mean square error on the original y-scale, we see thatall of the transformations improve upon the untransformed model and that the transformation to log x is by far the best.
Linear Fit
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Transformation to Log X
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Transformation to X
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
Transformation to 2Y
-25
-15
-5
5
15
Res
idua
l
0 5000 10000 15000 20000 25000 30000
Per Capita GDP
The transformation to Log X appears to have mostly removed a trend in the meanof the residuals. This means that . There is still a problem of nonconstant variance.
XXYE log)|( 10
Comparing models for curvilinear relationships
• In comparing two transformations, use transformation with lower RMSE, using the fit measured on the original scale if y was transformed on the original y-scale
• In comparing transformations to polynomial regression models, compare RMSE of best transformation to best polynomial regression model.
• If the transfomation’s RMSE is larger than the polynomial regression’s RMSE but is within 1% of the polynomial regression’s RMSE, then it is still a good idea to use the transformation on the grounds of parsimony.
Transformations and Polynomial Regression for Display.JMP
RMSE
Linear 51.59
log x 41.31
1/x 40.04
46.02
Fourth order poly. 37.79
x
Fourth order polynomial is the best polynomial regression model using the criterion on slide 10
Fourth order polynomial is the best model – it has the smallest RMSE by a considerable amount (more than 1% advantage over best transformation of 1/x.