04/22/23 330 lecture 16 1
STATS 330: Lecture 16
04/22/23 330 lecture 16 2
Case studyAim of today’s lecture
To illustrate the modelling process using the evaporation data.
04/22/23 330 lecture 16 3
The Evaporation data
Data in data frame evap.df Aims of the analysis:
• Understand relationships between explanatory variables and the response
• Be able to predict evaporation loss given the other variables
04/22/23 330 lecture 16 4
Case Study: Evaporation data
Recall from Lecture 15: variables are
evap: the amount of moisture evaporating from the soil in the 24 hour period (response)
maxst: maximum soil temperature over the 24 hour periodminst: minimum soil temperature over the 24 hour periodavst: average soil temperature over the 24 hour periodmaxat: maximum air temperature over the 24 hour periodminat: minimum air temperature over the 24 hour periodavat: average air temperature over the 24 hour periodmaxh: maximum humidity over the 24 hour periodminh: minimum humidity over the 24 hour periodavh: average humidity over the 24 hour periodwind: average wind speed over the 24 hour period.
04/22/23 330 lecture 16 5
Modelling cycle
Examine residuals
Fit model
Transform
Choose Model
Bad fit
Good fit
Use model
Plots, theory
04/22/23 330 lecture 16 6
Modelling cycle (2)Our plan of attack:
1. Graphical check
1. Suitability for regression
2. Gross outliers
2. Preliminary fit
3. Model selection (for prediction)
4. Transforming if required
5. Outlier check
6. Use model for prediction
04/22/23 330 lecture 16 7
Step 1: Plots Preliminary plots Want to get an initial idea of suitability of data
for regression modelling Check for linear relationships, outliers
• Pairs plots, coplots • Data looks OK to proceed, but evap/maxh plot
looks curved
04/22/23 330 lecture 16 8
Points to note Avh has very few values Strong relationships between response and
some variables (particularly maxh, avst) Not much relationship between response and
minst, minat, wind strong relationships between min, av and max No obvious outliers
04/22/23 330 lecture 16 9
Step 2: preliminary fitCoefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.074877 130.720826 -0.414 0.68164
avst 2.231782 1.003882 2.223 0.03276 *
minst 0.204854 1.104523 0.185 0.85393
maxst -0.742580 0.349609 -2.124 0.04081 *
avat 0.501055 0.568964 0.881 0.38452
minat 0.304126 0.788877 0.386 0.70219
maxat 0.092187 0.218054 0.423 0.67505
avh 1.109858 1.133126 0.979 0.33407
minh 0.751405 0.487749 1.541 0.13242
maxh -0.556292 0.161602 -3.442 0.00151 **
wind 0.008918 0.009167 0.973 0.33733
Residual standard error: 6.508 on 35 degrees of freedom
Multiple R-squared: 0.8463, Adjusted R-squared: 0.8023
F-statistic: 19.27 on 10 and 35 DF, p-value: 2.073e-11
04/22/23 330 lecture 16 10
04/22/23 330 lecture 16 11
0 10 20 30 40 50
-15
-10
-50
510
15
Fitted values
Res
idua
ls
Residuals vs Fitted
33412
-2 -1 0 1 2
-2-1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q-Q
41 33 2
0 10 20 30 40 50
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale-Location41 33
2
0.0 0.2 0.4 0.6
-3-2
-10
12
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
31
241
Findings Plots OK, normality dubious Gam plots indicated no transformations Point 31 has quite high Cooks distance but
removing it doesn’t change regression much Model is OK. Could interpret coefficients, but variables
highly correlated.
04/22/23 330 lecture 16 12
04/22/23 330 lecture 16 13
Step 3: Model selection
Use APR Model selected was
evap ~ maxat + maxh + wind
However, this model does not fit all that well (outliers, non-normality)
Try “best AIC” model
evap ~ avst + maxst + maxat + minh+maxh Now proceed to step 4
04/22/23 330 lecture 16 14
Step 4: Diagnostic checks
For a quick check, plot the regression object produced by lm
model1.lm<-lm(evap ~ avst + maxst + maxat + minh+maxh, data=evap.df)
plot(model1.lm)
04/22/23 330 lecture 16 15
Outliers ?Non-normal?
0 10 20 30 40 50
-20
-15
-10
-50
510
15
Fitted values
Res
idua
ls
Residuals vs Fitted
41 33
2
-2 -1 0 1 2
-3-2
-10
12
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q-Q
41
332
0 10 20 30 40 50
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale-Location41
33 2
0.0 0.1 0.2 0.3 0.4
-3-2
-10
12
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance1
0.5
0.5
Residuals vs Leverage
2
41
38
04/22/23 330 lecture 16 16
Conclusions?
No real evidence of non-linearity, but will check further with gams
Normal plot looks curved Some largish outliers Points 2, 41 have largish Cooks D
04/22/23 330 lecture 16 17
Checking linearity Check for linearity with gams
> library(mgcv)>plot(gam(evap ~ s(avst) + s(maxst) + s(maxat) + s(maxh) + s(wind), data=evap.df))
04/22/23 330 lecture 16 18
75 80 85 90 95
-60
-40
-20
020
40
avst
s(av
st,4
.49)
130 140 150 160 170 180 190 200
-60
-40
-20
020
40
maxst
s(m
axst
,1)
150 160 170 180 190 200 210
-60
-40
-20
020
40
maxat
s(m
axat
,1)
340 360 380 400 420 440 460 480
-60
-40
-20
020
40
maxh
s(m
axh,
3.69
)
30 40 50 60 70
-60
-40
-20
020
40
minh
s(m
inh,
2.04
)
Transform avst, maxh ?
04/22/23 330 lecture 16 19
Remedy Gam plots for avst and maxh are curved Try cubics in these variables Plots look better Cubic terms are significant
04/22/23 330 lecture 16 20
04/22/23 330 lecture 16 21
> model2.lm<-lm(evap ~ poly(avst,3) + maxst + maxat + minh+poly(maxh,3), data=evap.df)> summary(model2.lm)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.6521 25.4308 2.935 0.00577 ** poly(avst, 3)1 83.0106 27.3221 3.038 0.00441 ** poly(avst, 3)2 21.4666 8.3097 2.583 0.01399 * poly(avst, 3)3 14.1680 7.2199 1.962 0.05749 . maxst -0.8167 0.1697 -4.814 2.65e-05 ***maxat 0.4175 0.1177 3.546 0.00111 ** minh 0.4580 0.3253 1.408 0.16766 poly(maxh, 3)1 -89.0809 20.0297 -4.447 8.02e-05 ***poly(maxh, 3)2 -10.7374 7.9265 -1.355 0.18398 poly(maxh, 3)3 15.1172 6.3209 2.392 0.02212 * ---
Residual standard error: 5.276 on 36 degrees of freedomMultiple R-squared: 0.8961, Adjusted R-squared: 0.8701 F-statistic: 34.49 on 9 and 36 DF, p-value: 4.459e-15
04/22/23 330 lecture 16 22
New modelLets now adopt model
lm(evap~poly(avst,3)+maxst+maxat+poly(maxh,3) + wind
Outliers are not too bad but lets check
> influenceplots(model2.lm)
04/22/23 330 lecture 16 23
04/22/23 330 lecture 16 24
04/22/23 330 lecture 16 25
Deletion of points Points 2, 6, 7, 41 are affecting the fitted values,
some coefficients. Removing these one at a time and refitting indicates that the cubics are not very robust, so we revert to the non-polynomial model
The coefficients of the non-polynomial model are fairly stable when we delete these points one at a time, so we decide to retain them.
04/22/23 330 lecture 16 26
Normality?However, the normal plot for the non-
polynomial model is not very straight – WB test has p-value 0.
Normality of polynomial model is better
Try predictions with both
04/22/23 330 lecture 16 27
predict.df = data.frame(avst = mean(evap.df$avst),maxst = mean(evap.df$maxst),maxat = mean(evap.df$maxat),maxh = mean(evap.df$maxh),minh = mean(evap.df$minh))
rbind(predict(model1.lm, predict.df,interval="p" ),predict(model2.lm, predict.df,interval="p" )) fit lwr upr1 34.67391 21.75680 47.591031 32.38471 21.39857 43.37084CV fit: fit lwr upr1 34.67391 21.02628 48.32154