Process Design and Optimization of Bioprocesses with Quality by
Design Approach
Kyunghee Cho, Yunhao He
June 11, 2014
1 Introduction
In this project, we work with data from a pharmaceutical company that uses a new biochemical methodto produce drugs. The ultimate goal is to improve the process and develop models to aid decision making.In general, we hope for more information out of less measurement. Due to time-constraint, we focus onthe prediction under the classical statistical settings. That is, to predict an output Y from X usingthe model developed from data X1, X2, · · · , Xn and Y1, Y2, · · · , Yn. Many models exist for this kind ofprediction task. We only use a few of them in this project.
1.1 Data Description
In this part we briefly describe the data. The following is how the raw data look like.
Batch.ID Run WD Ptime DO CO2 pH Stress GLC LAC GLN GLU NH4
1 1 539 0 0.0000000 50 46 7.12 2.2 7.00 0.67 1.88 1.32 3.55
2 1 539 1 0.8368056 50 21 7.10 2.2 5.95 1.52 1.29 1.67 3.97
3 1 539 2 1.7986111 50 23 6.97 2.2 5.13 2.06 1.31 1.80 4.11
4 1 539 3 2.8020833 50 33 6.97 2.2 3.85 2.16 1.28 2.64 4.57
5 1 539 4 3.8680556 50 39 6.98 2.2 3.02 1.93 1.73 2.57 4.45
6 1 539 5 4.8993056 50 40 7.02 2.2 2.27 1.71 2.07 3.53 4.20
OSM Xv Via Titer
1 322.0000 1.45 97.97297 0.03913624
2 315.4865 2.07 97.64151 0.05251863
3 308.0000 3.30 97.34513 0.05340000
4 305.0906 4.97 97.83465 0.10257388
5 302.0000 7.02 97.50000 0.14200000
6 298.0000 7.75 96.27329 0.19834530
In a Run of experiment, all the variables are measured from day WD 1 to WD 10, sometimes 11 or12. In the experiments the controlled variables are DO, CO2, pH, Stress. Variables as GLC, LAC, GLN,GLU, NH4, OSM, Xv, Via are measured throbbughout to monitor the state of the production. Titer isthe output we are interested in.
1.2 Missing Data Imputation
The raw data contains quite a few missing values. To carry out sensible statical analysis, we imputedmissing data for both the input values and the output values. Titer are interpolated by our client.Missing values in the input values are imputed by MissForest, which is a recent achievement based onRandomForest. See [1].
1.3 Data Standardization
It is sometimes a good practice to normalize the data to have mean 0 and variance 1. In our analysis,all the models except for linear models are standardized, since we have more sophiscated customizedtransformation for linear regressions. As a result, the coefficients of linear regressions and mars arealways different in scale.
1
2 Information on Models Used
2.1 Linear Model
Linear regression is the most mature and widely used method. Although it is quite simple and intuive,sometimes it has good prediction power. We can easily tell if any model assumption is violated by lookingat the plots of the linear fit.
To make a linear model work best, it is necessary to specify the predictors manually. So we have toconsider which transformations to take, which interactions or orders to include and which to leave out.One easy way to do this is fit a big model in the beginning and use stepwise selection by AIC(or BIC)criterion1 which is a measure of balance between goodness of fit and model complexity.
To assess variable importance, a simple way is to look at p-value or t-value, given the model assump-tions are met.
2.2 Decision Tree
Decision tree is a scale independent statistical model. It is easy to implement and deals with interactionsnaturally. The biggest advantage is that it can be visualized easily. See [5].
2.3 Random Forest
All the information in this subsection is based on [4]. Random forest is an ensemble method that combinesmany decision trees, in each of which the data and variables are sampled so that only part of them areused. Due to implicit bootstrapping, the model suffers less from overfitting and has good predictionpower.
Random forest has built-in mechanisms to estimate importance of variables. The two measures aredescribed in the following way:
� The first measure is computed from permuting OOB data: For each tree, the predic-tion error on the out-of-bag portion of the data is recorded (error rate for classification,MSE for regression). Then the same is done after permuting each predictor variable.The difference between the two are then averaged over all trees, and normalized by thestandard deviation of the differences. If the standard deviation of the differences is equalto 0 for a variable, the division is not done (but the average is almost always equal to 0in that case).
� The second measure is the total decrease in node impurities from splitting on the variable,averaged over all trees. For classification, the node impurity is measured by the Giniindex. For regression, it is measured by residual sum of squares.
The package randomForest has a function varImpPlot to plot the importance of variables easily.
2.4 MARS
All the information in this subsection is based on [3].MARS, multivariate adaptive regression splines, is an adaptive extension of linear regression. In the
final model, it is a linear regression with terms like (xj − d)+, (d − xj)+ and higher order interactionsof such terms. The algorithm adds terms forwardly and prune the model to a place where GCV isminimized.
MARS can also estimate the importance of variables. From earth vignette: [3]
� The nsubsets criterion counts the number of model subsets that include the variable.Variables that are included in more subsets are considered more important.
By ”subsets” we mean the subsets of terms generated by the pruning pass. There is onesubset for each model size (from 1 to the size of the selected model) and the subset isthe best set of terms for that model size. (These subsets are speced in $prune.terms inearth’s return value.) Only subsets that are smaller than or equal in size to the finalmodel are used for estimating variable importance.
1See http://stat.ethz.ch/R-manual/R-devel/library/stats/html/step.html. For more information about BIC, seehttp://en.wikipedia.org/wiki/Bayesian_information_criterion
2
� The rss criterion first calculates the decrease in the RSS for each subset relative to theprevious subset. (For multiple response models, RSS’s are calculated over all responses.)Then for each variable it sums these decreases over all subsets that include the variable.Finally, for ease of interpretation the summed decreases are scaled so the largest summeddecrease is 100. Variables which cause larger net decreases in the RSS are consideredmore important.
� The gcv criterion is the same, but uses the GCV instead of the RSS. Adding a variablecan increase the GCV, i.e., adding the variable has a deleterious effect on the model.When this happens, the variable could even have a negative total importance, and thusappear less important than unused variables.
2.5 Neural Network
Neural network is a felxible model which is in a sense an extension of linear regression. See en.wikipedia.org/wiki/Artificial_neural_network.
3 Overview of Three Main Tasks
This statistical analysis comprises three main tasks, in each of which all three above mentioned modelswere used, specifically, Linear Regression, Random Forest and MARS. Only for the third task a differentdataset was used, in which missing Titer values are interpolated by a logistic function fit within eachrun.
3.1 First Task: Blackbox Models
Maximizing the output of a useful product by controlling experimental conditions is of primary interest.The first blackbox models are fitted using only the four controlled variables, namely, DO, Stress, pHset,Xv0 as input variables to predict the output variable, Titer at day 10. Secondly, all other input variablesat day 0 are used as input variables to predict Titer at day 10.
3.2 Second Task: Snapshot Models
In order to save the measurement cost of Titer, it is useful to predict the current value of Titer duringan experiment using the current values of the input variables, which are relatively cheaper to measure.The snapshot approach means each observation is considered to be independant and input variables atday t are used to predict Titer at day t.
3.3 Third Task: History Models
The history approach can be regarded as an extension of both the blackbox models and the snapshotmodels. For the history models, not only the current value of the input variables at day t are consideredin the model as predictors but also how they have changed over time in the past, that is, the history ofthe input variables. All Titer values in the future as well as the one at day t are to be predicted. Inother words, a (i, j) history model uses input variables at day 0, 1, ..., i as predictors to predict the Titervalue at day j(≥ i).
4 Model Comparison
In order to compare the prediction performance between the three statistical models, the cross-validationmethod was used.
1. First, the data is randomly splitted into a training set and a test set by Run ID. That is, 30 runsare randomly sampled for a test set among 122 runs.
2. Then a model is fitted using the training set, the rest of the data.
3. MSE (mean squared error) is calculated in the test set.
3
4. This is repeated 5 times and using the 5 resulting MSE’s, RMSECV is calculated, which is definedas follows.
For blackbox or history models:
RMSECV =
√∑5k=1
∑30i=1(y
(k)i −y
(k)i )2
5·30
where y(k)i and y
(k)i are the true Titer value and the predicted one of the ith sample run in the kth
cross-validation, repectively.
For snapshot models:
RMSECV =
√∑5k=1
∑30i=1
∑nij=1(y
(k)ij −y
(k)ij )2
5·30·ni
where y(k)ij and y
(k)ij are the true Titer value and the predicted one of the ith sample run in the kth
cross-validation at the day j, repectively.
In case the target variable is transformed, the RMSECV is calculated based on back-transformedvalues of both true and predicted values.
5 Blackbox Models
This is a prediction problem with classical statistics setting. Before we use any statistical models, it ishelpful to get an intuition what the dataset looks like.
●
●
●
●●
●●
●●
●●
●
●
●●●●●●●●●●●●●●
●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●●
●
●
●● ●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
600 700 800 900 1000
0.2
0.4
0.6
0.8
Final Titer in Different Runs without Transformation
dat.simp$Run
dat.s
imp$
Tite
r
Since we are using only a small part of the raw data here, the number of observations is 122.If only prediction power is of interest, you can jump to Section 5.6.
4
5.1 Linear Model
5.1.1 Linear Model with 4 Controlled Variables
The following model was selected by the BIC model selection criterion. We can see that both DO andpHset exhibit quadratic effect.
Call:
lm(formula = bb4lm$formula, data = if.data)
Residuals:
Min 1Q Median 3Q Max
-0.59583 -0.10260 -0.00799 0.10758 0.84827
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.695e-01 7.932e-02 -3.397 0.000952 ***
I((DO - 50)^2) -7.425e-04 8.316e-05 -8.929 1.18e-14 ***
log(Xv0) -3.114e-01 1.810e-01 -1.721 0.088092 .
I((pHset - 7.1)^2) -7.436e+00 9.961e-01 -7.465 2.17e-11 ***
I((DO - 50)^2):log(Xv0) 5.670e-04 1.714e-04 3.309 0.001270 **
I((DO - 50)^2):I((pHset - 7.1)^2) -1.852e-03 7.084e-04 -2.614 0.010202 *
log(Xv0):I((pHset - 7.1)^2) 8.604e+00 1.811e+00 4.750 6.23e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2011 on 109 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.8462, Adjusted R-squared: 0.8377
F-statistic: 99.95 on 6 and 109 DF, p-value: < 2.2e-16
[1] 0.09764054
−3.0 −2.0 −1.0
−0.
50.
00.
51.
0
Fitted values
Res
idua
ls
●●
●
●●●●
●●●●●●●●●●●●●●●●
●●●●
●●
●● ●●●●
●
●
●●
●
●●
●
●●●
●
●
● ●
●
●●●●
●
●
●●●
● ●●●
●●●●
●●
●
●●
●● ●
●●●●●●●●●●●●●
●●
●
●●●●●●
●●●●
●●
●●●●●●●
●●●●
●
Residuals vs Fitted
73
4348
●●
●
●● ●●●●
●●●
●● ●● ●●●●●●●
●●●●
● ●
●●●●●●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●
●●●
●● ●●
●●●●
●●
●
●●
●●●
●●●●● ●●●
●●●●
●
●●
●
●●●●● ●
●●●
●●●
●●
● ●●●●
●●
● ●
●
−2 −1 0 1 2
−2
02
46
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
73
4843
−3.0 −2.0 −1.0
0.0
1.0
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
●
●
●
●●●●
●●
●●
●
●
●●●●●●
●●●●
●●●●
●●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●●
●
●●
●●●
●●
●●●
●
●●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●
●
●●●
●●
●
●
●●●
Scale−Location73
4843
0.0 0.1 0.2 0.3 0.4
−4
−2
02
46
Leverage
Sta
ndar
dize
d re
sidu
als
●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●
●●
●●●●●●●
●
●●
●
●●
●
●●●
●
●
●●
●
●●
●●
●
●
●●●
●●●●
●●● ●
●●
●
●●
●●●
●●●●●●●●●●●●●
●●
●
● ●●●●●
●●●●●●
●●●●●
●●●
●●●
●
Cook's distance10.5
0.51
Residuals vs Leverage
73
4837
5
The transformations for DO, Xv0 and pHset were chosen according to the termplot to give a reasonable fit.
> par(mfrow=c(2,2))
> termplot(bb4lmfit,partial.resid=TRUE)
10 20 30 40 50 60 70
−1.
5−
0.5
0.5
DO
Par
tial f
or I(
(DO
− 5
0)^2
)
●●
●
●●●●●●●●●●●●●●●●●●●● ●●●● ●●
●●●●●●●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●●
●
●
●●
●●
●●●●●●●●●●●●●●●●
●
●●●●●●●
●●●●●
●●●●●●●●●●●
●
1.0 1.2 1.4 1.6 1.8 2.0
−1.
5−
0.5
0.5
Xv0
Par
tial f
or lo
g(X
v0)
●●
●
●●●●●● ●●
●●
● ● ●●● ●● ●● ●
● ●●●
● ●
●●●● ●●●
●
●●● ●●
●
●●●●
●
●●●
●●●
●
●
●●●●
●●● ●●●●
●●●
●
●●
●●
● ●● ● ●● ●● ●●●● ●●●●
●
●●●● ●●
●●●●
●●
●●●● ●●
●●●
● ●
●
6.7 6.8 6.9 7.0 7.1 7.2
−1.
5−
0.5
0.5
pHset
Par
tial f
or I(
(pH
set −
7.1
)^2)
●●
●
●●●●●●●●●●●●●●●●●●●●
●●●●
●●
●●
●●●●●
●
●●●
●●
●
●●●
●
●
●●
●
●
●●
●
●
●
●●
●●●●●
●●● ●
●●
●●●
●●●
●●●●●●●●●●●●●●●
●
●● ●●●●● ●●●●●
● ●●●●●●●●●●
●
5.1.2 Linear Model with All Variables
The following model was selected by the BIC model selection criterion. Apart from the 4 controlledvariables, only NH4 and GLN.
Call:
lm(formula = bblmbic$formula, data = if.data)
Residuals:
Min 1Q Median 3Q Max
-0.86766 -0.09680 0.01636 0.09528 0.59879
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.559e+00 2.359e-01 -6.606 1.49e-09 ***
I((DO - 50)^2) -5.932e-04 4.454e-05 -13.319 < 2e-16 ***
pCO2 5.530e-03 2.287e-03 2.418 0.01726 *
GLN 1.264e-01 5.030e-02 2.513 0.01343 *
NH4 8.313e-02 2.345e-02 3.545 0.00058 ***
log(Xv0) 5.981e-01 1.377e-01 4.343 3.16e-05 ***
I((pHset - 7.1)^2) -5.336e+00 5.264e-01 -10.137 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2224 on 109 degrees of freedom
(2 observations deleted due to missingness)
6
Multiple R-squared: 0.8119, Adjusted R-squared: 0.8015
F-statistic: 78.41 on 6 and 109 DF, p-value: < 2.2e-16
−2.5 −2.0 −1.5 −1.0 −0.5
−1.
0−
0.5
0.0
0.5
Fitted values
Res
idua
ls
●
●
●
●●●●●
●●●
●
●
●●●●
●●●
●●●
●
●
●●
●●
●● ●
●●●
●● ●
●
●
●●
●
●●●
●
●
●●
●
●
● ●
●
●
●
●●
●
● ●●●
●●
●
● ●●
●
●●
●●●●
●●●●●●●●●●
●● ●●
●
●●●
●●● ●
●●●
●
●
●●●
●●
●●
●●
●●
●
Residuals vs Fitted
474855
●
●
●
●●●●
●
●● ●
●
●
●●●
●●●
●
●●●
●
●
●●
●●
●●●
●●●
●●●
●
●
●●
●
● ●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●●
●●
●●
● ●●
●
●●
●●
●●●
● ●●● ●
●●●●
●● ●●
●
●●●
●●●●
●●●
●
●
●●●
●●
●●
●●
●●
●
−2 −1 0 1 2
−4
−2
02
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
474855
−2.5 −2.0 −1.5 −1.0 −0.5
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
●●
●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●● ●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●●
●
●●●
●
●
●●
●
●
●●
●●
●●
●
●●●
●
●●
●
●
●
●●
●
●
●●
●
Scale−Location47
4855
0.00 0.10 0.20 0.30
−4
−2
02
Leverage
Sta
ndar
dize
d re
sidu
als
●
●
●
●●●●
●
●● ●
●
●
●●●●
●●●
●●●
●
●●●
●●
●●●
●●●
● ●●
●
●
●●
●
●●●
●●
●●
●
●
●●
●
●
●
●●
●
●●●●●
●●
● ●●
●
●●
●●●●
●●●
●●●●●●●
●● ● ●
●
●●●
●●● ●
●●●
●
●
●●●
●●
● ●
●●
●●
●
Cook's distance 1
0.5
0.5
Residuals vs Leverage
474855
5.2 MARS
To allow for flexibility and nonlinear effects, we use Mars - Multivariate adaptive regression splines.
5.2.1 MARS with 4 Controlled Variables
> source(file.path(cwd, "mars_simp.R"))
> summary(mars.con)
Call: earth(formula=Titer~DO+pHset+Xv+Stress, data=dat.simp,
trace=0)
coefficients
(Intercept) 0.69954039
h(-1.06901-DO) -0.22496222
h(DO-0.159614) -0.05639779
h(pHset-0.444749) -0.12694532
h(0.444749-pHset) -0.11450147
Selected 5 of 13 terms, and 2 of 4 predictors
Importance: DO, pHset, Xv-unused, Stress-unused
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 0.009779767 RSS 0.9944133 GRSq 0.6943894 RSq 0.7344234
> plot(mars.con, main = "Mars Using 4 Inputs.jpg")
7
0 2 4 6 8 10
Mars Using 4 Inputs.jpg
Number of terms
0.45
0.55
0.65
0.75
GR
Sq
RS
q
01
23
4N
umbe
r of
use
d pr
edic
tors
GRSqselected modelRSqnbr preds
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
Mars Using 4 Inputs.jpg
abs(Residuals)
Pro
port
ion
0% 50% 75% 90% 95% 100%
●
●●
● ●●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●● ●
●●
●●
●●
●
●
●
●●●●
●●
●●
●
●
●●●●●●●●●●●●
●●●●
●●
●●
●
●●●●
●
●
●
●●●●●●●
●
●
●●●
●
●●
●
●●
●●
●
●
●●
●●
●
●●●
●
●●
●
●
●
●●
●●
●
●
●
0.1 0.3 0.5 0.7
−0.
3−
0.1
0.1
Mars Using 4 Inputs.jpg
Fitted
Res
idua
ls
42
106
47
●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●●
●●
●●
●●
●
●
●
●●●
●
●●
●●
●
●
●●●●
●●●●●●●●
●●●●
●●
●●
●
●●
●●
●
●
●
●●
●●●
●●
●
●
●●●
●
●●
●
●●
●●
●
●
●●
●●
●
●●●
●
●●
●
●
●
●●
●●
●
●
●
−2 −1 0 1 2
−0.
3−
0.1
0.1
Mars Using 4 Inputs.jpg
Theoretical Quantiles
Res
idua
l Qua
ntile
s
42
106
47
Titer: earth(formula=Titer~DO+pHset+Xv+Stre...
5.2.2 MARS with All Variables
> summary(mars.simp)
Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH, data=dat.simp,
trace=0)
coefficients
(Intercept) 0.56859816
h(-1.06901-DO) -0.23177287
h(DO-0.159614) -0.06620347
h(pHset-0.444749) -0.09650702
h(0.444749-pHset) -0.09711387
h(Via-0.355972) 0.33786922
h(1.41519-GLC) 0.14153165
h(-0.949741-LAC) 0.36807897
h(NH4- -0.263424) 0.10067628
h(-0.0456292-OSM) 0.35674962
Selected 10 of 24 terms, and 7 of 13 predictors
Importance: DO, pHset, NH4, Via, GLC, LAC, OSM, Xv-unused, Stress-unused, ...
Number of terms at each degree of interaction: 1 9 (additive model)
GCV 0.007086922 RSS 0.5955396 GRSq 0.7785388 RSq 0.84095
> plot(mars.simp, main = "Mars Using All Inputs.jpg")
8
0 5 10 15 20
Mars Using All Inputs.jpg
Number of terms
0.5
0.6
0.7
0.8
GR
Sq
RS
q
02
46
810
Num
ber
of u
sed
pred
icto
rs
GRSqselected modelRSqnbr preds
0.00 0.10 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Mars Using All Inputs.jpg
abs(Residuals)
Pro
port
ion
0% 50% 90% 100%
●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●●●●●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
0.0 0.2 0.4 0.6 0.8
−0.
2−
0.1
0.0
0.1
Mars Using All Inputs.jpg
Fitted
Res
idua
ls
42
9347
●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●●●●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
−2 −1 0 1 2
−0.
2−
0.1
0.0
0.1
Mars Using All Inputs.jpg
Theoretical Quantiles
Res
idua
l Qua
ntile
s
42
93
47
Titer: earth(formula=Titer~DO+pHset+Xv+...
Now we add the label(pCO2) showing whether or not CO2 is targeted to a certain level. As it turned out, thislabel is not important.
> summary(mars.simp2)
Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pCO2t,
data=dat.simp, trace=0)
coefficients
(Intercept) 0.80438564
h(DO-0.159614) -0.04110043
h(0.159614-DO) -0.09742652
h(pHset- -1.58274) -0.11389203
h(0.444749-pHset) -0.17420137
h(Via-0.340761) 0.29860268
h(-0.962971-LAC) 0.32408837
h(GLU- -1.50335) -0.81614339
h(GLU- -1.42405) 1.31071055
h(NH4- -0.292625) 0.10170289
h(-0.0456292-OSM) 0.52167298
Selected 11 of 27 terms, and 7 of 16 predictors
Importance: pHset, DO, NH4, OSM, GLU, Via, LAC, Xv-unused, Stress-unused, ...
Number of terms at each degree of interaction: 1 10 (additive model)
GCV 0.007404181 RSS 0.5975609 GRSq 0.7686247 RSq 0.8404102
> plot(mars.simp2,
+ main = "Mars Using All inputs and Additional CO2 Target Label.jpg")
9
0 5 10 15 20 25
Mars Using All inputs and Additional CO2 Target Label.jpg
Number of terms
0.3
0.5
0.7
0.9
GR
Sq
RS
q
02
46
810
Num
ber
of u
sed
pred
icto
rs
GRSqselected modelRSqnbr preds
0.00 0.10 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Mars Using All inputs and Additional CO2 Target Label.jpg
abs(Residuals)
Pro
port
ion
0% 50% 90% 100%
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
●●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.2 0.4 0.6 0.8
−0.
2−
0.1
0.0
0.1
Mars Using All inputs and Additional CO2 Target Label.jpg
Fitted
Res
idua
ls
42
47
93
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−0.
2−
0.1
0.0
0.1
Mars Using All inputs and Additional CO2 Target Label.jpg
Theoretical Quantiles
Res
idua
l Qua
ntile
s
4247
93
Titer: earth(formula=Titer~DO+pHset+Xv+...
Fitting Mars, records 42, 47 and 91 are identified as outliers.
> dat.simp[c(42, 47, 91), ]
Run WD Ptime DO pCO2 pH Xv Via GLC
42 630 0 0 0.159614 0.1237602 1.287431 -1.580576 0.3932955 1.078003
47 635 0 0 1.388239 0.9739924 1.426846 -1.894227 0.4848425 1.317452
91 768 0 0 0.159614 0.5488763 1.357139 -1.645298 0.4860406 1.156190
LAC GLN GLU NH4 OSM Stress
42 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237 -0.2096149
47 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617 -0.2096149
91 -0.9629707 0.3432373 -1.334834 1.247747682 0.145048862 -0.2096149
Xv0 pHset RunNo Discr.
42 0.3312174 0.03925127 BIOS-3.5L - 630 Standard conditions
47 -2.2207835 -2.39373696 BIOS-3.5L - 635 DO 70% / pH 6.70 / seeding 1 mio
91 -0.1953860 0.03925127 BIOS-3.5L - 768 WGE 15 - SE 3
Titer pCO2t
42 0.3510 0
47 0.0858 0
91 0.7650 0
Also, we can see the importance of variables. The methods for importance estimation have beendiscussed in Section 2.
10
02
46
8
Importance of Variables
nsub
sets
nsubsetssqrt gcvsqrt rss
020
4060
8010
0
norm
aliz
ed s
qrt g
cv o
r rs
s
DO
1
pHse
t 2
NH
4 1
1
Via
6
GLC
7
LAC
8
OS
M 1
2
5.3 Random Forest
Using Random Forest, we can easily see the importance of different variables.
5.3.1 Random Forest with 4 Controlled Variables
Call:
randomForest(formula = Titer ~ DO + pHset + Stress + Xv, data = dat.simp, importance = TRUE, mtry = 4)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 0.008918479
% Var explained: 71.66
5.3.2 Random Forest with All Variables
We can see the ranking of variable importances is similar to that of MARS.
11
Stress
pCO2
pH
GLU
GLC
GLN
Via
LAC
OSM
Xv
NH4
pHset
DO
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.5 1.0
rf.simp
IncNodePurity
> rf.simp
Call:
randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH, data = dat.simp, mtry = 10)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 0.007168825
% Var explained: 77.22
5.4 Decision Tree
For completeness, we also include decision tree model. It cannot outperform random forest which is anextension of decision tree, but it is easy for interpretation.
5.4.1 Decision Tree with 4 Controlled Variables
n= 119
node), split, n, deviance, yval
* denotes terminal node
1) root 119 3.74435600 0.5739803
2) DO< -1.683324 13 0.10235720 0.2245385 *
3) DO>=-1.683324 106 1.85988900 0.6168364
6) pHset< -0.7717448 10 0.22043140 0.3785372 *
7) pHset>=-0.7717448 96 1.01244000 0.6416593
14) pHset>=0.6474983 8 0.02327288 0.5111250 *
15) pHset< 0.6474983 88 0.84046160 0.6535260
12
30) Stress< 0.3004491 79 0.70332710 0.6425430 *
31) Stress>=0.3004491 9 0.04395770 0.7499322 *
|DO< −1.683
pHset< −0.7717
pHset>=0.6475
Stress< 0.3004
0.2245
0.3785
0.51110.6425 0.7499
5.4.2 Decision Tree with All Variables
n= 119
node), split, n, deviance, yval
* denotes terminal node
1) root 119 3.74435600 0.5739803
2) DO< -1.683324 13 0.10235720 0.2245385 *
3) DO>=-1.683324 106 1.85988900 0.6168364
6) pHset< -0.7717448 10 0.22043140 0.3785372 *
7) pHset>=-0.7717448 96 1.01244000 0.6416593
14) NH4< 0.2658514 67 0.43613680 0.6091353
28) OSM>=-0.1028327 50 0.28442470 0.5858890
56) Via< 0.4785685 23 0.15613930 0.5504854 *
57) Via>=0.4785685 27 0.07489916 0.6160477 *
29) OSM< -0.1028327 17 0.04522386 0.6775066 *
15) NH4>=0.2658514 29 0.34168860 0.7168009
30) Xv< -1.573108 21 0.21895440 0.6899347
60) Xv>=-1.652965 12 0.13595400 0.6410000 *
61) Xv< -1.652965 9 0.01595145 0.7551809 *
31) Xv>=-1.573108 8 0.06778793 0.7873245 *
13
|DO< −1.683
pHset< −0.7717
NH4< 0.2659
OSM>=−0.1028Via< 0.4786
Xv< −1.573Xv>=−1.653
0.2245
0.3785
0.5505 0.616 0.6775 0.641 0.7552 0.7873
5.5 Neural Network
Neural network is comparatively not so easy to interpret. We include it here to compare its predictionaccuracy with other models using CV.
5.6 Cross Validation
Although many packages nowadays have built-in measures of test errors, by implementing cross validationourselves, we can compare the performance of different models on the same ground. Here we use Leave-30-Runs-Out cross validation. Since it is a regression problem, mean squared error is a good indicatorof performance.
We can see the cross validation mean squared errors of different methods.
RMSECV of Different Models Using All Input Variables
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.078494 0.1060308 0.09046751 0.1237991 0.1362411
RMSECV of Different Models Using 4 Input Variables
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.09764054 0.117845 0.09327159 0.0972197 0.1150896
A surprising fact is that the more complicated models perform even worse. Linear model performs quitewell.
14
6 Snapshot Models
6.1 Linear model
The following model was selected by the BIC model selection criterion. We can see that now the numberof selected variables is much higher. In snapshot model, the effects are too complex to be estimated bya few variables.
Call:
lm(formula = snlm$formula, data = ccFdata)
Residuals:
Min 1Q Median 3Q Max
-0.63033 -0.09694 0.00396 0.11494 0.48256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.492e+00 1.705e+00 0.875 0.381944
I((DO - 50)^2) 9.907e-04 3.332e-04 2.973 0.003066 **
pCO2 1.522e-01 2.442e-02 6.233 8.54e-10 ***
I((pH - 7.1)^2) 3.698e+01 4.867e+00 7.598 1.14e-13 ***
log(Xv) 7.013e-01 1.506e-01 4.658 3.93e-06 ***
Via -3.025e-02 9.257e-03 -3.268 0.001144 **
GLC -2.973e-01 1.400e-01 -2.123 0.034175 *
LAC -2.858e-01 9.404e-02 -3.039 0.002479 **
GLN 5.613e-01 5.873e-02 9.557 < 2e-16 ***
GLU -1.950e+00 3.485e-01 -5.596 3.31e-08 ***
NH4 -1.106e-01 2.415e-02 -4.580 5.64e-06 ***
OSM -8.029e-03 3.901e-03 -2.058 0.039999 *
Stress -5.076e-02 1.549e-02 -3.276 0.001113 **
I((DO - 50)^2):pCO2 1.418e-05 4.040e-06 3.510 0.000482 ***
I((DO - 50)^2):Via -8.782e-06 2.193e-06 -4.005 6.98e-05 ***
I((DO - 50)^2):GLC 8.047e-05 1.697e-05 4.741 2.65e-06 ***
I((DO - 50)^2):GLU -1.739e-04 3.561e-05 -4.883 1.34e-06 ***
I((DO - 50)^2):NH4 -6.375e-05 2.582e-05 -2.469 0.013824 *
pCO2:Via -6.531e-04 1.354e-04 -4.824 1.78e-06 ***
pCO2:GLC 3.431e-03 6.722e-04 5.105 4.44e-07 ***
pCO2:GLN -6.369e-03 1.483e-03 -4.293 2.05e-05 ***
pCO2:OSM -2.974e-04 5.367e-05 -5.540 4.50e-08 ***
I((pH - 7.1)^2):LAC 2.711e+00 3.745e-01 7.240 1.36e-12 ***
I((pH - 7.1)^2):NH4 6.535e-01 2.225e-01 2.937 0.003435 **
I((pH - 7.1)^2):OSM -1.449e-01 1.812e-02 -7.997 6.46e-15 ***
log(Xv):GLC 1.177e-01 1.607e-02 7.323 7.75e-13 ***
log(Xv):GLU -2.613e-01 3.265e-02 -8.000 6.29e-15 ***
Via:GLC -4.150e-03 8.275e-04 -5.015 6.97e-07 ***
Via:GLU 1.534e-02 1.466e-03 10.463 < 2e-16 ***
Via:Stress 6.505e-04 1.614e-04 4.031 6.26e-05 ***
GLC:LAC -2.587e-02 7.330e-03 -3.529 0.000448 ***
GLC:NH4 1.571e-02 4.848e-03 3.241 0.001255 **
GLC:OSM 9.333e-04 3.859e-04 2.419 0.015863 *
LAC:NH4 -4.049e-02 8.656e-03 -4.678 3.58e-06 ***
LAC:OSM 1.038e-03 2.115e-04 4.911 1.16e-06 ***
GLN:Stress -4.219e-03 1.587e-03 -2.658 0.008059 **
GLU:OSM 4.736e-03 8.420e-04 5.625 2.83e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1743 on 608 degrees of freedom
(740 observations deleted due to missingness)
Multiple R-squared: 0.9526, Adjusted R-squared: 0.9498
F-statistic: 339.2 on 36 and 608 DF, p-value: < 2.2e-16
[1] 0.07678454
15
−4 −3 −2 −1 0
−0.
6−
0.2
0.2
0.6
Fitted values
Res
idua
ls
●
●
● ●
●●
●
● ●
●
●●●
●
●
●
●●
●
● ● ● ●●●
●
● ●
●●
●●●
● ●
●
●
●●● ●
●
●●●●
●
●
●●●●
●
●
●
● ●
●●●
● ●●
●●●
●
●
●
●●
●●
●●
●
●
●●●
●● ● ●
●
●● ●
●●
●
●
● ●
●
●
●
●
● ●
●
●●
●
●
● ●
●●●● ●
●●●
●
● ●
●●
●●
● ●
●
●●
●
●
●
●
●● ●
●
●●
● ● ●
●●
●
● ●
●●
●
●
●
●
●● ● ●
●
●● ●
●
●
●
●
● ●●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
● ●●●
● ●
●
●●●
●●
●
●
●
●
●
●●●●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●●
●●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
● ●
●
●●●
●
●● ●
●
●
●
●●●
●
●●●
●
●
●
●
●●
●
● ●
●●●
●
●
●
●
●
●● ●
●
●
●●
● ●
●
●●
●
●
●
●
●
●●
●●
●
● ●●
●
●●●
●
●●
●●
● ●
●● ●
●●
●●
●
●●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●●
● ●●
●●●
●●●
● ●●
●
●
●
● ●
●●
●
●●
●●
●● ●
●●
●
● ● ●●
● ● ●●●●
●●
●●● ●
●
●
●
●
●●●
●
● ●●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●●
●
●●
●
●
● ●●●
●●
●●●●
●● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
● ●●
● ●
●●
●
●
●
●
●
●● ●
● ●
● ●●
●
●
●
●●
●
●●
●●
●
●● ●
●
● ●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
● ●
●
● ●●●
● ●
●●
Residuals vs Fitted
90 345909
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●●
●
●●●●●
●●
●●
●●
●●
●●
●
●
●
●●●●
●
●●●●
●
●
●
●●●
●
●
●
●●
●●●
●●●
●●●
●
●
●
●●
●●
●●
●
●
●●
●
●●●●
●
●●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●●●●●
●●●
●
●●
●●
●●
●●
●
●●
●
●
●
●
●●●
●
●●
●●●
●●
●
●●
●●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●●●
●●
●
●●●
●●
●
●
●
●
●
●●●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●●●
●
●●●
●
●
●
●●●
●
●●●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
●●
●
●●●
●
●●
●
●
●●
●●
●●
●●●
●●
●●
●
●●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●●●
●●●
●
●
●
●●
●●
●
●●
●●
●●●
●●
●
●●●●
●●●●●●
●●
●●●●
●
●
●
●
●●●
●
●●●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●●●
●
●●
●●●●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●●●
●
●●●
●●
●●
●
●
●
●
●
●
●●●●
●●●
●
●
●
●●
●
●●
●●
●
●●●
●
●●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●
●●●
●
●●
●
●
−3 −2 −1 0 1 2 3
−3
−1
12
3
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
90345909
−4 −3 −2 −1 0
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
●
●
● ●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●●●
●●
●
●●
●
●
●
●
●●● ●
●
●
●●
●●
●
● ●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
● ●
●
●●
●
● ●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
● ●●
●
●●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
● ●
●
●●●
●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
● ●
●
●●
●
●●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●●
●●
●●
●
●
●
●●
●
●
●
● ●
●●
●● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●●
●
●
Scale−Location90 345
909
0.0 0.2 0.4 0.6
−4
−2
02
Leverage
Sta
ndar
dize
d re
sidu
als
●
●
●●
●●
●
●●
●
●●●
●
●
●
●●
●
●●●●●
●●
●●
●●●
●●●●
●
●
●●●●
●
●●●●
●
●
●●●●
●
●
●
●●
●●●
●●●
●●●
●
●
●
●●
●●
●●
●
●
●●●
●●●●●
●●●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●●●●●
●●●
●
●●
●●●●
●●
●
●●
●
●
●
●
●●●●
●●
●●●
●●
●
●●
●●
●
●
●
●
●●●
●
●
●●●●
●
●
●
●●●
●
●
●●●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●●●●
●●
●
●● ●
●●
●
●
●
●
●
●●●●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●● ●
●
●●●
●
●
●
●●●●
●●●●
●
●
●
●●●
●●
●●●
●
●
●
●
●
●●●
●
●
●●
●●
●
●●●
●
●
●
●
●●
●●
●
●●●
●
●●
●
●
●●
●●
●●
●●●
●●●
●
●
●●
●
●
●●
●
● ●
●●●●
●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●●
●●●
●●●●
●
●
● ●●●
●
●●
●●
●●●●●
●
●●● ●
●●●●●●
●●
●●●●
●
●
●
●
●●●
●
●●●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●●●●
●●●●●●
●●●●
●
●
●
●●
●●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●●●
●
●
●
●
●
●●
●
●●●
●
●●●
●●
●●
●
●
●
●
●
●●●●●
●●●
●
●
●
●●
●
●●
●●
●
●●●●
●●●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●
● ● ●
●
●●
●
●
Cook's distance1
0.5
0.5
Residuals vs Leverage
530
5321377
6.2 Random Forest
The importance ranking of variables is now much different from black-box model. GLU, LAC, NH4,GLN can be used to estimate Titer at the same time, which is a good “snapshot”.
[1] "Titer~DO+pCO2+Via+GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+Xv"
Call:
randomForest(formula = as.formula(snrf$formula), data = ccFdata, na.action = na.omit)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 0.003390071
% Var explained: 92.32
[1] 0.06970167
16
Stress
DO
GLC
pH
OSM
NH4
Via
pCO2
Xv
GLN
LAC
GLU
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 8 10
snrffit
IncNodePurity
6.3 MARS
MARS have a similar importance plot to Random Forest.
Call: earth(formula=as.formula(snmars$formula), data=subset(ccFdata,
!is.na(ccFdata$Titer)))
coefficients
(Intercept) 0.52558053
h(DO-70) 0.00384701
h(70-DO) 0.00081744
h(39-pCO2) -0.00387669
h(Via-95.1825) -0.01859399
h(1.64-GLC) -0.04381892
h(3.14-LAC) 0.06892429
h(GLN-2.09) 0.17083241
h(2.09-GLN) -0.07104502
h(GLU-4.49) 0.11840427
h(4.49-GLU) -0.13551714
h(NH4-5.19) -0.01831231
h(5.19-NH4) 0.04728241
h(OSM-305) 0.00128608
h(305-OSM) 0.00120199
h(Stress-21) -0.00161051
h(21-Stress) -0.00473324
h(7.11-pH) -0.20785734
h(Xv-2.96) -0.01598677
h(2.96-Xv) -0.10513829
Selected 20 of 25 terms, and 12 of 12 predictors
17
Importance: GLU, LAC, NH4, GLN, Xv, OSM, pH, Stress, Via, pCO2, DO, GLC
Number of terms at each degree of interaction: 1 19 (additive model)
GCV 0.004576446 RSS 2.605637 GRSq 0.8966136 RSq 0.9084545
[1] 0.07026124
0 5 10 15 20 25
Model Selection
Number of terms
0.6
0.7
0.8
0.9
GR
Sq
RS
q
02
46
810
12N
umbe
r of
use
d pr
edic
tors
GRSqselected modelRSqnbr preds
0.00 0.10 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Distribution
abs(Residuals)P
ropo
rtio
n
0% 50% 90% 100%
●
●
●
●●●●●
●●●●
●
●●
●●●
●●●
●●●●●
●
●●
●
●●●●●
●
●●●●
●
●
●
●●●●●●●●●
●
●
●
●
●
●
●●●●●●
●
●●
●
●
●
●●
●
●
●
●●
●●●●●●
●
●
●●
●
●●
●
●●●
●●●
●
●
●●●
●
●
●●
●
●
●●●
●●●
●●●●
●
●
●
●
●
●
●●
●
●●●
●
●●●
●
●
●
●●●
●●●●
●
●
●●
●●●
●
●
●●●
●●
●
●
●
●●●●●●
●●●●●
●
●●●●
●
●
●
●●●●●●●
●●
●●●
●
●
●●
●
●●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●●
●●●
●
●
●
●
●
●●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●●
●
●
●●
●
●●●●
●●
●●
●
●
●
●●●●●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8
−0.
20.
00.
10.
2
Residuals vs Fitted
Fitted
Res
idua
ls
586
10
582
●
●
●
●●
●●●
●●●
●
●
●●
●●●
●●●
●●●●●
●
●●
●
●●●●
●
●
●●
●●
●
●
●
●●●
●●
●●
●●
●
●
●
●
●
●
●●
●●●●
●
●●
●
●
●
●●
●
●
●
●●
●●
●●●●
●
●
●●
●
●●
●
●●●
●●
●
●
●
●●●
●
●
●●
●
●
●●●
●●●
●●
●●
●
●
●
●
●
●
●●
●
●●●
●
●●●
●
●
●
●●
●
●●●●
●
●
●●
●●●
●
●
●●●
●●
●
●
●
●●●
●●
●
●●●●●
●
●●●●
●
●
●
●●●●
●●●
●●
●●●
●
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●●●
●●
●●
●
●
●
●●
●●●
●
●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
−3 −1 0 1 2 3
−0.
20.
00.
10.
2Normal Q−Q
Theoretical Quantiles
Res
idua
l Qua
ntile
s
586
10
582
Titer: earth(formula=as.formula(snmars$formu...
18
05
1015
Variable importance
nsub
sets
nsubsetssqrt gcvsqrt rss
020
4060
8010
0no
rmal
ized
sqr
t gcv
or
rss
GLU
7
LAC
5
NH
4 8
GLN
6
Xv
12
OS
M
9
pH 1
1
Str
ess
10
Via
3
pCO
2 2
DO
1
GLC
4
7 History Model(Naive Way)
Let’s first use input data only at time t and treat t simply as another parameter.Again, the target here is the Titer on day 10.Explore with different models.
7.1 Linear Regression
Since we now have around 1300 observations, we can fit the model with more parameters. After this wecan select model using step.
Call:
lm(formula = Titer ~ I(DO^2) + DO + pHset + I(pHset^2) + Via +
poly(GLC, 3) + poly(LAC, 3) + GLN + GLU + NH4 + OSM + poly(Ptime,
3) + DO:pHset + I(pHset^2):poly(Ptime, 3) + Via:poly(Ptime,
3) + poly(LAC, 3):poly(Ptime, 3) + GLN:poly(Ptime, 3) + GLU:poly(Ptime,
3) + NH4:poly(Ptime, 3) + OSM:poly(Ptime, 3), data = dat.agg)
Residuals:
Min 1Q Median 3Q Max
-0.296451 -0.047078 0.002983 0.044872 0.190200
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.049e+00 5.200e-01 -2.017 0.044626 *
I(DO^2) -3.644e-02 6.218e-03 -5.860 1.26e-08 ***
DO 1.747e-02 7.374e-03 2.369 0.018520 *
pHset 9.137e-03 7.090e-03 1.289 0.198499
19
I(pHset^2) -3.318e-02 5.477e-03 -6.058 4.31e-09 ***
Via 1.341e-01 2.815e-02 4.763 3.02e-06 ***
poly(GLC, 3)1 -8.356e-01 1.905e-01 -4.387 1.62e-05 ***
poly(GLC, 3)2 -2.459e-01 1.455e-01 -1.690 0.092150 .
poly(GLC, 3)3 1.980e-01 1.425e-01 1.389 0.165837
poly(LAC, 3)1 -1.572e+02 4.936e+01 -3.184 0.001611 **
poly(LAC, 3)2 -1.203e+02 3.832e+01 -3.138 0.001878 **
poly(LAC, 3)3 -3.211e+01 1.041e+01 -3.084 0.002244 **
GLN 3.912e-02 8.812e-03 4.439 1.29e-05 ***
GLU 5.716e-02 2.163e-02 2.642 0.008686 **
NH4 5.605e-02 7.730e-03 7.252 3.82e-12 ***
OSM -2.508e-02 2.934e-02 -0.855 0.393294
poly(Ptime, 3)1 2.465e+01 7.910e+00 3.117 0.002013 **
poly(Ptime, 3)2 -4.044e+01 1.110e+01 -3.642 0.000321 ***
poly(Ptime, 3)3 5.017e+00 1.286e+00 3.903 0.000119 ***
DO:pHset 5.555e-03 2.814e-03 1.974 0.049324 *
I(pHset^2):poly(Ptime, 3)1 4.147e-01 8.829e-02 4.697 4.10e-06 ***
I(pHset^2):poly(Ptime, 3)2 1.152e-02 4.871e-02 0.236 0.813219
I(pHset^2):poly(Ptime, 3)3 -2.443e-01 1.301e-01 -1.878 0.061423 .
Via:poly(Ptime, 3)1 -1.151e+00 4.468e-01 -2.576 0.010507 *
Via:poly(Ptime, 3)2 2.055e+00 5.664e-01 3.628 0.000338 ***
Via:poly(Ptime, 3)3 -5.371e-01 3.308e-01 -1.624 0.105528
poly(LAC, 3)1:poly(Ptime, 3)1 2.404e+03 7.520e+02 3.197 0.001546 **
poly(LAC, 3)2:poly(Ptime, 3)1 1.869e+03 5.845e+02 3.199 0.001536 **
poly(LAC, 3)3:poly(Ptime, 3)1 5.054e+02 1.594e+02 3.172 0.001680 **
poly(LAC, 3)1:poly(Ptime, 3)2 -3.718e+03 1.095e+03 -3.395 0.000784 ***
poly(LAC, 3)2:poly(Ptime, 3)2 -2.955e+03 8.698e+02 -3.398 0.000776 ***
poly(LAC, 3)3:poly(Ptime, 3)2 -8.601e+02 2.569e+02 -3.347 0.000924 ***
poly(LAC, 3)1:poly(Ptime, 3)3 3.446e+02 1.065e+02 3.236 0.001353 **
poly(LAC, 3)2:poly(Ptime, 3)3 2.778e+02 8.433e+01 3.294 0.001110 **
poly(LAC, 3)3:poly(Ptime, 3)3 8.598e+01 2.539e+01 3.387 0.000805 ***
GLN:poly(Ptime, 3)1 2.287e-01 1.479e-01 1.547 0.123014
GLN:poly(Ptime, 3)2 -1.576e-01 1.730e-01 -0.911 0.363026
GLN:poly(Ptime, 3)3 -2.659e-01 1.555e-01 -1.710 0.088273 .
GLU:poly(Ptime, 3)1 -1.712e-01 3.530e-01 -0.485 0.628013
GLU:poly(Ptime, 3)2 -8.317e-01 4.445e-01 -1.871 0.062348 .
GLU:poly(Ptime, 3)3 -5.699e-01 2.756e-01 -2.068 0.039551 *
NH4:poly(Ptime, 3)1 -5.983e-01 1.399e-01 -4.278 2.57e-05 ***
NH4:poly(Ptime, 3)2 -3.588e-01 1.453e-01 -2.469 0.014148 *
NH4:poly(Ptime, 3)3 1.258e-01 1.830e-01 0.687 0.492460
OSM:poly(Ptime, 3)1 1.639e+00 3.637e-01 4.507 9.58e-06 ***
OSM:poly(Ptime, 3)2 1.621e+00 6.580e-01 2.464 0.014321 *
OSM:poly(Ptime, 3)3 2.464e-01 1.771e-01 1.391 0.165379
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07481 on 288 degrees of freedom
Multiple R-squared: 0.8548, Adjusted R-squared: 0.8317
F-statistic: 36.87 on 46 and 288 DF, p-value: < 2.2e-16
20
0.0 0.2 0.4 0.6 0.8
−0.
3−
0.1
0.1
Fitted values
Res
idua
ls ●
●
●●●
●
●
●
●
●●
●
●
●●●●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●●
●●●●●
●
●●●
●●●●●●●●
●●●●●●●
●●
●●
●
●
●●● ●
●●●●●●●
●●●●
●
●●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●● ●
●●
●
●
●
●
●●
●●
●
●●
●●●
●
●●
●
●●●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●●
●●●●
●
●●●●●
●●● ●
●
●
●
●
●
●●
●
●●
●
●●
● ●●
●
●●●●
●
●
●
●●
●●
●
●●
●●●●
●●●●●
●●●●
●●●
●●
●●●
●●
●
●
●●●●
●●
●●
●●
●●●
●
●●
●●●●●●
●●
●●●●
●●●
●●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●●●
●●
●●
●●●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●●
●●
●●
●●
●●
●●●
●●
●●●●
●
●
●
●
●
●
●●
●
●●●●
Residuals vs Fitted
104258
323
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●●
●
●●●
●●
●●●
●●●●
●●●●
●●●●●●●
●●
●●
●
●
●●
●●●●●●●
●●●●●
●
●
●●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●●
●
●●
●
●●●
●●
●
●●●
●●●
●
●
●●
●
●
●●●
●●
●●
●
●
●●
●●
●●●
●●
● ●●
●●
●●
●
●
●
●
●
●●
●
●●
●
●●
●●●
●
●●
●●
●
●
●
●●
●●
●
●●
●●●●●
●●●●●
●●●
●●●
●●
●●●
●●
●
●
●●●●
●●
●●
●●
●●●
●
●●
●●●● ●●●●
●●●
●●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●●
●●
●●
●●
●●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●●●
●●
●●
●
●
● ●●
●●
●●●●
●
●
●
●
●
●
●●
●
●
●●●
−3 −2 −1 0 1 2 3
−4
−2
02
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
104258
323
0.0 0.2 0.4 0.6 0.8
0.0
0.5
1.0
1.5
2.0
Fitted values
Sta
ndar
dize
d re
sidu
als
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●●●
●●
●●
●
●●
●●
●●
●●
●
●
●
●
●
●
●●
●●●●●
●●●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●●
●
●●●
●●●
●●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●●
●
●●
●●
●
●●●●
●
●●●
●
●
●●●
●●●
●
●●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●●●●
●●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●●
Scale−Location104258
323
0.0 0.2 0.4 0.6 0.8 1.0
−4
−2
02
Leverage
Sta
ndar
dize
d re
sidu
als
●
●
●●
●
●
●
●
●
●●
●
●
●●●●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●● ●
●● ●●●●
●●●
● ●● ●● ●● ●
● ●●●●●●
●●●
●●
●
●●●●
●●●●●●●●●●●
●
●●
●
● ●
●
●● ●
●
●
●
●●
●
●
●
●
●●
●
●●●
●
●
●
●●
● ●
●
●●
●
●●●
●●
●
●● ●
●● ●
●
●
●●
●
●
●●●
●●
●●
●
●
● ●
●●
●●●
● ●● ● ●
●●
●●
●
●
●
●
●
● ●
●
●●
●
●●
●●●
●
●● ●●
●
●
●
●●
●●
●
●●
●● ●●●
●● ●●●
●● ●
●●●
●●
●●●
●●
●
●
●● ●●●
●
●●
●●
●●●●
●●●●●●●●●●
●●●
●● ●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●●●
● ●
●●
●●
●●
●
● ●●●
●●
●
●
●
●●
●
●
●
●●
●●●●
●
●
●●● ●
●●
●●
●
●●●●
●●
● ●●●
●
●
●
●
●
●
●●
●
●
●●●
Cook's distance
10.5
0.51
Residuals vs Leverage
135
136
196
Linear model does not capture all the effects. We can identify some outliers.
> (observations <- unique(dat.agg[which(abs(lm.agg$residuals) > 0.2), "Run"]))
[1] 630 770
7.2 Mars
We perform model fitting and variables selection by Mars.
Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+
GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pdiff+
Ptime, data=dat.agg, trace=0, ncross=3, nfold=10)
coefficients
(Intercept) 0.74650082
h(DO- -1.06901) -0.03339099
h(-1.06901-DO) -0.22191908
h(pHset-0.444749) -0.12856755
h(0.444749-pHset) -0.10199167
h(0.580131-Xv) -0.03825405
h(0.810513-Stress) -0.09597828
h(pCO2- -0.471402) 0.01252083
h(Via-0.390253) 0.18390409
h(GLC-1.1171) -0.05487600
h(GLN- -0.203858) 0.02302284
h(-0.203858-GLN) 0.03187502
h(NH4- -0.0225122) 0.15374637
h(NH4-1.01414) -0.17152919
h(pH- -0.664392) 0.04304223
21
h(pH-1.70568) -0.23513685
Selected 16 of 26 terms, and 10 of 15 predictors
Importance: DO, pHset, NH4, Stress, Xv, pH, GLC, GLN, Via, pCO2, ...
Number of terms at each degree of interaction: 1 15 (additive model)
GCV 0.006192977 RSS 1.708448 GRSq 0.8142985 RSq 0.8461598 cv.rsq 0.7821464
Note: the cross-validation sd's below are standard deviations across folds
Cross validation: nterms 16.90 sd 1.88 nvars 9.33 sd 0.92
cv.rsq sd MaxErr sd
0.78 0.069 -0.29 0.19
22
05
1015
Variable importance
nsub
sets
nsubsetssqrt gcvsqrt rss
020
4060
8010
0
norm
aliz
ed s
qrt g
cv o
r rs
s
DO
1
pHse
t 2
NH
4 1
1
Str
ess
4
Xv
3
pH 1
3
GLC
7
GLN
9
Via
6
pCO
2 5
0 5 10 15 20 25
Model Selection
Number of terms
0.5
0.6
0.7
0.8
GR
Sq
RS
q
02
46
810
Num
ber
of u
sed
pred
icto
rs
GRSqselected modelRSqnbr preds
0.00 0.10 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Distribution
abs(Residuals)
Pro
port
ion
0% 50% 90% 100%
●
●●●
●
●
●
●
●●
●
●●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
●●●●●
●●
●
●●●
●●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●
●
●●
●●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●●●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●●●
●
●●●●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8
−0.
20.
00.
10.
2
Residuals vs Fitted
Fitted
Res
idua
ls
104119
323
●
●●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
●●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
●
−3 −1 0 1 2 3
−0.
20.
00.
10.
2
Normal Q−Q
Theoretical Quantiles
Res
idua
l Qua
ntile
s
104 119
323
Titer: earth(formula=Titer~DO+pHset+Xv+...
23
Take a look at outliers.
> dat.agg[c(104, 119, 323), ]
Run WD Ptime DO pCO2 pH Xv Via
104 630 0 0.0000000 0.159614 0.1237602 1.287431 -1.580576 0.3932955
119 635 0 0.0000000 1.388239 0.9739924 1.426846 -1.894227 0.4848425
323 1066 1 0.7791667 0.159614 -1.0665650 1.217723 -1.525812 0.1411053
GLC LAC GLN GLU NH4 OSM
104 1.0780028 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237
119 1.3174515 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617
323 0.6821795 -0.4139295 0.2494495 -1.790816 0.262201197 0.068777622
Stress Xv0 pHset RunNo
104 -0.2096149 0.3312174 0.03925127 BIOS-3.5L - 630
119 -0.2096149 -2.2207835 -2.39373696 BIOS-3.5L - 635
323 -0.2096149 -1.7751961 0.03925127 TACI-3L-1066
Discr. Titer pdiff pCO2t
104 Standard conditions 0.3510000 0.6432812 0
119 DO 70% / pH 6.70 / seeding 1 mio 0.0858000 0.6542345 0
323 STD condition (Control - no loop) 0.8432867 0.5062897 0
7.3 Random Forest
The parameter mtry = 10 is determined by tuneRF to optimize its performance.
> library(randomForest)
> rf.agg <- randomForest(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via
+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH
+ + pdiff + Ptime, data = dat.agg, importance = TRUE, mtry = 10)
> rf.agg
Call:
randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + pdiff + Ptime, data = dat.agg, importance = TRUE, mtry = 10)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 0.005391989
% Var explained: 83.73
24
0 100 200 300 400 500
0.00
60.
008
0.01
00.
012
0.01
40.
016
rf.agg
trees
Err
or
pCO2
Ptime
GLU
pdiff
GLN
Via
pH
GLC
Xv
OSM
Stress
LAC
NH4
pHset
DO
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10 20 30 40 50 60%IncMSE
Ptime
pdiff
pCO2
Stress
GLU
GLC
Via
GLN
OSM
pH
Xv
LAC
NH4
pHset
DO
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 1 2 3 4IncNodePurity
rf.agg
25
Under two importance plots, the first 7 most important variables are the same.
7.4 Decision Tree
plot.cp is used to determine cp in decision tree.
> library(rpart)
> tr.agg <- rpart(Titer ~ DO + pHset + Stress + Xv + pCO2
+ + Via + GLC + LAC + GLN + GLU + NH4 + OSM
+ + Stress + pH + pdiff + Ptime, data=dat.agg)
> (tr.agg <- prune(tr.agg, cp = 0.019))
n= 335
node), split, n, deviance, yval
* denotes terminal node
1) root 335 11.10534000 0.5719146
2) DO< -1.683324 39 0.30707160 0.2245385 *
3) DO>=-1.683324 296 5.47207100 0.6176837
6) pHset< -1.988239 18 0.17149480 0.2748954 *
7) pHset>=-1.988239 278 3.04856000 0.6398787
14) NH4< 0.3899572 202 1.52384000 0.6085717
28) LAC>=0.002812559 9 0.02199422 0.4344444 *
29) LAC< 0.002812559 193 1.21623800 0.6166916 *
15) NH4>=0.3899572 76 0.80050930 0.7230893 *
> plot(tr.agg)
> text(tr.agg)
|DO< −1.683
pHset< −1.988
NH4< 0.39
LAC>=0.002813
0.2245
0.2749
0.4344 0.61670.7231
26
7.5 Neural Network
> nn.agg <- nnet(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via
+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + pdiff,
+ data = dat.agg, size = 4, skip = TRUE, decay = 4e-4,
+ lin.out = FALSE, maxit = 2000, trace = FALSE)
> nn.agg
a 14-4-1 network with 79 weights
inputs: DO pHset Xv Stress pCO2 Via GLC LAC GLN GLU NH4 OSM pH pdiff
output(s): Titer
options were - skip-layer connections decay=4e-04
7.6 Cross Validation
Because the nature of our data, we better leave the whole run of experiment out when we try to performcross validation. Still we are using MSE to assess performance.
Overall, the performance is better than in first task.
RMSECVs of Different Models
Linear Model Mars Random Forest Decision Tree Neural Network
[1,] 0.09515759 0.09222569 0.07226036 0.1000236 0.09316403
8 History Models
Here we fit historical models and perform CV on them to measure their prediction powers.The input of a historical model always include the 2 strictly controlled variables that are kept constant.
These are DO, Stress. For a historical model using the history up to day i, we use the other 9 variablespCO2, Xv, Via, GLC, pH, LAC, GLN, GLU, NH4, OSM from day 0 to day i. This gives up 2+10∗(i+1)input variables.
We do this by unfolding the raw data. In case of missing values, we impute them by using missForeston the unfolded historical input values from day 0 to day i. This is done in function ArrHist.
We produce 2 tables for each statistical model. One is the table of variance explained in CV and theother is RMSECV. We calculate variance explained as 1 − (
∑(yi − yi)
2)/(∑
(yi − y)2) in each run ofcross validation and average them to get the final result. Similarly, RMSECV is also the mean over allcross validation runs.
Now we see how models perfrom under this setting.
8.1 Linear model
The (i, j)th history linear model is defined as follows:
log(Titerj) = β0 +
i∑t=0
[β1,t(DOt − 50)2 + β2,tpCO2t
+ β3,t(pHt − 7.1)2 + β4,tlog(Xvt) + β5,tV iat
+ β6,tGLCt + β7,tLACt + β8,tGLNt
+ β9,tGLUt + β10,tNH4t + β11,tOSMt + β12,tStresst]
And the following table shows the RMSECV values. The RMSECV value of the (i, j)th model is the
number at the ith row and the jth column.
RMSECV
27
Y0 Y1 Y2 Y3 Y4
X0 3.200970e-03 1.962524e-03 4.279477e-03 2.301885e-03 2.446558e-03
X0_1 1.545867e-02 1.659791e-02 2.354038e-03 1.230346e-03
X0_2 1.730547e-02 4.075473e-03 1.085479e-03
X0_3 4.041385e-03 2.398019e-03
X0_4 3.482082e-03
X0_5
X0_6
X0_7
X0_8
X0_9
X0_10
Y5 Y6 Y7 Y8 Y9
X0 2.051734e-03 3.416656e-03 6.081081e-03 7.219461e-03 1.941174e-02
X0_1 2.327897e-03 3.313077e-03 6.668536e-03 6.489212e-03 1.036755e-02
X0_2 3.283539e-03 1.628859e-03 5.512231e-03 3.368808e-03 7.914411e-03
X0_3 2.395535e-03 1.284514e-03 2.544108e-03 2.998240e-03 5.437181e-03
X0_4 1.389313e-03 1.658926e-03 2.458271e-03 2.497301e-03 1.091136e-02
X0_5 2.656460e-03 2.030364e-03 2.326792e-03 2.966961e-03 6.045417e-03
X0_6 2.639904e-03 2.232001e-03 3.808531e-03 1.056563e-02
X0_7 1.251427e-02 6.530875e-03 1.346802e-02
X0_8 3.184551e+02 7.913746e+13
X0_9 4.492133e+02
X0_10
Y10
X0 1.961133e-02
X0_1 5.580546e-02
X0_2 1.869011e-02
X0_3 1.359189e-02
X0_4 7.909511e-03
X0_5 1.306489e-02
X0_6 1.450043e-02
X0_7 3.412453e-02
X0_8 1.202324e+11
X0_9 1.613414e+00
X0_10 1.289649e+00
28
0 1 2 3 4 5 6 7 8 9 10
Titer
10
9
8
7
6
5
4
3
2
1
0
His
tory
use
d
RMSECV for history model based on Linear model
8.2 Random Forest
Variance explained in CV
Y0 Y1 Y2 Y3 Y4 Y5
X0 0.28324661 0.26155005 -0.09489543 -0.03088496 0.39081454 0.43354914
X0_1 0.37598610 0.21074302 0.35566190 0.31553364 0.37313831
X0_2 0.16472940 0.36469677 0.44168895 0.40601249
X0_3 0.25131256 0.35404242 0.51706127
X0_4 0.37300694 0.56438519
X0_5 0.54745573
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.44032714 0.55857068 0.57889201 0.64090502 0.55601625
X0_1 0.53608831 0.50026635 0.55488094 0.72923139 0.71702833
X0_2 0.69563541 0.67904955 0.69803823 0.66081810 0.71536101
X0_3 0.66675711 0.72303901 0.71583577 0.71933173 0.77114221
X0_4 0.77659188 0.77304970 0.77028128 0.80331752 0.74513898
X0_5 0.79546288 0.79299697 0.81030901 0.73912138 0.78821029
X0_6 0.76082620 0.76023416 0.79248812 0.82078319 0.79484015
X0_7 0.81875451 0.80675983 0.82533617 0.81928950
X0_8 0.86596990 0.82482254 0.80067588
X0_9 0.87802758 0.84462589
X0_10 0.81461628
RMSECV
29
Y0 Y1 Y2 Y3 Y4 Y5
X0 0.04226597 0.05005782 0.05790429 0.05367666 0.04880549 0.05140720
X0_1 0.03811918 0.05222506 0.03665541 0.04491827 0.04562487
X0_2 0.04978004 0.04001816 0.04554554 0.04125988
X0_3 0.04335484 0.03659673 0.04576400
X0_4 0.04272663 0.03664301
X0_5 0.03671531
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.06260631 0.07307958 0.08436972 0.09963318 0.10319883
X0_1 0.05939476 0.07320302 0.08885598 0.07992313 0.08665985
X0_2 0.04182959 0.06069709 0.06802562 0.09828617 0.08913791
X0_3 0.05082051 0.04949298 0.06647324 0.07765117 0.08482019
X0_4 0.03694516 0.05087756 0.05854120 0.06370206 0.08794341
X0_5 0.04183240 0.04704620 0.05592446 0.07192094 0.07872577
X0_6 0.03919570 0.05291936 0.05561075 0.07081846 0.07979003
X0_7 0.04348533 0.05150214 0.06694091 0.06891756
X0_8 0.04623633 0.06832347 0.07381787
X0_9 0.05664780 0.07072329
X0_10 0.06890835
0 1 2 3 4 5 6 7 8 9 10
Titer
10
9
8
7
6
5
4
3
2
1
0
His
tory
use
d
RMSECV for history model based on Random Forest
8.3 MARS
Variance explained in CV
Y0 Y1 Y2 Y3 Y4
X0 0.069814132 -0.005095947 0.024849356 -0.034696230 -0.120319657
30
X0_1 0.007649867 -0.396661134 -0.275546932 -0.033114509
X0_2 -0.827848243 -0.204988340 -1.123241366
X0_3 -0.239974296 0.267732287
X0_4 -1.130404725
X0_5
X0_6
X0_7
X0_8
X0_9
X0_10
Y5 Y6 Y7 Y8 Y9
X0 0.363899116 0.330268801 0.508385257 0.305851870 0.599098171
X0_1 -0.150206083 0.147392088 0.375098045 0.089004911 0.342867469
X0_2 0.287235896 0.651413305 0.651947560 0.742131695 0.616545382
X0_3 0.233978160 0.517310655 0.669245506 0.804495843 0.768104648
X0_4 0.397724415 0.704167262 0.821355319 0.839638561 0.836162082
X0_5 0.512508645 0.686174900 0.814525758 0.846335693 0.780111923
X0_6 0.846201740 0.823482389 0.841185054 0.823600595
X0_7 0.763357781 0.908825907 0.814810324
X0_8 0.850118872 0.714621386
X0_9 0.598640731
X0_10
Y10
X0 -0.266671250
X0_1 0.099612543
X0_2 0.296458653
X0_3 0.673287674
X0_4 0.681350544
X0_5 0.706625242
X0_6 0.812505473
X0_7 0.789365935
X0_8 0.657600818
X0_9 0.742572597
X0_10 0.830697666
RMSECV
Y0 Y1 Y2 Y3 Y4 Y5
X0 0.05414481 0.06435345 0.04914698 0.04440181 0.04899524 0.04751189
X0_1 0.06349293 0.07040039 0.07104746 0.04787156 0.06328275
X0_2 0.08353868 0.05609517 0.06006556 0.05632823
X0_3 0.05677938 0.05112222 0.05670555
X0_4 0.07399238 0.04632260
X0_5 0.04234317
X0_6
X0_7
X0_8
X0_9
X0_10
Y6 Y7 Y8 Y9 Y10
X0 0.06136661 0.06598322 0.10313729 0.10811682 0.16777832
X0_1 0.06212712 0.09116696 0.10869171 0.12338118 0.17132883
X0_2 0.05033973 0.05957265 0.06570322 0.09632618 0.12433817
X0_3 0.05168125 0.06504277 0.06069211 0.07182945 0.10263122
X0_4 0.03954150 0.04350307 0.05051779 0.06939214 0.09432636
X0_5 0.04101469 0.04404178 0.04560743 0.06953788 0.08301654
X0_6 0.02986677 0.04677269 0.05327379 0.07087714 0.07679419
X0_7 0.04676462 0.04301543 0.07409887 0.07206619
X0_8 0.04281923 0.07792298 0.09770436
X0_9 0.09445278 0.08202713
X0_10 0.07658425
31
Y0
Y1
Y2
Y3
Y4
Y5
Y6
Y7
Y8
Y9
Y10
Titer
X0_10
X0_9
X0_8
X0_7
X0_6
X0_5
X0_4
X0_3
X0_2
X0_1
X0
His
tory
use
d
RMSECV for history model based on MARS
9 Evaluation of Results
All models are performing reasonably well.Random forest has the most stable performance. This is expected. It is also easy to implement.
There are some tuning parameters to consider, e.g., the number of trees, the size chosen for resampling.However, default setting works good most of the time, and even the number of trees is too large it wouldnot be a big problem.
Linear regression is also doing ok. Its performance heavily relies on variable selection, interactionspecification and transformation, that is why the implementation of linear regression is not as straight-forward as other models. A compromise is fitting a large model in the beginning and use stepwise selectionto reduce the number of parameters. More recent method would be using penalized linear regression, socalled lasso, ridge regression. See [2].
Mars is easy to implement, because it can adjust to nonlinearity automatically. Here its predictionpower is not as good as the other two, which might be resulted from overfitting. It is possible to tunethe parameters. See [3].
References
[1] Daniel J.Stekhoven. Using the missForest package, 2011. Available from http://stat.ethz.ch/
education/semesters/ss2012/ams/paper/missForest_1.2.pdf.
[2] Peter Buhlmann Martin Machler. Computational statistics. Available from http://stat.ethz.ch/
education/semesters/ss2014/CompStat/sk.pdf, 2014.
[3] Stephen Milborrow. Notes on the earth package, 2014. Available from http://cran.r-project.
org/web/packages/earth/vignettes/earth-notes.pdf.
32
[4] Fortran original by Leo Breiman, R port by Andy Liaw Adele Cutler, and Matthew Wiener.Package ‘randomForest’, 2012. Available from http://stat-www.berkeley.edu/users/breiman/
RandomForests.
[5] Mayo Foundation Terry M. Therneau, Elizabeth J. Atkinson. An Introduction to Recursive Partition-ing Using the RPART Routines, 2013. Available from http://cran.r-project.org/web/packages/
rpart/vignettes/longintro.pdf.
33