+ All Categories

report

Date post: 07-Aug-2015
Category:
Upload: arthur-he
View: 16 times
Download: 0 times
Share this document with a friend
33
Process Design and Optimization of Bioprocesses with Quality by Design Approach Kyunghee Cho, Yunhao He June 11, 2014 1 Introduction In this project, we work with data from a pharmaceutical company that uses a new biochemical method to produce drugs. The ultimate goal is to improve the process and develop models to aid decision making. In general, we hope for more information out of less measurement. Due to time-constraint, we focus on the prediction under the classical statistical settings. That is, to predict an output Y from X using the model developed from data X 1 ,X 2 , ··· ,X n and Y 1 ,Y 2 , ··· ,Y n . Many models exist for this kind of prediction task. We only use a few of them in this project. 1.1 Data Description In this part we briefly describe the data. The following is how the raw data look like. Batch.ID Run WD Ptime DO CO2 pH Stress GLC LAC GLN GLU NH4 1 1 539 0 0.0000000 50 46 7.12 2.2 7.00 0.67 1.88 1.32 3.55 2 1 539 1 0.8368056 50 21 7.10 2.2 5.95 1.52 1.29 1.67 3.97 3 1 539 2 1.7986111 50 23 6.97 2.2 5.13 2.06 1.31 1.80 4.11 4 1 539 3 2.8020833 50 33 6.97 2.2 3.85 2.16 1.28 2.64 4.57 5 1 539 4 3.8680556 50 39 6.98 2.2 3.02 1.93 1.73 2.57 4.45 6 1 539 5 4.8993056 50 40 7.02 2.2 2.27 1.71 2.07 3.53 4.20 OSM Xv Via Titer 1 322.0000 1.45 97.97297 0.03913624 2 315.4865 2.07 97.64151 0.05251863 3 308.0000 3.30 97.34513 0.05340000 4 305.0906 4.97 97.83465 0.10257388 5 302.0000 7.02 97.50000 0.14200000 6 298.0000 7.75 96.27329 0.19834530 In a Run of experiment, all the variables are measured from day WD 1 to WD 10, sometimes 11 or 12. In the experiments the controlled variables are DO, CO2, pH, Stress. Variables as GLC, LAC, GLN, GLU, NH4, OSM, Xv, Via are measured throbbughout to monitor the state of the production. Titer is the output we are interested in. 1.2 Missing Data Imputation The raw data contains quite a few missing values. To carry out sensible statical analysis, we imputed missing data for both the input values and the output values. Titer are interpolated by our client. Missing values in the input values are imputed by MissForest, which is a recent achievement based on RandomForest. See [1]. 1.3 Data Standardization It is sometimes a good practice to normalize the data to have mean 0 and variance 1. In our analysis, all the models except for linear models are standardized, since we have more sophiscated customized transformation for linear regressions. As a result, the coefficients of linear regressions and mars are always different in scale. 1
Transcript

Process Design and Optimization of Bioprocesses with Quality by

Design Approach

Kyunghee Cho, Yunhao He

June 11, 2014

1 Introduction

In this project, we work with data from a pharmaceutical company that uses a new biochemical methodto produce drugs. The ultimate goal is to improve the process and develop models to aid decision making.In general, we hope for more information out of less measurement. Due to time-constraint, we focus onthe prediction under the classical statistical settings. That is, to predict an output Y from X usingthe model developed from data X1, X2, · · · , Xn and Y1, Y2, · · · , Yn. Many models exist for this kind ofprediction task. We only use a few of them in this project.

1.1 Data Description

In this part we briefly describe the data. The following is how the raw data look like.

Batch.ID Run WD Ptime DO CO2 pH Stress GLC LAC GLN GLU NH4

1 1 539 0 0.0000000 50 46 7.12 2.2 7.00 0.67 1.88 1.32 3.55

2 1 539 1 0.8368056 50 21 7.10 2.2 5.95 1.52 1.29 1.67 3.97

3 1 539 2 1.7986111 50 23 6.97 2.2 5.13 2.06 1.31 1.80 4.11

4 1 539 3 2.8020833 50 33 6.97 2.2 3.85 2.16 1.28 2.64 4.57

5 1 539 4 3.8680556 50 39 6.98 2.2 3.02 1.93 1.73 2.57 4.45

6 1 539 5 4.8993056 50 40 7.02 2.2 2.27 1.71 2.07 3.53 4.20

OSM Xv Via Titer

1 322.0000 1.45 97.97297 0.03913624

2 315.4865 2.07 97.64151 0.05251863

3 308.0000 3.30 97.34513 0.05340000

4 305.0906 4.97 97.83465 0.10257388

5 302.0000 7.02 97.50000 0.14200000

6 298.0000 7.75 96.27329 0.19834530

In a Run of experiment, all the variables are measured from day WD 1 to WD 10, sometimes 11 or12. In the experiments the controlled variables are DO, CO2, pH, Stress. Variables as GLC, LAC, GLN,GLU, NH4, OSM, Xv, Via are measured throbbughout to monitor the state of the production. Titer isthe output we are interested in.

1.2 Missing Data Imputation

The raw data contains quite a few missing values. To carry out sensible statical analysis, we imputedmissing data for both the input values and the output values. Titer are interpolated by our client.Missing values in the input values are imputed by MissForest, which is a recent achievement based onRandomForest. See [1].

1.3 Data Standardization

It is sometimes a good practice to normalize the data to have mean 0 and variance 1. In our analysis,all the models except for linear models are standardized, since we have more sophiscated customizedtransformation for linear regressions. As a result, the coefficients of linear regressions and mars arealways different in scale.

1

2 Information on Models Used

2.1 Linear Model

Linear regression is the most mature and widely used method. Although it is quite simple and intuive,sometimes it has good prediction power. We can easily tell if any model assumption is violated by lookingat the plots of the linear fit.

To make a linear model work best, it is necessary to specify the predictors manually. So we have toconsider which transformations to take, which interactions or orders to include and which to leave out.One easy way to do this is fit a big model in the beginning and use stepwise selection by AIC(or BIC)criterion1 which is a measure of balance between goodness of fit and model complexity.

To assess variable importance, a simple way is to look at p-value or t-value, given the model assump-tions are met.

2.2 Decision Tree

Decision tree is a scale independent statistical model. It is easy to implement and deals with interactionsnaturally. The biggest advantage is that it can be visualized easily. See [5].

2.3 Random Forest

All the information in this subsection is based on [4]. Random forest is an ensemble method that combinesmany decision trees, in each of which the data and variables are sampled so that only part of them areused. Due to implicit bootstrapping, the model suffers less from overfitting and has good predictionpower.

Random forest has built-in mechanisms to estimate importance of variables. The two measures aredescribed in the following way:

� The first measure is computed from permuting OOB data: For each tree, the predic-tion error on the out-of-bag portion of the data is recorded (error rate for classification,MSE for regression). Then the same is done after permuting each predictor variable.The difference between the two are then averaged over all trees, and normalized by thestandard deviation of the differences. If the standard deviation of the differences is equalto 0 for a variable, the division is not done (but the average is almost always equal to 0in that case).

� The second measure is the total decrease in node impurities from splitting on the variable,averaged over all trees. For classification, the node impurity is measured by the Giniindex. For regression, it is measured by residual sum of squares.

The package randomForest has a function varImpPlot to plot the importance of variables easily.

2.4 MARS

All the information in this subsection is based on [3].MARS, multivariate adaptive regression splines, is an adaptive extension of linear regression. In the

final model, it is a linear regression with terms like (xj − d)+, (d − xj)+ and higher order interactionsof such terms. The algorithm adds terms forwardly and prune the model to a place where GCV isminimized.

MARS can also estimate the importance of variables. From earth vignette: [3]

� The nsubsets criterion counts the number of model subsets that include the variable.Variables that are included in more subsets are considered more important.

By ”subsets” we mean the subsets of terms generated by the pruning pass. There is onesubset for each model size (from 1 to the size of the selected model) and the subset isthe best set of terms for that model size. (These subsets are speced in $prune.terms inearth’s return value.) Only subsets that are smaller than or equal in size to the finalmodel are used for estimating variable importance.

1See http://stat.ethz.ch/R-manual/R-devel/library/stats/html/step.html. For more information about BIC, seehttp://en.wikipedia.org/wiki/Bayesian_information_criterion

2

� The rss criterion first calculates the decrease in the RSS for each subset relative to theprevious subset. (For multiple response models, RSS’s are calculated over all responses.)Then for each variable it sums these decreases over all subsets that include the variable.Finally, for ease of interpretation the summed decreases are scaled so the largest summeddecrease is 100. Variables which cause larger net decreases in the RSS are consideredmore important.

� The gcv criterion is the same, but uses the GCV instead of the RSS. Adding a variablecan increase the GCV, i.e., adding the variable has a deleterious effect on the model.When this happens, the variable could even have a negative total importance, and thusappear less important than unused variables.

2.5 Neural Network

Neural network is a felxible model which is in a sense an extension of linear regression. See en.wikipedia.org/wiki/Artificial_neural_network.

3 Overview of Three Main Tasks

This statistical analysis comprises three main tasks, in each of which all three above mentioned modelswere used, specifically, Linear Regression, Random Forest and MARS. Only for the third task a differentdataset was used, in which missing Titer values are interpolated by a logistic function fit within eachrun.

3.1 First Task: Blackbox Models

Maximizing the output of a useful product by controlling experimental conditions is of primary interest.The first blackbox models are fitted using only the four controlled variables, namely, DO, Stress, pHset,Xv0 as input variables to predict the output variable, Titer at day 10. Secondly, all other input variablesat day 0 are used as input variables to predict Titer at day 10.

3.2 Second Task: Snapshot Models

In order to save the measurement cost of Titer, it is useful to predict the current value of Titer duringan experiment using the current values of the input variables, which are relatively cheaper to measure.The snapshot approach means each observation is considered to be independant and input variables atday t are used to predict Titer at day t.

3.3 Third Task: History Models

The history approach can be regarded as an extension of both the blackbox models and the snapshotmodels. For the history models, not only the current value of the input variables at day t are consideredin the model as predictors but also how they have changed over time in the past, that is, the history ofthe input variables. All Titer values in the future as well as the one at day t are to be predicted. Inother words, a (i, j) history model uses input variables at day 0, 1, ..., i as predictors to predict the Titervalue at day j(≥ i).

4 Model Comparison

In order to compare the prediction performance between the three statistical models, the cross-validationmethod was used.

1. First, the data is randomly splitted into a training set and a test set by Run ID. That is, 30 runsare randomly sampled for a test set among 122 runs.

2. Then a model is fitted using the training set, the rest of the data.

3. MSE (mean squared error) is calculated in the test set.

3

4. This is repeated 5 times and using the 5 resulting MSE’s, RMSECV is calculated, which is definedas follows.

For blackbox or history models:

RMSECV =

√∑5k=1

∑30i=1(y

(k)i −y

(k)i )2

5·30

where y(k)i and y

(k)i are the true Titer value and the predicted one of the ith sample run in the kth

cross-validation, repectively.

For snapshot models:

RMSECV =

√∑5k=1

∑30i=1

∑nij=1(y

(k)ij −y

(k)ij )2

5·30·ni

where y(k)ij and y

(k)ij are the true Titer value and the predicted one of the ith sample run in the kth

cross-validation at the day j, repectively.

In case the target variable is transformed, the RMSECV is calculated based on back-transformedvalues of both true and predicted values.

5 Blackbox Models

This is a prediction problem with classical statistics setting. Before we use any statistical models, it ishelpful to get an intuition what the dataset looks like.

●●

●●

●●

●●

●●●●●●●●●●●●●●

●●●

●●●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

600 700 800 900 1000

0.2

0.4

0.6

0.8

Final Titer in Different Runs without Transformation

dat.simp$Run

dat.s

imp$

Tite

r

Since we are using only a small part of the raw data here, the number of observations is 122.If only prediction power is of interest, you can jump to Section 5.6.

4

5.1 Linear Model

5.1.1 Linear Model with 4 Controlled Variables

The following model was selected by the BIC model selection criterion. We can see that both DO andpHset exhibit quadratic effect.

Call:

lm(formula = bb4lm$formula, data = if.data)

Residuals:

Min 1Q Median 3Q Max

-0.59583 -0.10260 -0.00799 0.10758 0.84827

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.695e-01 7.932e-02 -3.397 0.000952 ***

I((DO - 50)^2) -7.425e-04 8.316e-05 -8.929 1.18e-14 ***

log(Xv0) -3.114e-01 1.810e-01 -1.721 0.088092 .

I((pHset - 7.1)^2) -7.436e+00 9.961e-01 -7.465 2.17e-11 ***

I((DO - 50)^2):log(Xv0) 5.670e-04 1.714e-04 3.309 0.001270 **

I((DO - 50)^2):I((pHset - 7.1)^2) -1.852e-03 7.084e-04 -2.614 0.010202 *

log(Xv0):I((pHset - 7.1)^2) 8.604e+00 1.811e+00 4.750 6.23e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2011 on 109 degrees of freedom

(2 observations deleted due to missingness)

Multiple R-squared: 0.8462, Adjusted R-squared: 0.8377

F-statistic: 99.95 on 6 and 109 DF, p-value: < 2.2e-16

[1] 0.09764054

−3.0 −2.0 −1.0

−0.

50.

00.

51.

0

Fitted values

Res

idua

ls

●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●

●● ●●●●

●●

●●

●●●

● ●

●●●●

●●●

● ●●●

●●●●

●●

●●

●● ●

●●●●●●●●●●●●●

●●

●●●●●●

●●●●

●●

●●●●●●●

●●●●

Residuals vs Fitted

73

4348

●●

●● ●●●●

●●●

●● ●● ●●●●●●●

●●●●

● ●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●● ●●

●●●●

●●

●●

●●●

●●●●● ●●●

●●●●

●●

●●●●● ●

●●●

●●●

●●

● ●●●●

●●

● ●

−2 −1 0 1 2

−2

02

46

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

73

4843

−3.0 −2.0 −1.0

0.0

1.0

2.0

Fitted values

Sta

ndar

dize

d re

sidu

als

●●●●

●●

●●

●●●●●●

●●●●

●●●●

●●

●●●●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

Scale−Location73

4843

0.0 0.1 0.2 0.3 0.4

−4

−2

02

46

Leverage

Sta

ndar

dize

d re

sidu

als

●●

●●●●●●●●●●●●●●●●●●●●

●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●● ●

●●

●●

●●●

●●●●●●●●●●●●●

●●

● ●●●●●

●●●●●●

●●●●●

●●●

●●●

Cook's distance10.5

0.51

Residuals vs Leverage

73

4837

5

The transformations for DO, Xv0 and pHset were chosen according to the termplot to give a reasonable fit.

> par(mfrow=c(2,2))

> termplot(bb4lmfit,partial.resid=TRUE)

10 20 30 40 50 60 70

−1.

5−

0.5

0.5

DO

Par

tial f

or I(

(DO

− 5

0)^2

)

●●

●●●●●●●●●●●●●●●●●●●● ●●●● ●●

●●●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●●●

1.0 1.2 1.4 1.6 1.8 2.0

−1.

5−

0.5

0.5

Xv0

Par

tial f

or lo

g(X

v0)

●●

●●●●●● ●●

●●

● ● ●●● ●● ●● ●

● ●●●

● ●

●●●● ●●●

●●● ●●

●●●●

●●●

●●●

●●●●

●●● ●●●●

●●●

●●

●●

● ●● ● ●● ●● ●●●● ●●●●

●●●● ●●

●●●●

●●

●●●● ●●

●●●

● ●

6.7 6.8 6.9 7.0 7.1 7.2

−1.

5−

0.5

0.5

pHset

Par

tial f

or I(

(pH

set −

7.1

)^2)

●●

●●●●●●●●●●●●●●●●●●●●

●●●●

●●

●●

●●●●●

●●●

●●

●●●

●●

●●

●●

●●●●●

●●● ●

●●

●●●

●●●

●●●●●●●●●●●●●●●

●● ●●●●● ●●●●●

● ●●●●●●●●●●

5.1.2 Linear Model with All Variables

The following model was selected by the BIC model selection criterion. Apart from the 4 controlledvariables, only NH4 and GLN.

Call:

lm(formula = bblmbic$formula, data = if.data)

Residuals:

Min 1Q Median 3Q Max

-0.86766 -0.09680 0.01636 0.09528 0.59879

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.559e+00 2.359e-01 -6.606 1.49e-09 ***

I((DO - 50)^2) -5.932e-04 4.454e-05 -13.319 < 2e-16 ***

pCO2 5.530e-03 2.287e-03 2.418 0.01726 *

GLN 1.264e-01 5.030e-02 2.513 0.01343 *

NH4 8.313e-02 2.345e-02 3.545 0.00058 ***

log(Xv0) 5.981e-01 1.377e-01 4.343 3.16e-05 ***

I((pHset - 7.1)^2) -5.336e+00 5.264e-01 -10.137 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2224 on 109 degrees of freedom

(2 observations deleted due to missingness)

6

Multiple R-squared: 0.8119, Adjusted R-squared: 0.8015

F-statistic: 78.41 on 6 and 109 DF, p-value: < 2.2e-16

−2.5 −2.0 −1.5 −1.0 −0.5

−1.

0−

0.5

0.0

0.5

Fitted values

Res

idua

ls

●●●●●

●●●

●●●●

●●●

●●●

●●

●●

●● ●

●●●

●● ●

●●

●●●

●●

● ●

●●

● ●●●

●●

● ●●

●●

●●●●

●●●●●●●●●●

●● ●●

●●●

●●● ●

●●●

●●●

●●

●●

●●

●●

Residuals vs Fitted

474855

●●●●

●● ●

●●●

●●●

●●●

●●

●●

●●●

●●●

●●●

●●

● ●●

●●

●●

●●●

●●

●●

● ●●

●●

●●

●●●

● ●●● ●

●●●●

●● ●●

●●●

●●●●

●●●

●●●

●●

●●

●●

●●

−2 −1 0 1 2

−4

−2

02

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

474855

−2.5 −2.0 −1.5 −1.0 −0.5

0.0

0.5

1.0

1.5

2.0

Fitted values

Sta

ndar

dize

d re

sidu

als

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

Scale−Location47

4855

0.00 0.10 0.20 0.30

−4

−2

02

Leverage

Sta

ndar

dize

d re

sidu

als

●●●●

●● ●

●●●●

●●●

●●●

●●●

●●

●●●

●●●

● ●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●

● ●●

●●

●●●●

●●●

●●●●●●●

●● ● ●

●●●

●●● ●

●●●

●●●

●●

● ●

●●

●●

Cook's distance 1

0.5

0.5

Residuals vs Leverage

474855

5.2 MARS

To allow for flexibility and nonlinear effects, we use Mars - Multivariate adaptive regression splines.

5.2.1 MARS with 4 Controlled Variables

> source(file.path(cwd, "mars_simp.R"))

> summary(mars.con)

Call: earth(formula=Titer~DO+pHset+Xv+Stress, data=dat.simp,

trace=0)

coefficients

(Intercept) 0.69954039

h(-1.06901-DO) -0.22496222

h(DO-0.159614) -0.05639779

h(pHset-0.444749) -0.12694532

h(0.444749-pHset) -0.11450147

Selected 5 of 13 terms, and 2 of 4 predictors

Importance: DO, pHset, Xv-unused, Stress-unused

Number of terms at each degree of interaction: 1 4 (additive model)

GCV 0.009779767 RSS 0.9944133 GRSq 0.6943894 RSq 0.7344234

> plot(mars.con, main = "Mars Using 4 Inputs.jpg")

7

0 2 4 6 8 10

Mars Using 4 Inputs.jpg

Number of terms

0.45

0.55

0.65

0.75

GR

Sq

RS

q

01

23

4N

umbe

r of

use

d pr

edic

tors

GRSqselected modelRSqnbr preds

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Mars Using 4 Inputs.jpg

abs(Residuals)

Pro

port

ion

0% 50% 75% 90% 95% 100%

●●

● ●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●●

●●

●●

●●●●●●●●●●●●

●●●●

●●

●●

●●●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

0.1 0.3 0.5 0.7

−0.

3−

0.1

0.1

Mars Using 4 Inputs.jpg

Fitted

Res

idua

ls

42

106

47

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

−2 −1 0 1 2

−0.

3−

0.1

0.1

Mars Using 4 Inputs.jpg

Theoretical Quantiles

Res

idua

l Qua

ntile

s

42

106

47

Titer: earth(formula=Titer~DO+pHset+Xv+Stre...

5.2.2 MARS with All Variables

> summary(mars.simp)

Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+

GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH, data=dat.simp,

trace=0)

coefficients

(Intercept) 0.56859816

h(-1.06901-DO) -0.23177287

h(DO-0.159614) -0.06620347

h(pHset-0.444749) -0.09650702

h(0.444749-pHset) -0.09711387

h(Via-0.355972) 0.33786922

h(1.41519-GLC) 0.14153165

h(-0.949741-LAC) 0.36807897

h(NH4- -0.263424) 0.10067628

h(-0.0456292-OSM) 0.35674962

Selected 10 of 24 terms, and 7 of 13 predictors

Importance: DO, pHset, NH4, Via, GLC, LAC, OSM, Xv-unused, Stress-unused, ...

Number of terms at each degree of interaction: 1 9 (additive model)

GCV 0.007086922 RSS 0.5955396 GRSq 0.7785388 RSq 0.84095

> plot(mars.simp, main = "Mars Using All Inputs.jpg")

8

0 5 10 15 20

Mars Using All Inputs.jpg

Number of terms

0.5

0.6

0.7

0.8

GR

Sq

RS

q

02

46

810

Num

ber

of u

sed

pred

icto

rs

GRSqselected modelRSqnbr preds

0.00 0.10 0.20

0.0

0.2

0.4

0.6

0.8

1.0

Mars Using All Inputs.jpg

abs(Residuals)

Pro

port

ion

0% 50% 90% 100%

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●

●●

●●

0.0 0.2 0.4 0.6 0.8

−0.

2−

0.1

0.0

0.1

Mars Using All Inputs.jpg

Fitted

Res

idua

ls

42

9347

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

−2 −1 0 1 2

−0.

2−

0.1

0.0

0.1

Mars Using All Inputs.jpg

Theoretical Quantiles

Res

idua

l Qua

ntile

s

42

93

47

Titer: earth(formula=Titer~DO+pHset+Xv+...

Now we add the label(pCO2) showing whether or not CO2 is targeted to a certain level. As it turned out, thislabel is not important.

> summary(mars.simp2)

Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+

GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pCO2t,

data=dat.simp, trace=0)

coefficients

(Intercept) 0.80438564

h(DO-0.159614) -0.04110043

h(0.159614-DO) -0.09742652

h(pHset- -1.58274) -0.11389203

h(0.444749-pHset) -0.17420137

h(Via-0.340761) 0.29860268

h(-0.962971-LAC) 0.32408837

h(GLU- -1.50335) -0.81614339

h(GLU- -1.42405) 1.31071055

h(NH4- -0.292625) 0.10170289

h(-0.0456292-OSM) 0.52167298

Selected 11 of 27 terms, and 7 of 16 predictors

Importance: pHset, DO, NH4, OSM, GLU, Via, LAC, Xv-unused, Stress-unused, ...

Number of terms at each degree of interaction: 1 10 (additive model)

GCV 0.007404181 RSS 0.5975609 GRSq 0.7686247 RSq 0.8404102

> plot(mars.simp2,

+ main = "Mars Using All inputs and Additional CO2 Target Label.jpg")

9

0 5 10 15 20 25

Mars Using All inputs and Additional CO2 Target Label.jpg

Number of terms

0.3

0.5

0.7

0.9

GR

Sq

RS

q

02

46

810

Num

ber

of u

sed

pred

icto

rs

GRSqselected modelRSqnbr preds

0.00 0.10 0.20

0.0

0.2

0.4

0.6

0.8

1.0

Mars Using All inputs and Additional CO2 Target Label.jpg

abs(Residuals)

Pro

port

ion

0% 50% 90% 100%

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

0.2 0.4 0.6 0.8

−0.

2−

0.1

0.0

0.1

Mars Using All inputs and Additional CO2 Target Label.jpg

Fitted

Res

idua

ls

42

47

93

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

−2 −1 0 1 2

−0.

2−

0.1

0.0

0.1

Mars Using All inputs and Additional CO2 Target Label.jpg

Theoretical Quantiles

Res

idua

l Qua

ntile

s

4247

93

Titer: earth(formula=Titer~DO+pHset+Xv+...

Fitting Mars, records 42, 47 and 91 are identified as outliers.

> dat.simp[c(42, 47, 91), ]

Run WD Ptime DO pCO2 pH Xv Via GLC

42 630 0 0 0.159614 0.1237602 1.287431 -1.580576 0.3932955 1.078003

47 635 0 0 1.388239 0.9739924 1.426846 -1.894227 0.4848425 1.317452

91 768 0 0 0.159614 0.5488763 1.357139 -1.645298 0.4860406 1.156190

LAC GLN GLU NH4 OSM Stress

42 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237 -0.2096149

47 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617 -0.2096149

91 -0.9629707 0.3432373 -1.334834 1.247747682 0.145048862 -0.2096149

Xv0 pHset RunNo Discr.

42 0.3312174 0.03925127 BIOS-3.5L - 630 Standard conditions

47 -2.2207835 -2.39373696 BIOS-3.5L - 635 DO 70% / pH 6.70 / seeding 1 mio

91 -0.1953860 0.03925127 BIOS-3.5L - 768 WGE 15 - SE 3

Titer pCO2t

42 0.3510 0

47 0.0858 0

91 0.7650 0

Also, we can see the importance of variables. The methods for importance estimation have beendiscussed in Section 2.

10

02

46

8

Importance of Variables

nsub

sets

nsubsetssqrt gcvsqrt rss

020

4060

8010

0

norm

aliz

ed s

qrt g

cv o

r rs

s

DO

1

pHse

t 2

NH

4 1

1

Via

6

GLC

7

LAC

8

OS

M 1

2

5.3 Random Forest

Using Random Forest, we can easily see the importance of different variables.

5.3.1 Random Forest with 4 Controlled Variables

Call:

randomForest(formula = Titer ~ DO + pHset + Stress + Xv, data = dat.simp, importance = TRUE, mtry = 4)

Type of random forest: regression

Number of trees: 500

No. of variables tried at each split: 4

Mean of squared residuals: 0.008918479

% Var explained: 71.66

5.3.2 Random Forest with All Variables

We can see the ranking of variable importances is similar to that of MARS.

11

Stress

pCO2

pH

GLU

GLC

GLN

Via

LAC

OSM

Xv

NH4

pHset

DO

0.0 0.5 1.0

rf.simp

IncNodePurity

> rf.simp

Call:

randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH, data = dat.simp, mtry = 10)

Type of random forest: regression

Number of trees: 500

No. of variables tried at each split: 10

Mean of squared residuals: 0.007168825

% Var explained: 77.22

5.4 Decision Tree

For completeness, we also include decision tree model. It cannot outperform random forest which is anextension of decision tree, but it is easy for interpretation.

5.4.1 Decision Tree with 4 Controlled Variables

n= 119

node), split, n, deviance, yval

* denotes terminal node

1) root 119 3.74435600 0.5739803

2) DO< -1.683324 13 0.10235720 0.2245385 *

3) DO>=-1.683324 106 1.85988900 0.6168364

6) pHset< -0.7717448 10 0.22043140 0.3785372 *

7) pHset>=-0.7717448 96 1.01244000 0.6416593

14) pHset>=0.6474983 8 0.02327288 0.5111250 *

15) pHset< 0.6474983 88 0.84046160 0.6535260

12

30) Stress< 0.3004491 79 0.70332710 0.6425430 *

31) Stress>=0.3004491 9 0.04395770 0.7499322 *

|DO< −1.683

pHset< −0.7717

pHset>=0.6475

Stress< 0.3004

0.2245

0.3785

0.51110.6425 0.7499

5.4.2 Decision Tree with All Variables

n= 119

node), split, n, deviance, yval

* denotes terminal node

1) root 119 3.74435600 0.5739803

2) DO< -1.683324 13 0.10235720 0.2245385 *

3) DO>=-1.683324 106 1.85988900 0.6168364

6) pHset< -0.7717448 10 0.22043140 0.3785372 *

7) pHset>=-0.7717448 96 1.01244000 0.6416593

14) NH4< 0.2658514 67 0.43613680 0.6091353

28) OSM>=-0.1028327 50 0.28442470 0.5858890

56) Via< 0.4785685 23 0.15613930 0.5504854 *

57) Via>=0.4785685 27 0.07489916 0.6160477 *

29) OSM< -0.1028327 17 0.04522386 0.6775066 *

15) NH4>=0.2658514 29 0.34168860 0.7168009

30) Xv< -1.573108 21 0.21895440 0.6899347

60) Xv>=-1.652965 12 0.13595400 0.6410000 *

61) Xv< -1.652965 9 0.01595145 0.7551809 *

31) Xv>=-1.573108 8 0.06778793 0.7873245 *

13

|DO< −1.683

pHset< −0.7717

NH4< 0.2659

OSM>=−0.1028Via< 0.4786

Xv< −1.573Xv>=−1.653

0.2245

0.3785

0.5505 0.616 0.6775 0.641 0.7552 0.7873

5.5 Neural Network

Neural network is comparatively not so easy to interpret. We include it here to compare its predictionaccuracy with other models using CV.

5.6 Cross Validation

Although many packages nowadays have built-in measures of test errors, by implementing cross validationourselves, we can compare the performance of different models on the same ground. Here we use Leave-30-Runs-Out cross validation. Since it is a regression problem, mean squared error is a good indicatorof performance.

We can see the cross validation mean squared errors of different methods.

RMSECV of Different Models Using All Input Variables

Linear Model Mars Random Forest Decision Tree Neural Network

[1,] 0.078494 0.1060308 0.09046751 0.1237991 0.1362411

RMSECV of Different Models Using 4 Input Variables

Linear Model Mars Random Forest Decision Tree Neural Network

[1,] 0.09764054 0.117845 0.09327159 0.0972197 0.1150896

A surprising fact is that the more complicated models perform even worse. Linear model performs quitewell.

14

6 Snapshot Models

6.1 Linear model

The following model was selected by the BIC model selection criterion. We can see that now the numberof selected variables is much higher. In snapshot model, the effects are too complex to be estimated bya few variables.

Call:

lm(formula = snlm$formula, data = ccFdata)

Residuals:

Min 1Q Median 3Q Max

-0.63033 -0.09694 0.00396 0.11494 0.48256

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.492e+00 1.705e+00 0.875 0.381944

I((DO - 50)^2) 9.907e-04 3.332e-04 2.973 0.003066 **

pCO2 1.522e-01 2.442e-02 6.233 8.54e-10 ***

I((pH - 7.1)^2) 3.698e+01 4.867e+00 7.598 1.14e-13 ***

log(Xv) 7.013e-01 1.506e-01 4.658 3.93e-06 ***

Via -3.025e-02 9.257e-03 -3.268 0.001144 **

GLC -2.973e-01 1.400e-01 -2.123 0.034175 *

LAC -2.858e-01 9.404e-02 -3.039 0.002479 **

GLN 5.613e-01 5.873e-02 9.557 < 2e-16 ***

GLU -1.950e+00 3.485e-01 -5.596 3.31e-08 ***

NH4 -1.106e-01 2.415e-02 -4.580 5.64e-06 ***

OSM -8.029e-03 3.901e-03 -2.058 0.039999 *

Stress -5.076e-02 1.549e-02 -3.276 0.001113 **

I((DO - 50)^2):pCO2 1.418e-05 4.040e-06 3.510 0.000482 ***

I((DO - 50)^2):Via -8.782e-06 2.193e-06 -4.005 6.98e-05 ***

I((DO - 50)^2):GLC 8.047e-05 1.697e-05 4.741 2.65e-06 ***

I((DO - 50)^2):GLU -1.739e-04 3.561e-05 -4.883 1.34e-06 ***

I((DO - 50)^2):NH4 -6.375e-05 2.582e-05 -2.469 0.013824 *

pCO2:Via -6.531e-04 1.354e-04 -4.824 1.78e-06 ***

pCO2:GLC 3.431e-03 6.722e-04 5.105 4.44e-07 ***

pCO2:GLN -6.369e-03 1.483e-03 -4.293 2.05e-05 ***

pCO2:OSM -2.974e-04 5.367e-05 -5.540 4.50e-08 ***

I((pH - 7.1)^2):LAC 2.711e+00 3.745e-01 7.240 1.36e-12 ***

I((pH - 7.1)^2):NH4 6.535e-01 2.225e-01 2.937 0.003435 **

I((pH - 7.1)^2):OSM -1.449e-01 1.812e-02 -7.997 6.46e-15 ***

log(Xv):GLC 1.177e-01 1.607e-02 7.323 7.75e-13 ***

log(Xv):GLU -2.613e-01 3.265e-02 -8.000 6.29e-15 ***

Via:GLC -4.150e-03 8.275e-04 -5.015 6.97e-07 ***

Via:GLU 1.534e-02 1.466e-03 10.463 < 2e-16 ***

Via:Stress 6.505e-04 1.614e-04 4.031 6.26e-05 ***

GLC:LAC -2.587e-02 7.330e-03 -3.529 0.000448 ***

GLC:NH4 1.571e-02 4.848e-03 3.241 0.001255 **

GLC:OSM 9.333e-04 3.859e-04 2.419 0.015863 *

LAC:NH4 -4.049e-02 8.656e-03 -4.678 3.58e-06 ***

LAC:OSM 1.038e-03 2.115e-04 4.911 1.16e-06 ***

GLN:Stress -4.219e-03 1.587e-03 -2.658 0.008059 **

GLU:OSM 4.736e-03 8.420e-04 5.625 2.83e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1743 on 608 degrees of freedom

(740 observations deleted due to missingness)

Multiple R-squared: 0.9526, Adjusted R-squared: 0.9498

F-statistic: 339.2 on 36 and 608 DF, p-value: < 2.2e-16

[1] 0.07678454

15

−4 −3 −2 −1 0

−0.

6−

0.2

0.2

0.6

Fitted values

Res

idua

ls

● ●

●●

● ●

●●●

●●

● ● ● ●●●

● ●

●●

●●●

● ●

●●● ●

●●●●

●●●●

● ●

●●●

● ●●

●●●

●●

●●

●●

●●●

●● ● ●

●● ●

●●

● ●

● ●

●●

● ●

●●●● ●

●●●

● ●

●●

●●

● ●

●●

●● ●

●●

● ● ●

●●

● ●

●●

●● ● ●

●● ●

● ●●

●●●

●●●

●●

●●

●●

● ●●●

● ●

●●●

●●

●●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●●

●● ●

●●●

●●●

●●

● ●

●●●

●● ●

●●

● ●

●●

●●

●●

● ●●

●●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●●

●●●

● ●●

● ●

●●

●●

●●

●● ●

●●

● ● ●●

● ● ●●●●

●●

●●● ●

●●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●●●

●● ●

●●

● ●

●●

●●●

●●

●●●

●●

●●●

● ●●

● ●

●●

●● ●

● ●

● ●●

●●

●●

●●

●● ●

● ●●

●●

●●

●●

●●

● ●

● ●●●

● ●

●●

Residuals vs Fitted

90 345909

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

−3 −2 −1 0 1 2 3

−3

−1

12

3

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

90345909

−4 −3 −2 −1 0

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

● ●

●●

● ●

●●

● ●

● ●

●●

●●●

●●

●●

●●● ●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●●

●●●

●●

●●

●●

●●

●●

● ●

●●●

●●●

● ●

● ●●

●●

●●

●●

● ●

●● ●

●●

●●●

●●

●●

● ●●

●●●

●●

●●

●●

● ●

● ●

●●

●● ●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

● ●

●●

● ●●

●●

Scale−Location90 345

909

0.0 0.2 0.4 0.6

−4

−2

02

Leverage

Sta

ndar

dize

d re

sidu

als

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●●

●●●●

●●●●

●●●●

●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●●

●●●●

●●●

●●

●●

● ●

●●●●

●●

●● ●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●● ●

●●●

●●●●

●●●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

● ●

●●●●

●●

●●

●●●

●●●

●●●

●●●●

● ●●●

●●

●●

●●●●●

●●● ●

●●●●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●● ●

●●

●●

●●

●●

● ● ●

●●

Cook's distance1

0.5

0.5

Residuals vs Leverage

530

5321377

6.2 Random Forest

The importance ranking of variables is now much different from black-box model. GLU, LAC, NH4,GLN can be used to estimate Titer at the same time, which is a good “snapshot”.

[1] "Titer~DO+pCO2+Via+GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+Xv"

Call:

randomForest(formula = as.formula(snrf$formula), data = ccFdata, na.action = na.omit)

Type of random forest: regression

Number of trees: 500

No. of variables tried at each split: 4

Mean of squared residuals: 0.003390071

% Var explained: 92.32

[1] 0.06970167

16

Stress

DO

GLC

pH

OSM

NH4

Via

pCO2

Xv

GLN

LAC

GLU

0 2 4 6 8 10

snrffit

IncNodePurity

6.3 MARS

MARS have a similar importance plot to Random Forest.

Call: earth(formula=as.formula(snmars$formula), data=subset(ccFdata,

!is.na(ccFdata$Titer)))

coefficients

(Intercept) 0.52558053

h(DO-70) 0.00384701

h(70-DO) 0.00081744

h(39-pCO2) -0.00387669

h(Via-95.1825) -0.01859399

h(1.64-GLC) -0.04381892

h(3.14-LAC) 0.06892429

h(GLN-2.09) 0.17083241

h(2.09-GLN) -0.07104502

h(GLU-4.49) 0.11840427

h(4.49-GLU) -0.13551714

h(NH4-5.19) -0.01831231

h(5.19-NH4) 0.04728241

h(OSM-305) 0.00128608

h(305-OSM) 0.00120199

h(Stress-21) -0.00161051

h(21-Stress) -0.00473324

h(7.11-pH) -0.20785734

h(Xv-2.96) -0.01598677

h(2.96-Xv) -0.10513829

Selected 20 of 25 terms, and 12 of 12 predictors

17

Importance: GLU, LAC, NH4, GLN, Xv, OSM, pH, Stress, Via, pCO2, DO, GLC

Number of terms at each degree of interaction: 1 19 (additive model)

GCV 0.004576446 RSS 2.605637 GRSq 0.8966136 RSq 0.9084545

[1] 0.07026124

0 5 10 15 20 25

Model Selection

Number of terms

0.6

0.7

0.8

0.9

GR

Sq

RS

q

02

46

810

12N

umbe

r of

use

d pr

edic

tors

GRSqselected modelRSqnbr preds

0.00 0.10 0.20

0.0

0.2

0.4

0.6

0.8

1.0

Cumulative Distribution

abs(Residuals)P

ropo

rtio

n

0% 50% 90% 100%

●●●●●

●●●●

●●

●●●

●●●

●●●●●

●●

●●●●●

●●●●

●●●●●●●●●

●●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●●

●●●●

●●

●●●

●●●

●●●

●●●●

●●

●●●

●●●

●●

●●●●●●

●●●●●

●●●●

●●●●●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●●●

●●●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●●●

●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

10.

2

Residuals vs Fitted

Fitted

Res

idua

ls

586

10

582

●●

●●●

●●●

●●

●●●

●●●

●●●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●●●●

●●●●

●●●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−3 −1 0 1 2 3

−0.

20.

00.

10.

2Normal Q−Q

Theoretical Quantiles

Res

idua

l Qua

ntile

s

586

10

582

Titer: earth(formula=as.formula(snmars$formu...

18

05

1015

Variable importance

nsub

sets

nsubsetssqrt gcvsqrt rss

020

4060

8010

0no

rmal

ized

sqr

t gcv

or

rss

GLU

7

LAC

5

NH

4 8

GLN

6

Xv

12

OS

M

9

pH 1

1

Str

ess

10

Via

3

pCO

2 2

DO

1

GLC

4

7 History Model(Naive Way)

Let’s first use input data only at time t and treat t simply as another parameter.Again, the target here is the Titer on day 10.Explore with different models.

7.1 Linear Regression

Since we now have around 1300 observations, we can fit the model with more parameters. After this wecan select model using step.

Call:

lm(formula = Titer ~ I(DO^2) + DO + pHset + I(pHset^2) + Via +

poly(GLC, 3) + poly(LAC, 3) + GLN + GLU + NH4 + OSM + poly(Ptime,

3) + DO:pHset + I(pHset^2):poly(Ptime, 3) + Via:poly(Ptime,

3) + poly(LAC, 3):poly(Ptime, 3) + GLN:poly(Ptime, 3) + GLU:poly(Ptime,

3) + NH4:poly(Ptime, 3) + OSM:poly(Ptime, 3), data = dat.agg)

Residuals:

Min 1Q Median 3Q Max

-0.296451 -0.047078 0.002983 0.044872 0.190200

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.049e+00 5.200e-01 -2.017 0.044626 *

I(DO^2) -3.644e-02 6.218e-03 -5.860 1.26e-08 ***

DO 1.747e-02 7.374e-03 2.369 0.018520 *

pHset 9.137e-03 7.090e-03 1.289 0.198499

19

I(pHset^2) -3.318e-02 5.477e-03 -6.058 4.31e-09 ***

Via 1.341e-01 2.815e-02 4.763 3.02e-06 ***

poly(GLC, 3)1 -8.356e-01 1.905e-01 -4.387 1.62e-05 ***

poly(GLC, 3)2 -2.459e-01 1.455e-01 -1.690 0.092150 .

poly(GLC, 3)3 1.980e-01 1.425e-01 1.389 0.165837

poly(LAC, 3)1 -1.572e+02 4.936e+01 -3.184 0.001611 **

poly(LAC, 3)2 -1.203e+02 3.832e+01 -3.138 0.001878 **

poly(LAC, 3)3 -3.211e+01 1.041e+01 -3.084 0.002244 **

GLN 3.912e-02 8.812e-03 4.439 1.29e-05 ***

GLU 5.716e-02 2.163e-02 2.642 0.008686 **

NH4 5.605e-02 7.730e-03 7.252 3.82e-12 ***

OSM -2.508e-02 2.934e-02 -0.855 0.393294

poly(Ptime, 3)1 2.465e+01 7.910e+00 3.117 0.002013 **

poly(Ptime, 3)2 -4.044e+01 1.110e+01 -3.642 0.000321 ***

poly(Ptime, 3)3 5.017e+00 1.286e+00 3.903 0.000119 ***

DO:pHset 5.555e-03 2.814e-03 1.974 0.049324 *

I(pHset^2):poly(Ptime, 3)1 4.147e-01 8.829e-02 4.697 4.10e-06 ***

I(pHset^2):poly(Ptime, 3)2 1.152e-02 4.871e-02 0.236 0.813219

I(pHset^2):poly(Ptime, 3)3 -2.443e-01 1.301e-01 -1.878 0.061423 .

Via:poly(Ptime, 3)1 -1.151e+00 4.468e-01 -2.576 0.010507 *

Via:poly(Ptime, 3)2 2.055e+00 5.664e-01 3.628 0.000338 ***

Via:poly(Ptime, 3)3 -5.371e-01 3.308e-01 -1.624 0.105528

poly(LAC, 3)1:poly(Ptime, 3)1 2.404e+03 7.520e+02 3.197 0.001546 **

poly(LAC, 3)2:poly(Ptime, 3)1 1.869e+03 5.845e+02 3.199 0.001536 **

poly(LAC, 3)3:poly(Ptime, 3)1 5.054e+02 1.594e+02 3.172 0.001680 **

poly(LAC, 3)1:poly(Ptime, 3)2 -3.718e+03 1.095e+03 -3.395 0.000784 ***

poly(LAC, 3)2:poly(Ptime, 3)2 -2.955e+03 8.698e+02 -3.398 0.000776 ***

poly(LAC, 3)3:poly(Ptime, 3)2 -8.601e+02 2.569e+02 -3.347 0.000924 ***

poly(LAC, 3)1:poly(Ptime, 3)3 3.446e+02 1.065e+02 3.236 0.001353 **

poly(LAC, 3)2:poly(Ptime, 3)3 2.778e+02 8.433e+01 3.294 0.001110 **

poly(LAC, 3)3:poly(Ptime, 3)3 8.598e+01 2.539e+01 3.387 0.000805 ***

GLN:poly(Ptime, 3)1 2.287e-01 1.479e-01 1.547 0.123014

GLN:poly(Ptime, 3)2 -1.576e-01 1.730e-01 -0.911 0.363026

GLN:poly(Ptime, 3)3 -2.659e-01 1.555e-01 -1.710 0.088273 .

GLU:poly(Ptime, 3)1 -1.712e-01 3.530e-01 -0.485 0.628013

GLU:poly(Ptime, 3)2 -8.317e-01 4.445e-01 -1.871 0.062348 .

GLU:poly(Ptime, 3)3 -5.699e-01 2.756e-01 -2.068 0.039551 *

NH4:poly(Ptime, 3)1 -5.983e-01 1.399e-01 -4.278 2.57e-05 ***

NH4:poly(Ptime, 3)2 -3.588e-01 1.453e-01 -2.469 0.014148 *

NH4:poly(Ptime, 3)3 1.258e-01 1.830e-01 0.687 0.492460

OSM:poly(Ptime, 3)1 1.639e+00 3.637e-01 4.507 9.58e-06 ***

OSM:poly(Ptime, 3)2 1.621e+00 6.580e-01 2.464 0.014321 *

OSM:poly(Ptime, 3)3 2.464e-01 1.771e-01 1.391 0.165379

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07481 on 288 degrees of freedom

Multiple R-squared: 0.8548, Adjusted R-squared: 0.8317

F-statistic: 36.87 on 46 and 288 DF, p-value: < 2.2e-16

20

0.0 0.2 0.4 0.6 0.8

−0.

3−

0.1

0.1

Fitted values

Res

idua

ls ●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●

●●

●●

●●● ●

●●●●●●●

●●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●●●●

●●● ●

●●

●●

●●

● ●●

●●●●

●●

●●

●●

●●●●

●●●●●

●●●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●

Residuals vs Fitted

104258

323

●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●●●

●●●●

●●●●●●●

●●

●●

●●

●●●●●●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●● ●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

● ●●

●●

●●●●

●●

●●●

−3 −2 −1 0 1 2 3

−4

−2

02

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

104258

323

0.0 0.2 0.4 0.6 0.8

0.0

0.5

1.0

1.5

2.0

Fitted values

Sta

ndar

dize

d re

sidu

als

●●●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

Scale−Location104258

323

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

Leverage

Sta

ndar

dize

d re

sidu

als

●●

●●

●●●●

●●●

●●

●●

●● ●

●● ●●●●

●●●

● ●● ●● ●● ●

● ●●●●●●

●●●

●●

●●●●

●●●●●●●●●●●

●●

● ●

●● ●

●●

●●

●●●

●●

● ●

●●

●●●

●●

●● ●

●● ●

●●

●●●

●●

●●

● ●

●●

●●●

● ●● ● ●

●●

●●

● ●

●●

●●

●●●

●● ●●

●●

●●

●●

●● ●●●

●● ●●●

●● ●

●●●

●●

●●●

●●

●● ●●●

●●

●●

●●●●

●●●●●●●●●●

●●●

●● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●●●

●●

●●

●●

●●●●

●●● ●

●●

●●

●●●●

●●

● ●●●

●●

●●●

Cook's distance

10.5

0.51

Residuals vs Leverage

135

136

196

Linear model does not capture all the effects. We can identify some outliers.

> (observations <- unique(dat.agg[which(abs(lm.agg$residuals) > 0.2), "Run"]))

[1] 630 770

7.2 Mars

We perform model fitting and variables selection by Mars.

Call: earth(formula=Titer~DO+pHset+Xv+Stress+pCO2+Via+

GLC+LAC+GLN+GLU+NH4+OSM+Stress+pH+pdiff+

Ptime, data=dat.agg, trace=0, ncross=3, nfold=10)

coefficients

(Intercept) 0.74650082

h(DO- -1.06901) -0.03339099

h(-1.06901-DO) -0.22191908

h(pHset-0.444749) -0.12856755

h(0.444749-pHset) -0.10199167

h(0.580131-Xv) -0.03825405

h(0.810513-Stress) -0.09597828

h(pCO2- -0.471402) 0.01252083

h(Via-0.390253) 0.18390409

h(GLC-1.1171) -0.05487600

h(GLN- -0.203858) 0.02302284

h(-0.203858-GLN) 0.03187502

h(NH4- -0.0225122) 0.15374637

h(NH4-1.01414) -0.17152919

h(pH- -0.664392) 0.04304223

21

h(pH-1.70568) -0.23513685

Selected 16 of 26 terms, and 10 of 15 predictors

Importance: DO, pHset, NH4, Stress, Xv, pH, GLC, GLN, Via, pCO2, ...

Number of terms at each degree of interaction: 1 15 (additive model)

GCV 0.006192977 RSS 1.708448 GRSq 0.8142985 RSq 0.8461598 cv.rsq 0.7821464

Note: the cross-validation sd's below are standard deviations across folds

Cross validation: nterms 16.90 sd 1.88 nvars 9.33 sd 0.92

cv.rsq sd MaxErr sd

0.78 0.069 -0.29 0.19

22

05

1015

Variable importance

nsub

sets

nsubsetssqrt gcvsqrt rss

020

4060

8010

0

norm

aliz

ed s

qrt g

cv o

r rs

s

DO

1

pHse

t 2

NH

4 1

1

Str

ess

4

Xv

3

pH 1

3

GLC

7

GLN

9

Via

6

pCO

2 5

0 5 10 15 20 25

Model Selection

Number of terms

0.5

0.6

0.7

0.8

GR

Sq

RS

q

02

46

810

Num

ber

of u

sed

pred

icto

rs

GRSqselected modelRSqnbr preds

0.00 0.10 0.20

0.0

0.2

0.4

0.6

0.8

1.0

Cumulative Distribution

abs(Residuals)

Pro

port

ion

0% 50% 90% 100%

●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

0.0 0.2 0.4 0.6 0.8

−0.

20.

00.

10.

2

Residuals vs Fitted

Fitted

Res

idua

ls

104119

323

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

−3 −1 0 1 2 3

−0.

20.

00.

10.

2

Normal Q−Q

Theoretical Quantiles

Res

idua

l Qua

ntile

s

104 119

323

Titer: earth(formula=Titer~DO+pHset+Xv+...

23

Take a look at outliers.

> dat.agg[c(104, 119, 323), ]

Run WD Ptime DO pCO2 pH Xv Via

104 630 0 0.0000000 0.159614 0.1237602 1.287431 -1.580576 0.3932955

119 635 0 0.0000000 1.388239 0.9739924 1.426846 -1.894227 0.4848425

323 1066 1 0.7791667 0.159614 -1.0665650 1.217723 -1.525812 0.1411053

GLC LAC GLN GLU NH4 OSM

104 1.0780028 -0.7579071 0.8434386 -1.671864 0.006689145 -0.045629237

119 1.3174515 -0.9828156 0.9841203 -1.820554 -0.292624973 -0.007493617

323 0.6821795 -0.4139295 0.2494495 -1.790816 0.262201197 0.068777622

Stress Xv0 pHset RunNo

104 -0.2096149 0.3312174 0.03925127 BIOS-3.5L - 630

119 -0.2096149 -2.2207835 -2.39373696 BIOS-3.5L - 635

323 -0.2096149 -1.7751961 0.03925127 TACI-3L-1066

Discr. Titer pdiff pCO2t

104 Standard conditions 0.3510000 0.6432812 0

119 DO 70% / pH 6.70 / seeding 1 mio 0.0858000 0.6542345 0

323 STD condition (Control - no loop) 0.8432867 0.5062897 0

7.3 Random Forest

The parameter mtry = 10 is determined by tuneRF to optimize its performance.

> library(randomForest)

> rf.agg <- randomForest(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via

+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH

+ + pdiff + Ptime, data = dat.agg, importance = TRUE, mtry = 10)

> rf.agg

Call:

randomForest(formula = Titer ~ DO + pHset + Xv + Stress + pCO2 + Via + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + pdiff + Ptime, data = dat.agg, importance = TRUE, mtry = 10)

Type of random forest: regression

Number of trees: 500

No. of variables tried at each split: 10

Mean of squared residuals: 0.005391989

% Var explained: 83.73

24

0 100 200 300 400 500

0.00

60.

008

0.01

00.

012

0.01

40.

016

rf.agg

trees

Err

or

pCO2

Ptime

GLU

pdiff

GLN

Via

pH

GLC

Xv

OSM

Stress

LAC

NH4

pHset

DO

10 20 30 40 50 60%IncMSE

Ptime

pdiff

pCO2

Stress

GLU

GLC

Via

GLN

OSM

pH

Xv

LAC

NH4

pHset

DO

0 1 2 3 4IncNodePurity

rf.agg

25

Under two importance plots, the first 7 most important variables are the same.

7.4 Decision Tree

plot.cp is used to determine cp in decision tree.

> library(rpart)

> tr.agg <- rpart(Titer ~ DO + pHset + Stress + Xv + pCO2

+ + Via + GLC + LAC + GLN + GLU + NH4 + OSM

+ + Stress + pH + pdiff + Ptime, data=dat.agg)

> (tr.agg <- prune(tr.agg, cp = 0.019))

n= 335

node), split, n, deviance, yval

* denotes terminal node

1) root 335 11.10534000 0.5719146

2) DO< -1.683324 39 0.30707160 0.2245385 *

3) DO>=-1.683324 296 5.47207100 0.6176837

6) pHset< -1.988239 18 0.17149480 0.2748954 *

7) pHset>=-1.988239 278 3.04856000 0.6398787

14) NH4< 0.3899572 202 1.52384000 0.6085717

28) LAC>=0.002812559 9 0.02199422 0.4344444 *

29) LAC< 0.002812559 193 1.21623800 0.6166916 *

15) NH4>=0.3899572 76 0.80050930 0.7230893 *

> plot(tr.agg)

> text(tr.agg)

|DO< −1.683

pHset< −1.988

NH4< 0.39

LAC>=0.002813

0.2245

0.2749

0.4344 0.61670.7231

26

7.5 Neural Network

> nn.agg <- nnet(Titer ~ DO + pHset + Xv + Stress+ pCO2 + Via

+ + GLC + LAC + GLN + GLU + NH4 + OSM + Stress + pH + pdiff,

+ data = dat.agg, size = 4, skip = TRUE, decay = 4e-4,

+ lin.out = FALSE, maxit = 2000, trace = FALSE)

> nn.agg

a 14-4-1 network with 79 weights

inputs: DO pHset Xv Stress pCO2 Via GLC LAC GLN GLU NH4 OSM pH pdiff

output(s): Titer

options were - skip-layer connections decay=4e-04

7.6 Cross Validation

Because the nature of our data, we better leave the whole run of experiment out when we try to performcross validation. Still we are using MSE to assess performance.

Overall, the performance is better than in first task.

RMSECVs of Different Models

Linear Model Mars Random Forest Decision Tree Neural Network

[1,] 0.09515759 0.09222569 0.07226036 0.1000236 0.09316403

8 History Models

Here we fit historical models and perform CV on them to measure their prediction powers.The input of a historical model always include the 2 strictly controlled variables that are kept constant.

These are DO, Stress. For a historical model using the history up to day i, we use the other 9 variablespCO2, Xv, Via, GLC, pH, LAC, GLN, GLU, NH4, OSM from day 0 to day i. This gives up 2+10∗(i+1)input variables.

We do this by unfolding the raw data. In case of missing values, we impute them by using missForeston the unfolded historical input values from day 0 to day i. This is done in function ArrHist.

We produce 2 tables for each statistical model. One is the table of variance explained in CV and theother is RMSECV. We calculate variance explained as 1 − (

∑(yi − yi)

2)/(∑

(yi − y)2) in each run ofcross validation and average them to get the final result. Similarly, RMSECV is also the mean over allcross validation runs.

Now we see how models perfrom under this setting.

8.1 Linear model

The (i, j)th history linear model is defined as follows:

log(Titerj) = β0 +

i∑t=0

[β1,t(DOt − 50)2 + β2,tpCO2t

+ β3,t(pHt − 7.1)2 + β4,tlog(Xvt) + β5,tV iat

+ β6,tGLCt + β7,tLACt + β8,tGLNt

+ β9,tGLUt + β10,tNH4t + β11,tOSMt + β12,tStresst]

And the following table shows the RMSECV values. The RMSECV value of the (i, j)th model is the

number at the ith row and the jth column.

RMSECV

27

Y0 Y1 Y2 Y3 Y4

X0 3.200970e-03 1.962524e-03 4.279477e-03 2.301885e-03 2.446558e-03

X0_1 1.545867e-02 1.659791e-02 2.354038e-03 1.230346e-03

X0_2 1.730547e-02 4.075473e-03 1.085479e-03

X0_3 4.041385e-03 2.398019e-03

X0_4 3.482082e-03

X0_5

X0_6

X0_7

X0_8

X0_9

X0_10

Y5 Y6 Y7 Y8 Y9

X0 2.051734e-03 3.416656e-03 6.081081e-03 7.219461e-03 1.941174e-02

X0_1 2.327897e-03 3.313077e-03 6.668536e-03 6.489212e-03 1.036755e-02

X0_2 3.283539e-03 1.628859e-03 5.512231e-03 3.368808e-03 7.914411e-03

X0_3 2.395535e-03 1.284514e-03 2.544108e-03 2.998240e-03 5.437181e-03

X0_4 1.389313e-03 1.658926e-03 2.458271e-03 2.497301e-03 1.091136e-02

X0_5 2.656460e-03 2.030364e-03 2.326792e-03 2.966961e-03 6.045417e-03

X0_6 2.639904e-03 2.232001e-03 3.808531e-03 1.056563e-02

X0_7 1.251427e-02 6.530875e-03 1.346802e-02

X0_8 3.184551e+02 7.913746e+13

X0_9 4.492133e+02

X0_10

Y10

X0 1.961133e-02

X0_1 5.580546e-02

X0_2 1.869011e-02

X0_3 1.359189e-02

X0_4 7.909511e-03

X0_5 1.306489e-02

X0_6 1.450043e-02

X0_7 3.412453e-02

X0_8 1.202324e+11

X0_9 1.613414e+00

X0_10 1.289649e+00

28

0 1 2 3 4 5 6 7 8 9 10

Titer

10

9

8

7

6

5

4

3

2

1

0

His

tory

use

d

RMSECV for history model based on Linear model

8.2 Random Forest

Variance explained in CV

Y0 Y1 Y2 Y3 Y4 Y5

X0 0.28324661 0.26155005 -0.09489543 -0.03088496 0.39081454 0.43354914

X0_1 0.37598610 0.21074302 0.35566190 0.31553364 0.37313831

X0_2 0.16472940 0.36469677 0.44168895 0.40601249

X0_3 0.25131256 0.35404242 0.51706127

X0_4 0.37300694 0.56438519

X0_5 0.54745573

X0_6

X0_7

X0_8

X0_9

X0_10

Y6 Y7 Y8 Y9 Y10

X0 0.44032714 0.55857068 0.57889201 0.64090502 0.55601625

X0_1 0.53608831 0.50026635 0.55488094 0.72923139 0.71702833

X0_2 0.69563541 0.67904955 0.69803823 0.66081810 0.71536101

X0_3 0.66675711 0.72303901 0.71583577 0.71933173 0.77114221

X0_4 0.77659188 0.77304970 0.77028128 0.80331752 0.74513898

X0_5 0.79546288 0.79299697 0.81030901 0.73912138 0.78821029

X0_6 0.76082620 0.76023416 0.79248812 0.82078319 0.79484015

X0_7 0.81875451 0.80675983 0.82533617 0.81928950

X0_8 0.86596990 0.82482254 0.80067588

X0_9 0.87802758 0.84462589

X0_10 0.81461628

RMSECV

29

Y0 Y1 Y2 Y3 Y4 Y5

X0 0.04226597 0.05005782 0.05790429 0.05367666 0.04880549 0.05140720

X0_1 0.03811918 0.05222506 0.03665541 0.04491827 0.04562487

X0_2 0.04978004 0.04001816 0.04554554 0.04125988

X0_3 0.04335484 0.03659673 0.04576400

X0_4 0.04272663 0.03664301

X0_5 0.03671531

X0_6

X0_7

X0_8

X0_9

X0_10

Y6 Y7 Y8 Y9 Y10

X0 0.06260631 0.07307958 0.08436972 0.09963318 0.10319883

X0_1 0.05939476 0.07320302 0.08885598 0.07992313 0.08665985

X0_2 0.04182959 0.06069709 0.06802562 0.09828617 0.08913791

X0_3 0.05082051 0.04949298 0.06647324 0.07765117 0.08482019

X0_4 0.03694516 0.05087756 0.05854120 0.06370206 0.08794341

X0_5 0.04183240 0.04704620 0.05592446 0.07192094 0.07872577

X0_6 0.03919570 0.05291936 0.05561075 0.07081846 0.07979003

X0_7 0.04348533 0.05150214 0.06694091 0.06891756

X0_8 0.04623633 0.06832347 0.07381787

X0_9 0.05664780 0.07072329

X0_10 0.06890835

0 1 2 3 4 5 6 7 8 9 10

Titer

10

9

8

7

6

5

4

3

2

1

0

His

tory

use

d

RMSECV for history model based on Random Forest

8.3 MARS

Variance explained in CV

Y0 Y1 Y2 Y3 Y4

X0 0.069814132 -0.005095947 0.024849356 -0.034696230 -0.120319657

30

X0_1 0.007649867 -0.396661134 -0.275546932 -0.033114509

X0_2 -0.827848243 -0.204988340 -1.123241366

X0_3 -0.239974296 0.267732287

X0_4 -1.130404725

X0_5

X0_6

X0_7

X0_8

X0_9

X0_10

Y5 Y6 Y7 Y8 Y9

X0 0.363899116 0.330268801 0.508385257 0.305851870 0.599098171

X0_1 -0.150206083 0.147392088 0.375098045 0.089004911 0.342867469

X0_2 0.287235896 0.651413305 0.651947560 0.742131695 0.616545382

X0_3 0.233978160 0.517310655 0.669245506 0.804495843 0.768104648

X0_4 0.397724415 0.704167262 0.821355319 0.839638561 0.836162082

X0_5 0.512508645 0.686174900 0.814525758 0.846335693 0.780111923

X0_6 0.846201740 0.823482389 0.841185054 0.823600595

X0_7 0.763357781 0.908825907 0.814810324

X0_8 0.850118872 0.714621386

X0_9 0.598640731

X0_10

Y10

X0 -0.266671250

X0_1 0.099612543

X0_2 0.296458653

X0_3 0.673287674

X0_4 0.681350544

X0_5 0.706625242

X0_6 0.812505473

X0_7 0.789365935

X0_8 0.657600818

X0_9 0.742572597

X0_10 0.830697666

RMSECV

Y0 Y1 Y2 Y3 Y4 Y5

X0 0.05414481 0.06435345 0.04914698 0.04440181 0.04899524 0.04751189

X0_1 0.06349293 0.07040039 0.07104746 0.04787156 0.06328275

X0_2 0.08353868 0.05609517 0.06006556 0.05632823

X0_3 0.05677938 0.05112222 0.05670555

X0_4 0.07399238 0.04632260

X0_5 0.04234317

X0_6

X0_7

X0_8

X0_9

X0_10

Y6 Y7 Y8 Y9 Y10

X0 0.06136661 0.06598322 0.10313729 0.10811682 0.16777832

X0_1 0.06212712 0.09116696 0.10869171 0.12338118 0.17132883

X0_2 0.05033973 0.05957265 0.06570322 0.09632618 0.12433817

X0_3 0.05168125 0.06504277 0.06069211 0.07182945 0.10263122

X0_4 0.03954150 0.04350307 0.05051779 0.06939214 0.09432636

X0_5 0.04101469 0.04404178 0.04560743 0.06953788 0.08301654

X0_6 0.02986677 0.04677269 0.05327379 0.07087714 0.07679419

X0_7 0.04676462 0.04301543 0.07409887 0.07206619

X0_8 0.04281923 0.07792298 0.09770436

X0_9 0.09445278 0.08202713

X0_10 0.07658425

31

Y0

Y1

Y2

Y3

Y4

Y5

Y6

Y7

Y8

Y9

Y10

Titer

X0_10

X0_9

X0_8

X0_7

X0_6

X0_5

X0_4

X0_3

X0_2

X0_1

X0

His

tory

use

d

RMSECV for history model based on MARS

9 Evaluation of Results

All models are performing reasonably well.Random forest has the most stable performance. This is expected. It is also easy to implement.

There are some tuning parameters to consider, e.g., the number of trees, the size chosen for resampling.However, default setting works good most of the time, and even the number of trees is too large it wouldnot be a big problem.

Linear regression is also doing ok. Its performance heavily relies on variable selection, interactionspecification and transformation, that is why the implementation of linear regression is not as straight-forward as other models. A compromise is fitting a large model in the beginning and use stepwise selectionto reduce the number of parameters. More recent method would be using penalized linear regression, socalled lasso, ridge regression. See [2].

Mars is easy to implement, because it can adjust to nonlinearity automatically. Here its predictionpower is not as good as the other two, which might be resulted from overfitting. It is possible to tunethe parameters. See [3].

References

[1] Daniel J.Stekhoven. Using the missForest package, 2011. Available from http://stat.ethz.ch/

education/semesters/ss2012/ams/paper/missForest_1.2.pdf.

[2] Peter Buhlmann Martin Machler. Computational statistics. Available from http://stat.ethz.ch/

education/semesters/ss2014/CompStat/sk.pdf, 2014.

[3] Stephen Milborrow. Notes on the earth package, 2014. Available from http://cran.r-project.

org/web/packages/earth/vignettes/earth-notes.pdf.

32

[4] Fortran original by Leo Breiman, R port by Andy Liaw Adele Cutler, and Matthew Wiener.Package ‘randomForest’, 2012. Available from http://stat-www.berkeley.edu/users/breiman/

RandomForests.

[5] Mayo Foundation Terry M. Therneau, Elizabeth J. Atkinson. An Introduction to Recursive Partition-ing Using the RPART Routines, 2013. Available from http://cran.r-project.org/web/packages/

rpart/vignettes/longintro.pdf.

33


Recommended