Small data: practical modeling issues in human-model -omic...

Small data: practical modeling issues in human-model -omic data

Defense for the degree of Ph. D.Einar HolsbøFebruary 8th, 2019

Act I: “Boy Bitten by a Lizard” (1590s)

–Eiliv Lund, 4.5 years ago, quote made up

Can we predict breast cancer metastasis from blood samples?

Metastasis is the spread of cancer in the body


0.0

0.2

0.4

0.6

0.8

1.0

Five−year survival probability,various cancers

Local Regional Distant

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

● Female breast

Data source: Siegel, R. L., Miller, K. D. and Jemal, A. (2017), Cancer statistics, 2017. CA: A Cancer Journal for Clinicians, 67: 7-30. doi:10.3322/caac.21387


0.0

0.2

0.4

0.6

0.8

1.0

Five−year survival probability,various cancers

Local Regional Distant

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

● Female breast

Data source: Siegel, R. L., Miller, K. D. and Jemal, A. (2017), Cancer statistics, 2017. CA: A Cancer Journal for Clinicians, 67: 7-30. doi:10.3322/caac.21387


Goal: predict it, win the Nobel prize 🏅

Norwegian Women and Cancer

• Prospective population-based cohort that tracks 34% (170 000) of all Norwegian women born between 1943-57.

• The data collection started in NOWAC in 1991. Includes blood samples from 50.000 women, as well as more than 300 biopsies.

• Now contains various -omics material: microarray mRNA, miRNA, methylation, metabolomics, and RNA-seq.

ProspectiveEnrollment


Time →


Time →

Prospective

Time →

Prospective

Prospective

Nested case–control

} cc-pair

} cc-pair

} cc-pair

} cc-pair

Prospective design nice because recruitment is blinded to outcome

and exposure

Prospective design nice because recruitment is blinded to outcome

and exposure

Low bias

Gene expressionAT GC CG TA TA CG

……

DNA


U C G A A G…

…

……

DNA mRNA


U C G A A G

some useful protein

……

……

DNA mRNA

Gene expression

U C G A A G

Gene expression

U C G A A G

💡

Gene expression

U C G A A G

💡How much light

do we see?

Data at a glancedim(gene_expression)## [1] 88 12404

summary(days_to_diagnosis)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 6.0 117.8 189.5 186.8 269.2 358.0

summary(metastasis)## FALSE TRUE## 66 22

table(metastasis, stratum)## stratum## metastasis screening interval clinical## FALSE 43 10 13## TRUE 6 6 10

















These are “small data” & we should be careful with them

A computer scientist’s guide to precision medicine

• Step 1: pick some models

• Step 2: pick some scoring rules/performance metrics

• Step 3: “classification”

Scoring rule examples (aka. loss functions, aka. metrics)

• Accuracy: how many did we get right?

• Precision: how many correct “success” predictions did we do

• Recall: how many of the true successes did we detect

Scoring rule examples (aka. loss functions, aka. metrics)

p > .5? something else?

Decoupling score and decision threshold

• AUC: the probability of ranking success higher than failure

(aka. concordance probability)

Just trying some methods & scores

Just trying some methods & scores

log

p

1� p

= �0 + �1x1 + . . .+ �dxd<latexit sha1_base64="ZGi/WZGaXB6h66Gn4QQFG0EtpOE=">AAACJ3icbZBdS8MwFIbT+TXnV9VLb4JDEMTRiqBeiKI3Xio4N1hHSdN0hqVNSU5lo+zneONf8UZERS/9J2ZbEXW+EHjynnNIzhukgmtwnA+rNDU9MztXnq8sLC4tr9irazdaZoqyOpVCqmZANBM8YXXgIFgzVYzEgWCNoHs+rDfumNJcJtfQT1k7Jp2ER5wSMJZvn3hCdrAXKULzdJC7u+kAH2MvYEB8B+8U5PZ8d3gRoQT97YY9P/TtqlNzRsKT4BZQRYUuffvZCyXNYpYAFUTrluuk0M6JAk4FG1S8TLOU0C7psJbBhMRMt/PRogO8ZZwQR1KZkwAeuT8nchJr3Y8D0xkTuNV/a0Pzv1org+iwnfMkzYAldPxQlAkMEg9TwyFXjILoGyBUcfNXTG+JCQ1MthUTgvt35Umo79WOas7VfvX0rEijjDbQJtpGLjpAp+gCXaI6ougePaIX9Go9WE/Wm/U+bi1Zxcw6+iXr8wu9rKQl</latexit><latexit sha1_base64="ZGi/WZGaXB6h66Gn4QQFG0EtpOE=">AAACJ3icbZBdS8MwFIbT+TXnV9VLb4JDEMTRiqBeiKI3Xio4N1hHSdN0hqVNSU5lo+zneONf8UZERS/9J2ZbEXW+EHjynnNIzhukgmtwnA+rNDU9MztXnq8sLC4tr9irazdaZoqyOpVCqmZANBM8YXXgIFgzVYzEgWCNoHs+rDfumNJcJtfQT1k7Jp2ER5wSMJZvn3hCdrAXKULzdJC7u+kAH2MvYEB8B+8U5PZ8d3gRoQT97YY9P/TtqlNzRsKT4BZQRYUuffvZCyXNYpYAFUTrluuk0M6JAk4FG1S8TLOU0C7psJbBhMRMt/PRogO8ZZwQR1KZkwAeuT8nchJr3Y8D0xkTuNV/a0Pzv1org+iwnfMkzYAldPxQlAkMEg9TwyFXjILoGyBUcfNXTG+JCQ1MthUTgvt35Umo79WOas7VfvX0rEijjDbQJtpGLjpAp+gCXaI6ougePaIX9Go9WE/Wm/U+bi1Zxcw6+iXr8wu9rKQl</latexit><latexit sha1_base64="ZGi/WZGaXB6h66Gn4QQFG0EtpOE=">AAACJ3icbZBdS8MwFIbT+TXnV9VLb4JDEMTRiqBeiKI3Xio4N1hHSdN0hqVNSU5lo+zneONf8UZERS/9J2ZbEXW+EHjynnNIzhukgmtwnA+rNDU9MztXnq8sLC4tr9irazdaZoqyOpVCqmZANBM8YXXgIFgzVYzEgWCNoHs+rDfumNJcJtfQT1k7Jp2ER5wSMJZvn3hCdrAXKULzdJC7u+kAH2MvYEB8B+8U5PZ8d3gRoQT97YY9P/TtqlNzRsKT4BZQRYUuffvZCyXNYpYAFUTrluuk0M6JAk4FG1S8TLOU0C7psJbBhMRMt/PRogO8ZZwQR1KZkwAeuT8nchJr3Y8D0xkTuNV/a0Pzv1org+iwnfMkzYAldPxQlAkMEg9TwyFXjILoGyBUcfNXTG+JCQ1MthUTgvt35Umo79WOas7VfvX0rEijjDbQJtpGLjpAp+gCXaI6ougePaIX9Go9WE/Wm/U+bi1Zxcw6+iXr8wu9rKQl</latexit>

X|�i| t

<latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit><latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit><latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit>

X�2i t

<latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit><latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit><latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit>

Figures from Hastie, Tibshirani, and Friedman: The Elements of Statistical Learning

“lasso”“ridge”

+ =


X⇥↵�2

i + (1� ↵)|�i|⇤ t

<latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit>

“ElasticNet”

+ =


X⇥↵�2

i + (1� ↵)|�i|⇤ t


“ElasticNet”

Tradeoff between penalty types, controls “roundness”

Trying different alphas

−5 −4 −3 −2 −1

0.50

0.65

0.80

log(Lambda)AU

C

111 96 88 80 73 63 50 28 10ElasticNet, binomial family, alpha=0.5

−6 −5 −4 −3 −2

0.2

0.4

0.6

log(Lambda)

AUC

35 33 32 31 31 26 24 13 6 0Lasso, binomial family

1 2 3 4 5

0.60

0.70

0.80

log(Lambda)

AUC

12295 12295 12295 12295 12295Ridge, binomial family

Figures show concordance (higher is better)

Alpha = 1 Alpha = 0Alpha = .5

Trying different alphas

−5 −4 −3 −2 −1

0.50

0.65

0.80

log(Lambda)AU

C

111 96 88 80 73 63 50 28 10ElasticNet, binomial family, alpha=0.5

−6 −5 −4 −3 −2

0.2

0.4

0.6

log(Lambda)

AUC

35 33 32 31 31 26 24 13 6 0Lasso, binomial family

1 2 3 4 5

0.60

0.70

0.80

log(Lambda)

AUC

12295 12295 12295 12295 12295Ridge, binomial family

Figures show concordance (higher is better)

.7 .8

.5

Alpha = 1 Alpha = 0Alpha = .5

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

best auc for varying alpha

alpha

AUC

Finding the “best” parameter alpha by cross-validation

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8


alpha

AUC


????????????(this is the lizard)

❧ intermission ☙

Act II: When you are engulfed in flames

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8


alpha

AUC


AUC

alpha

Some “technical” sources of variation

• The big classic one: sample size

• Scoring rule

• Validation procedure



• Scoring rule


Small data: sample size is more or less fixed in the human model

Typical sample sizes in transcriptomics

4 9 21 56 176 614 3372 18736

n = 1178

Small data: sample size is more or less fixed in the human model

Typical sample sizes in transcriptomics

4 9 21 56 176 614 3372 18736

n = 1178

Ethics, economy, logistics limit access to human obs.



• Scoring rule


Yet another scoring rule

Brier’s score is the mean squared errors of predicted probabilities

n�1X

(p̂i � pi)2

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Some risk surfaces

log

p

1� p

= 1 + x,

x ⇠ U [�6, 6]


(risk = expected loss)

Some risk surfaces

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.041 −0.0405

−0.04 −0.0395 −0.039

−0.0385 −0.038 −0.0375 −0.037 −0.0365

−0.036 −0.0355 −0.035 −0.0345 −0.034 −0.0335 −0.033 −0.0325 −0.032 −0.0315 −0.031

−0.0305 −0.03 −0.0295 −0.029

−0.0285 −0.028 −0.0275 −0.027 −0.0265

−0.026 −0.0255

−0.025

−0.0245 −0.024 −0.0235 −0.023

−0.0225

−0.022 −0.0215

−0.021

●

log

p

1� p

= 1 + x,

x ⇠ U [�6, 6]


Brier

Brighter is better

Some risk surfaces

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.041 −0.0405

−0.04 −0.0395 −0.039

−0.0385 −0.038 −0.0375 −0.037 −0.0365

−0.036 −0.0355 −0.035 −0.0345 −0.034 −0.0335 −0.033 −0.0325 −0.032 −0.0315 −0.031

−0.0305 −0.03 −0.0295 −0.029

−0.0285 −0.028 −0.0275 −0.027 −0.0265

−0.026 −0.0255

−0.025

−0.0245 −0.024 −0.0235 −0.023

−0.0225

−0.022 −0.0215

−0.021

●

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.0445 −0.044 −0.0435

−0.043 −0.0425

−0.042

−0.0415 −0.041

−0.0405 −0.04

−0.0395

−0.039

−0.0385 −0.038 −0.0375

−0.037 −0.0365

−0.036 −0.0355 −0.035

−0.0345

−0.034

−0.0335

−0.032

−0.0315

−0.031

−0.0305

−0.03

−0.0295

−0.029

−0.0285

−0.028

−0.0275

−0.027

−0.0265

−0.026

−0.0255

−0.025

−0.0245

●

Brier Accuracy

log

p

1� p

= 1 + x,

x ⇠ U [�6, 6]


Brighter is better

Some risk surfaces

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.041 −0.0405

−0.04 −0.0395 −0.039

−0.0385 −0.038 −0.0375 −0.037 −0.0365

−0.036 −0.0355 −0.035 −0.0345 −0.034 −0.0335 −0.033 −0.0325 −0.032 −0.0315 −0.031

−0.0305 −0.03 −0.0295 −0.029

−0.0285 −0.028 −0.0275 −0.027 −0.0265

−0.026 −0.0255

−0.025

−0.0245 −0.024 −0.0235 −0.023

−0.0225

−0.022 −0.0215

−0.021

●

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

0.015 0.016 0.017 0.018 0.019 0.02 0.021 0.022 0.023 0.024 0.025 0.026

●

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.0445 −0.044 −0.0435

−0.043 −0.0425

−0.042

−0.0415 −0.041

−0.0405 −0.04

−0.0395

−0.039

−0.0385 −0.038 −0.0375

−0.037 −0.0365

−0.036 −0.0355 −0.035

−0.0345

−0.034

−0.0335

−0.032

−0.0315

−0.031

−0.0305

−0.03

−0.0295

−0.029

−0.0285

−0.028

−0.0275

−0.027

−0.0265

−0.026

−0.0255

−0.025

−0.0245

●

Brier Accuracy Concordance

log

p

1� p

= 1 + x,

x ⇠ U [�6, 6]


Brighter is better



• Scoring rule


Validation

• Holdout data

• Cross-validation

• Repeat CV

• The Bootstrap

Holdout data

Holdout data

Holdout data

i) Fit model

ii) Calculate score

Cross validation

Cross validation

i) Fit model

ii) Score

Cross validation

iii) Fit model

iv) Score

Cross validation

iii) Fit model

iv) Score

&c., &c.

Cross validation

xi) Summarize by mean, sd

Repeated cross validation

It’s exactly what you’d expect

Bootstrap

Bootstrap

Bootstrap

Bootstrap

Bootstrap

Bootstrap

&c., &c., &c.

Bootstrap

F̂ ⇠ F<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

F̂ ⇤ ⇠ F̂<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Bootstrap


Bootstrap


T (F̂ ⇤, F̂ ) ⇠ T (F̂ , F )<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

“The bootstrap principle”

Relative efficiency of two estimators

For two estimators, T1, T2, of the same quantity :


Var(T1)Var(T2)



For two estimators, T1, T2, of the same quantity :


Var(T1)Var(T2)


All else being equal, pick the less variable one


0.10 0.15 0.20 0.25

05

1015

20

Error estimates, p=2, k=2

Den

sity

split samplebootstraprepeated cvcv

Brier score estimated in different ways

Relative efficiency to split sample:

Bootstrap: 3.5 CV: 3.6

Repeat CV: 3.6


0.10 0.15 0.20 0.25

05

1015

20

Error estimates, p=2, k=2

Den

sity

split samplebootstraprepeated cvcv

Brier score estimated in different ways

Relative efficiency to split sample:

Bootstrap: 3.5 CV: 3.6

Repeat CV: 3.6

Need 3–4 times as many obs. w/ split sample!

Some lessons

1. Small data: new observations are hard to get

2. Optimize a less weird scoring rule

3. Estimate with less variance

Some lessons




Some lessons




Some lessons




❧ intermission ☙

Act III: Hold Fast

Brier score + Bootstrap�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

model �.� model �.�LIMMA-t .44 ± .30 .76 ± .20SAM .46 ± .26 .75 ± .24ANOVA-fs .51 ± .29 .75 ± .16ANOVA-s .41 ± .57 .75 ± .38t-test .65 ± 1.5 .74 ± .71ANOVA-f .44 ± .25 .72 ± .21

intercept .5stratum .49 ± .055lasso .36 ± 1.4ridge .81 ± 3.3

Table �.�: AUC presented as point estimate plus/minus two standard errors. Measuresthe probability of forecasting a higher probability of metastasis for a ran-domly chosen metastasis case than for a randomly chosen non-metastasiscase: higher is better. Model number refers to the equations in Section �.A.�.Model �.� includes stratum as a predictor. Below the break are the fourbaseline models.

The collected results for model �.� suggest some reason for optimism. Due tothe size of the standard errors we must necessarily be uncertain about even thefirst significant digit of our point estimates. But even accounting for uncertaintythere seems to be predictive information better than random guess. As in thesimulations, there is not too much difference between the different methods,perhaps apart from the simple t-test, for which we observe much variance.Note that both SAM and LIMMA are flexible frameworks and we could haveaccounted for stratum and followup in either. Our comparison is between usingthis information and various ways of not using it, and there is no reason tobelieve that either framework should perform poorly if we were to use morerefined models there.

Table �.� shows the predictor set stability as point estimate plus/minus twostandard errors. Stability is in general very low, and the standard errors suggestthat there is even some uncertainty to the order of magnitude of the pointestimates. A possible interpretation is that the correlation between genes issuch that many different genes hold similar information. It is at least clear thatwe need much more data if we want to find a stable set of predictor genes. Ifwe take the point estimates at face value, Table �.� reflects the fact that we seelower uncertainty using ANOVA-f/fs in Tables �.� and �.�.

issue mentioned in the preamble to this chapter. For details see Section �.�.� and Section�.�.

�.A A P P E N D I X: VA R I A B L E S E L E C T I O N M E T H O D S ��

model �.� model �.�t-test .17 ± .45 .17 ± .33ANOVA-fs .27 ± .13 .18 ± .10SAM .34 ± .11 .20 ± .15ANOVA-s .33 ± .22 .20 ± .25ANOVA-f .31 ± .084 .21 ± .11LIMMA-t .35 ± .14 .20 ± .17

intercept .19 ± .010stratum .22 ± .029lasso .27 ± .19ridge .23 ± .30

Table �.�: Brier scores presented as point estimate plus/minus two standard errors.Measures error in forecast probability: lower is better. Model number refersto the equations in Section �.A.�. Model �.� includes stratum as a predictor.Below the break are the four baseline models.

but it is noteworthy that the intercept-only model is among the best-calibrated.The uncertainty is large enough that is difficult to say that any selection methodis better than any other. It is clear that the interaction with detection method inmodel �.� improves calibration for all models. There is also lower uncertaintyin the ANOVA-f/fs models.

AUC or concordance probability is a measure of a model’s ability to discriminatebetween outcomes: the higher the better. Brier score alone does not providefull information about predictive performance; the intercept-only model is well-calibrated but cannot be used for prediction at all. Random guess (or forecastinga constant for every observation) yields AUC of .�; perfect discrimination yieldsAUC of unity. Table �.� shows AUC as point estimate plus/minus two standarderrors in decreasing order by model �.�. Again the clearest signal is that theadded information from detection method is very important. Point estimatesimprove markedly and standard errors generally decrease. Also here does useof stratification and followup time in preselection reduce uncertainty.

The ridge regression baseline performance has a very good AUC point estimate,but the standard error is very large. Too large: it is a theorem that the upperbound on standard deviation in a variable 2 [0, 1] is 1

2 . This says somethingabout the imperfection of the jackknife as an estimator of standard error. Theblame lies at least in part with the correctional factor n�1n in Equation �.�, whichwas originally defined heuristically. Since it is difficult to suggest a sensiblealternative, we choose to live with this.�

�. This was really the result of nesting a cross-validation in the bootstrap: the methodology

Concordance: Higher better, random guess is .5

Brier score: Lower better, null model is .19

















Concordance Brier







































In short more lizards ahead

Reminder of likelihood penaltiesX

|�i| t<latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit><latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit><latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit>

X�2i t

<latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit><latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit><latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit>

X⇥↵�2

i + (1� ↵)|�i|⇤ t


Need to choose t (aka lambda)

Risky procedure# nested cv in bootstrapboot <- boostrap_samples()for (b in boot) { lambda <- cross_validate_glmnet(b)}





i) train

ii) test


i) train

ii) test

Bias toward !!!!!!!!!

0.00 0.05 0.10 0.15 0.20

010

2030

4050

60

Chosen lambda, p=100, k=1000

Den

sity

cvcv in bootstrapdeduplicated cv in bootstrap

Risky procedure

shrinkage parameter lambda

Instead choose lambda by AIC●●●

●●●●

●●

●●

●

●●●

●●●

●●●●

●

●●●●

●●●

●●

●●

●●●●●

●●●●

●●●

●●●●

●●●

●

●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−250

−150

−50

0

AIC as a function of shrinkage parameter

lambda

AIC

Scatterplot smoother

Max curvature

ElasticNet, alpha = .5

+ =


X⇥↵�2

i + (1� ↵)|�i|⇤ t



Brier score

Freq

uenc

y

0.04 0.08 0.12 0.16

015

0

Bootstrapped estimates

Concordance

Freq

uenc

y

0.65 0.75 0.85 0.95

010

0

Stability

Freq

uenc

y

0.05 0.15 0.25 0.35

015

0

ElasticNet, alpha = .5�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N
















Brier score

Freq

uenc

y

0.04 0.08 0.12 0.16

015

0


ConcordanceFr

eque

ncy

0.65 0.75 0.85 0.95

010

0Stability

Freq

uenc

y

0.05 0.15 0.25 0.35

015

0

















Brier score

Freq

uenc

y

0.04 0.08 0.12 0.16

015

0


ConcordanceFr

eque

ncy

0.65 0.75 0.85 0.95

010

0Stability

Freq

uenc

y

0.05 0.15 0.25 0.35

015

0

Use stratum information

















Brier score

Freq

uenc

y

0.04 0.08 0.12 0.16

015

0


ConcordanceFr

eque

ncy

0.65 0.75 0.85 0.95

010

0Stability

Freq

uenc

y

0.05 0.15 0.25 0.35

015

0

Does not


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Calibration curve for predictions

Predicted metastasis probability

Prop

ortio

n of

met

asta

ses

expectedmiddle 80%


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Calibration curve for predictions

Predicted metastasis probability

Prop

ortio

n of

met

asta

ses

expectedmiddle 80%

Overestimation

Underestimation

108 genes selected�.� R E S U LT S ��

likely. This is a natural consequence of doing variable selection: “redundant”information may shrink out of the model.

Table �.�: Resampling selection probability for the �� elasticnet-selected genes.

GRK�a 0.853 C�orf�� 0.290 ANO� 0.221 FBLN� 0.157GPATCH� 0.682 LOC�� 0.287 PTTG�IP 0.219 BLMH 0.156GNGT� 0.474 RNF�� 0.280 �NDg�gVCd. . .b 0.218 FCRL� 0.149PDGFDc 0.467 SULT�A� 0.278 USF� 0.216 TDRD� 0.143FAM��B 0.457 ZNF�� 0.271 BCCIP 0.210 ACY� 0.142PTPRN� 0.442 USE� 0.267 MGC�� 0.209 ZFP�� 0.142CBLB 0.440 DNMT�A 0.267 GRK�a 0.207 SLIC� 0.138PDCL 0.410 LOC�� 0.266 WTIP 0.205 PICK� 0.135RASA� 0.380 CNTNAP� 0.265 BCL�� 0.204 RTN�IP� 0.134C��orf�� 0.376 IL�RA 0.265 DLGAP� 0.200 CDCA�L 0.132TCEB� 0.374 CCT� 0.264 HRAS 0.199 BEX� 0.131CAPN� 0.354 R�HDM� 0.263 RAD� 0.189 FCAR 0.130STK�� 0.351 MRPL�� 0.260 PRKCE 0.187 ANKRD�� 0.111GUCY�A� 0.348 SLC��A� 0.256 UBAP�L 0.186 USP�� 0.109ZDHHC�� 0.345 GNG� 0.255 BPI 0.186 KIAA�� 0.106SULT�A� 0.336 PLA�G�C 0.251 DTX� 0.184 BRI�BP 0.106Z�FIQGkeo. . .d 0.335 TCF� 0.248 LASS� 0.182 TUBA�A 0.105FAM��A 0.328 uX��cu�f_. . .e 0.247 GSTT� 0.182 IDH� 0.102rh��dQX��. . .f 0.324 C��orf�� 0.245 SPATA�� 0.182 DDX�� 0.100LANCL� 0.323 VCL 0.242 IGLL� 0.172 ANKRD�� 0.094SERPINE� 0.318 EZH� 0.242 SPG�A 0.172 TFG 0.087ADIPOR� 0.314 PRPSAP� 0.237 PPAP�A 0.172 LILRA� 0.080GPR�� 0.312 ISY� 0.235 NOTCH�NL 0.172 C�orf�� 0.078PDGFDc 0.299 UGDH 0.234 TAF� 0.168 WDR�� 0.075LOC�� 0.294 ABCF� 0.230 CCDC��B 0.166 AHCYL� 0.068WEE� 0.293 C��orf� 0.229 LOC�� 0.158 HAUS� 0.068ITM�C 0.291 VAV� 0.225 CDH� 0.157 MAD�L� 0.053

a. Two probes map to the same gene GRK�. Combined selection probability is �.��, implyingthat both get selected together at least some of the time.

b. Illumina probe id �NDg�gVCdQkNdcg.Ko, missing annotation.c. Two probes map to the same gene PDGFD. Combined selection probability is �.��.d. Ilummina probe id Z�FIQGkeoCSiVAoKeg, missing annotation.e. Illumina probe id uX��cu�f_VUIuXoST�, missing annotation.f. Illumina probe id rh��dQX��hUS�uOpRQ, missing annotation.

Figure �.� shows the (log fold change) expression levels in each of the ��selected genes for the metastasized and non-metastasized observations. Theshaded area shows the middle .� of the bootstrap distribution for differencein medians between the two groups; the white notch shows the expectationof this distribution, by which the genes are ordered. The black snake-shapedline marks the two group medians. The non-metastasized median is usuallyaround zero, so the difference in medians is mostly dominated by the medianfold change of the metastasized observations. In other words, for these genesthe average case–control pair is similar in the non-metastasized group, whilethe average pair is dissimilar in the metastasized group.

108 genes selected�.� R E S U LT S ��

likely. This is a natural consequence of doing variable selection: “redundant”information may shrink out of the model.

Table �.�: Resampling selection probability for the �� elasticnet-selected genes.

GRK�a 0.853 C�orf�� 0.290 ANO� 0.221 FBLN� 0.157GPATCH� 0.682 LOC�� 0.287 PTTG�IP 0.219 BLMH 0.156GNGT� 0.474 RNF�� 0.280 �NDg�gVCd. . .b 0.218 FCRL� 0.149PDGFDc 0.467 SULT�A� 0.278 USF� 0.216 TDRD� 0.143FAM��B 0.457 ZNF�� 0.271 BCCIP 0.210 ACY� 0.142PTPRN� 0.442 USE� 0.267 MGC�� 0.209 ZFP�� 0.142CBLB 0.440 DNMT�A 0.267 GRK�a 0.207 SLIC� 0.138PDCL 0.410 LOC�� 0.266 WTIP 0.205 PICK� 0.135RASA� 0.380 CNTNAP� 0.265 BCL�� 0.204 RTN�IP� 0.134C��orf�� 0.376 IL�RA 0.265 DLGAP� 0.200 CDCA�L 0.132TCEB� 0.374 CCT� 0.264 HRAS 0.199 BEX� 0.131CAPN� 0.354 R�HDM� 0.263 RAD� 0.189 FCAR 0.130STK�� 0.351 MRPL�� 0.260 PRKCE 0.187 ANKRD�� 0.111GUCY�A� 0.348 SLC��A� 0.256 UBAP�L 0.186 USP�� 0.109ZDHHC�� 0.345 GNG� 0.255 BPI 0.186 KIAA�� 0.106SULT�A� 0.336 PLA�G�C 0.251 DTX� 0.184 BRI�BP 0.106Z�FIQGkeo. . .d 0.335 TCF� 0.248 LASS� 0.182 TUBA�A 0.105FAM��A 0.328 uX��cu�f_. . .e 0.247 GSTT� 0.182 IDH� 0.102rh��dQX��. . .f 0.324 C��orf�� 0.245 SPATA�� 0.182 DDX�� 0.100LANCL� 0.323 VCL 0.242 IGLL� 0.172 ANKRD�� 0.094SERPINE� 0.318 EZH� 0.242 SPG�A 0.172 TFG 0.087ADIPOR� 0.314 PRPSAP� 0.237 PPAP�A 0.172 LILRA� 0.080GPR�� 0.312 ISY� 0.235 NOTCH�NL 0.172 C�orf�� 0.078PDGFDc 0.299 UGDH 0.234 TAF� 0.168 WDR�� 0.075LOC�� 0.294 ABCF� 0.230 CCDC��B 0.166 AHCYL� 0.068WEE� 0.293 C��orf� 0.229 LOC�� 0.158 HAUS� 0.068ITM�C 0.291 VAV� 0.225 CDH� 0.157 MAD�L� 0.053

a. Two probes map to the same gene GRK�. Combined selection probability is �.��, implyingthat both get selected together at least some of the time.

b. Illumina probe id �NDg�gVCdQkNdcg.Ko, missing annotation.c. Two probes map to the same gene PDGFD. Combined selection probability is �.��.d. Ilummina probe id Z�FIQGkeoCSiVAoKeg, missing annotation.e. Illumina probe id uX��cu�f_VUIuXoST�, missing annotation.f. Illumina probe id rh��dQX��hUS�uOpRQ, missing annotation.

Figure �.� shows the (log fold change) expression levels in each of the ��selected genes for the metastasized and non-metastasized observations. Theshaded area shows the middle .� of the bootstrap distribution for differencein medians between the two groups; the white notch shows the expectationof this distribution, by which the genes are ordered. The black snake-shapedline marks the two group medians. The non-metastasized median is usuallyaround zero, so the difference in medians is mostly dominated by the medianfold change of the metastasized observations. In other words, for these genesthe average case–control pair is similar in the non-metastasized group, whilethe average pair is dissimilar in the metastasized group.

Low selection frequencies: unstable signatures

Please turn to page 22 in the required reading

−1.0 −0.5 0.0 0.5 1.0 1.5

PRKCEBRI3BP

LOC647460PDCL

KIAA0495DDX52BCCIPACY1

RASA2R3HDM1FAM24B

WTIPC16orf5

SULT1A3LOC649210

DTX1USP39GRK5

ZFP57TCEB1GRK5

PDGFDPDGFDBCL10USE1BEX4CDH2

SERPINE2EZH2GNG8

3NDg8gVCd...PRPSAP2

rh13dQX04...PTPRN2C11orf48

ISY1NOTCH2NL

UBAP2LADIPOR2CNTNAP2PLA2G4C

GNGT2VCL

TUBA4AFCRL3GSTT1

GUCY1A3ITM2CTCF4

SLC38A1MGC29506

CBLBIGLL1VAV3

−1.0 −0.5 0.0 0.5 1.0 1.5

LOC654055SPATA20

BPISULT1A1GPR177

TFGUSF1

ANKRD57HAUS4LASS5

PTTG1IPLOC731486

STK19FCARPICK1

GPATCH4CAPN3

RNF214WDR60ZNF365

C1orf115FBLN5

C6orf47CCT5BLMH

MRPL43Z6FIQGkeo...

ANKRD35uX15cu4f_...

C20orf107HRAS

MAD2L2ABCF2IL2RAUGDH

TDRD9LILRA6

LANCL2ZDHHC11

FAM89ARTN4IP1

RAD1CDCA7LAHCYL2

TAF6PPAP2ASPG3A

DLGAP2WEE1SLIC1

DNMT3ACCDC90B

ANO8IDH1

Expression levels, group medians, and difference in group medians for selected genes

metastasizednon−metastasized Observations

median

median

Bootstrapped difference in mediansw/ middle 80% of bootstrap distribution

Some genes tend to be selected togetherCo−selection heatmap

0

0.1

0.2

0.3

0.4

0.5

0.6 �� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

data even under resampling.

Table �.�: Genes that tend to be selected together, ordered alphabetically.

ADIPOR� FAM��A LANCL� PTPRN� SULT�A�C��orf�� GNG� LOC�� R�HDM� TCEB�C�orf�� GNGT� LOC�� RASA� TCF�CAPN� GPATCH� PDCL rh��dQX�. . . WEE�CBLB GRK� PDGFD SERPINE� Z�FIQGkeo. . .DNMT�A GUCY�A� PDGFD STK�� ZDHHC��FAM��B ITM�C PRPSAP� SULT�A� ZNF��

�.� Conclusion

We have demonstrated predictability of metastasis in these data. We can, witha high probability, rank case–control pairs in terms of predicted metastasisprobability. However we should not count the model itself as a reliable tool dueto poor calibration and stability, and since these results stem from exploratorymodeling we should be moderate in our expectations; further investigation isneeded to establish reliable results.

We provide �� candidate predictor genes as an avenue for future research. Weare currently investigating their biological properties. An interesting statisticalinvestigation may be to review the importance of the stratification and how tobuild this into a shrinkage model, as the results in the appendix below indicatethat this may lead to improvements. We believe however that it is necessaryto obtain independent data to be able to make any inference stronger thangeneral indication.

�.A Appendix: variable selection methods

In addition to the main results presented above we previously explored vari-ous ad-hoc variable selection schemes. The results of these explorations arenot competitive compared with the above penalized likelihood model, but Ipresent them here for completeness and comparison. To make the next sectionscomplete we must define the followup time of a case. This is the number ofdays between provision of the blood sample and the eventual diagnosis ofcancer. Although followup introduces a time aspect, these are not time seriesdata in the strictest technical sense. Each observation stems from a differentwoman, so there should be no autocorrelation to speak of, and followup timeis random.

At this point maybe call a biologist

https://commons.wikimedia.org/wiki/File:Biologist_Victoria_Achkasova_20150529.jpg

Thesis: Small data require particular care

• 1000s of measurements, maybe 100 observations

• Validation matters more than you think

• Model search difficult

• I suggest to make more assumptions





















Thesis: Small data require particular careFast

Good Cheap

Thesis: Small data require particular careAgnosticmodeling

Carefulvalidation

Smalldata

Thesis: Small data require particular careAgnosticmodeling

Carefulvalidation

Smalldata

NOFREE

LUNCHES

Mosteller & Tukey’s green book

Naturally, we all desire an adequate assessment of both the indications and their uncertainties, but we shouldn’t refuse

good cake only because we can’t have frosting too.

Closing curtain: Thank you.

Date post:	26-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Small data: practical modeling issues in human-model -omic...

Documents