+ All Categories
Home > Documents > Small data: practical modeling issues in human-model -omic...

Small data: practical modeling issues in human-model -omic...

Date post: 26-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
121
Small data: practical modeling issues in human-model -omic data Defense for the degree of Ph. D. Einar Holsbø February 8th, 2019
Transcript
Page 1: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Small data: practical modeling issues in human-model -omic data

Defense for the degree of Ph. D.Einar HolsbøFebruary 8th, 2019

Page 2: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Act I: “Boy Bitten by a Lizard” (1590s)

Page 3: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

–Eiliv Lund, 4.5 years ago, quote made up

Can we predict breast cancer metastasis from blood samples?

Page 4: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Metastasis is the spread of cancer in the body

Page 5: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Metastasis is the spread of cancer in the body

0.0

0.2

0.4

0.6

0.8

1.0

Five−year survival probability,various cancers

Local Regional Distant

●●

● ●

● Female breast

Data source: Siegel, R. L., Miller, K. D. and Jemal, A. (2017), Cancer statistics, 2017. CA: A Cancer Journal for Clinicians, 67: 7-30. doi:10.3322/caac.21387

Page 6: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Metastasis is the spread of cancer in the body

0.0

0.2

0.4

0.6

0.8

1.0

Five−year survival probability,various cancers

Local Regional Distant

●●

● ●

● Female breast

Data source: Siegel, R. L., Miller, K. D. and Jemal, A. (2017), Cancer statistics, 2017. CA: A Cancer Journal for Clinicians, 67: 7-30. doi:10.3322/caac.21387

Page 7: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Metastasis is the spread of cancer in the body

Goal: predict it, win the Nobel prize 🏅

Page 8: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Norwegian Women and Cancer

• Prospective population-based cohort that tracks 34% (170 000) of all Norwegian women born between 1943-57.

• The data collection started in NOWAC in 1991. Includes blood samples from 50.000 women, as well as more than 300 biopsies.

• Now contains various -omics material: microarray mRNA, miRNA, methylation, metabolomics, and RNA-seq.

Page 9: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ProspectiveEnrollment

Page 10: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ProspectiveEnrollment

Time →

Page 11: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ProspectiveEnrollment

Time →

Page 12: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Prospective

Time →

Page 13: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Prospective

Page 14: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Prospective

Page 15: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Nested case–control

} cc-pair

} cc-pair

} cc-pair

} cc-pair

Page 16: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Prospective design nice because recruitment is blinded to outcome

and exposure

Page 17: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Prospective design nice because recruitment is blinded to outcome

and exposure

Low bias

Page 18: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Gene expressionAT GC CG TA TA CG

……

DNA

Page 19: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Gene expressionAT GC CG TA TA CG

U C G A A G…

……

DNA mRNA

Page 20: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Gene expressionAT GC CG TA TA CG

U C G A A G

some useful protein

……

……

DNA mRNA

Page 21: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Gene expression

U C G A A G

Page 22: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Gene expression

U C G A A G

💡

Page 23: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Gene expression

U C G A A G

💡How much light

do we see?

Page 24: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Data at a glancedim(gene_expression)## [1] 88 12404

summary(days_to_diagnosis)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 6.0 117.8 189.5 186.8 269.2 358.0

summary(metastasis)## FALSE TRUE## 66 22

table(metastasis, stratum)## stratum## metastasis screening interval clinical## FALSE 43 10 13## TRUE 6 6 10

Page 25: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Data at a glancedim(gene_expression)## [1] 88 12404

summary(days_to_diagnosis)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 6.0 117.8 189.5 186.8 269.2 358.0

summary(metastasis)## FALSE TRUE## 66 22

table(metastasis, stratum)## stratum## metastasis screening interval clinical## FALSE 43 10 13## TRUE 6 6 10

Page 26: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Data at a glancedim(gene_expression)## [1] 88 12404

summary(days_to_diagnosis)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 6.0 117.8 189.5 186.8 269.2 358.0

summary(metastasis)## FALSE TRUE## 66 22

table(metastasis, stratum)## stratum## metastasis screening interval clinical## FALSE 43 10 13## TRUE 6 6 10

Page 27: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Data at a glancedim(gene_expression)## [1] 88 12404

summary(days_to_diagnosis)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 6.0 117.8 189.5 186.8 269.2 358.0

summary(metastasis)## FALSE TRUE## 66 22

table(metastasis, stratum)## stratum## metastasis screening interval clinical## FALSE 43 10 13## TRUE 6 6 10

Page 28: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Data at a glancedim(gene_expression)## [1] 88 12404

summary(days_to_diagnosis)## Min. 1st Qu. Median Mean 3rd Qu. Max.## 6.0 117.8 189.5 186.8 269.2 358.0

summary(metastasis)## FALSE TRUE## 66 22

table(metastasis, stratum)## stratum## metastasis screening interval clinical## FALSE 43 10 13## TRUE 6 6 10

Page 29: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

These are “small data” & we should be careful with them

Page 30: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

A computer scientist’s guide to precision medicine

• Step 1: pick some models

• Step 2: pick some scoring rules/performance metrics

• Step 3: “classification”

Page 31: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Scoring rule examples (aka. loss functions, aka. metrics)

• Accuracy: how many did we get right?

• Precision: how many correct “success” predictions did we do

• Recall: how many of the true successes did we detect

Page 32: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Scoring rule examples (aka. loss functions, aka. metrics)

p > .5? something else?

Page 33: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Decoupling score and decision threshold

• AUC: the probability of ranking success higher than failure

(aka. concordance probability)

Page 34: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Just trying some methods & scores

Page 35: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Just trying some methods & scores

Page 36: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

log

p

1� p

= �0 + �1x1 + . . .+ �dxd<latexit sha1_base64="ZGi/WZGaXB6h66Gn4QQFG0EtpOE=">AAACJ3icbZBdS8MwFIbT+TXnV9VLb4JDEMTRiqBeiKI3Xio4N1hHSdN0hqVNSU5lo+zneONf8UZERS/9J2ZbEXW+EHjynnNIzhukgmtwnA+rNDU9MztXnq8sLC4tr9irazdaZoqyOpVCqmZANBM8YXXgIFgzVYzEgWCNoHs+rDfumNJcJtfQT1k7Jp2ER5wSMJZvn3hCdrAXKULzdJC7u+kAH2MvYEB8B+8U5PZ8d3gRoQT97YY9P/TtqlNzRsKT4BZQRYUuffvZCyXNYpYAFUTrluuk0M6JAk4FG1S8TLOU0C7psJbBhMRMt/PRogO8ZZwQR1KZkwAeuT8nchJr3Y8D0xkTuNV/a0Pzv1org+iwnfMkzYAldPxQlAkMEg9TwyFXjILoGyBUcfNXTG+JCQ1MthUTgvt35Umo79WOas7VfvX0rEijjDbQJtpGLjpAp+gCXaI6ougePaIX9Go9WE/Wm/U+bi1Zxcw6+iXr8wu9rKQl</latexit><latexit sha1_base64="ZGi/WZGaXB6h66Gn4QQFG0EtpOE=">AAACJ3icbZBdS8MwFIbT+TXnV9VLb4JDEMTRiqBeiKI3Xio4N1hHSdN0hqVNSU5lo+zneONf8UZERS/9J2ZbEXW+EHjynnNIzhukgmtwnA+rNDU9MztXnq8sLC4tr9irazdaZoqyOpVCqmZANBM8YXXgIFgzVYzEgWCNoHs+rDfumNJcJtfQT1k7Jp2ER5wSMJZvn3hCdrAXKULzdJC7u+kAH2MvYEB8B+8U5PZ8d3gRoQT97YY9P/TtqlNzRsKT4BZQRYUuffvZCyXNYpYAFUTrluuk0M6JAk4FG1S8TLOU0C7psJbBhMRMt/PRogO8ZZwQR1KZkwAeuT8nchJr3Y8D0xkTuNV/a0Pzv1org+iwnfMkzYAldPxQlAkMEg9TwyFXjILoGyBUcfNXTG+JCQ1MthUTgvt35Umo79WOas7VfvX0rEijjDbQJtpGLjpAp+gCXaI6ougePaIX9Go9WE/Wm/U+bi1Zxcw6+iXr8wu9rKQl</latexit><latexit sha1_base64="ZGi/WZGaXB6h66Gn4QQFG0EtpOE=">AAACJ3icbZBdS8MwFIbT+TXnV9VLb4JDEMTRiqBeiKI3Xio4N1hHSdN0hqVNSU5lo+zneONf8UZERS/9J2ZbEXW+EHjynnNIzhukgmtwnA+rNDU9MztXnq8sLC4tr9irazdaZoqyOpVCqmZANBM8YXXgIFgzVYzEgWCNoHs+rDfumNJcJtfQT1k7Jp2ER5wSMJZvn3hCdrAXKULzdJC7u+kAH2MvYEB8B+8U5PZ8d3gRoQT97YY9P/TtqlNzRsKT4BZQRYUuffvZCyXNYpYAFUTrluuk0M6JAk4FG1S8TLOU0C7psJbBhMRMt/PRogO8ZZwQR1KZkwAeuT8nchJr3Y8D0xkTuNV/a0Pzv1org+iwnfMkzYAldPxQlAkMEg9TwyFXjILoGyBUcfNXTG+JCQ1MthUTgvt35Umo79WOas7VfvX0rEijjDbQJtpGLjpAp+gCXaI6ougePaIX9Go9WE/Wm/U+bi1Zxcw6+iXr8wu9rKQl</latexit>

X|�i| t

<latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit><latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit><latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit>

X�2i t

<latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit><latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit><latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit>

Figures from Hastie, Tibshirani, and Friedman: The Elements of Statistical Learning

“lasso”“ridge”

Page 37: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

+ =

Figures from Hastie, Tibshirani, and Friedman: The Elements of Statistical Learning

X⇥↵�2

i + (1� ↵)|�i|⇤ t

<latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit>

“ElasticNet”

Page 38: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

+ =

Figures from Hastie, Tibshirani, and Friedman: The Elements of Statistical Learning

X⇥↵�2

i + (1� ↵)|�i|⇤ t

<latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit>

“ElasticNet”

Tradeoff between penalty types, controls “roundness”

Page 39: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Trying different alphas

−5 −4 −3 −2 −1

0.50

0.65

0.80

log(Lambda)AU

C

111 96 88 80 73 63 50 28 10ElasticNet, binomial family, alpha=0.5

−6 −5 −4 −3 −2

0.2

0.4

0.6

log(Lambda)

AUC

35 33 32 31 31 26 24 13 6 0Lasso, binomial family

1 2 3 4 5

0.60

0.70

0.80

log(Lambda)

AUC

12295 12295 12295 12295 12295Ridge, binomial family

Figures show concordance (higher is better)

Alpha = 1 Alpha = 0Alpha = .5

Page 40: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Trying different alphas

−5 −4 −3 −2 −1

0.50

0.65

0.80

log(Lambda)AU

C

111 96 88 80 73 63 50 28 10ElasticNet, binomial family, alpha=0.5

−6 −5 −4 −3 −2

0.2

0.4

0.6

log(Lambda)

AUC

35 33 32 31 31 26 24 13 6 0Lasso, binomial family

1 2 3 4 5

0.60

0.70

0.80

log(Lambda)

AUC

12295 12295 12295 12295 12295Ridge, binomial family

Figures show concordance (higher is better)

.7 .8

.5

Alpha = 1 Alpha = 0Alpha = .5

Page 41: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

best auc for varying alpha

alpha

AUC

Finding the “best” parameter alpha by cross-validation

Page 42: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

best auc for varying alpha

alpha

AUC

Finding the “best” parameter alpha by cross-validation

????????????(this is the lizard)

Page 43: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

❧ intermission ☙

Page 44: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Act II: When you are engulfed in flames

Page 45: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

best auc for varying alpha

alpha

AUC

Finding the “best” parameter alpha by cross-validation

Page 46: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

AUC

alpha

Page 47: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some “technical” sources of variation

• The big classic one: sample size

• Scoring rule

• Validation procedure

Page 48: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some “technical” sources of variation

• The big classic one: sample size

• Scoring rule

• Validation procedure

Page 49: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Small data: sample size is more or less fixed in the human model

Typical sample sizes in transcriptomics

4 9 21 56 176 614 3372 18736

n = 1178

Page 50: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Small data: sample size is more or less fixed in the human model

Typical sample sizes in transcriptomics

4 9 21 56 176 614 3372 18736

n = 1178

Ethics, economy, logistics limit access to human obs.

Page 51: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some “technical” sources of variation

• The big classic one: sample size

• Scoring rule

• Validation procedure

Page 52: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Yet another scoring rule

Brier’s score is the mean squared errors of predicted probabilities

n�1X

(p̂i � pi)2

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 53: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some risk surfaces

log

p

1� p

= 1 + x,

x ⇠ U [�6, 6]

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

(risk = expected loss)

Page 54: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some risk surfaces

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.041 −0.0405

−0.04 −0.0395 −0.039

−0.0385 −0.038 −0.0375 −0.037 −0.0365

−0.036 −0.0355 −0.035 −0.0345 −0.034 −0.0335 −0.033 −0.0325 −0.032 −0.0315 −0.031

−0.0305 −0.03 −0.0295 −0.029

−0.0285 −0.028 −0.0275 −0.027 −0.0265

−0.026 −0.0255

−0.025

−0.0245 −0.024 −0.0235 −0.023

−0.0225

−0.022 −0.0215

−0.021

log

p

1� p

= 1 + x,

x ⇠ U [�6, 6]

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Brier

Brighter is better

Page 55: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some risk surfaces

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.041 −0.0405

−0.04 −0.0395 −0.039

−0.0385 −0.038 −0.0375 −0.037 −0.0365

−0.036 −0.0355 −0.035 −0.0345 −0.034 −0.0335 −0.033 −0.0325 −0.032 −0.0315 −0.031

−0.0305 −0.03 −0.0295 −0.029

−0.0285 −0.028 −0.0275 −0.027 −0.0265

−0.026 −0.0255

−0.025

−0.0245 −0.024 −0.0235 −0.023

−0.0225

−0.022 −0.0215

−0.021

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.0445 −0.044 −0.0435

−0.043 −0.0425

−0.042

−0.0415 −0.041

−0.0405 −0.04

−0.0395

−0.039

−0.0385 −0.038 −0.0375

−0.037 −0.0365

−0.036 −0.0355 −0.035

−0.0345

−0.034

−0.0335

−0.032

−0.0315

−0.031

−0.0305

−0.03

−0.0295

−0.029

−0.0285

−0.028

−0.0275

−0.027

−0.0265

−0.026

−0.0255

−0.025

−0.0245

Brier Accuracy

log

p

1� p

= 1 + x,

x ⇠ U [�6, 6]

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Brighter is better

Page 56: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some risk surfaces

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.041 −0.0405

−0.04 −0.0395 −0.039

−0.0385 −0.038 −0.0375 −0.037 −0.0365

−0.036 −0.0355 −0.035 −0.0345 −0.034 −0.0335 −0.033 −0.0325 −0.032 −0.0315 −0.031

−0.0305 −0.03 −0.0295 −0.029

−0.0285 −0.028 −0.0275 −0.027 −0.0265

−0.026 −0.0255

−0.025

−0.0245 −0.024 −0.0235 −0.023

−0.0225

−0.022 −0.0215

−0.021

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

0.015 0.016 0.017 0.018 0.019 0.02 0.021 0.022 0.023 0.024 0.025 0.026

−0.5 0.0 0.5 1.0 1.5 2.0

−0.5

0.0

0.5

1.0

1.5

2.0

intercept

slop

e

−0.0445 −0.044 −0.0435

−0.043 −0.0425

−0.042

−0.0415 −0.041

−0.0405 −0.04

−0.0395

−0.039

−0.0385 −0.038 −0.0375

−0.037 −0.0365

−0.036 −0.0355 −0.035

−0.0345

−0.034

−0.0335

−0.032

−0.0315

−0.031

−0.0305

−0.03

−0.0295

−0.029

−0.0285

−0.028

−0.0275

−0.027

−0.0265

−0.026

−0.0255

−0.025

−0.0245

Brier Accuracy Concordance

log

p

1� p

= 1 + x,

x ⇠ U [�6, 6]

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Brighter is better

Page 57: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some “technical” sources of variation

• The big classic one: sample size

• Scoring rule

• Validation procedure

Page 58: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Validation

• Holdout data

• Cross-validation

• Repeat CV

• The Bootstrap

Page 59: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Holdout data

Page 60: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Holdout data

Page 61: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Holdout data

i) Fit model

ii) Calculate score

Page 62: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Cross validation

Page 63: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Cross validation

i) Fit model

ii) Score

Page 64: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Cross validation

iii) Fit model

iv) Score

Page 65: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Cross validation

iii) Fit model

iv) Score

&c., &c.

Page 66: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Cross validation

xi) Summarize by mean, sd

Page 67: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Repeated cross validation

It’s exactly what you’d expect

Page 68: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

Page 69: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

Page 70: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

Page 71: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

Page 72: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

Page 73: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

&c., &c., &c.

Page 74: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

F̂ ⇠ F<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

F̂ ⇤ ⇠ F̂<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 75: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

F̂ ⇤ ⇠ F̂<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 76: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Bootstrap

F̂ ⇤ ⇠ F̂<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

T (F̂ ⇤, F̂ ) ⇠ T (F̂ , F )<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

“The bootstrap principle”

Page 77: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Relative efficiency of two estimators

For two estimators, T1, T2, of the same quantity :

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Var(T1)Var(T2)

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 78: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Relative efficiency of two estimators

For two estimators, T1, T2, of the same quantity :

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Var(T1)Var(T2)

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

All else being equal, pick the less variable one

Page 79: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Relative efficiency of two estimators

0.10 0.15 0.20 0.25

05

1015

20

Error estimates, p=2, k=2

Den

sity

split samplebootstraprepeated cvcv

Brier score estimated in different ways

Relative efficiency to split sample:

Bootstrap: 3.5 CV: 3.6

Repeat CV: 3.6

Page 80: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Relative efficiency of two estimators

0.10 0.15 0.20 0.25

05

1015

20

Error estimates, p=2, k=2

Den

sity

split samplebootstraprepeated cvcv

Brier score estimated in different ways

Relative efficiency to split sample:

Bootstrap: 3.5 CV: 3.6

Repeat CV: 3.6

Need 3–4 times as many obs. w/ split sample!

Page 81: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some lessons

1. Small data: new observations are hard to get

2. Optimize a less weird scoring rule

3. Estimate with less variance

Page 82: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some lessons

1. Small data: new observations are hard to get

2. Optimize a less weird scoring rule

3. Estimate with less variance

Page 83: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some lessons

1. Small data: new observations are hard to get

2. Optimize a less weird scoring rule

3. Estimate with less variance

Page 84: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some lessons

1. Small data: new observations are hard to get

2. Optimize a less weird scoring rule

3. Estimate with less variance

Page 85: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

❧ intermission ☙

Page 86: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Act III: Hold Fast

Page 87: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Brier score + Bootstrap�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

model �.� model �.�LIMMA-t .44 ± .30 .76 ± .20SAM .46 ± .26 .75 ± .24ANOVA-fs .51 ± .29 .75 ± .16ANOVA-s .41 ± .57 .75 ± .38t-test .65 ± 1.5 .74 ± .71ANOVA-f .44 ± .25 .72 ± .21

intercept .5stratum .49 ± .055lasso .36 ± 1.4ridge .81 ± 3.3

Table �.�: AUC presented as point estimate plus/minus two standard errors. Measuresthe probability of forecasting a higher probability of metastasis for a ran-domly chosen metastasis case than for a randomly chosen non-metastasiscase: higher is better. Model number refers to the equations in Section �.A.�.Model �.� includes stratum as a predictor. Below the break are the fourbaseline models.

The collected results for model �.� suggest some reason for optimism. Due tothe size of the standard errors we must necessarily be uncertain about even thefirst significant digit of our point estimates. But even accounting for uncertaintythere seems to be predictive information better than random guess. As in thesimulations, there is not too much difference between the different methods,perhaps apart from the simple t-test, for which we observe much variance.Note that both SAM and LIMMA are flexible frameworks and we could haveaccounted for stratum and followup in either. Our comparison is between usingthis information and various ways of not using it, and there is no reason tobelieve that either framework should perform poorly if we were to use morerefined models there.

Table �.� shows the predictor set stability as point estimate plus/minus twostandard errors. Stability is in general very low, and the standard errors suggestthat there is even some uncertainty to the order of magnitude of the pointestimates. A possible interpretation is that the correlation between genes issuch that many different genes hold similar information. It is at least clear thatwe need much more data if we want to find a stable set of predictor genes. Ifwe take the point estimates at face value, Table �.� reflects the fact that we seelower uncertainty using ANOVA-f/fs in Tables �.� and �.�.

issue mentioned in the preamble to this chapter. For details see Section �.�.� and Section�.�.

�.A A P P E N D I X: VA R I A B L E S E L E C T I O N M E T H O D S ��

model �.� model �.�t-test .17 ± .45 .17 ± .33ANOVA-fs .27 ± .13 .18 ± .10SAM .34 ± .11 .20 ± .15ANOVA-s .33 ± .22 .20 ± .25ANOVA-f .31 ± .084 .21 ± .11LIMMA-t .35 ± .14 .20 ± .17

intercept .19 ± .010stratum .22 ± .029lasso .27 ± .19ridge .23 ± .30

Table �.�: Brier scores presented as point estimate plus/minus two standard errors.Measures error in forecast probability: lower is better. Model number refersto the equations in Section �.A.�. Model �.� includes stratum as a predictor.Below the break are the four baseline models.

but it is noteworthy that the intercept-only model is among the best-calibrated.The uncertainty is large enough that is difficult to say that any selection methodis better than any other. It is clear that the interaction with detection method inmodel �.� improves calibration for all models. There is also lower uncertaintyin the ANOVA-f/fs models.

AUC or concordance probability is a measure of a model’s ability to discriminatebetween outcomes: the higher the better. Brier score alone does not providefull information about predictive performance; the intercept-only model is well-calibrated but cannot be used for prediction at all. Random guess (or forecastinga constant for every observation) yields AUC of .�; perfect discrimination yieldsAUC of unity. Table �.� shows AUC as point estimate plus/minus two standarderrors in decreasing order by model �.�. Again the clearest signal is that theadded information from detection method is very important. Point estimatesimprove markedly and standard errors generally decrease. Also here does useof stratification and followup time in preselection reduce uncertainty.

The ridge regression baseline performance has a very good AUC point estimate,but the standard error is very large. Too large: it is a theorem that the upperbound on standard deviation in a variable 2 [0, 1] is 1

2 . This says somethingabout the imperfection of the jackknife as an estimator of standard error. Theblame lies at least in part with the correctional factor n�1n in Equation �.�, whichwas originally defined heuristically. Since it is difficult to suggest a sensiblealternative, we choose to live with this.�

�. This was really the result of nesting a cross-validation in the bootstrap: the methodology

Concordance: Higher better, random guess is .5

Brier score: Lower better, null model is .19

Page 88: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Brier score + Bootstrap�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

model �.� model �.�LIMMA-t .44 ± .30 .76 ± .20SAM .46 ± .26 .75 ± .24ANOVA-fs .51 ± .29 .75 ± .16ANOVA-s .41 ± .57 .75 ± .38t-test .65 ± 1.5 .74 ± .71ANOVA-f .44 ± .25 .72 ± .21

intercept .5stratum .49 ± .055lasso .36 ± 1.4ridge .81 ± 3.3

Table �.�: AUC presented as point estimate plus/minus two standard errors. Measuresthe probability of forecasting a higher probability of metastasis for a ran-domly chosen metastasis case than for a randomly chosen non-metastasiscase: higher is better. Model number refers to the equations in Section �.A.�.Model �.� includes stratum as a predictor. Below the break are the fourbaseline models.

The collected results for model �.� suggest some reason for optimism. Due tothe size of the standard errors we must necessarily be uncertain about even thefirst significant digit of our point estimates. But even accounting for uncertaintythere seems to be predictive information better than random guess. As in thesimulations, there is not too much difference between the different methods,perhaps apart from the simple t-test, for which we observe much variance.Note that both SAM and LIMMA are flexible frameworks and we could haveaccounted for stratum and followup in either. Our comparison is between usingthis information and various ways of not using it, and there is no reason tobelieve that either framework should perform poorly if we were to use morerefined models there.

Table �.� shows the predictor set stability as point estimate plus/minus twostandard errors. Stability is in general very low, and the standard errors suggestthat there is even some uncertainty to the order of magnitude of the pointestimates. A possible interpretation is that the correlation between genes issuch that many different genes hold similar information. It is at least clear thatwe need much more data if we want to find a stable set of predictor genes. Ifwe take the point estimates at face value, Table �.� reflects the fact that we seelower uncertainty using ANOVA-f/fs in Tables �.� and �.�.

issue mentioned in the preamble to this chapter. For details see Section �.�.� and Section�.�.

�.A A P P E N D I X: VA R I A B L E S E L E C T I O N M E T H O D S ��

model �.� model �.�t-test .17 ± .45 .17 ± .33ANOVA-fs .27 ± .13 .18 ± .10SAM .34 ± .11 .20 ± .15ANOVA-s .33 ± .22 .20 ± .25ANOVA-f .31 ± .084 .21 ± .11LIMMA-t .35 ± .14 .20 ± .17

intercept .19 ± .010stratum .22 ± .029lasso .27 ± .19ridge .23 ± .30

Table �.�: Brier scores presented as point estimate plus/minus two standard errors.Measures error in forecast probability: lower is better. Model number refersto the equations in Section �.A.�. Model �.� includes stratum as a predictor.Below the break are the four baseline models.

but it is noteworthy that the intercept-only model is among the best-calibrated.The uncertainty is large enough that is difficult to say that any selection methodis better than any other. It is clear that the interaction with detection method inmodel �.� improves calibration for all models. There is also lower uncertaintyin the ANOVA-f/fs models.

AUC or concordance probability is a measure of a model’s ability to discriminatebetween outcomes: the higher the better. Brier score alone does not providefull information about predictive performance; the intercept-only model is well-calibrated but cannot be used for prediction at all. Random guess (or forecastinga constant for every observation) yields AUC of .�; perfect discrimination yieldsAUC of unity. Table �.� shows AUC as point estimate plus/minus two standarderrors in decreasing order by model �.�. Again the clearest signal is that theadded information from detection method is very important. Point estimatesimprove markedly and standard errors generally decrease. Also here does useof stratification and followup time in preselection reduce uncertainty.

The ridge regression baseline performance has a very good AUC point estimate,but the standard error is very large. Too large: it is a theorem that the upperbound on standard deviation in a variable 2 [0, 1] is 1

2 . This says somethingabout the imperfection of the jackknife as an estimator of standard error. Theblame lies at least in part with the correctional factor n�1n in Equation �.�, whichwas originally defined heuristically. Since it is difficult to suggest a sensiblealternative, we choose to live with this.�

�. This was really the result of nesting a cross-validation in the bootstrap: the methodology

Concordance Brier

Concordance: Higher better, random guess is .5

Brier score: Lower better, null model is .19

Page 89: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Brier score + Bootstrap�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

model �.� model �.�LIMMA-t .44 ± .30 .76 ± .20SAM .46 ± .26 .75 ± .24ANOVA-fs .51 ± .29 .75 ± .16ANOVA-s .41 ± .57 .75 ± .38t-test .65 ± 1.5 .74 ± .71ANOVA-f .44 ± .25 .72 ± .21

intercept .5stratum .49 ± .055lasso .36 ± 1.4ridge .81 ± 3.3

Table �.�: AUC presented as point estimate plus/minus two standard errors. Measuresthe probability of forecasting a higher probability of metastasis for a ran-domly chosen metastasis case than for a randomly chosen non-metastasiscase: higher is better. Model number refers to the equations in Section �.A.�.Model �.� includes stratum as a predictor. Below the break are the fourbaseline models.

The collected results for model �.� suggest some reason for optimism. Due tothe size of the standard errors we must necessarily be uncertain about even thefirst significant digit of our point estimates. But even accounting for uncertaintythere seems to be predictive information better than random guess. As in thesimulations, there is not too much difference between the different methods,perhaps apart from the simple t-test, for which we observe much variance.Note that both SAM and LIMMA are flexible frameworks and we could haveaccounted for stratum and followup in either. Our comparison is between usingthis information and various ways of not using it, and there is no reason tobelieve that either framework should perform poorly if we were to use morerefined models there.

Table �.� shows the predictor set stability as point estimate plus/minus twostandard errors. Stability is in general very low, and the standard errors suggestthat there is even some uncertainty to the order of magnitude of the pointestimates. A possible interpretation is that the correlation between genes issuch that many different genes hold similar information. It is at least clear thatwe need much more data if we want to find a stable set of predictor genes. Ifwe take the point estimates at face value, Table �.� reflects the fact that we seelower uncertainty using ANOVA-f/fs in Tables �.� and �.�.

issue mentioned in the preamble to this chapter. For details see Section �.�.� and Section�.�.

�.A A P P E N D I X: VA R I A B L E S E L E C T I O N M E T H O D S ��

model �.� model �.�t-test .17 ± .45 .17 ± .33ANOVA-fs .27 ± .13 .18 ± .10SAM .34 ± .11 .20 ± .15ANOVA-s .33 ± .22 .20 ± .25ANOVA-f .31 ± .084 .21 ± .11LIMMA-t .35 ± .14 .20 ± .17

intercept .19 ± .010stratum .22 ± .029lasso .27 ± .19ridge .23 ± .30

Table �.�: Brier scores presented as point estimate plus/minus two standard errors.Measures error in forecast probability: lower is better. Model number refersto the equations in Section �.A.�. Model �.� includes stratum as a predictor.Below the break are the four baseline models.

but it is noteworthy that the intercept-only model is among the best-calibrated.The uncertainty is large enough that is difficult to say that any selection methodis better than any other. It is clear that the interaction with detection method inmodel �.� improves calibration for all models. There is also lower uncertaintyin the ANOVA-f/fs models.

AUC or concordance probability is a measure of a model’s ability to discriminatebetween outcomes: the higher the better. Brier score alone does not providefull information about predictive performance; the intercept-only model is well-calibrated but cannot be used for prediction at all. Random guess (or forecastinga constant for every observation) yields AUC of .�; perfect discrimination yieldsAUC of unity. Table �.� shows AUC as point estimate plus/minus two standarderrors in decreasing order by model �.�. Again the clearest signal is that theadded information from detection method is very important. Point estimatesimprove markedly and standard errors generally decrease. Also here does useof stratification and followup time in preselection reduce uncertainty.

The ridge regression baseline performance has a very good AUC point estimate,but the standard error is very large. Too large: it is a theorem that the upperbound on standard deviation in a variable 2 [0, 1] is 1

2 . This says somethingabout the imperfection of the jackknife as an estimator of standard error. Theblame lies at least in part with the correctional factor n�1n in Equation �.�, whichwas originally defined heuristically. Since it is difficult to suggest a sensiblealternative, we choose to live with this.�

�. This was really the result of nesting a cross-validation in the bootstrap: the methodology

Concordance: Higher better, random guess is .5

Brier score: Lower better, null model is .19

Page 90: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Brier score + Bootstrap�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

model �.� model �.�LIMMA-t .44 ± .30 .76 ± .20SAM .46 ± .26 .75 ± .24ANOVA-fs .51 ± .29 .75 ± .16ANOVA-s .41 ± .57 .75 ± .38t-test .65 ± 1.5 .74 ± .71ANOVA-f .44 ± .25 .72 ± .21

intercept .5stratum .49 ± .055lasso .36 ± 1.4ridge .81 ± 3.3

Table �.�: AUC presented as point estimate plus/minus two standard errors. Measuresthe probability of forecasting a higher probability of metastasis for a ran-domly chosen metastasis case than for a randomly chosen non-metastasiscase: higher is better. Model number refers to the equations in Section �.A.�.Model �.� includes stratum as a predictor. Below the break are the fourbaseline models.

The collected results for model �.� suggest some reason for optimism. Due tothe size of the standard errors we must necessarily be uncertain about even thefirst significant digit of our point estimates. But even accounting for uncertaintythere seems to be predictive information better than random guess. As in thesimulations, there is not too much difference between the different methods,perhaps apart from the simple t-test, for which we observe much variance.Note that both SAM and LIMMA are flexible frameworks and we could haveaccounted for stratum and followup in either. Our comparison is between usingthis information and various ways of not using it, and there is no reason tobelieve that either framework should perform poorly if we were to use morerefined models there.

Table �.� shows the predictor set stability as point estimate plus/minus twostandard errors. Stability is in general very low, and the standard errors suggestthat there is even some uncertainty to the order of magnitude of the pointestimates. A possible interpretation is that the correlation between genes issuch that many different genes hold similar information. It is at least clear thatwe need much more data if we want to find a stable set of predictor genes. Ifwe take the point estimates at face value, Table �.� reflects the fact that we seelower uncertainty using ANOVA-f/fs in Tables �.� and �.�.

issue mentioned in the preamble to this chapter. For details see Section �.�.� and Section�.�.

�.A A P P E N D I X: VA R I A B L E S E L E C T I O N M E T H O D S ��

model �.� model �.�t-test .17 ± .45 .17 ± .33ANOVA-fs .27 ± .13 .18 ± .10SAM .34 ± .11 .20 ± .15ANOVA-s .33 ± .22 .20 ± .25ANOVA-f .31 ± .084 .21 ± .11LIMMA-t .35 ± .14 .20 ± .17

intercept .19 ± .010stratum .22 ± .029lasso .27 ± .19ridge .23 ± .30

Table �.�: Brier scores presented as point estimate plus/minus two standard errors.Measures error in forecast probability: lower is better. Model number refersto the equations in Section �.A.�. Model �.� includes stratum as a predictor.Below the break are the four baseline models.

but it is noteworthy that the intercept-only model is among the best-calibrated.The uncertainty is large enough that is difficult to say that any selection methodis better than any other. It is clear that the interaction with detection method inmodel �.� improves calibration for all models. There is also lower uncertaintyin the ANOVA-f/fs models.

AUC or concordance probability is a measure of a model’s ability to discriminatebetween outcomes: the higher the better. Brier score alone does not providefull information about predictive performance; the intercept-only model is well-calibrated but cannot be used for prediction at all. Random guess (or forecastinga constant for every observation) yields AUC of .�; perfect discrimination yieldsAUC of unity. Table �.� shows AUC as point estimate plus/minus two standarderrors in decreasing order by model �.�. Again the clearest signal is that theadded information from detection method is very important. Point estimatesimprove markedly and standard errors generally decrease. Also here does useof stratification and followup time in preselection reduce uncertainty.

The ridge regression baseline performance has a very good AUC point estimate,but the standard error is very large. Too large: it is a theorem that the upperbound on standard deviation in a variable 2 [0, 1] is 1

2 . This says somethingabout the imperfection of the jackknife as an estimator of standard error. Theblame lies at least in part with the correctional factor n�1n in Equation �.�, whichwas originally defined heuristically. Since it is difficult to suggest a sensiblealternative, we choose to live with this.�

�. This was really the result of nesting a cross-validation in the bootstrap: the methodology

Concordance: Higher better, random guess is .5

Brier score: Lower better, null model is .19

In short more lizards ahead

Page 91: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Reminder of likelihood penaltiesX

|�i| t<latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit><latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit><latexit sha1_base64="8NqjQ2vieh0SxUmPxqGqRkxR35Y=">AAAB/XicbVDLSgNBEJyNrxhfUfHkZTAInsKuCOot6MVjBGMC2RBmJ73JkJnddaZXCJuAv+LFg4pX/8Obf+PkcdDEgoaiqpvuriCRwqDrfju5peWV1bX8emFjc2t7p7i7d2/iVHOo8VjGuhEwA1JEUEOBEhqJBqYCCfWgfz3264+gjYijOxwk0FKsG4lQcIZWahcPfJMqOvQDQNYWQ+pLeKDYLpbcsjsBXSTejJTIDNV28cvvxDxVECGXzJim5ybYyphGwSWMCn5qIGG8z7rQtDRiCkwrm5w/osdW6dAw1rYipBP190TGlDEDFdhOxbBn5r2x+J/XTDG8aGUiSlKEiE8XhamkGNNxFrQjNHCUA0sY18LeSnmPacbRJlawIXjzLy+S2mn5suzenpUqV7M08uSQHJET4pFzUiE3pEpqhJOMPJNX8uY8OS/Ou/Mxbc05s5l98gfO5w8rx5Ur</latexit>

X�2i t

<latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit><latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit><latexit sha1_base64="MKxOuK7+4jD6KuDz6gDROXoIIys=">AAAB/XicbVA9SwNBEN2LXzF+nYqVzWIQrMIlCGoXtLGMYEwgF8PeZpIs2ds7d+eEcAT8KzYWKrb+Dzv/jZvkCk18MPB4b4aZeUEshUHP+3ZyS8srq2v59cLG5tb2jru7d2eiRHOo80hGuhkwA1IoqKNACc1YAwsDCY1geDXxG4+gjYjULY5iaIesr0RPcIZW6rgHvklC6geArCPuK9SX8ECx4xa9kjcFXSTljBRJhlrH/fK7EU9CUMglM6ZV9mJsp0yj4BLGBT8xEDM+ZH1oWapYCKadTs8f02OrdGkv0rYU0qn6eyJloTGjMLCdIcOBmfcm4n9eK8HeeTsVKk4QFJ8t6iWSYkQnWdCu0MBRjixhXAt7K+UDphlHm1jBhlCef3mR1Culi5J3c1qsXmZp5MkhOSInpEzOSJVckxqpE05S8kxeyZvz5Lw4787HrDXnZDP75A+czx+K5pTD</latexit>

X⇥↵�2

i + (1� ↵)|�i|⇤ t

<latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit>

Need to choose t (aka lambda)

Page 92: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Risky procedure# nested cv in bootstrapboot <- boostrap_samples()for (b in boot) { lambda <- cross_validate_glmnet(b)}

Page 93: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Risky procedure# nested cv in bootstrapboot <- boostrap_samples()for (b in boot) { lambda <- cross_validate_glmnet(b)}

Page 94: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Risky procedure# nested cv in bootstrapboot <- boostrap_samples()for (b in boot) { lambda <- cross_validate_glmnet(b)}

Page 95: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Risky procedure# nested cv in bootstrapboot <- boostrap_samples()for (b in boot) { lambda <- cross_validate_glmnet(b)}

Page 96: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Risky procedure# nested cv in bootstrapboot <- boostrap_samples()for (b in boot) { lambda <- cross_validate_glmnet(b)}

i) train

ii) test

Page 97: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Risky procedure# nested cv in bootstrapboot <- boostrap_samples()for (b in boot) { lambda <- cross_validate_glmnet(b)}

i) train

ii) test

Bias toward !!!!!!!!!

Page 98: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

0.00 0.05 0.10 0.15 0.20

010

2030

4050

60

Chosen lambda, p=100, k=1000

Den

sity

cvcv in bootstrapdeduplicated cv in bootstrap

Risky procedure

shrinkage parameter lambda

Page 99: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Instead choose lambda by AIC●●●

●●●●

●●

●●

●●●

●●●

●●●●

●●●●

●●●

●●

●●

●●●●●

●●●●

●●●

●●●●

●●●

●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−250

−150

−50

0

AIC as a function of shrinkage parameter

lambda

AIC

Scatterplot smoother

Max curvature

Page 100: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ElasticNet, alpha = .5

+ =

Figures from Hastie, Tibshirani, and Friedman: The Elements of Statistical Learning

X⇥↵�2

i + (1� ↵)|�i|⇤ t

<latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit><latexit sha1_base64="d4YfrtDxlfZbZi+PS4Pmi+bbts0=">AAACKXicbVDJSgNBFOxxN25Rj14ag6CIYUYE9eZy8ahgVMiM4U3nTdKkZ7H7jRCi3+PFX/Hiwe3qj9hJ5uBW0FBUveL1qzBT0pDrvjsjo2PjE5NT06WZ2bn5hfLi0oVJcy2wJlKV6qsQDCqZYI0kKbzKNEIcKrwMO8d9//IWtZFpck7dDIMYWomMpACyUqN86Js85r7CiOrcB5W1gfshEjTk9Tbf5Ove1lDduCvkO+5r2WpT0E/dcGqUK27VHYD/JV5BKqzAaaP87DdTkceYkFBgTN1zMwp6oEkKhfclPzeYgehAC+uWJhCjCXqDU+/5mlWaPEq1fQnxgfo90YPYmG4c2skYqG1+e33xP6+eU7QX9GSS5YSJGC6KcsUp5f3eeFNqFKS6loDQ0v6VizZoEGTbLdkSvN8n/yW17ep+1T3bqRwcFW1MsRW2ytaZx3bZATthp6zGBHtgT+yFvTqPzrPz5nwMR0ecIrPMfsD5/AJw2aWq</latexit>

Page 101: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ElasticNet, alpha = .5

Brier score

Freq

uenc

y

0.04 0.08 0.12 0.16

015

0

Bootstrapped estimates

Concordance

Freq

uenc

y

0.65 0.75 0.85 0.95

010

0

Stability

Freq

uenc

y

0.05 0.15 0.25 0.35

015

0

Page 102: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ElasticNet, alpha = .5�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

model �.� model �.�LIMMA-t .44 ± .30 .76 ± .20SAM .46 ± .26 .75 ± .24ANOVA-fs .51 ± .29 .75 ± .16ANOVA-s .41 ± .57 .75 ± .38t-test .65 ± 1.5 .74 ± .71ANOVA-f .44 ± .25 .72 ± .21

intercept .5stratum .49 ± .055lasso .36 ± 1.4ridge .81 ± 3.3

Table �.�: AUC presented as point estimate plus/minus two standard errors. Measuresthe probability of forecasting a higher probability of metastasis for a ran-domly chosen metastasis case than for a randomly chosen non-metastasiscase: higher is better. Model number refers to the equations in Section �.A.�.Model �.� includes stratum as a predictor. Below the break are the fourbaseline models.

The collected results for model �.� suggest some reason for optimism. Due tothe size of the standard errors we must necessarily be uncertain about even thefirst significant digit of our point estimates. But even accounting for uncertaintythere seems to be predictive information better than random guess. As in thesimulations, there is not too much difference between the different methods,perhaps apart from the simple t-test, for which we observe much variance.Note that both SAM and LIMMA are flexible frameworks and we could haveaccounted for stratum and followup in either. Our comparison is between usingthis information and various ways of not using it, and there is no reason tobelieve that either framework should perform poorly if we were to use morerefined models there.

Table �.� shows the predictor set stability as point estimate plus/minus twostandard errors. Stability is in general very low, and the standard errors suggestthat there is even some uncertainty to the order of magnitude of the pointestimates. A possible interpretation is that the correlation between genes issuch that many different genes hold similar information. It is at least clear thatwe need much more data if we want to find a stable set of predictor genes. Ifwe take the point estimates at face value, Table �.� reflects the fact that we seelower uncertainty using ANOVA-f/fs in Tables �.� and �.�.

issue mentioned in the preamble to this chapter. For details see Section �.�.� and Section�.�.

�.A A P P E N D I X: VA R I A B L E S E L E C T I O N M E T H O D S ��

model �.� model �.�t-test .17 ± .45 .17 ± .33ANOVA-fs .27 ± .13 .18 ± .10SAM .34 ± .11 .20 ± .15ANOVA-s .33 ± .22 .20 ± .25ANOVA-f .31 ± .084 .21 ± .11LIMMA-t .35 ± .14 .20 ± .17

intercept .19 ± .010stratum .22 ± .029lasso .27 ± .19ridge .23 ± .30

Table �.�: Brier scores presented as point estimate plus/minus two standard errors.Measures error in forecast probability: lower is better. Model number refersto the equations in Section �.A.�. Model �.� includes stratum as a predictor.Below the break are the four baseline models.

but it is noteworthy that the intercept-only model is among the best-calibrated.The uncertainty is large enough that is difficult to say that any selection methodis better than any other. It is clear that the interaction with detection method inmodel �.� improves calibration for all models. There is also lower uncertaintyin the ANOVA-f/fs models.

AUC or concordance probability is a measure of a model’s ability to discriminatebetween outcomes: the higher the better. Brier score alone does not providefull information about predictive performance; the intercept-only model is well-calibrated but cannot be used for prediction at all. Random guess (or forecastinga constant for every observation) yields AUC of .�; perfect discrimination yieldsAUC of unity. Table �.� shows AUC as point estimate plus/minus two standarderrors in decreasing order by model �.�. Again the clearest signal is that theadded information from detection method is very important. Point estimatesimprove markedly and standard errors generally decrease. Also here does useof stratification and followup time in preselection reduce uncertainty.

The ridge regression baseline performance has a very good AUC point estimate,but the standard error is very large. Too large: it is a theorem that the upperbound on standard deviation in a variable 2 [0, 1] is 1

2 . This says somethingabout the imperfection of the jackknife as an estimator of standard error. Theblame lies at least in part with the correctional factor n�1n in Equation �.�, whichwas originally defined heuristically. Since it is difficult to suggest a sensiblealternative, we choose to live with this.�

�. This was really the result of nesting a cross-validation in the bootstrap: the methodology

Brier score

Freq

uenc

y

0.04 0.08 0.12 0.16

015

0

Bootstrapped estimates

ConcordanceFr

eque

ncy

0.65 0.75 0.85 0.95

010

0Stability

Freq

uenc

y

0.05 0.15 0.25 0.35

015

0

Page 103: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ElasticNet, alpha = .5�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

model �.� model �.�LIMMA-t .44 ± .30 .76 ± .20SAM .46 ± .26 .75 ± .24ANOVA-fs .51 ± .29 .75 ± .16ANOVA-s .41 ± .57 .75 ± .38t-test .65 ± 1.5 .74 ± .71ANOVA-f .44 ± .25 .72 ± .21

intercept .5stratum .49 ± .055lasso .36 ± 1.4ridge .81 ± 3.3

Table �.�: AUC presented as point estimate plus/minus two standard errors. Measuresthe probability of forecasting a higher probability of metastasis for a ran-domly chosen metastasis case than for a randomly chosen non-metastasiscase: higher is better. Model number refers to the equations in Section �.A.�.Model �.� includes stratum as a predictor. Below the break are the fourbaseline models.

The collected results for model �.� suggest some reason for optimism. Due tothe size of the standard errors we must necessarily be uncertain about even thefirst significant digit of our point estimates. But even accounting for uncertaintythere seems to be predictive information better than random guess. As in thesimulations, there is not too much difference between the different methods,perhaps apart from the simple t-test, for which we observe much variance.Note that both SAM and LIMMA are flexible frameworks and we could haveaccounted for stratum and followup in either. Our comparison is between usingthis information and various ways of not using it, and there is no reason tobelieve that either framework should perform poorly if we were to use morerefined models there.

Table �.� shows the predictor set stability as point estimate plus/minus twostandard errors. Stability is in general very low, and the standard errors suggestthat there is even some uncertainty to the order of magnitude of the pointestimates. A possible interpretation is that the correlation between genes issuch that many different genes hold similar information. It is at least clear thatwe need much more data if we want to find a stable set of predictor genes. Ifwe take the point estimates at face value, Table �.� reflects the fact that we seelower uncertainty using ANOVA-f/fs in Tables �.� and �.�.

issue mentioned in the preamble to this chapter. For details see Section �.�.� and Section�.�.

�.A A P P E N D I X: VA R I A B L E S E L E C T I O N M E T H O D S ��

model �.� model �.�t-test .17 ± .45 .17 ± .33ANOVA-fs .27 ± .13 .18 ± .10SAM .34 ± .11 .20 ± .15ANOVA-s .33 ± .22 .20 ± .25ANOVA-f .31 ± .084 .21 ± .11LIMMA-t .35 ± .14 .20 ± .17

intercept .19 ± .010stratum .22 ± .029lasso .27 ± .19ridge .23 ± .30

Table �.�: Brier scores presented as point estimate plus/minus two standard errors.Measures error in forecast probability: lower is better. Model number refersto the equations in Section �.A.�. Model �.� includes stratum as a predictor.Below the break are the four baseline models.

but it is noteworthy that the intercept-only model is among the best-calibrated.The uncertainty is large enough that is difficult to say that any selection methodis better than any other. It is clear that the interaction with detection method inmodel �.� improves calibration for all models. There is also lower uncertaintyin the ANOVA-f/fs models.

AUC or concordance probability is a measure of a model’s ability to discriminatebetween outcomes: the higher the better. Brier score alone does not providefull information about predictive performance; the intercept-only model is well-calibrated but cannot be used for prediction at all. Random guess (or forecastinga constant for every observation) yields AUC of .�; perfect discrimination yieldsAUC of unity. Table �.� shows AUC as point estimate plus/minus two standarderrors in decreasing order by model �.�. Again the clearest signal is that theadded information from detection method is very important. Point estimatesimprove markedly and standard errors generally decrease. Also here does useof stratification and followup time in preselection reduce uncertainty.

The ridge regression baseline performance has a very good AUC point estimate,but the standard error is very large. Too large: it is a theorem that the upperbound on standard deviation in a variable 2 [0, 1] is 1

2 . This says somethingabout the imperfection of the jackknife as an estimator of standard error. Theblame lies at least in part with the correctional factor n�1n in Equation �.�, whichwas originally defined heuristically. Since it is difficult to suggest a sensiblealternative, we choose to live with this.�

�. This was really the result of nesting a cross-validation in the bootstrap: the methodology

Brier score

Freq

uenc

y

0.04 0.08 0.12 0.16

015

0

Bootstrapped estimates

ConcordanceFr

eque

ncy

0.65 0.75 0.85 0.95

010

0Stability

Freq

uenc

y

0.05 0.15 0.25 0.35

015

0

Use stratum information

Page 104: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ElasticNet, alpha = .5�� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

model �.� model �.�LIMMA-t .44 ± .30 .76 ± .20SAM .46 ± .26 .75 ± .24ANOVA-fs .51 ± .29 .75 ± .16ANOVA-s .41 ± .57 .75 ± .38t-test .65 ± 1.5 .74 ± .71ANOVA-f .44 ± .25 .72 ± .21

intercept .5stratum .49 ± .055lasso .36 ± 1.4ridge .81 ± 3.3

Table �.�: AUC presented as point estimate plus/minus two standard errors. Measuresthe probability of forecasting a higher probability of metastasis for a ran-domly chosen metastasis case than for a randomly chosen non-metastasiscase: higher is better. Model number refers to the equations in Section �.A.�.Model �.� includes stratum as a predictor. Below the break are the fourbaseline models.

The collected results for model �.� suggest some reason for optimism. Due tothe size of the standard errors we must necessarily be uncertain about even thefirst significant digit of our point estimates. But even accounting for uncertaintythere seems to be predictive information better than random guess. As in thesimulations, there is not too much difference between the different methods,perhaps apart from the simple t-test, for which we observe much variance.Note that both SAM and LIMMA are flexible frameworks and we could haveaccounted for stratum and followup in either. Our comparison is between usingthis information and various ways of not using it, and there is no reason tobelieve that either framework should perform poorly if we were to use morerefined models there.

Table �.� shows the predictor set stability as point estimate plus/minus twostandard errors. Stability is in general very low, and the standard errors suggestthat there is even some uncertainty to the order of magnitude of the pointestimates. A possible interpretation is that the correlation between genes issuch that many different genes hold similar information. It is at least clear thatwe need much more data if we want to find a stable set of predictor genes. Ifwe take the point estimates at face value, Table �.� reflects the fact that we seelower uncertainty using ANOVA-f/fs in Tables �.� and �.�.

issue mentioned in the preamble to this chapter. For details see Section �.�.� and Section�.�.

�.A A P P E N D I X: VA R I A B L E S E L E C T I O N M E T H O D S ��

model �.� model �.�t-test .17 ± .45 .17 ± .33ANOVA-fs .27 ± .13 .18 ± .10SAM .34 ± .11 .20 ± .15ANOVA-s .33 ± .22 .20 ± .25ANOVA-f .31 ± .084 .21 ± .11LIMMA-t .35 ± .14 .20 ± .17

intercept .19 ± .010stratum .22 ± .029lasso .27 ± .19ridge .23 ± .30

Table �.�: Brier scores presented as point estimate plus/minus two standard errors.Measures error in forecast probability: lower is better. Model number refersto the equations in Section �.A.�. Model �.� includes stratum as a predictor.Below the break are the four baseline models.

but it is noteworthy that the intercept-only model is among the best-calibrated.The uncertainty is large enough that is difficult to say that any selection methodis better than any other. It is clear that the interaction with detection method inmodel �.� improves calibration for all models. There is also lower uncertaintyin the ANOVA-f/fs models.

AUC or concordance probability is a measure of a model’s ability to discriminatebetween outcomes: the higher the better. Brier score alone does not providefull information about predictive performance; the intercept-only model is well-calibrated but cannot be used for prediction at all. Random guess (or forecastinga constant for every observation) yields AUC of .�; perfect discrimination yieldsAUC of unity. Table �.� shows AUC as point estimate plus/minus two standarderrors in decreasing order by model �.�. Again the clearest signal is that theadded information from detection method is very important. Point estimatesimprove markedly and standard errors generally decrease. Also here does useof stratification and followup time in preselection reduce uncertainty.

The ridge regression baseline performance has a very good AUC point estimate,but the standard error is very large. Too large: it is a theorem that the upperbound on standard deviation in a variable 2 [0, 1] is 1

2 . This says somethingabout the imperfection of the jackknife as an estimator of standard error. Theblame lies at least in part with the correctional factor n�1n in Equation �.�, whichwas originally defined heuristically. Since it is difficult to suggest a sensiblealternative, we choose to live with this.�

�. This was really the result of nesting a cross-validation in the bootstrap: the methodology

Brier score

Freq

uenc

y

0.04 0.08 0.12 0.16

015

0

Bootstrapped estimates

ConcordanceFr

eque

ncy

0.65 0.75 0.85 0.95

010

0Stability

Freq

uenc

y

0.05 0.15 0.25 0.35

015

0

Does not

Page 105: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ElasticNet, alpha = .5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Calibration curve for predictions

Predicted metastasis probability

Prop

ortio

n of

met

asta

ses

expectedmiddle 80%

Page 106: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

ElasticNet, alpha = .5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Calibration curve for predictions

Predicted metastasis probability

Prop

ortio

n of

met

asta

ses

expectedmiddle 80%

Overestimation

Underestimation

Page 107: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

108 genes selected�.� R E S U LT S ��

likely. This is a natural consequence of doing variable selection: “redundant”information may shrink out of the model.

Table �.�: Resampling selection probability for the ��� elasticnet-selected genes.

GRK�a 0.853 C�orf��� 0.290 ANO� 0.221 FBLN� 0.157GPATCH� 0.682 LOC������ 0.287 PTTG�IP 0.219 BLMH 0.156GNGT� 0.474 RNF��� 0.280 �NDg�gVCd. . .b 0.218 FCRL� 0.149PDGFDc 0.467 SULT�A� 0.278 USF� 0.216 TDRD� 0.143FAM��B 0.457 ZNF��� 0.271 BCCIP 0.210 ACY� 0.142PTPRN� 0.442 USE� 0.267 MGC����� 0.209 ZFP�� 0.142CBLB 0.440 DNMT�A 0.267 GRK�a 0.207 SLIC� 0.138PDCL 0.410 LOC������ 0.266 WTIP 0.205 PICK� 0.135RASA� 0.380 CNTNAP� 0.265 BCL�� 0.204 RTN�IP� 0.134C��orf�� 0.376 IL�RA 0.265 DLGAP� 0.200 CDCA�L 0.132TCEB� 0.374 CCT� 0.264 HRAS 0.199 BEX� 0.131CAPN� 0.354 R�HDM� 0.263 RAD� 0.189 FCAR 0.130STK�� 0.351 MRPL�� 0.260 PRKCE 0.187 ANKRD�� 0.111GUCY�A� 0.348 SLC��A� 0.256 UBAP�L 0.186 USP�� 0.109ZDHHC�� 0.345 GNG� 0.255 BPI 0.186 KIAA���� 0.106SULT�A� 0.336 PLA�G�C 0.251 DTX� 0.184 BRI�BP 0.106Z�FIQGkeo. . .d 0.335 TCF� 0.248 LASS� 0.182 TUBA�A 0.105FAM��A 0.328 uX��cu�f_. . .e 0.247 GSTT� 0.182 IDH� 0.102rh��dQX��. . .f 0.324 C��orf��� 0.245 SPATA�� 0.182 DDX�� 0.100LANCL� 0.323 VCL 0.242 IGLL� 0.172 ANKRD�� 0.094SERPINE� 0.318 EZH� 0.242 SPG�A 0.172 TFG 0.087ADIPOR� 0.314 PRPSAP� 0.237 PPAP�A 0.172 LILRA� 0.080GPR��� 0.312 ISY� 0.235 NOTCH�NL 0.172 C�orf�� 0.078PDGFDc 0.299 UGDH 0.234 TAF� 0.168 WDR�� 0.075LOC������ 0.294 ABCF� 0.230 CCDC��B 0.166 AHCYL� 0.068WEE� 0.293 C��orf� 0.229 LOC������ 0.158 HAUS� 0.068ITM�C 0.291 VAV� 0.225 CDH� 0.157 MAD�L� 0.053

a. Two probes map to the same gene GRK�. Combined selection probability is �.��, implyingthat both get selected together at least some of the time.

b. Illumina probe id �NDg�gVCdQkNdcg.Ko, missing annotation.c. Two probes map to the same gene PDGFD. Combined selection probability is �.���.d. Ilummina probe id Z�FIQGkeoCSiVAoKeg, missing annotation.e. Illumina probe id uX��cu�f_VUIuXoST�, missing annotation.f. Illumina probe id rh��dQX��hUS�uOpRQ, missing annotation.

Figure �.� shows the (log fold change) expression levels in each of the ���selected genes for the metastasized and non-metastasized observations. Theshaded area shows the middle .� of the bootstrap distribution for differencein medians between the two groups; the white notch shows the expectationof this distribution, by which the genes are ordered. The black snake-shapedline marks the two group medians. The non-metastasized median is usuallyaround zero, so the difference in medians is mostly dominated by the medianfold change of the metastasized observations. In other words, for these genesthe average case–control pair is similar in the non-metastasized group, whilethe average pair is dissimilar in the metastasized group.

Page 108: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

108 genes selected�.� R E S U LT S ��

likely. This is a natural consequence of doing variable selection: “redundant”information may shrink out of the model.

Table �.�: Resampling selection probability for the ��� elasticnet-selected genes.

GRK�a 0.853 C�orf��� 0.290 ANO� 0.221 FBLN� 0.157GPATCH� 0.682 LOC������ 0.287 PTTG�IP 0.219 BLMH 0.156GNGT� 0.474 RNF��� 0.280 �NDg�gVCd. . .b 0.218 FCRL� 0.149PDGFDc 0.467 SULT�A� 0.278 USF� 0.216 TDRD� 0.143FAM��B 0.457 ZNF��� 0.271 BCCIP 0.210 ACY� 0.142PTPRN� 0.442 USE� 0.267 MGC����� 0.209 ZFP�� 0.142CBLB 0.440 DNMT�A 0.267 GRK�a 0.207 SLIC� 0.138PDCL 0.410 LOC������ 0.266 WTIP 0.205 PICK� 0.135RASA� 0.380 CNTNAP� 0.265 BCL�� 0.204 RTN�IP� 0.134C��orf�� 0.376 IL�RA 0.265 DLGAP� 0.200 CDCA�L 0.132TCEB� 0.374 CCT� 0.264 HRAS 0.199 BEX� 0.131CAPN� 0.354 R�HDM� 0.263 RAD� 0.189 FCAR 0.130STK�� 0.351 MRPL�� 0.260 PRKCE 0.187 ANKRD�� 0.111GUCY�A� 0.348 SLC��A� 0.256 UBAP�L 0.186 USP�� 0.109ZDHHC�� 0.345 GNG� 0.255 BPI 0.186 KIAA���� 0.106SULT�A� 0.336 PLA�G�C 0.251 DTX� 0.184 BRI�BP 0.106Z�FIQGkeo. . .d 0.335 TCF� 0.248 LASS� 0.182 TUBA�A 0.105FAM��A 0.328 uX��cu�f_. . .e 0.247 GSTT� 0.182 IDH� 0.102rh��dQX��. . .f 0.324 C��orf��� 0.245 SPATA�� 0.182 DDX�� 0.100LANCL� 0.323 VCL 0.242 IGLL� 0.172 ANKRD�� 0.094SERPINE� 0.318 EZH� 0.242 SPG�A 0.172 TFG 0.087ADIPOR� 0.314 PRPSAP� 0.237 PPAP�A 0.172 LILRA� 0.080GPR��� 0.312 ISY� 0.235 NOTCH�NL 0.172 C�orf�� 0.078PDGFDc 0.299 UGDH 0.234 TAF� 0.168 WDR�� 0.075LOC������ 0.294 ABCF� 0.230 CCDC��B 0.166 AHCYL� 0.068WEE� 0.293 C��orf� 0.229 LOC������ 0.158 HAUS� 0.068ITM�C 0.291 VAV� 0.225 CDH� 0.157 MAD�L� 0.053

a. Two probes map to the same gene GRK�. Combined selection probability is �.��, implyingthat both get selected together at least some of the time.

b. Illumina probe id �NDg�gVCdQkNdcg.Ko, missing annotation.c. Two probes map to the same gene PDGFD. Combined selection probability is �.���.d. Ilummina probe id Z�FIQGkeoCSiVAoKeg, missing annotation.e. Illumina probe id uX��cu�f_VUIuXoST�, missing annotation.f. Illumina probe id rh��dQX��hUS�uOpRQ, missing annotation.

Figure �.� shows the (log fold change) expression levels in each of the ���selected genes for the metastasized and non-metastasized observations. Theshaded area shows the middle .� of the bootstrap distribution for differencein medians between the two groups; the white notch shows the expectationof this distribution, by which the genes are ordered. The black snake-shapedline marks the two group medians. The non-metastasized median is usuallyaround zero, so the difference in medians is mostly dominated by the medianfold change of the metastasized observations. In other words, for these genesthe average case–control pair is similar in the non-metastasized group, whilethe average pair is dissimilar in the metastasized group.

Low selection frequencies: unstable signatures

Page 109: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Please turn to page 22 in the required reading

−1.0 −0.5 0.0 0.5 1.0 1.5

PRKCEBRI3BP

LOC647460PDCL

KIAA0495DDX52BCCIPACY1

RASA2R3HDM1FAM24B

WTIPC16orf5

SULT1A3LOC649210

DTX1USP39GRK5

ZFP57TCEB1GRK5

PDGFDPDGFDBCL10USE1BEX4CDH2

SERPINE2EZH2GNG8

3NDg8gVCd...PRPSAP2

rh13dQX04...PTPRN2C11orf48

ISY1NOTCH2NL

UBAP2LADIPOR2CNTNAP2PLA2G4C

GNGT2VCL

TUBA4AFCRL3GSTT1

GUCY1A3ITM2CTCF4

SLC38A1MGC29506

CBLBIGLL1VAV3

−1.0 −0.5 0.0 0.5 1.0 1.5

LOC654055SPATA20

BPISULT1A1GPR177

TFGUSF1

ANKRD57HAUS4LASS5

PTTG1IPLOC731486

STK19FCARPICK1

GPATCH4CAPN3

RNF214WDR60ZNF365

C1orf115FBLN5

C6orf47CCT5BLMH

MRPL43Z6FIQGkeo...

ANKRD35uX15cu4f_...

C20orf107HRAS

MAD2L2ABCF2IL2RAUGDH

TDRD9LILRA6

LANCL2ZDHHC11

FAM89ARTN4IP1

RAD1CDCA7LAHCYL2

TAF6PPAP2ASPG3A

DLGAP2WEE1SLIC1

DNMT3ACCDC90B

ANO8IDH1

Expression levels, group medians, and difference in group medians for selected genes

metastasizednon−metastasized Observations

median

median

Bootstrapped difference in mediansw/ middle 80% of bootstrap distribution

Page 110: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Some genes tend to be selected togetherCo−selection heatmap

0

0.1

0.2

0.3

0.4

0.5

0.6 �� C H A P T E R � M E TA S TA S I S P R E D I C T I O N

data even under resampling.

Table �.�: Genes that tend to be selected together, ordered alphabetically.

ADIPOR� FAM��A LANCL� PTPRN� SULT�A�C��orf�� GNG� LOC������ R�HDM� TCEB�C�orf��� GNGT� LOC������ RASA� TCF�CAPN� GPATCH� PDCL rh��dQX�. . . WEE�CBLB GRK� PDGFD SERPINE� Z�FIQGkeo. . .DNMT�A GUCY�A� PDGFD STK�� ZDHHC��FAM��B ITM�C PRPSAP� SULT�A� ZNF���

�.� Conclusion

We have demonstrated predictability of metastasis in these data. We can, witha high probability, rank case–control pairs in terms of predicted metastasisprobability. However we should not count the model itself as a reliable tool dueto poor calibration and stability, and since these results stem from exploratorymodeling we should be moderate in our expectations; further investigation isneeded to establish reliable results.

We provide ��� candidate predictor genes as an avenue for future research. Weare currently investigating their biological properties. An interesting statisticalinvestigation may be to review the importance of the stratification and how tobuild this into a shrinkage model, as the results in the appendix below indicatethat this may lead to improvements. We believe however that it is necessaryto obtain independent data to be able to make any inference stronger thangeneral indication.

�.A Appendix: variable selection methods

In addition to the main results presented above we previously explored vari-ous ad-hoc variable selection schemes. The results of these explorations arenot competitive compared with the above penalized likelihood model, but Ipresent them here for completeness and comparison. To make the next sectionscomplete we must define the followup time of a case. This is the number ofdays between provision of the blood sample and the eventual diagnosis ofcancer. Although followup introduces a time aspect, these are not time seriesdata in the strictest technical sense. Each observation stems from a differentwoman, so there should be no autocorrelation to speak of, and followup timeis random.

Page 111: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

At this point maybe call a biologist

https://commons.wikimedia.org/wiki/File:Biologist_Victoria_Achkasova_20150529.jpg

Page 112: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Thesis: Small data require particular care

• 1000s of measurements, maybe 100 observations

• Validation matters more than you think

• Model search difficult

• I suggest to make more assumptions

Page 113: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Thesis: Small data require particular care

• 1000s of measurements, maybe 100 observations

• Validation matters more than you think

• Model search difficult

• I suggest to make more assumptions

Page 114: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Thesis: Small data require particular care

• 1000s of measurements, maybe 100 observations

• Validation matters more than you think

• Model search difficult

• I suggest to make more assumptions

Page 115: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Thesis: Small data require particular care

• 1000s of measurements, maybe 100 observations

• Validation matters more than you think

• Model search difficult

• I suggest to make more assumptions

Page 116: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Thesis: Small data require particular care

• 1000s of measurements, maybe 100 observations

• Validation matters more than you think

• Model search difficult

• I suggest to make more assumptions

Page 117: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Thesis: Small data require particular careFast

Good Cheap

Page 118: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Thesis: Small data require particular careAgnosticmodeling

Carefulvalidation

Smalldata

Page 119: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Thesis: Small data require particular careAgnosticmodeling

Carefulvalidation

Smalldata

NOFREE

LUNCHES

Page 120: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Mosteller & Tukey’s green book

Naturally, we all desire an adequate assessment of both the indications and their uncertainties, but we shouldn’t refuse

good cake only because we can’t have frosting too.

Page 121: Small data: practical modeling issues in human-model -omic ...3inar.github.io/pdfs/phd_main_small_data.pdf · Metastasis is the spread of cancer in the body 0.0 0.2 0.4 0.6 0.8 1.0

Closing curtain: Thank you.


Recommended