Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies...

Innovative Paths to Better Medicines

Design Considerations in Molecular Biomarker Discovery Studies

Doris Damian and Robert McBurneyJune 6, 2007


Confidential Information – Do Not Reproduce or Distribute – page 2

Outline of Presentation

• Introduction:

– Mass Spectrometry Data

– Studies objectives and questions

• Statistical Processing of MS Data

– Sample normalization

– Removal of peak-specific batch and other temporal trends

– Filtering of noisy peaks

• Design Considerations

– Power calculations – for univariate biomarkers

– Power calculations for multivariate biomarkers (regression)



• Measurements: chemical compounds of different classes (proteins,

lipids, polar and non-polar metabolites, amino acids, etc.)

• The variables constituting the data sets are peak intensities (peaks)

identified by m/z and retention time. The peak intensities are

proportional to the amount of analyte detected by the mass

spectrometer. Note that p >> n!

0 10 20 30 40 50 60 70 80 90 100 110 120 1305

e+

067

e+

06

sample

peak

inte

nsity

MS of Individual

Peaks

Total Ion Chromatogram

Selected Ion Chromatogram

Figure modified from: http://www.asms.org/whatisms/p13.html

biological samplesQC samples

Mass Spectrometry Data



Questions

Design

Experiment

StatisticalProcessing

Data Analysis

Objectives

Structure of a Molecular Biomarker Discovery Study



Questions

Design

Experiment

Processing

Analysis

Objectives

Objectives

Questions

Diagnosis Elucidation of Mechanisms of Action (MoA)

•What is a minimal set of biomarkers?

•What are all the biomarkers?•What are the molecular

pathways?

Questions

Biomarker:A characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic response(s) to a therapeutic intervention.

Studies Objectives and Questions




• Introduction:












• Sample normalization

– correction of baseline differences between samples

• Removal of peak-specific batch and other temporal trends

– due to instrument and processing limitations, samples are acquired

sequentially in batches – peaks exhibit batch-to-batch variation;

– instrument performance may become unstable over time, samples

may undergo degradation.

These are main causes for temporal variation observed in peak

intensities.

• Filtering of noisy peaks

– for each biological sample replicate measurements are obtained;

– the estimated correlation between these replicates is used as a filter

for noisy data.

Statistical Processing

Presented at IBC’s Biomarkers and Molecular Diagnostic conferences September 2006



• Correction of baseline differences between samples.

• Based on Internal Standards.

• Internal Standards are known exogenous compounds,

added to the biological samples in fixed amounts at the

beginning of the sample preparation stage (same for all

samples).

• Used to account for sample variability (e.g., pipetting

errors) during sample preparation and acquisition.

Sample Normalization



1 2 3 4 5 6

14.0

14.5

15.0

15.5

16.0

16.5

17.0

17.5

IS Peak

log(

inte

nsity

)

Before Normalization: Sample Profiles of 6 Internal Standard Peaks

Typical Sample Profiles of IS Peaks – before Normalization



• Normalization – the statistical procedure of multivariate

scaling of samples based on (a subset of) IS peaks.

• Y = log(intensity); i = 1,…,I IS peak; j = 1,…,J sample.

• The sample-specific factors, , are estimated in this

ANOVA model and removed from all peaks.

ij i j ijY

j

Sample Normalization



1 2 3 4 5 6

14.0

14.5

15.0

15.5

16.0

16.5

17.0

17.5

IS Peak

log(

inte

nsity

)

After Normalization: Sample Profiles of 6 Internal Standard Peaks

Through normalization, temporal trends common to all peaks are removed.

Typical Sample Profiles of IS Peaks – after Normalization



0 50 100 150 200 250 300 350 400 450

14.0

14.5

15.0

15.5

16.0

16.5

17.0

17.5

sample order

log(

inte

nsity

)

Before Normalization: Temporal Profiles of 6 Internal Standard Peaks

ˆ t

Typical Temporal Profiles of IS Peaks – before Normalization



0 50 100 150 200 250 300 350 400 450

14.0

14.5

15.0

15.5

16.0

16.5

17.0

17.5

sample order

log(

inte

nsity

)

After Normalization: Temporal Profiles of 6 Internal Standard Peaks

Typical Temporal Profiles of IS Peaks – after Normalization











intensities.




for noisy data.




0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 130

14.8

15.2

15.6

sample order

log(

inte

nsity

)

Before Normalization: Temporal Profile of Peak 41

QC: Black; Biological samples: Red

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 130

14.8

15.2

15.6

sample order

log(

inte

nsity

)

After Normalization: Temporal Profile of Peak 41


Peak-Specific Temporal Trends – after Normalization



• The within and between batch patterns cause visible batch

separations:

• If one does not account for these intrinsic experimental trends,

important biological effects may be obscured.

The Need for Batch Corrections

-12

-10

-8

-6

-4

-2

0

2

4

6

8

10

12

14

-15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

t[2]

t[1]

PCA: Iris Plasma GC/MS Data SetAfter Normalization

(colored by batch, numbered sequentially)

Ellipse: Hotelling T2 (0.95)

1234

1

2

3

4 5

67

8

910

11 12

131415

16

1718

1920

2122

23

2425

2627

2829

30

31

32

33

34

35 36 3738

39 40

4142

43

44

45

46

47

4849

5051

5253

54

55

565758

5960

61

62

6364

65

6667

68

69

70

717273

7475

7677

78

7980

8182

8384

8586

8788

8990

9192

93

9495

9697

9899

100

101102

103104

105107

108109

110111

112

113

114

115

116

117118

119

120

121

122

123124

125126

127128129

130131

132133134

PCA Plot: Data set after NormalizationColored by Batch

first principal component

secon

d p

rin

cip

al com

pon

en

t



• Based on QC samples (ideally)

– QC samples: a pool of material from the biological

samples in a study, aliquoted into a set of identical

samples that are acquired at specific intervals in

each batch of samples.

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 130

14.8

15.2

15.6

sample order

log(

inte

nsity

)

Before Normalization: Temporal Profile of Peak 41


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 130

14.8

15.2

15.6

sample order

log(

inte

nsity

)

After Normalization: Temporal Profile of Peak 41


Removal of Peak-Specific Temporal Trends



0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 130

14.8

15.2

15.6

sample order

log(

inte

nsity

)

After Normalization, Before Batch Correction: Temporal Profile of Peak 41

QCY: Black; Biological samples: Red

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 130

14.8

15.2

15.6

sample order

log(

inte

nsity

)

After Normalization, After Batch Correction: Temporal Profile of Peak 41


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 130

14.8

15.2

15.6

sample order

log(

inte

nsity

)

After Normalization, Before Batch Correction: Temporal Profile of Peak 41


0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 130

14.8

15.2

15.6

sample order

log(

inte

nsity

)

After Normalization, After Batch Correction: Temporal Profile of Peak 41


20, 1, 2,( )b b b bf t t t Temporal trend within batch b (b=1,…,B batches):

estimated based on QC samples within batch b

Removal of Peak-Specific Temporal Trends











intensities.




for noisy data.




• When the same sample is measured several times, we require

the measurements to correlate well.

• The correlation between replicates can be expressed as a

tradeoff between the biological variance ( ) and the

measurement error variance ( ).

• Ideal case: no measurement error .

• The estimated correlation, , can be used to filter noisy peaks.

2

1 2 2 2, Bio

Bio

Corr Y Y

2Bio

2

1 2 20.5 .Bio

Correlations between Biological Replicates



10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6

10.4

10.6

10.8

11.0

11.2

11.4

11.6

11.8

12.0

12.2

12.4

12.6

replicate 1

repl

icat

e 2

Peak 25: Estimated Correlation = 0.37

10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6

10.4

10.6

10.8

11.0

11.2

11.4

11.6

11.8

12.0

12.2

12.4

12.6

replicate 1

repl

icat

e 2

Peak 101: Estimated Correlation = 0.98Distribution of Correlations

between Replicates

Fre

quen

cy

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

010

2030

4050

6070

80Examples of Correlations (two extremes)




• Introduction:












• The power in biomarker discovery studies is a function of:

– The sample size

– The separation between the groups (e.g., MFC)

– The proportion of biomarkers in the data set

– The false discovery rate (FDR) allowed

– The platform variability

– The within-group variability

– Other factors (e.g. other covariates in the model) ?

Power Calculations

• Statistical power = probability to detect biomarkers



• The power in biomarker discovery studies is a function of:

– The sample size

– The separation between the groups (e.g., MFC)

– The proportion of biomarkers in the data set

– The false discovery rate (FDR) allowed

– The platform variability

– The within-group variability

– Other factors (e.g. other covariates in the model) ?

Power Calculations

• Statistical power = probability to detect biomarkers



den

sity

x

healthydiseased

time (days)

y (E

xpec

ted

Val

ue)

0 1 2 3 4 5 6 7

healthydiseased

: MFC = 1.7: MFC = 2.0: MFC = 3.0

solid: FDR 0.1dashed: FDR 0.2

6 8 10 12 14 16 18 20 22 24 26 28 30

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

sample size (per group)

po

wer

Proportion of Biomarkers = 90%

Illustration I: Power Curves



den

sity

x

healthydiseased

time (days)

y (E

xpec

ted

Val

ue)

0 1 2 3 4 5 6 7

healthydiseased

: MFC = 1.7: MFC = 2.0: MFC = 3.0

solid: FDR 0.1dashed: FDR 0.2

6 8 10 12 14 16 18 20 22 24 26 28 30

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0


po

wer


6 8 10 12 14 16 18 20 22 24 26 28 300.

00.

10.

20.

30.

40.

50.

60.

70.

80.

91.

0


po

wer


Illustration I: Power Curves



den

sity

x

healthydiseased

time (days)

y (E

xpec

ted

Val

ue)

0 1 2 3 4 5 6 7

healthydiseased

: MFC = 1.7: MFC = 2.0: MFC = 3.0

dotted: EstimatedFDR

There is no loss in power,

(proportion of biomarkers

discovered) BUT the FDR

may be undesirable.

6 8 10 12 14 16 18 20 22 24 26 28 300.

00.

10.

20.

30.

40.

50.

60.

70.

80.

91.

0


po

wer


FRD

Power Curves Not Accounting for the FDR



Power Calculation for Multivariate Biomarkers (Regression)

Classical Setting

• n > p

• Linear regression model

• Parametric (F) test of model

significance

• Computationally inexpensive

Biomarker Discovery Setting

• n << p

• Regression with constraints on

parameters (elastic net)

• Dimensionality reduction

needed (through cross-

validation)

• Non-parametric (label

permutations) test of model

significance

• Computationally very expensive



Illustration: Power for Regression

Tf X X

1 1 p pY X X

2

2

1

1,

1 p

ii

Corr Y f

X

• Model

• Multivariate biomarker

• Parameter of interest

• Test: = 0

• Power = proportion of times that this hypothesis is rejected



Power Calculation – Regression

15 20 25 30 35 40 45 50 55 60

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

number of samples

pow

er

rho = 0.58rho = 0.75rho = 0.92

rhoNumber of Samples

Power

0.92 30 0.50

0.92 38 0.79

0.92 45 0.96

0.75 30 0.31

0.75 38 0.46

0.75 45 0.50

0.75 60 0.70

0.00 30 0.02

Biomarker with 10 Components(known in advance)

…10 minutes to calculate

Biomarker with 10 Components(buried among 90 other analytes)

…days to calculate



Thank you!

Date post:	21-Jan-2016
Category:	Documents
Upload:	vernon-ball
View:	213 times
Download:	1 times

Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies...

Documents