Challenges In Progressing Biomarkers To Clinical Use Proteomic Experiences Chris Harbron Technical...

Post on 27-Mar-2015

216 views 1 download

Tags:

transcript

Challenges In Progressing Biomarkers To Clinical Use

Proteomic ExperiencesChris Harbron

Technical Lead For High Dimensional Data

AstraZeneca

FDA Industry Statistics Workshop

September 2006

2Gap Between Published Biomarkers And

Biomarkers Being Approved For Use

3 Why Might This Be?Challenges

• Pressures from the contextual environment• High quality data is essential

– These are new technologies - not simple to use or analyse– Robust study design including :– Consistent sample collection and processing– Need to understand reproducibility between & within labs & within

subjects

• Failure leads to poor data quality, frequently dominated by nuisance factors

• Rigorous validation is also essential– Occurs at many levels– Avoid overfitting data

• Omics may not do it alone– Applications will require combining -omics with other data types

4

Example : Case-Control Study

• Interest in identifying a peptidomic profile that could predict an adverse event– Potential use as a personalised medicine predictive

marker

• Blood samples taken from subjects at start of treatment

• Subjects monitored for adverse event using a rigorous definition

• Subjects entered in cohorts• Samples processed in batches within cohorts• Analysed on a LC/MS-MS platform

5

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

m/z

05

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e A

bund

anc

e

690.81

1027.87

570.33 1156

.84599.13

635.85

1138.861122

.831251.79

371.25

799.93

1010.89242

.26727.23258

.19881.99

389.22

561.21

958.89

276.24

832.76

1269.83

286.28

1234.85

1107.00

1346.63

1252.9

579.3

643.8F

ragm

ent

Ion

inte

nsity

Mass / Charge Ratio

Ion

inte

nsity

Mas

s / C

harg

e R

atio

Retention Time

LC-MS/MS Proteomics

Clinical Plasma Samples

Peptides

Liquid Chromatography

Preparation& Digestion

Mass Spectrometry

MS/MS

Separation By Mass/ChargeMeasurement Of Intensity

ProteinIdentification

Separation By Retention Time

6 Distribution Of Average Intensities

Retention Time

Mas

s-C

harg

e R

atio

High Intensity

LowIntensity

Distribution Of Average Intensities~5,500,000RT / MZ / IntensityMeasurementsPer Sample

~25,000Common PeaksPer Sample

Pre-Processing- Alignment Of Retention Times- Scaling- Binning

7 Proteomic DataExploratory Analysis - PCAConsiderable batch to batch variation

Cohort 1

Cohort 2

Cohort 3

Cohort 4

ControlCaseNon-Index Case

8 Proteomic DataExploratory Analysis - PCA

Within all batches withboth cases and controls, there is separation of cases and controls

9 Univariate Analyses Within BatchesHistograms Of t-Test p-Values

10 Global Test Of Agreement Between Batches Using A Permutation Test

Observed Permuted

Identify peaks where direction of effect agrees in all 3 batchesSummarise by maximum p-valueGlobal test of expected level due to multiple testing by permutation

11 Typical Highly Significant Peak

CASE CONTROL NIC

Within each batch,cases are highly expressed compared to controls

Not possible to define a global cut-off between cases and controls

Inte

nsity

Batches

12

Multivariate Analyses

• Identified consistent effect• BUT, may be difficult to use as a predictive

biomarker in a clinical setting due to batch variation

• Would a combination of markers, a peptidomic profile, work as a predictive biomarker?

• Use Random Forests to generate multivariate predictive models

• Assess predictive power using a nested cross-validation– Within and between batch prediction

13

Modelling Process

Data

Analyse Each PeakWithin Each Batch

Take Maximum p-Value For Each Peak

Test SetTraining Set

Rank Peaks By p-Value

Build Model WithTop n Peaks

Test Model InTest Set

Mixed Case-Control batchesExclude Batches In TurnExclude Observations By LOO

Control Only batchesBatch excludedObservation excluded

Number Of Peaks

ObservationExcluded

BatchExcluded

14 Leave One Out Cross ValidationProteomic Model Predictions

Leave One Out Training Set Batches CasesLeave One Out Training Set Batches ControlsOther Mixed Batch CasesOther Mixed Batch ControlsOther Batches - Controls

15Mask Data By Restricting To High Quality

Regions Of Proteomic Space

Retention Time

Mas

s C

harg

e R

atio

TECHNICALLY• Region of focus for instrument

EMPIRICALLY• Lowest residual variability• Highest average intensity

16

Analysis Of Unmasked Peaks

• Batch Effects Still Dominate• Consistent Case-Control Effect

Can Identify Peaks SeparatingCases & Controls Across Batches

17 Cross-Validation PredictionsUnmasked Peaks

Leave One Out Same Batch – CasesLeave One Out Same Batch - ControlsOther Mixed Batch - CasesOther Mixed Batch - ControlsOther Batches - Controls

•Good Predictions Within Same Batch•Prediction Rate Falls When Extrapolated To Other Batches•Need To Prospectively Test In Another Set Of Patients

18How To Combine Other Non-omic

Information Into A Biomarker?

• Combining different data types is challenging

• The “bigger” data type will dominate the modelling

• Greater signal in data, but doesn’t extrapolate as well

• Exploring options turning the random part of random forests to our advantage

Known Clinical PrognosticProteomic Peaks

19 Proteomic Quality Control Consortium?

• MAQC recently reported a reproducibility study for microarrays– Wealth of valuable information– Mammoth effort

• Could we do the same for proteomics?– Less mature technology– Greater diversity of platforms– Diversity of pre-processing methodologies– Issues of identification making large scale

comparisons challenging

20

Conclusions

• Complicated new technologies• Many challenges

– Technical, Data Quality, Data Analysis, Practical

• Essential role for statistics• Need to integrate statistical approaches with

understanding of technologies and biology• Great potential

– Better treatments for patients– Improved use of compounds– Greater biological understanding