+ All Categories
Home > Documents > Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis...

Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis...

Date post: 19-Mar-2020
Category:
Upload: others
View: 13 times
Download: 0 times
Share this document with a friend
60
Machine Learning and Pathway Analysis as basic tools in Systems Biology Adi L. Tarca 1,2 1. Department of Computer Science, Wayne State University, Detroit, MI, USA 2. Perinatology Research Branch, NICHD/NIH, Bethesda, MD, and Detroit, MI, USA 3. Center for Molecular Medicine and Genetics, Wayne State University,
Transcript
Page 1: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Machine Learning and

Pathway Analysis as basic

tools in Systems Biology

Adi L. Tarca1,2

1. Department of Computer Science, Wayne State University, Detroit, MI, USA 2. Perinatology Research Branch, NICHD/NIH, Bethesda, MD, and Detroit, MI, USA 3. Center for Molecular Medicine and Genetics, Wayne State University,

Page 2: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Lessons learned from the

IMPROVER Diagnostic Signature

Challenge

• Gene set/pathway analysis

• Approach of the PRB team in

Species Translation Challenge

Outline

Page 3: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,
Page 4: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,
Page 5: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Automated process performed by a

machine (computer)

• to approximate (learn) the relation

between an set of predictors (Xj) and an

outcome y

• using a set of examples (X,y)i

• The model is expected to perform well

when applied to new data (generalize)

Machine Learning / Supervised

Learning

Page 6: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

IMPROVER: Diagnostic Signature

Challenge

• Assess and verify computational approaches that classify clinical samples based on transcriptomics data

• Participants built models from public data to predict 5 endpoints (Psoriasis, COPD, Lung cancer, MS stages, MS diagnosis)

• Compared to a previous initiative, MAQC-II (Shi et al. 2012, Nat Biotechnol) it was more stringent

Meyer P et al, Bioinformatics 28, 2012

Page 7: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Model Performance Results in the Diagnostic

Signature Challenge

- 54 teams

participated

- The endpoint

explained 69% of the

variance (p <0.05)

- Team/approach

explained 8% (NS)

Endpoint / Sub-challenge

Pre

dic

tio

n

Qu

ality

10

5

0

-5

BCM

CCEM

AUPR

Z-s

co

re

Psoriasis MSS MSD Lung

cancer

COPD

Page 8: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

No team performed best in more

than one sub-challenge

Using unrelated

training dataset

Page 9: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Set 1

Set 2

Set 1 & Set 2

All Batches Together

Within batch + Batch effect correction

Training sets Preprocessing

Approach of the best overall

team

LDA

Neural networks

SVM

Decision trees

Classification

model

QDA

PLIER

RMA

MAS5

GCRMA

dCHIP

Filter genes by Moderated t-test & Fold Change Optimize the number of genes by cross-validated AUC

Feature

selection

DLDA

Page 10: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Strategies of 2nd and 3rd Best

Overall Teams in DSC

• 2nd best overall team used:

- unsupervised clustering of test samples

- clustering based on features selected by Wilcoxon test

- cluster labels assigned using prior information about the

direction of change of few known genes

• 3rd best overall team used:

- LASSO regularized logistic regression

- Regularization parameter optimized via LOO cross-validation

- Features filtered by Wilcoxon test

Page 11: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

What explains the variability in

performance data and what works

best in general?

Page 12: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Issues in the Analysis of Model

Performance Data in the

IMPROVER DSC

• Methods description were not detailed enough

resulting in missing data

• There were too many different methods for each

modeling factor (e.g. over 15 types of classifiers)

• The training data was different between teams for

the same endpoint

Page 13: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

A Post Challenge Survey

• Had the team used cross-validation to tune any of

the parameters in their classification pipeline?

• Teams that had used cross-validation had better

performance (p<0.05): 1.2 Z-score units for BCM

1.9 for AUPR

Page 14: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

A Post Challenge Computational

Experiment - Fix the training datasets and everything else

- Vary the preprocessing (RMA, GCRMA, MAS5),

Feature selection (t-test, moderated t-test,

Wilcoxon test) and classifier (LDA, kNN, SVM)

Combination of the best overall team

(BCM+CCEM+AUPR)/3 (BCM+CCEM+AUPR)/3

Page 15: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Most Important Modeling Factor is

Problem and Metric Dependent

Ideally, the exact prediction assessment procedure should be known in advance !

Page 16: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Data Preprocessing: Together is Better

than Separate

- 24 data points (2 preprocessing methods x 3 feature selections x 4 endpoints)

- BCM and AUPR were both improved on average by 6% and 4% respectively (Wilcoxon p-value <0.05).

Page 17: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Implements the approach of the PRB team, available from

Bioconductor

• Starts with raw (Affymetrix) gene expression data files and

one annotation data frame assigning files to groups

(disease, control, test)

• Tries 27 combinations of data preprocessing, feature

selection and classifiers to guide model selection

• Uses N-fold cross-validation to determine the optimal

number of features for each combination of methods

• Provides predictions for the test samples and a fitted

model

maPredictDSC R Package

Page 18: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,
Page 19: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Conclusions IMPROVER DSC

• When gene expression differences are weak no classification

pipeline method will work

• The No Free Lunch (NLF) theorem was proven right, again: There is

no universally best approach to class prediction

• Using one’s favorite methods can work in average well, yet the

methods need to be used properly to avoid under- and over-fitting

• Finding best model for a given problem requires trying many

combinations of methods the maPredictDSC package can help

• The importance of each step in the process (preprocessing, feature

selection, classifier choice) is problem and metric dependent, so

no shortcut can be confidently suggested

Page 20: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Gene Set and Pathway Analysis

Page 21: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• A successful class-comparsion experiment may

result in hundreds or thousands of differentially

expressed (DE) genes

• A widely used approach to interpret such result

includes the following 2 steps:

Gene Set Analysis: Motivation

1) Staring At Long Lists of Genes

and

2) Focus on Genes that we Already Know

Page 22: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Gene Sets

• Examples of gene sets:

– Gene Ontology Terms

– Signaling and metabolic pathways (e.g. KEGG, BioCarta,

Reactome)

– Motif gene sets, etc. (GSEAbase)

Page 23: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Methods to test association between a predefined

set of variables (e.g. genes, proteins, etc.) and an

outcome of interest

• One of the few options to extract meaning from

hundreds or more of DE genes in a given condition

• Can establish a link between a gene set and the

outcome even when there are no DE genes by usual

thresholds (E.g. *)

Gene Set Analysis

* Mootha et. al, PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately down-regulated in human diabetes. Nat Genet, 2003.

Page 24: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Select a subset of genes as differentially expressed

(DE)

• Test if a given gene set has more DE genes than

expected by chance:

– Hypergeometric (Tavazoie, Nat. Genet. 1999,

Draghici et al., Genomics, 2003)

– Binomial (Cho et al. Nat. Genet. 2001)

– Chi-square (2)

• Results change as a function of the stringency of gene

selection

• Ignore the correlation between genes

Over-Representation Analysis

(ORA)

Page 25: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Combining pathway information with with over-

representation evidence

The impact analysis method, (Pathway Express)

Draghici S. , Khatri P., Tarca A.L. et al. Genome Res, 2007

A Novel Signaling Pathway Impact Analysis (SPIA),

Tarca A.L., Draghici S., Khatri P. et al, Bioinformatics, 2009

Page 26: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Use evidence from all genes in the gene set (not only DE)

• Provide a “unique” solution to a given problem

• Account for gene-gene correlations

• Typically are slower because they involve sample

permutation

• Popular methods:

• Gene Set Enrichment Analysis (GSEA) (Mootha et al, Nat.

Genet., 2003, Subramanian PNAS, 2005)

• Gene Set Analysis (GSA): Efron B & Tibshirani R, Annals of

Applied Statistics, 2006.

• Give equal weight to all genes in a gene set

Functional class scoring methods

Page 27: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Pathway Analysis with Down-weighting

of Overlapping Genes (PADOG)

Extends the Gene Set Analysis (GSA):

1. Weights gene score as a function of how often the

gene appears across the gene sets

2. Uses moderated t-scores instead of ordinary t-

scores for each gene

3. Computes the mean of absolute gene scores

instead of the maxmean statistic

4. Computes significance by comparing observed

pathway scores to the empirical null distribution

obtained from phenotype permutations

Tarca AL., Draghici S., Bhatti G., Romero R., BMC Bioinformatics, 2012, 13:136.

Page 28: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Gene frequency

across 143

KEGG non-metabolic

pathways

PADOG

Page 29: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Most gene set analysis methods are compared against

a few competitor methods using a few real datasets or

simulated data

• Performance is typically measured relying on literature

citations for the relevance of a given gene set to a

condition

• It is relatively easy to find 2-3 datasets on which one’s

method works better than a competitor method

• It is hard to figure out which method works best in

general, if any

A.L. Tarca, G. Bhatti, R. Romero, PlOS ONE, 2013.

Assessing Gene Set Analysis

Results

Page 30: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Assessing Gene Set Analysis

Results

Pathway Name Method 1

P-value Method 2

P-value Method 1

Rank Method 2

Rank

Bile secretion 0.023 0.012 1 2

Fatty acid elongation in mitochondria 0.029 0.018 2 5

Colorectal cancer 0.14 0.015 3 4

Ribosome biogenesis in eukaryotes 0.2 0.03 4 7

Cell cycle 0.21 0.014 5 3

Cyanoamino acid metabolism 0.32 0.16 6 8

Purine metabolism 0.58 0.022 7 6

Small cell lung cancer 0.68 0.01 8 1

Fatty acid metabolism 0.8 0.3 9 9

Analysis of a colorectal cancer dataset Colorectal cancer pathway expected to be relevant to this dataset (target pathway)

Page 31: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

42 Microarray

datasets

ORA

GSEA

GSEAP

GLOBALTEST

SAFE

SIGPATHWAY

Q1

SIGPATHWAY

Q2

PLAGE

GSA

ZSCORE

MRGSE

GAGE

SSGSEA

PADOG

CAMERA

GSVA

KEGG Disease Pathways

Metacore Disease Biomarker Networks

Acute myeloid leukemia (3)

Chronic myeloid leukemia (2)

Colorectal cancer (5)

Dilated cardiomyopathy (2)

Endometrial cancer (1)

Glioma (2)

Huntington's disease (1)

Prostate cancer (2)

Renal cell carcinoma (2)

Alzheimer's Disease (5)

Non Small Cell Lung Cancer (2)

Pancreatic cancer (3)

Parkinson's disease (3)

Thyroid cancer (2)

Methods

Diabetes Mellitus Type2 (1)

Lupus Erythematosus

Systemic (1)

Pulmonary Disease Chronic

Obstructive (2)

Pancreatic Neoplasms (1)

Ovarian Neoplasms 1 (2)

Performance

Page 32: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• The methods are compared in their ability to:

– produce low p-values the target pathway

(sensitivity)

– rank the target pathway close to the top

(prioritization)

– produce no more than expected false

positives when phenotypes are permuted

(specificity)

A.L. Tarca, G. Bhatti, R. Romero, PlOS ONE, 2013, in press.

Assessing Gene Set Analysis

Results

Page 33: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

- Ranking methods by median p-value of pathways expected to be relevant, a.k.a. surrogate for sensitivity

- using sensitivity TP/(TP+FN) is similar but leads to ties in the ranking

Page 34: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

- Ranking methods by median ranks of pathways expected to be relevant, a.k.a. surrogate for prioritization

Page 35: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Permute the phenotype of each of the 42 datasets

• Repeat 50 times • Count how many

pathways have p<a

Page 36: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Method

Category

Sensitivity Prioritization Specificity Overall rank in

category Median

p-value Median rank (%) FP at a=1%

PLAGE I 0.0022 25.0 1.1% 1

GLOBALTEST I 0.0001 27.9 2.0% 2

PADOG I 0.0960 9.7 2.5% 3

ORA I 0.0732 18.3 2.5% 4

SAFE I 0.1065 18.8 1.3% 5

SIGPATHWAYQ2 I 0.0565 38.0 0.9% 6

GSA I 0.1420 21.0 1.3% 7

SSGSEA I 0.0808 40.3 1.0% 8

ZSCORE I 0.0950 39.8 1.0% 9

GSEA I 0.1801 33.1 2.3% 10

GSVA I 0.1986 51.5 1.1% 11

CAMERA I 0.3126 43.0 0.5% 12

MRGSE II 0.0100 18.8 4.9% 1

GSEAP II 0.0644 36.2 15.8% 2

GAGE II 0.0024 35.9 37.9% 3

SIGPATHWAYQ1 II 0.1165 49.7 17.2% 4

A Ranking of Gene Set

Analysis Methods

Page 37: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Ranking Stability

Method

Rank in

Category

Sample size Gene set size Effect

Small

n<22

Large

n22

Small

N<66

Large

N66

Small

g<24.6%

Large

g24.6%

PLAGE 1 1 4 2 3 3 3

GLOBALTEST 2 2 1 3 5 2 4

PADOG 3 3 2 1 2 1 1

ORA 4 4 3 5 1 5 2

SAFE 5 7 5 4 8 4 6

SIGPATH.Q2 6 5 8 8 4 8 8

GSA 7 9 6 7 6 6 11

SSGSEA 8 8 7 6 12 9 5

ZSCORE 9 6 10 10 7 7 9

GSEA 10 10 9 9 11 10 7

GSVA 11 11 11 11 9 11 10

CAMERA 12 12 12 12 10 12 12

MRGSE 1 1 1 1 2 1 1

GSEAP 2 2 2 2 1 2 2

GAGE 3 3 3 3 4 3 3

SIGPATH.Q1 4 4 4 4 3 4 4

• Ranking based on all 42 datasets significantly correlated with all rankings based on half of the datasets (all p<0.0001)

Page 38: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Conclusions Gene Set Analysis

• Gene set analysis methods are useful to reduce

complexity of high-throughput experiments

• Best methods for gene set prioritization are different

from best ones in terms of sensitivity

• PLAGE, GLOBALTEST, PADOG are best overall in

category I, and MRGSE best in category II.

• Disease pathways (KEGG, Metacore) can be used as

positive controls in gene set analysis in conjunction

with a large number of datasets studying those diseases

• Our approach to gene set analysis assessment is less

subjective than relying on literature citations

Page 39: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Approach of team PRB (49) in the

Species Translation Challenge

Page 40: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Species Translation Challenge

• Sub-challenges SC1-3 were approachable via a machine

learning frame work

• discriminating positive responses (proteins

phosphorylated or pathways activated) against

negatives

• Compared to IMPROVER DSC, the STC had these

particularities:

– More datasets (SC1:16, SC2:16, SC3: 246), so more challenging

but potentially better opportunity to assess performance

– Did not involve data pre-processing (batch effect removal?)

– Teams used the same data

– Datasets were highly imbalanced (many responses with 0,1,2,

etc. positives, and the remaining up to 26 negatives)

Page 41: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Used the average gene expression in DME samples

within batch as normalizer (to remove batch effects)

• Predictions set to 0 (negative) for phosphoproteins that

were positive in less than 2 stimuli in the training set

• For all other phospoproteins fit a LDA model on the

training data similar procedure* as we used in

IMPROVER DSC challenge

SC1: Intra-Species Protein

Phosphorylation Prediction

(*) Tarca AL, Than NG, Romero R., Systems Biomedicine; 1(4) , 2013.

Page 42: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Genes are ranked by moderated t-test p-value & fold change (cutoff

optimized between 1.25 – 4);

• Number of genes to include in the LDA model was determined by

maximizing AUC+CCEM+BCM estimated by repeated cross-validation

gene 1 gene 2 … gene N

stimuli 1

stimuli 2

stimuli 3

stimuli 26

Protein 1

0

1

0

1

Rat gene expression (DME normalized) Rat protein phosphorylation

SC1: Intra-Species Protein

Phosphorylation Prediction

predict

Page 43: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

MK03 (Mitogen-activated protein

kinase 3) phosphorylation in rat

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

FRMD5

WISP1

SMTN

DIO3

GFRA1

Negative (19)

Positive (7)

Training data

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

PC1

PC3 PC2

Page 44: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

MK03 (Mitogen-activated protein

kinase 3) phosphorylation in rat

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

FRMD5

WISP1

SMTN

DIO3

GFRA1

Test data

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

PC1

PC3 PC2

Negative (19)

Positive (7)

Training data

Page 45: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

MK03 (Mitogen-activated protein

kinase 3) phosphorylation in rat

stim01

stim09

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

FRMD5

WISP1

SMTN

DIO3

GFRA1

Negative

Positive

Test data

SEROTONIN

PROMETHAZINE

PMA

TGFA

MEPYRAMINE

PDGFB

EGF

PC1

PC3 PC2

Negative (19)

Positive (7)

Training data

Page 46: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

SC1: Intra-Species Protein

Phosphorylation Prediction

Page 47: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Same approach as in SC1 with these differences:

– Predictors are not genes but phosphoproteins (1..16) at 5 and 25

mins (32 features)

– We used the actual phosphorylation level (continuous) for model

fitting

SC2: Inter-Species Protein

Phosphorylation Prediction

Prot. 1 Prot. 2 … Prot. 16

stimuli 1

stimuli 2

stimuli 3

stimuli 26

Protein 1

0

1

0

1

5 min

Prot. 1 Prot. 2 … Prot. 16

25 min

Rat protein phosphorylation Human protein phosphorylation

Page 48: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Should have used gene expression + protein

phosphorylation data?

SC2: Inter-Species Protein

Phosphorylation Prediction

Page 49: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Predictors were NES scores (continuous)

• If a pathway was not perturbed in 4/26

training stimuli in human (91% of the 246

pathways), it was predicted as non-

perturbed in test set

SC3: Inter-Species Pathway Perturbation

Prediction – First submission

Pathway 1 … Pathway 246

stimuli 1

stimuli 2

stimuli 3

stimuli 26

Pathway 1

0

1

0

1

Rat pathways NES Human pathway perturbation

predict

Page 50: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Result of the first submission (before the first deadline)

SC3: Inter-Species Pathway

Perturbation Prediction

Non-official 0.20 0.57 0.53

3

Page 51: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

• Result of the final submission

SC3: Inter-Species Pathway

Perturbation Prediction

Page 52: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

SC3: Why did we need a new and riskier

approach in SC3?

• A new approach would make editors more

interested in the resulting paper

• Machine learning can not work if a pathway is

activated in 2 or less stimuli in human (79% of

the 246 pathways)

• We assumed that the quality of predictions for

each pathway will have equal weight in the

scoring

Page 53: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Stimuli Pathway True Status Method 1 Method 2 Stimuli Pathway True

Status Method 1 Method 2

1 1 1 0 1 1 3 1 1 1

2 1 0 0 0 2 3 1 1 1

3 1 0 0 0 3 3 1 1 1

4 1 0 0 0 4 3 1 1 0

5 1 0 0 0 5 3 1 0 0

6 1 0 0 0 6 3 0 0 0

7 1 0 0 0 7 3 0 0 0

8 1 0 0 0 8 3 0 0 0

9 1 0 0 0 9 3 0 0 0

10 1 0 0 0 10 3 0 0 0

1 2 1 0 1 1 4 1 1 1

2 2 1 0 0 2 4 1 1 1

3 2 0 0 0 3 4 1 1 1

4 2 0 0 0 4 4 1 1 0

5 2 0 0 0 5 4 1 0 0

6 2 0 0 0 6 4 0 0 0

7 2 0 0 0 7 4 0 0 0

8 2 0 0 0 8 4 0 0 0

9 2 0 0 0 9 4 0 0 0

10 2 0 0 0 10 4 0 0 0

Person BAC

Method 1 0.72 0.807

Method 2 0.72 0.807

Person BAC

Method 1 NA 0.7

Method 2 0.74 0.84

Pooled performance Separate then averaged

Page 54: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

SC3: A non-Machine Learning

Approach to SC3

Human gene sets collection

GSEAP analysis

Q-value

Pathway 1 0.01

Pathway 2 0.03

Pathway 3 0.26

Pathway 246 0.99

Human Control Replicate 1

Control Replicate 2

Stimulus01 Replicate 1

Stimulus01 Replicate 2

Gene 1

Gene 2

Gene 3

Gene N

Human Gene expression (Test set)

Rat Control Replicate 1

Control Replicate 2

Stimulus01 Replicate 1

Stimulus01 Replicate 2

Gene 1

Gene 2

Gene 3

Gene N

Rat Gene expression (Test set)

f

Page 55: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Human gene sets collection

GSEAP analysis

Q-value

Pathway 1 0.01

Pathway 2 0.03

Pathway 3 0.26

Pathway 246 0.99

Human Control Replicate 1

Control Replicate 2

Stimulus01 Replicate 1

Stimulus01 Replicate 2

Gene 1

Gene 2

Gene 3

Gene N

Rat Control Replicate 1

Control Replicate 2

Stimulus01 Replicate 1

Stimulus01 Replicate 2

Gene 1

Gene 2

Gene 3

Gene N

Human Gene expression (Test set) Rat Gene expression (Test set)

SC3: A non-Machine Learning

Approach to SC3

Page 56: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

SC3: Find a data driven ortholog for

each human gene among rat genes

Human Stimulus01

t-score

Gene 10 -5.2

Gene 63 -4.0

Gene 20 -3.0 … ….

Gene 98 +4.5

Gene 67 +6.0

Human Training Set

Human Stimulus02

t-score

Gene 60 -2.2

Gene 20 -1.0

Gene 10 -0.9 … ….

Gene 76 +3.5

Gene 10 +4.0

. . .

Human Stimulus26

t-score

Gene 60 -2.2

Gene 20 -1.0

Gene 34 -0.9 … ….

Gene 76 +3.5

Gene 10 +4.0

Rat Stimulus01

t-score

Gene 50 -6.2

Gene 63 -4.0

Gene 20 -3.0 … ….

Gene 98 +4.5

Gene 67 +6.0

Rat Stimulus02

t-score

Gene 60 -2.2

Gene 20 -1.0

Gene 50 -0.5 … ….

Gene 30 +0.9

Gene 10 +4.0

. . .

Rat Stimulus26

t-score

Gene 60 -2.2

Gene 20 -1.0

Gene 34 -0.9 … ….

Gene 76 +3.5

Gene 50 +4.0

Rat Training Set

Page 57: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

SC3: Find a Data Driven Ortholog for

each Human Gene Among Rat Genes

• For each human gene h (in gene sets 1..246)

– Compute the rank distance to each rat gene

(0.0 Rank 1.0)

– Choose the rat gene r that minimizes D(h,r)

Rank Gene Stimulus 01

t-score

0.000 Gene 50 -6.2

0.001 Gene 63 -4.0

0.002 Gene 20 -3.0 0.500 … ….

0.999 Gene 98 +4.5

1.000 Gene 67 +6.0

Page 58: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Acknowledgements

• Gustavo Stolovitzky, Raquel Norel, Erhan Bilal, Pablo Meyer, Jeremy

J Rice, IBM Thomas J. Watson Research Center

• Stephanie Boue, Julia Hoeng, Florian Martin, Marja Talikka, Yang

Xiang: Philip Morris International, Research & Development

• Mario Lauria: The Microsoft Research - University of Trento Centre

for Computational and Systems Biology, Rovereto, Italy

• Michael Unger, Kushal Kumar Dey, Preetam Nandy, Christoph

Zechner, Heinz Koeppl: ETH Zurich, Switzerland

• IMPROVER DSC Collaborators

Page 59: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Acknowledgements

• The Intramural Research Program of the Eunice Kennedy Shriver

National Institute of Child Health and Human Development,

NIH/DHHS

• The IMPROVER Diagnostic Signature Challenge Grant from Philip

Morris

Page 60: Machine Learning and Pathway Analysis as basic tools in ......Machine Learning and Pathway Analysis as basic tools in Systems Biology 1,2 Adi L. Tarca 1. Department of Computer Science,

Thank you! / Questions ?


Recommended