Download - Introduction to Systems Biology IIveda.cs.uiuc.edu/.../08_Intro_Systems_Biology_II.pdf · Systems Biology • Systems biology is the computational and mathematical modeling of complex

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Introduction to Systems Biology II

Amin Emad

NIH BD2K KnowEnG Center of Excellence in Big Data Computing Carl R. Woese Institute for Genomic Biology

Department of Computer Science University of Illinois at Urbana-Champaign

June, 2018

Systems Biology

•  Systems biology is the computational and mathematical modeling of complex biological systems (wikipedia).

•  Studies the interactions between the components of biological systems such as genes, proteins, metabolites, etc. (i.e. biological networks), and how these interactions give rise to the function and behavior of that system (phenotype)

NIH Big Data Center of Excellence 2

Biological Networks

A graphical representation of the interactions of the components of a biological systems


BMIF310, Fall 2009 3

Cell as a system

Signaling network

Transcriptional regulatory network

Metabolic network

Gene co-expression network

Protein interaction network

Zhang (2009)

•  Cell signaling networks

•  Gene regulatory networks

•  Protein-protein interaction networks

•  Gene co-expression networks

•  Metabolic networks

Biological Networks in Computational Biology


Analyzing network

properties

Analyzing ‘omic’ data in

light of networks

Reconstructing biological

networks

Graph Theory

Machine Learning

Statistics

1) Analyzing network properties


What is a network/graph?


•  Graph: A representation of relationship among objects •  A graph G(V, E) is a set of vertices (nodes) V and edges (links) E

Directed vs. Undirected:

Undirected graph

•  Protein-protein interactions

•  Co-expression network

Directed graph

•  Gene regulatory network

•  Signaling pathways

Graph Properties


Weighted vs. Unweighted:

•  Weights represent affinity in PPI, correlation coefficient in a co-expression network, confidence in a GRN, etc.

Weighted graph Unweighted graph

Graph Properties


Degree and degree distribution: •  Degree: Number of connections of a node to other nodes

•  Indegree (outdegree) of a node in a directed graph is the number of edges entering (leaving) that node

•  Degree distribution of a network is the probability distribution of these degrees over the network:

Graph Properties


Adjacency matrix: •  A matrix representation of the graph

https://www.ebi.ac.uk/training/online/course/network-analysis-protein-interaction-data-introduction/introduction-graph-theory/graph-0

Graph Properties


Path and connectivity: •  Path: A sequence of distinct edges connecting a sequence

of vertices: GFAB, EAC, etc.

•  Connectivity: A graph that in which a path exists between any two nodes

Graph Properties


Important classes of graphs: •  Tree: Any two vertices are connected by exactly one path (e.g.

dendogram in hierarchical clustering)

•  Complete graph: Each pair of vertices are connected by an edge

2) Analyzing ‘omic’ data in light of biological networks


Analyzing ‘omic’ data in light of networks


How to analyze large ‘omic’ datasets?

Statistics Machine Learning



How to analyze large ‘omic’ datasets? Machine learning is concerned with utilizing statistical techniques to give computers the ability to “learn”.




How to analyze large ‘omic’ datasets? Machine learning is concerned with utilizing statistical techniques to give computers the ability to “learn”. However, it can do much more!


Machine Learning in Computational Biology


Some examples:

•  Predicting whether a patient is sensitive or resistant to a drug

•  Predicting the survival probability of a cancer patient

•  Identifying the subtypes of a disease

•  Identifying genes associated with a disease

•  etc.

Machine Learning


Training examples are provided with desired inputs and outputs to help learning the desired rule

No training example exists and the goal is to learn structure in the data

Machine Learning

Supervised Learning

Unsupervised Learning

Machine Learning


Machine Learning

Supervised Learning

Unsupervised Learning

Classification Regression

Supervised Feature Selection

Clustering

Dimensionality Reduction

Unsupervised Machine Learning (Clustering)


•  We have a set of samples characterized using several features (e.g. expression of thousands of genes for tumor samples)

•  Goal: Group the sample such that those in the same group are more similar to each other than to those in other groups

•  Many methods exist such as K-means, hierarchical clustering, matrix factorization, etc.

•  Example: Identifying subtypes of breast

cancer using transcriptomic data

Unsupervised ML (Dimensionality Reduction)


•  We have a set of samples characterized using several features

•  Goal: Reduce the number of features while preserving characteristics of the data

•  Many methods exist such as principal component analysis, linear discriminative analysis, etc.

•  Example: PCA identifies a few principal

components, orthogonal to each other, such that they account for most of the variance in the data

Supervised Machine Learning (Classification)


Classification: •  We have a set of samples characterized using several features (e.g.

expression of thousands of genes for tumor samples)

•  The samples belong to set of known categories •  Goal: Given a new sample, to which category does it belong?

•  Many methods exist such as KNN, SVM, logistic regression, decision trees, random forests, etc.

Supervised Machine Learning (Classification)


Example: •  We have ‘omic’ profiles and clinical information of breast cancer patients

•  We also know which patients were resistant to a drug and which ones were not

•  Given the ‘omic’ profiles and clinical information of a new patient, will they be resistant to the drug or not?

✗

✓

+ =

+ =

‘omic’ and clinical features

sam

ples

Supervised Machine Learning (Regression)


•  We have a set of samples characterized using several features (e.g. expression of thousands of genes for tumor samples)

•  For each sample, we know a continuous-valued response (dependent variable) (e.g. number of years between diagnosis and occurrence of metastasis)

•  Goal: Estimate the relationship between the response and features and predict the value of response for a new sample

•  Many methods exist such as linear regression, LASSO, Elastic Net, Support vector regression, etc.

Supervised Machine Learning (Regression)


Example: •  We have transcriptomic profiles of breast cancer patients

•  We also know number of months between diagnosis and occurrence of metastasis

•  What is the relationship between gene expression and time of metastasis?

genes

sam

ples

https://www.cancer.gov/types/metastatic-cancer

Supervised Machine Learning (Feature Selection)


•  We have a set of samples characterized using several features •  We know a continuous-valued or categorical response for samples •  Goal: What are the features most predictive of the response?

Examples:

•  Differentially expressed genes (case vs. control)

•  Correlation analysis (GWAS)

•  etc.

genes

sam

ples

continuous categorical

Network guided analysis


How can biological networks help? •  When features correspond to genes or proteins (e.g. gene

expression, mutation, etc.), these networks can provide information regarding the interactions and relationships of these features.

genes

sam

ples

Network-guided gene prioritization using ProGENI


Background

•  Phenotypic properties of a cell are determined (partially) by expression of its genes and proteins

•  Gene expression profiling measures the activity of thousands of genes to create a global picture of cellular function

NIH Big Data Center of Excellence

genes

sam

ples

28

Background

•  Goal: •  Identifying genes whose basal mRNA expression determines the drug

sensitivity in different samples (supervised feature selection)

•  Motivations: •  Overcoming drug resistance

•  Revealing drug mechanism of action

•  Identifying novel drug targets

•  Predicting drug sensitivity of individuals


✗

✓

+ =

+ =

29

Gene prioritization


genes

sam

ples

Examples of current methods: •  Score each gene based on the correlation of its

expression with drug response

30

Gene prioritization


Xwixi



•  Use multivariable regression algorithms such as Elastic Net to relate multiple genes’ expression values to drug response

31

genes

sam

ples

Gene prioritization



•  Use multivariable regression algorithms such as Elastic Net to relate multiple genes’ expression values to drug response

Shortcoming: •  These methods do not incorporate prior information

about the interaction of the genes


ProGENI

Hypothesis: •  Since genes and proteins involved in drug MoA are functionally related, prior

knowledge in the form of gene interaction network (e.g. PPI) can improve accuracy of the prioritization task


genes

sam

ples

33

ProGENI

ProGENI: Network-guided gene prioritization •  An algorithm that incorporates gene network information to improve

prioritization accuracy


RESEARCH Open Access

Knowledge-guided gene prioritizationreveals new insights into the mechanismsof chemoresistanceAmin Emad1 , Junmei Cairns2, Krishna R. Kalari3, Liewei Wang2* and Saurabh Sinha4*

Abstract

Background: Identification of genes whose basal mRNA expression predicts the sensitivity of tumor cells to cytotoxictreatments can play an important role in individualized cancer medicine. It enables detailed characterization of themechanism of action of drugs. Furthermore, screening the expression of these genes in the tumor tissue may suggestthe best course of chemotherapy or a combination of drugs to overcome drug resistance.

Results: We developed a computational method called ProGENI to identify genes most associated with the variationof drug response across different individuals, based on gene expression data. In contrast to existing methods, ProGENIalso utilizes prior knowledge of protein–protein and genetic interactions, using random walk techniques. Analysis oftwo relatively new and large datasets including gene expression data on hundreds of cell lines and their cytotoxicresponses to a large compendium of drugs reveals a significant improvement in prediction of drug sensitivity usinggenes identified by ProGENI compared to other methods. Our siRNA knockdown experiments on ProGENI-identifiedgenes confirmed the role of many new genes in sensitivity to three chemotherapy drugs: cisplatin, docetaxel, anddoxorubicin. Based on such experiments and extensive literature survey, we demonstrate that about 73% of our toppredicted genes modulate drug response in selected cancer cell lines. In addition, global analysis of genes associatedwith groups of drugs uncovered pathways of cytotoxic response shared by each group.

Conclusions: Our results suggest that knowledge-guided prioritization of genes using ProGENI gives new insight intomechanisms of drug resistance and identifies genes that may be targeted to overcome this phenomenon.

Keywords: Chemoresistance, Chemotherapy, Drug sensitivity, Gene interaction network, Gene prioritization,Network-based algorithm

BackgroundThe goal of gene prioritization is to rank genes with re-spect to their relationship to a phenotype (e.g., occurrenceof a disease, response to a drug, etc.), providing an experi-mentalist a way to prioritize genetic perturbation tests andleading to discovery of genes affecting the phenotype [1].In the context of drug design and drug sensitivity, variousgene prioritization techniques have been used to identifydrug targets, reveal mechanisms of action (MoAs) of

drugs, and identify genes associated with drug response,as well as for drug repositioning [2–5].It has been previously shown that gene expression is

the most informative currently available ‘omic’ featurewith respect to drug sensitivity prediction [6], and it hasbeen also successfully used to predict drug response inlarge clinical studies [7]. Basal gene expression of cancercell lines (CCLs) has been used to rank genes by theirrole in cytotoxic drug resistance, utilizing correlationanalysis [2, 8–11] or feature selection and regressiontechniques [12–16] to statistically associate drug re-sponse with gene expression profiles of cell lines. At thesame time, many genes with key roles escape identifica-tion based on expression profiling alone, due to thecomplexity of drug MoA and noisy data [2], and due tothe fact that current methods overlook known functional

* Correspondence: [email protected]; [email protected] of Molecular Pharmacology and Experimental Therapeutics,Gonda 19, Mayo Clinic Rochester, 200, 1st St. SW, Rochester, MN 55905, USA4Department of Computer Science and Institute of Genomic Biology,University of Illinois at Urbana-Champaign, 2122 Siebel Center, 201N.Goodwin Ave, Urbana, IL 61801, USAFull list of author information is available at the end of the article

© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Emad et al. Genome Biology (2017) 18:153 DOI 10.1186/s13059-017-1282-3

34

ProGENI

Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance


Randomlyselect80%ofcelllines

Rankallgenes

Aggregaterankedlistsofgenes

RepeatNr8mes

Genes

Celllines

Priori%z

a%on)

PerformNetworktransforma8onofgeneexpressions

Obtainequilibriumprobabilitydistribu8on

forthenodes

Celllines

Genes

Network

Geneexpressions

Drugresponse(e.g.IC50)

Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe

restartsetforaRWR

a)

b)

Rankgenesaccordingtonormalized

probabilityscores

Normalizew.r.t.globalnetworkdistribu8on

35

ProGENI




Rankallgenes


RepeatNr8mes

Genes

Celllines

Priori%z

a%on)



forthenodes

Celllines

Genes

Network

Geneexpressions



restartsetforaRWR

a)

b)


probabilityscores


(Rosvall and Bergstrom, 2007)

36

ProGENI


Step 2: Find genes most correlated with drug response (RCG set)



Rankallgenes


RepeatNr8mes

Genes

Celllines

Priori%z

a%on)



forthenodes

Celllines

Genes

Network

Geneexpressions



restartsetforaRWR

a)

b)


probabilityscores


37

ProGENI



Step 3: Score genes based on their relevance to the RCG set



Rankallgenes


RepeatNr8mes

Genes

Celllines

Priori%z

a%on)



forthenodes

Celllines

Genes

Network

Geneexpressions



restartsetforaRWR

a)

b)


probabilityscores


38

ProGENI



Step 3: Score genes based on their relevance to the RCG set

Step 4: Remove network bias by normalizing scores w.r.t. scores corresponding to global network topology



Rankallgenes


RepeatNr8mes

Genes

Celllines

Priori%z

a%on)



forthenodes

Celllines

Genes

Network

Geneexpressions



restartsetforaRWR

a)

b)


probabilityscores


39

Datasets

•  Human lymphoblastoid cell lines (LCL) •  Gene expression (~17K genes of ~300 cell lines) •  Drug response of 24 cytotoxic treatments

•  Publicly available dataset from GDSC •  Gene expression (~13K genes of ~600 cell lines from 13

tissues) •  Drug response of 139 cytotoxic treatments

•  Publicly available prior knowledge •  Network of gene interactions (PPI and genetic interactions) from

STRING (~1.5M edges, ~15.5K nodes)


Data Sources for Knowledge Network• Philosophy: Rely on existing collections

• Protein-Protein Interactions • (40 M)

• Experimentally determined physical and genetic interactions

• Literature-based co-occurrence• Many other types

• Sources for experimental interactions (1.4 M)

5Interactions among 12 genes

40

Validation using drug response prediction

•  Genes ranked highly using a good prioritization method are good predictors of drug sensitivity



Rankallgenes


RepeatNr8mes

GenesCe

lllines

Priori%z

a%on)



forthenodes

Celllines

Genes

Network

Geneexpressions



restartsetforaRWR

A

B


probabilityscores


Dividesamplesintotrainingand

testsetsRankallgenes

TrainaSVRontrainingsetusingexpressionofhighlyrankedgenes

RepeatNr8mes

Celllines

Genes

Geneexpressions

Predictdrugsensi8vityofthetestset

Trainingset

Testset

C

41

Validation using drug response prediction


LCL Dataset Pearson Elastic Net Num. Drugs (out of 24)

ProGENI > Baseline 14 20

FDR (Wilcoxon signed-rank test) 6.5 E-3 9.6 E-5

GDSC Dataset Pearson Elastic Net Num. Drugs (out of 139)

ProGENI > Baseline 66 110

FDR (Wilcoxon signed-rank test) 9.1 E-4 4.0 E-21

SPCI(P

roGE

NI-S

VR)

A

B

C

D

SPCI(PCC-SVR)

SPCI(PCC-SVR)

SPCI(EN-SVR)

SPCI(EN-SVR)

SPCI(P

roGE

NI-S

VR)

SPCI(P

roGE

NI-S

VR)

SPCI(P

roGE

NI-S

VR)

42

Functional validation

We validated role of 33 (out of 45) genes (73%) for three drugs.


A

Gene Symbol Rank (ProGENI) Rank (Pearson) Absolute value of

Pearson correlation coefficient

Evidence

TUBB6 2 2 0.2759 Direct (this study)

DYNC2H1 3 4 0.2680 Direct (this study)

CLDN3 4 7 0.2602 Direct (literature)

SPARC 5 8 0.2574 Direct (literature)

GJA1 6 6 0.2623 Direct (literature)

ITGA5 7 11 0.2466 Direct (literature)

TPM2 8 9 0.2567 Direct (literature)

MMP2 9 37 0.2160 Direct (literature)

AXL 12 15 0.2373 Direct (literature)

ENG 13 47 0.2089 Direct (literature)

ELK3 14 13 0.2394 Direct (this study)

TIMP1 15 29 0.2207 Direct (literature)

FSCN1 1 1 0.2879 Not found

FHL3 10 10 0.2477 Not found

MMP14 11 39 0.2143 Not found

B



Evidence

CAV1 1 8 0.3713 Direct (literature)

YAP1 2 1 0.4148 Direct (literature)

WWTR1 3 4 0.4075 Direct (literature)

AXL 6 2 0.4098 Direct (literature)

MMP14 7 22 0.3525 Direct (literature)

CYR61 9 6 0.3791 Direct (literature)

CAV2 10 16 0.3566 Direct (literature)

GNG12 11 5 0.3792 Direct (this study)

CTSB 12 27 0.3462 Direct (literature)

FSTL1 14 17 0.3557 Direct (this study)

ST5 15 7 0.3782 Direct (this study)

PDGFC 4 13 0.3659 Not found

PTRF 5 3 0.4094 Not found

ITGB5 8 21 0.3534 Not found

PLAU 13 110 0.3033 Not found

C



Evidence

ATF1 1 1 0.2000 Direct (this study)

MIS12 2 4 0.1887 Direct (this study)

OSBPL2 5 6 0.1865 Direct (this study)

CSNK2A1 7 1587 0.0752 Direct (literature)

PSIP1 (LEDGF) 8 46 0.1537 Direct (literature)

CAMK2A 9 6991 0.0157 Direct (literature)

CSNK2A2 10 4870 0.0347 Direct (literature)

GOSR1 11 6867 0.0167 Direct (this study)

MAPK8 13 7574 0.0112 Direct (literature)

SPI1 14 6287 0.0217 Direct (literature)

CREB1 15 665 0.1000 Direct (literature)

NOC3L 3 3 0.1893 Not found

IL27RA 4 2 0.1911 Not found

MGEA5 6 7 0.1814 Not found

WAPAL 12 8 0.1805 Not found

B

C

p-value<0.0001 p-value<0.0001 p-value<0.0001

p-value<0.0001 p-value<0.0001 p-value<0.0001p-value<0.0001

BT549

BT549

p-value=0.0005 p-value<0.0001 p-value=0.0002MDA-MB-231

A BT549p-value<0.0001 p-value<0.0001 p-value<0.0001

MDA-MB-231p-value<0.0001 p-value<0.0001 p-value<0.0001

p-value=0.0002 p-value=0.0010 p-value=0.0018MDA-MB-231

JUNMAPK8PDPK1

CDC42IL27RA

PAK1

ATF1SRC

PXN

FNBP1

GOSR1

IRS1

MIS12

SUGT1

NOL6

NOC3L

HSP90AA1

GTPBP4LEO1

MGEA5NIFK

RPL5

RRS1

SUMO2

PSIP1

HEATR1

CDC37

CAMK2ACREM

CREB1RPS6KB1CREBBP

SGK1

ZNF45

SPI1

PRKCD

EDF1YWHAQ

UBC

CSNK2A1SIN3A

CEBPACSNK2A2

BRCA1

RPS6KA1

RPS6KA5

OSBPL2

PRKACA

RPS6KA3

RPS6KA4

EBI3

RPS6KA2

FKBP4

WAPAL

NOL3

OGT

CASP2

FGFR2

HSP90AA1

ITGA5

MXRA7

MMP2

TGFB1

TIMP1

MMP14

ZBTB16

FSCN1

AXL

UBC

MRC2

DYNC2H1

GJA1

FHL3

TUBB6

COL18A1

SPARCENG

ITGB1

ACVRL1

TGFBR1

CAV1TPM2

ELK3

CLDN3

MMP9

CLDN1

THBS1

PARVA

GNG12

FLNA

TUBB6

FLOT2

PTRF

ITGAVSRC

ITGB5EHD2

HRAS

PLAT

FHL2

YAP1

UBC

CAP2

PROCR

RASA1CAV2

PTGS1

FPGT-TNNI3K

LRP1

MALL

CAV1

KCNMA1

CSNK2A1

MMP14PLD2

PDGFCCYR61

CTSB

AXLPLAU

ST14

cispla4n

D

docetaxel

doxorubicin

43

How about other ML tasks?


•  Similar principles can be used for ML tasks other than feature selection/prioritization

•  “Network-smoothing” of the features used in ProGENI can be used as a preprocessing step to regression and classification algorithms

•  Network-smoothing can also be used for clustering and dimensionality reduction (e.g. Network-based stratification)

Network-based Stratification


Goal: •  Stratification (clustering) of tumor samples based on somatic mutation

profiles

Main Issue: •  The mutation data is very sparse and most conventional clustering

techniques fail to identify reasonable patterns

•  Although two tumors may not share the same somatic mutations, they may affect the same pathways and interaction networks

Value of network-guided analysis


Data sparsity: •  Due to the sparsity of the

data, all samples are at equal distance of each other





•  Pathway information clarifies the similarity among some samples





•  Pathway information clarifies the similarity among some samples

•  Conventional clustering methods can identify clusters based on network-smoothed features

NBS (Algorithm Overview)


•  Employs network smoothing to mitigate sparsity by transforming the binary gene-level somatic mutation vectors of patients into a continuous gene importance vector that captures the proximity of each gene in the network to all of the genes with somatic mutations in the patient sample

•  Bootstrap sampling enables robust clustering

3) Reconstruction of Biological Networks


Gene Co-expression Networks


•  Nodes represent genes •  An edge exists between two genes that are highly co-expressed

across different samples

gene 1 gene 1 gene 1

gene

2

gene

2

gene

2

genes

sam

ples



•  Such networks provide a global view of co-expression patterns

•  But do not provide information on how these networks relate to the variation in a phenotypic outcome



How can we relate these networks to the phenotypic variation?


gene

2

gene

2

gene

2

genes

sam

ples

Calculate pair-wise

correlations Filter o

ut small

corre

lations

genes

sam

ples

continuous categorical

gene-gene correlation

gene-phenotype association



Approach 1: In reconstructing the network, we can limit our samples to one manifestation of the phenotypic outcome

•  For example, build a Basal-like co-expression network by looking at the gene correlations across Basal breast cancer samples

•  Issues: 1.  Only works if we have categorical phenotype 2.  Does not relate the network to the variation in the phenotypic

outcome



Approach 2: If the phenotype is binary, reconstruct two networks (one for each manifestation of the phenotype) and compare the two to build a differential network

•  Shows changes in the co-expression pattern

Case Control Differential Network



•  Issues: 1.  Becomes very cumbersome if phenotype is not binary

2.  Does not work for continuous-valued phenotypes 3.  By dividing the samples into two groups, we will have less

statistical power in identifying co-expression patterns 4.  Fails in a case shown below

Gene 1

Gen

e 2



Approach 3: First, find genes associated with the phenotype and then reconstruct a context-specific network only using those genes

•  Issue: Ignores the strength (p-value) of gene-phenotype association


gene

2

gene

2

gene

2

genes

sam

ples

Calculate pair-wise

correlations Filter o

ut small

corre

lations

Filter o

ut non-D

EGs



Approach 4: Calculate p-values of gene-gene correlation and gene-phenotype associations separately and combine together using Fisher’s method or Stouffer’s method à Simplified-InPheRNo •  Specifically useful to identify transcription factor-gene-

phenotype associations

gene expression phenotype

gene

exp

ress

ion

gene

exp

ress

ion

Combine the two p-values

Questions?


Regression algorithms

•  Lasso: learns a linear model from the training data using only a few features (sparse linear model)

•  Elastic Net: learns a linear model from the training data by linearly combining ridge and Lasso regression regularization terms (a generalization of both Lasso and ridge regression)


�̂ = argmin�

�||y �X�||2 + �1||�||1

�

�̂ = argmin�

�||y �X�||2 + �2||�||2 + �1||�||1

�

60

Regression algorithms

•  Kernel-SVR:

•  Linear SVR learns a linear model such that it has at most ε-deviation from the response values and is as flat as possible

(Smola and Schölkopf, 1998)

•  Kernel-SVR generalizes the idea to nonlinear models by mapping the features to a high-dimensional kernel space


x

x

x x

x

xxx

x

xx

x

x

x

+ε−ε

x

ζ+ε

−ε0

ζ

61