National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
Introduction to Systems Biology II
Amin Emad
NIH BD2K KnowEnG Center of Excellence in Big Data Computing Carl R. Woese Institute for Genomic Biology
Department of Computer Science University of Illinois at Urbana-Champaign
June, 2018
Systems Biology
• Systems biology is the computational and mathematical modeling of complex biological systems (wikipedia).
• Studies the interactions between the components of biological systems such as genes, proteins, metabolites, etc. (i.e. biological networks), and how these interactions give rise to the function and behavior of that system (phenotype)
NIH Big Data Center of Excellence 2
Biological Networks
A graphical representation of the interactions of the components of a biological systems
NIH Big Data Center of Excellence 3
BMIF310, Fall 2009 3
Cell as a system
Signaling network
Transcriptional regulatory network
Metabolic network
Gene co-expression network
Protein interaction network
Zhang (2009)
• Cell signaling networks
• Gene regulatory networks
• Protein-protein interaction networks
• Gene co-expression networks
• Metabolic networks
Biological Networks in Computational Biology
NIH Big Data Center of Excellence 4
Analyzing network
properties
Analyzing ‘omic’ data in
light of networks
Reconstructing biological
networks
Graph Theory
Machine Learning
Statistics
1) Analyzing network properties
NIH Big Data Center of Excellence 5
What is a network/graph?
NIH Big Data Center of Excellence 6
• Graph: A representation of relationship among objects • A graph G(V, E) is a set of vertices (nodes) V and edges (links) E
Directed vs. Undirected:
Undirected graph
• Protein-protein interactions
• Co-expression network
Directed graph
• Gene regulatory network
• Signaling pathways
Graph Properties
NIH Big Data Center of Excellence 7
Weighted vs. Unweighted:
• Weights represent affinity in PPI, correlation coefficient in a co-expression network, confidence in a GRN, etc.
Weighted graph Unweighted graph
Graph Properties
NIH Big Data Center of Excellence 8
Degree and degree distribution: • Degree: Number of connections of a node to other nodes
• Indegree (outdegree) of a node in a directed graph is the number of edges entering (leaving) that node
• Degree distribution of a network is the probability distribution of these degrees over the network:
Graph Properties
NIH Big Data Center of Excellence 9
Adjacency matrix: • A matrix representation of the graph
https://www.ebi.ac.uk/training/online/course/network-analysis-protein-interaction-data-introduction/introduction-graph-theory/graph-0
Graph Properties
NIH Big Data Center of Excellence 10
Path and connectivity: • Path: A sequence of distinct edges connecting a sequence
of vertices: GFAB, EAC, etc.
• Connectivity: A graph that in which a path exists between any two nodes
Graph Properties
NIH Big Data Center of Excellence 11
Important classes of graphs: • Tree: Any two vertices are connected by exactly one path (e.g.
dendogram in hierarchical clustering)
• Complete graph: Each pair of vertices are connected by an edge
2) Analyzing ‘omic’ data in light of biological networks
NIH Big Data Center of Excellence 12
Analyzing ‘omic’ data in light of networks
NIH Big Data Center of Excellence 13
How to analyze large ‘omic’ datasets?
Statistics Machine Learning
Analyzing ‘omic’ data in light of networks
NIH Big Data Center of Excellence 14
How to analyze large ‘omic’ datasets? Machine learning is concerned with utilizing statistical techniques to give computers the ability to “learn”.
Statistics Machine Learning
Analyzing ‘omic’ data in light of networks
NIH Big Data Center of Excellence 15
How to analyze large ‘omic’ datasets? Machine learning is concerned with utilizing statistical techniques to give computers the ability to “learn”. However, it can do much more!
Statistics Machine Learning
Machine Learning in Computational Biology
NIH Big Data Center of Excellence 16
Some examples:
• Predicting whether a patient is sensitive or resistant to a drug
• Predicting the survival probability of a cancer patient
• Identifying the subtypes of a disease
• Identifying genes associated with a disease
• etc.
Machine Learning
NIH Big Data Center of Excellence 17
Training examples are provided with desired inputs and outputs to help learning the desired rule
No training example exists and the goal is to learn structure in the data
Machine Learning
Supervised Learning
Unsupervised Learning
Machine Learning
NIH Big Data Center of Excellence 18
Machine Learning
Supervised Learning
Unsupervised Learning
Classification Regression
Supervised Feature Selection
Clustering
Dimensionality Reduction
Unsupervised Machine Learning (Clustering)
NIH Big Data Center of Excellence 19
• We have a set of samples characterized using several features (e.g. expression of thousands of genes for tumor samples)
• Goal: Group the sample such that those in the same group are more similar to each other than to those in other groups
• Many methods exist such as K-means, hierarchical clustering, matrix factorization, etc.
• Example: Identifying subtypes of breast
cancer using transcriptomic data
Unsupervised ML (Dimensionality Reduction)
NIH Big Data Center of Excellence 20
• We have a set of samples characterized using several features
• Goal: Reduce the number of features while preserving characteristics of the data
• Many methods exist such as principal component analysis, linear discriminative analysis, etc.
• Example: PCA identifies a few principal
components, orthogonal to each other, such that they account for most of the variance in the data
Supervised Machine Learning (Classification)
NIH Big Data Center of Excellence 21
Classification: • We have a set of samples characterized using several features (e.g.
expression of thousands of genes for tumor samples)
• The samples belong to set of known categories • Goal: Given a new sample, to which category does it belong?
• Many methods exist such as KNN, SVM, logistic regression, decision trees, random forests, etc.
Supervised Machine Learning (Classification)
NIH Big Data Center of Excellence 22
Example: • We have ‘omic’ profiles and clinical information of breast cancer patients
• We also know which patients were resistant to a drug and which ones were not
• Given the ‘omic’ profiles and clinical information of a new patient, will they be resistant to the drug or not?
✗
✓
+ =
+ =
‘omic’ and clinical features
sam
ples
Supervised Machine Learning (Regression)
NIH Big Data Center of Excellence 23
• We have a set of samples characterized using several features (e.g. expression of thousands of genes for tumor samples)
• For each sample, we know a continuous-valued response (dependent variable) (e.g. number of years between diagnosis and occurrence of metastasis)
• Goal: Estimate the relationship between the response and features and predict the value of response for a new sample
• Many methods exist such as linear regression, LASSO, Elastic Net, Support vector regression, etc.
Supervised Machine Learning (Regression)
NIH Big Data Center of Excellence 24
Example: • We have transcriptomic profiles of breast cancer patients
• We also know number of months between diagnosis and occurrence of metastasis
• What is the relationship between gene expression and time of metastasis?
genes
sam
ples
https://www.cancer.gov/types/metastatic-cancer
Supervised Machine Learning (Feature Selection)
NIH Big Data Center of Excellence 25
• We have a set of samples characterized using several features • We know a continuous-valued or categorical response for samples • Goal: What are the features most predictive of the response?
Examples:
• Differentially expressed genes (case vs. control)
• Correlation analysis (GWAS)
• etc.
genes
sam
ples
continuous categorical
Network guided analysis
NIH Big Data Center of Excellence 26
How can biological networks help? • When features correspond to genes or proteins (e.g. gene
expression, mutation, etc.), these networks can provide information regarding the interactions and relationships of these features.
genes
sam
ples
Network-guided gene prioritization using ProGENI
NIH Big Data Center of Excellence 27
Background
• Phenotypic properties of a cell are determined (partially) by expression of its genes and proteins
• Gene expression profiling measures the activity of thousands of genes to create a global picture of cellular function
NIH Big Data Center of Excellence
genes
sam
ples
28
Background
• Goal: • Identifying genes whose basal mRNA expression determines the drug
sensitivity in different samples (supervised feature selection)
• Motivations: • Overcoming drug resistance
• Revealing drug mechanism of action
• Identifying novel drug targets
• Predicting drug sensitivity of individuals
NIH Big Data Center of Excellence
✗
✓
+ =
+ =
29
Gene prioritization
NIH Big Data Center of Excellence
genes
sam
ples
Examples of current methods: • Score each gene based on the correlation of its
expression with drug response
30
Gene prioritization
NIH Big Data Center of Excellence
Xwixi
Examples of current methods: • Score each gene based on the correlation of its
expression with drug response
• Use multivariable regression algorithms such as Elastic Net to relate multiple genes’ expression values to drug response
31
genes
sam
ples
Gene prioritization
Examples of current methods: • Score each gene based on the correlation of its
expression with drug response
• Use multivariable regression algorithms such as Elastic Net to relate multiple genes’ expression values to drug response
Shortcoming: • These methods do not incorporate prior information
about the interaction of the genes
NIH Big Data Center of Excellence 32
ProGENI
Hypothesis: • Since genes and proteins involved in drug MoA are functionally related, prior
knowledge in the form of gene interaction network (e.g. PPI) can improve accuracy of the prioritization task
NIH Big Data Center of Excellence
genes
sam
ples
33
ProGENI
ProGENI: Network-guided gene prioritization • An algorithm that incorporates gene network information to improve
prioritization accuracy
NIH Big Data Center of Excellence
RESEARCH Open Access
Knowledge-guided gene prioritizationreveals new insights into the mechanismsof chemoresistanceAmin Emad1 , Junmei Cairns2, Krishna R. Kalari3, Liewei Wang2* and Saurabh Sinha4*
Abstract
Background: Identification of genes whose basal mRNA expression predicts the sensitivity of tumor cells to cytotoxictreatments can play an important role in individualized cancer medicine. It enables detailed characterization of themechanism of action of drugs. Furthermore, screening the expression of these genes in the tumor tissue may suggestthe best course of chemotherapy or a combination of drugs to overcome drug resistance.
Results: We developed a computational method called ProGENI to identify genes most associated with the variationof drug response across different individuals, based on gene expression data. In contrast to existing methods, ProGENIalso utilizes prior knowledge of protein–protein and genetic interactions, using random walk techniques. Analysis oftwo relatively new and large datasets including gene expression data on hundreds of cell lines and their cytotoxicresponses to a large compendium of drugs reveals a significant improvement in prediction of drug sensitivity usinggenes identified by ProGENI compared to other methods. Our siRNA knockdown experiments on ProGENI-identifiedgenes confirmed the role of many new genes in sensitivity to three chemotherapy drugs: cisplatin, docetaxel, anddoxorubicin. Based on such experiments and extensive literature survey, we demonstrate that about 73% of our toppredicted genes modulate drug response in selected cancer cell lines. In addition, global analysis of genes associatedwith groups of drugs uncovered pathways of cytotoxic response shared by each group.
Conclusions: Our results suggest that knowledge-guided prioritization of genes using ProGENI gives new insight intomechanisms of drug resistance and identifies genes that may be targeted to overcome this phenomenon.
Keywords: Chemoresistance, Chemotherapy, Drug sensitivity, Gene interaction network, Gene prioritization,Network-based algorithm
BackgroundThe goal of gene prioritization is to rank genes with re-spect to their relationship to a phenotype (e.g., occurrenceof a disease, response to a drug, etc.), providing an experi-mentalist a way to prioritize genetic perturbation tests andleading to discovery of genes affecting the phenotype [1].In the context of drug design and drug sensitivity, variousgene prioritization techniques have been used to identifydrug targets, reveal mechanisms of action (MoAs) of
drugs, and identify genes associated with drug response,as well as for drug repositioning [2–5].It has been previously shown that gene expression is
the most informative currently available ‘omic’ featurewith respect to drug sensitivity prediction [6], and it hasbeen also successfully used to predict drug response inlarge clinical studies [7]. Basal gene expression of cancercell lines (CCLs) has been used to rank genes by theirrole in cytotoxic drug resistance, utilizing correlationanalysis [2, 8–11] or feature selection and regressiontechniques [12–16] to statistically associate drug re-sponse with gene expression profiles of cell lines. At thesame time, many genes with key roles escape identifica-tion based on expression profiling alone, due to thecomplexity of drug MoA and noisy data [2], and due tothe fact that current methods overlook known functional
* Correspondence: [email protected]; [email protected] of Molecular Pharmacology and Experimental Therapeutics,Gonda 19, Mayo Clinic Rochester, 200, 1st St. SW, Rochester, MN 55905, USA4Department of Computer Science and Institute of Genomic Biology,University of Illinois at Urbana-Champaign, 2122 Siebel Center, 201N.Goodwin Ave, Urbana, IL 61801, USAFull list of author information is available at the end of the article
© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Emad et al. Genome Biology (2017) 18:153 DOI 10.1186/s13059-017-1282-3
34
ProGENI
Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance
NIH Big Data Center of Excellence
Randomlyselect80%ofcelllines
Rankallgenes
Aggregaterankedlistsofgenes
RepeatNr8mes
Genes
Celllines
Priori%z
a%on)
PerformNetworktransforma8onofgeneexpressions
Obtainequilibriumprobabilitydistribu8on
forthenodes
Celllines
Genes
Network
Geneexpressions
Drugresponse(e.g.IC50)
Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe
restartsetforaRWR
a)
b)
Rankgenesaccordingtonormalized
probabilityscores
Normalizew.r.t.globalnetworkdistribu8on
35
ProGENI
Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance
NIH Big Data Center of Excellence
Randomlyselect80%ofcelllines
Rankallgenes
Aggregaterankedlistsofgenes
RepeatNr8mes
Genes
Celllines
Priori%z
a%on)
PerformNetworktransforma8onofgeneexpressions
Obtainequilibriumprobabilitydistribu8on
forthenodes
Celllines
Genes
Network
Geneexpressions
Drugresponse(e.g.IC50)
Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe
restartsetforaRWR
a)
b)
Rankgenesaccordingtonormalized
probabilityscores
Normalizew.r.t.globalnetworkdistribu8on
(Rosvall and Bergstrom, 2007)
36
ProGENI
Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance
Step 2: Find genes most correlated with drug response (RCG set)
NIH Big Data Center of Excellence
Randomlyselect80%ofcelllines
Rankallgenes
Aggregaterankedlistsofgenes
RepeatNr8mes
Genes
Celllines
Priori%z
a%on)
PerformNetworktransforma8onofgeneexpressions
Obtainequilibriumprobabilitydistribu8on
forthenodes
Celllines
Genes
Network
Geneexpressions
Drugresponse(e.g.IC50)
Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe
restartsetforaRWR
a)
b)
Rankgenesaccordingtonormalized
probabilityscores
Normalizew.r.t.globalnetworkdistribu8on
37
ProGENI
Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance
Step 2: Find genes most correlated with drug response (RCG set)
Step 3: Score genes based on their relevance to the RCG set
NIH Big Data Center of Excellence
Randomlyselect80%ofcelllines
Rankallgenes
Aggregaterankedlistsofgenes
RepeatNr8mes
Genes
Celllines
Priori%z
a%on)
PerformNetworktransforma8onofgeneexpressions
Obtainequilibriumprobabilitydistribu8on
forthenodes
Celllines
Genes
Network
Geneexpressions
Drugresponse(e.g.IC50)
Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe
restartsetforaRWR
a)
b)
Rankgenesaccordingtonormalized
probabilityscores
Normalizew.r.t.globalnetworkdistribu8on
38
ProGENI
Step 1: Generate new features representing expression of each gene and the activity level of their neighbors weighted proportional to their relevance
Step 2: Find genes most correlated with drug response (RCG set)
Step 3: Score genes based on their relevance to the RCG set
Step 4: Remove network bias by normalizing scores w.r.t. scores corresponding to global network topology
NIH Big Data Center of Excellence
Randomlyselect80%ofcelllines
Rankallgenes
Aggregaterankedlistsofgenes
RepeatNr8mes
Genes
Celllines
Priori%z
a%on)
PerformNetworktransforma8onofgeneexpressions
Obtainequilibriumprobabilitydistribu8on
forthenodes
Celllines
Genes
Network
Geneexpressions
Drugresponse(e.g.IC50)
Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe
restartsetforaRWR
a)
b)
Rankgenesaccordingtonormalized
probabilityscores
Normalizew.r.t.globalnetworkdistribu8on
39
Datasets
• Human lymphoblastoid cell lines (LCL) • Gene expression (~17K genes of ~300 cell lines) • Drug response of 24 cytotoxic treatments
• Publicly available dataset from GDSC • Gene expression (~13K genes of ~600 cell lines from 13
tissues) • Drug response of 139 cytotoxic treatments
• Publicly available prior knowledge • Network of gene interactions (PPI and genetic interactions) from
STRING (~1.5M edges, ~15.5K nodes)
NIH Big Data Center of Excellence
Data Sources for Knowledge Network• Philosophy: Rely on existing collections
• Protein-Protein Interactions • (40 M)
• Experimentally determined physical and genetic interactions
• Literature-based co-occurrence• Many other types
• Sources for experimental interactions (1.4 M)
5Interactions among 12 genes
40
Validation using drug response prediction
• Genes ranked highly using a good prioritization method are good predictors of drug sensitivity
NIH Big Data Center of Excellence
Randomlyselect80%ofcelllines
Rankallgenes
Aggregaterankedlistsofgenes
RepeatNr8mes
GenesCe
lllines
Priori%z
a%on)
PerformNetworktransforma8onofgeneexpressions
Obtainequilibriumprobabilitydistribu8on
forthenodes
Celllines
Genes
Network
Geneexpressions
Drugresponse(e.g.IC50)
Iden8fyresponsecorrelatedgenes(RCG)andusethemasthe
restartsetforaRWR
A
B
Rankgenesaccordingtonormalized
probabilityscores
Normalizew.r.t.globalnetworkdistribu8on
Dividesamplesintotrainingand
testsetsRankallgenes
TrainaSVRontrainingsetusingexpressionofhighlyrankedgenes
RepeatNr8mes
Celllines
Genes
Geneexpressions
Predictdrugsensi8vityofthetestset
Trainingset
Testset
C
41
Validation using drug response prediction
NIH Big Data Center of Excellence
LCL Dataset Pearson Elastic Net Num. Drugs (out of 24)
ProGENI > Baseline 14 20
FDR (Wilcoxon signed-rank test) 6.5 E-3 9.6 E-5
GDSC Dataset Pearson Elastic Net Num. Drugs (out of 139)
ProGENI > Baseline 66 110
FDR (Wilcoxon signed-rank test) 9.1 E-4 4.0 E-21
SPCI(P
roGE
NI-S
VR)
A
B
C
D
SPCI(PCC-SVR)
SPCI(PCC-SVR)
SPCI(EN-SVR)
SPCI(EN-SVR)
SPCI(P
roGE
NI-S
VR)
SPCI(P
roGE
NI-S
VR)
SPCI(P
roGE
NI-S
VR)
42
Functional validation
We validated role of 33 (out of 45) genes (73%) for three drugs.
NIH Big Data Center of Excellence
A
Gene Symbol Rank (ProGENI) Rank (Pearson) Absolute value of
Pearson correlation coefficient
Evidence
TUBB6 2 2 0.2759 Direct (this study)
DYNC2H1 3 4 0.2680 Direct (this study)
CLDN3 4 7 0.2602 Direct (literature)
SPARC 5 8 0.2574 Direct (literature)
GJA1 6 6 0.2623 Direct (literature)
ITGA5 7 11 0.2466 Direct (literature)
TPM2 8 9 0.2567 Direct (literature)
MMP2 9 37 0.2160 Direct (literature)
AXL 12 15 0.2373 Direct (literature)
ENG 13 47 0.2089 Direct (literature)
ELK3 14 13 0.2394 Direct (this study)
TIMP1 15 29 0.2207 Direct (literature)
FSCN1 1 1 0.2879 Not found
FHL3 10 10 0.2477 Not found
MMP14 11 39 0.2143 Not found
B
Gene Symbol Rank (ProGENI) Rank (Pearson) Absolute value of
Pearson correlation coefficient
Evidence
CAV1 1 8 0.3713 Direct (literature)
YAP1 2 1 0.4148 Direct (literature)
WWTR1 3 4 0.4075 Direct (literature)
AXL 6 2 0.4098 Direct (literature)
MMP14 7 22 0.3525 Direct (literature)
CYR61 9 6 0.3791 Direct (literature)
CAV2 10 16 0.3566 Direct (literature)
GNG12 11 5 0.3792 Direct (this study)
CTSB 12 27 0.3462 Direct (literature)
FSTL1 14 17 0.3557 Direct (this study)
ST5 15 7 0.3782 Direct (this study)
PDGFC 4 13 0.3659 Not found
PTRF 5 3 0.4094 Not found
ITGB5 8 21 0.3534 Not found
PLAU 13 110 0.3033 Not found
C
Gene Symbol Rank (ProGENI) Rank (Pearson) Absolute value of
Pearson correlation coefficient
Evidence
ATF1 1 1 0.2000 Direct (this study)
MIS12 2 4 0.1887 Direct (this study)
OSBPL2 5 6 0.1865 Direct (this study)
CSNK2A1 7 1587 0.0752 Direct (literature)
PSIP1 (LEDGF) 8 46 0.1537 Direct (literature)
CAMK2A 9 6991 0.0157 Direct (literature)
CSNK2A2 10 4870 0.0347 Direct (literature)
GOSR1 11 6867 0.0167 Direct (this study)
MAPK8 13 7574 0.0112 Direct (literature)
SPI1 14 6287 0.0217 Direct (literature)
CREB1 15 665 0.1000 Direct (literature)
NOC3L 3 3 0.1893 Not found
IL27RA 4 2 0.1911 Not found
MGEA5 6 7 0.1814 Not found
WAPAL 12 8 0.1805 Not found
B
C
p-value<0.0001 p-value<0.0001 p-value<0.0001
p-value<0.0001 p-value<0.0001 p-value<0.0001p-value<0.0001
BT549
BT549
p-value=0.0005 p-value<0.0001 p-value=0.0002MDA-MB-231
A BT549p-value<0.0001 p-value<0.0001 p-value<0.0001
MDA-MB-231p-value<0.0001 p-value<0.0001 p-value<0.0001
p-value=0.0002 p-value=0.0010 p-value=0.0018MDA-MB-231
JUNMAPK8PDPK1
CDC42IL27RA
PAK1
ATF1SRC
PXN
FNBP1
GOSR1
IRS1
MIS12
SUGT1
NOL6
NOC3L
HSP90AA1
GTPBP4LEO1
MGEA5NIFK
RPL5
RRS1
SUMO2
PSIP1
HEATR1
CDC37
CAMK2ACREM
CREB1RPS6KB1CREBBP
SGK1
ZNF45
SPI1
PRKCD
EDF1YWHAQ
UBC
CSNK2A1SIN3A
CEBPACSNK2A2
BRCA1
RPS6KA1
RPS6KA5
OSBPL2
PRKACA
RPS6KA3
RPS6KA4
EBI3
RPS6KA2
FKBP4
WAPAL
NOL3
OGT
CASP2
FGFR2
HSP90AA1
ITGA5
MXRA7
MMP2
TGFB1
TIMP1
MMP14
ZBTB16
FSCN1
AXL
UBC
MRC2
DYNC2H1
GJA1
FHL3
TUBB6
COL18A1
SPARCENG
ITGB1
ACVRL1
TGFBR1
CAV1TPM2
ELK3
CLDN3
MMP9
CLDN1
THBS1
PARVA
GNG12
FLNA
TUBB6
FLOT2
PTRF
ITGAVSRC
ITGB5EHD2
HRAS
PLAT
FHL2
YAP1
UBC
CAP2
PROCR
RASA1CAV2
PTGS1
FPGT-TNNI3K
LRP1
MALL
CAV1
KCNMA1
CSNK2A1
MMP14PLD2
PDGFCCYR61
CTSB
AXLPLAU
ST14
cispla4n
D
docetaxel
doxorubicin
43
How about other ML tasks?
NIH Big Data Center of Excellence 44
• Similar principles can be used for ML tasks other than feature selection/prioritization
• “Network-smoothing” of the features used in ProGENI can be used as a preprocessing step to regression and classification algorithms
• Network-smoothing can also be used for clustering and dimensionality reduction (e.g. Network-based stratification)
Network-based Stratification
NIH Big Data Center of Excellence 45
Goal: • Stratification (clustering) of tumor samples based on somatic mutation
profiles
Main Issue: • The mutation data is very sparse and most conventional clustering
techniques fail to identify reasonable patterns
• Although two tumors may not share the same somatic mutations, they may affect the same pathways and interaction networks
Value of network-guided analysis
NIH Big Data Center of Excellence 46
Data sparsity: • Due to the sparsity of the
data, all samples are at equal distance of each other
Value of network-guided analysis
NIH Big Data Center of Excellence 47
Data sparsity: • Due to the sparsity of the
data, all samples are at equal distance of each other
• Pathway information clarifies the similarity among some samples
Value of network-guided analysis
NIH Big Data Center of Excellence 48
Data sparsity: • Due to the sparsity of the
data, all samples are at equal distance of each other
• Pathway information clarifies the similarity among some samples
• Conventional clustering methods can identify clusters based on network-smoothed features
NBS (Algorithm Overview)
NIH Big Data Center of Excellence 49
• Employs network smoothing to mitigate sparsity by transforming the binary gene-level somatic mutation vectors of patients into a continuous gene importance vector that captures the proximity of each gene in the network to all of the genes with somatic mutations in the patient sample
• Bootstrap sampling enables robust clustering
3) Reconstruction of Biological Networks
NIH Big Data Center of Excellence 50
Gene Co-expression Networks
NIH Big Data Center of Excellence 51
• Nodes represent genes • An edge exists between two genes that are highly co-expressed
across different samples
gene 1 gene 1 gene 1
gene
2
gene
2
gene
2
genes
sam
ples
Gene Co-expression Networks
NIH Big Data Center of Excellence 52
• Such networks provide a global view of co-expression patterns
• But do not provide information on how these networks relate to the variation in a phenotypic outcome
Gene Co-expression Networks
NIH Big Data Center of Excellence 53
How can we relate these networks to the phenotypic variation?
gene 1 gene 1 gene 1
gene
2
gene
2
gene
2
genes
sam
ples
Calculate pair-wise
correlations Filter o
ut small
corre
lations
genes
sam
ples
continuous categorical
gene-gene correlation
gene-phenotype association
Gene Co-expression Networks
NIH Big Data Center of Excellence 54
Approach 1: In reconstructing the network, we can limit our samples to one manifestation of the phenotypic outcome
• For example, build a Basal-like co-expression network by looking at the gene correlations across Basal breast cancer samples
• Issues: 1. Only works if we have categorical phenotype 2. Does not relate the network to the variation in the phenotypic
outcome
Gene Co-expression Networks
NIH Big Data Center of Excellence 55
Approach 2: If the phenotype is binary, reconstruct two networks (one for each manifestation of the phenotype) and compare the two to build a differential network
• Shows changes in the co-expression pattern
Case Control Differential Network
Gene Co-expression Networks
NIH Big Data Center of Excellence 56
• Issues: 1. Becomes very cumbersome if phenotype is not binary
2. Does not work for continuous-valued phenotypes 3. By dividing the samples into two groups, we will have less
statistical power in identifying co-expression patterns 4. Fails in a case shown below
Gene 1
Gen
e 2
Gene Co-expression Networks
NIH Big Data Center of Excellence 57
Approach 3: First, find genes associated with the phenotype and then reconstruct a context-specific network only using those genes
• Issue: Ignores the strength (p-value) of gene-phenotype association
gene 1 gene 1 gene 1
gene
2
gene
2
gene
2
genes
sam
ples
Calculate pair-wise
correlations Filter o
ut small
corre
lations
Filter o
ut non-D
EGs
Gene Co-expression Networks
NIH Big Data Center of Excellence 58
Approach 4: Calculate p-values of gene-gene correlation and gene-phenotype associations separately and combine together using Fisher’s method or Stouffer’s method à Simplified-InPheRNo • Specifically useful to identify transcription factor-gene-
phenotype associations
gene expression phenotype
gene
exp
ress
ion
gene
exp
ress
ion
Combine the two p-values
Questions?
NIH Big Data Center of Excellence 59
Regression algorithms
• Lasso: learns a linear model from the training data using only a few features (sparse linear model)
• Elastic Net: learns a linear model from the training data by linearly combining ridge and Lasso regression regularization terms (a generalization of both Lasso and ridge regression)
NIH Big Data Center of Excellence
�̂ = argmin�
�||y �X�||2 + �1||�||1
�
�̂ = argmin�
�||y �X�||2 + �2||�||2 + �1||�||1
�
60
Regression algorithms
• Kernel-SVR:
• Linear SVR learns a linear model such that it has at most ε-deviation from the response values and is as flat as possible
(Smola and Schölkopf, 1998)
• Kernel-SVR generalizes the idea to nonlinear models by mapping the features to a high-dimensional kernel space
NIH Big Data Center of Excellence
x
x
x x
x
xxx
x
xx
x
x
x
+ε−ε
x
ζ+ε
−ε0
ζ
61