Illuminating Genetic Networks with Random Forest
ANDREAS BEYER
?
University of Cologne
Andreas Beyer
Outline
• Random Forest
• Applications
• QTL mapping
• Epistasis (analyzing model structure)
2
Random ForestHOW DOES IT WORK?
3
Andreas Beyer
Random Forest
Y X
PredictorsSa
mpl
esResponse
Andreas Beyer
Random Forest
predictor 1
low / class 1high / class 2
predictor 2
predictor 3
Leo Breiman, 2001
Andreas Beyer
Random Forest
predictor 1
predictor 2
predictor 3
Leo Breiman, 2001
low / class 1high / class 2
Andreas Beyer
RF uses CART
• Classification And Regression Trees
Breiman et al. (1984)
Andreas Beyer
Splitting Rules
1
= fraction of items labeled km = number possible values
minimize:
Classification: Gini Impurity
Andreas Beyer
Splitting Rules
= i’th itemYl, Yr = items in left (right) nodenl, nr = number of items in left (right) node
, = average of left (right) items
∈ ∈
left node right node
minimize:
Regression: RSS
Andreas Beyer
Decision trees …
• are nice to interpret,
• but generalize very poorly (large variance)
http://research.microsoft.com
Andreas Beyer
Random Forest
Y X
PredictorsSa
mpl
esResponse
boot
stra
prandom sampling
Andreas Beyer
Random Forest
Grow many trees!
Average predictions across all trees
Andreas Beyer
Benefits
• Works very well in practice
• Very broadly applicable
• Intuitive algorithm
• Robust, no overfitting
• No assumptions about data
• Virtually no tuning needed
• Very easily parallelizable
• Accounts for complex interactions between features
13
Andreas Beyer
Drawbacks
• Difficult interpretation
• What is the underlying ‘model’?
• Almost impossible to capture analytically
14
Andreas Beyer
Predicting Protein-Protein Interactions
15
Human
Elefsinioti et al. 2011 Molec. Cell. Prot. Sarac et al. 2012 Bioinformatics
Yeast
Andreas Beyer
Proximity measure
= very similar (together in 3/3)
= quite similar (together in 2/3)
= different (together in 0/3)
Score similarities (or differences) of samples
Andreas Beyer
Weighted Clustering
17
Michaelson, Trump, et al. 2011 BMC Genomics
1. Learn model to predict outcome
RFFeatures Outcome
2. Use feature importance as weights
Proximity measure
PAM clustering Clusters
QTL Analysis
18
Andreas Beyer
Genetic Association
19
ACCGTCCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAATTTTGG
ACCGACCGACACGTTTGGACAAGTACGTTGCAACACACCCGTACCAATTTTGG
ACCGACCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAAAATTGG
ACCGTCCCACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG
ACCGACCCACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG
ACCGTCCGACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG
ACCGTCCGACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG
Andreas Beyer
Genetic Association
20
ACCGTCCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAATTTTGG
ACCGACCGACACGTTTGGACAAGTACGTTGCAACACACCCGTACCAATTTTGG
ACCGACCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAAAATTGG
ACCGTCCCACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG
ACCGACCCACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG
ACCGTCCGACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG
ACCGTCCGACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG
Andreas Beyer
NHGRI GWA Catalog, www.genome.gov/GWAStudieswww.ebi.ac.uk/fgpt/gwas/
Published Genome-Wide Associations through 2015Published GWA at p≤5X10-8 for 17 trait categories
21
Andreas Beyer
Quantitative Trait Loci (QTL)
• Sub-type of GWAS
• Must have several causal loci (complex trait)• Why?
Allele A a
Standard approach:t-test
Andreas Beyer
RF for QTL mapping
23
Feature Matrix:Genetic Markers
RF Trait(e.g. body size)
Feature importance = importance of marker
Andreas Beyer
Pathway Consistency
Random Forest
Enrichment of gene pairs in same pathway
Michaelson et al. 2010 BMC Genomics
24
Andreas Beyer
RF-based QTL Mapping
• Michaelson et al. 2010 BMC Syst. Biol.• Comparison using real data
• Ackermann et al. 2012 PLoS ONE• Comparison using simulated data (DREAM)
• Picotti, Clément-Ziza et al. 2013 Nature• Extracting epistatic interactions
• Ackermann et al. 2013 PLoS Genetics• Multiple cell types/conditions
• Clément-Ziza et al. 2014 Molec. Systems Biol.• Non-coding genes, antisense transcription
• Stephan et al. 2015 Nat. Commun.• Population substructure
• Valenzano et al. 2015 Cell• Mapping traits in fish
25
Epistasis with RFGETTING A GRIP ON RF STRUCTURE
JAKE MICHAELSON, MATHIEU CLÉMENT-ZIZA, JAN GROßBACH, CORINNA SCHMALOHR
32
Andreas Beyer
What is Epistasis?
33
AB Ab aB ab
Trai
t additive
AB Ab aB ab
Trai
t
epistatic
AB Ab aB abTr
ait
epistatic
“Non-additive interaction between markers (predictors).”
Andreas Beyer
Problem with Random Forest
34
Interaction between variables
need to know model structure!
Andreas Beyer
Finding Epistasis with Decision Trees
aA
bbB B
aA
bbB B
epistatic additive
→ compare slopes
Andreas Beyer
Algorithm
1. Learn decision trees
2. Compute slopes (differences) of trait values at splits
3. Collect slopes for ‘left’ and ‘right’ sides (there will be many trees)
4. Compare distributions of slopes. Are they different?
Slopesepistasis no epistasis
left right left right
Andreas Beyer
Validation on real data (Saccharomyces cerevisiae)
37
Using Costanzo et al. 2010 for validation
RFANOVA
True
Pos
itiv
e Ra
te
False Positive RatePr
ecis
ion
Recall
Andreas Beyer
Why is RF better?
38
A B C
Andreas Beyer
Random Forest
• Extremely versatile
• Robust
• Can analyse structure
39
http://research.microsoft.com
Andreas Beyer
Acknowledgements
Oliver Stegle (EBI)
Ruedi Aebersold (ETH)Paola Picotti
Jürg Bähler (UCL)Sam MargueratXavi Masellach
Chris Workman (DTU)Manos Papadakis
People MoneyJan GroßbachJohannes Stephan
SystemsX.chThe Swiss Initiative in Systems Biology
40
Mathieu Clément-ZizaCorinna Schmalohr