Genetic Regulatory Network Inference
Russell Schwartz
Department of Biological Sciences
Carnegie Mellon University
Why Study Network Inference? It can help us understand how to interpret and
when to trust biological networks It is a model for many kinds of complex inference
problems in systems biology and beyond It is a great example of a machine learning
problem, a kind of computer science central to much work in biology
Network inference is a good way of thinking about issues in data abstraction central to all computational thinking
Our Assumptions We will focus
specifically on transcriptional regulatory networks, assuming no cycles
We will assume, at least initially, that our data source is a set of microarray gene expression values
cI Cro
+
--
++
+
+
conditions
gene
s
*Clustered gene expression data from NCBI Gene Expression Omnibus (GEO), entry GSE1037: M. H. Jones et al. (2004) Lancet 363(9411):775-81.
*
Intuition Behind Network Inference
3
12
43
12
43
12
43
12
43
12
4
1
4
32
+ -
-
1
32
+
-
1
32
+ -
1
32 -
-
1
32
+
-
-
…
conditions
gene
s
correlated expression implies common regulation that intuition still leaves a lot of ambiguity
Why Is Intuition Not Enough?
Models are ambiguous:
Data are noisy:
Data are sparse:
32
4
1
32
4
1
32
4
1
…
pointsdata~
vs.models3~ 2/2
m
m
*Clustered gene expression data from NCBI Gene Expression Omnibus (GEO), entry GSE1037: M. H. Jones et al. (2004) Lancet 363(9411):775-81.
*
We will assume for the moment that genes only have two possible states: 0 (off) or 1 (on)
We will also assume that we want to find directionality but not strength of regulatory interactions:
A Next Step Beyond Intuition: Assuming a Binary Input Matrix
1 01 0 1 1 1 00 1 0 1 1 1 1 0
conditions
gene 1gene 2
0 0 1 0 0 0 0 10 0 0 0 0 1 0 1
gene 3gene 4
1
32
4
Making it Even Simpler: Two Genes
Only three possible models to consider
1 01 0 1 1 1 00 1 0 1 1 1 1 0
conditions
gene 1gene 2
1 2 1 2 1 2
model 1 “G1 regulates G2”
model 2 “G2 regulates G1”
model 3 “G1 and G2 are independent”
Judging a Model: Likelihood Complicated inference problems like this are
commonly described in terms of probabilities We want to infer a model (which we will call M)
using a data set (which we will call D) Problems like this are commonly posed in terms
of maximizing a likelihood function:
We read this as “probability of the data given the model,” i.e., the probability that a given model would generate a given data set
}|Pr{ MD
We can describe the probability of a microarray as the product of the probabilities of all of its individual measurements:
Pr{ }=
Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }
What is the Probability of a Microarray?
1 01 0 1 1 1 0
1 1 11 1
0 0
0
We can estimate Pr{ } and Pr{ } by counting how often each individual value occurs:
Pr{ } = 5/8
Pr{ } = 3/8 Therefore:
Pr{ }=Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }=5/8 x 5/8 x 3/8 x 3/8 x 5/8 x 5/8 x 5/8 x 3/8= 0.00503
What is the Probability of One Measurement on a Microarray?
1 01 0 1 1 1 0
1 1 11 1
0 0
0
1 0
1
0
Evaluating One Model
1 01 0 1 1 1 00 1 0 1 1 1 1 0
gene 1gene 2
1 2
data D =
model M =
1 01 0 1 1 1 0Pr{D|M} = Pr{ } x Pr{ }
= 0.00503 x 0.00503 = 2.5 x 10-5
0 1 0 1 1 1 1 0
Adding in Regulation How do we evaluate output probabilities for a
regulated gene?
We need the notion of conditional probability: evaluating the probability of gene 2’s output given that we know gene one’s output:
1 21 01 0 1 1 1 00 1 0 1 1 1 1 0
gene 1gene 2
Pr{G2= |G1= } = 1/5
Pr{G2= |G1= } = 4/5
0 1
1 1
Pr{G2= |G1= } = 2/30 0
Pr{G2= |G1= } = 1/31 0
Evaluating Another Model1 01 0 1 1 1 00 1 0 1 1 1 1 0
gene 1gene 2
1 2
data D =
model M =
1 01 0 1 1 1 0Pr{D|M} = Pr{ } x Pr{ | }
= 0.00503 x (1/5 x 4/5 x 2/3 x 1/3 x 4/5 x 4/5 x 4/5 x 2/3)
= 6.1 x 10-5
0 1 0 1 1 1 1 0 1 01 0 1 1 1 0
Evaluating Another Model1 01 0 1 1 1 00 1 0 1 1 1 1 0
gene 1gene 2
1 2
data D =
model M =
1 01 0 1 1 1 0Pr{D|M} = Pr{ | } x Pr{ }
= (1/5 x 4/5 x 2/3 x 1/3 x 4/5 x 4/5 x 4/5 x 2/3) x0.00503
= 6.1 x 10-5
0 1 0 1 1 1 1 0
0 1 0 1 1 1 1 0
Comparing the Models for Two Genes
Pr{ | } = 6.1 x 10-5 1 01 0 1 1 1 00 1 0 1 1 1 1 0 1 2
Pr{ | } = 6.1 x 10-5 1 01 0 1 1 1 00 1 0 1 1 1 1 0 1 2
Pr{ | } = 2.5 x 10-5 1 01 0 1 1 1 00 1 0 1 1 1 1 0 1 2
Conclusion: Knowing the expression of gene 1 helps us predict the expression of gene 2 and vice versa; we can suggest there should be an edge between them but cannot decide the direction it should take
Generalizing to Many Genes The same basic concepts let us evaluate the
plausibility of any regulatory model
This is known as a Bayesian graphical model
1 01 0 1 1 1 00 1 0 1 1 1 1 00 0 1 0 0 0 0 10 0 0 0 0 1 0 1
1
32
4Pr{ | }
= Pr{ }x Pr{ | }x Pr{ | , }x Pr{ | }
1 01 0 1 1 1 0
0 1 0 1 1 1 1 0 1 01 0 1 1 1 0
0 0 1 0 0 0 0 1 1 01 0 1 1 1 0
0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 10 1 0 1 1 1 1 0
Adding Prior Knowledge We can also build in any prior knowledge we have
about the proper model (e.g., from the literature) We can use that knowledge by simply multiplying
each likelihood by our prior confidence in its validity:
1 01 0 1 1 1 00 1 0 1 1 1 1 00 0 1 0 0 0 0 10 0 0 0 0 1 0 1
1
32
4Pr{ | } x Pr{ } x
Pr{ } x Pr { } x Pr { } x …
1
2
1
31 4 32
Adding in Other Data Types We can also incorporate other pieces of evidence
in much the same way Example: suppose we have microarrays and TF
binding site predictions:
Pr{ , ACGATCTCA… | }
= Pr{ | } x
Pr{ACGATCTCA … | }
1 01 0 1 1 1 00 1 0 1 1 1 1 0
1
2
1 01 0 1 1 1 00 1 0 1 1 1 1 0
1
2 1
2Evaluate as before
Evaluate by a binding site prediction method (e.g., PSSM)
Moving from Discrete to Real-Valued Data We can also drop the need for discrete (on or off)
data by making an assumption of how values vary in the absence of regulation, e.g., Gaussian:
1.5 -0.30.4 -1.2
0 1-1
1.5 -0.30.4 -1.2Pr{ } = 2/))2.1((2/))3.0((2/)4.0(2/)5.1( 2222
2
1
2
1
2
1
2
1 eeee
Finding the Best Model
We now know how to compare different network models, but finding the best model is not easy; far too many possibilities to compare them all
Algorithms for model inference is a more complex topic than we can cover here, but there are some general approaches to be aware of optimization: many specialized methods exist for finding the
best model without trying everything; solving hard problems of this type is a core concern in computer science
sampling: there are also many specialized methods for randomly generating solutions likely to be “good” and seeing what model features are preserved across most solutions; this is a core concern of statisticians
Network Inference in Practice
The methods covered here are the key ideas behind how people really infer networks from complex data
The practice is usually more complicated, though: many kinds of data sources, specialized prior probabilities, lots of algorithmic tricks needed to get good results
If you really want to know the details, these topics are typically covered in a class on machine learning