Download - Genetic Regulatory Network Inference

Genetic Regulatory Network Inference

Russell Schwartz

Department of Biological Sciences

Carnegie Mellon University

Why Study Network Inference? It can help us understand how to interpret and

when to trust biological networks It is a model for many kinds of complex inference

problems in systems biology and beyond It is a great example of a machine learning

problem, a kind of computer science central to much work in biology

Network inference is a good way of thinking about issues in data abstraction central to all computational thinking

Our Assumptions We will focus

specifically on transcriptional regulatory networks, assuming no cycles

We will assume, at least initially, that our data source is a set of microarray gene expression values

cI Cro

+

--

++

+

+

conditions

gene

s

*Clustered gene expression data from NCBI Gene Expression Omnibus (GEO), entry GSE1037: M. H. Jones et al. (2004) Lancet 363(9411):775-81.

*

Intuition Behind Network Inference

3

12

43

12

43

12

43

12

43

12

4

1

4

32

+ -

-

1

32

+

-

1

32

+ -

1

32 -

-

1

32

+

-

-

…

conditions

gene

s

correlated expression implies common regulation that intuition still leaves a lot of ambiguity

Why Is Intuition Not Enough?

Models are ambiguous:

Data are noisy:

Data are sparse:

32

4

1

32

4

1

32

4

1

…

pointsdata~

vs.models3~ 2/2

m

m

*Clustered gene expression data from NCBI Gene Expression Omnibus (GEO), entry GSE1037: M. H. Jones et al. (2004) Lancet 363(9411):775-81.

*

We will assume for the moment that genes only have two possible states: 0 (off) or 1 (on)

We will also assume that we want to find directionality but not strength of regulatory interactions:

A Next Step Beyond Intuition: Assuming a Binary Input Matrix

1 01 0 1 1 1 00 1 0 1 1 1 1 0

conditions

gene 1gene 2

0 0 1 0 0 0 0 10 0 0 0 0 1 0 1

gene 3gene 4

1

32

4

Making it Even Simpler: Two Genes

Only three possible models to consider

1 01 0 1 1 1 00 1 0 1 1 1 1 0

conditions

gene 1gene 2

1 2 1 2 1 2

model 1 “G1 regulates G2”

model 2 “G2 regulates G1”

model 3 “G1 and G2 are independent”

Judging a Model: Likelihood Complicated inference problems like this are

commonly described in terms of probabilities We want to infer a model (which we will call M)

using a data set (which we will call D) Problems like this are commonly posed in terms

of maximizing a likelihood function:

We read this as “probability of the data given the model,” i.e., the probability that a given model would generate a given data set

}|Pr{ MD

We can describe the probability of a microarray as the product of the probabilities of all of its individual measurements:

Pr{ }=

Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }

What is the Probability of a Microarray?

1 01 0 1 1 1 0

1 1 11 1

0 0

0

We can estimate Pr{ } and Pr{ } by counting how often each individual value occurs:

Pr{ } = 5/8

Pr{ } = 3/8 Therefore:

Pr{ }=Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }=5/8 x 5/8 x 3/8 x 3/8 x 5/8 x 5/8 x 5/8 x 3/8= 0.00503

What is the Probability of One Measurement on a Microarray?

1 01 0 1 1 1 0

1 1 11 1

0 0

0

1 0

1

0

Evaluating One Model

1 01 0 1 1 1 00 1 0 1 1 1 1 0

gene 1gene 2

1 2

data D =

model M =

1 01 0 1 1 1 0Pr{D|M} = Pr{ } x Pr{ }

= 0.00503 x 0.00503 = 2.5 x 10-5

0 1 0 1 1 1 1 0

Adding in Regulation How do we evaluate output probabilities for a

regulated gene?

We need the notion of conditional probability: evaluating the probability of gene 2’s output given that we know gene one’s output:

1 21 01 0 1 1 1 00 1 0 1 1 1 1 0

gene 1gene 2

Pr{G2= |G1= } = 1/5

Pr{G2= |G1= } = 4/5

0 1

1 1

Pr{G2= |G1= } = 2/30 0

Pr{G2= |G1= } = 1/31 0

Evaluating Another Model1 01 0 1 1 1 00 1 0 1 1 1 1 0

gene 1gene 2

1 2

data D =

model M =

1 01 0 1 1 1 0Pr{D|M} = Pr{ } x Pr{ | }

= 0.00503 x (1/5 x 4/5 x 2/3 x 1/3 x 4/5 x 4/5 x 4/5 x 2/3)

= 6.1 x 10-5

0 1 0 1 1 1 1 0 1 01 0 1 1 1 0

Evaluating Another Model1 01 0 1 1 1 00 1 0 1 1 1 1 0

gene 1gene 2

1 2

data D =

model M =

1 01 0 1 1 1 0Pr{D|M} = Pr{ | } x Pr{ }

= (1/5 x 4/5 x 2/3 x 1/3 x 4/5 x 4/5 x 4/5 x 2/3) x0.00503

= 6.1 x 10-5

0 1 0 1 1 1 1 0

0 1 0 1 1 1 1 0

Comparing the Models for Two Genes

Pr{ | } = 6.1 x 10-5 1 01 0 1 1 1 00 1 0 1 1 1 1 0 1 2

Pr{ | } = 6.1 x 10-5 1 01 0 1 1 1 00 1 0 1 1 1 1 0 1 2

Pr{ | } = 2.5 x 10-5 1 01 0 1 1 1 00 1 0 1 1 1 1 0 1 2

Conclusion: Knowing the expression of gene 1 helps us predict the expression of gene 2 and vice versa; we can suggest there should be an edge between them but cannot decide the direction it should take

Generalizing to Many Genes The same basic concepts let us evaluate the

plausibility of any regulatory model

This is known as a Bayesian graphical model

1 01 0 1 1 1 00 1 0 1 1 1 1 00 0 1 0 0 0 0 10 0 0 0 0 1 0 1

1

32

4Pr{ | }

= Pr{ }x Pr{ | }x Pr{ | , }x Pr{ | }

1 01 0 1 1 1 0

0 1 0 1 1 1 1 0 1 01 0 1 1 1 0

0 0 1 0 0 0 0 1 1 01 0 1 1 1 0

0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 10 1 0 1 1 1 1 0

Adding Prior Knowledge We can also build in any prior knowledge we have

about the proper model (e.g., from the literature) We can use that knowledge by simply multiplying

each likelihood by our prior confidence in its validity:

1 01 0 1 1 1 00 1 0 1 1 1 1 00 0 1 0 0 0 0 10 0 0 0 0 1 0 1

1

32

4Pr{ | } x Pr{ } x

Pr{ } x Pr { } x Pr { } x …

1

2

1

31 4 32

Adding in Other Data Types We can also incorporate other pieces of evidence

in much the same way Example: suppose we have microarrays and TF

binding site predictions:

Pr{ , ACGATCTCA… | }

= Pr{ | } x

Pr{ACGATCTCA … | }

1 01 0 1 1 1 00 1 0 1 1 1 1 0

1

2

1 01 0 1 1 1 00 1 0 1 1 1 1 0

1

2 1

2Evaluate as before

Evaluate by a binding site prediction method (e.g., PSSM)

Moving from Discrete to Real-Valued Data We can also drop the need for discrete (on or off)

data by making an assumption of how values vary in the absence of regulation, e.g., Gaussian:

1.5 -0.30.4 -1.2

0 1-1

1.5 -0.30.4 -1.2Pr{ } = 2/))2.1((2/))3.0((2/)4.0(2/)5.1( 2222

2

1

2

1

2

1

2

1 eeee

Finding the Best Model

We now know how to compare different network models, but finding the best model is not easy; far too many possibilities to compare them all

Algorithms for model inference is a more complex topic than we can cover here, but there are some general approaches to be aware of optimization: many specialized methods exist for finding the

best model without trying everything; solving hard problems of this type is a core concern in computer science

sampling: there are also many specialized methods for randomly generating solutions likely to be “good” and seeing what model features are preserved across most solutions; this is a core concern of statisticians

Network Inference in Practice

The methods covered here are the key ideas behind how people really infer networks from complex data

The practice is usually more complicated, though: many kinds of data sources, specialized prior probabilities, lots of algorithmic tricks needed to get good results

If you really want to know the details, these topics are typically covered in a class on machine learning