Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor...

transcript

Problem

• Limited number of experimental replications.

• Postgenomic data intrinsically noisy.

• Poor network reconstruction.

Problem

• Limited number of experimental replications.

• Postgenomic data intrinsically noisy.

• Can we improve the network reconstruction by systematically integrating different sources of biological prior knowledge?

• Which sources of prior knowledge are reliable?

• How do we trade off the different sources of prior knowledge against each other and against the data?

Overview of the talk

• Revision: Bayesian networks

• Integration of prior knowledge

• Empirical evaluation

Bayesian networks

•Marriage between graph theory and probability theory.

•Directed acyclic graph (DAG) representing conditional independence relations.

•It is possible to score a network in light of the data: P(D|M), D:data, M: network structure.

•We can infer how well a particular network explains the observed data.

),|()|(),|()|()|()(

),,,,,(

DCFPDEPCBDPACPABPAP

FEDCBAP

Bayesian networks versus causal networks

Bayesian networks represent conditional (in)dependence relations - not necessarily causal interactions.

True causal graph

Node A unknown

• Equivalence classes: networks with the same scores: P(D|M).

• Equivalent networks cannot be distinguished in light of the data.

Symmetry breaking

Prior knowledge

P(M|D) = P(D|M) P(M) / Z

D: data. M: network structure

P(D|M)

Prior knowledge:

B is a transcription factor with binding sites in the upstream regions of A and C

P(M|D) ~ P(D|M) P(M)

Learning Bayesian networks

P(M|D) = P(D|M) P(M) / Z

M: Network structure. D: Data

Use TF binding motifs in promoter sequences

Biological prior knowledge matrix

Biological Prior Knowledge

Indicates some knowledge aboutthe relationship between genes i and j

Biological prior knowledge matrix

Biological Prior Knowledge

Define the energy of a Graph G

Indicates some knowledge aboutthe relationship between genes i and j

Notation

• Prior knowledge matrix:

P B (for “belief”)

• Network structure:

G (for “graph”) or M (for “model”)

• P: Probabilities

Prior distribution over networks

Energy of a network

Sample networks and hyperparameters from the posterior distribution • Capture intrinsic inference uncertainty• Learn the trade-off parameters automatically

P(M|D) = P(D|M) P(M) / Z

Prior distribution over networks

Energy of a network

Rewriting the energy

Energy of a network

Approximation of the partition function

Partition function of a perfect gas

Multiple sources of prior knowledge

MCMC sampling scheme

Sample networks and hyperparameters from the posterior distribution

Metropolis-Hastings scheme

Proposal probabilities

Bayesian networkswith biological prior knowledge

•Biological prior knowledge: Information about the interactions between the nodes.

•We use two distinct sources of biological prior knowledge.

•Each source of biological prior knowledge is associated with its own trade-off parameter: 1 and 2.

•The trade off parameter indicates how much biological prior information is used.

•The trade-off parameters are inferred. They are not set by the user!

Bayesian networkswith two sources of prior

BNs + MCMC

Recovered Networks and trade off parameters

Source 1 Source 2

BNs + MCMC

Source 1 Source 2

BNs + MCMC

Source 1 Source 2

Evaluation

• Can the method automatically evaluate how useful the different sources of prior knowledge are?

• Do we get an improvement in the regulatory network reconstruction?

• Is this improvement optimal?

Raf regulatory network

From Sachs et al Science 2005

Raf regulatory network

Evaluation: Raf signalling pathway

• Cellular signalling network of 11 phosphorylated proteins and phospholipids in human immune systems cell

• Deregulation carcinogenesis

• Extensively studied in the literature gold standard network

DataPrior knowledge

Flow cytometry data

• Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins

• 5400 cells have been measured under 9 different cellular conditions (cues)

• Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments

Microarray example Spellman et al (1998)Cell cycle73 samples

Tu et al (2005)Metabolic cycle36 samples

time time

DataPrior knowledge

KEGG PATHWAYS are a collection of manually drawn pathway maps representing our knowledge of molecular interactions and reaction networks.

http://www.genome.jp/kegg/

Flow cytometry data and KEGG

Prior knowledge from KEGG

Prior distribution

The data and the priors

+ KEGG

+ Random

Evaluation

BNs + MCMC

Source 1 Source 2

BNs + MCMC

Source 1 Source 2

Sampled values of the

hyperparameters

Evaluation

How can we evaluate the reconstruction accuracy?

Flow cytometry data and KEGG

Evaluation

Learning the trade-off hyperparameter

• Repeat MCMC simulations for large set of fixed hyperparameters β

• Obtain AUC scores for each value of β

• Compare with the proposed scheme in which β is automatically inferred.

Mean and standard deviation of the sampled trade off parameter

Conclusion• Bayesian scheme for the systematic

integration of different sources of biological prior knowledge.

• The method can automatically evaluate how useful the different sources of prior knowledge are.

• We get an improvement in the regulatory network reconstruction.

• This improvement is close to optimal.

Thank you

Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor...

Documents