As a result of the need for analyzing genomic expression data with models that permit latent...

As a result of the need for analyzing genomic expression data with models that permit latent variable capturing, describing complex relationships and can be scored rigorously against observational data.

Introduction – motivation, current models, new techniques.Bayesian networks, modeling regulatory networks with them, scoring using the scoring metric.Example - using the galactose system.Representing models with annotated edges.Scoring annotated models of the galactose system.Conclusions.

The opportunity to research in fields like medicine, biology and pharmacology using the vast quantity of data generated by gene arrays.

Understanding basic cellular processes, diagnosis and treatment of disease and designing targeted therapeutics.

Data from expression array is inherently noisy. Our knowledge regarding genetic regulatory networks is extremely limited so all hypotheses about their structure or function may be incomplete.Gene expression is regulated in a complex and combinatorial manner, however most analysis of expression array data utilizes only pair wise measures.

Typically performed by clustering the expression profiles of a collection of genes using pair wise measures like:

Correlation - given 2 data vectors, normalize them and use dot product.Euclidian distance - square root of the sum of the squared

differences in each dimension. Mutual information – using the information theory how much information A contains about B (and vice versa), uses a discretized model, partitioning the expression levels into bins and find pairs of genes that have one-to-one mapping by permuting the bin numbers. Identifying clusters with common sequence motifs. Previous presentations.

Data - microarray of 8600 human genes according to expression level – as can be seen in 5 clusters.

Noise in expression array data is typically not analyzed in detail - the significance of alternative conclusions from these studies cannot be quantitatively compared.

Does not permit models to describe latent variables - a variable that describes an unobserved value (such as protein levels) and make predictions that can be verified later as data becomes available.

Employing Boolean models that are restricted to logical relationships between variables:

A graph G(V,E) annotated with a set of states X = {xi | i = 1, 2, .. , n}, together with a set of Boolean functions B = {bi | i = 1, 2, .. , k}, bi : {0,1}k -> {0,1}.

Gate: each node vi has associated with it, a function with inputs the states of the nodes connected to vi..

The state of the node vi at time t is denoted as xi (t). Then the state of that node at time t+1 is given by : xi (t+1) = bi (xi1,xi2,..,xik) where xij are the states of the node connected to vi. Described in details in Andrey’s presentation.

Network of three nodes - a, b and c. As one can see, expression of c directly depends on expression of b, which directly depends on a. However, influence of b and c on a is more complex. For example, high level of expression of both b and c leads to inhibition of a.

Bayesian networks used to describe relationships between variables in a genetic regulatory network.

Describes arbitrary combinatorial control of gene expression not limited to pair wise interaction between genes.

Useful in describing processes composed of locally interacting components (the value of each component directly depends on the values of a relatively small number of components).

Provide models of casual influence – modeling the mechanism that generated the dependencies – helps predict the effect of an intervention in the domain settling the value of a variable in a way that the manipulation itself doesn’t affect the other variables.Due to their probabilistic nature, robust in the face of imperfect data/imperfect model (small variation/noise don’t change much of the outcome of the network).Permit latent variable capturing unobserved values.

The variables in it can be either discrete or continuous. Can represent mRNA concentration, protein concentration, genotypic information.A variable describing an observed value = an information variable. A variable describing an unobserved value = a latent variable. Describes the relationships between variables at a qualitative and quantitative level.

At a qualitative level the relationships between variables are dependence and conditional independence – encoded in a structure of a directed acyclic graph G:

The vertices correspond to variables.Directed edges represent dependencies between variables.

At a quantitative level the relationships between variables are described by a family of joint probability distributions that are consistent with the independence assertions embedded in the graph - .Under the Markov assumption: “Each variable X is independent of its non-descendants, given its parent in G. General formulae for the joint probability distribution:

Discrete variables from a finite set, P(X | U1,U2,...,Uk) can

be represented as a table that specifies the probability of values for X – the number of free parameters in exponential in the number of parents.

Continuous variables – there is no representation that can represent all possible densities. P(X | U1,U2,...,Uk)

can be represented using a gaussian conditional density:

P(X | U1,U2,...,Uk) ~ N(a0 + ai * Ui , 2) – the mean

depends linearly on the values of its parents. The variation is independent of the parent’s values.

Hybrid network a mixture of discrete and continuous variables.

More than one graph can imply the same set of independencies.

Example ind(G) = .

Two graphs G, G’ are equivalent if ind(G) = ind(G’).

Two DAG’s are equivalent if and only if they have the same underlying undirected graph and the same v-structure (converging directed edges into the same node).

X

Y

Y

X

The approach to analyzing gene expression data using Bayesian network learning techniques is as follows:

Our modeling assumptions are presented.

Probability distributions of over all possible states of the system are considered.

A state of the system is described using random variables.

Each random variable describes:The expression level of individual genes.Experimental conditions.Temporal indicators (the time/stage that the sample was taken from). Background variable (which clinical procedure was used to get a biopsy sample).

Given all the states and the samples, using a scoring technique, the best model that matches the data is found (similar to the method in Inbar’s presentation).When such a model is found, queries about the system can be answered.

When given a data base D = {Xi ,..,Xn} of n

samples when Xn = (Xi1,..,Xim) of m variables

finding a network B = <G, > that best matches D.

A likelihood function is calculated:

P(D|G) = P(D| G, ) * P( | G) d

P(D| G, ) = P(Ch|G, )

C = all the cases in the Data set (under the assumption that case occur independently).

n

h 1

P( | G) – The likelihood of the probability assignments given the graph structure.

Calculating the score of the model:

BayesianScore(G) = log(P(D|G) + logP(G) +C.

P(G) – the prior / the probability that a Network has a graph structure G.

Given 3 variables the assumption is that all possible belief networks are equally likely. There are 25 possible belief networks (DAG). The prior probability is uniform.

P(D| G, ) = P(Ch|G, )

n

h 1

Finding the maximal score/ the structure G that maximizes the score is NP-hard, so heuristics are needed.

One possibility is a local search that changes one edge at each move – greedy hill climbing algorithm. – at each step performs the local change that results in the maximal gain until it reaches a local maximum.

Performs well in practice.

Includes an inherent penalty for model complexity (balancing a model’s ability to explain observed data with its ability to do so economically, and consequently guards against over-fitting models to data.

The model is permitted to be incomplete containing additional degrees of freedom (while being penalized by the scoring metric). Scores improve as a model converges to one without degrees of freedom.

Allows us to represent uncertainty about the precise dependencies between variables as is a distribution and not a singular value.

Instantiate the latent variables by sampling from the distribution of possible values for each such values – MCMC methods (Markov Chain Monte Carlo). Becomes computationally prohibitive as networks become very large:

When X – a latent variable, E(X) = 1/n * f(Xt) ,

f - a function of interest regarding X, n – the number of samples.

Law of large numbers makes sure that enlarging n will give a better approximation of X.

This equation assumes that all { Xt }t=1.. n are

independent – incorrect assumption.

n

t 1

As a result, a Markov chain is formed meaning a series of random variables X1,.., Xn and each

sample is taken from the distribution of P(Xt+1 | Xt ) – given Xt , Xt+1 isn’t dependant upon

{ X1,.., Xt-1}.

P(Xt+1 | Xt ) – the transition kernel of the chain.

An algorithm for finding X1,.., Xn is called the

“Hastings-Metropolis” .

Variational approximation methods can be used, either on their own or in conjunction with sampling – for example ``search-based'' methods, which consider node instantiations across the entire graph. The general hope in these methods is that a relatively small fraction of the (exponentially many) node instantiations contains a majority of the probability mass, and that by exploring the high probability instantiations (and bounding the unexplored probability mass) one can obtain reasonable bounds on probabilities.

Variational methods also yield upper and lower bounds on the score enabling the highest scoring graph to often be identified without resorting to sampling.

When a patient has a certain disease, at some point he took certain pills, which contributing to his dying eventually. Learning the probability of his staying alive if he hadn’t taken the pills.

Predicting the effects of an intervention in the domain, not only the probability of observations.

If X causes Y, then manipulating the value of X affects the value of Y.

If Y causes X, then manipulating X will not affect Y. X

Y

Y

X

The 2 networks are equivalent Bayesian networks but not equivalent as casual networks.

A casual network can be interpreted as a Bayesian network when assuming the casual Markov assumption:

Given the values of a variable’s immediate causes, it’s independent of its earlier causes.

Genetic regulatory network responsible for the control of genes necessary for the galactose metabolism.

This is a fairly well understood system in yeast and so allows us the opportunity to evaluate our methodology in a setting where we can rely on accepted fact.

Example of genetic regulatory networks represented as Bayesian networks.

Boxed variables – mRNA levels that can be determined from expression into array data.

Unboxed variables – protein levels. In this model they are treated as latent variables whose values cannot be measured directly.

The 2 networks represent two competing models of a portion of the galactose system in yeast – differ in terms of the dependence relationships they hold between the variables Gal80p, Gal4m,Gal4p.

Conclusions based on previous research:

It was originally proposed that Gal80 protein is a repressor of Gal4 transcription –shown in M1.

It is now clear that Gal4 is expressed constitutively and that its activity is inhibited by Gal80 protein – shown in M2.

Expression data for this analysis consisted of 52 genomes of Affymetric S. cerevisiae genechip data.

In order to get those two competing networks, the Bayesian scoring metric was used.

Binary quantization was performed independently for each gene using a maximum likelihood separation technique.

Scoring results:

The model M1, in which Gal80p represses transcription of Gal4m, received a score of –44.0, while the model M2, in which Gal80p inhibits gal4p actively received a score of –34.5 .

The score difference translates to the data being over 13,000 times more likely to be observed under M2 .

The score of the more complex model M1 or M2 was –35.4, lower than that of the currently accepted model.

The simplified versions of M1 and M2

Score for equivalence classes of the three variable galactose

system

The models fall into two primary grouping based on their score:

Those that include an edge between Gal80 and Gal2, which score between –34.1 and –35.4 .

Those that don’t, which score between –42.2 and –44 .

This supports the claim that Gal80 and Gal2 are unlikely to be conditionally independent given Gal4.

Extending the Bayesian network model by adding the ability to annotate edges, in order to represent additional information about the type of dependence relationships between variables.

4 types of annotations in the context of binary variables:

An unannotated edge from X to Y – a dependence that can be arbitrary.

A positive edge from X to Y – higher values of X will bias the distribution of Y higher. For all instantiations of the variables P(Y=1|X=1,P(Y=1|X=0,

A negative edge from X to Y – higher values of X will bias the distribution of Y lower. For all instantiations of the variables P(Y=1|X=0,P(Y=1|X=1,.

A negative/positive edge from X to Y – Y’s dependence on X is either positive or negative but the true relationship not unknown

As edge annotations describe the relationship between a variable and a single parent and Bayesian networks describe the relationship between a variable and all its parent, an added requirement is that the implied constraints hold for all possible values of other parents.

Advantages to this extended model:

Allows us to represent finer degrees of refinement regarding the types of relationships between variables but doesn’t force us to.

Permits a model to evolve as more knowledge is gained about the types of influences that are present in the biological system under study – all edges can be initially unannotated with ‘+’ and ‘–‘ annotations added incrementally as activators and repressors are identified.

In the models M1, M2 we allow the edges in each model to take on all possible combinations of annotations (‘+’,’-‘,’+/-‘) :

Results:

In model M1 adding different kinds of annotations fails to change the score significantly, as the structure of the graph is limited in explaining the observed expression data.

The same effect is observed when the edge between Gal4 and Gals2 in considered in model M2 – this is consistent with the result of figure 3 indicating that the coupling between Gal4 and Gal2 is weak.

In contrast, adding a ‘+’ annotation to the edge between Gal80 and Gal2 results in a score comparable with previously achieved results, but adding a ‘-‘ annotation to the same edge worsens the score.

Conclusions:

This example illustrates that when the constraints implied by edge annotations cannot be satisfied by the data, scores result that are as poor as when the given structure is incorrect.

For this reason annotations serve a useful discriminator of the kinds of relationships present in the data.

Precautions when interpreting results:

Although Gal80 is know to act as a repressor in the cell, this effect is countered by a level of a factor that is currently unknown and remains unmodeled here.

A complete model would include the effect of this latent variable and so in such a model given sufficient data, the edge between Gal80 and Gal2 would be labeled ‘-‘.

Nevertheless in the limited model here, a ‘+’ annotation for this edge is correct as the level of Gal80 concomitantly with the level of Gal2 in our data

While Bayesian networks are well suited to dealing robustly with noisy data, as noise increases, the score difference between correct and incorrect models goes down.

In the case of uninformative data, correct models will score as poorly as incorrect ones.

The ability to particular data to enhance score difference between models suggests the possibility of performing “experimental suggestion” in the future – meaning models could be used to generate suggestions for new experiments, yielding data that would optimally elucidate a given network.

Disadvantages:

A danger of wrong conclusions given bad/small data/incomplete model.

Given a small amount of data difficult to find a model that expresses all features.

Assumptions like the prior probability for computing the scoring.

Assumptions like all experiments performed are independent.

“Using Graphical models and genomic expression data to statistically validate models of genetic regulatory networks” Alexander J Hartemink, David K. Gifford, Tommi S. Jaakola, Richard A. Young.

“Using Bayesian networks to analyze expression data” Nir Friedman, Iftach Nachman, Michal Linial Dana Pe’er .

“A Bayesian method for the induction of probabilistic networks from data” Gregory F. Cooper, Edward Herskovits.

The presentation deals with using a model with a probabilistic nature to analyze genomic expression data for use with genetic regulatory networks that can be represented in a computational form.

First, an introduction to such a model is presented, describing the motivation for such a model, the obstacles of previous models in analyzing the genomic expression data are described, followed by existing techniques and an example of previous models.

Then, the new model – the Bayesian network is introduced, first being defined, then its characteristics described, and given the expression data, analyzing it and finding the best network that matches it using a scoring method, describing the method and its advantages and properties (like dealing with latent variables).

Next, an example of using this network with genomic expression data from genes in the galactose metabolism in S. Cerevisiae is described.

Afterwards, the network semantics are extended to include annotated edges and the previous example of the galactose system is revisited with the modified network.

The presentation is concluded with conclusions regarding the usage of the new model.

Date post:	02-Jan-2016
Category:	Documents
Upload:	cuthbert-willis
View:	217 times
Download:	2 times

As a result of the need for analyzing genomic expression data with models that permit latent...

Documents