Project In Bioinformatics (236524)- Projects Presentation.

Presented by: Ma’ayan Fishelson

Proposed Projects

1. Performing haplotyping on the input data.

2. Creating a friendly user-interface for the statistical genetics program SimWalk2.

3. Performing approximate inference by using a heuristic which ignores extreme markers in the computation.

4. Performing approximate inference via Iterative Join-Graph Propagation.

Haplotyping

• Many applications require haplotype information.

• Unfortunately, the human genome is diploid and

therefore genotype information is collected and

not haplotype information.

Efficient and accurate computational methods for haplotype reconstruction from genotype data arehighly demanded.

Efficient and accurate computational methods for haplotype reconstruction from genotype data arehighly demanded.

AK1 locus on Chromosome 9 (Lange 97)

2

4

5

1

3

A1/A1 A2/A2

A1/A2

A1/A2

A2/A2A1 | A2

Inferred haplotype information.

A1 | A1

A2 | A2

A1 | A2

A2 | A2

Project #1

• Goal of project #1: to perform

haplotyping on the input data, i.e. to

infer the most likely haplotypes for the

individuals in the input pedigrees, via

the MPE (Most Probable Explanation)

query.

• Goal of project #1: to perform

haplotyping on the input data, i.e. to

infer the most likely haplotypes for the

individuals in the input pedigrees, via

the MPE (Most Probable Explanation)

query.

Haplotyping – General Definition

• Input: a set of multilocus phenotypes for the individuals of a pedigree.

– Note: some individuals may be untyped.

• Output: the most likely configuration of haplotypes (ordered genotypes) for the pedigree, i.e. a configuration with maximum probability.

– Note: there might be a couple of configurations with maximum probability.

Bayesian Network• X = {X1,…,Xn} is a set of random variables.

• A BN is a pair (G,P):

– G is a directed acyclic graph over nodes that

represent the random variables X.

– P = {Pi|1 ≤i ≤n}. Pi, defined on , is the

conditional probability table associated with node Xi.

Pi = P(Xi | pa(Xi))

• The BN represents a probability distribution over X.

XAi

Bayesian Network - Example

A

B C

E

F

D

Bayesian NetworkP(A)

P(C|A)P(B|A)

P(E|B,C)

P(F|E)

P(D|A,B)

P(A,B,C,D,E,F) = P(A)P(B|A)P(C|A)P(D|A,B)P(E|B,C)P(F|E)

A

B C

E

F

D

Moralized Graph

Sc[1,m]

Sc[2,m]

Ga[1,m]Ga[1,p]

Sc[1,p]

Gc[1,p]

Pa[1]

Gb[1,m]Gb[1,p]

Gc[1,m]

Pb[1]

Pc[1]

Ga[2,m]Ga[2,p]

Sc[2,p]

Gc[2,p]

Pa[2]

Gb[2,m]Gb[2,p]

Gc[2,m]

Pb[2]

Pc[2]

locus #1 variables

locus #2 variables

Bayesian network built by SUPERLINK

Variables of the Bayesian Network

• Genetic Loci. the variables Gi[a, p] and Gi[a, m] represent the paternal & maternal alleles of individual i in locus a. – (orange nodes)

• Phenotypes. variable Pi[a] denotes the value of phenotype a for individual i - (yellow nodes).

• Selector variables. variables Si[a, p] and Si[a, m] are used to specify whether i got his alleles from the paternal or the maternal haplotype of his father and mother, respectively, at locus a – (green nodes).

Three types of random variables:

(Likelihood Computation with Value Abstraction with N. Friedman, D. Geiger, and N. Lotner).

Haplotyping – Specific to Superlink

• Find the most likely assignment to all the genetic loci variables (orange nodes).

• Method: use the MPE query – find

using a bucket elimination algorithm, which is an algorithm for performing inference in a Bayesian network.

(,,)maxarg* exPxx

Bucket Elimination Alg. for MPE

Given an ordering of the variables X1,…Xn:

– Distribute P1,…,Pn into buckets B1,…,Bn.

is the highest in order in Ai.

– Backward part: Process the buckets in reverse order: BnB1.• When processing bucket Bi, multiply all the probability tables in Bi

and eliminate the bucket’s variable Xi by keeping only the maximum joint probability entry for each possible assignment to the other variables.

• Store the value of Xi which maximizes the joint probability function for each possible assignment to the other variables.

• Place the resulting function in the bucket of the highest variable (in the order) that is in its scope.

– Forward part: Process the buckets in the order: B1Bn.• When processing bucket Bi, (after choosing the partial assignment

(x1,…xi-1)) choose the value of Xi which was recorded in the backward phase together with this assignment.

jji XBP

Bayesian Network – Example Revisited

A

B C

E

F

D

P(A)

P(C|A)P(B|A)

P(E|B,C)

P(F|E)

P(D|A,B)

P(A,B,C,D,E,F) = P(A)P(B|A)P(C|A)P(D|A,B)P(E|B,C)P(F|E)

Example – MPE

• Find: x0 = (a,b,c,d,e,f) such that:

• Suppose an order A,C,B,E,D,F, and evidence that F=1.

The distribution into buckets is as follows:

P(F|E)P(D|A,B)P(E|B,C)P(B|A)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

(|)(,|)(,|)(|)(|)()max() 0 EFPCBEPBADPACPABPAPxPnx

Example MPE (cont. 1)

F=1P(D|A,B)P(E|B,C)hF(E)P(B|A)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

P(F|E)P(D|A,B)P(E|B,C)P(B|A)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

• To process B6 (F): Assign F=1, get hF(E) = P(F=1|E)• Place hF(E) in bucket B4(E).• Record F=1.


F=1Dopt(a,b)P(E|B,C)

hF(E)P(B|A)

hD(A,B)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

P(D|A,B)P(E|B,C)

hF(E)P(B|A)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

• Process B5(D), compute:• Place hD(a,b) in bucket B3(B).• Record the maximizing values Dopt(a,b).

(|,)max(,) dbaPbah DdD


B3)B(

F=1Dopt(a,b)Eopt(b,c)P(B|A)hD(A,B)hE(B,C)

P(A) P(C|A)

B1)A( B2)C( B4)E( B5)D( B6)F(

P(E|B,C)hF(E)

P(B|A)hD(A,B)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

• Process B4(E):• Place hE(b,c) in bucket B3(B).• Record Eopt(b,c).

()(,|)max(,) ehcbePcbh FEeE


F=1Dopt(a,b)Eopt(b,c)Bopt(a,c)P(A)P(C|A)

hB(A,C)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

P(A) P(C|A)

B1)A( B2)C(B3)B(

B4)E( B5)D( B6)F(

P(B|A)hD(A,B)hE(B,C)

• Process B3:• Place hB(a,c) in bucket B2(C).• Record Bopt(a,c).

(,)(,)(|)max(,) cbhbahabPcah EDBbB


F=1Dopt(a,b)Eopt(b,c)Bopt(a,c)P(A)

hC(A) Copt(a)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

P(A)P(C|A)hB(A,C)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

• Process B2(C): • Place hC(a) in bucket B1(A).• Record Copt(a).

(,)(|)max() cahacPah BCcC


B1)A(

F=1Dopt(a,b)Eopt(b,c)Bopt(a,c)P(A)hC(A) Copt(a)

B2)C( B3)B( B4)E( B5)D( B6)F(

• Compute the maximum value associated with A: • Record the value of A which produced this maximum.

()()maxmax ahaPh CAa

Traverse the variables in the opposite order (C,B,E,D)to determine the rest of the most probable assignment.

SimWalk2

• A statistical genetics computer application for haplotype, parametric linkage,

non-parametric linkage (NPL), identity by descent (IBD) and mistyping analyses on any size of pedigree.

• Performs approximate computations using Markov Chain Monte Carlo (MCMC) and simulated annealing algorithms.

Project #2

• SimWalk2 requires 4/5 input files in order

to run. These can be difficult for a non-

expert to produce.

• Goal of project #2: to create a friendly web-based user-interface for the program

SimWalk2, using Java.

Performing Approximate Inference

• Algorithms for performing genetic linkage analysis are being improved constantly.

• However:– due to the enormous development in the human

genome project, knowledge about many markers exists.

– Markers are highly polymorphic.– Some disease models depend on multiple loci.

• Sometimes a model is too large for performing exact inference.

Project #3

• Goal of project #3: provide the means for

performing approximate inference via a

heuristic which ignores extreme markers

in the computations when these are too

strenuous to be performed exactly.

• Goal of project #3: provide the means for

performing approximate inference via a

heuristic which ignores extreme markers

in the computations when these are too

strenuous to be performed exactly.

• Currently Superlink performs exact inference.

General Outline of Heuristic Algorithm

1. Begin with the total number of markers as specified in the input.

2. Determine an elimination order for the problem as is.

3. Check the complexity of the elimination order found:

a. If it is greater than some determined threshold, clip off one of the extreme markers and return to step 2.

b. Else, continue to compute the likelihood.

Open Questions

• Which marker to clip in each iteration ?

Some possible options are:

The marker farther from the disease locus. The less informative marker. A marker which is very close to its adjacent

marker.

Project #4

• Another project which deals with approximate inference in a different way…

• Goal of project #4: to provide the means for performing approximate inference by using Iterative Join-Graph Propagation.

An exact inference algorithm for singly-connected networks.

Pearl’s Polytree Algorithm (BP – Belief Propagation)

Each node X computes BEL(x) = P(X=x|E), (E is the observed evidence), by combining messages from:

• its children, and • its parents.

Loopy Belief Propagation (Iterative-

BP)• Uses Pearl’s polytree algorithm on a Bayesian

network with loops.

• Initialization: All messages are initialized to a vector of ones.

• At each iteration: all nodes calculate their outgoing messages based on the incoming messages of their neighbors from the previous iteration.

• Stopping condition: Convergence of messages. None of the beliefs in successive iterations changed by more than a small threshold (e.g.,10-

4).

Generalized Belief Propagation (GBP)

• An extension of IBP towards being an anytime algorithm.

• Can be significantly more accurate than ordinary IBP at an adjustable increased complexity.

• Central idea: improve the approximation by clustering some of the network’s nodes into super nodes and apply message passing between the super nodes rather than between the original singleton nodes.

Iterative Join-Graph Propagation (IJGP)

• A special class of GBP (Generalized Belief Propagation) algorithms.

• Pearl’s BP algorithm on trees was extended to a general propagation algorithm on trees of clusters – join-tree clustering (exact method).

• IJGP extends this idea to a join-graph, by applying join-tree message passing over the join-graph, iteratively.

• IJGP(i) – works on join-graphs having cluster size bounded by i variables.

i allows the user to control the tradeoff between time and accuracy.

Belief Network -BNA quadruple BN= < X, D, G, P>:

– X = {X1,…Xn} is a set of random variables.

– D = {D1,…,Dn} is the set of corresponding domains.

– G is a directed acyclic graph over X.

– P = {p1,…,pn}, where pi = P(Xi | pai) (pai are the parents of Xi in G), denote probability tables.

Join-Graph Decomposition

A triple D = < JG, χ, ψ > for BN= <X, D, G, P>:

– JG = (V, E) is a graph.– χ,ψ are functions which associate with

each vertex two sets and , such that:1. Each function is associated with

exactly one vertex .2. (connectedness) For each variable

, the set of vertices which are associated with it induces a connected sub-graph of JG.

Vv X )v( Pv ()

Ppi Vv

Xxi Vv

Join-Graph Decomposition: Example (in this case, a

tree..)χ(1) = {A, B, C}ψ(1) = {p(a), p(b|a), p(c|a,b)}

A Bayesian network.A join-tree decomposition.

1

2

3

4

χ(2) = {B, C, D, F}ψ(2) = {p(d|b), p(f|c,d)}

χ(3) = {B, E, F}ψ(3) = {p(e|b, f)}

χ(4) = {E, F, G}ψ(4) = {p(g|e, f)}

AB

C D

F

E

G

Arc-Labeled Join-Graph Decomposition

Vv Xv () Pv ()

A quadruple D = < JG, χ, ψ, θ > for BN= < X, D, G, P>:

– JG = (V, E) is a graph.– χ, ψ are functions which associate with each

vertex two sets and . – θ associates with each edge the set

, such that:1. Each function is associated with

exactly one vertex .2. (arc-connectedness) For each arc (u,v),

, such that , any 2 clusters containing Xi can be connected by a path whose every arc’s label contains Xi.

Euv (,)Xuv ((,))

Ppi Vv

(,)(,) vusepvu XX i

Minimal Arc-Labeled Join-Graph Decomposition

• An arc-labeled join graph is minimal if no variable can be deleted from any label while still satisfying the arc-connectedness property.

• A minimal arc-labeled join-graph does not contain any cycle relative to any single variable.

Definition - Eliminator

• Given 2 adjacent vertices u and v of JG, the eliminator of u with respect to v includes all the variables that appear in u and don’t appear on the arc (u,v).

elim(u,v) = χ(u) - θ((u,v)).

Algorithm IJGP

• Input: – An arc-labeled join graph

decomposition. – Evidence variables var(e).

• Output: – An augmented graph whose nodes

are clusters containing the original CPTs and the messages received from neighbors.

– Approximations of P(Xi|e), .

XX i

Algorithm IJGP – 1 iteration

Apply message-passing in some topological order over the join graph, forward and back. When node u sends a message to a neighbor node v:

1. Compute individual functions: include in H(u,v) each function whose scope doesn’t contain variables in elim(u,v). Denote by A the remaining functions.

2. Compute the combined function:

3. Send all the functions to v: Send h(u,v) and the individual functions H(u,v) to node v.

(,lim)(,)

vueAfvu fh

Execution of IJGP on a Join-Tree

a

bacpabpapcbh (,|)(|)()(,)(2,1)

fd

fbhdcfpbdpcbh,

(2,3)(1,2) (,)(,|)(|)(,)

dc

cbhdcfpbdpfbh,

(2,4)(3,2) (,)(,|)(|)(,)

ABC

BCDF

BEF

EFG

BC

BF

EF

1

2

3

4

e

fehfbepfbh (,)(,|)(,) (3,4)(2,3)

b

fbhfbepfeh (,)(,|)(,) (3,2)(4,3)

(,|)(,)(3,4) fegGpfeh e

Algorithm IJGP – computing beliefs

• Compute P(Xi, e) for every :

– let u be a vertex in JG such that

– compute:

where cluster(u) includes all the functions in

u, including messages sent from its neighbors.

XX i

(.)uX i

}{() ()()(|)

iXu uclusterfi feXP

Bounded Join-Graphs

•Join-graphs with cluster size bounded by i.

A partition based approach to generate such decompositions:start from a given tree-decomposition and then partition the clusters until the decomposition has clusters bounded by i.

Goal: allows to control the complexity of IJGP.The time and space complexity of 1 iteration of IJGP(i) is exponential in i.

Output: a join-graph with cluster size bounded by i.

1. Apply procedure schematic mini-bucket(i).

2. Associate each resulting mini-bucket with a node in the join-graph. The variables of the node are those appearing in the mini-bucket. The original functions of the node are those in the mini-bucket.

3. Keep the arcs created by the procedure (out-edges) and label them by the regular separator.

4. Connect the mini-bucket clusters belonging to the same bucket in a chain by in-edges labeled by the single variable of the bucket.

Algorithm join-graph structuring(i)

Procedure schematic mini-bucket(i)

1. Order the variables from X1 to Xn, and associate a bucket with each variable.

2. Place each CPT in the bucket of the highest index variable in its scope.

3. For j=n to 1 do:– Partition the functions in bucket(Xj) into mini-

buckets having at most i variables.– For each mini-bucket mb create a function

(message) f where and place scope(f) in the bucket of its highest index variable. mb needs to be connected with an arc to the bucket of f (which will be created later).

},{}|{() jXmbXXfscope

Build Join-Graph:Example

GFE

EBF

BFFCD

CDB

CAB

BA

A

P)G|F,E(

P)E|B, F(P)F|C,D(

P)D|B(

P)C|A,B(

P)B|A(

P)A(

EF

BFF

CD

CB

BA

A

b. After applying alg. join-graph structuring.

B

AB

C D

F

E

G

G: )GFE(

E: )EBF( )EF(

F: )FCD( )BF(

D: )DB( )CD(

C: )CAB( )CB(

B: )BA( )AB( )B(

A: )A( )A(

a. After applying schematic mini-bucket)3(.

X1

Xn

IJGP(i) -summary

• As i is increased, we get a more accurate performance, requiring more time to process.

• This yields the anytime behavior of the algorithm.

Date post:	10-Jan-2016
Category:	Documents
Upload:	rowa
View:	16 times
Download:	0 times

Project In Bioinformatics (236524)- Projects Presentation.

Documents