+ All Categories
Home > Documents > Project In Bioinformatics (236524)- Projects Presentation.

Project In Bioinformatics (236524)- Projects Presentation.

Date post: 10-Jan-2016
Category:
Upload: rowa
View: 16 times
Download: 0 times
Share this document with a friend
Description:
Project In Bioinformatics (236524)- Projects Presentation. Presented by : Ma’ayan Fishelson. Proposed Projects. Performing haplotyping on the input data. Creating a friendly user-interface for the statistical genetics program SimWalk2. - PowerPoint PPT Presentation
Popular Tags:
50
Presented by : Ma’ayan Fishelson
Transcript
Page 1: Project In Bioinformatics (236524)- Projects Presentation.

Presented by: Ma’ayan Fishelson

Page 2: Project In Bioinformatics (236524)- Projects Presentation.

Proposed Projects

1. Performing haplotyping on the input data.

2. Creating a friendly user-interface for the statistical genetics program SimWalk2.

3. Performing approximate inference by using a heuristic which ignores extreme markers in the computation.

4. Performing approximate inference via Iterative Join-Graph Propagation.

Page 3: Project In Bioinformatics (236524)- Projects Presentation.
Page 4: Project In Bioinformatics (236524)- Projects Presentation.

Haplotyping

• Many applications require haplotype information.

• Unfortunately, the human genome is diploid and

therefore genotype information is collected and

not haplotype information.

Efficient and accurate computational methods for haplotype reconstruction from genotype data arehighly demanded.

Efficient and accurate computational methods for haplotype reconstruction from genotype data arehighly demanded.

Page 5: Project In Bioinformatics (236524)- Projects Presentation.

AK1 locus on Chromosome 9 (Lange 97)

2

4

5

1

3

A1/A1 A2/A2

A1/A2

A1/A2

A2/A2A1 | A2

Inferred haplotype information.

A1 | A1

A2 | A2

A1 | A2

A2 | A2

Page 6: Project In Bioinformatics (236524)- Projects Presentation.

Project #1

• Goal of project #1: to perform

haplotyping on the input data, i.e. to

infer the most likely haplotypes for the

individuals in the input pedigrees, via

the MPE (Most Probable Explanation)

query.

• Goal of project #1: to perform

haplotyping on the input data, i.e. to

infer the most likely haplotypes for the

individuals in the input pedigrees, via

the MPE (Most Probable Explanation)

query.

Page 7: Project In Bioinformatics (236524)- Projects Presentation.

Haplotyping – General Definition

• Input: a set of multilocus phenotypes for the individuals of a pedigree.

– Note: some individuals may be untyped.

• Output: the most likely configuration of haplotypes (ordered genotypes) for the pedigree, i.e. a configuration with maximum probability.

– Note: there might be a couple of configurations with maximum probability.

Page 8: Project In Bioinformatics (236524)- Projects Presentation.

Bayesian Network• X = {X1,…,Xn} is a set of random variables.

• A BN is a pair (G,P):

– G is a directed acyclic graph over nodes that

represent the random variables X.

– P = {Pi|1 ≤i ≤n}. Pi, defined on , is the

conditional probability table associated with node Xi.

Pi = P(Xi | pa(Xi))

• The BN represents a probability distribution over X.

XAi

Page 9: Project In Bioinformatics (236524)- Projects Presentation.

Bayesian Network - Example

A

B C

E

F

D

Bayesian NetworkP(A)

P(C|A)P(B|A)

P(E|B,C)

P(F|E)

P(D|A,B)

P(A,B,C,D,E,F) = P(A)P(B|A)P(C|A)P(D|A,B)P(E|B,C)P(F|E)

A

B C

E

F

D

Moralized Graph

Page 10: Project In Bioinformatics (236524)- Projects Presentation.

Sc[1,m]

Sc[2,m]

Ga[1,m]Ga[1,p]

Sc[1,p]

Gc[1,p]

Pa[1]

Gb[1,m]Gb[1,p]

Gc[1,m]

Pb[1]

Pc[1]

Ga[2,m]Ga[2,p]

Sc[2,p]

Gc[2,p]

Pa[2]

Gb[2,m]Gb[2,p]

Gc[2,m]

Pb[2]

Pc[2]

locus #1 variables

locus #2 variables

Bayesian network built by SUPERLINK

Page 11: Project In Bioinformatics (236524)- Projects Presentation.

Variables of the Bayesian Network

• Genetic Loci. the variables Gi[a, p] and Gi[a, m] represent the paternal & maternal alleles of individual i in locus a. – (orange nodes)

• Phenotypes. variable Pi[a] denotes the value of phenotype a for individual i - (yellow nodes).

• Selector variables. variables Si[a, p] and Si[a, m] are used to specify whether i got his alleles from the paternal or the maternal haplotype of his father and mother, respectively, at locus a – (green nodes).

Three types of random variables:

(Likelihood Computation with Value Abstraction with N. Friedman, D. Geiger, and N. Lotner).

Page 12: Project In Bioinformatics (236524)- Projects Presentation.

Haplotyping – Specific to Superlink

• Find the most likely assignment to all the genetic loci variables (orange nodes).

• Method: use the MPE query – find

using a bucket elimination algorithm, which is an algorithm for performing inference in a Bayesian network.

(,,)maxarg* exPxx

Page 13: Project In Bioinformatics (236524)- Projects Presentation.

Bucket Elimination Alg. for MPE

Given an ordering of the variables X1,…Xn:

– Distribute P1,…,Pn into buckets B1,…,Bn.

is the highest in order in Ai.

– Backward part: Process the buckets in reverse order: BnB1.• When processing bucket Bi, multiply all the probability tables in Bi

and eliminate the bucket’s variable Xi by keeping only the maximum joint probability entry for each possible assignment to the other variables.

• Store the value of Xi which maximizes the joint probability function for each possible assignment to the other variables.

• Place the resulting function in the bucket of the highest variable (in the order) that is in its scope.

– Forward part: Process the buckets in the order: B1Bn.• When processing bucket Bi, (after choosing the partial assignment

(x1,…xi-1)) choose the value of Xi which was recorded in the backward phase together with this assignment.

jji XBP

Page 14: Project In Bioinformatics (236524)- Projects Presentation.

Bayesian Network – Example Revisited

A

B C

E

F

D

P(A)

P(C|A)P(B|A)

P(E|B,C)

P(F|E)

P(D|A,B)

P(A,B,C,D,E,F) = P(A)P(B|A)P(C|A)P(D|A,B)P(E|B,C)P(F|E)

Page 15: Project In Bioinformatics (236524)- Projects Presentation.

Example – MPE

• Find: x0 = (a,b,c,d,e,f) such that:

• Suppose an order A,C,B,E,D,F, and evidence that F=1.

The distribution into buckets is as follows:

P(F|E)P(D|A,B)P(E|B,C)P(B|A)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

(|)(,|)(,|)(|)(|)()max() 0 EFPCBEPBADPACPABPAPxPnx

Page 16: Project In Bioinformatics (236524)- Projects Presentation.

Example MPE (cont. 1)

F=1P(D|A,B)P(E|B,C)hF(E)P(B|A)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

P(F|E)P(D|A,B)P(E|B,C)P(B|A)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

• To process B6 (F): Assign F=1, get hF(E) = P(F=1|E)• Place hF(E) in bucket B4(E).• Record F=1.

Page 17: Project In Bioinformatics (236524)- Projects Presentation.

Example MPE (cont. 2)

F=1Dopt(a,b)P(E|B,C)

hF(E)P(B|A)

hD(A,B)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

P(D|A,B)P(E|B,C)

hF(E)P(B|A)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

• Process B5(D), compute:• Place hD(a,b) in bucket B3(B).• Record the maximizing values Dopt(a,b).

(|,)max(,) dbaPbah DdD

Page 18: Project In Bioinformatics (236524)- Projects Presentation.

Example MPE (cont. 3)

B3)B(

F=1Dopt(a,b)Eopt(b,c)P(B|A)hD(A,B)hE(B,C)

P(A) P(C|A)

B1)A( B2)C( B4)E( B5)D( B6)F(

P(E|B,C)hF(E)

P(B|A)hD(A,B)P(A) P(C|A)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

• Process B4(E):• Place hE(b,c) in bucket B3(B).• Record Eopt(b,c).

()(,|)max(,) ehcbePcbh FEeE

Page 19: Project In Bioinformatics (236524)- Projects Presentation.

Example MPE (cont. 4)

F=1Dopt(a,b)Eopt(b,c)Bopt(a,c)P(A)P(C|A)

hB(A,C)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

P(A) P(C|A)

B1)A( B2)C(B3)B(

B4)E( B5)D( B6)F(

P(B|A)hD(A,B)hE(B,C)

• Process B3:• Place hB(a,c) in bucket B2(C).• Record Bopt(a,c).

(,)(,)(|)max(,) cbhbahabPcah EDBbB

Page 20: Project In Bioinformatics (236524)- Projects Presentation.

Example MPE (cont. 5)

F=1Dopt(a,b)Eopt(b,c)Bopt(a,c)P(A)

hC(A) Copt(a)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

P(A)P(C|A)hB(A,C)

B1)A( B2)C( B3)B( B4)E( B5)D( B6)F(

• Process B2(C): • Place hC(a) in bucket B1(A).• Record Copt(a).

(,)(|)max() cahacPah BCcC

Page 21: Project In Bioinformatics (236524)- Projects Presentation.

Example MPE (cont. 6)

B1)A(

F=1Dopt(a,b)Eopt(b,c)Bopt(a,c)P(A)hC(A) Copt(a)

B2)C( B3)B( B4)E( B5)D( B6)F(

• Compute the maximum value associated with A: • Record the value of A which produced this maximum.

()()maxmax ahaPh CAa

Traverse the variables in the opposite order (C,B,E,D)to determine the rest of the most probable assignment.

Page 22: Project In Bioinformatics (236524)- Projects Presentation.
Page 23: Project In Bioinformatics (236524)- Projects Presentation.

SimWalk2

• A statistical genetics computer application for haplotype, parametric linkage,

non-parametric linkage (NPL), identity by descent (IBD) and mistyping analyses on any size of pedigree.

• Performs approximate computations using Markov Chain Monte Carlo (MCMC) and simulated annealing algorithms.

Page 24: Project In Bioinformatics (236524)- Projects Presentation.

Project #2

• SimWalk2 requires 4/5 input files in order

to run. These can be difficult for a non-

expert to produce.

• Goal of project #2: to create a friendly web-based user-interface for the program

SimWalk2, using Java.

Page 25: Project In Bioinformatics (236524)- Projects Presentation.
Page 26: Project In Bioinformatics (236524)- Projects Presentation.

Performing Approximate Inference

• Algorithms for performing genetic linkage analysis are being improved constantly.

• However:– due to the enormous development in the human

genome project, knowledge about many markers exists.

– Markers are highly polymorphic.– Some disease models depend on multiple loci.

• Sometimes a model is too large for performing exact inference.

Page 27: Project In Bioinformatics (236524)- Projects Presentation.

Project #3

• Goal of project #3: provide the means for

performing approximate inference via a

heuristic which ignores extreme markers

in the computations when these are too

strenuous to be performed exactly.

• Goal of project #3: provide the means for

performing approximate inference via a

heuristic which ignores extreme markers

in the computations when these are too

strenuous to be performed exactly.

• Currently Superlink performs exact inference.

Page 28: Project In Bioinformatics (236524)- Projects Presentation.

General Outline of Heuristic Algorithm

1. Begin with the total number of markers as specified in the input.

2. Determine an elimination order for the problem as is.

3. Check the complexity of the elimination order found:

a. If it is greater than some determined threshold, clip off one of the extreme markers and return to step 2.

b. Else, continue to compute the likelihood.

Page 29: Project In Bioinformatics (236524)- Projects Presentation.

Open Questions

• Which marker to clip in each iteration ?

Some possible options are:

The marker farther from the disease locus. The less informative marker. A marker which is very close to its adjacent

marker.

Page 30: Project In Bioinformatics (236524)- Projects Presentation.
Page 31: Project In Bioinformatics (236524)- Projects Presentation.

Project #4

• Another project which deals with approximate inference in a different way…

• Goal of project #4: to provide the means for performing approximate inference by using Iterative Join-Graph Propagation.

Page 32: Project In Bioinformatics (236524)- Projects Presentation.

An exact inference algorithm for singly-connected networks.

Pearl’s Polytree Algorithm (BP – Belief Propagation)

Each node X computes BEL(x) = P(X=x|E), (E is the observed evidence), by combining messages from:

• its children, and • its parents.

Page 33: Project In Bioinformatics (236524)- Projects Presentation.

Loopy Belief Propagation (Iterative-

BP)• Uses Pearl’s polytree algorithm on a Bayesian

network with loops.

• Initialization: All messages are initialized to a vector of ones.

• At each iteration: all nodes calculate their outgoing messages based on the incoming messages of their neighbors from the previous iteration.

• Stopping condition: Convergence of messages. None of the beliefs in successive iterations changed by more than a small threshold (e.g.,10-

4).

Page 34: Project In Bioinformatics (236524)- Projects Presentation.

Generalized Belief Propagation (GBP)

• An extension of IBP towards being an anytime algorithm.

• Can be significantly more accurate than ordinary IBP at an adjustable increased complexity.

• Central idea: improve the approximation by clustering some of the network’s nodes into super nodes and apply message passing between the super nodes rather than between the original singleton nodes.

Page 35: Project In Bioinformatics (236524)- Projects Presentation.

Iterative Join-Graph Propagation (IJGP)

• A special class of GBP (Generalized Belief Propagation) algorithms.

• Pearl’s BP algorithm on trees was extended to a general propagation algorithm on trees of clusters – join-tree clustering (exact method).

• IJGP extends this idea to a join-graph, by applying join-tree message passing over the join-graph, iteratively.

• IJGP(i) – works on join-graphs having cluster size bounded by i variables.

i allows the user to control the tradeoff between time and accuracy.

Page 36: Project In Bioinformatics (236524)- Projects Presentation.

Belief Network -BNA quadruple BN= < X, D, G, P>:

– X = {X1,…Xn} is a set of random variables.

– D = {D1,…,Dn} is the set of corresponding domains.

– G is a directed acyclic graph over X.

– P = {p1,…,pn}, where pi = P(Xi | pai) (pai are the parents of Xi in G), denote probability tables.

Page 37: Project In Bioinformatics (236524)- Projects Presentation.

Join-Graph Decomposition

A triple D = < JG, χ, ψ > for BN= <X, D, G, P>:

– JG = (V, E) is a graph.– χ,ψ are functions which associate with

each vertex two sets and , such that:1. Each function is associated with

exactly one vertex .2. (connectedness) For each variable

, the set of vertices which are associated with it induces a connected sub-graph of JG.

Vv X )v( Pv ()

Ppi Vv

Xxi Vv

Page 38: Project In Bioinformatics (236524)- Projects Presentation.

Join-Graph Decomposition: Example (in this case, a

tree..)χ(1) = {A, B, C}ψ(1) = {p(a), p(b|a), p(c|a,b)}

A Bayesian network.A join-tree decomposition.

1

2

3

4

χ(2) = {B, C, D, F}ψ(2) = {p(d|b), p(f|c,d)}

χ(3) = {B, E, F}ψ(3) = {p(e|b, f)}

χ(4) = {E, F, G}ψ(4) = {p(g|e, f)}

AB

C D

F

E

G

Page 39: Project In Bioinformatics (236524)- Projects Presentation.

Arc-Labeled Join-Graph Decomposition

Vv Xv () Pv ()

A quadruple D = < JG, χ, ψ, θ > for BN= < X, D, G, P>:

– JG = (V, E) is a graph.– χ, ψ are functions which associate with each

vertex two sets and . – θ associates with each edge the set

, such that:1. Each function is associated with

exactly one vertex .2. (arc-connectedness) For each arc (u,v),

, such that , any 2 clusters containing Xi can be connected by a path whose every arc’s label contains Xi.

Euv (,)Xuv ((,))

Ppi Vv

(,)(,) vusepvu XX i

Page 40: Project In Bioinformatics (236524)- Projects Presentation.

Minimal Arc-Labeled Join-Graph Decomposition

• An arc-labeled join graph is minimal if no variable can be deleted from any label while still satisfying the arc-connectedness property.

• A minimal arc-labeled join-graph does not contain any cycle relative to any single variable.

Page 41: Project In Bioinformatics (236524)- Projects Presentation.

Definition - Eliminator

• Given 2 adjacent vertices u and v of JG, the eliminator of u with respect to v includes all the variables that appear in u and don’t appear on the arc (u,v).

elim(u,v) = χ(u) - θ((u,v)).

Page 42: Project In Bioinformatics (236524)- Projects Presentation.

Algorithm IJGP

• Input: – An arc-labeled join graph

decomposition. – Evidence variables var(e).

• Output: – An augmented graph whose nodes

are clusters containing the original CPTs and the messages received from neighbors.

– Approximations of P(Xi|e), .

XX i

Page 43: Project In Bioinformatics (236524)- Projects Presentation.

Algorithm IJGP – 1 iteration

Apply message-passing in some topological order over the join graph, forward and back. When node u sends a message to a neighbor node v:

1. Compute individual functions: include in H(u,v) each function whose scope doesn’t contain variables in elim(u,v). Denote by A the remaining functions.

2. Compute the combined function:

3. Send all the functions to v: Send h(u,v) and the individual functions H(u,v) to node v.

(,lim)(,)

vueAfvu fh

Page 44: Project In Bioinformatics (236524)- Projects Presentation.

Execution of IJGP on a Join-Tree

a

bacpabpapcbh (,|)(|)()(,)(2,1)

fd

fbhdcfpbdpcbh,

(2,3)(1,2) (,)(,|)(|)(,)

dc

cbhdcfpbdpfbh,

(2,4)(3,2) (,)(,|)(|)(,)

ABC

BCDF

BEF

EFG

BC

BF

EF

1

2

3

4

e

fehfbepfbh (,)(,|)(,) (3,4)(2,3)

b

fbhfbepfeh (,)(,|)(,) (3,2)(4,3)

(,|)(,)(3,4) fegGpfeh e

Page 45: Project In Bioinformatics (236524)- Projects Presentation.

Algorithm IJGP – computing beliefs

• Compute P(Xi, e) for every :

– let u be a vertex in JG such that

– compute:

where cluster(u) includes all the functions in

u, including messages sent from its neighbors.

XX i

(.)uX i

}{() ()()(|)

iXu uclusterfi feXP

Page 46: Project In Bioinformatics (236524)- Projects Presentation.

Bounded Join-Graphs

•Join-graphs with cluster size bounded by i.

A partition based approach to generate such decompositions:start from a given tree-decomposition and then partition the clusters until the decomposition has clusters bounded by i.

Goal: allows to control the complexity of IJGP.The time and space complexity of 1 iteration of IJGP(i) is exponential in i.

Page 47: Project In Bioinformatics (236524)- Projects Presentation.

Output: a join-graph with cluster size bounded by i.

1. Apply procedure schematic mini-bucket(i).

2. Associate each resulting mini-bucket with a node in the join-graph. The variables of the node are those appearing in the mini-bucket. The original functions of the node are those in the mini-bucket.

3. Keep the arcs created by the procedure (out-edges) and label them by the regular separator.

4. Connect the mini-bucket clusters belonging to the same bucket in a chain by in-edges labeled by the single variable of the bucket.

Algorithm join-graph structuring(i)

Page 48: Project In Bioinformatics (236524)- Projects Presentation.

Procedure schematic mini-bucket(i)

1. Order the variables from X1 to Xn, and associate a bucket with each variable.

2. Place each CPT in the bucket of the highest index variable in its scope.

3. For j=n to 1 do:– Partition the functions in bucket(Xj) into mini-

buckets having at most i variables.– For each mini-bucket mb create a function

(message) f where and place scope(f) in the bucket of its highest index variable. mb needs to be connected with an arc to the bucket of f (which will be created later).

},{}|{() jXmbXXfscope

Page 49: Project In Bioinformatics (236524)- Projects Presentation.

Build Join-Graph:Example

GFE

EBF

BFFCD

CDB

CAB

BA

A

P)G|F,E(

P)E|B, F(P)F|C,D(

P)D|B(

P)C|A,B(

P)B|A(

P)A(

EF

BFF

CD

CB

BA

A

b. After applying alg. join-graph structuring.

B

AB

C D

F

E

G

G: )GFE(

E: )EBF( )EF(

F: )FCD( )BF(

D: )DB( )CD(

C: )CAB( )CB(

B: )BA( )AB( )B(

A: )A( )A(

a. After applying schematic mini-bucket)3(.

X1

Xn

Page 50: Project In Bioinformatics (236524)- Projects Presentation.

IJGP(i) -summary

• As i is increased, we get a more accurate performance, requiring more time to process.

• This yields the anytime behavior of the algorithm.


Recommended