Motif Refinement using Hybrid Motif Refinement using Hybrid Expectation Maximization Expectation Maximization Algorithm Algorithm Chandan Reddy Chandan Reddy Yao-Chung Weng Yao-Chung Weng Hsiao-Dong Chiang Hsiao-Dong Chiang School of Electrical and School of Electrical and Computer Engr. Computer Engr. Cornell University, Ithaca, NY Cornell University, Ithaca, NY - 14853. - 14853.
Transcript
Slide 1
Motif Refinement using Hybrid Expectation Maximization
Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of
Electrical and Computer Engr. Cornell University, Ithaca, NY -
14853.
Slide 2
Motif Finding Problem DNA and protein sequences that are
strongly conserved i.e. they have important biological functions
like gene regulation and gene interaction Motifs are certain
patterns in DNA and protein sequences that are strongly conserved
i.e. they have important biological functions like gene regulation
and gene interaction Finding these conserved patterns might be very
useful for controlling the expression of genes Finding these
conserved patterns might be very useful for controlling the
expression of genes Motif finding problem is to detect novel,
over-represented unknown signals in a set of sequences (for eg.
transcription factor binding sites in a genome).
Slide 3
Motif Finding Problem Consensus Pattern - CCGATTACCGA ( l, d )
(11,2) consensus pattern
Slide 4
Problem Definition Without any previous knowledge about the
consensus pattern, discover all instances (alignment positions) of
the motifs and then recover the final pattern to which all these
instances are within a given number of mutations.
Slide 5
Complexity of the Problem Let n is the length of the DNA
sequence l is the length of the motif t is the number of sequences
d is the number of mutations in a motif The running time of a brute
force approach: There are (n-l+1) l-mers in each of t sequences.
Total combination is (n-l+1) t l-mers for t sequences. Typically, n
is much larger than l. ie. n = 600, t = 20.
Slide 6
Existing methodologies Generative probabilistic representation
- continuous Gibbs Sampling Gibbs Sampling Expectation Maximization
Expectation Maximization Greedy CONSENSUS Greedy CONSENSUS HMM
based HMM based Mismatch representation Discrete Consensus
Projection Methods Projection Methods Multiprofiler Multiprofiler
Suffix Trees Suffix Trees
Slide 7
Existing methodologies Global Solvers Advantage: neighborhood
of global optimal solutions. Advantage: neighborhood of global
optimal solutions. Disadvantage: misses out better solutions
locally. Disadvantage: misses out better solutions locally. ie:
Random Projection, Pattern Branching, etc ie: Random Projection,
Pattern Branching, etc Local Solvers Advantage: returns best
solution in neighborhood. Advantage: returns best solution in
neighborhood. Disadvantage: relies heavily on initial conditions.
Disadvantage: relies heavily on initial conditions. ie: EM, Gibbs
Sampling, Greedy CONSENSUS, etc ie: EM, Gibbs Sampling, Greedy
CONSENSUS, etc
Slide 8
Our Approach Performs global solver to estimate neighborhood of
a promising solution. (Random Projection) Performs global solver to
estimate neighborhood of a promising solution. (Random Projection)
Using this neighborhood as initial guess, apply local solver to
refine the solution to be the global optimal solution. (Expectation
Maximization) Using this neighborhood as initial guess, apply local
solver to refine the solution to be the global optimal solution.
(Expectation Maximization) Performs efficient neighborhood search
to jump out of convergence region to find another local solutions
systematically. Performs efficient neighborhood search to jump out
of convergence region to find another local solutions
systematically. A hybrid approach includes the advantages of both
the global and local solvers. A hybrid approach includes the
advantages of both the global and local solvers.
Slide 9
Random Projection Implements a hash function h(x) to map l-mer
onto a k- dimensional space. Implements a hash function h(x) to map
l-mer onto a k- dimensional space. Hashes all possible l-mers in t
sequences into 4 k buckets where each bucket corresponds an unique
k-mer. Hashes all possible l-mers in t sequences into 4 k buckets
where each bucket corresponds an unique k-mer. Imposing certain
conditions and setting a reasonable bucket threshold S, the buckets
that exceed S is returned as the solution. Imposing certain
conditions and setting a reasonable bucket threshold S, the buckets
that exceed S is returned as the solution.
Slide 10
Expectation Maximization Expectation Maximization is a local
optimal solver in which we refine the solution yielded by random
projection methodology. The EM method iteratively updates the
solution until it converges to a locally optimal one. Expectation
Maximization is a local optimal solver in which we refine the
solution yielded by random projection methodology. The EM method
iteratively updates the solution until it converges to a locally
optimal one. Follow these steps : Compute the scoring function
Compute the scoring function Iterate the Expectation step and the
Maximization step Iterate the Expectation step and the Maximization
step
Slide 11
Profile Space Jk=bk=1k=2k=3k=4k=l {A} C 0,1 C 1,1 C 2,1 C 3,1 C
4,1 C l,1 {T} C 0,2 C 1,2 C 2,2 C 3,2 C 4,2 C l,2 {G} C 0,3 C 1,3 C
2,3 C 3,3 C 4,3 C l,3 {C} C 0,4 C 1,4 C 2,4 C 3,4 C 4,4 C l,4 A
profile is a matrix of probabilities, where the rows represent
possible bases, and the columns represent consecutive sequence
positions. A profile is a matrix of probabilities, where the rows
represent possible bases, and the columns represent consecutive
sequence positions. Applying the Profile Space into the coefficient
formula constructs PSSM.
Slide 12
Scoring function- Maximum Likelihood
Slide 13
Expectation Step The Expectation step returns the expected
number of j th residue in each position of the motif instance and
overall sequence. The algorithm is as follows: Obtains k,j from the
previous M-step iteration. Obtains k,j from the previous M-step
iteration. Uses k,j to calculate the probability of all possible l-
mers against the expected motif. Uses k,j to calculate the
probability of all possible l- mers against the expected motif.
Given probability of each l-mer, calculates probability of the
correct starting position for each l-mer using Bayes formula. Given
probability of each l-mer, calculates probability of the correct
starting position for each l-mer using Bayes formula. Multiplying
weight to each position of each l-mer, calculate the expected
number of j at position k. Multiplying weight to each position of
each l-mer, calculate the expected number of j at position k.
Slide 14
Maximization Step The Maximization Step receives the expected
values passed on by E-Step to calculate the new probability k,j and
0,j and return them for E-Step. (q) k,j = E[k,j] / t, (q) 0,j =
E[0,j] / (t [ n-l ] ) (q) k,j = E[k,j] / t, (q) 0,j = E[0,j] / (t [
n-l ] ) If (q) = (q-1), then iteration ends. All the local optimal
solution sites are returned with the consensus made up of j th
residue with highest probability at k th position. If (q) = (q-1),
then iteration ends. All the local optimal solution sites are
returned with the consensus made up of j th residue with highest
probability at k th position. Else, (q) k,j and (q) 0,j are used to
the q+1 iteration of the E-Step. Else, (q) k,j and (q) 0,j are used
to the q+1 iteration of the E-Step.
Slide 15
Slide 16
Basic Idea one-to-one correspondence of the critical points
Local Minimum Local Maximum Stable Equilibrium Point Decomposition
Point Source Saddle Point
Slide 17
Theoretical Background Practical Stability Boundary The problem
of finding all the Tier-1 stable equilibrium points of x s is the
problem of finding all the decomposition points on its stability
boundary
Slide 18
Theoretical background Theorem (Unstable manifold of type-1
equilibrium point) : Let x s 1 be a stable e.p. of the gradient
system (2) and x d be a type-1 e.p. on the practical stability
boundary A p (x s ). Assume that there exist and such that | f (x)|
> unless x {x : f (x) =0}. If every e.p. of (1) is hyperbolic
and its stable and unstable manifolds satisfy the transversality
condition, then there exists another stable e.p. x s 2 to which the
one dimensional unstable manifold of x d converges. Our method
finds the stability boundary between the two local minima and
traces the stability boundary to find the saddle point. We used a
new trajectory adjustment procedure to move along the practical
stability boundary.
Slide 19
Definitions Def 1 : x is said to be a critical point of (1) if
it satisfies the condition f (x) = 0 where f (x) is the objective
function assumed to be in C 2 ( n, ).The corresponding nonlinear
dynamical system is -------- Eq. (1) The solution curve of Eq. (1)
starting from x at time t = 0 is called a trajectory and it is
denoted by ( x,.) : n. A state vector x is called an equilibrium
point (e.p.) of Eq. (3) if f ( x ) = 0.
Slide 20
Definitions (contd.) Def 2 : An equilibrium point is said to be
hyperbolic if the Jacobian of f at point x has no eigenvalues with
zero real part. A hyperbolic e.p. is a Stable e.p. - if all the
eigenvalues of its Jacobian have negative real part. Unstable e.p.
- if some eigenvalues have positive real part. Type-k e.p. - if its
Jacobian has exact k eigenvalues with positive real part. We
propose to build a negative gradient system associated with ( 1) as
shown below : dx /dt = - f (x) -------- Eq. (2)
Slide 21
Definitions (contd.) A dynamical system is completely stable if
every trajectory of the system leads to one of its stable
equilibrium points. Def 3 : The stability region (or region of
attraction) of a stable equilibrium point x s of a nonlinear
dynamical system (1) is denoted by A(x s ) and is A(x s ) = {x n :
lim t ( x, t) = x s } The boundary of stability region is called
the stability boundary of x s and is represented as A(x s ).
Slide 22
Definitions (contd.) Def 4 : The practical stability region of
a stable equilibrium point x s of a nonlinear dynamical system (1),
denoted by A p (x s ) and is. The practical stability boundary ( A
p (x s ) ) is a subset of its stability boundary. It eliminates the
complex portion of the stability boundary which has no contact with
the complement of the closure of the stability region. Def 5 : A
decomposition point is a type-1 equilibrium point x d on the
practical stability boundary of a stable equilibrium point x
s.
Slide 23
Theoretical background Theorem 1 (Unstable manifold of type-1
equilibrium point) : Let x s 1 be a stable e.p. of the gradient
system (2) and x d be a type-1 e.p. on the practical stability
boundary A p (x s ). Assume that there exist and such that | f (x)|
> unless x {x : f (x) =0}. If every e.p. of (1) is hyperbolic
and its stable and unstable manifolds satisfy the transversality
condition, then there exists another stable e.p. x s 2 to which the
one dimensional unstable manifold of x d converges. Our method
finds the stability boundary between the two local minima and
traces the stability boundary to find the saddle point. We used a
new trajectory adjustment procedure to move along the practical
stability boundary.
Slide 24
Our Method
Slide 25
Search Directions
Slide 26
Slide 27
Our Method The exit point method is implemented so that EM can
move out of its convergence region to seek out other local optimal
solutions. Construct a PSSM from initial alignments. Construct a
PSSM from initial alignments. Calculate eigenvectors of Hessian
matrix. Calculate eigenvectors of Hessian matrix. Find exit points
(or saddle points) along each eigenvector. Find exit points (or
saddle points) along each eigenvector. Apply EM from the new
stability/convergence region. Apply EM from the new
stability/convergence region. Repeat first step. Repeat first step.
Return max score {A, a 1i, a 2j } Return max score {A, a 1i, a 2j
}
Slide 28
Results
Slide 29
Improvements in the Alignment Scores Motif Original Pattern
Score Second Tier Pattern Score
(11,2)AACGGTCGCAG125.1CCCGGGAGCTG153.3
(11,2)ATACCAGTTAC145.7ATACCAGGGTC153.6
(13,3)CTACGGTCGTCTT142.6CCTCGGGTTTGTC158.7
(13,3)GACGCTAGGGGGT158.3GACCTTGGGTATT165.8
(15,4)CCGAAAAGAGTCCGA147.5CCGAAAGGACTGCGT176.2
(15,4)TGGGTGATGCCTATG164.6TGAGAGATGCCTATG170.4
(17,5)TTGTAGCAAAGGCTAAA143.3CAGTAGCAAAGACTTCC175.8 (17,5)
(17,5)ATCGCGAAAGGTTGTGG174.1ATTGCGAAAGAATGTGG178.3
(20,6)CTGGTGATTGAGATCATCAT165.9CATTTAGCTGAGTTCACCTT194.9
(20,6)GGTCACTTAGTGGCGCCATG216.3CGTCACTTAGTCGCGCCATG219.7
Slide 30
Improvements in the Alignment Scores Motif Original Pattern
Score Second Tier Pattern Score
(11,2)TATCGCTGGGC147.5TCTCGCTGGGC161.1
(13,3)CACCTTGGTAATT168.4GACCATGGGTATT181.5
(15,4)ATGGCGTCCGCAATG174.7ATGGCGTCCGAAAGA188.5
(17,5)CGACACTTTCTCAATGT178.8CGACACTATCTTAAGAT196.2
(20,6)TCAAATAGACTAGAGGCGAC189.0TCTACTAGACTGGAGGCGGC201.1 Random
Projection method results
Slide 31
Performance Coefficient K is the set of the residue positions
of the planted motif instances, and P is the corresponding set of
positions predicted
Slide 32
Results Different Motifs and the average score using random
starts. The first tier and second tier improvements on synthetic
data.
Slide 33
Results Different Motifs and the average score using random
projection. The first tier and second tier improvements on
synthetic data.
Slide 34
Results Different Motifs and the average score using random
projections and the first tier and second tier improvements on real
human sequences.
Slide 35
Results on Real data
Slide 36
Concluding discussion Using dynamical system approach, we have
shown that the EM algorithm can be improved significantly. Using
dynamical system approach, we have shown that the EM algorithm can
be improved significantly. In the context of motif finding, we see
that there are many local optimal solutions and it is important to
search the neighborhood space. In the context of motif finding, we
see that there are many local optimal solutions and it is important
to search the neighborhood space. Try different global methods and
other techniques like GibbsDNA Try different global methods and
other techniques like GibbsDNA