Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
Motivation
• Genomic sequences are very long:
Human genome = 3 x 109 –long Mouse genome = 2.7 x 109 –long
• Aligning genomic regions is useful for revealing common gene structure
Useful to compare regions > 1,000,000-long
Main Idea
Genomic regions of interest contain ordered islands of similarity, such as genes
1. Find local alignments
2. Chain an optimal subset of them
Outline
• Methods to FIND Local Alignments
Sorting k-long words
Suffix Trees
• Methods to CHAIN Local Alignments
Dynamic Programming
Sparse Dynamic Programming
Methods to FIND Local Alignments
1. Sorting K-long wordsBLAST, BLAT, and the like
2. Suffix Trees
Finding Local Alignments: Sorting k-long words
Given sequences x, y:
1. Write down all (w, 0, i): w = xi+1…xi+k
(z, 1, j): z = yj+1…yj+k
2. Sort them lexicographically
3. Deduce all k-long matches between x and y
4. Extend to local alignments
Sorting k-long words: example
Let x, y be matched with 3-long words:
x = caggc: (cag,0,0), (agg,0,1), (ggc,0,2)
y = ggcag: (ggc,1,0), (gca,1,1), (cag,1,2)
Sorted: (agg,0,1),(cag,0,0),(cag,1,2),(ggc,0,2),(ggc,1,0),(gca,1,1)
Matches:1. cag: x1x2x3 = y3y4y5
2. ggc: x3x4x5 = y1y2y3
Running time
• Worst case: O(NxM)
• In practice: a large value of k results in a short list of matches
Tradeoff:
Low k: worse running time
High k: significant alignments missed
PatternHunter:
Sampling non-consecutive positions increases the likelihood to detect a conserved region, for a fixed value of k – refer to Lecture 3
Suffix Trees
• Suffix trees are a method to find all maximal matches between two strings (and much more)
Example: x = dabdac d a b d a c
ca
bd
acc
cca
db
1
4
25
63
Definition of a Suffix Tree
Definition:
For string x = x1…xm, a suffix tree is:
A rooted tree with m leaves
Leaf i: xi…xm
Each edge is a substring
No two edges out of a node, start with same letter
It follows, every substring corresponds to
an initial part of a path from root to a leaf
Constructing a Suffix Tree
• Naïve algorithm: O( N2 ) time
• Better algorithms: O( N ) time
(outside the scope of this class – too technical and not so interesting)
Memory: O( N ) but with a significant constant
Naïve Algorithm to Construct a Suffix Tree
1. Initialize tree T: a single root node r
2. Insert special symbol $ at end of x
3. For j = 1 to m
• Find longest match of xi…xm to T, starting from r
• Split edge where match stops: new node w
• Create edge (w, j), and label with unmatched portion of xi…xm
Example of Suffix Tree Construction
1
x = d a b d a $
d a b d a $
1. Insert d a b d a $
a
bd
a$
2
2. Insert a b d a $
$a
db
3
3. Insert b d a $
$
4
4. Insert d a $
$
5
5. Insert a $
$
6
6. Insert $
Faster Construction
Several algorithms
O( N ) time,
O( N ) memory with a big constant
Technical but not deep, outside the scope of this course
Optional: Gusfield, chapter 6
Memory to Store Suffix Tree
• Can store in O( N ) memory!
• Every edge is labeled with (i, j):
(i,j) denotes xi…xj
• Tree has O( N ) nodes
Proof:1. # leafs # nodes – 1
2. # leafs = |x|
Application: find all matches between x, y
1. Build suffix tree for x, mark nodes with x
2. Insert y in suffix tree, mark all nodes y “passes from” with y
The path label of every node marked both 0 and 1, is a common substring
1
x = d a b d a $y = a b a d a $
d a b d a $1. Construct tree for x
a
bd
a$2
$a
db
3
$
4
$
5
$6
xx
x
6. Insert a $
5
6
6. Insert $
4. Insert a d a $
da$
3
5. Insert d a $
y
4
2. Insert a b a d a $
a
y
da
$
1
y
yx
3. Insert b a d a $ ady
2
a$
x
Example of Suffix Tree construction
Application: String search on a database
Say we have a database D = { s1, s2, …sn }(e.g., proteins)
Question: Given new string x, find all matches of x to database
1. Build suffix tree for {s1,…, sn}
2. All new queries x take O( |x| ) time (somewhat like BLAST)
Application: common substrings of k strings
To find the longest common substring of s1, s2, …sn
1. Build suffix tree for s1,…, sn
2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
Quadratic Time Solution
• Build Directed Acyclic Graph (DAG): Nodes: local alignments [(xa,xb) (ya,yb)] & score
Directed edges: local alignments that can be chained• edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )• xa < xb < xc < xd
• ya < yb < yc < yd
Each local alignment
is a node vi with
alignment score si
Quadratic Time Solution
Dynamic programming:
Initialization:Find each node va s.t. there is no edge (u,v0)
Set score of V(a) to be sa
Iteration:For each vi, optimal path ending in vi has total score:
V(i) = max ( weight(vj, vi) + V(j) )
Termination:Optimal global chain:
j = argmax ( V(j) ); trace chain from vj
Worst case time: quadratic
Sparse Dynamic Programming
Back to the LCS problem:
• Given two sequences x = x1, …, xm
y = y1, …, yn
• Find the longest common subsequence Quadratic solution with DP
• How about when “hits” xi = yj are sparse?
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18
4
20
24
3
11
15
11
4
18
20
• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence
• Given a sequence over an ordered alphabet
x = x1, …, xm
• Find a subsequence
s = s1, …, sk
s1 < s2 < … < sk
Sparse LCS expressed as LIS
Create a sequence w
• Every matching point x-to-y, (i, j), is inserted into a sequence as follows:
• For each position j of x, from smallest to largest, insert in z the points (i, j), in decreasing column i order
• The 11 example points are inerted in the order given
• Any two points (ya, xa), (yb, xb) can be chained iff
a is before b in w, and ya < yb
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse LCS expressed as LIS
Create a sequence w
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
Consider now w’s elements as ordered lexicographically, where
• (ya, xa) < (yb, xb) if ya < yb
Claim: An increasing subsequence of w is a common subsequence of x and y
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse Dynamic Programming for LIS
• Algorithm:
initialize empty array L
/* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */
for i = 1 to |w|
binary search for w[i] in L, to find lj < w[i] ≤ lj+1
replace lj+1 with w[i]
keep a backptr lj w[i]
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse Dynamic Programming for LIS
Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)
(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)
L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:
s = 4, 24, 3, 11, 18
15 3 24 16 20 4 24 3 11 18
6
4
2 7
1 8
10
9
5
11
3
4
20
24
3
11
15
11
4
18
20
x
y
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj L is implemented as a balanced binary tree
y
h
l
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest:
1. When on the leftmost end of i:
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i:
a. j: rectangle in L, with largest lj lib. If V(i) > V(j):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lk, V(k), k) with V(k) V(i) & lk li
Example
x
y
1: 5
3: 3
2: 6
4: 45: 2
2
56
91011
1214
1516
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree