Download - Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Motivation

• Genomic sequences are very long:

Human genome = 3 x 109 –long Mouse genome = 2.7 x 109 –long

• Aligning genomic regions is useful for revealing common gene structure

Useful to compare regions > 1,000,000-long

Main Idea

Genomic regions of interest contain ordered islands of similarity, such as genes

1. Find local alignments

2. Chain an optimal subset of them

Outline

• Methods to FIND Local Alignments

Sorting k-long words

Suffix Trees

• Methods to CHAIN Local Alignments

Dynamic Programming

Sparse Dynamic Programming

Methods to FIND Local Alignments

1. Sorting K-long wordsBLAST, BLAT, and the like

2. Suffix Trees

Finding Local Alignments: Sorting k-long words

Given sequences x, y:

1. Write down all (w, 0, i): w = xi+1…xi+k

(z, 1, j): z = yj+1…yj+k

2. Sort them lexicographically

3. Deduce all k-long matches between x and y

4. Extend to local alignments

Sorting k-long words: example

Let x, y be matched with 3-long words:

x = caggc: (cag,0,0), (agg,0,1), (ggc,0,2)

y = ggcag: (ggc,1,0), (gca,1,1), (cag,1,2)

Sorted: (agg,0,1),(cag,0,0),(cag,1,2),(ggc,0,2),(ggc,1,0),(gca,1,1)

Matches:1. cag: x1x2x3 = y3y4y5

2. ggc: x3x4x5 = y1y2y3

Running time

• Worst case: O(NxM)

• In practice: a large value of k results in a short list of matches

Tradeoff:

Low k: worse running time

High k: significant alignments missed

PatternHunter:

Sampling non-consecutive positions increases the likelihood to detect a conserved region, for a fixed value of k – refer to Lecture 3

Suffix Trees

• Suffix trees are a method to find all maximal matches between two strings (and much more)

Example: x = dabdac d a b d a c

ca

bd

acc

cca

db

1

4

25

63

Definition of a Suffix Tree

Definition:

For string x = x1…xm, a suffix tree is:

A rooted tree with m leaves

Leaf i: xi…xm

Each edge is a substring

No two edges out of a node, start with same letter

It follows, every substring corresponds to

an initial part of a path from root to a leaf

Constructing a Suffix Tree

• Naïve algorithm: O( N2 ) time

• Better algorithms: O( N ) time

(outside the scope of this class – too technical and not so interesting)

Memory: O( N ) but with a significant constant

Naïve Algorithm to Construct a Suffix Tree

1. Initialize tree T: a single root node r

2. Insert special symbol $ at end of x

3. For j = 1 to m

• Find longest match of xi…xm to T, starting from r

• Split edge where match stops: new node w

• Create edge (w, j), and label with unmatched portion of xi…xm

Example of Suffix Tree Construction

1

x = d a b d a $

d a b d a $

1. Insert d a b d a $

a

bd

a$

2

2. Insert a b d a $

$a

db

3

3. Insert b d a $

$

4

4. Insert d a $

$

5

5. Insert a $

$

6

6. Insert $

Faster Construction

Several algorithms

O( N ) time,

O( N ) memory with a big constant

Technical but not deep, outside the scope of this course

Optional: Gusfield, chapter 6

Memory to Store Suffix Tree

• Can store in O( N ) memory!

• Every edge is labeled with (i, j):

(i,j) denotes xi…xj

• Tree has O( N ) nodes

Proof:1. # leafs # nodes – 1

2. # leafs = |x|

Application: find all matches between x, y

1. Build suffix tree for x, mark nodes with x

2. Insert y in suffix tree, mark all nodes y “passes from” with y

The path label of every node marked both 0 and 1, is a common substring

1

x = d a b d a $y = a b a d a $

d a b d a $1. Construct tree for x

a

bd

a$2

$a

db

3

$

4

$

5

$6

xx

x

6. Insert a $

5

6

6. Insert $

4. Insert a d a $

da$

3

5. Insert d a $

y

4

2. Insert a b a d a $

a

y

da

$

1

y

yx

3. Insert b a d a $ ady

2

a$

x

Example of Suffix Tree construction

Application: String search on a database

Say we have a database D = { s1, s2, …sn }(e.g., proteins)

Question: Given new string x, find all matches of x to database

1. Build suffix tree for {s1,…, sn}

2. All new queries x take O( |x| ) time (somewhat like BLAST)

Application: common substrings of k strings

To find the longest common substring of s1, s2, …sn

1. Build suffix tree for s1,…, sn

2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Quadratic Time Solution

• Build Directed Acyclic Graph (DAG): Nodes: local alignments [(xa,xb) (ya,yb)] & score

Directed edges: local alignments that can be chained• edge ( (xa, xb, ya, yb) , (xc, xd, yc, yd) )• xa < xb < xc < xd

• ya < yb < yc < yd

Each local alignment

is a node vi with

alignment score si

Quadratic Time Solution

Dynamic programming:

Initialization:Find each node va s.t. there is no edge (u,v0)

Set score of V(a) to be sa

Iteration:For each vi, optimal path ending in vi has total score:

V(i) = max ( weight(vj, vi) + V(j) )

Termination:Optimal global chain:

j = argmax ( V(j) ); trace chain from vj

Worst case time: quadratic


Back to the LCS problem:

• Given two sequences x = x1, …, xm

y = y1, …, yn

• Find the longest common subsequence Quadratic solution with DP

• How about when “hits” xi = yj are sparse?


15 3 24 16 20 4 24 3 11 18

4

20

24

3

11

15

11

4

18

20

• Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

Sparse Dynamic Programming – L.I.S.

• Longest Increasing Subsequence

• Given a sequence over an ordered alphabet

x = x1, …, xm

• Find a subsequence

s = s1, …, sk

s1 < s2 < … < sk

Sparse LCS expressed as LIS

Create a sequence w

• Every matching point x-to-y, (i, j), is inserted into a sequence as follows:

• For each position j of x, from smallest to largest, insert in z the points (i, j), in decreasing column i order

• The 11 example points are inerted in the order given

• Any two points (ya, xa), (yb, xb) can be chained iff

a is before b in w, and ya < yb

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Sparse LCS expressed as LIS

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

• (ya, xa) < (yb, xb) if ya < yb

Claim: An increasing subsequence of w is a common subsequence of x and y

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Sparse Dynamic Programming for LIS

• Algorithm:

initialize empty array L

/* at each point, lj will contain the last element of the longest j-long increasing subsequence that ends with the smallest wi */

for i = 1 to |w|

binary search for w[i] in L, to find lj < w[i] ≤ lj+1

replace lj+1 with w[i]

keep a backptr lj w[i]

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Sparse Dynamic Programming for LIS

Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)

(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:

s = 4, 24, 3, 11, 18

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj L is implemented as a balanced binary tree

y

h

l

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

1. When on the leftmost end of i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. j: rectangle in L, with largest lj lib. If V(i) > V(j):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lk, V(k), k) with V(k) V(i) & lk li

Example

x

y

1: 5

3: 3

2: 6

4: 45: 2

2

56

91011

1214

1516

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree