+ All Categories
Home > Documents > Www.strandls.com Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of...

Www.strandls.com Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of...

Date post: 02-Jan-2016
Category:
Upload: ada-glenn
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
20
www.strandls.com Read Alignment Algorithms
Transcript
Page 1: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Read Alignment Algorithms

Page 2: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

The Problem

2

• Given a very long reference sequence of length n and given several short strings (reads) of length m each, m << n

• Find the best matching location for each read in the reference

• Where the best location is that which minimizes the number of mismatches • We ignore insertions and deletions for the moment; those will come

later

• Provided the number of mismatches is at most, say 5% of m

Page 3: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Indexing the Reference

3

• What if we do not allow any mismatches at all?

• Pre-process the reference sequence so…

• Each query – find the best matching location of a read – can be identified in time proportional to m and independent of n

• The resulting data structure is called an index

• Suffix trees are one possible index• A trie of all suffixes of the reference sequence, with a $ marker at the

end

Page 4: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Suffix Trees

4

C G A C G

The Reference

C

C

G

T

T

A C

A G

A C

T

C G CQuery

Page 5: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Space Required by Suffix Trees

5

• n-1 internal nodes plus n leaves, so 2n-1 nodes

• 2n-2 tree pointers + n pointers into the reference

• So ~3n pointers

• 36GB!

• Can we make this smaller?

Page 6: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Indexing the Reference with Mismatches

6

• What if we allow mismatches?

• So we put the query through the suffix tree but get struck – can’t proceed further

• Next, resume by dropping the first character, but without redoing the work already done

• How?

Page 7: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Suffix Links in Suffix Trees

7

C G A C G

The Reference

C

C

G

T

T

A C

A G

A C

T

G C GQuery

Page 8: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Indexing with Mismatches (Contd)

8

• For an internal node A with string x leading down from the root to that node and branching into xa and xb

• Let x=cy

• Then there exists a node B with string y leading down from the root to that node

• The suffix link from A leads to this node B

• Such a node exists

• So if you get stuck, you follow the suffix link in constant time and continue from where you left off, to find the longest perfect-match substring starting at each position in the read

• Or alternatively, find all substrings of a certain minimum length that match

• Check explicitly for the number of mismatches at each of these locations

Page 9: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Space Required by Suffix Trees & Links

9

• n-1 internal nodes plus n leaves plus n-1 suffix links, so 3n-1 nodes

• 3n-3 tree pointers + n pointers into the reference

• So ~4n pointers

• 48GB!

• Can we make this smaller? Can we fit this tree into an array?

Page 10: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

A Succinct Data Structure

10

C G A C $

A C $ C GC G A C $C $ C G AG A C $ C$ C G A C

The Reference

All circular shifts, sorted

lexicographically

Burrows-Wheeler

Transform

• Store only the first and last columns and the links back to the reference

• Used in bzip

Page 11: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

A Succinct Data Structure

11

C G A C $

A C $ C GC G A C $C $ C G AG A C $ C$ C G A C

20314

$

A

G

C $G

The Reference

• The reference can be reconstructed from the first and last columns

• Claim: The ith G in the first column corresponds to the ith G in the last column! Likewise for A,C,G,T.

Page 12: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Proof of Claim

12

yG<xG if and only if Gy<Gx; That’s it!

So given a G in the first column, say corresponding to the string Gx– It’s rank r is trivial to find because the first column is sorted, just store

counts for all 4 characters– We need to locate the corresponding G in the last column – In other words, the index of the string xG in the table– Which is the rth G in the last column [The Select Query]

So given a G in the last column, say corresponding to the string xG– Find it’s rank r among G’s in the last column [The Rank Query]– We need to locate the corresponding G in the first column – In other words, the index of the string Gx in the table– Which is the rth G in the first column, trivial to find

Page 13: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Select and Rank Queries

13

Given a binary array– SELECT: Given index i, find the ith 1– RANK: Given index i, find how many 1s precede this location

Use a separate array for each of the 4 characters

RANK is easy, just keeps counts at Δ milestones and answer queries by traversing to the nearest milestone in time Δ

– 4n/Δ bytes of storage, O(Δ) time

SELECT needs a bit more, keep counts for Δ-rank milestones – Go to the nearest rank milestone and traverse from there– May need to traverse quite a bit though– So need an extra data structure to get to the next 1, which you store at Δ milestones – So 8n/Δ bits storage, O(Δ) time

Of course we need the 4 n-bit binary arrays as well

So 4n bits + 48n/Δ bytes and O(Δ) time

Page 14: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

String Matching using Rank-Selects

14

Given a string Gx

Assume inductively we have the band B of indices in the table corresponding to suffixes that begin with x

We want the band B’ that begins with Gx

Take the band B, take the last column, identify the rank of the first and last G in the last column, find their corresponding first column indices; that’s the band

– All doable using RANK alone

At the end you have the band containing all suffixes which begin with Gx

Unless of course, there are none, in which case the band will vanish at some point

We can use this to find matches for say all length 16 substrings of a read

So 4n+48n/Δ bytes and O(mΔ) time per read

Page 15: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Indentifying Indices in the Reference

15

We still have to go from a band in the table to indices in the reference

4n bits if we store explicitly

We can use the same trick, store explicitly at Δ milestones

Then, if we have index i with string Gx, then we can go to index i+1 with string xG and so on till we get to a milestone

4n/Δ bytes storage

Time per index is O(Δ)

Page 16: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Sorting Circular Shifts

16

It remains to describe the construction of the table in the first place

Given a string S=x0 x1 x2 ….$

– Consider string S’=(x0 x1 x2) (x1 x2 x3) (x3 x4 x5) (x4 x5 x6)….

– Note (x2 x3 x4) and other triplets starting at 2 mod 3 are missing– Rename S’ so identical tuples get the same number and distinct tuples get

different numbers– Recursively sort S’

• How does x0 x1 x2 … compare to x1 x2 x3 … ? – Already available from recursion

• How does x0 x1 x2 … compare to x2 x3 x4 … ?

– Compare x0 , x2 and then x1 x2 … , x3 x4 … – We have info for comparing all pairs of suffixes!

– Sort the 2 mod 3 suffixes and then merge them in– Time T(n)= 2T(n/3)+O(n)

Page 17: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

A Generalization: Difference Covers

17

v 2v 3v

This string has size |D|n/v

Set D of indices mod v

Time taken to create this string

is O(n |D|)

Sorting suffixes of this string gives the sorted order

of all suffixes which begin at

indices j such that j mod v is in D

Page 18: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

A Generalization: Difference Covers

For any 2 indices i and j i-j mod v is the distance between some two beads in D

x<v

D is a Difference Cover if distances between beads in D generate 0,1…,v-1

x<v

Page 19: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

A Generalization: Difference Covers

There exists a Difference Cover of size 1.5*sqrt(v)!

sqrt(v)

sqrt(v)

Page 20: Www.strandls.com Read Alignment Algorithms.  The Problem 2 Given a very long reference sequence of length n and given several short strings.

www.strandls.com

Thank you

20


Recommended