+ All Categories
Home > Documents > Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan...

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan...

Date post: 22-Jan-2016
Category:
Upload: kaylah-locks
View: 219 times
Download: 0 times
Share this document with a friend
41
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis
Transcript
Page 1: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Suffix Trees Come of Age in Bioinformatics

Algorithms, Applications and Implementations

Dan Gusfield, U.C. Davis

Page 2: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

AABAAB

A

AB

A

A

B$

1

$4

B

A

A

$2

BA A

B$

3

B

$5

$

6

1 2 3 4 5 6

Suffix Tree

Every suffix of S is encoded by a root to leaf walk.Every substring encoded by a walk from the root.

S =

Page 3: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

AABAAB

A

AB

A

A

B$

1

$4

B

A

A

$2

BA A

B$

3

B

$5

$

6

4 1 5 2 3 6

1 2 3 4 5 6

SUFFIX ARRAY

The suffixes listed in lexicographic order.Found by lexicographic DFS

Page 4: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Basic Facts about Suffix Trees

• Suffix tree for a string of length n can be built in O(n) time.

• Basic implementations need about 25 bytes per character.

• Numerous applications in string algorithms achieving significant (sometimes amazing) speedups over naïve algorithms – O(m) time substring search for a string of length m in a string of length n, regardless of n.

Page 5: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Suffix Tree/Array Construction

• Weiner 1973

• McCreight ~ 1974

• Ukkonen ~ 1996

• Farach-Colton ~ 1999

• Manber-Myers ~ 1991 (Suffix-Arrays)

O(n log n time) with small space. O(n) time possible as on prior slide.

Page 6: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Bioinformatics Applications pre 1997

• A small number of applications

• Limited by space requirements of the tree, small computer memories, complication of the algorithms, poor locality of reference

Page 7: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Post 1997

• Much has changed, and Suffix trees and their relatives (particularly suffix arrays) are more widely applied in bioinformatics

• 20- 40 new publications with substantial connection to bioinformatics, post 1997

Page 8: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

New Results since 1997

• Fundamental Algorithms: Farach-Colton construction; simpler algorithm for least-common-ancestor.

• Implementation improvements (space) for suffix trees and arrays

• New variants of suffix trees – affix trees, virtual suffix trees

• New algorithms for tandem repeats

Page 9: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

New Results since 1997

• Approximate repeats and large-scale repeat structure

• Fast lookups in databases

• Substring frequencies

• Motif and pattern discovery

• Hybrid dynamic programming

• Oligo and probe construction

Page 10: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

New Results since 1997

• Genome fragment assembly and resequencing

• (multiple) whole genome comparison

• Large-scale sequence comparison in resequencing projects

• Misc. clever applications

• Related less-used data structures

Page 11: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Suffix trees collect together repeated substring prefixes

Example: Speeding up the use of Position Specific ScoringMatrices. Accelerating protein classification ISMB 2000

position 1 2 3 4 5 A 3 1 4 2 1 T 2 4 4 3 2 C 1 3 5 6 2 G 4 2 4 6 2

…AACTG…AACTG….AACTG…

AACTG

Walk around thetree to depth 5

PSSM

Starting locations of AACTG

Page 12: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Whole Genome Alignment with MUM’s:

• Delcher …Salzberg NAR ~ 1999• A MUM is a Maximal Unique Matching

substring in two strings S and S’• MUMMER finds all MUM’s; selects and

aligns non-overlapping pairs of MUM’s; and then recurses in the regions between adjacent selected MUMs.

• Key issue: finding MUMs efficiently

Page 13: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Spotting MUMs with a suffix tree

Example: S = …AATCCGTG…. S’ = ………GATCCGTA… 20 So ATCCGT is a MUM if it only occurs in these two places

root

Path spelling ATCCGT

7

S7S’20

Exactly two sibling leaves

Page 14: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Finding MUM’s in a suffix tree

• Build a suffix tree for the concatenation of S and S’, noting at each leaf whether the suffix is from the S part or the S’ part. Also note the prior character in S or S’ for that position.

• Do a DFS traversal of the S.T. to note at each node how many leaves below it are from S and how many from S’.

• A node corresponds to a MUM if and only if the count there is 1 and 1, and the two prior characters are different.

• For total string length n, the time used is O(n).

Page 15: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Reducing the space needed for MUM finding

MUMMER II - NAR spring 2002root

Path to a nodev, spellingxB, where Bis a string

v v’

Suffix link

Path to a node v’, spelling B

Page 16: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Space reduction when finding MUMs

• Given strings S and S’, build a suffix tree T, with suffix links , but for only the smaller of S and S’, say S.

• Using T, find for each position i in S’, the longest substring starting at i which occurs in S exactly once. Mark the end location in T of that match. Each such match is a candidate MUM.

Page 17: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Finding MUMs in less space

• To find the matches, walk S’ through T, using suffix links to reduce search time.

• After S’ has been completely processed, every marked position in T that is visited only once and has no marked position below it, corresponds to a MUM.

• So all MUMs can be found in O(n) time, but in much reduced space.

Page 18: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Finding Tandem Repeats

…TATAACTAACTAAGATT..

For a string of length n, a naïve algorithm might use (on the order of) n^3 operations to find all T.R.s

Stoye and Gusfield, Theoretical Computer Science, 2001

Page 19: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Example: Finding Tandem Repeats in Strings

TATAACTAACTAAGATT….

Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat.

To extend a chain, examine the first character of the T.R.and the character after the end of the T.R. If they are the same, a new T.R. exists on place to the right.

Page 20: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Example: Finding Tandem Repeats in Strings

TATAACTAACTAAGATT….

Branching Tandem Repeats

Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat.

Page 21: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Example: Finding Tandem Repeats in Strings

TATAACTAACTAAGATT….

Branching Tandem Repeats

Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat.

Page 22: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Example: Finding Tandem Repeats in Strings

TATAACTAACTAAGATT….

Branching Tandem Repeat

Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat.

So to find all T.R.s – find all Branching T.R’s

Use a suffix tree to efficiently find Branching T.R.’s

Page 23: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Finding Branching T.R.’s

• Every Branching T.R. ends at a node of the suffix tree – so examine each node.

String to here is length 8

5

2113 = 5 + 8

root

leaves 5 , 13 and 21 are below vv

Page 24: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Basic Algorithm

• Build the S.T. for string S; process it to find depth of each node, and do a DFS numbering of the nodes so that ancestry of two nodes can be determined in constant time. (Classic property of DFS)

• Check each internal node to see if it is at the first or second part of a tandem repeat.

Page 25: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

How to check

• From a node v of depth d(v), walk the subtree to collect the leaf numbers below v.

• If leaf k is collected, check if v is an ancestor of leaf k+d(v), and the next characters k+1 and k+d(v) + 1 differ. If so, then the string to v is the first part of a Branching T.R.

• Alternatively, focus on k-d(v) to see if the string to v is the second part of a Branching T.R.

Page 26: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

AABAAB

A

AB

A

A

B$

1

$4

B

A

A

$2

BA A

B$

3

B

$5

$

6

1 2 3 4 5 6

Suffix Tree

S =

Page 27: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Finding Tandem Repeats

• This approach gives O(n^2) time to find all branching tandem repeats.The time can be reduced to O(n log n) with a classic trick.

• Related more complex result: In O(n) time, one can mark in the suffix tree, the endpoints of all tandem repeat “species”. Gusfield and Stoye 2000 UCD techreport

Uses the fact (Frankel et al 1999) that there are only 2n tandem repeat species – complex proof.

Example: aaaaaaaaaa has only five tandem repeat species, but 25 tandem repeat occurrences

Page 28: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Space issues again

• Use “virtual suffix tree” or “extended suffix array” to reduce space

• Def: for positions i and j in string S, LCP(i,j) is the length of the longest matching substring starting at positions i and j in S.

Page 29: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

AABAAB

A

AB

A

A

B$

1

$4

B

A

A

$2

BA A

B$

3

B

$5

$

6

4 1 5 2 3 63 1 2 0 1

1 2 3 4 5 6

SUFFIX ARRAYWith LCP or stringdepth information

Node depths attime of descentafter a backup LCP of neighbors

Page 30: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Extended Suffix Array

• String S, the suffix array A for S, and the LCP array for A, completely determine the suffix tree T for S. This is the “extended suffix array” for S.

• The suffix array and the LCP array can be found in O(n log n) time and small space, without an explicit suffix tree.

• For many efficient algorithms using an explicit suffix tree, it is possible to simulate the algorithm using only the extended suffix array, greatly reducing space usage without increasing time.

• S. Kurtz WABI 2002

Page 31: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

String Barcoding – Oligo Construction

Uncovering Optimal Virus Signatures

Sam Rash and Dan Gusfield

RECOMB April 2002

Page 32: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Motivation

• Need for rapid virus detection– Given

• unknown virus • database known viruses

– Problem• identify unknown virus quickly

– Ideal solution• have sequence of

– viruses in database– unknown virus

• Solution– use BLAST (or any sequence similarity program/algorithm)

Page 33: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Motivation

– Real World• only have sequence for pathogens in database

– not possible to quickly sequence an unknown virus

• can test for presence small (<= 50 bp) strings in unknown virus

– substring tests

– Another Idea• String Barcoding

– use substring tests to uniquely identify each virus in the database

– acquire unique barcode for each virus in database

Page 34: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Problem Definition

• Formal Definition– given

• set of strings S

– goal• find set of strings S’, the testing set

– wlog, for each s1,s2 in S, there exists at least one u in S’ where u is a substring of only s1

– u is a signature substring

• minimize |S’|

– result• barcode for each element on S

Page 35: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Idea

• strings:1. cagtgc

2. cagttc

3. catgga

• Each node in the suffix tree has a corresponding set of string IDs below it

v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3}

v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3}

v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3}

v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3}

v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1}

Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga

Page 36: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Idea

• strings:1. cagtgc

2. cagttc

3. catgga

• Each node in the suffix tree has a corresponding set of string IDs below it

v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3}

v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3}

v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3}

v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3}

v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1}

Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga

Page 37: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Practical Implementation:

• If two substrings occur in exactly the same set of original strings, only one need be considered– Use strings from suffix tree for each uniquely labeled

node

• Build ILP as discussed• Solve ILP using CPLEX• Acquire barcode and signatures for each original

string– signature is the set of substring tests occurring in a

string

Page 38: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Implementation: ExampleminimizeV18 + V22 + V11 + V17 + V8 #objective functionstV18 + V22 + V11 + V17 + V8 >= 2 #this is the theoretical minimumV18 + V17 + V8 >= 1 #constraint to cover pair 1,2V22 + V11 + V8 >= 1 #constraint to cover pair 1,3V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3binaries #all variables are 0/1V18 V22 V11 V17 V8end

Figure 1.4 - barcodes

Figure 1.5 - signatures

tg (V18) atgga (V22)

cagtgc 1 0

cagttc 0 0

catgga 1 1

cagtgc {“tg”}

cagttc

catgga {“tg”, “atgga”}

Page 39: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Implementation: Extensions

• minimum and maximum lengths on signature substrings

• acquire barcodes/signatures for only a subset of input strings (wrt to whole set)

• minimum string edit distance between chosen signature substrings

• redundancy– require r signature substrings to differentiate each pair– adds a higher level of confidence that signatures remain

valid even with mutations

Page 40: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Results

• Works quickly on most moderately sized datasets (especially when redundancy >= 2)– dataset properties

• ~50k virus genomes taken from NCBI (Genbank)• 50-150 virus genomes• average length of each sequence ~1000 characters• total input size ranged from approximately 50,000 – 150,000

characters• increasing dataset size scaled approximately linearly

– reach 25% gap (at most 1/3 more than optimum) in just a few minutes

– reach small gap (often < 1%) in 4 hours

Page 41: Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Summary

• Numerous applications of Suffix Trees and relatives in Bioinformatics

• More yet to be found

• Suffix tree and array software at my UCD website.


Recommended