+ All Categories
Home > Documents > SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of...

SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of...

Date post: 20-Dec-2015
Category:
View: 223 times
Download: 0 times
Share this document with a friend
24
SST: SST: an algorithm for an algorithm for finding near-exact finding near-exact sequence matches in sequence matches in time proportional to time proportional to the logarithm of the the logarithm of the database size database size Eldar Giladi Eldar Giladi Michael G. Walker Michael G. Walker James Z. Wang James Z. Wang Wayne Volkmuth Wayne Volkmuth ioinformatics Vol. 18 no. 6 – 2002 ages 873-879 Norman Casagrande 2003 – IFT 629
Transcript

SST:SST: an algorithm for finding near-an algorithm for finding near-exact sequence matches in time exact sequence matches in time proportional to the logarithm of proportional to the logarithm of the database sizethe database size

• Eldar GiladiEldar Giladi• Michael G. WalkerMichael G. Walker• James Z. WangJames Z. Wang• Wayne VolkmuthWayne Volkmuth

Bioinformatics Vol. 18 no. 6 – 2002Pages 873-879

Norman Casagrande 2003 – IFT 6291

OutlineOutline

• Motivation• Previous Related Research• The SST Algorithm• Computational Results• Discussion

MotivationMotivation• Searches for near-exact sequence

matches are performed frequently in large-scale sequencing projects and in comparative genomics.

• The time and cost of performing these searches is prohibitive with current algorithms.

• Faster algorithms are desired.

Previous related researchPrevious related research

• Needleman-Wunsch and Smith-WatermanThese algorithms perform global and local sequence alignment using dynamic programming.

Time complexity: O(mn).

• m = length of query sequence• n = sum of lengths of all sequences in the database

Previous related researchPrevious related research

• FASTAThis algorithm identifies regions of local sequence similarity by first identifying candidate similar sequences based on shared k-tuples and performs local alignment with the Smith-Waterman algorithm.

Time complexity: O(mn).

Previous related researchPrevious related research• BLAST

This algorithm identifies regions of local sequence similarity by first identifying candidate similar sequences that have k-tuples in common with the query sequence, and then extending the regions of similarity.

Time complexity: O(n).

The SST AlgorithmThe SST Algorithm

• Database partitioning with sliding windows• Mapping windows into vector space• Tree-structured index for database

windows• The search procedure

SST: Database PartitioningSST: Database Partitioning

• First step- Database partitioned into overlapping windows.

- Fixed windows of length W. Typically: 25-1000

- Measure of overlap parameter ∆ is typically: 5 ≤ ∆ ≤ W/2

SST: Database PartitioningSST: Database Partitioning

A A C C G G T T A C G T A C G ...

Norman Casagrande 2003 – IFT 6291

A A C C G G

W = 6

C C G G T T

G G T T A C

….

∆ = 2

SST: SST: Query Query PartitioningPartitioning

• Query sequence partitioned into – Non-overlapping windows

or

– Windows which overlap by half of their length:

Norman Casagrande 2003 – IFT 6291

SST: Mapping Windows into VectorSST: Mapping Windows into Vector

• For each window, create a vector which counts the number of occurrences of each k-tuple.

• Tuple size k:2 – 10

Typically: 4 or 5 (empirically found)

SST: Mapping Windows into VectorSST: Mapping Windows into Vector

• Assume window:

1 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0

AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT

A A C C G G

k=2 → 16 occurrences

Resulting vector

Norman Casagrande 2003 – IFT 6291

SST: Creation of Tree-structured IndexSST: Creation of Tree-structured Index• Distance between vectors as heuristic

function for distance between sequences.

A = (01), B = (10)d = Σ |Ai - Bi| = |(0-1)| + |(1-0)| = 2

• Method: TSVQ (Gersho and Gray, 1992)

SST: Creation of Tree-structured IndexSST: Creation of Tree-structured Index• Select two centroids XA and XB and their

corresponding partitions of the data into disjoint set A and set B using the following iterative procedure:

- Choose two initial values for XA and XB.- For each vector y in the database, compute the distance d from the vector to each of the centroids. Assign y to set A if dA < dB, and to set B otherwise.

dA(y) = |XAi - yi|, dB(y) = |XBi - yi|

SST: Creation of Tree-structured IndexSST: Creation of Tree-structured Index

- Compute the new centroids:

where |A| = size of set A, |B| = size of set B

XA =ΣyAy

|A|XB =

ΣyBy

|B|

SST: Creation of Tree-structured IndexSST: Creation of Tree-structured Index

- Compute values for the terminating criteria:

DA = ΣyA dA(y) DB = ΣyB dB(y)

- Repeat until the change in DA and DB is less than a small threshold, or no vectors change partition.

SST: Creation of Tree-structured IndexSST: Creation of Tree-structured Index• Recursively partition the set A and B

generated above using the same algorithm.

• The recursion terminates when the number of vectors in a set is smaller than a specified tolerance or when the algorithm fails to fragment a cluster into two substantial new clusters.

SST: Creation of Tree-structured IndexSST: Creation of Tree-structured Index

• TSVQ:- Each leave contains the set of vectors that are nearest neighbors to the centroid for that node.

- When the tree is balanced, the depth of the tree is proportional to the logarithm of the number of windows and the number of windows is proportional to the size of the database.

- Average complexity of tree construction: O(nlogn).

SST: The Search ProcedureSST: The Search Procedure• Begin at the root of the tree.• Nodes are represented by their respective

centroid.• Select the branch whose centroid is the

lesser distance from the query vector.• Proceed recursively until reaching a

terminal node.• The vectors in the terminal node represent

the database windows which are the nearest neighbors to the query window.

SST: The Search ProcedureSST: The Search Procedure• Query window:

A G C C T G

Equal to windows size

• Vector: 001001010100000

Norman Casagrande 2003 – IFT 6291

A B

if dA > dB, follow branch B

dA

dB

SST: Time ComplexitySST: Time Complexity

• Construction of the index

O(nlogn)

• Search

O(mlogn)

Computational ResultsComputational Results• Compare the computation time per query

between BLAST and SST.

- For search along, SST is 27 times faster than BLAST while for building the tree index and searching it, SST is 15 times faster than BLAST for the database of 120,000 sequences when query windows do not overlap.

- For search along, SST is 13.2 times faster than BLAST while for building the tree index and searching it, SST is 9.3 times faster than BLAST for the database of 120,000 sequences when query windows overlap.

- A higher speed up is expected for larger databases.

DiscussionDiscussion

• SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as shotgun sequences or matching ESTs to genomic sequence.

• The accuracy is greatly improved when query windows overlap, but it will substantially slowdown the algorithm.

Homework 7:Homework 7:

• Based on the current SST algorithm, describe strategies to further improve the speed and space complexity of the algorithm.

• SST is designed for fast searches of similar sequences, discuss any drawbacks it may have.


Recommended