Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | alfred-summers |
View: | 218 times |
Download: | 2 times |
Genome-scale disk-based suffix tree indexing
Benjarath Phoophakdee
Mohammed J. Zaki
Compiled by:
Amit MahajanChaitra Venus
Introduction…
• Growth in biological sequences database
• Need for effective and efficient structure
• Suffix Tree– Exact/approx. matching– Database querying– Longest common substrings etc.
Introduction…
• In-memory construction algorithms– O(n2)– Can achieve Linear Time and Space
• suffix links • edge encoding• skip and count
– Problem: do not scale for large input sequences
Disk based Suffix trees
• “A Database Index to Large Biological Sequences”– Abandon suffix links (for better locality of reference)– Partition input based on fixed length prefixes– Faces problem in partition size because of data skew– Use of bin packing for partitions: expensive to count
frequency for long length prefixes
• “Practical Suffix Tree Construction”– TDD: Similar to above… drops suffix links– Reported to scale to human genome level– Random I/Os when input string size > memory
• ST-Merge (Improvement to TDD)– Input string = smaller contiguous substrings– Apply TDD on each substring and then merge all
trees– Does not have suffix links
• TOP-Q and DynaCluster– Only known algorithms that maintain suffix links and
do not have data skew problem– Experiments show that they do not scale to human
genome level
Disk based Suffix trees
Issue
• Problems with disk based algorithms– Data skew– No Suffix Links– No scalability
Authors propose a novel disk based suffix tree algorithm called TRELLIS
TRELLIS
• O(n2) Time, O(n) Space• Idea:
– construct by partitioning and merging– use variable length prefixes– Recover suffix links in a different post construction
phase
• Effectively scales up to human genome level– Can index entire human genome using 2GB in 4
hours, recover suffix links in 2 hours
TRELLIS
• Has 4 different phases– Prefix Creation– Partitioning– Merging– Suffix Link Recovery
Prefix Creation Phase
• Problems with fixed-length prefix– Cannot handle data skew– Computing appropriate length is not defined
• TRELLIS makes use of variable length prefixes.
P = {P0, P1, P2, …, Pm-1}
Use some threshold t to determine P such that freq(Pi) ≤ t
Prefix Creation Phase
• Multi-scan approach to compute P– ith scan
• Process prefixes up to certain length Li
(See formula below to calculate Li)
• EPi = set of prefixes that need further extension in next scan (as their frequency > t)
• Add to P only the smallest length prefixes that meets the frequency threshold t and reject their extensions
Prefix Creation Phase
• Ex:With t = 106, only two stages were required for the human genome with L1=8 and L2=16
Resulting set P contained about 6400 prefixes of lengths in the range 4 to 16
Partitioning Phase
• Divide input string into r consecutive partitions where r = (n+1) / t
• Suffix Subtree TRi
– Contains suffixes that start in partition Ri
– Use Ukkonen’s algorithm* to build it• Prefixed Suffix Subtree TRi, Pk
– Split TRi into subtrees that contain only suffixes that have prefix Pk
– At most m such subtrees• Store these prefixed suffix subtrees on disk
* proposed in the paper “Online construction of suffix trees” – E. Ukkonen
Partitioning Phase
• TRis obtained are implicit suffix trees (i.e. some suffixes are part of internal edges)
• To guarantee that TRi explicitly contains all suffixes from ith partition– Continue to read some characters from next
partition Ri+1 until t leaves are obtained in TRi
– Cannot do special character appending as it will incur additional overhead during merging phase
Merging Phase
• For each prefix Pk in the set P
– Merge all Prefixed Suffixed Subtrees TRi,Pk to get Prefixed Suffix Tree TPk
• We get m Prefixed Suffix trees
• Store the resulting trees back to disk
Suffix Link Recovery Phase
• Why?– Suffix links are crucial for efficiency in many fast string
processing algorithms
• Why in a separate phase?– TRELLIS may discard all suffix links information
during the merge phase as new internal nodes are created and some old ones are deleted
– It is useful to discard suffix links information after partitioning as it reduces amount of data per node
– Recovering links from scratch takes same time as keeping original link information
Suffix Link Recovery Phase
• TRELLIS recovers suffix links of one Prefix Suffix Tree at a time
• Start with children of root• Proceeding in a depth-first fashion, do the
following for each internal node x– Locate p(x) and sl(p(x))– Count from sl(p(x)) to locate sl(x), when found
add link– Do this recursively for all children of x
Choosing t
Note: t is threshold for Partition size also
M >= n/4 + ((0.7 x 40) + 16)t + (0.7 x 40)t
M = available main memory
n/4 = memory for input (in compressed form)
# internal nodes = 0.7(# external nodes)
40, 16 are sizes of internal and external nodes
Computational Complexity
• Prefix Creation Phase– O(nL) time, where L = longest prefix length– O(n+|∑L+1|)space
• Partitioning Phase– Input is broken into r partitions and each partition is of
size t– O(t) time/space for each => r x O(t) = O(n)– Disk I/Os: O(r x m) since at most m prefixed suffix
subtrees can be created for each partition
Computational Complexity
• Merging Phase– Each merge operation can be O(p) where
p = | longest common prefix | – Across all prefixes, merging = O(p x n) since
number of tree nodes in suffix tree is bounded by n
– In worst case p can be O(n), therefore merge = O(n2)
– Disk I/Os: O(r x m)
Computational Complexity
• Suffix Link Recovery Phase– Internal nodes in final suffix trees are O(n)– Constant set of operations for each suffix link
recovery
• Putting all together…– O(n2) time since most expensive is the merge
phase– O(n) space
Experimental Setup
• Compared to– TOP-Q and DynaCluster (maintain suffix links)– TDD (no suffix links)
• Performed on Linux with– 2 GB RAM for human genome and 512 MB for others– 288 GB disk space– TRELLIS written in C++ and compiled with g++– Other algorithms obtained from their authors
Experimental Results
TRELLIS vs. TOP-Q and DynaCluster
For 200 Mbp, DynaCluster did not terminate even after 8 hours, TRELLIS took 13 min
Experimental Results
TRELLIS vs. TDD
• TDD uses four different buffers (string, suffix, temp and tree)
• 200 Mbp requires only last 2 buffers• Saves additional I/O incurred in other cases
Experimental Results
TRELLIS vs. TDD
• TDD is built using memory optimized suffix-tree method• Difference is not significant for human genome as TDD
needs to be run in 64 bit mode
Experimental Results
TRELLIS vs. TDD – Query time
• TDD does not store edge length, determine by examining children
• Internal node has pointer only to one child, so scan all children linearly for every query
Conclusions
• TRELLIS– Solves data skew problem: variable length prefixes– Scales gracefully for very large sequence– No Disk I/O overhead as it works with suffix trees that
are guaranteed to fit in memory– It exhibits faster construction and query times when
compared to other disk based algorithms
Future Work
• Plan to make TRELLIS applicable to wider range of alphabets (Ex: English alphabets)
• No buffering strategy required for human genome, but start building one for use of a generalized suffix tree composed of many large genomes
• Parallelize TRELLIS, since its partioning and merging steps seem ideally suited