Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by:...

Genome-scale disk-based suffix tree indexing

Benjarath Phoophakdee

Mohammed J. Zaki

Compiled by:

Amit MahajanChaitra Venus

Introduction…

• Growth in biological sequences database

• Need for effective and efficient structure

• Suffix Tree– Exact/approx. matching– Database querying– Longest common substrings etc.

Introduction…

• In-memory construction algorithms– O(n2)– Can achieve Linear Time and Space

• suffix links • edge encoding• skip and count

– Problem: do not scale for large input sequences

Disk based Suffix trees

• “A Database Index to Large Biological Sequences”– Abandon suffix links (for better locality of reference)– Partition input based on fixed length prefixes– Faces problem in partition size because of data skew– Use of bin packing for partitions: expensive to count

frequency for long length prefixes

• “Practical Suffix Tree Construction”– TDD: Similar to above… drops suffix links– Reported to scale to human genome level– Random I/Os when input string size > memory

• ST-Merge (Improvement to TDD)– Input string = smaller contiguous substrings– Apply TDD on each substring and then merge all

trees– Does not have suffix links

• TOP-Q and DynaCluster– Only known algorithms that maintain suffix links and

do not have data skew problem– Experiments show that they do not scale to human

genome level

Disk based Suffix trees

Issue

• Problems with disk based algorithms– Data skew– No Suffix Links– No scalability

Authors propose a novel disk based suffix tree algorithm called TRELLIS

TRELLIS

• O(n2) Time, O(n) Space• Idea:

– construct by partitioning and merging– use variable length prefixes– Recover suffix links in a different post construction

phase

• Effectively scales up to human genome level– Can index entire human genome using 2GB in 4

hours, recover suffix links in 2 hours

TRELLIS

• Has 4 different phases– Prefix Creation– Partitioning– Merging– Suffix Link Recovery

Prefix Creation Phase

• Problems with fixed-length prefix– Cannot handle data skew– Computing appropriate length is not defined

• TRELLIS makes use of variable length prefixes.

P = {P0, P1, P2, …, Pm-1}

Use some threshold t to determine P such that freq(Pi) ≤ t


• Multi-scan approach to compute P– ith scan

• Process prefixes up to certain length Li

(See formula below to calculate Li)

• EPi = set of prefixes that need further extension in next scan (as their frequency > t)

• Add to P only the smallest length prefixes that meets the frequency threshold t and reject their extensions


• Ex:With t = 106, only two stages were required for the human genome with L1=8 and L2=16

Resulting set P contained about 6400 prefixes of lengths in the range 4 to 16

Partitioning Phase

• Divide input string into r consecutive partitions where r = (n+1) / t

• Suffix Subtree TRi

– Contains suffixes that start in partition Ri

– Use Ukkonen’s algorithm* to build it• Prefixed Suffix Subtree TRi, Pk

– Split TRi into subtrees that contain only suffixes that have prefix Pk

– At most m such subtrees• Store these prefixed suffix subtrees on disk

* proposed in the paper “Online construction of suffix trees” – E. Ukkonen

Partitioning Phase

• TRis obtained are implicit suffix trees (i.e. some suffixes are part of internal edges)

• To guarantee that TRi explicitly contains all suffixes from ith partition– Continue to read some characters from next

partition Ri+1 until t leaves are obtained in TRi

– Cannot do special character appending as it will incur additional overhead during merging phase

Merging Phase

• For each prefix Pk in the set P

– Merge all Prefixed Suffixed Subtrees TRi,Pk to get Prefixed Suffix Tree TPk

• We get m Prefixed Suffix trees

• Store the resulting trees back to disk

Suffix Link Recovery Phase

• Why?– Suffix links are crucial for efficiency in many fast string

processing algorithms

• Why in a separate phase?– TRELLIS may discard all suffix links information

during the merge phase as new internal nodes are created and some old ones are deleted

– It is useful to discard suffix links information after partitioning as it reduces amount of data per node

– Recovering links from scratch takes same time as keeping original link information

Suffix Link Recovery Phase

• TRELLIS recovers suffix links of one Prefix Suffix Tree at a time

• Start with children of root• Proceeding in a depth-first fashion, do the

following for each internal node x– Locate p(x) and sl(p(x))– Count from sl(p(x)) to locate sl(x), when found

add link– Do this recursively for all children of x

Choosing t

Note: t is threshold for Partition size also

M >= n/4 + ((0.7 x 40) + 16)t + (0.7 x 40)t

M = available main memory

n/4 = memory for input (in compressed form)

# internal nodes = 0.7(# external nodes)

40, 16 are sizes of internal and external nodes

Computational Complexity

• Prefix Creation Phase– O(nL) time, where L = longest prefix length– O(n+|∑L+1|)space

• Partitioning Phase– Input is broken into r partitions and each partition is of

size t– O(t) time/space for each => r x O(t) = O(n)– Disk I/Os: O(r x m) since at most m prefixed suffix

subtrees can be created for each partition


• Merging Phase– Each merge operation can be O(p) where

p = | longest common prefix | – Across all prefixes, merging = O(p x n) since

number of tree nodes in suffix tree is bounded by n

– In worst case p can be O(n), therefore merge = O(n2)

– Disk I/Os: O(r x m)


• Suffix Link Recovery Phase– Internal nodes in final suffix trees are O(n)– Constant set of operations for each suffix link

recovery

• Putting all together…– O(n2) time since most expensive is the merge

phase– O(n) space

Experimental Setup

• Compared to– TOP-Q and DynaCluster (maintain suffix links)– TDD (no suffix links)

• Performed on Linux with– 2 GB RAM for human genome and 512 MB for others– 288 GB disk space– TRELLIS written in C++ and compiled with g++– Other algorithms obtained from their authors

Experimental Results

TRELLIS vs. TOP-Q and DynaCluster

For 200 Mbp, DynaCluster did not terminate even after 8 hours, TRELLIS took 13 min


TRELLIS vs. TDD

• TDD uses four different buffers (string, suffix, temp and tree)

• 200 Mbp requires only last 2 buffers• Saves additional I/O incurred in other cases


TRELLIS vs. TDD

• TDD is built using memory optimized suffix-tree method• Difference is not significant for human genome as TDD

needs to be run in 64 bit mode


TRELLIS vs. TDD – Query time

• TDD does not store edge length, determine by examining children

• Internal node has pointer only to one child, so scan all children linearly for every query

Conclusions

• TRELLIS– Solves data skew problem: variable length prefixes– Scales gracefully for very large sequence– No Disk I/O overhead as it works with suffix trees that

are guaranteed to fit in memory– It exhibits faster construction and query times when

compared to other disk based algorithms

Future Work

• Plan to make TRELLIS applicable to wider range of alphabets (Ex: English alphabets)

• No buffering strategy required for human genome, but start building one for use of a generalized suffix tree composed of many large genomes

• Parallelize TRELLIS, since its partioning and merging steps seem ideally suited

Date post:	18-Dec-2015
Category:	Documents
Upload:	alfred-summers
View:	218 times
Download:	2 times

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by:...

Documents