+ All Categories
Home > Documents > Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by:...

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by:...

Date post: 18-Dec-2015
Category:
Upload: alfred-summers
View: 218 times
Download: 2 times
Share this document with a friend
27
Genome-scale disk- based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus
Transcript
Page 1: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Genome-scale disk-based suffix tree indexing

Benjarath Phoophakdee

Mohammed J. Zaki

Compiled by:

Amit MahajanChaitra Venus

Page 2: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Introduction…

• Growth in biological sequences database

• Need for effective and efficient structure

• Suffix Tree– Exact/approx. matching– Database querying– Longest common substrings etc.

Page 3: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Introduction…

• In-memory construction algorithms– O(n2)– Can achieve Linear Time and Space

• suffix links • edge encoding• skip and count

– Problem: do not scale for large input sequences

Page 4: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Disk based Suffix trees

• “A Database Index to Large Biological Sequences”– Abandon suffix links (for better locality of reference)– Partition input based on fixed length prefixes– Faces problem in partition size because of data skew– Use of bin packing for partitions: expensive to count

frequency for long length prefixes

• “Practical Suffix Tree Construction”– TDD: Similar to above… drops suffix links– Reported to scale to human genome level– Random I/Os when input string size > memory

Page 5: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

• ST-Merge (Improvement to TDD)– Input string = smaller contiguous substrings– Apply TDD on each substring and then merge all

trees– Does not have suffix links

• TOP-Q and DynaCluster– Only known algorithms that maintain suffix links and

do not have data skew problem– Experiments show that they do not scale to human

genome level

Disk based Suffix trees

Page 6: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Issue

• Problems with disk based algorithms– Data skew– No Suffix Links– No scalability

Authors propose a novel disk based suffix tree algorithm called TRELLIS

Page 7: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

TRELLIS

• O(n2) Time, O(n) Space• Idea:

– construct by partitioning and merging– use variable length prefixes– Recover suffix links in a different post construction

phase

• Effectively scales up to human genome level– Can index entire human genome using 2GB in 4

hours, recover suffix links in 2 hours

Page 8: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

TRELLIS

• Has 4 different phases– Prefix Creation– Partitioning– Merging– Suffix Link Recovery

Page 9: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Prefix Creation Phase

• Problems with fixed-length prefix– Cannot handle data skew– Computing appropriate length is not defined

• TRELLIS makes use of variable length prefixes.

P = {P0, P1, P2, …, Pm-1}

Use some threshold t to determine P such that freq(Pi) ≤ t

Page 10: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Prefix Creation Phase

• Multi-scan approach to compute P– ith scan

• Process prefixes up to certain length Li

(See formula below to calculate Li)

• EPi = set of prefixes that need further extension in next scan (as their frequency > t)

• Add to P only the smallest length prefixes that meets the frequency threshold t and reject their extensions

Page 11: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Prefix Creation Phase

• Ex:With t = 106, only two stages were required for the human genome with L1=8 and L2=16

Resulting set P contained about 6400 prefixes of lengths in the range 4 to 16

Page 12: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Partitioning Phase

• Divide input string into r consecutive partitions where r = (n+1) / t

• Suffix Subtree TRi

– Contains suffixes that start in partition Ri

– Use Ukkonen’s algorithm* to build it• Prefixed Suffix Subtree TRi, Pk

– Split TRi into subtrees that contain only suffixes that have prefix Pk

– At most m such subtrees• Store these prefixed suffix subtrees on disk

* proposed in the paper “Online construction of suffix trees” – E. Ukkonen

Page 13: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Partitioning Phase

• TRis obtained are implicit suffix trees (i.e. some suffixes are part of internal edges)

• To guarantee that TRi explicitly contains all suffixes from ith partition– Continue to read some characters from next

partition Ri+1 until t leaves are obtained in TRi

– Cannot do special character appending as it will incur additional overhead during merging phase

Page 14: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Merging Phase

• For each prefix Pk in the set P

– Merge all Prefixed Suffixed Subtrees TRi,Pk to get Prefixed Suffix Tree TPk

• We get m Prefixed Suffix trees

• Store the resulting trees back to disk

Page 15: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Suffix Link Recovery Phase

• Why?– Suffix links are crucial for efficiency in many fast string

processing algorithms

• Why in a separate phase?– TRELLIS may discard all suffix links information

during the merge phase as new internal nodes are created and some old ones are deleted

– It is useful to discard suffix links information after partitioning as it reduces amount of data per node

– Recovering links from scratch takes same time as keeping original link information

Page 16: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Suffix Link Recovery Phase

• TRELLIS recovers suffix links of one Prefix Suffix Tree at a time

• Start with children of root• Proceeding in a depth-first fashion, do the

following for each internal node x– Locate p(x) and sl(p(x))– Count from sl(p(x)) to locate sl(x), when found

add link– Do this recursively for all children of x

Page 17: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Choosing t

Note: t is threshold for Partition size also

M >= n/4 + ((0.7 x 40) + 16)t + (0.7 x 40)t

M = available main memory

n/4 = memory for input (in compressed form)

# internal nodes = 0.7(# external nodes)

40, 16 are sizes of internal and external nodes

Page 18: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Computational Complexity

• Prefix Creation Phase– O(nL) time, where L = longest prefix length– O(n+|∑L+1|)space

• Partitioning Phase– Input is broken into r partitions and each partition is of

size t– O(t) time/space for each => r x O(t) = O(n)– Disk I/Os: O(r x m) since at most m prefixed suffix

subtrees can be created for each partition

Page 19: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Computational Complexity

• Merging Phase– Each merge operation can be O(p) where

p = | longest common prefix | – Across all prefixes, merging = O(p x n) since

number of tree nodes in suffix tree is bounded by n

– In worst case p can be O(n), therefore merge = O(n2)

– Disk I/Os: O(r x m)

Page 20: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Computational Complexity

• Suffix Link Recovery Phase– Internal nodes in final suffix trees are O(n)– Constant set of operations for each suffix link

recovery

• Putting all together…– O(n2) time since most expensive is the merge

phase– O(n) space

Page 21: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Experimental Setup

• Compared to– TOP-Q and DynaCluster (maintain suffix links)– TDD (no suffix links)

• Performed on Linux with– 2 GB RAM for human genome and 512 MB for others– 288 GB disk space– TRELLIS written in C++ and compiled with g++– Other algorithms obtained from their authors

Page 22: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Experimental Results

TRELLIS vs. TOP-Q and DynaCluster

For 200 Mbp, DynaCluster did not terminate even after 8 hours, TRELLIS took 13 min

Page 23: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Experimental Results

TRELLIS vs. TDD

• TDD uses four different buffers (string, suffix, temp and tree)

• 200 Mbp requires only last 2 buffers• Saves additional I/O incurred in other cases

Page 24: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Experimental Results

TRELLIS vs. TDD

• TDD is built using memory optimized suffix-tree method• Difference is not significant for human genome as TDD

needs to be run in 64 bit mode

Page 25: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Experimental Results

TRELLIS vs. TDD – Query time

• TDD does not store edge length, determine by examining children

• Internal node has pointer only to one child, so scan all children linearly for every query

Page 26: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Conclusions

• TRELLIS– Solves data skew problem: variable length prefixes– Scales gracefully for very large sequence– No Disk I/O overhead as it works with suffix trees that

are guaranteed to fit in memory– It exhibits faster construction and query times when

compared to other disk based algorithms

Page 27: Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Future Work

• Plan to make TRELLIS applicable to wider range of alphabets (Ex: English alphabets)

• No buffering strategy required for human genome, but start building one for use of a generalized suffix tree composed of many large genomes

• Parallelize TRELLIS, since its partioning and merging steps seem ideally suited


Recommended