Memory-aware BWT by Segmenting Sequences

Memory-aware BWT by Segmenting Sequences

presented by Jiaying Wang

April 12, 2012

Northeastern University, China

The 14th Asia-Pacific Web Conference (APWeb)

Motivation

• Most interesting massive data sets contain string data (web data, record data, genome data, etc.)

• BWT as a full text index provides fast substring search over large text collections

• Enormous memory cost while building BWT(n log n + n logσ)

Preliminaries

• text: T[0..n − 1], T[i]∈Σ, |Σ| = σ• We add a $ to the end of the text. $ do no

t belong to Σ• T[i...j] is a sequence starting at i position a

nd ending at j position– empty string iff i>j– prefix iff i = 0– suffix iff j = 0

Problem definition

• Let T[0..n−1] be a text, and P[0..m-1] be a query. Subsequence matching problem is to find all the start positions of occurrences of P in T, i.e. {i | 0 ≤ i ≤ n; T[i..i+m-1] = P[0..m-1]}.

• We take the memory cost into account.• The process should guarantee the efficien

cy of query and memory cost at the same time.

Bwt transformation

p i$mississi pp pi$mississ is ippi$missi ss issippi$mi ss sippi$miss is sissippi$m i

i ssippi$mis s

m ississippi $i ssissippi$ m

i ppi$missis s i $mississip p$ mississipp i

LF

11107410986352

SA

mississippi$ississippi$mssissippi$misissippi$misissippi$missssippi$missisippi$missisippi$mississppi$mississipi$mississipi$mississipp$mississippi

bwt: ipssm$pissiimississippi$text:

Backward search on BWT

L 0, hbwt.length

For i from pat.length-1 to 0

k = pat[i]

l = C[k] + occ(k,l)

h = C[k] + occ(k,h)

Return h - l

searching "ssi"

p i$mississi pp pi$mississ is ippi$missi ss issippi$mi ss sippi$miss is sissippi$m i

i ssippi$mis s

m ississippi $i ssissippi$ m

i ppi$missis s i $mississip p$ mississipp i

LF

Memory cost analysis

• Enormous memory cost for building BWT.• n log n + n logσ. About 5*n Bytes. (1G 5G)• For example: mississippi

mississippi mississippi$

SA:11 10 7 4 1 0 9 8 6 3 5 2ipssm$pissii

12 12×4+ = 12×5

Our idea(1/2)

mississippi

missis sippi

search ssi Load one segment each time will help us save the memory

How to find the segmented sequence?

Our idea(2/2)

mississippi

mississi issippi

search ssi

Oops, we find another one

BWT on Overlapped Segments

…

L

l

T

T1

T2

Tk

bwt…

BWT1

BWT2

BWTk

bwt

bwt

Searching cases

• prerequisite : query length ≤ l

• For the second case, we have to remove duplicates of the results

Filtering method

Filter interval f = l - m

All the occurrences starting at positions in a filter interval should be filtered.

f

Searching algorithm

BWT on Disjoint Segments

…

T

T1

T2

Tk

bwt…

BWT1

BWT2

BWTk

bwt

bwt

Searching cases

• For the second case, we need to– 1 Find the suffix of the query as the prefix of a

segment.– 2 Verify rest prefix of the query needs on the l

eft segment.

Suffix checking

Time complexity: Θ (m)

Prefix verification

• To verify the prefix, we can– 1 keep text. (waste

space) – 2 revert text on the

fly.(waste a little time)

Searching algorithm

Analysis

• Overlap method – Memory cost (n + l + k) × (log σ + log(n + l +

k) − log(k))/k– Time complexity Θ(occ+δ+mk)

• Backwalk method– Memory cost n(log σ+log n−log k)/k bits.– Time complexity Θ(occ + (η + k)m)

Experiment

• Environment – C++ language – PC with 2.93 GHz Intel Core CPU– 4 GB main memory– Ubuntu operating system (Linux distribution).

• data sets– English text at Pizza&Chili Corpus– Genome sequence at UCSC goldenPath

Performance on EnglishMemory cost Build time

Query time Query time

Performance on genomeMemory cost Build time

Query time Query time

More performance

Conclusion

• We propose a novel variation of BWT called S-BWT

• Our index save more memory than BWT

• Two query method based on S-BWT

• Our method is faster than BWT method on large text.

Thank you!

Q&A

Date post:	07-Jan-2016
Category:	Documents
Upload:	sirius
View:	26 times
Download:	1 times

Memory-aware BWT by Segmenting Sequences

Documents