Memory-aware BWT by Segmenting Sequences
presented by Jiaying Wang
April 12, 2012
Northeastern University, China
The 14th Asia-Pacific Web Conference (APWeb)
Motivation
• Most interesting massive data sets contain string data (web data, record data, genome data, etc.)
• BWT as a full text index provides fast substring search over large text collections
• Enormous memory cost while building BWT(n log n + n logσ)
Preliminaries
• text: T[0..n − 1], T[i]∈Σ, |Σ| = σ• We add a $ to the end of the text. $ do no
t belong to Σ• T[i...j] is a sequence starting at i position a
nd ending at j position– empty string iff i>j– prefix iff i = 0– suffix iff j = 0
Problem definition
• Let T[0..n−1] be a text, and P[0..m-1] be a query. Subsequence matching problem is to find all the start positions of occurrences of P in T, i.e. {i | 0 ≤ i ≤ n; T[i..i+m-1] = P[0..m-1]}.
• We take the memory cost into account.• The process should guarantee the efficien
cy of query and memory cost at the same time.
Bwt transformation
p i$mississi pp pi$mississ is ippi$missi ss issippi$mi ss sippi$miss is sissippi$m i
i ssippi$mis s
m ississippi $i ssissippi$ m
i ppi$missis s i $mississip p$ mississipp i
LF
11107410986352
SA
mississippi$ississippi$mssissippi$misissippi$misissippi$missssippi$missisippi$missisippi$mississppi$mississipi$mississipi$mississipp$mississippi
bwt: ipssm$pissiimississippi$text:
Backward search on BWT
L 0, hbwt.length
For i from pat.length-1 to 0
k = pat[i]
l = C[k] + occ(k,l)
h = C[k] + occ(k,h)
Return h - l
searching "ssi"
p i$mississi pp pi$mississ is ippi$missi ss issippi$mi ss sippi$miss is sissippi$m i
i ssippi$mis s
m ississippi $i ssissippi$ m
i ppi$missis s i $mississip p$ mississipp i
LF
Memory cost analysis
• Enormous memory cost for building BWT.• n log n + n logσ. About 5*n Bytes. (1G 5G)• For example: mississippi
mississippi mississippi$
SA:11 10 7 4 1 0 9 8 6 3 5 2ipssm$pissii
12 12×4+ = 12×5
Our idea(1/2)
mississippi
missis sippi
search ssi Load one segment each time will help us save the memory
How to find the segmented sequence?
Our idea(2/2)
mississippi
mississi issippi
search ssi
Oops, we find another one
BWT on Overlapped Segments
…
L
l
T
T1
T2
Tk
bwt…
BWT1
BWT2
BWTk
bwt
bwt
Searching cases
• prerequisite : query length ≤ l
• For the second case, we have to remove duplicates of the results
Filtering method
Filter interval f = l - m
All the occurrences starting at positions in a filter interval should be filtered.
f
Searching algorithm
BWT on Disjoint Segments
…
T
T1
T2
Tk
bwt…
BWT1
BWT2
BWTk
bwt
bwt
Searching cases
• For the second case, we need to– 1 Find the suffix of the query as the prefix of a
segment.– 2 Verify rest prefix of the query needs on the l
eft segment.
Suffix checking
Time complexity: Θ (m)
Prefix verification
• To verify the prefix, we can– 1 keep text. (waste
space) – 2 revert text on the
fly.(waste a little time)
Searching algorithm
Analysis
• Overlap method – Memory cost (n + l + k) × (log σ + log(n + l +
k) − log(k))/k– Time complexity Θ(occ+δ+mk)
• Backwalk method– Memory cost n(log σ+log n−log k)/k bits.– Time complexity Θ(occ + (η + k)m)
Experiment
• Environment – C++ language – PC with 2.93 GHz Intel Core CPU– 4 GB main memory– Ubuntu operating system (Linux distribution).
• data sets– English text at Pizza&Chili Corpus– Genome sequence at UCSC goldenPath
Performance on EnglishMemory cost Build time
Query time Query time
Performance on genomeMemory cost Build time
Query time Query time
More performance
Conclusion
• We propose a novel variation of BWT called S-BWT
• Our index save more memory than BWT
• Two query method based on S-BWT
• Our method is faster than BWT method on large text.
Thank you!
Q&A