Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | dorcas-rose |
View: | 214 times |
Download: | 0 times |
High-throughput sequence alignment using Graphics Processing Units
Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney
UMD
Presented by Steve Rumble
Motivation
NGS technologies produce a ton of data AB SOLiD: 22e6 25-mers Others are even worse…
How does 200e6 50-mers sound?
Algorithms have been pushed hard, but typically assume same workstation CPU
Wozniak and others showed S-W could be well-parallelised on special H/W. What of other algorithms/hardware?
Motivation
GPUs have recently evolved general purpose programmability (GPGPU)
E.g.: nVidia 8800 GTX 16 multiprocessors
8 processors each => 128 stream processors
768MB onboard 1.35GHz clock Almost a year old now…
Short GPU Overview
Highly parallel execution (hundreds of simultaneous operations)
Hundreds of gigaflops per chip! Large on-board memories (up to 2GB)
Limitations: No recursion (no stacks) Each multiprocessor’s constituent processors
execute same instruction Thread Divergence due to conditionals hurts…
No direct host memory access Small caches (locality is key) High memory latency No dynamic memory allocation (why one would ever
do that, I don’t know)
Short GPU Overview
GPGPU environments
Previously had to reduce problems to graphics primitives… no more
Simplified C-like programming Paper has very little detail, but they make
it sound enticingly simple…
Each processor runs the same ‘kernel’
Muh-muh-muh… MUMmer!
Maximal Unique Match
Find longest match for each subsequence of a read (of reasonable length)
Employs Suffix Trees
MUMmerGPU Plug-and-play replacement for MUMmer MUMmer is not ‘arithmetic intensive’
Is the GPU a good fit?
Six-step process 1) Build Suffix Tree of reference genome
(Ukkonen’s alg. – O(n)) on host CPU 2) Suffix Tree -> GPU Memory 3) Queries -> GPU Memory 4) Kick off the GPU… 5) Results -> Host Memory 6) Final processing on Host CPU
Suffix Trees
We want to find the longest subsequence of a string (query) quickly
Suffix Trees permit O(m) string search, m = string length
Space complexity is O(n) But constants are apparently pretty big
Suffix Trees
Definition: Node edges have a node label
A string subsequence Non-empty (but can be terminating)
A path label is the sequence formed by traversing from root to leaf
1-1 correspondence of suffixes of S to path labels
Internal nodes have at least 2 children
n leaf nodes – one for each suffix of S
Suffix Trees
O(n) space n leaf nodes => at most n – 1 internal nodes => n + (n – 1) + 1 = 2n nodes (worst
case)
n = 3n – 1 = 23 + 2 + root = 6 nodes
Suffix Trees
Example: TORONTO$ ‘$’ is terminating character
T
ORONTO$
O$
NTO$RONTO$
6
4
0 5
2
3 1
O
$
RONTO$
NTO
$
Suffix Trees
Example: TORONTO$ Searching for ‘ONT’
T
ORONTO$
O$
NTO$RONTO$
6
4
0 5
2
3 1
O
$
RONTO$
NTO
$
Suffix Trees
Example: TORONTO$ Searching for ‘ONT’
T
ORONTO$
O$
NTO$RONTO$
6
4
0 5
2
3 1
O
$
RONTO$
NTO
$
Suffix Trees
Example: TORONTO$ Searching for ‘ONT’
T
ORONTO$
O$
NTO$RONTO$
6
4
0 5
2
3 1
O
$
RONTO$
NTO
$
Suffix Trees
Example: TORONTO$ Searching for ‘ONT’
T
ORONTO$
O$
NTO$RONTO$
6
4
0 5
2
3 1
O
$
RONTO$
NTO
$
‘ONT’ at position 3 in S
Suffix Trees
MUMmer wants to find all maximal unique matches for all suffixes: E.g., for query ACCGTGCGTC, we want:
ACCGTGCGTC CCGTGCGTC CGTGCGTC GTGCGTC … Up to some reasonable limit…
Don’t want to go back to root of tree each time…
Suffix Trees
Suffix Links All internal, non-root nodes have a
suffix link to another node If x is a single character and a is a
(possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a.
Got that?
Suffix Trees
Example: TORONTO$ Suffix Links… Don’t backtrack (bad ex.)
T
ORONTO$
O$
NTO$RONTO$
6
4
0 5
2
3 1
O
$
RONTO$
NTO
$
Suffix Trees
Example: BANANA$ Better example of Suffix Links
A
$
NA
NA
1
0
5
3
BA
NA
NA
$
NA$$
24
NA$$
Suffix Trees
Example: BANANA$ Searching for suffixes of ‘ANANA’
A
$
NA
NA
1
0
5
3
BA
NA
NA
$
NA$$
24
NA$$
Suffix Trees
Example: BANANA$ Searching for suffixes of ‘ANANA’
A
$
NA
NA
1
0
5
3
BA
NA
NA
$
NA$$
24
NA$$
Suffix Trees
Example: BANANA$ Searching for suffixes of ‘ANANA’
A
$
NA
NA
1
0
5
3
BA
NA
NA
$
NA$$
24
NA$$
Suffix Trees
Example: BANANA$ Searching for suffixes of ‘ANANA’
A
$
NA
NA
1
0
5
3
BA
NA
NA
$
NA$$
24
NA$$
Suffix Trees
Example: BANANA$ Searching for suffixes of ‘ANANA’
A
$
NA
NA
1
0
5
3
BA
NA
NA
$
NA$$
24
NA$$
Suffix Trees
Example: BANANA$ Searching for suffixes of ‘ANANA’
A
$
NA
NA
1
0
5
3
BA
NA
NA
$
NA$$
24
NA$$
Memory Limitations
Suffix trees take up a fair bit of memory
GPUs have 100’s of MBs, but this is still small
Divide the target sequence into ‘k’ segments with overlaps
Cache Optimisation
Memory latency high, cache performance crucial We’re walking a tree here, not crunching numbers
down an array
Can store read-only data in 2D textures; nVidia caching scheme optimises access
Re-order and squish tree nodes into ‘texel blocks’ such that:
Nodes near root are level-ordered (BFS) Nodes further down are ordered with descendants
Cache Optimisation1
2 3 4 5
6 7 8 9 10 11 12 13
14 15 16 17 18 19 21 2320 22 24 25 26 27 28 29
0 2 4 6 8 10 12 14
1 3 5 7 9 11 13 15
16 18 20 22 24 26 28 30
17 19 21 23 25 27 29 31
• Texture cache organized in 2x2 blocks.• Try to place all children of a node are in the same cache block
Shamelessly cribbed from:http://www.cbcb.umd.edu/software/cmatch/FastExactStringMatching.ppt
Cache Optimisation
Reference Sequence stored in 4x216 blocks of a 2D array Sequence: A B C D E F G H …
……….
A EB FC GD H
……….
α Φ β Χ Γ Ψ Δ Ω
Why? It worked well.
Cache Optimisation
Memory layouts heuristically determined nVidia cache details not public
Cache optimisation improves execution speed ‘by several fold’.
Conclusions
GPGPU isn’t just good for ‘arithmetic intensive’ applications
5-11x speed-up for NGS data
Conclusions
Fine Print: 5-11x is for the Suffix Tree kernel on the GPU Reality is different! 3.5x speed-up for real data in terms of total
application runtime. Pretty constant across read lengths (35-700+ bp)
Careful management of memory layout is crucial
Authors claim several-fold performance increase (could be difference between some improvement and none)
Conclusions
Runtime dominated by serial parts of MUMmer
Food for Thought
8800 GTX costs ~$400, uses 100-150 watts
Quad Core 2 chip runs ~$250, uses 100-130 watts
Each core approx. 2x faster than their test CPU
MUMmerGPU maximally 3.5x faster than test CPU
What have we won here?
Food for Thought
Confusing reports
“Fast Exact String Matching on the GPU” (Schatz, Trapnell) claims up to 35x improvement
Earlier course paper (early/mid-2007)
Why from 35x down to 5-11x with MUMmerGPU?
My Impressions…
(…whatever they’re worth)
GPU is not a clear win (in this case) Suffix trees seem unsuited:
Cache locality trouble O(n) footprint, but multiplicative constants
are still substantial Host CPUs seem to be as good or better
(in $ and watts)
My Impressions…
GPGPU’s aren’t a great fit here
At least for this algorithm…
MUMmerGPU isn’t the order-of-magnitude win it claims to be
But this is a first-generation, general-purpose chip
geared toward number-crunching, not pointer-traversing
I don’t think we’ve seen the last (nor the best) of GPUs…