+ All Categories
Home > Documents > [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo,...

[IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo,...

Date post: 14-Dec-2016
Category:
Upload: martin-c
View: 221 times
Download: 3 times
Share this document with a friend
7
CAAD BLASTP 2.0: NCBI BLASTP ACCELERATED WITH PIPELINED FILTERS Atabak Mahram Martin C. Herbordt Department of Electrical and Computer Engineering Boston University; Boston, MA email: [email protected], [email protected] ABSTRACT BLAST is a central application in bioinformatics and so has been the subject of numerous acceleration studies. The de facto standard version of this code, NCBI BLAST, uses complex heuristics which make it challenging to simultane- ously achieve both high performance and exact agreement with the original output. In previous work, we have used novel FPGA-based filters that reduce the input database by over 99.99% without loss of sensitivity. In the present work there are two primary contributions. The first is a new mech- anism to couple two of the filters in such a way that promis- ing alignments can be found in a fraction of the previous time. The second is the pipelining of the three filters. This is a challenging load balancing problem since the work per filter drops by 5x - 10x at both of the interfaces. Pipelin- ing the filters has two benefits: it removes the need to re- configure between passes and it reduces the off-chip band- width requirement. Together, these two enhancements more than double the performance over the previous best imple- mentation. We currently have CAAD BLASTP working on Virtex-6 and Stratix-IV FPGAs with speed-ups of 9x and 15x, respectively, over the multithreaded original code run- ning on an 8-core PC. We discuss FPGA features that cause this performance disparity. CAAD BLASTP scales easily and is appropriate for use in large FPGA-based servers. 1. INTRODUCTION The fundamental insight that biologically significant poly- mers such as proteins and DNA can be abstracted into char- acter strings (sequences) allows biologists to use approxi- mate string matching (AM) to determine, for example, how a newly identified protein is related to those previously ana- lyzed, and how it has diverged through mutation. Fast meth- ods for AM, such as BLAST [1], are based on heuristics, and can match a typical sequence (a query) against a set of known sequences (e.g., the millions in the NR database) in just a few minutes, and only rarely cause significant matches to be missed [2]. Support for high performance BLAST has This work was supported in part by the NIH through award #R01- RR023168-01A1. Web: www.bu.edu/caadlab. therefore received much attention, with, e.g., NCBI main- taining a large server that processes hundreds of thousands of searches per day [3]. For acceleration, FPGAs have prob- ably been the most successful, with commercial products from TimeLogic [4] and Mitrionics [5] and several academic efforts [6, 7, 8, 9, 10, 11, 12]. GPU implementations de- scribed in published studies [13, 14] have so far only achieved performance similar to that of a multicore CPU. Of the many versions of BLAST, NCBI BLAST [15] has become a de facto standard. This motivates the design crite- ria for accelerated BLAST codes: users expect not only that performance be significantly upgraded, but also that outputs match exactly those given by the original system. BLAST runs through several phases, details of which are given be- low, and return some number of matches with respect to a statistical measure of likely significance. NCBI BLAST itself is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics beyond those of the original algorithm. Creating an accelerated version that both matches the NCBI BLAST output and delivers significant acceleration is therefore challenging. A now commonly used method is prefiltering [16, 17]. The idea is to quickly reduce the size of the database to a small fraction, and then use the origi- nal NCBI BLAST code to process the query. Agreement is achieved as follows. The prefiltering is constructed to guar- antee that its output is strictly more sensitive than the origi- nal code: that is, no matches are missed, but extra matches may be found. The latter can then be (optionally) removed by running NCBI BLAST on the reduced database. In previous work we have described a filter that per- forms exhaustive ungapped alignment (EUA) at streaming rate [17, 6], the integration of EUA into NCBI BLASTP [10] (CAAD BLASTP), and the use of a two-hit filter in conjunc- tion with the EUA filter [18]. The last of these achieves high performance by enabling a large number of database streams to be processed in parallel (30 or more for a Stratix-III), but also requires reconfiguration between phases. Both of these characteristics, high number of streams and reconfig- uration, have become problematic with new generation de- vices and system configurations. First, in CAAD BLASTP 978-1-4673-2256-0/12/$31.00 c 2012 IEEE 217
Transcript
Page 1: [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable

CAAD BLASTP 2.0: NCBI BLASTP ACCELERATED WITH PIPELINED FILTERS

Atabak Mahram Martin C. Herbordt

Department of Electrical and Computer EngineeringBoston University; Boston, MA

email: [email protected], [email protected]

ABSTRACT

BLAST is a central application in bioinformatics and sohas been the subject of numerous acceleration studies. Thede facto standard version of this code, NCBI BLAST, usescomplex heuristics which make it challenging to simultane-ously achieve both high performance and exact agreementwith the original output. In previous work, we have usednovel FPGA-based filters that reduce the input database byover 99.99% without loss of sensitivity. In the present workthere are two primary contributions. The first is a new mech-anism to couple two of the filters in such a way that promis-ing alignments can be found in a fraction of the previoustime. The second is the pipelining of the three filters. Thisis a challenging load balancing problem since the work perfilter drops by 5x - 10x at both of the interfaces. Pipelin-ing the filters has two benefits: it removes the need to re-configure between passes and it reduces the off-chip band-width requirement. Together, these two enhancements morethan double the performance over the previous best imple-mentation. We currently have CAAD BLASTP working onVirtex-6 and Stratix-IV FPGAs with speed-ups of 9x and15x, respectively, over the multithreaded original code run-ning on an 8-core PC. We discuss FPGA features that causethis performance disparity. CAAD BLASTP scales easilyand is appropriate for use in large FPGA-based servers.

1. INTRODUCTION

The fundamental insight that biologically significant poly-mers such as proteins and DNA can be abstracted into char-acter strings (sequences) allows biologists to use approxi-mate string matching (AM) to determine, for example, howa newly identified protein is related to those previously ana-lyzed, and how it has diverged through mutation. Fast meth-ods for AM, such as BLAST [1], are based on heuristics,and can match a typical sequence (a query) against a set ofknown sequences (e.g., the millions in the NR database) injust a few minutes, and only rarely cause significant matchesto be missed [2]. Support for high performance BLAST has

This work was supported in part by the NIH through award #R01-RR023168-01A1. Web: www.bu.edu/caadlab.

therefore received much attention, with, e.g., NCBI main-taining a large server that processes hundreds of thousandsof searches per day [3]. For acceleration, FPGAs have prob-ably been the most successful, with commercial productsfrom TimeLogic [4] and Mitrionics [5] and several academicefforts [6, 7, 8, 9, 10, 11, 12]. GPU implementations de-scribed in published studies [13, 14] have so far only achievedperformance similar to that of a multicore CPU.

Of the many versions of BLAST, NCBI BLAST [15] hasbecome a de facto standard. This motivates the design crite-ria for accelerated BLAST codes: users expect not only thatperformance be significantly upgraded, but also that outputsmatch exactly those given by the original system. BLASTruns through several phases, details of which are given be-low, and return some number of matches with respect toa statistical measure of likely significance. NCBI BLASTitself is a complex highly-optimized system, consisting oftens of thousands of lines of code and a large number ofheuristics beyond those of the original algorithm.

Creating an accelerated version that both matches theNCBI BLAST output and delivers significant accelerationis therefore challenging. A now commonly used method isprefiltering [16, 17]. The idea is to quickly reduce the sizeof the database to a small fraction, and then use the origi-nal NCBI BLAST code to process the query. Agreement isachieved as follows. The prefiltering is constructed to guar-antee that its output is strictly more sensitive than the origi-nal code: that is, no matches are missed, but extra matchesmay be found. The latter can then be (optionally) removedby running NCBI BLAST on the reduced database.

In previous work we have described a filter that per-forms exhaustive ungapped alignment (EUA) at streamingrate [17, 6], the integration of EUA into NCBI BLASTP [10](CAAD BLASTP), and the use of a two-hit filter in conjunc-tion with the EUA filter [18]. The last of these achieves highperformance by enabling a large number of database streamsto be processed in parallel (30 or more for a Stratix-III),but also requires reconfiguration between phases. Both ofthese characteristics, high number of streams and reconfig-uration, have become problematic with new generation de-vices and system configurations. First, in CAAD BLASTP

978-1-4673-2256-0/12/$31.00 c©2012 IEEE 217

Page 2: [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable

the number of streams supported increases proportionallywith FPGA feature count; the number of streams necessaryto fully utilize the resources of Virtex-6/Stratix-IV genera-tion FPGAs now exceeds the bandwidth typically providedto off-chip memory. And second, reconfiguration times con-tinue to grow longer in parallel with FPGA size. When cou-pled with improved filter performance, we find that recon-figuration time now dominates the execution time.

In the present work there are two primary contributions.The first is pipelining the three filters: 2-hit, EUA, and Smith-Waterman (SW). This is a significant load balancing prob-lem since the work per filter drops by 5 × −10× at bothof the interfaces, but has been achieved with little resourceoverhead. The second is a new mechanism to couple thetwo-hit and the EUA filters in such a way that roughly fourtimes as many alignments can be skipped by the EUA filteras previously. The first contribution eliminates the need forreconfiguration and reduces the bandwidth requirement bymore than a factor of 2. The second reduces by a factor offour the number of EUA filters needed to speed-match withthe 2-hit filter.

The primary result is a transparent FPGA-acceleratedNCBI BLASTP that achieves both output identical to theoriginal and a factor of 9x to 15x improvement in perfor-mance over an 8-core PC; speed-ups correspond to Virtex-6and Stratix-IV FPGAs, respectively. The performance vari-ation is due mostly to the difference in number of BRAMports; this and other FPGA resource issues are discussed be-low. The overall significance of this work is that the newCAAD BLASTP is substantially faster than previous FPGAand GPU versions, processing queries with respect to the lat-est NR database in just a few seconds. This results in higherquery throughput not only for BLAST, but for a number ofmore complex bioinformatics applications that use BLASTin the inner loop. CAAD BLASTP scales easily and is ap-propriate for use in large FPGA-based servers such as theNovo-G [19].

The rest of this manuscript is organized as follows. Webegin with a review of BLAST, followed by an overview ofNCBI BLAST, especially in how it differs from the origi-nal algorithm. Then comes the overall design, including themechanisms we use to guarantee agreement. After that is adescription of the filters themselves, followed by some prac-tical concerns and results. We conclude with a discussionand future work.

2. BACKGROUND

2.1. Basics of AM for Biological Sequences

We briefly describe biological sequence matching and theclassic BLAST algorithm. For details, please see one of thesurveys (e.g., [20]). An alignment of two sequences is a one-to-one correspondence between their characters, without re-

ordering, but with the possibility of some number of inser-tions or deletions (i.e., gaps or indels). In biological AM, analignment score between two (sub)sequences is computedby combining the independently scored character matches,which themselves are determined a priori by biological sig-nificance. The highest scoring alignment between a querysequence of length m and a database of length n can befound in time O(mn) using dynamic programming (DP)techniques (e.g., Needelman-Wunsch and Smith-Waterman).

For large databases, however, DP methods are impracti-cal, motivating heuristic methods such as BLAST. BLASTis based on an observation about the typical distribution ofhigh-scoring character matches in the DP alignment tableau:there are relatively few overall, and only a small fraction arepromising. This promising fraction is often recognizable asproximate line segments along the main diagonal.

The original BLAST algorithm has three phases: iden-tifying short sequences (words) with high match scores, ex-tending those matches, and merging proximate extensions.In the first phase (seeding), the word size W is typically 3for BLASTP and significance is determined using a scor-ing matrix and threshold score T (default 11). Nowadays,the preferred method of seeding depends on there being twohits on a diagonal (ungapped alignment) within a certain dis-tance A (default 40).

In the second phase (extension), seeds are extended inboth directions to form high-scoring segment pairs (HSPs).Extension stops when it ceases to be promising, i.e., whenthe drop off from the last maximum score exceeds a thresh-old X . An Evalue (expected value) is computed from theraw alignment score and other parameters. Database se-quences with a sufficiently good Evalue, as selected bydefault or by user, are reported. The third phase is nowa-days often replaced by a gapped extension based on DP –the O(nm) is not onerous when n is a small fraction of theoriginal.

2.2. NCBI BLAST Overview

NCBI BLAST adds a number of phases and options, whichwe sketch here. There are two options, ungapped and gapped.Ungapped alignment proceeds initially as just presented. Ingapped alignment, extension and evaluation are triggeredonly when ungapped alignment satisfies the ungapped thresh-old. In gapped extension, the extension drop-off thresholdX also depends on gap-opening and gap-extension costs.

NCBI BLAST begins the evaluation phase by using anempirically determined cutoff score (cutoff ) to keep onlystatistically significant HSPs. To improve sensitivity, a lowerscore is tolerated if there are multiple HSPs in a particulardatabase sequence; the more HSPs, the lower the thresh-old. These multiple HSP scores are combined using Poissonand sum-of-scores methods for ungapped and gapped align-ments, respectively. Finally, HSPs are organized into consis-

218

Page 3: [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable

Inputs: query, database

NCBI BLAST starts

filtersPrecompute filtering criteria

Ungapped filter: EUA filter (FPGA)

Gapped filter: Smith-Waterman (FPGA)database’

database’’

database, bit vector

Ungapped filter: Two-Hit filter (FPGA)

Format filtered databasedatabase

Formatted database’or database’’

Original searchspace info

NCBI BLAST modules

NCBI BLAST report

Fig. 1. High-level design of CAAD BLASTP.

tent groups and evaluated with the final threshold Evalue.

2.3. Target Systems

We briefly state our assumptions about the target systemswith FPGA-based accelerators. Nodes have 1-4 accelera-tor boards plugged into a high-speed connection (e.g., QPI,FSB, or PCI Express). The host node runs the main appli-cation program. Nodes communicate with the acceleratorsthrough function calls. Each accelerator board consist of 1-4 FPGAs, memory, and a bus interface. On-board memoryis tightly coupled to each FPGA either through several in-terfaces (e.g., 6 x 32-bit, 3 x 64-bit) or a wide bus (128-bit)running at 333MHz or more. These interfaces can generallybe virtualized into some number of streams. 4GB-64GB ofmemory per FPGA is currently standard.

3. CAAD BLASTP OVERVIEW

CAAD BLASTP uses three FPGA-based filters all of whichwork on the same principal (see Figure 1). Each holds acopy of the query and executes as the database streams throughit. The filter size is related to the query size. Generally thefilter uses only a fraction of the chip area and so can be repli-cated some number of times. If the query is very large, thenthe filter is folded: it still operates correctly, but with a slow-down proportional to the number of folds.

Each filter thus runs in O(N), assuming that the query

s

database position dx

a a aa

ry p

ositi

ons ax-k ax-j ax-i

qiqj

i jk

hits from previous db positions

ax-m+2

que

Counter Frame

qk

……Bit Vector

Counters

two-hitwindowsize = A

Fig. 2. The Two-Hit filter is processing the xth 3-mer in thedatabase sequence. There are three hits. The hit on align-ment ax−j is within A of a previous hit, and so is part ofa two-hit event. This is determined by comparison with thecorresponding Counter value; its bit in the Bit Vector will beset.

sequence is a small multiple of what can fit on a currentFPGA, a characteristic of almost all proteins. Large pro-tein databases such as NR currently have over 4GB of data:many off-the-shelf FPGA plug-in boards, e.g., the PROCeIII from Gidel [21], can hold this in local memory and streamit through the FPGA in about a second. Integration intoNCBI BLASTP is described in [10].

The Two-Hit Filter is based on the BLASTP two-hit seed-ing algorithm. All ungapped alignments are evaluated as towhether or not they contain a two-hit seed. A bit vector isgenerated containing a 1 or 0 at each position depending onwhether or not the corresponding alignment contains a seed.The basic function is shown in Figure 2. The design is gen-erally similar to that used in the Mercury BLAST seedingpass [7] and updated as described in [18].

The exhaustive ungapped alignment (EUA) filter is basedon the TreeBLAST scheme described in [6] and shown inFigure 3. The Tree-BLAST filter has two essential proper-ties. First, it exhaustively evaluates every possible ungappedalignment between query and database. And second, it isfully pipelined: it evaluates the database at streaming rate.

Queries larger than the tree size can be handled throughfolding as shown in Figure 4. The key idea is that we cantrade off space, and therefore parallelism, for latency. Wetake advantage of this idea when we combine the two filtersas described in the next Section.

The Exhaustive Gapped Alignment (SW) Filter is basedon the Smith-Waterman algorithm and returns the highestscoring gapped local alignments for each sequence in thedatabase. We base our SW filter on the version of Smith-Waterman described in [22].

219

Page 4: [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable

C G W W W M Y FCstreamingdatabase

For this alignment, generatecharacter-character match scores

8-2-3 -3 -3 -1 8-2

M C L K K W Y F query string

score sequenceof this alignment

Leaf Leaf Leaf Leaf

Intern. Intern.

Intern.Output the max local

leaf nodes

non-leaf nodes

Score “covered”subsequences

palignment scorefor this alignment

Fig. 3. TreeBLAST structure for m = 8.

L L L L L L

NL NL

NL

NL

NL

NL

NL

controlcontrol

control

1 fold 2 folds

Fig. 4. An 8 leaf tree folded once and twice. L = leaf node,NL = non-leaf node.

4. CAAD BLASTP DESIGN

The following scenario is for accelerated gapped NCBI BLAST.The Two-Hit pass provides “hints” to the EUA filter as towhich alignments can be skipped. The EUA filter prunes atleast 95% of the database so that it need not be processed bythe SW filter. The SW filter prunes the database to 0.01%of the original. The reduced database is then processed byNCBI BLASTP.

4.1. Fast Seed Finder

We describe the coupling between the first two filters, in par-ticular, how stream processing is accelerated to “race past”an arbitrary number of 0s with only modest additional logic.

General Skipping.The idea behind general skipping is, on every cycle, to lookahead in the bit vector to find the next “one” (correspond-ing to the next alignment to be examined) and then slidethe database the correct number of positions. Ideally, gen-eral skipping takes only the number of cycles equal to the

number of ones in the bit vector. The additional hardwarerequired, however, is complex. For the bit vector, the “look-ahead” logic is similar to a leading one detector used, e.g.,in a floating point adder. For both the bit vector and thedatabase stream, they must be able, on each cycle, to slideany number of positions up to the maximum number sup-ported. This, in turn, requires that each register in the streambuffer have a multiplexor (MUX) that is large enough forevery possible number of positions that could be skipped. Italso requires complex routing logic. As a result, support foreven a small range of choices makes the logic for generalskipping more expensive than the original EUA filter.

Constant SkippingThe idea behind constant skipping is to limit the number ofpositions that can be skipped to a single number S that is de-termined experimentally. That is, the database stream skipseither S positions or none (and advances either S positionsor 1). If there is a sequence of S or more 0s, then S skippingis used, otherwise it is not. This scheme greatly simplifiesthe MUX logic, but has two drawbacks.

1) Only sequences of 0s of length S or greater can be takenadvantage of. All shorter sequences of 0s are useless. Thisindicates a small S so that most sequences of 0s can be used.2) The maximum skip is also limited indicating a large S.

The optimal S is query dependent but generally followsthe query length. A larger S works best with smaller queries:Their alignments are more much likely to have been filtered.

Folded SkippingRecall that trees can be folded with the addition of a trivialamount of logic. Also that a tree that is folded to 1/F itsoriginal size requires only 1/F the logic of the original, butrequires F cycles per alignment rather than 1.

The idea behind folded skipping is to process unfilteredalignments in F cycles (as before), but to process the oth-ers in only 1 cycle. The control for this scheme is thus ex-tremely simple: there is no need for complex look-ahead orrouting logic. Rather, if the bit-vector value of an alignmentis a 0, simply shift the database stream; if the value is a 1,then continue processing the alignment for a delay of an-other F − 1 cycles. The hardware cost is a slight increase incontrol complexity; no other additional logic is needed.

The performance benefit of folded skipping can be demon-strated as follows. Assume that the bit vector for a size Ndatabase has O ones. Without skipping, an F -folded tree re-quires roughly F × N cycles to process the database. Withskipping, the number of cycles is N + O × (F − 1). If Fis 16 and N/O is 20, then the speed-up is greater than 9×.This speed-up is independent of the distribution of 1s.

Variable Folded SkippingThe drawback of folded skipping is that while 0s are pro-cessed F× as fast as 1s, they still take a cycle per char-acter. Since the fraction of 0s (Z) is generally 98%-99%

220

Page 5: [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable

Mux

&

C

TRL

Db Stream i

Thf bit vector

Tree Blast

Scores,andindexes

PQ To Dynamicprogram-mingmodule

Two hit filter

DB

Stre

am Two hit filter

Two hit filterTwo hit filter

Fig. 5. A single instance of a set of speed-matched,pipelined filters.

of the stream, processing these null alignments still takesZ

Z+(1−Z)×F of the cycles, or 75% to 85% for almost allquery sequences.

Variable folded skipping works as follows. During theF clock cycles required by the folded skipping mechanismwhen TreeBLAST is processing an alignment, a seed lookupmodule continues streaming the database until it finds thenext unfiltered diagonal. The seed lookup module finds thenext unfiltered alignment by implementing the constant skipmechanism with S=16. That is, each clock cycle it eitherskips one character or 16 consecutive characters until it findsthe next unfiltered alignment. With a typical F = 16, 16 ×16 = 256 filtered diagonals (0s) can be skipped. The perfor-mance gain is dramatic: only a small number of cycles arespent processing 0s, improving performance of this phase bymore than 4×. Note that variable folded skipping addressesanother significant issue with the EUA filter: the need to pro-cess artifactual null alignments that are inserted as paddingduring start-up and tear-down of each database sequence.

4.2. Pipelined Filters

Figure 5 shows the overall scheme of a single filter bank:parallel database streams feed the two-hit filters, which inturn send the 0/1 stream to an EUA filter. A copy of thedatabase is streamed to the EUA filter where it is coupledwith the 0/1 stream. This structure is then replicated somenumber of times depending on the size of the query and ofthe FPGA. In the final stage, the highest scoring databasesequences from all of the banks are processed with a singleDP module.

Speed matching between two-hit and EUA stages is ac-complished as follows. The EUA filter processes data (asingle sequence) from a single two-hit filter at a time. Pro-cessed sequences from the other two-hit filters in the bankare buffered. Through the mechanism described in the pre-vious Subsection, the EUA filter is capable of consuming 3to 5 characters per cycle. That is, data from buffered filteredsequences are transferred to the EUA filter F characters at a

Query sizeFraction 1s

Ratio of two hitto EUA filters

EUA Filter:Percent idle cycles

4 135 1.56 02 143 0.142 133 0.062 123 0.03

2560.008

5120.01610240.02020480.027

Fig. 6. Balance between two-hit and EUA filters. The num-ber of 1s generated by the two-hit filter is shown togetherwith the query size. Optimal ratio for each query size rangeis shown in bold.

Query sizeReductiondb to db'

Number of Foldsfor SW Filter(Virtex 6)

Number of Foldsfor SW Filter(Stratix IV)

256 0.01 7 4512 0.03 4 21024 0.05 4 22048 0.07 5 2

Fig. 7. Shown is the average fraction reduction of thedatabase after the EUA filter (and before the SW filter) andthe resulting optimal number of folds in the SW filter.

time (currently F = 16). After processing the data of onetwo hit-filter, the EUA filter starts working on the next se-quence from the next two-hit filter. In order to load balance,the database sequences are sorted based by length and mul-tiplexed among multiple two-hit filters. As a result the timerequired to process successive sequences is nearly equal.

Coupling with the SW filter is accomplished as follows.For each database sequence, the EUA filter compares themaximum score generated with a constant threshold. If thisscore is larger than the threshold, the EUA filter writes theaddress of the sequence to a FIFO. The SW unit reads theseaddresses, streams the subject sequences, and calculates themaximum scores.

5. IMPLEMENTATION AND RESULTS

5.1. Replicating and Balancing the Components

We find the optimal number two-hit filters per EUA filterby measuring the fraction idle cycles in the EUA filter as afunction of number of two-hit filters and query size. Theresults are shown in Figure 6: 3 to 5 two-hit filters per EUAfilter is optimal.

The EUA filter enables the database to be reduced byat least 97% for most query sequences. The SW filter can

221

Page 6: [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable

Component ALMs M9K M144K5 Two Hit 5771 250/100 0/91 EUA (F=16) 4084 8 01 SW(F=16) 1475 8/216 12/0

Total 11363 266 123 Two Hit 3544 200/60 0/101 EUA(f=16) 6888 16 01 SW(F=8) 5296 32/240 12/0

Total 15750 248 123 Two Hit 3625 258/72 0/121 EUA (f=16) 12540 32 01 SW(F=8) 10473 64/272 12/0Total 26660 354 12

3 Two Hit 3704 368/96 0/161 EUA (f=16) 23998 64 01 SW(F=8) 20412 128/336 12/0Total 48163 496 16

325000 1610 60

QS < 2048Reps = 32hSt = 9

Total Available (Stratix IV)

Query SizeReplications2 hit Streams

Logic Utilization

QS < 256Reps = 52hSt = 25

QS < 512Reps = 52hst = 15

QS < 1024Reps = 42hSt = 12

Fig. 8. Per component resource utilization for the AlteraStratix-IV and the number of replications for the bank con-figurations shown for each query size.

therefore be compacted substantially through folding andstill obtain adequate performance. The optimally folded SWfilter consumes characters of the reduced database db at thesame rate that characters of the original database db are con-sumed by the two-hit filters. The raw results are shown inFigure 7. When integrated into the overall system, the num-ber of folds is either 2, 4, or 8 (see Figures 8 and 9).

From the preceding discussion we see that a speed match-ed bank of filters contains from 3 to 5 two-hit filters and 1EUA filter folded to effect 16× replication. A single SWfilter is shared by all of the filter banks and folded to effect2× to 8× replication. The number of filter banks themselvesthat can fit on an FPGA is a function of query size and FPGAresources. Figure 8 shows the results for the Altera Stratix-IV EP4SE820H40I3 and Figure 9 shows the results for theXilinx Virtex-6 XC6VLX7601. We find that the Stratix-IV,depending on query size, can fit 5, 5, 4, or 3 filter banks fora total of 25, 15, 12, or 9 input streams. The Virtex-6 canfit, depending on query size, 3, 3, 2, or 1 filter banks for atotal of 15, 9, 6, or 3 input streams. While these are similarsized chips, we find that the substantial difference in num-ber of input streams, which limits overall performance, isthe number of BRAM ports.

Large queries are handled as special cases. For both theVirtex-6 and the Stratix-IV, queries greater than 4096 are runin software on the 8-core workstation. For queries with sizesbetween 2K and 4K the methods diverge. For the Stratix-IVwe run straight Smith-Waterman with no prefiltering. Forthe Virtex-6 we again run on the host.

Component SlicesBlock

Ram/Fifo5 Two Hit 3921 1591 EUA (F=16) 2103 81 SW(F=16) 1227 56

Total 7313 2233 Two Hit 2496 1281 EUA(F=16) 3590 161 SW(F=8) 4595 80

Total 10725 2243 Two Hit 2534 1341 EUA (F=16) 6304 321 SW(F=8) 8804 112Total 17687 278

3 Two Hit 2537 1401 EUA (F=16) 12064 641 SW(F=8) 17152 176Total 31801 380

18560 720Total Available (V6)

Logic Utilization

QS < 256Reps = 32hSt = 15

QS < 512Reps = 32hSt = 9

QS < 1024Reps = 22hSt = 6

QS < 2048Reps = 12hSt = 3

Query SizeReplications2 hit Streams

Fig. 9. Per component resource utilization for the XilinxVirtex-6 and the number of replications for the bank config-urations shown for each query size.

5.2. Performance

The updated design leaves the processing logic unchangedso correctness follows from previous reports [10, 18]. Thebaseline code is the latest version of NCBI BLASTP, 2.2.25.We integrated the accelerator into the baseline code as pre-viously described [10]. This code is multithreaded. ForCPU-only reference we use a 3GHz 8-core Intel worksta-tion. In all experiments we use the same version of theNR protein database (downloaded 9/2011) which has 4.74Gresidues (characters).

We have validated the designs for both Altera and Xil-inx in simulation using ModelSim and obtained resource uti-lization and operating frequency post-place-and-route usingstandard vendor tool flows. We have also verified the cor-rectness of the Virtex-6 design on the Convey HC-1ex aswell as a base operating frequency of 125MHz. We also an-ticipate increasing the operating frequency to 150MHz; theresults in this subsection are for 125MHz. The maximummemory bandwidth required is currently less than 8GB/s,within reach of most common accelerators.

Performance results are shown in Figure 10. As is cus-tomary in BLAST performance measurements, we assumethat the database is preloaded in memory. This matchesboth common usage of BLAST as a query server and as aninner loop function in a complex application. Overhead in-cludes the time to load the query and the final pass of NCBI

222

Page 7: [IEEE 2012 22nd International Conference on Field Programmable Logic and Applications (FPL) - Oslo, Norway (2012.08.29-2012.08.31)] 22nd International Conference on Field Programmable

Query Size Weight 1 core 8 cores Overhead Stratix IV Virtex 61 256 0.469 270 65 0.2 3.0 5.1

257 512 0.361 367 83 0.5 5.1 8.4513 1024 0.14 512 104 0.9 6.3 12.61025 2048 0.026 712 141 1.7 8.4 25.32048 4095 0.005 820 177 2.0 75.9 1774096+ 0.0004 1382 292 0 292 292

354 80 NA 5.3 8.9Weighted total time

Fig. 10. Performance for reference and accelerated imple-mentations. All results in seconds. For each Query Size, thenumber reported is derived from a weighted average withinthe range.

BLASTP on the reduced database. We observe speed-ups ofroughly 9× for the Virtex-6 based system and 15× for theStratix-IV.

6. DISCUSSION AND FUTURE WORK

We have described a new version of CAAD BLASTP, amultiphase, FPGA accelerated, version of NCBI BLASTP.We have redesigned the interface logic between the stages.There are two particular contributions. The first is that thestages are now pipelined. This both avoids the need for re-configuration and reduces the required memory bandwidth.The second drastically improves the efficiency of one of thefilter stages reducing the logic required there by a factor offour. We have so far found speed-ups of from 9× to 15×over an 8-core workstation depending on the target FPGA.

When we map CAAD BLASTP to Virtex-7 and Stratix-V FPGAs (as such systems become available) performancewill scale proportionally to the critical feature counts (mostlyBRAM ports) and improved operating frequency.

7. REFERENCES

[1] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman,“Basic local alignment search tool,” Journal of Molecular Bi-ology, vol. 215, pp. 403–410, 1990.

[2] T. Lam, W. Sung, S. Tam, C. Wong, and S. Yiu, “Com-pressed indexing and local alignment of DNA,” Bioinformat-ics, vol. 24, no. 6, pp. 791–797, 2008.

[3] S. McGinnis and T. Madder, “BLAST: at the core of a power-ful and diverse set of sequence analysis tools,” Nucleic AcidsResearch, vol. 32, p. Web Server Issue, 2004.

[4] TimeLogic web site, Time Logic Corp., www.timelogic.com,Accessed 10/2010.

[5] Mitrion-Accelerated NCBI BLAST for SGI BLAST, Mitrion-ics, Available at www.mitrionics.se/press, Accessed 1/2010.

[6] M. Herbordt, J. Model, B. Sukhwani, Y. Gu, and T. VanCourt,“Single pass streaming BLAST on FPGAs,” Parallel Com-puting, vol. 33, no. 10-11, pp. 741–756, 2007.

[7] A. Jacob, J. Lancaster, J. Buhler, B. Harris, and R. Cham-berlain, “Mercury BLASTP: Accelerating protein sequencealignment,” ACM Transactions on Reconfigurable Technol-ogy and Systems, vol. 1, no. 2, 2008.

[8] D. Lavenier, L. Xinchun, and G. Georges, “Seed-based ge-nomic sequence comparison using a FGPA/FLASH accelera-tor,” in Proc. IEEE Conference on Field Programmable Tech-nology, 2006, pp. 41–48.

[9] K. Muriki, K. Underwood, and R. Sass, “RC-BLAST: To-wards an open source hardware implementation,” in Proc. In-ternational Work. High Performance Computational Biology,2005.

[10] J. Park, Y. Qui, and M. Herbordt, “CAAD BLASTP: NCBIBLASTP Accelerated with FPGA-Based Pre-Filtering,” inProc. IEEE Symp. on Field Programmable Custom Comput-ing Machines, 2009, pp. 81–87.

[11] E. Sotiriades and A. Dollas, “A general reconfigurable archi-tecture for the BLAST algorithm,” Journal of VLSI SignalProcessing, vol. 48, pp. 189–208, 2007.

[12] F. Xia, Y. Dou, and J. Xu, “Families of FPGA-based accel-erators for BLAST algorithm with multi-seeds detection andparallel extension,” in 2nd Int. Conf. Bioinformatics Researchand Development, 2008, pp. 43–57.

[13] C. Ling and K. Benkrid, “Design and implementation of aCUDA-compatible GPU-basedcore for gapped BLAST algo-rithm,” Procedia Computer Science, vol. 1, no. 1, 2010.

[14] P. Vouzis and N. Sahinidis, “Gpu-blast: using graphics pro-cessors to accelerate protein sequence alignment,” Bioinfor-matics, vol. 27, no. 2, pp. 182–188, 2011.

[15] http://blast.ncbi.nlm.nih.gov/Blast.cgi, Accessed 1/2009.

[16] P. Afratis, E. Sotiriades, G. Chrysos, S. Fytraki, and D. Pnev-matikatos, “A rate-based prefiltering approach to BLASTacceleration,” in Proc. IEEE Conference on Field Pro-grammable Logic and Applications, 2008.

[17] M. Herbordt, J. Model, B. Sukhwani, Y. Gu, and T. VanCourt,“Single pass, BLAST-like, approximate string matching onFPGAs,” in Proc. IEEE Symp. on Field Programmable Cus-tom Computing Machines, 2006.

[18] A. Mahram and M. Herbordt, “Fast and Accurate NCBIBLASTP: Acceleration with Multiphase FPGA-Based Pre-filtering,” in Proceedings of the 24th ACM International Con-ference on Supercomputing, 2010, pp. 73–82.

[19] A. George, H. Lam, and G. Stitt, “Novo-G: At the Forefrontof Scalable Reconfigurable Computing,” Computing in Sci-ence and Engineering, vol. 13, no. 1, 2011.

[20] I. Korf, M. Yandell, and J. Bedell, BLAST: An EssentialGuide to the Basic Local Alignment Search Tool. O’Reillyand Associates, 2003.

[21] PROCStar III, Gidel Reconfigurable Computing,http://www.gidel.com/PROCStar

[22] E. Chow, T. Hunkapiller, and J. Peterson, “Biological infor-mation signal processor,” in Proc. International Conferenceon Application Specific Systems, Architectures, and Proces-sors, 1991, pp. 144–160.

223


Recommended