+ All Categories
Home > Documents > Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title:...

Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title:...

Date post: 06-Aug-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
25
Accurate Sequence Alignment using Distributed Filtering on GPU Clusters Reza Farivar * Shivaram Venkataraman § Roy H. Campbell * * University of Illinois at Urbana-Champaign § University of California, Berkeley
Transcript
Page 1: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Accurate Sequence Alignment using Distributed Filtering on GPU Clusters

Reza Farivar*

Shivaram Venkataraman§

Roy H. Campbell*

*University of Illinois at Urbana-Champaign

§University of California, Berkeley

Page 2: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Problem Introduction • What is Gene Sequencing ?

• 1st Phase, getting the reads

• 2nd Phase, Aligning them to a reference DNA

5/15/12 University of Illinois at Urbana-Champaign 2

Page 3: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Problem Description

• Human genome sequence – 3.5 billion nucleotides

• A C T G

– Less than 30 chromosomes

– Unidentified bases (N)

• Reads – Explore a sequence by designing and searching for

short DNA reads

• Types of disagreements – Insertions, Deletions and Mismatches

5/15/12 University of Illinois at Urbana-Champaign 3

Page 4: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Why is the Problem So Hard

• Divide reference DNA into patterns using a sliding window

• Brute Force - (3 × 10^9 ) × (20 × 10^6)

Using Quad core, each core working on two datasets in parallel using SSE will take 195 days to complete !

5/15/12 University of Illinois at Urbana-Champaign 4

Page 5: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Our Approach

PHASE 1

• Generate Masks for Filter Phase of the algorithm

• Reduce the number of Masks

• Filter the Pattern-Read pairs

PHASE 2

• Align them using an accelerator like GPUs, get rid of the False positives

5/15/12 University of Illinois at Urbana-Champaign 5

Page 6: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Phase 1 - Masks

• Related to the q-gram lemma

– Based on pigeon hole principle

• If n “disagreements” are spread out in a string, and if we divide it in n+k divisions, at least k division will be an exact match

5/15/12 University of Illinois at Urbana-Champaign 6

Page 7: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Distributed Filtering

• Using a sliding window, create patterns

• For each mask, make a “masked array” from all patterns

• For each read, mask with a proper value and search in the corresponding masked array

• If found, this pattern/read are a potential match for phase 2

5/15/12 University of Illinois at Urbana-Champaign 7

Page 8: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Creating the distributed filter

TGCCTTCATTTCGTTATGTACCCAGTAGTCATAAAAGCACTAGCTTGCCAAGTT

TGCCTT

GCCTTC

CCTTCA

CTTCAT

TTCATT

1 1 0 1 0 1 0 1 1

Sorted Masked Arrays

TG00TT

GC00TC

CC00CA

CT00AT

TT00TT

TGCC00

GCCT00

CCTT00

CTTC00

TTCA00

00CCTT

00CTTC

00TTCA

00TCAT

00CATT

Distributed pigeon hole filter 5/15/12 University of Illinois at Urbana-Champaign 8

Page 9: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Masked Read Matching

CCATCA

1 1 0 1 0 1 0 1 1

CCAT00 CC00CA 00ATCA

1 1 0 1 0 1 0 1 1

Sorted Masked Arrays

TG00TT

GC00TC

CT00AT

CC00CA

TT00TT

TGCC00

GCCT00

CCTT00

CTTC00

TTCA00

00CCTT

00CTTC

00TTCA

00TCAT

00CATT

A Short Read

CC00CA

5/15/12 University of Illinois at Urbana-Champaign 9

Page 10: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Performance of filtering

5/15/12 University of Illinois at Urbana-Champaign 10

Filter Passed % Avg. No. of Patterns Masks Used

85.38 13.6 1-1-1-2

13.39 21.38 1-1-1-3

2.84 46.66 1-1-1-4

Page 11: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Phase 2: Post Filter Matching

• Aligning two short sequences

– Insertion, Deletion, Mismatch

– Example:

• X: kitten-

• Y: sitting

• Needleman-Wunsch

– Global Alignment

– Create a matrix, with each element showing the score of the optimal alignment of two smaller subsequences. Then trace back.

5/15/12 University of Illinois at Urbana-Champaign 11

Page 12: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Needleman Wunsch Algorithm

1. Fill the matrix (forward pass) – S(i,j) is the optimal

alignment score for substrings X[0..i] and Y[0..j]

– S(i,j) = f(S(i-1,j), S(i,j-1), S(i-1,j-1), X[i],Y[j])

2. Trace-back (backward pass)

1 1 1 1

1 1 1 2

1 2 2 2

1 2 2 2

X

Y

5/15/12 University of Illinois at Urbana-Champaign 12

Page 13: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Limited Fast Memory

• NW in Shared Memory – Amount of shared memory available per

streaming multiprocessor (in Tesla) = 16KB

– For 32 long strings, 1024 bytes required 16 threads per SM

– For 128 long strings 1 thread per SM !

Problem with 16 threads per SM – Very low GPU utilization

– Unable to hide RAW latency

5/15/12 University of Illinois at Urbana-Champaign 13

Page 14: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Matrix in Global Memory • Allows hundreds of threads per streaming

multiprocessor

• Many threads can be used to hide latency incurred by each thread, normal GPU practice

PROBLEM

• In the best case, we are global memory bound

• The algorithm does not support coalesced memory accesses during back-tracing

• This results in serializing the threads

5/15/12 University of Illinois at Urbana-Champaign 14

Page 15: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Algorithm Walk through 1. Divide the matrix into quadrants, store the boundaries 2. Start from the bottom right quadrant i,j = 31

– Fill the alignment matrix for the quadrant – Back trace – Decide on a new quadrant

3. Repeat 2 until i,j = 0

5/15/12 University of Illinois at Urbana-Champaign 15

Page 16: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Storing The Boundaries

• Storing in shared memory only 2 warps per SM 6% occupancy • Storing in global memory Shared memory usage 81 bytes

Threads per SM=192 19% GPU utilization

256 Bytes + 81 Bytes

A T C G A T T C G G A T……………………………………….................................32 A T C G A A T T A A A T T T T … … … … … … … … … … …

….32

• Calculation of each quadrant requires the first row and the first column of the quadrant

5/15/12 University of Illinois at Urbana-Champaign 16

Page 17: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Experimental Results: Speedup

4.5

5.5

6.5

7.5

8.5

9.5

10.5

Spe

ed

up

Number of pairs to compare

GPU registers GPU 32bit coalasced GPU 64bit coalasced

• NVIDIA GTX 275 • Intel Core i7 920

Page 18: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Run Time Break Down

5/15/12 University of Illinois at Urbana-Champaign 18

0

2

4

6

8

10

12

14

Read from Disk Creating masked array Binary search for all

reads in one masked

array

GPU processing time Misc

Run Time Details (in seconds)

Page 19: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Future Work

• Scaling to public clouds

– Distributed filtering using a cluster of GPUs

– Integration with standard bioinformatics file formats

• Length of short sequences

– Currently 32, we are extending to 100 and longer

5/15/12 University of Illinois at Urbana-Champaign 19

Page 20: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Backup Slides

5/15/12 University of Illinois at Urbana-Champaign 20

Page 21: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Goals and Assumptions

• Goals: – Make the kernel compute bound

• Minimize global memory accesses, Ensure coalescing

– Maximize use of fast memory • Shared memory, registers

– Eliminate diverging flows

• Assumption: – If we eliminate memory bounds, we will have extra

cycles to spare re-compute when needed

– Bank-conflict-free shared memory access is as fast as registers

5/15/12 University of Illinois at Urbana-Champaign 21

Page 22: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

The Proposed Algorithm • Dynamically construct smaller quadrants while backtracing

• Solve each quadrant in shared memory

A T C G A T T C G G A T……………………………………….................................32 A T C G A A T T A A A T T T T … … … … … … … … … … …

….32

A T C C T A G G A

A T GC AAAG T

5/15/12 University of Illinois at Urbana-Champaign 22

Page 23: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Optimization with Registers

• 32 long strings boundaries take 256 bytes

• Trick: store the 256 bytes in 32 registers

• Problem: Registers are not array addressable

– Need to use the registers conditionally divergent code

• Solution: We use registers as bulk storage, paging into shared memory

• Shared memory usage 81 bytes 19% GPU utilization, no global memory

5/15/12 University of Illinois at Urbana-Champaign 23

Page 24: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Runtime vs. Speedup

0

20

40

60

80

100

120

140

160

180

No

rmal

ize

d r

un

-tim

e

Number of pairs to compare

Normalized OpenMP CPU time Normalized GPU time

Page 25: Accurate Sequence Alignment Using Distributed Filtering on GPU … · 2012-11-27 · Title: Accurate Sequence Alignment Using Distributed Filtering on GPU Clusters Author: Reza Farivar

Experimental Results

• Compared to a regular Needleman Wunsch implementation on an Intel i7 processor our implementation on GPUs is 22x faster

• Compared to a multithreaded implementation of the regular Needleman Wunsch using Open-MP on 4 Intel i7 cores with Hyper threading support (8 HT cores), our implementation is up to 8x faster.

• What does this mean ? Gene alignment for a human DNA with 3 Billion reads will now take 27 minutes instead of 3 hours !!

5/15/12 University of Illinois at Urbana-Champaign 26


Recommended