GeneIndex: an open source parallel program for enumerating and locating words in a genome

transcript

GeneIndex: an open sourceGeneIndex: an open source parallel program for enumerating and locating words in a genome

Huian Li, David Hart, Matthias Mueller, Ulf Markward, Craig Stewart

A t 3 2009August 3, 2009

Contents• MotivationMotivation• Serial algorithm

Parallel implementation• Parallel implementation• Performance Analysis• Conclusion

Motivation

Question from a Biology professor:

Gi d l th i th t ti lGiven a word length, is the computational task of scanning a DNA sequence and

di th iti f ll iblrecording the positions of all possible words trivial?

5 10 15 20 25 30 * * * * * *

5’ TAGCCGTGGCGGAGCCTCTTGGCTTTGTTTATTC 3’

Serial algorithm• Straightforward implementation:Straightforward implementation:

Binary coding for A, C, G, T. For example:A 00A: 00C: 01G: 10T: 11

5 10 15 20 * * * ** * * *

T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...

Serial algorithm• Given a sequence and a word length k in order toGiven a sequence and a word length k, in order to

list all possible words, we scan the sequence once from left to right g

5 10 15 20 * * * *

T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...

T A G CT A G C 11001001

A G C CA G C C00100101

G C C G10010110

C T T G ...01111110 ...0 0 ...

Serial algorithm

5 10 15 20 * * * *

T A G C C G T G G C G G A G C C T C T T G ...T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...

T A G C 11001001

A G C C00100101

ENCODE("AGCC") =

00100101

ENCODE( AGCC ) ENCODE("TAGC") & MASK << 2 | ENCODE('C')

MASK = 4k-1 – 1 = 111111 (in case of k = 4)

Serial algorithm• This essentially becomes a sorting problem sinceThis essentially becomes a sorting problem, since

each word is now converted into an integer.• Each word is associated with its position• Each word is associated with its position

information: (Encoded Word, Position)• Sorting has to be stable so that for the same words• Sorting has to be stable so that for the same words,

their positions will be in a certain order.

Serial algorithmImplementation details:Implementation details:• Words & positions are stored in a long longinteger (8 bytes = 64 bits)integer (8 bytes = 64 bits)

• Hash table with a linked list for each entryS i d f ll d i i i• Space required for all words in given sequence is pre-allocated, instead of malloc one by oneM tl AND OR d SHIFT LEFT ti• Mostly AND, OR and SHIFT-LEFT operations.

Word frequencies

Word distribution

Motivation for parallel implementation

Another question from Biology professor:

How about the human genome?

Fact: Human genome includes about 3 billion DNA bases.about 3 billion DNA bases.

Parallel implementation: inputLarge dataset input:Large dataset input:• Each process reads its own partition from the input

filefile.• Boundary area between neighboring processes has

to be consideredto be considered.agcatgcatgcatcgatcgatcgatgcatcgatgcatcgatacgatgcatgcta

t t t t tgacgatacgagcatgcatctagcatgcagtagcatgcatcgatgcattagcatgctagctagcatgctagcatgcatcgatgcatgctagcatgctagctagcatgctg g g g g g gatgcatgcatgcatcatgcatcgatcgatcgtgcaatgcatgctacgatgcatgcatcagtcagcatgcatgcatcgatcgt t t t t tatgcatcgatcgatgcatgcatgacgagcaatgatgcagtcatgcatcgacgagcatcgatcgatgcatgcatgcat

Parallel implementation: load balancingComputation and load balancing:Computation and load balancing:1. Each process deals with its own piece of data2 All processes perform global sorting2. All processes perform global sorting

Straightforward implementation: binary tree merge sortingXX sorting

Possible solution but could be problematic Ideal solution leading to load balancing

XX Ideal solution leading to load balancing

Parallel implementation: load balancingStraightforward implementation: binary tree mergeStraightforward implementation: binary tree merge

sortingAAAA:AAAA: AAAC: ...

TTTG:TTTT:

AAAA: AAAC:

AAAA: AAAC

TTTG: TTTT:

AAAC: ...

TTTG: TTTT:

AAAA: 5, 9AAAC: 22

AAAA: 19AAAC: 12

AAAA: 4, 8AAAC: 67

AAAA: 35AAAC: 46 P P P P...

TTTG: 101TTTT: 80

TTTG: 201TTTT: 26

TTTG: 88TTTT: 53

TTTG: 40TTTT: 30

P0 P1 P2 P3

Parallel implementation: load balancingPossible solution but could be problematic:Possible solution but could be problematic: Straightforward solution: partition word range [0, 4k) equally,

so each process hasso each process has

ni kk 14,4 , where i = 0, 1, ..., n-1

AAAA:AAAC:CAAC:

CTTT:GAAA:GTTT:

TTTG:TTTT:

Parallel implementation: load balancingImplementation of the straightforward solution:Implementation of the straightforward solution:• Problem is that some words occur more often than others,

leading to different memory requests for different processesg y q p

AAAA: 5, 9AAAC: 22, 37

CAAA: 19CAAC: 12, 47

GAAA: 4, 8GAAC: 67, 72

TAAA: 35, 93TAAC: 46

... ... ... ...

ATTG: 101ATTT: 80

CTTG: 201CTTT: 26

GTTG: 88GTTT: 53

TTTG: 40, 87TTTT: 15, 30

Parallel implementation: load balancingIdeal solution leading to load balancing:Ideal solution leading to load balancing:• partition the total number of words L-k+1 equally, so each

process has (L-k+1)/n words, where L is the length of p ( ) , ggiven sequence, k is the given word length.

Implementation:p1. After each process scanned its own piece, we know that:

kLPWfk 1)(

where i 0 1 n 1

2. We divide the word range [0,4k) into many small divisions

nPWf i

xx ),(

, where i = 0, 1, ..., n-1

g [ ) ywith total divisions of d (where d>>n):

kLPWfd j

1)(1 1)1(4

nkLPWf

1),(0 4

, where i = 0, 1, ..., n-1

Parallel implementation: load balancingImplementation:Implementation:3. The number of words in each small division as below:

1)1(4 jd

),(),(4

x PWfjiTk

, where i = 0, 1, ..., n-1and j = 0, 1, ..., d-1

4. The total number of words in each small division across all processes will be:p

1 1)1(4

),()(n

x PWfjT

where j = 0 1 d-1

),()(i

, where j = 0, 1, ..., d 1

Parallel implementation: load balancingImplementation:Implementation:5. Find an array of boundary B, such that:

kLmB 1)1(

1)()1(

where B(0)= 0, B(0<m<n) = any number in (0,d-1),

and B(n)= d-1

6. Process Pi should have all words in:[B(i) B(i+1)) where i = 0 1 n-1[B(i), B(i+1)) , where i = 0, 1, ..., n 1.

Parallel implementation: load balancing

AAAA:AAAC:CAAC:

CTTT:GAAA:GTTT:

TTTG:TTTT:

Parallel implementation: outputOutput:Output:• Each process creates its own output file

If necessary all files can concatenate into one• If necessary, all files can concatenate into one single file, while keeping the order

AAAA: 5, 9, 10AAAC: 22, 37 ... ...

... TTTG: 15, 30TTTT: 40, 87

Testbed

Testbed specification• Consists of 768 IBM JS21 bladesConsists of 768 IBM JS21 blades• On each blade:

2 dual core PowerPC CPUs @ 2 5GHz• 2 dual-core PowerPC CPUs @ 2.5GHz• 8 GB memory• SUSE Linux Enterprise Server 9 (ppc)

• Interconnect: Myrinet• Parallel environment: MPI

Performance analysis

Number of nodes

Number of processes

D. melanogaster H. sapiensk=6 k=25 k=6 k=25

1 1 95 232032 2 53 57704 4 29 15008 8 15 386

16 16 9 107 212 9367232 32 6 32 118 2493464 64 5 14 73 5998

128 128 9 11 59 1450256 256 11 15 49 558512 512 18 22 71 195

Timings of running against two datasets on BigRed using 1 PPN (SECONDS)Timings of running against two datasets on BigRed using 1 PPN (SECONDS)

Number of nodes

Number of processes

1 2 54 59582 4 31 15454 8 17 4008 16 11 115

16 32 8 40 156 2566132 64 6 19 101 623364 128 7 12 82 1506

128 256 25 16 61 576256 512 20 27 81 198512 1024 34 37 115 170

Number of nodes

Number of processes

1 4 37 17382 8 23 4534 16 15 1308 32 10 46

16 64 9 23 163 664932 128 11 17 121 163264 256 20 25 96 634

128 512 39 42 120 233256 1024 79 90 180 187512 2048 134 131 270 281

40.00Scalability in terms of node numbers

35.00up 1

edu 1 ppn

0.0016 32 64 128 256 512

Number of nodes

Scalability of enumerating 6-mers in H. sapiens

Number of nodes

600.00Scalability in terms of node numbers

400.00

500.00up 1

200 00

300.00

edu 1 ppn

100.00

200.00pp

0.0016 32 64 128 256 512

Number of nodes

Scalability of enumerating 25-mers in H. sapiens

Number of nodes

Conclusion• Addressed questions from the biology professor Addressed questions from the biology professor • Complicate solution aroused from memory

restrictionrestriction.• It can handle words of length up to 30.

It fi d ft t d d l d• It can find often-repeated words, rarely-occurred , or even non-occurred words.It l l ti l ll l l t hi• It scales relatively well on large cluster machines.

• We recently developed a Java version for small “ f ”DNA sequences, which was “our future work”. It can

zoom in or zoom out to view distribution and f i i t ti lfrequencies interactively.

The End

Thank youThank you

GeneIndex: an open source parallel program for enumerating and locating words in a genome

Technology