+ All Categories
Home > Technology > GeneIndex: an open source parallel program for enumerating and locating words in a genome

GeneIndex: an open source parallel program for enumerating and locating words in a genome

Date post: 26-Jun-2015
Category:
Upload: ptihpa
View: 323 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
GeneIndex: an open source GeneIndex: an open source parallel program for enumerating and locating words in a genome Huian Li, David Hart, Matthias Mueller, Ulf Markward, Craig Stewart A t3 2009 August 3, 2009
Transcript
Page 1: GeneIndex: an open source parallel program for enumerating and locating words in a genome

GeneIndex: an open sourceGeneIndex: an open source parallel program for enumerating and locating words in a genome

Huian Li, David Hart, Matthias Mueller, Ulf Markward, Craig Stewart

A t 3 2009August 3, 2009

Page 2: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Contents• MotivationMotivation• Serial algorithm

Parallel implementation• Parallel implementation• Performance Analysis• Conclusion

Page 3: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Motivation

Question from a Biology professor:

Gi d l th i th t ti lGiven a word length, is the computational task of scanning a DNA sequence and

di th iti f ll iblrecording the positions of all possible words trivial?

5 10 15 20 25 30 * * * * * *

5’ TAGCCGTGGCGGAGCCTCTTGGCTTTGTTTATTC 3’

Page 4: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Serial algorithm• Straightforward implementation:Straightforward implementation:

Binary coding for A, C, G, T. For example:A 00A: 00C: 01G: 10T: 11

5 10 15 20 * * * ** * * *

T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...

Page 5: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Serial algorithm• Given a sequence and a word length k in order toGiven a sequence and a word length k, in order to

list all possible words, we scan the sequence once from left to right g

5 10 15 20 * * * *

T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...

T A G CT A G C 11001001

A G C CA G C C00100101

G C C G10010110

C T T G ...01111110 ...0 0 ...

Page 6: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Serial algorithm

5 10 15 20 * * * *

T A G C C G T G G C G G A G C C T C T T G ...T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...

T A G C 11001001

A G C C00100101

ENCODE("AGCC") =

00100101

ENCODE( AGCC ) ENCODE("TAGC") & MASK << 2 | ENCODE('C')

MASK = 4k-1 – 1 = 111111 (in case of k = 4)

Page 7: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Serial algorithm• This essentially becomes a sorting problem sinceThis essentially becomes a sorting problem, since

each word is now converted into an integer.• Each word is associated with its position• Each word is associated with its position

information: (Encoded Word, Position)• Sorting has to be stable so that for the same words• Sorting has to be stable so that for the same words,

their positions will be in a certain order.

Page 8: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Serial algorithmImplementation details:Implementation details:• Words & positions are stored in a long longinteger (8 bytes = 64 bits)integer (8 bytes = 64 bits)

• Hash table with a linked list for each entryS i d f ll d i i i• Space required for all words in given sequence is pre-allocated, instead of malloc one by oneM tl AND OR d SHIFT LEFT ti• Mostly AND, OR and SHIFT-LEFT operations.

Page 9: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Word frequencies

Page 10: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Word distribution

Page 11: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Motivation for parallel implementation

Another question from Biology professor:

How about the human genome?

Fact: Human genome includes about 3 billion DNA bases.about 3 billion DNA bases.

Page 12: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: inputLarge dataset input:Large dataset input:• Each process reads its own partition from the input

filefile.• Boundary area between neighboring processes has

to be consideredto be considered.agcatgcatgcatcgatcgatcgatgcatcgatgcatcgatacgatgcatgcta

t t t t tgacgatacgagcatgcatctagcatgcagtagcatgcatcgatgcattagcatgctagctagcatgctagcatgcatcgatgcatgctagcatgctagctagcatgctg g g g g g gatgcatgcatgcatcatgcatcgatcgatcgtgcaatgcatgctacgatgcatgcatcagtcagcatgcatgcatcgatcgt t t t t tatgcatcgatcgatgcatgcatgacgagcaatgatgcagtcatgcatcgacgagcatcgatcgatgcatgcatgcat

Page 13: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: load balancingComputation and load balancing:Computation and load balancing:1. Each process deals with its own piece of data2 All processes perform global sorting2. All processes perform global sorting

Straightforward implementation: binary tree merge sortingXX sorting

Possible solution but could be problematic Ideal solution leading to load balancing

XX Ideal solution leading to load balancing

Page 14: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: load balancingStraightforward implementation: binary tree mergeStraightforward implementation: binary tree merge

sortingAAAA:AAAA: AAAC: ...

TTTG:TTTT:

P0

AAAA: AAAC:

AAAA: AAAC

TTTT:

...

TTTG: TTTT:

AAAC: ...

TTTG: TTTT:

P2P0

AAAA: 5, 9AAAC: 22

AAAA: 19AAAC: 12

AAAA: 4, 8AAAC: 67

AAAA: 35AAAC: 46 P P P P...

TTTG: 101TTTT: 80

...

TTTG: 201TTTT: 26

...

TTTG: 88TTTT: 53

...

TTTG: 40TTTT: 30

P0 P1 P2 P3

Page 15: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: load balancingPossible solution but could be problematic:Possible solution but could be problematic: Straightforward solution: partition word range [0, 4k) equally,

so each process hasso each process has

ni

ni kk 14,4 , where i = 0, 1, ..., n-1

AAAA:AAAC:CAAC:

AAAA:AAAC:CAAC:

AAAA:AAAC:CAAC:

AAAA:AAAC:CAAC:

CTTT:GAAA:GTTT:

CTTT:GAAA:GTTT:

CTTT:GAAA:GTTT:

CTTT:GAAA:GTTT:

`

TTTG:TTTT:

TTTG:TTTT:

TTTG:TTTT:

TTTG:TTTT:

Page 16: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: load balancingImplementation of the straightforward solution:Implementation of the straightforward solution:• Problem is that some words occur more often than others,

leading to different memory requests for different processesg y q p

AAAA: 5, 9AAAC: 22, 37

CAAA: 19CAAC: 12, 47

GAAA: 4, 8GAAC: 67, 72

TAAA: 35, 93TAAC: 46

... ... ... ...

ATTG: 101ATTT: 80

CTTG: 201CTTT: 26

GTTG: 88GTTT: 53

TTTG: 40, 87TTTT: 15, 30

Page 17: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: load balancingIdeal solution leading to load balancing:Ideal solution leading to load balancing:• partition the total number of words L-k+1 equally, so each

process has (L-k+1)/n words, where L is the length of p ( ) , ggiven sequence, k is the given word length.

Implementation:p1. After each process scanned its own piece, we know that:

kLPWfk 1)(

14

where i 0 1 n 1

2. We divide the word range [0,4k) into many small divisions

nPWf i

xx ),(

0

, where i = 0, 1, ..., n-1

g [ ) ywith total divisions of d (where d>>n):

kLPWfd j

d

k

1)(1 1)1(4

nkLPWf

ji

jd

x

xk

1),(0 4

, where i = 0, 1, ..., n-1

Page 18: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: load balancingImplementation:Implementation:3. The number of words in each small division as below:

1)1(4 jd

k

),(),(4

i

d

jd

x

x PWfjiTk

, where i = 0, 1, ..., n-1and j = 0, 1, ..., d-1

4. The total number of words in each small division across all processes will be:p

1 1)1(4

),()(n

i

jd

x PWfjT

k

where j = 0 1 d-1

0 4

),()(i

i

jd

x

xfjk

, where j = 0, 1, ..., d 1

Page 19: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: load balancingImplementation:Implementation:5. Find an array of boundary B, such that:

kLmB 1)1(

nkLjT

mB

mBj

1)()1(

)(

where B(0)= 0, B(0<m<n) = any number in (0,d-1),

and B(n)= d-1

6. Process Pi should have all words in:[B(i) B(i+1)) where i = 0 1 n-1[B(i), B(i+1)) , where i = 0, 1, ..., n 1.

Page 20: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: load balancing

AAAA:AAAC:CAAC:

AAAA:AAAC:CAAC:

AAAA:AAAC:CAAC:

AAAA:AAAC:CAAC:

CTTT:GAAA:GTTT:

CTTT:GAAA:GTTT:

CTTT:GAAA:GTTT:

CTTT:GAAA:GTTT:

`

TTTG:TTTT:

TTTG:TTTT:

TTTG:TTTT:

TTTG:TTTT:

Page 21: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Parallel implementation: outputOutput:Output:• Each process creates its own output file

If necessary all files can concatenate into one• If necessary, all files can concatenate into one single file, while keeping the order

AAAA: 5, 9, 10AAAC: 22, 37 ... ...

... TTTG: 15, 30TTTT: 40, 87

Page 22: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Testbed

Page 23: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Testbed specification• Consists of 768 IBM JS21 bladesConsists of 768 IBM JS21 blades• On each blade:

2 dual core PowerPC CPUs @ 2 5GHz• 2 dual-core PowerPC CPUs @ 2.5GHz• 8 GB memory• SUSE Linux Enterprise Server 9 (ppc)

• Interconnect: Myrinet• Parallel environment: MPI

Page 24: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Performance analysis

Number of nodes

Number of processes

D. melanogaster H. sapiensk=6 k=25 k=6 k=25

1 1 95 232032 2 53 57704 4 29 15008 8 15 386

16 16 9 107 212 9367232 32 6 32 118 2493464 64 5 14 73 5998

128 128 9 11 59 1450256 256 11 15 49 558512 512 18 22 71 195

Timings of running against two datasets on BigRed using 1 PPN (SECONDS)Timings of running against two datasets on BigRed using 1 PPN (SECONDS)

Page 25: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Performance analysis

Number of nodes

Number of processes

D. melanogaster H. sapiensk=6 k=25 k=6 k=25

1 2 54 59582 4 31 15454 8 17 4008 16 11 115

16 32 8 40 156 2566132 64 6 19 101 623364 128 7 12 82 1506

128 256 25 16 61 576256 512 20 27 81 198512 1024 34 37 115 170

Timings of running against two datasets on BigRed using 2 PPN (SECONDS)Timings of running against two datasets on BigRed using 2 PPN (SECONDS)

Page 26: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Performance analysis

Number of nodes

Number of processes

D. melanogaster H. sapiensk=6 k=25 k=6 k=25

1 4 37 17382 8 23 4534 16 15 1308 32 10 46

16 64 9 23 163 664932 128 11 17 121 163264 256 20 25 96 634

128 512 39 42 120 233256 1024 79 90 180 187512 2048 134 131 270 281

Timings of running against two datasets on BigRed using 4 PPN (SECONDS)Timings of running against two datasets on BigRed using 4 PPN (SECONDS)

Page 27: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Performance analysis

35.00

40.00Scalability in terms of node numbers

25.00

30.00

35.00up 1

15.00

20.00

Spe

edu 1 ppn

2 ppn

4 ppn

5.00

10.00

pp

0.0016 32 64 128 256 512

Number of nodes

Scalability of enumerating 6-mers in H. sapiens

Number of nodes

Page 28: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Performance analysis

600.00Scalability in terms of node numbers

400.00

500.00up 1

200 00

300.00

Spe

edu 1 ppn

2 ppn

4 ppn

100.00

200.00pp

0.0016 32 64 128 256 512

Number of nodes

Scalability of enumerating 25-mers in H. sapiens

Number of nodes

Page 29: GeneIndex: an open source parallel program for enumerating and locating words in a genome

Conclusion• Addressed questions from the biology professor Addressed questions from the biology professor • Complicate solution aroused from memory

restrictionrestriction.• It can handle words of length up to 30.

It fi d ft t d d l d• It can find often-repeated words, rarely-occurred , or even non-occurred words.It l l ti l ll l l t hi• It scales relatively well on large cluster machines.

• We recently developed a Java version for small “ f ”DNA sequences, which was “our future work”. It can

zoom in or zoom out to view distribution and f i i t ti lfrequencies interactively.

Page 30: GeneIndex: an open source parallel program for enumerating and locating words in a genome

The End

Thank youThank you


Recommended