Revolutionizing the Datacenter
Join the Conversation #OpenPOWERSummit
Accelerating Genome Assembly with Power8
Seung-Jong Park, Ph.D.
School of EECS, CCT, Louisiana State University
Join the Conversation #OpenPOWERSummit
Your logohere
Agenda
The Genome Assembly Problem
Accelerating Graph Construction with POWER8
Accelerating Graph Simplification with CAPI Flash
24/1/2016
Your logohere
The Genome Assembly Problem
34/1/2016
Your logohere
NGS Technologies Outpaced Moore’s Law
Software with Extreme Scalability
HPC Platform• More Compute Cycles
• Extreme I/O Performance
• Huge Storage Space
Challenges for Genome Assemblers
44/1/2016
Genome
NGS
Reads (TBs)
HPC
Re-constructed
Genome (MBs/GBs)Data and
ComputeIntensive
Your logohere
MapReduce-based Graph Construction
54/1/2016
TA
GT
CG
AG
G
CT
GG
CT
TTA
GA
T
CT
GA
GG
CT
TTA
G Map
TT
TA
GA
GA
CA
GG
AT
CC
GA
TG
A
GTA
GT
CG
AG
G
CT Map
TT
TA
:G
TA
GT
:C
TT
AG
:A
TA
GA
:GT
CC
G:
AT
GA
G:
N
TC
GA
:
G
AG
AG
:
AA
GA
C:A
AC
AG
:
NA
TC
C:
GC
CG
A:
TC
GA
T:
GA
TG
A:G
AG
TC
:
GC
GA
G:
GA
GG
C:
T
GA
TC
:
C
GA
GA
:
CG
AC
A:
G
GA
TG
:AG
TC
G:
AG
AG
G:
CG
GC
T:
N
GG
CT
:
T
GT
CG
:
AG
AG
G:
CG
GC
T:
N
GC
TT
:TG
AT
C:
NG
AG
G:
CG
GC
T:
TG
CT
T:T
AG
TC
:
GC
GA
G:
GA
GG
C:
TC
TT
T:A
AG
AT
:CA
GG
C:
TC
TT
T:A
TA
GT
:C
TG
AG
:
G
TC
GA
:
GT
TT
A:G
TT
AG
:A
TA
GA
:T
TT
TA
:G
TT
AG
:N
Reduce
Reduce
Reduce
TA
GA
:G,T
TA
GT
:C
TC
CG
:A
TC
GA
:G
TG
AG
:G
TT
AG
:A
TT
TA
:G
AC
AG
:N
AG
AC
:A
AG
AG
:A
AG
AT
:C
AG
GC
:T
AG
TC
:G
AT
CC
:G
AT
GA
:G
CC
GA
:T
CG
AG
:G
CG
AT
:G
CT
TT
:A
GA
CA
:G
GA
GA
:C
GA
GG
:C
GA
TC
:C
GA
TG
:A
GC
TT
:T
GG
CT
:T
GT
CG
:A
Your logohere
Accelerating Graph Construction with POWER8
64/1/2016
Your logohere
Experimental Test Beds
74/1/2016
System Type IBM PKY Cluster LSU SuperMikeII
Processor Two 10-core IBM Power8 Two 8-core Intel SandyBridge Xeon
Maximum #Nodes used in various
experiments
40 120
#Physical cores/node 20 (8 Simultaneous Multi-Thread) 16 (Hyper threading disabled)
#vcores/node 160 16
RAM/node (GB) 256 32
#Disks/node 5 3
#Disks/node used for shuffled data 3 1
Total Storage space/node used for shuffled
data
1.8 0.5
Network 56Gbps InfiniBand (non-blocking) 40Gbps InfiniBand (2:1 blockings)
Your logohere
Datasets
84/1/2016
Genome data set Input size Shuffle data
size
Output size
Rice genome 12GB 70GB 50GB
Bumble bee genome 90GB 600GB 95GB
Metagenome 3.2TB 20TB 8.6TB
Your logohere
Hadoop Configurations
94/1/2016
Hadoop Parameters IBM Power8 SuperMikeII
Yarn.nodemanager.cpu.resource.vcore 120 16
Yarn.nodemanager.memory.mb 231000 29000
Mapreduce.map/reduce.cpu.vcore 4 2
Mapreduce.map/reduce.memory.mb 7000 3500
Mapreduce.map/reduce.java.opts 6500m 3000m
Your logohere
Hadoop Scalability with POWER8 SMTs
Tested with small size rice genome data on 2 node
Almost linear scalability with increasing SMTs
104/1/2016
Your logohere
Rice Genome
Analyzing small size (12GB) data
Eliminate the impact of network and disk I/O
7.5X performance improvement per server
114/1/2016
Your logohere
Bumble Bee Genome
Analyzing Medium size (90GB) Bumble Bee genome
7.5x improvement in terms of Performance/server
124/1/2016
Your logohere
Metagenome
Analyzing huge (3.2TB) metagenome data
Only 6.5 hours on 40-node IBM Power8 cluster
More than 9x improvement in terms of performance per server
134/1/2016
Your logohere
Graph Simplification with Distributed NoSQL
144/1/2016
TAGA:G,T
TAGT:C
TCCG:A
TCGA:G
TGAG:G
TTAG:A
TTTA:G
ACAG:N
AGAC:A
AGAG:A
AGAT:C
AGGC:T
AGTC:G
ATCC:G
ATGA:G
GACA:G
GAGA:C
GAGG:C
GATC:C
GATG:A
GCTT:T
GGCT:T
GTCG:A
CCGA:T
CGAG:G
CGAT:G
CTTT:A
TAGTCGAG GAGGCTTTAGA
Your logohere
Accelerating Simplification with IBM CAPI Flash
154/1/2016
NoSQL I/OThroughput(keys/sec)
CAPI Flash I/O Throughput (bytes/sec)
Only 20 Power8 Cores + CAPI : 500GB Graph traversal in
7.5 Hrs
Your logohere
Computational Challenges – The Next Step
Graph building is the most expensive phase in terms of time and resources
The Obvious Solutions: Either use a single machine with LOTS of memory, or run on a cluster.
Idea: Use CAPI accelerated flash instead of main memory
164/1/2016
Your logohere
Graph Construction on IBM CAPI Flash
174/1/2016
TAGTCGAGGCT
GGCTTTAGATC
TGAGGCTTTAG
Map
TTTAGAGACAG
GATCCGATGAG
TAGTCGAGGCT
GATC:C
GAGA:C
GACA:G
GATG:A
GTCG:A
GAGG:C
GGCT:N
GGCT:T
GTCG:A
GAGG:C
GGCT:N
GCTT:T
GATC:N
GAGG:C
GGCT:T
GCTT:T
AGAG:A
AGAC:A
ACAG:N
ATCC:G
CCGA:T
CGAT:G
ATGA:G
AGTC:G
CGAG:G
AGGC:T
AGTC:G
CGAG:G
AGGC:T
CTTT:A
AGAT:C
AGGC:T
CTTT:A
TTTA:G
TAGT:C
TTAG:A
TAGA:G
TCCG:A
TGAG:N
TCGA:G
TAGT:C
TGAG:G
TCGA:G
TTTA:G
TTAG:A
TAGA:T
TTTA:G
TTAG:N
Sort
GATC:C
GAGA:C
GACA:G
GATG:A
GTCG:A
GAGG:C
GGCT:N
GGCT:T
GTCG:A
GAGG:C
GGCT:N
GCTT:T
GATC:N
GAGG:C
GGCT:T
GCTT:T
GACA:G
GAGA:C
GAGG:C
GATC:C
GATG:A
GCTT:T
GGCT:T
GTCG:A
AGAG:A
AGAC:A
ACAG:N
ATCC:G
CCGA:T
CGAT:G
ATGA:G
AGTC:G
CGAG:G
AGGC:T
AGTC:G
CGAG:G
AGGC:T
CTTT:A
AGAT:C
AGGC:T
CTTT:A
TTTA:G
TAGT:C
TTAG:A
TAGA:G
TCCG:A
TGAG:N
TCGA:G
TAGT:C
TGAG:G
TCGA:G
TTTA:G
TTAG:A
TAGA:T
TTTA:G
TTAG:N
Sort
Sort
ACAG:N
AGAC:A
AGAG:A
AGAT:C
AGGC:T
AGTC:G
ATCC:G
ATGA:G
CCGA:T
CGAG:G
CGAT:G
CTTT:A
TAGA:G,T
TAGT:C
TCCG:A
TCGA:G
TGAG:G
TTAG:A
TTTA:G
NoSQL data engine APIs
Your logohere
Initial Results of Graph Construction
Compared 85GB bumblebee dataset on 8-node Hadoop cluster vs. a single node with CAPI-accelerated flash.
Hadoop Cluster (20 physical cores per node)• Peak memory usage of 60GB per datanode
• 1 HDD per datanode
• 1 hr 56 mins
CAPI Accelerated Flash server (20 physical cores)• Peak memory usage of 7 GB
• 1 HDD and 1 CAPI card
• 3 hrs 44 mins
184/1/2016
• Peak memory usage reduced by 60 times.
• Execution time reduced by 3.5 times per node.