+ All Categories
Home > Technology > BioPig for scalable analysis of big sequencing data

BioPig for scalable analysis of big sequencing data

Date post: 25-May-2015
Category:
Upload: zhong-wang
View: 1,037 times
Download: 3 times
Share this document with a friend
Description:
This talk was adapted from my presentation at the Finishing in the Future 2011, Santa Fe, NM.
Popular Tags:
21
BioPig: Hadoop-based Analytic Toolkit for Next- Generation Sequence Data Zhong Wang, Ph.D. Computational Biology Staff Scientist
Transcript
Page 1: BioPig for scalable analysis of big sequencing data

BioPig: Hadoop-based Analytic Toolkit for Next-Generation Sequence Data

Zhong Wang, Ph.D.Computational Biology Staff Scientist

Page 2: BioPig for scalable analysis of big sequencing data

Cellulase

The deep metagenome approach to discover cellulases for biofuel research

Page 3: BioPig for scalable analysis of big sequencing data

Large data, large reward

http://www.cazy.org/

Only 1% shared (>=95% identity)50% validated activity

Science. 2011 Jan 28;331(6016):463-7.

Page 4: BioPig for scalable analysis of big sequencing data

Sequence data

More data would be even better

Page 5: BioPig for scalable analysis of big sequencing data

Rumen(2009) Rumen(2010) Rumen(2012)

17 Gb

250 Gb

1000 Gb

But, can analysis keep up with data growth?

Page 6: BioPig for scalable analysis of big sequencing data

Ideal solutions for the terabase problem

1.Scalable to 1Tb?2.Performance (within hours)?

Page 7: BioPig for scalable analysis of big sequencing data

High-Mem cluster

Input/Output (IO)Memory

Page 8: BioPig for scalable analysis of big sequencing data

MP/MPI solution: k-mer counting

1

2

3

4

Raw Data Data slicesEach node/core

has data and table slices Count table

Page 9: BioPig for scalable analysis of big sequencing data

MP/MPI performance

MPI version412 Gb, 4.5B reads2.7 hours on 128x24 coresNESRC HopperII

MP Threaded version268 Gb, 3B reads5 days on 32 coresHigh-Mem Cluster

• Experienced software engineers• Six months of development time• One nodes fails, all fail

Problems:

Fast, scalable

Page 10: BioPig for scalable analysis of big sequencing data

Hadoop/Map Reduce framework

• Google MapReduce– Data Parallel programming model to process petabyte data– Generally has a map and a reduce step

• Apache Hadoop– Distributed file system (HDFS) and job handling for

scalability and robustness– Data locality to bring compute to data, avoiding network

transfer bottleneck

Page 11: BioPig for scalable analysis of big sequencing data

Programmability: Hadoop vs Pig finding out top 5 websites young people visit

Page 12: BioPig for scalable analysis of big sequencing data

BioPig: design goals

• Flexible– every dataset is unique, data analysts have domain knowledge that is essential to

optimize the analysis,– pluggable modules that analysts can use to build custom analytic pipelines,

• High-Level – domain-specific language enable data analysts to create custom pipelines,– hide details of parallelism (too complex for most people),

• Scalability– leverage data parallelism to speed up analytics,– integrate external tools and applications where necessary,– scale from 1 to hundreds of compute nodes with minimal effort and linear

scalability.• Robustness

– Data and computation are replicated across nodes to combat failures

BioPIG

Page 13: BioPig for scalable analysis of big sequencing data

Runs on any hardware supporting Hadoop

• JGI Titanium (commodity hadoop cluster)– Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet

• NERSC Magellan Cloud Testbed– Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem

processors, 10Gbit InfiniBand, GPFS

• Amazon AWS– Elastic MapReduce with cluster compute nodes (23 GB of

memory, 2 x Intel quad-core “Nehalem” architecture 1690 GB of instance storage, 10G Ethernet

Page 14: BioPig for scalable analysis of big sequencing data

BioPig Modules

Blast

Input/Output(Fasta,q)

K-merCounter

Assembly

Page 15: BioPig for scalable analysis of big sequencing data

How k-mer count is implemented

Load Mapper Shuffle/sort

Reducer Merge

<id1, header, ‘attagc’><id2, header, ‘gttagg’>

<id1, ‘atta’>, <id1,’ttag’><id2, ‘gtta’>, <id2, ‘ttag’>

<‘atta’, id1>, <‘ttag’, id1, id2><‘gtta’, id2>, <‘tagg’, id2>

<‘atta’, 1>, <‘ttag’, 2><‘gtta’, 1>, <‘tagg’, 1>

<‘atta’, 3>, <‘ttag’, 2><‘gtta’, 2>, <‘tagg’, 1>

Page 16: BioPig for scalable analysis of big sequencing data

A 7-liner BioPig script for k-mer counting

Page 17: BioPig for scalable analysis of big sequencing data

Rumen metagenome gene discovery pipeline

Read preprocess

(remove artifacts)

pigBlast(blast reads

against known cellulases)

pigAssembler(Assemble reads

into contigs)

pigExtender(Extend contigs into full-length

enzymes)

Page 18: BioPig for scalable analysis of big sequencing data

Cloud solution to large data

BioPig-Blaster

BioPig-Assembler

BioPig-Extender

BioPIG

BioPig: 61 lines of codeMPI-extender: ~12,000 lines (vs 31 in BioPig)

Flexibility

Programmability

Scalability

xx

Page 19: BioPig for scalable analysis of big sequencing data

Conclusions

Hadoop-based BioPig shows great potential for scalable analysis on very large sequence data, it is robust and easy to use.

Page 20: BioPig for scalable analysis of big sequencing data

Challenges in application

• IO optimization, e.g., reduce data copying • Some problems do not easily fit into

map/reduce framework, e.g., graph-based algorithms

• Integration into exiting framework, Galaxy

Page 21: BioPig for scalable analysis of big sequencing data

Acknowledgement

• Karan Bhatia• Henrik Nordberg• Kai Wang• Rob Egan• Alex Sczyrba• Jeremy Brand @JGI/NERSC• Shane Cannon @NERSC

BioPIG


Recommended