+ All Categories
Home > Presentations & Public Speaking > Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Date post: 27-Nov-2014
Category:
Upload: bwainecho
View: 399 times
Download: 4 times
Share this document with a friend
Description:
My presentation for the workshop about the Best Practice Award BioIT on TriCon 2013
Popular Tags:
54
Xing Xu, Ph.D Director of Cloud Computing Product Challenges in the Era of Big Genomic Data and Our Practices in BGI
Transcript

Xing Xu, Ph.DDirector of Cloud Computing Product

Challenges in the Era of Big Genomic Data and Our Practices in BGI

Topics for Today

About BGI

Challenges and Solutions- Data transfer- Cloud Computing- Computational Algorithms and Infrastructure- Data Storage

2

BGI

The world largest genome sequencing center- Started with Human Genome Project in 1999 with only a

few sequencers.- Now more than 150 sequencers, 6 TB/day sequencing

throughput.

MODEL ABI3730XL

Roche454

ABISOLiD 4

SolexaGA IIx

IlluminaHiSeq 2000

INSTALLATION 16 1 27 6 135

BGI

The world largest genome sequencing center The largest computing and storage center for

genomics in China

- 20,000+ CPU cores- 19 NVIDIA GPUs- 220+ Tflops peak

performance- 17 PB data storage- The storage and

computation capability increase by 10000 folds!

- Still increasing …

BGI

The world largest genome sequencing center The largest computing and storage center for

genomics in China One of world leading research institutes in

Genomics

Since 2007, - 253 papers in high-impact journals- Including 47 in Nature and its sub-

journals, 9 in Science, 2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors

- 369 patent applications- 254 software authorship

BGI

The world largest genome sequencing center The largest computing and storage center for

genomics in China One of world leading research institutes in

Genomics

BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.

7

Challenges for Handling Big Data

Exponential growth of data amount

8

Challenges for Handling Big Data

Exponential growth of data amount Complicate data analysis process

Challenges for Handling Big Data

Exponential growth of data amount Complicate data analysis process Widely distributed data

Images from omicsmaps.com 9

BGI

Challenges and Solutions

Data transfer

Cloud Computing

Computational Algorithms and Infrastructure

Data Management

10

Solutions for data transfer

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer

12

High speed data transfer

Solutions for data transfer:High speed data transfer

13

Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June, 2012.

Solutions for data transfer:High speed data transfer

14

Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.

A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).- Right software: Aspera Fastp data transfer protocol- Right infrastructure: 10Gb link between US and China- Right technology: RAM Disk, iPV6

Solutions for data transfer

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer

15

Aspera Server

Aspera Client

Aspera Client

Aspera Client

Software license Expensive physical

bandwidth

Free

BGI

Clients Bottleneck on the

client site

Not a good solution of sharing

Solutions for data transfer

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)

16

Solutions for cloud

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)

Cloud Computing- EasyGenomics, A Software as a Service (SaaS) platform

for NGS data analysis

17

EasyGenomics™

EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications.

Algorithms, Workflows,

Reports

Computational ResourcesDatabase,

Data management

Web portal,Simple UIHigh speed

connection

A typical user case

19

Bioinformatics Workflow

Four steps: Upload, Create a Sample, Perform Analyses, Download Results

Algorithms: Carefully chosen, tested and optimized

Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly

Homepage

Four task portals

Status of recent works

Warning and Logging

Navigation Tabs

Sequencing Quality Report

22

Mapping Report

23

Create an Analysis

Selected sample(s)

• One selected sample => Single Analysis

• Multiple selected samples => Batch Analyses

Create an Analysis

Selectable modules

Predefined Settings

Shortcut

What’s new?

An internal version of EG is running automatically as a production system.

It integrates the new data delivery portal of sequencing service.- Aspera fastp download- Accessible to all workflows on EasyGenomics

26

You can chose to deliver data to EasyGenomics platform

27

Configuration file

Import Data from Sequencing Service

28

Import Data from Sequencing Service

29

Imported Samples

Solutions for cloud

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)

Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution

30

Two paths for the future cloud solution

Software as a Service (SaaS) to Platform as a Service (PaaS)To give the flexibility to research users:- Add their own tools (any tools)- Integrate their own workflows (different combinations of

modules)

One-Click SaaS solutionTo give the automated solution for clinical users:- Automated solution for repetitive works- Fulfill very specific functions

31

Solutions for Algorithm and Infrastructure

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)

Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution

Algorithm and Infrastructure- Scale up with Hadoop / MapReduce: Hecate (de novo

Assembly tool), Gaea (Resequencing pipeline)

32

• Fast Parallel Framework: Hadoop Streaming

• Reliable Storage System: HDFS

• Scalable Map/Reduce framework

Raw Data

QC

Mapping

Remove PCR duplications

Realignment

Identify Variations

Selection & Annotation

Raw Data

SOAP-GaeaQC

SOAPalginer BWA BOWTIESOAP-GaeaAlignment

Selection & Annotation

SOAP-GaeaMarkDuplicate

SOAP-GaeaRealignment

SNP : SOAPsnp, SOAP-GaeaSNP, SAMtools InDel : Dindel, SOAP-GaeaIndel

SOAP-Gaea: Hadoop based resequencing pipeline

Reads

Reference

Key Value

PositionMap

Aligning

Reduce

Distributed Indexing for load balancing

Flexible splitting tolerates more mismatches

Dynamic Programming for robust gap alignment

SOAP-Gaea: Hadoop based resequencing pipeline

Old Pipeline Cloud-based pipeline 0

2

4

6

8

10

12

14

16Two weeks

Within 15 hrs ( 120cores)

Data: Human 60X whole genome Re-sequencing

Fast and Scalable

• The Hadoop Implementation provides great scalability.• Simply by providing more resource, the analysis can finish much

faster.

SOAP-GaeaAlignment (1 human sample in 1000genome)

Software Mapping RateConfident Mapping Rate(MAPQ>=10)

Stampy 85.93% 70.00%

SOAP2 79.14% 79.14%

Novo align 82.53% 79.74

BWA 91.54% 84.78%

Bowtie 81.15% 81.15%

SOAP-GaeaAlignment 91.75% 85.20%

It’s not only FAST, but also ACCURATE

Assembly

Constructing de bruijn Graph

Solving Tiny Repeats Merging Bubbles

Scaffolding Merging Contigs

SOAP-Hecate: Distributed de novo Genome Assembly

Contig Extension ScaffoldingGap closing

SOAPdenovo v2 SOAP-Hecate v2.5(84 cores)

SOAP-Hecate v2.5(180 cores)

Data Size 670GB 670GB 670GB

No. of Servers 1 7 15

Time 59 hour 59hour 38hour

Memory Size 400*1 24*7 24G*15

Mode Centralized Distributed Distributed

*80X human whole genome

SOAP-Hecate is scalable and using much less memory

Scalability

PerformanceSOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS

Scaffold N50 26,570,829 117,000 211,000 495,000 486,000 144,300

Tested on simulated data from Assemblathon 1(Earl, Bradnam et al. 2011)

Solutions for Algorithms

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)

Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution

Algorithm and Infrastructure- Scale up with Hadoop / MapReduce: Hecate (de novo

Assembly tool), Gaea (Resequencing pipeline)- GPU based acceleration: SOAP3 (Aligner), GSNP(SNP

caller), GAMA (Population genetics tool)

40

SOAP3: ~20X speed up from SOAP2

SOAP

SOAP2 (2008)20-30x

SOAP3 (2011)10-30XGPU Version

Human Zebra fish0

2000

4000

6000

8000

10000

12000

1893.45

10671.39

211.53

819.809999999999

Total Time (second)

SOAP2 SOAP3

Human Zebra fish13

13.5

14

14.5

15

14.12

14.6

Speedup

Human Zebra fish0

102030405060708090

10084.2

64.49

88.2976.55

Alignment Ratio (%)

SOAP2 SOAP3

Collaboration from University of Hong Kong

GSNP SOAPsnp100

1000

10000

100000

527

21879

Ch.1

Elap

sed

time

(sec

.)

GSNP SOAPsnp10

100

1000

10000

73

3675

Ch. 21

Elap

sed

time

(sec

.)

GSNP: 50X faster than its CPU based SOAPSNP

The elapsed time of all steps are included. GSNP is around 50x faster than single-thread

CPU-based SOAPsnp.

Solutions for Data Management

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)

Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution

Algorithm and Infrastructure- Scale up with Hadoop / MapReduce- GPU based acceleration

Data Management- Data management in BGI

43

Paradigm Shift

Traditional Model

BusinessDetermine

what question to ask

ITStructures the

data to answer

that question

Big Data Model

ITDelivers a platform to

enable creative

discovery

BusinessExplores what

questions could be

asked

Information Pyramid

Value

Decision

Knowledge

Information

DataElement

Meaning

Context

ApplicationAchievement

Organizing Refining Summarizing Utilizing

BGI Data Pyramid

iRODS(Data)

Database(Information)

Data Mining(Knowledge)

Health/Clinical APP(Decision)

• Data Preservation• Data Retrieval• Data Sharing

• BGI-SNP• BGI-SV• BGI-GaP• Disease:

HGVD/PMRD• Systems Biology• Drug Discovery• Diagnosis of Genetic

Diseases• Drug of Choice

iRODS

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Personalized Analysis

Clinical Diagnosis

Data Flow

KnowledgeBase

Metadata

LIMS

Public Resources

BGI-DB

Variant (Gene)

Disease

Drug

iRODS - integrated Rule Oriented Data System

48*Access data with Web-based Browser or iRODS GUI or Command Line clients.

renci.org

iRODS

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Personalized Analysis

Clinical Diagnosis

Data Flow - iRODS

Knowledge Base

Metadata

LIMS

Public Resources

BGI-DB

Variant (Gene)

Disease

Drug

iRODS-based Data Management• Contents: raw data, analyzed data and related metadata• Data backup• Fully integrated with LIMS• Able to search and access any data according to the metadata from

BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc.• Federation: integrate separate iRODS zones

Variant (Gene)

Disease

Drug

iRODS

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Personalized Analysis

Clinical Diagnosis

Data Flow – BGI-DB

Knowledge Base

Metadata

LIMS

Public Resources

BGI-DB

BGI-DB• A locus-specific database (LSDB) for all variants identified by BGI• Manage all basic information generated from data analysis pipelines• Link all detailed information about individual samples to each variant• Easy to query information from samples with certain commonality

(such as same phenotype, same cohort, etc.)• Provide the raw information for further data mining steps

iRODS

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Personalized Analysis

Clinical Diagnosis

Data Flow – BGI-DW & BGI-KB

Knowledge Base

Metadata

LIMS

Public Resources

BGI-DB

Variant (Gene)

Disease

Drug

BGI Data Warehousing & Knowledge Base• BGI data warehousing (BGI-DW) consists of a series of secondary databases related to

variants, diseases and drugs• BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through

mining BGI-DB, BGI-DW and other public resources• Periodically and automatically updated• Provide APIs for the bioinformaticians to query the information and generate

individualized reports

iRODS

Sequencer

Raw Data

Data Analysis

Analyzed Data

Data Warehousing

Personalized Analysis

Clinical Diagnosis

Data Flow - Successful Story

Knowledge Base

Metadata

LIMS

Public Resources

BGI-DB

Query the allele frequency database to filter out common variants and identify disease-causal variants

Calculate variant frequencies from certain cohorts and save them into the allele frequency database

Diagnosis for Monogenic Disease

Group samples into cohorts based on their phenotypes

Variant (Gene)

Disease

Drug

Summary of Our Practice in IT infrastructure

Data transfer- Solution I: Hard drive shipment (w/ Fedex)- Solution II: High Speed Data Transfer- Solution III: Don’t move the data (Cloud Computing)

Cloud Computing- EasyGenomics, A SaaS platform for NGS data analysis - Two paths for the future cloud solution

Algorithm and Infrastructure- Scale up with Hadoop / MapReduce- GPU based acceleration

Data Management- Using iRODs file system to manage big data

53

Acknowledgement

Development Team- Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.- Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.

Test & QA Team- Xin Guan, Jingjuan Liu, etc.

PMO & IT Operation- Wenjun Zeng, Litong Lai, Jing Tian, etc.

Product Team- Xing Xu, Jing Guo, Fang Fang etc.

Other BGI Teams Collaborators:

- University of Hong Kong (HKU)- Hong Kong University of Science and Technology (HKUST)- Nvidia - Aspera- RENCI - TianJing Supercomputing center


Recommended