Francesco Versaci - Flink in genomics - efficient and scalable processing of raw Illumina BCL data

Flink in GenomicsEfficient and scalable processing of raw Illumina BCL data

F. VERSACI L. PIREDDU G. ZANETTI

– FlinkForward 2016 –

13 September 2016

Francesco Versaci (CRS4) Flink in Genomics 13 September 2016 1 / 28

Outline

1 Introduction

2 BCL to FASTQ Conversion

3 Implementation in Flink

4 Evaluation and Final Considerations


CRS4Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna

CRS4

Research center in Sardinia, ItalyFocus on big data, biosciences, HPC, visual computing, energyand environment


Next-Generation SequencingCost

Genome sequencing is now much cheaper than in the pastAbout 1000 euros per whole human genome

103

104

105

106

107

108

Cos

t[U

SD]

2001 2004 2007 2010 2013 2016Year

(Data from https://www.genome.gov/sequencingcosts/)


https://www.genome.gov/sequencingcosts/

Next-Generation SequencingApplications

High-throughput DNA sequencing has many applications, includingResearch into understanding human genetic diseasesMedicine, e.g., oncology, clinical pathology, . . .Human phylogenyPersonalized diagnostic applications

Huge amount of dataA single sequencer can produce 1 TB/day of data

Which need to be converted, filtered, aggregated, reconstructed,analysed, . . .


Standard pipeline

When using Illumina sequencers, the standard pipeline starts with twoprograms:bcl2fastq2 Proprietary, open-source tool by Illumina to convert raw

BCL data to FASTQ formatBWA-MEM Free (GPLv3) aligner to reconstruct the full genomic

sequence based on the short reads generated by the sequencer

ProblemParallel tools, but shared-memory (single node)To exploit more nodes data need to be distributed, there can befailures, etc.


BCL converter

In this talk we present a distributed-memory BCL converterDeveloped within the Flink frameworkWritten in ScalaEfficient (i.e., speed comparable to bcl2fastq2)ScalableCan easily be integrated into existent Hadoop/YARN workflows


Outline

1 Introduction





Shotgun genome sequencing

The DNA is a sequence of four bases:Adenine, Cytosine, Guanine andThymine (A, C, G and T)To reconstruct it, the genome is brokenup into short fragments (reads)The fragments are attached to a support(tile)Fluorescent molecules are iterativelyattached to bases of the DNA fragmentsbeing sequencedAt each cycle, the machine acquires(optically) a single base from all thefragments


File organization

We adopt the file structure of Illumina HiSeq 3000/4000machinesA single file refers to data obtained by specific lane, tile and cyclecombinationE.g., file L003/C80.1/s_3_1213.bcl.gz corresponds to dataread from tile t = 1213, in lane l = 3 during cycle c = 80Test dataset: 8 lanes × 112 tiles/lane × 210 cycles = 188,160gzip-compressed BCL files = about 250 GB


L003/C80.1/s_3_1213.bcl.gz

BCL to FASTQ conversion

BCL files are arrays of bytesEach byte encodes a base (bits0-1) and a quality score (bits 2-7).filter files specify which readsshould be ignored.locs files contain somemetadata which need to beattached to each readTo get the reads from the raw BCLdata we need to perform somesort of matrix transposition

File 1

File 2

File 3

File 4

File 5

A

A

G

C

T

TCGG

CAGA

CAGA

TGAT

CGTC

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Loca

tion

5

Loca

tion

4

Loca

tion

3

Loca

tion

2

Loca

tion

1

A A G C T

C A A G GG G G A TG A A T C

T C C T C

Read 1Read 2etc...


.filter

.locs

Outline

1 Introduction





Implementation choices

The converter is written in ScalaWe use sbt to handle compilation and dependenciesAll the code is less than 1000 linesNo fancy IDEs, just EMACS as editor


Algorithmic overview

For each lane/tile combinationBCL files corresponding to different cycles are openedconcurrentlyBases and quality scores are extracted and filteredFor each fragment a text header is added, containing variousmeta-data and an indexThen they are sorted by their indexesSince there can be read errors also in the indices, the repartitionis fuzzy: a parameter sets the numbers of allowed misinterpretedsymbolsFinally, a gzip-compressed file for each index is written to disk


DataSet vs DataStream

Our data is static and read from a storage unitThis is not a typical streaming applicationWe have tried both DataSet and DataStream structuresUsing DataStream is fasterBecause of its better overlap of I/O and computations?

Lesson Learned #1Try DataStream even if it doesn’t seem like a natural fit for yourapplication


Data granularity

BCL files are arrays of bytesIt might seem natural to process them in Flink asDataStream[Byte]But reading and writing single bytes is not efficientWe process data in bigger chunks (2048 bytes)It imposes a lower load on the streaming frameworkBetter cache locality exploitation

Lesson Learned #2Avoid fine granularity and read in larger chunks


Job granularity

The job unit (mini-job) is the processing of a lane-tilecombinationMini-jobs run for about one minute on one coreWe can choose to aggregate n mini-jobs into a Flink jobAnd assign c cores to each Flink jobE.g., aggregate n = 16 mini-jobs and run them on c = 4 coresWhat about launching one huge Flink job which handles all thework and cores?Best results with n = 2 and c = 1 (with processor SMT = 2)

Lesson Learned #3Keep Flink jobs reasonably small


Low level optimizationsUse of ByteBuffer

To extract bases and quality scores we need to perform some bitmasks and shiftsE.g., to get the quality score from byte b we can run

va l b : Byte = in . ge tva l q : Byte =

(0 x21 + ( b & 0xFC ) >>> 2 ) . toBy te

We can obtain a 8x speed-up by grouping bytes into 64-bit longsand executing the equivalent operations:

va l r : Long = in . getLongva l q : Long = 0x2121212121212121l

+ ( ( r & 0xFCFCFCFCFCFCFCFCl ) >>> 2 ) )

To interpret byte arrays as longs, we need to use the ByteBufferclass


Low level optimizationsUse of look-up tables

We need to convert bases from numeric to ASCII notationE.g., 0x0001020303020100 maps to “ACGTTGCA”We can do it efficiently by compressing the input and using it asan index in a look-up tableE.g., 0x0001020303020100 is compressed to index0b0001101111100100=0x1BE4 and searched in theprecomputed look-up tableThe table has 216 = 65536 entries


Scheduling

To schedule the Flink jobs we use Scala FuturesParallel and not blocking

/ / numTasks = number o f F l i n k jobsi m p l i c i t va l ec = s c a l a . concur ren t . Execu t ionContex t

. f romExecutor ( Execu to r s . newFixedThreadPool ( numTasks ) )va l mini Jobs : Seq [ Mini Job ] = r eade r . ge tMin i Jobs/ / f p a r = number o f mini−j obs per F l i n k jobsva l f l i n k J o b s = work . s l i d i n g ( fpa r , f p a r )

. map ( f j => Fu tu re { run Jobs ( f j ) } )/ / conver t Seq [ Fu tu re ] to Fu tu re [ Seq ]va l f l i s t = Fu tu re . sequence ( f l i n k J o b s )s c a l a . concur ren t . Await . r e s u l t ( f l i s t , Dura t ion . I n f )


Outline

1 Introduction





Hardware

Experiments run on the Amazon Elastic Compute Cloud (EC2)Up to 14 instances of r3.8xlarge machinesCPUs 32 virtual cores (Xeon E5-2670 v2, 25 MB cache)RAM 250 GBDisks 2x320 GB SSDNetwork 10 Gb EthernetHDFS distributed among the n computing nodesEach datanode using its two SSD disksYARN running on the same n nodesFlink running inside Hadoop/YARN


ResultsStrong, absolute scalability (bcl2fastq2 as baseline)

Running time of Illumina bcl2fastq2: 57.1 minutes on a singlenodebcl2fastq2 is written in C++ and exploits Boost libraries896 = 64 · 14 total Flink jobs

Nodes Time (minutes)

1 58.42 30.24 15.26 10.58 8.4

14 5.412

4

6

8

14

10

12

Spee

d-up

1 2 4 6 8 1410 12Number of nodes

bcl2fastq2 perfect scalabilityBCL converter


I/O vs Computations

The program is CPU-bound on the tested hardwareTotal I/O size (input+output): ≈ 500 GBI/O rate on single node: ≈ 150 MB/sI/O rate on 14 nodes: ≈ 1.6 GB/sNote: both input and output are gzip-compressed


Flink features – What we have used

Custom Flink Input/OutputFormatHadoop libraries to read/write files

import org . apache . hadoop . f s .{ F i leSys tem , FSData InputS t ream , FSDataOutputStream , Path }

import org . apache . hadoop . io . compress .{ CompressionCodecFactory , Compress ionInputSt ream }

import org . apache . hadoop . io . compress . z l i b .{ Zl ibCompressor , Z l i b F a c t o r y }

DataStreammap and flatMap

MapFunction and FlatMapFunction

filter

split and select


Flink features – What was not availableEfficient zip of DataStreams

ProblemGiven two data streams

va l names : DataStream [ S t r i n g ]va l ages : DataStream [ I n t ]

Join them as

va l combined : DataStream [ ( S t r i n g , I n t ) ]

Useful when reading data about the same object from differentfilesE.g., .bcl, .locs and .filter filesInverse function of

va l names = combined . map ( _ . _1 )va l ages = combined . map ( _ . _2 )


Flink features – What was not availableA smarter job scheduler

RemarkWe’re talking about Flink 1.0: it seems the new job scheduler is muchsmarter :)

It would be convenient for the job scheduler to be able toPick jobs from some (priority?) queueRuns them concurrently on the available Flink task slotsStart a new job as soon as another one finishesHandle failures and retries


Future work

Integrate our converter into Seal1 toolkit for short DNA readsmanipulation and analisysAdopt Flink also in the second stage of the pipeline, i.e., have aFlink-based aligner

Thanks for your attention!

1http://biodoop-seal.sourceforge.net/Francesco Versaci (CRS4) Flink in Genomics 13 September 2016 28 / 28

http://biodoop-seal.sourceforge.net/

Future work

Integrate our converter into Seal1 toolkit for short DNA readsmanipulation and analisysAdopt Flink also in the second stage of the pipeline, i.e., have aFlink-based aligner

Thanks for your attention!

1http://biodoop-seal.sourceforge.net/Francesco Versaci (CRS4) Flink in Genomics 13 September 2016 28 / 28

http://biodoop-seal.sourceforge.net/

Date post:	11-Jan-2017
Category:	Data & Analytics
Upload:	flink-forward
View:	73 times
Download:	0 times

Francesco Versaci - Flink in genomics - efficient and scalable processing of raw Illumina BCL data

Data & Analytics