Rethinking Data-Intensive Science Using Scalable Analytics Systems

transcript

Rethinking Data-Intensive Science Using Scalable

Analytics Systems Frank Austin Nothaft

UC Berkeley AMP/ASPIRE Lab, @fnothaft With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja,

Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson

Scientific revolutions are driven by data acquisition

revolutions

Genome Sequencing

Source: NIH National Genome Research Institute

2014: ~230,000 genomes sequenced 15-250GB/genome = ~30TB/day

= ~10PB/yearHuman Genome!Project: ~10GB

1000 Genomes: 15TB

TCGA: 3PB

Sequencing advances line up well with scalable analytics software

Source: NIH National Genome Research Institute

Google MapReduce

Hadoop MR

Parquet

Mapping scientific systems to commodity analytics systems

• Contemporary scientific systems are custom-built

• Leads to functionality from commodity systems being rebuilt

• We have an opportunity to rethink the abstractions that scientific systems use:

• Migrate from a flat architecture to a stacked architecture

• Expose higher level programming primitives

• Use commodity tools wherever possible

Common Traits of Legacy Data Intensive Scientific Systems

1. Computation is workflow/pipeline oriented

2. Processing system has monolithic/flat architecture

3. Data is stored in flat files

Genomics Pipelines

Source: The Broad Institute of MIT/Harvard

Flat File Formats• Scientific data is typically stored in application

specific file formats:

• Genomic reads: SAM/BAM, CRAM

• Genomic variants: VCF/BCF, MAF

• Genomic features: BED, NarrowPeak, GTF

• Centralized metadata makes it difficult to parallelize applications

Flat Architectures• APIs present very barebones abstractions:

• GATK: Sorted iterator over the genome

• Why are flat architectures bad?

1. Trivial: low level abstractions are not productive

2. Trivial: flat architectures create technical lock-in

3. Subtle: low level abstractions can introduce bugs

The perils of flattening…• The trivial:

• You can improve performance by pushing data access order into your data layout

• But now, you can’t easily compose pipeline stages that have different access orders

• The obscure:

• If you access data via a sorted iterator, will you incorrectly implement your algorithm?

A green field approach

First, define a schemarecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

A schema provides a narrow waist

record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

SchemaData Models

Accelerate common access patterns

• In genomics, we commonly have to find observations that overlap in a coordinate plane

• This coordinate plane is genomics specific, and is known a priori

• We can use our knowledge of the coordinate plane to implement a fast overlap join

SchemaData Models

Pick appropriate storage• When accessing scientific

datasets, we frequently slice and dice the dataset:

• Algorithms may touch subsets of columns

• We don’t always touch the whole dataset

• This is a good match for columnar storage

SchemaData Models

Is introducing a new data model really a good idea?

Source: XKCD, http://xkcd.com/927/

A subtle point:!Proper stack design can simplify

backwards compatibility

To support legacy data formats, you define a way to serialize/deserialize the schema into/from the

legacy flat file format!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution

SchemaData Models

A subtle point:!Proper stack design can simplify

backwards compatibility

This is a view!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution

SchemaData Models

A well designed stack simplifies application design

SchemaData Models

Variant calling & analysis,

RNA-seq analysis, etc.

Disk, SDD, block

store, memory cache

HDFS, Tachyon, HPC file

systems, S3

Load data from Parquet and

legacy formats

Spark, Spark-SQL,

Hadoop

Enriched Read/Variant

Avro Schema for reads,

variants, and genotypes

Users define analyses

via transformations

Enriched models provide convenient

methods on common models

The evidence access layer

efficiently executes transformations

Schemas define the logical

structure of basic genomic objects

Common interfaces map logical

schema to bytes on disk

Parallel file system layer

coordinates distribution of data

Decoupling storage enables

performance/cost tradeoff

How does this perform on real scientific data?

ADAM performs genomic preprocessing

Source: The Broad Institute of MIT/Harvard

ADAM’s Performance

• Achieve linear scalability out to 128 nodes for most tasks

• Up to 3x improvement over current tools on a single node

Analysis run using Amazon EC2, single node was i2.8xlarge, cluster was r3.2xlarge Scripts available at https://www.github.com/bigdatagenomics/bdg-services.git

Astronomy Pipelines

Source: The LSST Project

Astronomy Image Co-addition Performance

• Scales out to 16 nodes

• ~3x improvement over extant tool on a single node

Analysis run using Amazon EC2, cluster was c3.8xlarge (HPC optimized)

Conclusions• There is a huge increase in the amount of scientific

data being processed

• Although scientific processing pipelines tend to be custom solutions, we can replace these pipelines with general, DBMS backed solutions

• When we move to a general solution, we can gain performance without losing correctness

Acknowledgements• ADAM (https://www.github.com/bigdatagenomics/adam):!

• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson!

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher!

• GenomeBridge: Carl Yeksigian!

• Cloudera: Uri Laserson!

• Microsoft Research: Ravi Pandya!

• UC Santa Cruz: Benedict Paten, David Haussler!

• KIRA (https://www.github.com/BIDS/Kira):!

• UC Berkeley: Zhao Zhang, Mike Franklin, Evan Sparks, Kyle Barbary, Oliver Zahn, Saul Perlmutter!

• PoC code at https://github.com/zhaozhang/SparkMontage

Rethinking Data-Intensive Science Using Scalable Analytics Systems

Engineering