DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

transcript

The Redemptive Power of HadoopUri Laserson | @laserson | 14 November 2015

Scaling Up Genomics with Spark

We come in peace.

Pioneer plaque

What is genomics?

Organism

Organism Cell

Organism Cell Genome

Reference chromosome

Location

Ortelius, 1570

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Bioinformatics!

Alignment Dedup Recalibrate QC/Filter VariantCalling

VariantAnnotation

Pipelines!

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressed text files (non-splittable)Semi-structuredPoorly specified

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressed text files (non-splittable)Semi-structuredPoorly specified

Global sort order

CHPC (scheduler)POSIX filesystem

JavaHPC (Queue)POSIX filesystem

C++Single-nodeSQLite

It’s file formats all the way down!

/** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes);

final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { rec.setDuplicateReadFlag(false); } } recordInFileIndex++; if (!this.REMOVE_DUPLICATES || !rec.getDuplicateReadFlag()) { out.addAlignment(rec); } } return 0;}

Method

/** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes);

final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { rec.setDuplicateReadFlag(false); } } recordInFileIndex++; if (!this.REMOVE_DUPLICATES || !rec.getDuplicateReadFlag()) { out.addAlignment(rec); } } return 0;}

Method

@Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.")public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;

Method

Platform

VariantAnnotation

It’s pipelines all the way down!

VariantAnnotation

It’s pipelines all the way down!

VariantAnnotation

Node 1

VariantAnnotation

Node 2

VariantAnnotation

Node 3

Manually running pipelines on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

VariantAnnotation

Alignment Dedup Recalibrate QC/Filter

Node 1

VariantAnnotation

Node 2

Node 3

Alignment Dedup Recalibrate QC/Filter

Node 4

Node 1

Alignment Dedup QC/Filter VariantCalling

VariantAnnotation

Node 2

Node 3

Alignment Dedup QC/Filter

Node 4

Recalibrate

How now, brown cow?

Why Are We Still Defining File Formats By Hand?

• Instead of defining custom file formats for each data type and access pattern…

• Parquet creates a compressed format for each Avro-defined data model

• Improtvements over existing formats• ~20% for BAM• ~90% for VCF

YARN-managedHadoop cluster

Sparkexecutors

∏𝑗=1

𝑑 𝑖

𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖) ∏𝑗=1

𝑑 𝑖

𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖) ∏𝑗=1

𝑑 𝑖

𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖)Partial sums

∏𝑖=1

∏𝑗=1

𝑑𝑖

𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖)Driver

Applicationcode

ContEst Algorithm

Hadoop provides layered abstractions for data processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduce Impala (SQL) Solr (search) Spark

ADAMquince guacamole …

Executing query in Hadoop: interactive Spark shell (ADAM)

def inDbSnp(g: Genotype): Boolean = true or falsedef isDeleterious(g: Genotype): Boolean = g.getPolyPhen

val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)val genotypesRDD = sc.adamLoad("path/to/genotypes")

val filteredRDD = genotypesRDD .filter(!inDbSnp(_)) .filter(isDeleterious(_)) .filter(isFramingham(_))val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)

val maf = joinedRDD .keyBy(x => (x.getVariant, getPopulation(x))) .groupByKey() .map(computeMAF(_))

maf.saveAsNewAPIHadoopFile("path/to/output")

apply predicates

load data

join data

group-byaggregate (MAF)

persist data

Executing query in Hadoop: distributed SQLSELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.altWHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" )GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop

apply predicates

“load” and join data

group-by

aggregate (UDAF)

• Hosted at Berkeley and the AMPLab

• Apache 2 License• Contributors from both

research and commercial organizations

• Core spatial primitives, variant calling

• Avro and Parquet for data models and file formats

Spark + Genomics = ADAM

Core Genomics Primitives: Spatial Join

ADAM preliminary performance

Acknowledgements

UCBerkeleyMatt MassieFrank NothaftMichael Heuer

TamrTimothy Danford

MSSMJeff HammerbacherRyan Williams

ClouderaTom WhiteSandy Ryza

Thank you@lasersonlaserson@cloudera.com

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark

Technology