Post on 15-Feb-2017
transcript
1© Cloudera, Inc. All rights reserved.
The Redemptive Power of HadoopUri Laserson | @laserson | 14 November 2015
Scaling Up Genomics with Spark
2© Cloudera, Inc. All rights reserved.
We come in peace.
Pioneer plaque
3© Cloudera, Inc. All rights reserved.
What is genomics?
4© Cloudera, Inc. All rights reserved.
Organism
5© Cloudera, Inc. All rights reserved.
Organism Cell
6© Cloudera, Inc. All rights reserved.
Organism Cell Genome
7© Cloudera, Inc. All rights reserved.
8© Cloudera, Inc. All rights reserved.
9© Cloudera, Inc. All rights reserved.
Reference chromosome
10© Cloudera, Inc. All rights reserved.
Reference chromosome
Location
11© Cloudera, Inc. All rights reserved.“… decoding the Book of Life”
12© Cloudera, Inc. All rights reserved.
Ortelius, 1570
13© Cloudera, Inc. All rights reserved.
14© Cloudera, Inc. All rights reserved.Google Maps, 2015
15© Cloudera, Inc. All rights reserved.
16© Cloudera, Inc. All rights reserved.
17© Cloudera, Inc. All rights reserved.
18© Cloudera, Inc. All rights reserved.
19© Cloudera, Inc. All rights reserved.
20© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
21© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
22© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
23© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
Pipelines!
24© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)Semi-structuredPoorly specified
25© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)Semi-structuredPoorly specified
Global sort order
26© Cloudera, Inc. All rights reserved.
CHPC (scheduler)POSIX filesystem
JavaHPC (Queue)POSIX filesystem
C++Single-nodeSQLite
It’s file formats all the way down!
27© Cloudera, Inc. All rights reserved.
Dedup
28© Cloudera, Inc. All rights reserved.
/** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { rec.setDuplicateReadFlag(false); } } recordInFileIndex++; if (!this.REMOVE_DUPLICATES || !rec.getDuplicateReadFlag()) { out.addAlignment(rec); } } return 0;}
Method
Code
29© Cloudera, Inc. All rights reserved.
/** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { rec.setDuplicateReadFlag(false); } } recordInFileIndex++; if (!this.REMOVE_DUPLICATES || !rec.getDuplicateReadFlag()) { out.addAlignment(rec); } } return 0;}
Method
Code
30© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.")public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
31© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.")public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
Dedup
Method
Code
Platform
32© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
33© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
34© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
Node 1
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
Node 2
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
Node 3
35© Cloudera, Inc. All rights reserved.
Manually running pipelines on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
36© Cloudera, Inc. All rights reserved.
37© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
38© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup Recalibrate QC/Filter VariantCalling
VariantAnnotation
Node 2
Node 3
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
Node 4
39© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup QC/Filter VariantCalling
VariantAnnotation
Node 2
Node 3
Alignment Dedup QC/Filter
Alignment Dedup QC/Filter
Node 4
Recalibrate
40© Cloudera, Inc. All rights reserved.
How now, brown cow?
41© Cloudera, Inc. All rights reserved.
Why Are We Still Defining File Formats By Hand?
• Instead of defining custom file formats for each data type and access pattern…
• Parquet creates a compressed format for each Avro-defined data model
• Improtvements over existing formats• ~20% for BAM• ~90% for VCF
42© Cloudera, Inc. All rights reserved.
YARN-managedHadoop cluster
Sparkexecutors
∏𝑗=1
𝑑 𝑖
𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖) ∏𝑗=1
𝑑 𝑖
𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖) ∏𝑗=1
𝑑 𝑖
𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖)Partial sums
∏𝑖=1
𝑁
∏𝑗=1
𝑑𝑖
𝑃 (𝑏𝑖𝑗∨𝑒𝑖𝑗 , 𝑓 𝑖)Driver
Applicationcode
ContEst Algorithm
43© Cloudera, Inc. All rights reserved.
44© Cloudera, Inc. All rights reserved.
Hadoop provides layered abstractions for data processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduce Impala (SQL) Solr (search) Spark
ADAMquince guacamole …
bdg-
form
ats (
Avro
/Par
quet
)
45© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: interactive Spark shell (ADAM)
def inDbSnp(g: Genotype): Boolean = true or falsedef isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)val genotypesRDD = sc.adamLoad("path/to/genotypes")
val filteredRDD = genotypesRDD .filter(!inDbSnp(_)) .filter(isDeleterious(_)) .filter(isFramingham(_))val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD .keyBy(x => (x.getVariant, getPopulation(x))) .groupByKey() .map(computeMAF(_))
maf.saveAsNewAPIHadoopFile("path/to/output")
apply predicates
load data
join data
group-byaggregate (MAF)
persist data
46© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: distributed SQLSELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.altWHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" )GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
apply predicates
“load” and join data
group-by
aggregate (UDAF)
47© Cloudera, Inc. All rights reserved.
• Hosted at Berkeley and the AMPLab
• Apache 2 License• Contributors from both
research and commercial organizations
• Core spatial primitives, variant calling
• Avro and Parquet for data models and file formats
Spark + Genomics = ADAM
48© Cloudera, Inc. All rights reserved.
Core Genomics Primitives: Spatial Join
49© Cloudera, Inc. All rights reserved.
ADAM preliminary performance
50© Cloudera, Inc. All rights reserved.
51© Cloudera, Inc. All rights reserved.
Acknowledgements
UCBerkeleyMatt MassieFrank NothaftMichael Heuer
TamrTimothy Danford
MSSMJeff HammerbacherRyan Williams
ClouderaTom WhiteSandy Ryza
52© Cloudera, Inc. All rights reserved.
Thank you@lasersonlaserson@cloudera.com