Date post: | 14-Jun-2015 |
Category: |
Technology |
Upload: | andy-petrella |
View: | 16,650 times |
Download: | 6 times |
Lightning fast genomicsWith Spark and ADAM
Andy
@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool
Who are we?
Xavier
@xtordoirSilicoCloud-> Physics
-> Data analysis -> genomics
-> scalable systems-> ...
Genomics
What is genomics about?
Medical Diagnostics
Drug response
Diseases mechanisms
Genomics
What is genomics about?- A human genome is a 3 billion long sequence (of
nucleic acids: “bases”)
- 1 per 1000 base is variable in human population
- Genomes encode bio-molecules (tens of thousands)
- These molecules interact together
...and with environment
→ Biological systems are very complex
Genomics
State of the art- growing technological capacity
- cost reduction
- growing data._
Genomics
State of the art- I.T. becomes bottleneck (cost and latency)
- sacrifice data with sampling or cut-offsAndrea Sboner et al
Genomics
Blocking points
- “legacy stack” not designed scalable (C, perl, …)
- HPC approach not a fit (data intensive)
Genomics
Future of genomics
- Personal genomes (e.g. 1,000,000 genomes for cancer
research)
- New sequencing technologies
- Sequence “stuff” as needed (e.g. microbiome,
diagnostics)
- medicalCondition = f(genomics, environmentHistory)
Genomics
Needs of scalability → Scala & Spark
Needs of simplicity, clarity → ADAM
Parquet 101
Columnar storage
Row oriented
Column oriented
Parquet 101
Columnar storage
> Homogeneous collocated data> Better range access> Better encoding
Parquet 101
Efficient encoding of nested typed structures
message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; }}
Parquet 101
Efficient encoding of nested typed structures
message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; }}
Nested structure →Tree
Empty levels →Branch pruning
Repetitions →Metadata (index)
Types → Safe/Fast codec
Parquet 101
Efficient encoding of nested typed structures
ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Parquet 101
Optimized distributed storage (f.i. in HDFS)
ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/
Parquet 101
Efficient (schema based) serialization: AVRO
{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ]}
JSON Schema IDL
record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null;}
Parquet 101
Efficient (schema based) serialization: AVRO
{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ]}
JSON Schema Part of the:● protocol● serialization
→less metadata
Define: IDL → JSONSend: Binary → JSON
ADAM
Credits: AmpLab (UC Berkeley)
ADAM
Overview (Sequencing)
- DNA is a molecule
…or a Seq[Char] (A, T, G, C) alphabet
ADAM
Sequencing
- Massively parallel sequencing of random 100-150
bases reads (20,000,000 reads per genome)
- 30-60x coverage for quality
- All this mess must be re-organised!
→ ADAM
ADAM
Variants Calling
- From an organized set of reads (ADAM Pileup)
- Detect variants (Variant Calling)
→ AVOCADO
ADAM
Genomics specifications
- SAM, BAM, VCF
- Indexable
- libraries
- ~ scalable: hadoop-bam
ADAM
ADAM model- schema based (Avro), libraries are generated
- no storage spec here!
ADAM
ADAM model
- Parquet storage- evenly distribute data
- storage optimized for read/query
- better compression
ADAM
ADAM API- AdamContext provides functions to read from HDFS
ADAM
ADAM API
- Scala classes generated from Avro
- Data loaded as RDDs (Spark’s Resilient Distributed
Datasets)
- functions on RDDs (write to HDFS, genomic objects
manipulations)
ADAM
ADAM API
- e.g. reading genotypes
ADAM
ADAM Benchmark- It scales!- Data is more compact- Read perf is better- Code is simpler
As usual… let’s get some data.
Genomes relate to health and are private.
Still, there are options!
Stratification using 1000Genomes
Stratification using 1000Genomes
http://www.1000genomes.org/(Nowadays targeting 2000 genomes)
ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
Stratification using 1000Genomes
Stratification using 1000Genomes
Study genetic variations in populations (needs more contextual data for healthcare).
To validate the interest in ADAM, we’ll do some qualitative exploration of the data.
Question: it is possible to predict the appartenance of a given genome to a subpopulation?
We can run an unsupervised algorithm on a massive number of genomes.
The idea is to find clusters that would match subpopulations.
Stratification using 1000Genomes
Actually, it’s important because it reflects populations histories: gene flows, selection, ...
Stratification using 1000Genomes
From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants
ref: http://en.wikipedia.org/wiki/Chromosome
Genome Data
Data structure
Genome Data
Data structure
Panel: Map[SampleID, Population]
Genome Data
Data structureGenotypes in VCF format
Basically a text file. Ours were downloaded from S3.
Converted to ADAM Genotypes
Machine Learning model
Clustering: KMeans
ref: http://en.wikipedia.org/wiki/K-means_clustering
Machine Learning model
Clustering: KMeans
ref: http://en.wikipedia.org/wiki/K-means_clustering
PreProcess = {A,C,T,G}² → {0,1,2}
Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰
Distance = Euclidian (L2) ⁽*⁾
⁽*⁾MLlib restriction, although, here: L2~L1SPARK-3012
Machine Learning model
MLLib, KMeans
MLLib: ● Machine Learning Algorithms● Data structures (e.g. Vector)
Machine Learning model
MLLib KMeans
DataFrame Map: ● key = Sample● value = Vector of Genotypes alleles (sorted by Variant)
Mashup
prediction
Sample [NA20332] is in cluster #0 for population Some(ASW)
Sample [NA20334] is in cluster #2 for population Some(ASW)
Sample [HG00120] is in cluster #2 for population Some(GBR)
Sample [NA18560] is in cluster #1 for population Some(CHB)
Mashup
#0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0
Cluster
4 m3.xlarge instances (ec2)16 cores + 60G
Cluster
Performances
Cluster
40 m3.xlarge160 cores + 600G
Conclusions and future work
● ADAM and Spark provide tools to manipulate genomics data in a scalable way
● Simple APIs in Scala● MLLib for machine learning
→ implement less naïve algorithms→ cross medical and environmental data with genomes
Acknowledgements
Scala.IO
AmpLab Matt Massie Frank Nothaft
Vincent Botta
Acknowledgments
That’s all Folks
Apparently, we’re supposed to stay on stageWaiting for questionsHoping for noneLooking at the barAnd the lunchOh there are beersAnd candies
who can read this?