Date post: | 17-Jul-2015 |
Category: |
Technology |
Upload: | mapr-technologies |
View: | 343 times |
Download: | 4 times |
Hadoop as a Platform for Genomics
@AllenDay, Chief Scientist
Sungwook Yoon, Data Scientist
Data Science @MapR
®© 2014 MapR Technologies 2
DNA Sequencing, pre-2004
years
CPU transistors/mm2
HDD GB/mm2
DNA bp/$, pre-2004
®© 2014 MapR Technologies 3
DNA Sequencing, 2004 Disruption
years
CPU transistors/mm2
HDD GB/mm2 DNA
bp/$, post-2004
DNA bp/$, pre-2004
®© 2014 MapR Technologies 4
DNA Sequencing, 2004 Disruption
years
CPU transistors/mm2
HDD GB/mm2 DNA
bp/$, post-2004
DNA bp/$, pre-2004
Similar disruption occurred for Internet traffic in mid-1990s
®© 2014 MapR Technologies 5
Effect: Many DNA-Based Apps Coming…
• 2014: US$ 2B, mostly research, mostly chemical costs
• 2020: US$ 20B, mostly clinical, mostly analytics costs
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Clinical Non-Clinical
®© 2014 MapR Technologies 6 © 2014 MapR Technologies ®
1. What Kind of Analytics Apps? 2. How do they Work?
®© 2014 MapR Technologies 7
Target Audience • Fluency in computing, math • Basic knowledge of genetics, DNA
…so expect some encapsulated complexity
http://xkcd.com/803/
®© 2014 MapR Technologies 8
Clinical Sequencing Business Process Workflow
Physician Patient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
®© 2014 MapR Technologies 9
Step 1: Identify all the Single Nucleotide Polymorphisms • Currently ~12MM known SNPs • Each person has a unique Genotype
– Typically 3-5MM SNPs
– Relative to a reference human – diff this.human other.human,
essentially • Inherited from parents
• Inexpensive to find as sequencing costs have plummeted
http://learn.genetics.utah.edu/content/pharma/snips/
®© 2014 MapR Technologies 10
Step 2: Characterize all the SNPs (ML, AI)
Other data & algorithms
JOIN
®© 2014 MapR Technologies 11
Innovation Opportunities
Pop. Freq
Drug A Response
Drug B Response
10% Good Good
30% Poor Fair
30% Excellent Poor
30% Good, but Toxic
Fair
“Nil nocere” – do no harm
Step 3: Use Genotype to Customize Therapy
®© 2014 MapR Technologies 12
Jan 30: Obama Unveils “Precision Medicine” Initiative “Most medical treatments have been designed for the ‘average patient’ … treatments can be very successful for some patients but not for others.”
http://www.msnbc.com/msnbc/obama-seeks-215-million-personalized-medicine
®© 2014 MapR Technologies 13
Application: Forensic Analysis
http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
®© 2014 MapR Technologies 14
http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law
Moore’s Law #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)
®© 2014 MapR Technologies 15 © 2014 MapR Technologies ®
1. What Kind of Analytics Apps? 2. How do they Work?
®© 2014 MapR Technologies 16
Genome Sequencing in a Nutshell Reference Human Patient
Reference Genome
¢
¢
¢
¢
¢
¢
¢
De novo sequencing + assembly Resequencing
Patient Genotype
®© 2014 MapR Technologies 17
Population-Scale Genome Biobanking
®© 2014 MapR Technologies 18
GATK: Typical Tool for DNA=>Genotype Conversion Advantages • No consensus alternative… yet • Works! • Already deployed and being used to save lives Disadvantages • Map-Reduce but not Hadoop (and no plans to support) • Compute context cannot span multiple nodes • Inefficient use of shared memory (even within one node) • Inefficient asymmetric joins. No leverage of context, data locality
®© 2014 MapR Technologies 19
GATK: flat after chromosome split
®© 2014 MapR Technologies 20
Big Picture
N DNA Input Records
All SNPs
Catalog still growing; Genotype space huge ≫ 8E37
Personal input is fixed N records and trivial to cut into P partitions
G G
A good implementation: scales O(N) ~ F(N,P) But GATK is SLOW: scales O(N) ~ F(Genotypes) GATK parallelization metrics / DEAD END attempts: https://github.com/allenday/sequencing-utils
®© 2014 MapR Technologies 21
Bigger Picture: Human Suffering • Widely disliked. Reduction of suffering is good business.
Even Bigger • Is it morally wrong to allow others to suffer? • If you agree, and there’s a way to reduce suffering,
then…
• We can argue there is a moral imperative to build the most efficient, dependable, inexpensive solution possible
®© 2014 MapR Technologies 22 © 2014 MapR Technologies ®
From Feasible to Easy & Efficient
®© 2014 MapR Technologies 23
Two Phases of Genome Data Analysis
• Batch Sequence Processing – Align the reads to correct location – Make correct Variants detection through statistical modeling
• Genome / Phenome Data Analysis – Find relevant Genotypes for Phenotypes – Find relevant Phenotypes for Genotypes
®© 2014 MapR Technologies 24
Genome Processing Requirements
Big Storage Big Memory Algorithms
Sorting
Group By
Clustering
Sparse Matrix
Distributed Processing
Which Free SW Has This Solution?
2TB per person
Affordable Hardware
Forward Backward
®© 2014 MapR Technologies 25
Genome Processing Needs More Than Hadoop
• Strong In Memory Computation
• Strong Sparse Matrix Computation
Which Free SW Has This Solution?
®© 2014 MapR Technologies 26
Still One More
Genome Data Format Definition
(A 1 Z) (B 1 Z) (C 1 Z)
A 1 Z B 1 Z C 1 Z A B C 1 1 1 Z Z Z
Record 1 Record 2 Record 3
RowBased ColBased
Sorting Group MLLib
®© 2014 MapR Technologies 27
Compute Engines
Data Workflow
Adam Pipeline
FastQ BAM ADAM ADAM-VCF VCF
Avocado ADAM ADAM Aligner
Super Fast • In-memory • Scalable compute
context
Pipeline in Genomics Data Workflow, a sequence of data transformation from DNA sequence read to Variant Calls
®© 2014 MapR Technologies 28
Scale with Machines
From ADAM Tech Report
®© 2014 MapR Technologies 29
That’s A lot but it just is a start
• Why do we want sequencing? – To catch criminals ??
• Police State??
• Deeper wider genome study may reveal – Future medicine – Cure for diseases – Maybe … find Heroes??
®© 2014 MapR Technologies 30
Variants Accumulate – Need a Scalable Variant Store
ADAM ADAM-VCF
®© 2014 MapR Technologies 31
Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ: Count the number of occurrences as the value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPA
RS
E B
illion + Genotypes
®© 2014 MapR Technologies 32
Interpreting Genome × Phenome Matrix Factorization Result • Row Vectors of X represents
– Archetype set of phenotypes
• Column vectors of Y represents – Archetype set of genotypes
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1 Principal Column Vector
Archetype Genotypes
Archetype Phenotypes
Principal Row Vector
Sparse Matrix Package is Actively Developed in Spark
Community
®© 2014 MapR Technologies 33
Toward Heroes : Genome × Phenome Tensor • Aggregating over individuals with matrix could ignore the
correlations among genotypes and phenotypes • Maintain individual identity
Variants
Phenotypes
Variants
Phenotypes
®© 2014 MapR Technologies 34
Tensor Factorization (Parafac) G
enom
e Va
riant
s
Phenome ≈
Prin
cipa
l Va
riant
s1
Principal Phenotypes1
®© 2014 MapR Technologies 35 © 2014 MapR Technologies ®
From Imaginable to Possible
®© 2014 MapR Technologies 36
Genome needs Hadoop
Variant Calling
DNA Sequencer
Reads
Reference Genome
Genotype/ Phenotype/ Individual
Matrix
Cure & Prevent Disease
Medical Records
Patient
®© 2014 MapR Technologies 37
Scalable Variant Store – Data Mining
Model P ~ F(G) Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g. disease risk, drug response
®© 2014 MapR Technologies 38
Largest Biometric Database in the World
PEOPLE
1.2B PEOPLE
®© 2014 MapR Technologies 39
Why Create Aadhaar? • India: 1.2 billion residents
– 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers
• Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage
®© 2014 MapR Technologies 40
Aadhaar Biometric Capture & Index
Raw Digital Fingerprint
®© 2014 MapR Technologies 41
Aadhaar Biometric ID Creation
F(x): unique features G(x): uncommon features H(x): other features
• 900MM people loaded in 4 years
• In production – 1MM registrations/day – 200+ trillion lookups/day
• All built on MapR-DB (HBase)
®© 2014 MapR Technologies 42
How Does this Relate to Genomics?
F(x): unique features G(x): uncommon features H(x): other features
Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 7B humans, ~3M variants
®© 2014 MapR Technologies 43
How Does this Relate to Genomics?
F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features
Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 6B humans, ~3M variants • Genome: variant × phenotype • Common variant => effect-causing
gene F-1(x) !
Same data set operations
®© 2014 MapR Technologies 44
Genotype/ Phenotype/ Individual
Matrix
≈
indi
vidu
als
fingerprint minutiae
Find genetic basis of fingerprints
med
ical
reco
rds
genetic variants
Find genetic basis of disease
© 2014 MapR Technologies, confidential ®
Thanks! Questions?
@allenday, @mapr
[email protected], [email protected]
linkedin.com/in/allenday