Dr. Denis Bauer & Lynn Langit
Genomic-scale Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
Denis Bauer, PhD
Oscar Luo, PhD
Rob Dunne, PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson, PhD
Adrian WhiteAndy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai, PhD
Natalie Twine, PhD
Arash Bayat
John Hildebrandt Mia Chapman
Ian BlairKelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
YouTube
Big Data in 2025…Petabytes?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome holds the blueprint for every cell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
It affects looks, disease risk, and behavior
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
1
0.17
2
20
0 5 10 15 20 25
Astronomy
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
VCF Data
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
Focus
Finding the disease gene(s)
Spot the variant that is…• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Data Pipeline Pattern
Problem
• Define bizproblem
Data
• Quality
• Quantity
• Location
Candidate Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble Pipeline
• Validate sections
• Test at scale
Cloud Data Pipeline Pattern
Candidate Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble Pipeline
• Validate sections
• Test at scale
Machine Learning Pipeline Pattern
What is CSIRO’s solution?For Scale at reasonable cost Use Apache Hadoop
For Scale at speed Use Apache Spark
For Usability in bioinformatics Create a domain-specific ML API (library)
For global useLeverage Cloud Pipeline Patterns
Transformational Bioinformatics| Denis C. Bauer @allPowerde
GWAS Analysis with Variant-Spark
On-premise Cluster with Apache Hadoop & Spark
Genomics Analysts
CSIRO corporate data center
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Why Apache Spark?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Solving Important Questions…Cancer genomics?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
DEMO: Who is a Hipster?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
VariantSpark & Databricks Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
databricks Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
low Accuracy high
low
Spe
ed
h
igh
Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster • 12 workers
• 16 x Intel Xeon [email protected] CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset
Whole Genome
RangeGWAS Range
Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Future Directions for VariantSpark RF
Additional feature types
Unordered Categorical
For Scores -Continuous
Different feature ranges
Small and Big Inputs
For Gene Expression analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy
Editing does not work every time, e.g. only 7 in 10 embryos were mutation free
Aim: Develop computational guidance framework to enable edits the first time; every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Make process parallel and scalable
• SPEED: Each search can be broken down into parallel tasks to then only take seconds
• SCALE: Researchers might want to search the target for one gene or 100,000
Scalability + Agility =
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
One of the first Serverless Applications in Research
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Featured in
This is My Architecture
GT-Scan2
Considering Servicesfor GT-Scan2
• Use AWS Step Functions• Simplify workflow
• Simplify task timeouts
• Simplify task failures
• Must evaluate costs• SNS vs. Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Data Pipeline Pattern
Problem DataCandidate
TechnologiesBuild MVPs
Assemble Pipeline
1. Analyze/GWAS vcf -> S3/Hadoop IngestETLAnalyzeViz
S3 -> Databricks DBFSApache SparkVariant-Spark MLNotebook SQL, R or Python
Spark
2. Search/GTScan2 S3/fastq-> DynamoDBS3/fastq, bed
IngestETLAnalyzeViz
S3LambdaLambdaLambda/API Gateway
Serverless
Spark Pipeline Pattern
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Serverless Architecture Pattern
Lambda function
1
Lambda function
2
Lambda function
3
buckets with objects DynamoDB
API Gateway Users
Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Genomic Data Pipelines• Problem # 1 – Analyze
• Find the mutated genes
• Solution: Spark-based machine learning
• Problem #2 – Scan• Find the nucleotide (DNA letters)
• Solution: Serverless
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomics Big Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Dr. Denis Bauer & Lynn Langit