+ All Categories
Home > Documents > Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs...

Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs...

Date post: 20-May-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
34
Dr. Denis Bauer & Lynn Langit Genomic-scale Data Pipelines
Transcript
Page 1: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Dr. Denis Bauer & Lynn Langit

Genomic-scale Data Pipelines

Page 2: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Transformational Bioinformatics Team

Denis Bauer, PhD

Oscar Luo, PhD

Rob Dunne, PhD

Piotr Szul

Team

Aidan O’BrienLaurence Wilson, PhD

Adrian WhiteAndy Hindmarch

Collaborators

David Levy

News

Software

Dan Andrews

Kaitao Lai, PhD

Natalie Twine, PhD

Arash Bayat

John Hildebrandt Mia Chapman

Ian BlairKelly Williams

Jules Damji

Gaetan Burgio Lynn Langit

Page 3: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

1000

17

2000

0 500 1000 1500 2000 2500

Astronomy

Twitter

YouTube

Big Data in 2025…Petabytes?

1000

17

2000

0 500 1000 1500 2000 2500

Astronomy

Twitter

YouTube

Big Data in 2025…Petabytes?

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 4: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Genome holds the blueprint for every cell

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 5: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

It affects looks, disease risk, and behavior

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 6: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

1

0.17

2

20

0 5 10 15 20 25

Astronomy

Twitter

YouTube

Genomic

GENOMIC Big Data in 2025 - Exabytes

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 7: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

VCF Data

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 8: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Genomic Research Workflow

https://www.projectmine.com/about/

Focus

Page 9: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Finding the disease gene(s)

Spot the variant that is…• common amongst all affected

• absent in all unaffected*

* oversimplified

cases

controls

Gene1 Gene2

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 10: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Cloud Data Pipeline Pattern

Problem

• Define bizproblem

Data

• Quality

• Quantity

• Location

Candidate Technologies

• Ingest

• Clean

• Analyze

• Predict

• Visualize

Build MVPs

• Iterate

• Learn

• Assemble

Assemble Pipeline

• Validate sections

• Test at scale

Page 11: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Cloud Data Pipeline Pattern

Candidate Technologies

• Ingest

• Clean

• Analyze

• Predict

• Visualize

Build MVPs

• Iterate

• Learn

• Assemble

Assemble Pipeline

• Validate sections

• Test at scale

Page 12: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Machine Learning Pipeline Pattern

Page 13: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

What is CSIRO’s solution?For Scale at reasonable cost Use Apache Hadoop

For Scale at speed Use Apache Spark

For Usability in bioinformatics Create a domain-specific ML API (library)

For global useLeverage Cloud Pipeline Patterns

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 14: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

GWAS Analysis with Variant-Spark

On-premise Cluster with Apache Hadoop & Spark

Genomics Analysts

CSIRO corporate data center

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 15: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Why Apache Spark?

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 16: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

Cited

4

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 17: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Supervised ML: Wide Random Forests

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 18: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Solving Important Questions…Cancer genomics?

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 19: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

DEMO: Who is a Hipster?

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 20: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

VariantSpark & Databricks Notebook

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

databricks Notebook

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 21: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

low Accuracy high

low

Spe

ed

h

igh

Page 22: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Scaling to 50 M variables and 10 K samples

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

100K trees: 5 – 50h

AWS: ~$215.50

100K trees: 200 – 2000h

AWS: ~ $ 8620.00

• Yarn Cluster • 12 workers

• 16 x Intel Xeon [email protected] CPU

• 128 GB of RAM

• Spark 1.6.1 on YARN• 128 executors

• 6GB / executor (0.75TB)

• Synthetic dataset

Whole Genome

RangeGWAS Range

Page 23: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Try it out: VariantSpark Notebook

https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 24: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Future Directions for VariantSpark RF

Additional feature types

Unordered Categorical

For Scores -Continuous

Different feature ranges

Small and Big Inputs

For Gene Expression analysis

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 25: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy

Editing does not work every time, e.g. only 7 in 10 embryos were mutation free

Aim: Develop computational guidance framework to enable edits the first time; every time

Ma et al. Nature 2017 *

* Controversy around the paper – stay tuned

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Page 26: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Make process parallel and scalable

• SPEED: Each search can be broken down into parallel tasks to then only take seconds

• SCALE: Researchers might want to search the target for one gene or 100,000

Scalability + Agility =

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 27: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

One of the first Serverless Applications in Research

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Featured in

This is My Architecture

Page 28: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

GT-Scan2

Page 29: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Considering Servicesfor GT-Scan2

• Use AWS Step Functions• Simplify workflow

• Simplify task timeouts

• Simplify task failures

• Must evaluate costs• SNS vs. Step Functions

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 30: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Cloud Data Pipeline Pattern

Problem DataCandidate

TechnologiesBuild MVPs

Assemble Pipeline

1. Analyze/GWAS vcf -> S3/Hadoop IngestETLAnalyzeViz

S3 -> Databricks DBFSApache SparkVariant-Spark MLNotebook SQL, R or Python

Spark

2. Search/GTScan2 S3/fastq-> DynamoDBS3/fastq, bed

IngestETLAnalyzeViz

S3LambdaLambdaLambda/API Gateway

Serverless

Page 31: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Spark Pipeline Pattern

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Jupyter Notebook

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 32: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Serverless Architecture Pattern

Lambda function

1

Lambda function

2

Lambda function

3

buckets with objects DynamoDB

API Gateway Users

Step Functions

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 33: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Cloud Genomic Data Pipelines• Problem # 1 – Analyze

• Find the mutated genes

• Solution: Spark-based machine learning

• Problem #2 – Scan• Find the nucleotide (DNA letters)

• Solution: Serverless

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 34: Genomic-scale Data Pipelines - YOW! Conferences...Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks

Genomics Big Data Pipelines

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Dr. Denis Bauer & Lynn Langit


Recommended