+ All Categories
Home > Data & Analytics > #Code2Cure: A field guide for software engineers on their journey to the world of genomics.

#Code2Cure: A field guide for software engineers on their journey to the world of genomics.

Date post: 19-Jul-2015
Category:
Upload: amirhossein-kiani
View: 586 times
Download: 0 times
Share this document with a friend
Popular Tags:
28
#Code2Cure: Engineering Genomics : @mirkiani A field guide for software engineers on their journey to the world of genomics. Amirhossein Kiani Sr. Lead Software Engineer : [email protected] Image courtesy of http://circos.ca DISCLAIMER: The views expressed in this talk are mine alone and not those of my employer. Bina products are for use Research Use Only. Not for use in diagnostic procedures. Also, I’m a Computer Scientist by training and trying to help those with similar background to learn about the field of genomics. Therefore there has been a high degree of simplification done in explaining the scientific concepts in this talk.
Transcript

#Code2Cure: Engineering Genomics

: @mirkiani

A field guide for software engineers on their journey to the world of genomics.

Amirhossein Kiani

Sr. Lead Software Engineer

: [email protected]

Image courtesy of http://circos.ca

DISCLAIMER: The views expressed in this talk are mine alone and not

those of my employer.

Bina products are for use Research Use Only. Not for use in diagnostic

procedures.

Also, I’m a Computer Scientist by training and trying to help those with

similar background to learn about the field of genomics. Therefore there

has been a high degree of simplification done in explaining the scientific

concepts in this talk.

https://www.youtube.com/watch?v=G1ZLyGW8rKY

2

www.bina.com

Why Genomics?

$3,000,000,000

13 years

http://en.wikipedia.org/wiki/Human_Genome_Project

Past Present

$1000

24 hours

Future

3

www.bina.com

Why Genomics?

Some things we could do with genomics:

• Carrier Screening

• Prenatal Screening

• Newborn Screening

• Inherited Disease

• Infectious Disease

• Cancer Diagnostics

• Microbiome

• Personalized Medicine

4

But I have no genomics background!

It’s ok.

5

www.bina.com

My personal story…

6

Now

Then

www.bina.com

What is cell, what is DNA?

http://en.wikipedia.org/wiki/Cell_%28biology%29

http://en.wikipedia.org/wiki/DNA

7

Image courtesy of Pinterest

Image courtesy of Tumblr

www.bina.com

Crash Course on Genomics

The field of studying the structure of genomes.

http://en.wikipedia.org/wiki/Genomics

http://en.wikipedia.org/wiki/RNA

http://en.wikipedia.org/wiki/Protein

DNA RNA Protein You!

8

www.bina.com

How do we figure out what’s in DNA?

Like everything else, we turn the analog signal to digital, and then

analyze it.

http://en.wikipedia.org/wiki/DNA_sequencing

http://en.wikipedia.org/wiki/FASTQ_format

Illumina, Ion Torrent, Genia, …

Primary Analysis

FASTQ Format

9

Image courtesy of PersonalGenomes.org

www.bina.com

RAW Data to Variants (Secondary Analysis)

Step 1. Alignment

http://en.wikipedia.org/wiki/DNA_sequencing

http://en.wikipedia.org/wiki/FASTQ_format

10

Image courtesy of Wall Woodworks

Image courtesy of Wallpaper Up

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

Step 1. Short-Read Sequence Alignment

http://en.wikipedia.org/wiki/Reference_genome

http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism

http://en.wikipedia.org/wiki/Indel

http://en.wikipedia.org/wiki/Structural_variation

AACACACCCAAGGGGGAAACTTTGGTCCACCCAAGGGGGAAACCCAAGGGGGAAACTTTG

Reference Genome (~3B bases)

ACTTTGGTCCACCCAAGG

AAGGGGGACACCCAAGGACACCC__GGGGGAAACT

GGACACCCAAGGGGGAA

ACCCAAGGGGGACACCC

ACCC__GGGGGAAACTTTG

AACACACCC__GGGGGAA

Co

ve

rag

e

Deletion Single Nucleotide Polymorphism

11

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

• Burrows-Wheeler Aligner (BWA)

• Uses Burrows-Wheeler transform (also used in bzip)

• Uses Smith-Waterman algorithm

• Written in C++

• Uses ~4GB memory for human genome

http://bio-bwa.sourceforge.net

http://bioinformatics.oxfordjournals.org/content/25/14/1754.full.pdf+html

$ bwa mem ref.fa read1.fq read2.fq > aln-pe.sam

Example

12

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

Alignment

FASTQSAM

Convert to Binary

BZIP (samtools)

BAM File

BAM File Index

http://samtools.github.io/hts-specs/SAMv1.pdf

http://samtools.github.io

13

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

BAM File

BAM File Index

http://www.broadinstitute.org/igv

https://github.com/ekg/freebayes

http://arxiv.org/abs/1207.3907

https://www.broadinstitute.org/gatk

Visualize

Variant Calling

$ freebayes -f ref.fa aln.bam >var.vcf

Example

Interactive Genome Browser (IGV)

14

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

15

… and here are your variants (VCF file)!

http://samtools.github.io/hts-specs/VCFv4.2.pdf

www.bina.com

What do we do with variant calls then?

Zooming in on the Central Dogma of Molecular Biology:

• There is redundancy in protein codes.

• But a mutation could change the protein coding.

16

Image courtesy of Wikipedia

www.bina.com

What do we do with variant calls then?

Annotation & Interpretation

• Functional Annotation Figure out if the mutation is dangerous (Use SNPEff)

• Synonymous

• Non-Synonymous

• Frame-shift

• …

• Put in the context of existing findings

• dbSNP

• ClinVar

• COSMIC

• ESP

• 1000 Genomes

• …

http://snpeff.sourceforge.net

http://www.ncbi.nlm.nih.gov/SNP

17

www.bina.com

CASE STUDY:

18

www.bina.com

Statistics

Data AnalyticsBioinformatics

Genomics

Big Data Technologies

Compute and Data Science

19

Bringing three disciplines together

www.bina.com

Case Study: Bina GMS

20

Sequencing 2º Analysis 3º Analysis Interpretation

Meaningful Results

& Clinical Relevance

20+ DBs including over

140+ annotations:

HGMD // PGMD // Clinvar

COSMIC // dbNSFP // TRANSFAC

1000 Genome and more.

Tools & Workflows for:

WGS // WES // RNAseq

Somatic Mutations

Multi sample

Gene Panels

Bina Products are for Research Use Only

www.bina.com

Bina RAVE Architecture (1)

21

Secure REST InterfacePortal Server(s)

Portal Backend(Distributed)

• Workflow Definition

• Templates

• QC/Monitoring

• System Management/Updates

Task DependencyGraphs

Distributed

Workflow

Orchestration

Secure Push

Interface

Wo

rkflo

w G

ene

ratio

n

Interactive UI // Command Line SDK

Executor

Dynamic

Scheduling

Local Storage

Exe

cu

tio

n E

ng

ine

Executor Nodes / VMs

Network Storage – Input/Output Data

Static

Scheduling

Workflows

Tools

Commands

www.bina.com

Bina RAVE Architecture (2)

Workflows (DNA, RNA ..)

Tools (BWA, GATK, SVs)

Services(Logging, Storage, Caching,

Streaming)

Commands

(Samtools, GATK, URL,..)

Genome-aware – Workflow Generation

Distributed Coordination

Task Graph

JSON Request

(UI/CMD/SDK)

Nodes / VMs

Executor

Dynamic

scheduling

Graph

Triggers

Updates

Genome aware – Distributed Execution Framework

Syncing all

Nodes

Dependency

Graph

Task Status

Network storage – Input/output data

Local storage

• Dependency Aware Execution

• Locality Aware Execution (Caching)

• Streaming Through “Engines”

• In-Memory Computation

Output(VCF,SV)

Input(BAM, FASTQ)

Static

Scheduling

www.bina.com

Bina AAiM Architecture

Annotation and Indexing Engine

InputVCF

UI/CMD

Clinical

Annotations

Genomic

Context

Prediction

Func. Impact

Population

Frequency

Distributed Execution

Framework

Annotation

(Join static DBs)

Indexing &

Functional Filters

MapReduce Jobs

Analytics Engine

NoSQL

Data Store

Indices

Metadata

Store

Tumor/Norma

l

Pedigree

Queries, Filters, Variant Sets, Reports

Bina

Secondary

Cohort StudyProband

www.bina.com

What next?

http://www.genomicsengland.co.uk

http://www.personalgenomes.org

• Apply this process to different domains and applications

• Come up with ways of ranking variants

• Keep learning from data

• Sequence everyone!

• Genomics England 100,000 Genome Project

• Personal Genomes Project

• Decrease cost

• Increase accuracy

• Make the technology faster and more usable!

Map of sequencers around the globe: http://omicsmaps.com

24

www.bina.com

Challenges in Genomics

• Accuracy

• Gold standard? What tool is best, there are so many!

• NIST, Dream Challenge

• Need to speak the same language… interoperability

• Global Alliance

• API, format, meta data, …

• Regulations

• HIPPA, CLIA: security, accuracy, anonymity and encryption

• Scalability

• Storage

• Need terabytes

• Each genome could be up to 1T

• Computation

• We still pretty much have no idea what most of DNA is doing…

• Can’t run on single machine. Need to scale to many nodes

• Need to leverage cloud technologies

• Provenance and auditability

• Importance of usability

• Different personas

• Errors are very expensive (life and death)

• Better visualization → faster discovery → faster cure

25

www.bina.com

Why should software engineers move to genomics?

Because genomics needs you, and you need genomics.

Work on something that matters! (#Code2Cure)

Things that SWEs do very well:

• Automation

• Elegant solutions for complex problems

• Enabling non-savvy users by

making the technology robust and accessible

• Scale

• Optimization

• Building production-grade platforms

• Tested

• Robust

• Secure

THESE ARE ALL NEEDED IN GENOMICS YESTERDAY!

26

Image courtesy of http://silvsoul.blogspot.com

www.bina.com

Open projects/resources to checkout/contribute to

Projects/Conferences

• Galaxy -- http://galaxyproject.org

• Arvados -- https://arvados.org

• Open Bio Conference -- http://www.open-bio.org

• BioViz -- http://www.biovis.net

• BioPython -- http://biopython.org

• Global Alliance for Genomics Health -- http://ga4gh.org

• Rosalind Project -- http://rosalind.info

Blogs/Websites

• http://bcb.io

• http://nextgenseek.com/

• http://ngs-expert.com/

• http://seqanswers.com/

• http://core-genomics.blogspot.com

• http://www.genomesunzipped.org

• http://genomeweb.com

27

Thank you. And I hope you consider moving to genomics!

http://info.bina.com/code2cure-community

: @mirkiani

Amirhossein Kiani

Sr. Lead Software Engineer

: [email protected]


Recommended