Date post: | 19-Jul-2015 |
Category: |
Data & Analytics |
Upload: | amirhossein-kiani |
View: | 586 times |
Download: | 0 times |
#Code2Cure: Engineering Genomics
: @mirkiani
A field guide for software engineers on their journey to the world of genomics.
Amirhossein Kiani
Sr. Lead Software Engineer
Image courtesy of http://circos.ca
DISCLAIMER: The views expressed in this talk are mine alone and not
those of my employer.
Bina products are for use Research Use Only. Not for use in diagnostic
procedures.
Also, I’m a Computer Scientist by training and trying to help those with
similar background to learn about the field of genomics. Therefore there
has been a high degree of simplification done in explaining the scientific
concepts in this talk.
www.bina.com
Why Genomics?
$3,000,000,000
13 years
http://en.wikipedia.org/wiki/Human_Genome_Project
Past Present
$1000
24 hours
Future
3
www.bina.com
Why Genomics?
Some things we could do with genomics:
• Carrier Screening
• Prenatal Screening
• Newborn Screening
• Inherited Disease
• Infectious Disease
• Cancer Diagnostics
• Microbiome
• Personalized Medicine
4
www.bina.com
What is cell, what is DNA?
http://en.wikipedia.org/wiki/Cell_%28biology%29
http://en.wikipedia.org/wiki/DNA
7
Image courtesy of Pinterest
Image courtesy of Tumblr
www.bina.com
Crash Course on Genomics
The field of studying the structure of genomes.
http://en.wikipedia.org/wiki/Genomics
http://en.wikipedia.org/wiki/RNA
http://en.wikipedia.org/wiki/Protein
DNA RNA Protein You!
8
www.bina.com
How do we figure out what’s in DNA?
Like everything else, we turn the analog signal to digital, and then
analyze it.
http://en.wikipedia.org/wiki/DNA_sequencing
http://en.wikipedia.org/wiki/FASTQ_format
Illumina, Ion Torrent, Genia, …
Primary Analysis
FASTQ Format
9
Image courtesy of PersonalGenomes.org
www.bina.com
RAW Data to Variants (Secondary Analysis)
Step 1. Alignment
http://en.wikipedia.org/wiki/DNA_sequencing
http://en.wikipedia.org/wiki/FASTQ_format
10
Image courtesy of Wall Woodworks
Image courtesy of Wallpaper Up
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Step 1. Short-Read Sequence Alignment
http://en.wikipedia.org/wiki/Reference_genome
http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism
http://en.wikipedia.org/wiki/Indel
http://en.wikipedia.org/wiki/Structural_variation
AACACACCCAAGGGGGAAACTTTGGTCCACCCAAGGGGGAAACCCAAGGGGGAAACTTTG
Reference Genome (~3B bases)
ACTTTGGTCCACCCAAGG
AAGGGGGACACCCAAGGACACCC__GGGGGAAACT
GGACACCCAAGGGGGAA
ACCCAAGGGGGACACCC
ACCC__GGGGGAAACTTTG
AACACACCC__GGGGGAA
Co
ve
rag
e
Deletion Single Nucleotide Polymorphism
11
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
• Burrows-Wheeler Aligner (BWA)
• Uses Burrows-Wheeler transform (also used in bzip)
• Uses Smith-Waterman algorithm
• Written in C++
• Uses ~4GB memory for human genome
http://bio-bwa.sourceforge.net
http://bioinformatics.oxfordjournals.org/content/25/14/1754.full.pdf+html
$ bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
Example
12
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Alignment
FASTQSAM
Convert to Binary
BZIP (samtools)
BAM File
BAM File Index
http://samtools.github.io/hts-specs/SAMv1.pdf
http://samtools.github.io
13
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
BAM File
BAM File Index
http://www.broadinstitute.org/igv
https://github.com/ekg/freebayes
http://arxiv.org/abs/1207.3907
https://www.broadinstitute.org/gatk
Visualize
Variant Calling
$ freebayes -f ref.fa aln.bam >var.vcf
Example
Interactive Genome Browser (IGV)
14
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
15
… and here are your variants (VCF file)!
http://samtools.github.io/hts-specs/VCFv4.2.pdf
www.bina.com
What do we do with variant calls then?
Zooming in on the Central Dogma of Molecular Biology:
• There is redundancy in protein codes.
• But a mutation could change the protein coding.
16
Image courtesy of Wikipedia
www.bina.com
What do we do with variant calls then?
Annotation & Interpretation
• Functional Annotation Figure out if the mutation is dangerous (Use SNPEff)
• Synonymous
• Non-Synonymous
• Frame-shift
• …
• Put in the context of existing findings
• dbSNP
• ClinVar
• COSMIC
• ESP
• 1000 Genomes
• …
http://snpeff.sourceforge.net
http://www.ncbi.nlm.nih.gov/SNP
17
www.bina.com
Statistics
Data AnalyticsBioinformatics
Genomics
Big Data Technologies
Compute and Data Science
19
Bringing three disciplines together
www.bina.com
Case Study: Bina GMS
20
Sequencing 2º Analysis 3º Analysis Interpretation
Meaningful Results
& Clinical Relevance
20+ DBs including over
140+ annotations:
HGMD // PGMD // Clinvar
COSMIC // dbNSFP // TRANSFAC
1000 Genome and more.
Tools & Workflows for:
WGS // WES // RNAseq
Somatic Mutations
Multi sample
Gene Panels
Bina Products are for Research Use Only
www.bina.com
Bina RAVE Architecture (1)
21
Secure REST InterfacePortal Server(s)
Portal Backend(Distributed)
• Workflow Definition
• Templates
• QC/Monitoring
• System Management/Updates
Task DependencyGraphs
Distributed
Workflow
Orchestration
Secure Push
Interface
Wo
rkflo
w G
ene
ratio
n
Interactive UI // Command Line SDK
Executor
Dynamic
Scheduling
Local Storage
Exe
cu
tio
n E
ng
ine
Executor Nodes / VMs
Network Storage – Input/Output Data
Static
Scheduling
Workflows
Tools
Commands
www.bina.com
Bina RAVE Architecture (2)
Workflows (DNA, RNA ..)
Tools (BWA, GATK, SVs)
Services(Logging, Storage, Caching,
Streaming)
Commands
(Samtools, GATK, URL,..)
Genome-aware – Workflow Generation
Distributed Coordination
Task Graph
JSON Request
(UI/CMD/SDK)
Nodes / VMs
Executor
Dynamic
scheduling
Graph
Triggers
Updates
Genome aware – Distributed Execution Framework
Syncing all
Nodes
Dependency
Graph
Task Status
Network storage – Input/output data
Local storage
• Dependency Aware Execution
• Locality Aware Execution (Caching)
• Streaming Through “Engines”
• In-Memory Computation
Output(VCF,SV)
Input(BAM, FASTQ)
Static
Scheduling
www.bina.com
Bina AAiM Architecture
Annotation and Indexing Engine
InputVCF
UI/CMD
Clinical
Annotations
Genomic
Context
Prediction
Func. Impact
Population
Frequency
Distributed Execution
Framework
Annotation
(Join static DBs)
Indexing &
Functional Filters
MapReduce Jobs
Analytics Engine
NoSQL
Data Store
Indices
Metadata
Store
Tumor/Norma
l
Pedigree
Queries, Filters, Variant Sets, Reports
Bina
Secondary
Cohort StudyProband
www.bina.com
What next?
http://www.genomicsengland.co.uk
http://www.personalgenomes.org
• Apply this process to different domains and applications
• Come up with ways of ranking variants
• Keep learning from data
• Sequence everyone!
• Genomics England 100,000 Genome Project
• Personal Genomes Project
• Decrease cost
• Increase accuracy
• Make the technology faster and more usable!
Map of sequencers around the globe: http://omicsmaps.com
24
www.bina.com
Challenges in Genomics
• Accuracy
• Gold standard? What tool is best, there are so many!
• NIST, Dream Challenge
• Need to speak the same language… interoperability
• Global Alliance
• API, format, meta data, …
• Regulations
• HIPPA, CLIA: security, accuracy, anonymity and encryption
• Scalability
• Storage
• Need terabytes
• Each genome could be up to 1T
• Computation
• We still pretty much have no idea what most of DNA is doing…
• Can’t run on single machine. Need to scale to many nodes
• Need to leverage cloud technologies
• Provenance and auditability
• Importance of usability
• Different personas
• Errors are very expensive (life and death)
• Better visualization → faster discovery → faster cure
25
www.bina.com
Why should software engineers move to genomics?
Because genomics needs you, and you need genomics.
Work on something that matters! (#Code2Cure)
Things that SWEs do very well:
• Automation
• Elegant solutions for complex problems
• Enabling non-savvy users by
making the technology robust and accessible
• Scale
• Optimization
• Building production-grade platforms
• Tested
• Robust
• Secure
THESE ARE ALL NEEDED IN GENOMICS YESTERDAY!
26
Image courtesy of http://silvsoul.blogspot.com
www.bina.com
Open projects/resources to checkout/contribute to
Projects/Conferences
• Galaxy -- http://galaxyproject.org
• Arvados -- https://arvados.org
• Open Bio Conference -- http://www.open-bio.org
• BioViz -- http://www.biovis.net
• BioPython -- http://biopython.org
• Global Alliance for Genomics Health -- http://ga4gh.org
• Rosalind Project -- http://rosalind.info
Blogs/Websites
• http://bcb.io
• http://nextgenseek.com/
• http://ngs-expert.com/
• http://seqanswers.com/
• http://core-genomics.blogspot.com
• http://www.genomesunzipped.org
• http://genomeweb.com
27
Thank you. And I hope you consider moving to genomics!
http://info.bina.com/code2cure-community
: @mirkiani
Amirhossein Kiani
Sr. Lead Software Engineer