Date post: | 17-Jan-2017 |
Category: |
Health & Medicine |
Upload: | genomeinabottle |
View: | 3,699 times |
Download: | 3 times |
genomeinabottle.org
Genome in a Bottle Consortium GIAB/GRC Pre-ASHG Workshop
October 5, 2015
Reference Materials for Clinical Applications of Human Genome Sequencing
Justin Zook and Marc SalitNational Institute of Standards and Technology
genomeinabottle.org
Sequencing technologies and bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28
genomeinabottle.org
Sequencing technologies and bioinformatics pipelines disagree
O’Rawe et al. Genome Medicine 2013, 5:28
Who is right?
Is anyone right?
genomeinabottle.org
GIAB Scope
• The Genome in a Bottle Consortium is developing the reference materials, reference methods, and reference data needed to assess confidence in human genome variant calls.
• A principal motivation for this consortium is to enable performance assessment of sequencing and science-based regulatory oversight of clinical sequencing.
genomeinabottle.org
Well-characterized, stable RMs• Obtain metrics for validation,
QC, QA, PT• Determine sources and types of
bias/error• Learn to resolve difficult
structural variants• Improve reference genome
assembly• Optimization
– integration of data from multiple platforms
– sequencing and analysis• Enable regulated applications Comparison of SNP Calls for
NA12878 on 2 platforms, 3 analysis methods
genomeinabottle.org
NGS Validation Process usingGenomes in Bottles
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
Analytical ProcessGenome in a Bottle Scope
Pre-Analytical Process
Clinical InterpretationGIAB Data
genomeinabottle.org
Genome in a Bottle Consortium (GIAB)Hosted by US National Institute of Standards and Technology
Goal: Provide infrastructure to assess confidence in human variant calls
• Appropriately consented widely available DNA samples, distributed by the Coriell Institute– Also, QCed Reference Material (RM) versions
from controlled lots will be available from NIST– Also, PGP samples are commercially available
• High-accuracy reference data for these samples
• Tools to facilitate their use– With the Global Alliance Data Working Group
Benchmarking Team
ga4gh.org
genomeinabottle.org
GIAB Selected SamplesCEPH/Utah Pedigree 1463
✔
NA12889
NA12879
NA12890
NA12880NA12881
NA12882NA12883
NA12884NA12885
NA12886NA12887
NA12888NA12893
NA12877 NA12878
NA12891 NA12892
✔ ✔NA24149 NA24143
NA24385
Ashkenazi Jewish Trio
✔
NA24694 NA24695
NA24631
Asian (Han Chinese) Trio
✔
Note: Illumina and RTG have used data from the pedigreeto improve variant calls in the specific GIAB samples.
New
New
PersonalGenomeProject
Available asNIST RM8398
genomeinabottle.org
NIST Human Genome Reference Materials (RMs)
• NIST RM 8398 is available!– tinyurl.com/giabpilot– DNA isolated from large
growth cell cultures– Stable, homogeneous – Best for regulated uses– DNA from same cell line at
Coriell (NA12878)
• New AJ and Asian Samples– Available from Coriell now– NIST RM available in 2016
genomeinabottle.org
Integrated 14 datasets from 5 platforms to establish Reference SNP/indel Calls for NA12878
Zook et al., Nature Biotechnology, 2014.
genomeinabottle.org
Integration Methods to Establish Reference Variant Calls for NA12878
Candidate Variants from Each Platform
Identify Concordant Variants
Identify Characteristics of Systematic Error
Arbitrate Using Evidence of Systematic Error
Exclude regions potentially biased for all short reads (e.g., repeats, SVs)
Zook et al., Nature Biotechnology, 2014.
genomeinabottle.org
Assigning confidence to genomic regions for NA12878
High-confidence (77%)• Platforms agree or we
understand the systematic biases causing disagreement
• At least some methods have no evidence of systematic errors
• Mendelian inheritance consistent
Lower confidence (23%)• In a region known to be
difficult for current technologies– Segmental Dups– Repeats, Low Complexity– High/Low GC– Etc.
• Evidence of systematic error across many platforms
• Inconsistent inheritance
Zook et al., Nature Biotechnology, 2014.
genomeinabottle.org
Using high-confidence NIST-GIAB genotypes for NA12878
• NIST have released several versions of high-confidence genotypes for its pilot RM
• These data are presently being used for benchmarking– prior to release of RMs– SNPs & indels
• ~77% of the genome•Data on FTP now well-organized
genomeinabottle.org
GeT-RM Browser from NCBI and CDC• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of data underlying call each call
genomeinabottle.org
Uses of GIAB NA12878
Oncology – Molecular and Cellular Tumor Markers“Next Generation” Sequencing (NGS) guidelines for somatic genetic variant detection
www.bioplanet.com/gcat
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
• Formed June 2014 to develop methods and tools for comparing variant calls to a benchmark
• Developed standardized definitions for performance metrics like TP, FP, and FN.
• Initial focus on germline SNPs/indels• Developing benchmarking tools
• Comparison engine• Pluggable web interface with
modules for:• Reporting/calculation of metrics• Visualization/user interface
• Working with Genome in a Bottle Consortium to host data and calls from their well-characterized genomes
www.bioplanet.com/gcat
Example User Interface
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
How should we interpret this complex variant on chr21?
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
Beyond simple T/F classification: Genotype errorsTruth
Callset
Description ProposedName(s)
CM#1 region match
CM#2 allele match CM#3 genotype match
0/1 1/1 zygosity/genotype error
GE TP 1TP, 1GE FN
1/1 0/1
1/2 0/11/10/22/2
common allele, FN allele
GE_FN TP 1TP, 1GE, 1FN FN
0/1 1/2 common allele, FP allele
GE_FP TP 1TP, 1GE, 1FP FP, FN
1/1 1/2
1/2 1/3 common allele, FP allele, FN allele
GE_FP_FN TP 1TP, 1GE, 1FP, 1FN
FP, FN
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
Beyond simple T/F classification: no-calls and half-calls
Truth Callset Description ProposedName(s)
CM#1 region match
CM#2 allele match CM#3 genotype match
0/1 ./1 half-call, TP allele HC_TP NC, NCV, TP 1NC, 1NCV, 1TP, 1GE TP
1/1 ./1 1NC, 1NCV, 1TP, 1GE FN
0/11/1
./0 half call, FN allele(s)
HC_FN NC, NCV, TP 1NC, 1NCV, 1FN FN
1/2 ./0 1NC, 2NCV, 2FN FN
1/2 ./1./2
half-call, TP allele, FN allele
HC_TP_FN
NC, NCV, TP 1NC, 1NCV, 1TP, 1GE, 1FN
FN
genomeinabottle.org
Stratifying False PositivesGC ContentTR
Unit <7
TRUnit >=7
TRUnit
2TRUnit
1
TRUnit
3
TRUnit
4
Credit:Abby BeelerEllie Wood
GA4GH - Stratification
genomeinabottle.org
Public data from GIAB AJ PGP Trio
Long reads/”Linked” reads• ~70/30/30x PacBio
– ~11kb N50• BioNano• 10X Genomics• Moleculo• Complete Genomics LFR• Oxford Nanopore
Short reads• 300x Illumina paired-end• 15x Illumina 6kb mate-pair• Complete Genomics• SOLiD 5500W• Ion Proton Exome
http://biorxiv.org/content/early/2015/09/15/026468
genomeinabottle.org
GIAB Analysis Group – New Data Sets
Leaders• Francisco de la Vega
– Annai Systems• Chris Mason
– Weil Cornell Medical Center• Tina Graves
– Washington University• Valerie Schneider
– NCBI•and Justin and Marc
Status• Analysis Group Responsibilities:
– https://docs.google.com/document/d/10eA0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXHhtNH1uzw/edit?usp=sharing
• Analysis Milestones:– https://docs.google.com/spreadsheets/d/1Pj4nSz
H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?usp=sharing
• Analysis Methods– https://docs.google.com/spreadsheet
s/d/1Je2g85H7oK6kMXbBOoqQ1FMNrvGnFuUJTJn7deyYiS8/edit?usp=sharing
• Analysis Plan:– https://drive.google.com/file/d/0B7Ao1qq
JJDHQdnVEaVdqbWdEdkE/view?usp=sharing
• Collecting Data and analyses on GIAB FTP Site
• Recruiting people to help with the work.
Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios
genomeinabottle.org
Data Release Policy: Real-time, Open, Public Release
Individual Datasets• Uploaded to GIAB FTP site
as it is collected• Includes raw reads, aligned
reads, and variant/reference calls
Integrated High-confidence Calls• First develop SNP, indel, and
homozygous reference calls• Then develop SV and non-
SV calls• Released calls are versioned• Preliminary callsets will be
made available to be critiqued
genomeinabottle.org
Analysis Progress: AJ Trio• SNPs/indels
– Several candidate callsets– NIST working on integration– Plan to use 10X/moleculo/PacBio for difficult-to-map regions
• Assembly– 2 de novo assemblies of AJ trio (MHAP/PBcR and Falcon/Bionano)– Will be used by at least 2 groups for SV calling
• Structural variants– Candidate calls being generated by 15+ groups with >20 different
algorithms and 6 datasets– 3 integration methods: Bina-MetaSV, DNAnexus/Baylor-
Parliament, NIST-svclassify– Parliament: ~7k SVs with evidence in PacBio and Illumina
• Long-range Phasing– 2 phased calls so far (CG LFR and 10X)– Integration methods needed
genomeinabottle.org
Proposed approach to form high-confidence SV (and non-SV) calls
Generate candidate calls from multiple methods
Compare/evaluate calls using Parliament/MetaSV/svclassify/others?;
manually inspect discordant calls
Integrate new and revised calls
Combine integrated calls (with heuristics and/or machine learning) to generate high-
confidence calls
August 30, 2015
Nov 1, 2015
Jan 1, 2016
Jan 26, 2016
genomeinabottle.org
Acknowledgments
• FDA – Elizabeth Mansfield, Computing staff
• Many members of Genome in a Bottle– New members
welcome!– Sign up on website for
email newsletters
Steering Committee– Marc Salit – Justin Zook– David Mittelman – Andrew Grupe – Michael Eberle– Steve Sherry – Deanna Church – Francisco De La Vega– Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland– Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao– Kara Norman– Luke Hickey
genomeinabottle.org
For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
www.bioplanet.com/gcat - exome comparison tool
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking work group– ga4gh.org/#/benchmarking-team
Twice yearly workshop – Winter: January 28-29, 2016 at Stanford University, California, USA– Summer at NIST, Maryland, USA
Public Meetings
Justin Zook: [email protected] Salit: [email protected]