Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | genomeinabottle |
View: | 860 times |
Download: | 1 times |
Genome in a Bottle Consortium
Progress UpdateJanuary 27, 2014
Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
2
Whole Genome RMs vs. Current Validation Methods
• Sanger confirmation– Limited by number of sites (and sometimes it’s wrong)
• High depth NGS confirmation– May have same systematic errors
• Genotyping microarrays– Limited to known (easier) variants– Problems with neighboring “complex” variants, duplications
• Mendelian inheritance– Can’t account for some systematic errors
• Simulated data– Generally not very representative of errors in real data
• Ti/Tv– Varies by region of genome, and only gives overall statistic
3
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in confident regions
• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)
• Avoid bias towards any particular platform– take advantage of strengths of each platform
• Avoid bias towards any particular bioinformatics algorithms
4
Integrate 12 14 Datasets from 5 platforms
5
Integration of Data toForm Highly Confident Genotype Calls
Find all possible variant sites
Find concordant sites across multiple datasets
Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known
segmental duplications, SVs, or long repeats
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level
6
Verification of “Highly Confident” Genotype accuracy
• Sanger sequencing– 100% accuracy but only 100s of sites
• X Prize Fosmid sequencing– Sometimes call only part of a complex variant
• Microarrays– Differences appear to be FP or FN in arrays
• Broad 250bp HaplotypeCaller– Very highly concordant
• Platinum genomes pedigree SNPs– Some systematic errors are inherited; different representations of
complex variants• Real Time Genomics SNPs and indels
– Some interesting sites called by RTG complex caller
7
GCAT – Interactive Performance Metrics
• NIST is working with GCAT to use our highly confident variant calls
• Assess performance of many combinations of mappers and variant callers
• www.bioplanet.com/gcat
Improvement of FreeBayes over 1 year with indels
8
Why do calls differ from our highly confident genotypes?
Apparent False Positives • Platform-specific systematic
sequencing errors for SNPs• Analysis-specific • Difficult to map regions• Indels in long
homopolymers
Apparent False Negatives• Different complex variant
representation• Near indels• Inside repeats
9
Complex variants have multiple correct unphased representations
BWA
ssaha2
CGTools
Novo-align
Ref:
T insertion
TCTCT insertion
FP SNPs FP MNPs FP indels
Traditional comparison
0.38% (610)
100% (915)
6.5% (733)
Comparison with realignment
0.15% (249)
4.2% (38)
2.6% (298)
• ~225,000 highly confident variants are within 10bp of another variant
• FPs and FNs are significantly enriched for complex variants
• RTG vcfeval can fix this issue!
Reasons we exclude regions from high-confidence set
Reasons we exclude regions from high-confidence set
Depth of coverage (DOC)Control-FREECCnD
Paired-end mapping (PEM)Breakdancer
Split read (SR)Pindel
Assembly based (AS)VelvetABySS
SVMergeList of structural variant calls
CombinationGenome-STRiP
Structural variant analytical approach
• Coverage (mean and standard deviation)• Paired-end distance/insert size (mean and
standard deviation)• # of discordant paired-ends• Soft clipping of the reads (mean and
standard deviation)• Mapping quality (mean and standard
deviation)• # of heterozygous and homozygous SNP
genotype calls
Validation parameters for each SV
15
Challenges with assessing performance
• All variant types are not equal
• All regions of the genome are not equal– Homopolymers, STRs,
duplications– Can be similar or
different in different genomes
• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance
• Genotypes fall in 3+ categories (not positive/negative)– standard diagnostic
accuracy measures not well posed
16
Pedigree calls• RTG and Illumina Platinum
Genomes working on this• Sequence NA12878, husband,
and 11 children to identify high confidence variants– Identify cross-over events– Determine if genotypes are
consistent with inheritance
• Should we integrate these with the NIST high-confidence genotypes?
• Should we find larger families for future genomes?
• See afternoon presentations!
Source: Mike Eberle, Illumina
Pedigree Calls in Uncertain Regions
GIAB Characterization of pilot RM
• NIST – 300x 150x150bp HiSeq (from 6 vials)• NIST – 100x 75bp ECC SOLiD 5500W• Illumina – 50x 100x100bp HiSeq• Complete Genomics – Normal and LFR (non-
RM)• Garvan Institute – Illumina exome• NCI – Ion Proton whole genome• INOVA – Infinium SNP/CNV array
Homogeneity and Stability
Homogeneity• Multiplex First and last vial
– 3 libraries x 33x HiSeq each
• Multiplex 4 Random vials– 2 libraries x 12.5x HiSeq each
• Compare variability due to:– vial– library– day– flow cell– lane– sampling
• Run PFGE on each vial for size
Stability• Run PFGE to detect DNA
degradation• Freeze-thaw 2 and 5 times• Vortex for 10s• 4°C for 2 and 8 weeks• 37°C for 2 and 8 weeks
FTP site and Amazon S3
• NCBI is hosting fastq, bam, and vcf files on the giab ftp site
• These data are mirrored to Amazon S3, so we encourage you to take advantage of this!
Pilot Reference Material
• High-confidence calls are available on the ftp site and are already being used
• NIST plans to release this as a NIST Reference Material in the next couple months
Future Directions• Characterize more “difficult”
regions/variants• Structural variants• Compare to pedigree calls• Examine potentially clinically
relevant regions/variants in RMs• Use long-read technologies
– Moleculo– CG LFR– PacBio– BioNano Genomics– future technologies??
• Use glia/platypus to realign reads to candidate variants
• Analyze interlaboratory study data
• Characterize PGP genomes– Ashkenazim trio– son in Asian trio– DNA at NIST in Jan-Feb
2014– Volunteers to sequence?
• Select future genomes• Tumor-normal?
Topic #1: Moving beyond the easy regions/variants
Presentations• Emerging Technologies
– PacBio– Complete Genomics LFR– Moleculo– BioNano Genomics
• Structural Variants– Bina Technologies
Topics• Structural Variants• Phasing• Validation• Where should we set the
threshold(s) for confidence?
Topic #2: Cancer and Future Genomes
Cancer• Spike-ins• Mixtures of normal cell lines• Tumor-normal cell line pair• Transriptome controls
Priorities for Future Genomes• Diverse ancestry groups• Larger families• Recruitment with consent
for commercialization• How many genomes?• Should the parents be NIST
Reference Materials, or only the child?
Working Group Questions
RM Selection & Design• Spike-in controls• FFPE• Commercial RMs• ABRF interlaboratory study• Should we prioritize one or
two genomes?
RM Characterization• Production mode for new
trios– Pilot was characterized by
Illumina, SOLiD, Ion Proton, and Complete Genomics
– What resources should we invest in measurements for each new family?
Working Group Questions
Bioinformatics• Storing data/pipelines
– Suggestions for ftp structure– Data submission/accessioning
process– Data model for genomic data– Archiving pipelines and reproducible
research
• GRCh38• How to use pedigree calls for pilot
genome?• Clones for targeted regions (hard
regions if not whole genome)• In which difficult regions should
we focus our characterization?
Performance Metrics• Target audience• Requirements for user
interface– Establishing truth set(s)– Inputs/Outputs– Visualization
• Integration with GeT-RM