Post on 25-Jun-2015
transcript
© 2010 Illumina, Inc. All rights reserved.
Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,
GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
Platinum Genomes:
Towards a
comprehensive
truth data set
Michael A. Eberle
Morten Kallberg, Han-Yu Chuang
2
Platinum Genome project: Goals
Problem: No comprehensive truth set of variant calls for validation
Solution: Sequence and analyze large family pedigree
Use Mendelian inheritance to identify good / bad variant calls
– Including SNPs, indels & SVs
Aggressively incorporate variant calls
– Incorporate multiple algorithms and sequencing technologies
– Do not limit this just to what is currently easy to call
Make the data available publicly
– Both raw data and processed calls with accuracy assessment
Re-assess algorithms against a better truth data
– Better and more comprehensive truth data will allow for rapid advances in software
3
Using inheritance to detect conflicts: trio analysis
MOM DAD
When we do a trio analysis like this only 50% of the parents DNA is passed on to
the child so many of the variants will only be called in one parent
– Have no power to detect false positives in the parents
A trio analysis is also not very sensitive to detecting errors
– For example if father is AC and mother is AC then the child can be AA, AC or CC and
still be consistent with Mendelian inheritance
– Many errors occur at sites that are systematically het but trio analysis assumes that
these are correct
Father’s chromosomes
Mother’s chromosomes
CHILD
Child receives blue chromosome from mother and
green chromosome from father: e.g. typical trio analysis
4
Using inheritance to determine accuracy: larger pedigree
MOM DAD 1 3 2 6 5 4 7
CHILDREN
A T A A T A A A T A A A T A A A A A
T A A A A A T A A A T A A A T A T A
A A A T A T A T A A A A A A A T A A
A A T A A A A A A T A T A T A A A T
A A A T A T A T A A A A A A A T A A
OBSERVED GENOTYPES
Po
ssib
le G
T P
atte
rns
5
Using inheritance to determine accuracy: larger pedigree
A T A A T A A A T A A A T A A A A A
T A A A A A T A A A T A A A T A T A
A A A T A T A T A A A A A A A T A A
A A T A A A A A A T A T A T A A A T
A A A T A T A T A A A A A A A T A A
OBSERVED GENOTYPES
6
MOM DAD 1 3 2 6 5 4 7
# Errors / Hamming Distance
6
Using inheritance to determine accuracy: larger pedigree
A T A A T A A A T A A A T A A A A A
T A A A A A T A A A T A A A T A T A
A A A T A T A T A A A A A A A T A A
A A T A A A A A A T A T A T A A A T
A A A T A T A T A A A A A A A T A A
OBSERVED GENOTYPES
6
5
MOM DAD 1 3 2 6 5 4 7
7
Using inheritance to determine accuracy: larger pedigree
A T A A T A A A T A A A T A A A A A
T A A A A A T A A A T A A A T A T A
A A A T A T A T A A A A A A A T A A
A A T A A A A A A T A T A T A A A T
A A A T A T A T A A A A A A A T A A
OBSERVED GENOTYPES
6
5
0
MOM DAD 1 3 2 6 5 4 7
8
Using inheritance to determine accuracy: larger pedigree
A T A A T A A A T A A A T A A A A A
T A A A A A T A A A T A A A T A T A
A A A T A T A T A A A A A A A T A A
A A T A A A A A A T A T A T A A A T
A A A T A T A T A A A A A A A T A A
OBSERVED GENOTYPES
6
5
0
7
MOM DAD 1 3 2 6 5 4 7
9
Using inheritance to determine accuracy: larger pedigree
A T A A T A A A T A A A T A A A A A
T A A A A A T A A A T A A A T A T A
A A A T A T A T A A A A A A A T A A
A A T A A A A A A T A T A T A A A T
A A A T A T A T A A A A A A A T A A
100% consistent therefore we predict that all genotypes are correct
OBSERVED GENOTYPES
6
5
0
7
MOM DAD 1 3 2 6 5 4 7
10
Platinum Genomes - CEPH/Utah Pedigree 1463
All 17 members sequenced to at least 50x depth (PCR-Free protocol)
– SNPs & indels called using BWA + GATK + VQSR
Each member of the trio highlighted in bold is sequenced to 200x
An additional 200x technical replicate was done for NA12882
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893
12877 12878
12882
Analysis of SNPs in
the parents and 11
children
11
50x raw data was aligned and variants called using BWA + GATK + VQSR
– Accurate calls were supplemented with accurate variant calls made by Cortex using
the same sequence data and accurate CGI calls made across the same pedigree
First step is to define the inheritance of the parental chromosomes to the eleven
children everywhere in the genome
– Identified 709 crossover events between the parents and eleven children
Define accurate variants as those where the genotypes are 100% consistent
with the transmission of the parental haplotypes
– At any position of the genome there are only 16 possible combinations of genotypes
(biallelic & diploid) across the pedigree that are consistent with the inheritance pattern
– 313 (~1.6M) possible genotype combinations
Subsequent analysis mostly excludes all variants that are homozygous
alternative across the last two generations of this pedigree (~750k)
– Mostly will be accurate but for these “trivially consistent” sites we cannot differentiate
accurate from systematic errors or validate ploidy
Analysis of the data
12
Set C
Set B
Set A
Compare Against
Inheritance
Score (plat./gold)
db w/score
Assess Problem
Score (gold/silver)
db w/comments
Comment
db w/comments
NO CONFLICTS CONFLICTS
BIOLOGY BAD
Input all possible data and
use the inheritance to
separate good from bad:
Variants are unlikely to
accidentally match
inheritance
13
Cataloging the accurate SNPs
14
All Pass Filtered0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Co
un
ts (
Mil
lio
ns)
Accurate SNP positions based on the pedigree analysis
408,915
3,217,748
Correct
Problematic
Pedigree Analysis
Additional 754,014
SNPs are “trivially
consistent” – i.e. all 13
samples are hom alt.
GATK Site Description*
Normally might exclude
these from our analysis
because the variant
caller filtered some of the
calls
*Filtered means that at least one variant call was called but quality filtered
15
0 1 2 3 4 5 6 7 8 9 10 11 12 130
20
40
60
Hamming Distance
Per
cen
t
Hamming distance for the “accurate” SNPs to the 2nd best
solution
At these sites >85% of the
positions would require at least
four (very specific) genotype
errors to have erroneously ended
up with the observed predicted-
accurate calls
16
Cortex CGI0
20
40
60
Counts
(x1000)
Using other call sets for a more comprehensive catalogue
22,922 (0.6%)
57,270 (1.6%)
Unique
Common
Pedigree Analysis
17
Concordance between “pedigree-accurate” GTs
Comparison* # Sites # Diff GTs # Same
GTs
GT
Concordance
GATK & Cortex 2,053,136 5 26,690,763 99.99998%
GATK & CGI 3,146,399 19 40,903,168 99.99995%
Cortex & CGI 1,890,718 7 24,579,327 99.99997%
*Excluding sites where alleles did not match or all samples homozygous alternative
Includes 763,085 GT calls and 264,771 positions quality filtered by GATK
Attempting to validate a sample of the sites that are unique to a single call set
– Targeting ~300 per call set
18
Indel analysis
19
All Pass Filtered0
50
100
150
200
250
Co
un
ts (
tho
usa
nd
s)Accurate GATK indel positions based on pedigree
141,508
240,490
Correct
Problematic
Pedigree Analysis
Additional 115,587
indels are “trivially
consistent” – i.e. all 13
samples are hom alt.
Site Description
20
Cortex CGI0
20
40
60
Counts
(x1000)
Using other call sets for a more comprehensive catalogue
39,335 (10%)
9,637 (2.4%)
Unique
Common
Pedigree Analysis
21
Concordance between overlapping “accurate” indels
Comparison*1 # Sites # Diff GTs # Same
GTs
GT
Concordance
GATK & Cortex 96,228 43 1,250,921 99.997%
GATK & CGI 219,445 2,817 2,514,785 99.901%
Cortex & CGI 78,050 198 1,014,650 99.981%
*Excluding sites where alleles did not match or all samples homozygous alternative
Attempting to validate a sample of the sites that are unique to a single call set
– Targeting ~300 per call set
22
CNVs
23
Conflict mode: Hemizygous deletions
A T A A T A A A T A A A T A A A A A
A A A T A T A T A A A A A A A T A A
A A T A A A A A A T A T A T A A A T
OBSERVED GENOTYPES
6
7
2
7
MOM DAD 1 3 2 6 5 4 7
A A A T A T T T A A A A A A T T A A
T A A A A A T A A A T A A A T A T A
“Best” solution still indicates multiple errors
24
A - A T - T A T - A A A - A A T A A
- A T A A A - A A T - T A T - A - T
Conflict mode: Hemizygous deletions
- A A T A T - T A A - A A A - T - A
A - T A - A A A - T A T - T A A A T
A A A T A T T T A A A A A A T T A A
100% consistent therefore we predict that there is a deletion
OBSERVED GENOTYPES
6
5
0
7
Hamming distance will be less when including deletions so need to be careful
MOM DAD 1 3 2 6 5 4 7
25
0 20 40 60 80 1000
1000
2000
3000
4000
5000
Depth
Co
un
ts
Read depth of 5,180 SNPs predicted to overlap deletions
Depth shown for positions where
the genotypes indicate that the
SNP overlaps a deletion. Large
number of children allows us to
more-reliably separate errors
from deletions.
Haploid Diploid Hom Del
AA AB
BB
A- AB
-B
26
Have many potential large deletions to validate…
5,180 SNPs are predicted to overlap a hemizygous deletion
These SNPs cluster into ~902 unique events
– Clusters show evidence for ~279 deletions >1kb segregating in this pedigree
– Largest event is >152kb with 274 SNPs supporting the call
Have begun validating these events beyond just visual inspection
– 132 overlap with previously reported events (1kGP)
– Working to define the breakpoints for wet lab validation
Incorporating other calling methods (Cortex, breakdancer…)
Some SNPs also support the presence of duplications in a single parent
27
We have sequenced a large pedigree and used the inheritance information to
create a catalogue of ~4.45M accurate SNP calls
– Over 3.7M biallelic SNPs agree with transmission of parental chromosomes
– Over 750k homozygous alternative SNPs are trivially accurate across the pedigree
Have called indels using four different methods also to produce over 550k
“accurate” indel calls across the pedigree
– Over 428k bi-allelic indels agree with transmission of parental chromosomes
– Over 110k homozygous alternative indels are trivially accurate across the pedigree
Concordance for the bi-allelic, pedigree-accurate calls is >99.9999% for SNPs
and 99.9% for indels between call sets
SVs are in progress (just deletions right now)
The SNP and indel results presented here can be used for comparison
– Incorporating homozygous reference calls across the pedigree for completeness
– May see immediate gains by testing new algorithms against a better truth set
Summary
28
Acknowledgements
Morten Kallberg – alignment & variant calling
Han-Yu Chuang – analysis of SNP calls
Phil Tedder – validation of de novo SNPs
Sean Humphray
Epameinondas Fritzilas
Wendy Wong
David Bentley
Elliott Margulies