Post on 24-Jun-2015
transcript
Comparing Variant Calls
Francisco M. De La Vega, D.Sc.Visiting Scholar, Department of GeneticsStanford University School of Medicine
In collaboration with Real Time Genomics, Inc.
G E N O M E - I N - A - B O T T L E W O R K S H O P
rtgTools v1.0
A toolkit to compare and analyze VCFs
• vcfeval – comparison of VCFs for ROC curves • rocplot – draw ROC curves from vcfeval output• medelian – counts of Mendelian inheritance errors in pedigrees• vcfstats – basic statistics of VCF files• vcffilter – filtering of VCFs by scores, etc.• vcfannotate – annotation of VCF files• vcfmerge – merge VCF files
Java compiled code freely available at GiaB repository:
ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
3
Issues in representation of complex calls
Indel in homopolymer
Reference CAAAAAAG
Baseline C..AAAAGCalled CAAAA..G
After replay:
Baseline CAAAAGCalled CAAAAG
MNPs
Reference CAACGTAAG Baseline CAATGTCAG Called CAATGTCAG
Issues in representation of complex calls
Dinucleotide repeat
Reference ACGTACCAGATATCACAACATATATATABaseline ACGGACCAG..ATCACAACATATATATATA
Called ACGGACCAGAT..CACAACATATATATATA
After replay: Baseline ACGGACCAGATCACAACATATATATATA Called ACGGACCAGATCACAACATATATATATA
Best path Link mutations ROC
Comparison of variant call set with baseline set
Basic rules• Match the baseline and called sequences so as to maximize true positives
and minimize false positives and false negatives.• True positives + false negatives = total calls in the baseline• Heterozygous calls match: Both heterozygous and alleles must agree
Path creation• A path is a selection of subset of calls• Best path: paths that maximize true positives and minimize errors• In theory, exponential number of paths; in practice this can be solved by
dynamic programing
Baseline
Called
a b c d e f g h
Reference
Path creation - simple homozygous case
False positive (excluded)
Baseline
Called
Best Path
False negative (excluded)
a b c d e f g h
Baseline
Called
a b c d e f g h
Reference
Path creation - simple homozygous case
Baseline
Called
a b c d e f
Reference
Path creation - simple heterozygous case (non-phased)
False positive (excluded)
Baseline
Called
Best Path
False negative (excluded)
a b c d e f
Baseline
Called
a b c d e f
Reference
Path creation - simple heterozygous case (non-phased)
Why weighting is needed?
TP + FN = Totalbaseline
Reference CAACAACTATCCTC....ATCT....GC
Baseline CAACAACTATCCTCATCTATCTATCTGC
Called CAACAACTATCCTCATCTATCTATCTGC
Sync points
Reference ACAGTCACGGBaseline ACGGTCACTGCalled ACGGTTACGG
Reference AC AGT CAC GGBaseline AC GGT CAC TGCalled AC GGT TAC GG
Weighting
where B is the number of baseline variants between the current (Sn) and previous sync points (Sn-1) and C is the number of called variants between the current and previous sync points.
False positive (excluded)
False negative (excluded)
1 1 1 1 1 1
Baseline
Called
Weights
1
1
Type Weighted total
TP 6
FP 1
FN 1
Sync points
a b c d e f
Sync point
Simple homozygous weighting
False positive (excluded)
Baseline
Called
False negative (excluded)
1 1 1 1
1
2
Type Weighted total
TP 4
FP 1
FN 2
Sync point
a b c d e f
Simple heterozygous case (non-phased) weighting
a b c d e f
Called
1 1 1 1 0.5 0.5
Baseline
Type Weighted total
TP 5
FP 0
FN 0Sync point
Complex weighting
ROC Plot
http://biorxiv.org/content/early/2014/01/24/001958
Acknowledgements
RTG, Hamilton, New Zealand John Cleary Len Trigg Mehul Rathoud
Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab)
This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA.
© 2014 Real Time Genomics, Inc. All rights reserved.