Avoiding Nonsense Resultsin your NGS Variant Studies
James Lyons-Weiler, PhDScientific Director/
Senior Research ScientistBioinformatics Analysis Core
Genomics & Proteomics Core LaboratoriesUniversity of Pittsburgh
Pittsburgh, PAMay 1, 2014
Two Parts
• Identifying sites with low genotypic signal increases concordance among variant callers
• Hazards in finding differentially expressed genes in RNASeq – how to do it more robustly.
23andMe: High risk of RA and psiriosisGTL: Low risk of RA and psiriosis
NYTimes Article, etc.
Data were from Illumina hi-seq 2000
Among method averageConcordance57.5% overall; 32.7% at high coverage
O’Rawe et al.
TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)
SEQUENCER
MAPPER
VARIANT CALLERS
LOW CONCORDANCE (O’Rawe et al., 2013)
Consensus Analysise.g.,2/3, ¾, set analysis
Information Theory(-> modeling)
Improve Callers(fix errors, modeling) Bake Offs
Simulations
Spiked Ins
Entropy of Base Distributions
A T C GA T C G A T C GLow entropyHigh enthalpy
Low entropyHigh enthalpy
High entropyLow enthalpy
Boltzmann Entropy
• s = k ln w (Planck)
• w = antiln(s/k)
http://schneider.ncifcrf.gov/images/boltzmann/boltzmann-tomb-4.html
Rank Sorted Distribution of w(O’Rawe et al. data)
Homozygotes w = 1
Heterozygotes w = 2
Example w Density Distribution
w and FBVCA T C G w pw Zygosity Genotype200 0 0 0 1 0 Homozygote AA
16 158 13 13 2.102558 0 Homozygote TT100 100 0 0 2 0 Heterozygote AT
58 30 1 111 2.768507 0 Heterozygote AG28 80 14 78 3.303636 0 Heterozygote TG76 38 29 57 3.758733 0 Heterozygote AG33 49 60 58 3.895496 0.0126 Heterzygote? CG?50 50 50 50 4 1 noise unknown
Operational*Equiprobable Null Distribution
{f(A) = f(T) = f(G) = f(C)}
Convergence of significance (pw)
What We Expect
TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)
SEQUENCER
MAPPER
VARIANT/BASE CALLERS
Genotypic Signal Filtering
INCREASED CONCORDANCE
Phom Function
gatkConcordance w/ FBVC Hom Het
ALL 0.5762 11868 17670pw<=0.05 0.9976 11282 5676
pw>0.05 0.0074 586 11994samtools
ALL 0.5649 11541 18799pw<=0.05 0.9917 11489 5761
pw>0.05 0.0002 52 13038snver
ALL 0.6006 11904 16729pw<=0.05 0.9934 11812 5470
pw>0.05 0.0007 92 11259
From the O’Rawe et al. generated resultsFBVC = frequency-based variant caller (Lyons-Weiler et al.)
Signal Tx %ConcordanceFBVC_vs_FBVC Marked ALL 85.64
pw<=0.05 91.08pw>0.05 35.66
FBVC_vs_FBVC Realigned ALL 83.82pw<=0.05 91.69
pw>0.05 28.21FBVC_vs_FBVC Recalibrated ALL 93.14
pw<=0.05 ***99.39pw>0.05 48.53
FBVC_vs_FBVC Reduced ALL 21.54pw<=0.05 24.57
pw>0.05 4.25FBVC_vs_FBVC Marked-Realigned ALL 76.91
pw<=0.05 86.11pw>0.05 15.44
FBVC_vs_FBVC Marked-Realigned-Recalibrated ALL 76.73pw<=0.05 85.99
pw>0.05 15.34
FBVC_vs_FBVC Marked-Realigned-Recalibrated-Reduced ALL 19.98pw<=0.05 22.9
pw>0.05 2.66
TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)
SEQUENCER
MAPPER
VARIANT CALLERS
LOW CONCORDANCE (O’Rawe et al., 2013)
Consensus Analysise.g.,2/3, ¾, set analysis
Information Theory(-> modeling)
Improve Callers(fix errors, modeling) Bake Offs
Simulations
Spiked Ins
Lifescope reads (read)
Shrimp2 reads (blue)
Mappers must be systematically evaluated
Part 2: Good and Bad News forRNASeq (and everything else):
The Bad News:
Fold Change is Biased.
The Good News:
We have identified a much less biased method.
T-test is not appropriatefor small N, large P data
(such as RNASeq)
Fold Change > 2.0
Delta > 25
FC(A/B) is Blind to Large Portionsof Your Data
FC(A/B)
Delta(and J5: Patel & Lyons-Weiler, 2004)
Ratio are Hard to Interpret asBiological Differences
Gene A B delta (A-B) FC(A/B)
gene1 5 3 2 1.667
gene2 50 30 20 1.667
gene3 500 300 200 1.667
gene4 5000 3000 2000 1.667
gene5 50000 30000 20000 1.667
A-B is a differenceA/B is a quotient.
Log2 TransformationDoes not Help
Reveals Minor Delta (&J5) Bias
Pink = FC(A/B)Black = Delta
G-Thresholding J5
FC Bias in Amyotrophic Lateral Sclerosis
0
50000
100000
150000
200000
250000
300000
350000
0 50000 100000 150000 200000
Control
ALS DEGy
FCDEGy
Black circles = FC(A/B). Pink = Gthr-J5 genes
Black circles = FC(A/B). Pink = Gthr-J5 genes
FC(A/B) Bias inAlchohol-Induced Hepatitis
Conclusions• Not all NGS/HTS sites have sufficient genotypic signal to warrant
a base call. High coverage alone does not provide a solution.
• By measuring genotypic signal, we can determine which sites we can call with confidence.
• Fold-change(FC(A/B) is blind to highly expressed genes and should be abandoned as a measure of differential expression altogether – even for single gene or single protein studies!
• Published microarray data sets analyzed to date using FC(A/B) only are a gold-mine for re-analysis using less biased methods.
Credits and Contact• pw, pHom, etc: James Lyons-Weiler, Alan Twaddle, Rahil Sethi.
– (MS in preparation)– Our software is called Gconf (not yet available)
• Fold-Change Bias: James Lyons-Weiler, Tamanna Sultana, Rick Jordan, Rahil Sethi– (Paper in review)– For now, read
• Mariani TJ, Budhraja V, Mecham BH, Gu CC, Watson MA, Sadovsky Y. 2003. A variable fold change threshold determines significance for expression microarrays. FASEB J. 17:321-3. doi: 10.1096/fj.02-0351fje
• Pearson, K. 1897. On a form of spurious correlation that may arise when indices are used for the measurement of organs. Proc Roy Soc Lond 60:489-498 doi: 10.1098/rspl.1896.0076