+ All Categories
Home > Documents > MAPPIN: Method for Annotating, Predicting Pathogenicity ... · and mode of Inheritance for...

MAPPIN: Method for Annotating, Predicting Pathogenicity ... · and mode of Inheritance for...

Date post: 16-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
1
2.20e-206 3.55e-92 2.95e-07 2.32e-08 1.49e-08 3.74e-23 ns 4.19e-05 2.48e-08 1.53e-06 ns 2.32e-08 MAPPIN: Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants Nehal Gosalia 1,2 , Aris N. Economides 1,2 , Frederick E. Dewey 1 , and Suganthi Balasubramanian 1 1 Regeneron Genetics Center, 2 Regeneron Pharmaceuticals, Tarrytown, NY BACKGROUND An average exome can contain 10,500-13,500 nonsynonymous single nucleotide variants (nsSNVs) 1,2 , which is lower than expected suggesting negative selection 3,4 A major challenge with whole exome sequencing (WES) is differentiating benign and disease-causing variants HGMD and OMIM ® databases show that nsSNVs account for ~45% of disease-causing mutations 5,6 making it critical to identify them Many algorithms predict pathogenicity of nsSNVs, however none of them are able to distinguish dominant vs. recessive- disease causing mutations 7 It is important to differentiate between heterozygous dominant- acting variants and heterozygous carrier variants MAPPIN Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants Training data sets: - Pathogenic variants from UniProt, ExoVar 7 , subdivided using known dominant- and recessive- disease causing genes from OMIM ®6 and others 8,9 - Haploinsufficient genes subset from dominant genes using haploinsufficiency predictions 10 - Benign variants from ClinVar 11 refined by a) criteria provided, multiple submitters, no conflict, b) reviewed by expert panel, and c) practice guideline (★★-★★★★) Input Variant Annotation VCF File (chr, pos, ref, alt) 99 evolutionary, functional, network, and allele frequency features Training sets ExoVar (pathogenic) ClinVar (benign) Scores for Dominant, Recessive, and Benign classes Random Forest Classifier Prediction Output Annotation Output Category Features Evolutionary GERP score 12 , paralogs 13 , pseudogenes 14 and other gene annotation metrics, dN/dS rates, average heterozygosity of nsSNVs and synonymous SNVs, nonsynonymous and synonymous SNP density Functional Transcript length, variant affecting all/some transcript isoforms, single exon gene, protein domain annotations 13 , GTeX expression in individual tissues 15 Network Protein-protein interactions (BIOGRID) 16 , number of networks and interfaces, interactions with known dominant or recessive disease causing genes (OMIM ® ) 6 Allele Frequency 1000G 17 , ESP6500 18 , ExAC 19 , pLI score by gene (measure of haploinsufficiency based on constraint, ExAC) 20 Table 1. Features annotated within MAPPIN and used for predictions subdivided into categories based on the type of annotation. Figure 1. Adapted from Li et al., PLoS Genetics, 2013 7 . Several prediction algorithms and combinations were tested on a dataset composed of known dominant and recessive disease causing mutations. Figure clearly demonstrates that existing algorithms are unable to call dominant or recessive mutations confidently (AUCs ~0.55). RESULTS MAPPIN trained under two models: i. Haploinsufficient model composed of genes causing dominant diseases through haploinsufficiency (Multiclass AUC = 0.96) ii. All dominant model composed of all dominant disease-causing genes (Multiclass AUC = 0.91) FEATURE IMPORTANCE PLOT PERFORMANCE OF FEATURE SUBSETS VALIDATION ON TWO MENDELIAN DATASETS: CMG AND DDDS Figure 2. Workflow for MAPPIN. User inputs a VCF file which is annotated with 99 features and then run through a prediction model based on a random forest classifier trained on benign and pathogenic variants. DIFFERENTIATING BETWEEN DOMINANT & RECESSIVE DISEASE-CAUSING VARIANTS DOMINANT AND RECESSIVE DISCRIMINATION FOR HGMD VARIANTS Figure 6. Violin plots of score distributions for HGMD 5 variants in dominant and recessive genes. Training variants are excluded in the comparison and variants were subset using genes from Berg et al. 24 MAPPIN dominant (A) and recessive (B) class scores for HGMD variants in dominant and recessive genes. CADD (C) and Eigen (D) phred scores 23 for HGMD variants in dominant and recessive genes. REFERENCES 1. Levy et al., The diploid genome sequence of an individual human. PLoS Biology, 2007 2. Ng et al., Genetic variation in an individual human exome. PLoS Genetics, 2008 3. Cargill, M et al., Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genetics, 1999 4. Stephens JC et al., Haplotype variation and linkage disequilibrium in 313 human genes. Science, 2001 5. Stenson et al., The Human Gene Mutation Database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Human Genetics, 2014 6. Online Mendelian Inheritance in Man, OMIM ® . McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD). World Wide Web URL: http://omim.org/ 7. Li et al., Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies. PLoS Genetics, 2013 8. Blekhman, R. et al., Natural selection on genes that underlie human disease susceptibility. Current Biology, 2008 9. Boone, P.M. et al., Deletions of recessive disease genes: CNV contribution to carrier states and disease- causing alleles. Genome Research, 2013 10. Huang et al., Characterising and predicting haploinsufficiency in the human genome. PLoS Genetics, 2010 11. Landrum MJ et al., ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Research, 2015 12. Cooper, G.M. et al., Distribution and intensity of constraint in mammalian genomic sequence. Genome Research, 2005 13. Flicek et al., Ensembl 2014. Nucleic Acids Research, 2014 14. GENCODE, Pei, B. et al. The GENCODE pseudogene resource. Genome Biology, 2012 15. Lonsdale et al., The Genotype-Tissue Expression (GTEx) project. Nature Genetics, 2013 16. Stark, C. et al., BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 2006 (version 3.4.128) 17. The 1000 Genomes Project Consortium, An integrated map of genetic variation from 1,092 human genomes. Nature, 2012 18. Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL: http://evs.gs.washington.edu/EVS/), 2015 19. Lek et al., Analysis of protein-coding genetic variation in 60,706 humans. Nature, 2016 20. Samocha et al., A framework for the interpretation of de novo mutation in human disease. Nature Genetics, 2014 21. Chong et al., The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. American Journal of Human Genetics, 2015 22. Deciphering Developmental Disorders Study, Large-scale discovery of novel genetic causes of developmental disorders. Nature, 2015 23. Liu et al., dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Human Mutation, 2016 24. Berg et al., An informatics approach to analyzing the incidentalome. Genetics in Medicine, 2013. Figure 3. Precision and recall values for training data. A. Precision was calculated as true positives over the sum of all positives for a 10-fold cross-validation. B. Recall was calculated as the true positives over the sum of true positives and false negatives. Benign Recessive Dominant 0.5 0.6 0.7 0.8 0.9 1.0 Precision 0.88 0.85 0.87 0.79 0.80 0.71 Benign Recessive Dominant 0.5 0.6 0.7 0.8 0.9 1.0 Recall Haploinsufficient Model All Dominant Model 0.94 0.95 0.87 0.79 0.74 0.62 Figure 4. Feature importance plot. First, out-of-bag (OOB) prediction error is calculated for each tree. Next, OOB error is calculated after permuting each feature. Finally, to derive the mean decrease in accuracy, the difference between the two is averaged across all trees and normalized by the standard deviation of the differences. Table 2. Precision values using different subsets of features. The precision values were calculated based on the training data under the haploinsufficient model using a 10-fold cross-validation. 68 variants from the Centers for Mendelian Genomics (CMG) 21 , which are working towards identifying the genetic basis of Mendelian diseases 158 variants from the Deciphering Developmental Disorders Study (DDDS) 22 , which includes 1,133 children presenting with severe, undiagnosed developmental disorders of which 28% were identified with possibly pathogenic variants Table 3. MAPPIN prediction accuracy for two Mendelian datasets. Table showing the prediction accuracies for pathogenicity and mode of inheritance for Mendelian validation datasets from CMG and DDDS. CMG (genes not in training) and DDDS (genes not in training) are pathogenicity and mode of inheritance results after excluding all CMG and DDDS genes from the training data. Dataset Pathogenicity Prediction Accuracy Inheritance Prediction Accuracy CMG 68/68 (100%) 45/64 (70.3%) DDDS 138/158 (87.3%) 124/158 (78.5%) CMG (genes not in training) 68/68 (100%) 45/64 (70.3%) DDDS (genes not in training) 138/158 (87.3%) 125/158 (79.1%) MAPPIN FEATURES A. B. Figure 5. Violin plots of score distributions for CMG and DDDS dominant and recessive disease-causing variants. MAPPIN dominant (A) and recessive (B) class scores for CMG and DDDS genes annotated as dominant or recessive. CADD (C) and Eigen (D) phred scores 23 for CMG and DDDS genes annotated as dominant or recessive. CONCLUSIONS & APPLICATIONS To our knowledge, this is the first nsSNV prediction algorithm that predicts pathogenicity and mode of inheritance by classifying variants into three groups Mode of inheritance predictions are useful because they allow taking the genotype into account when prioritizing variants Prevents carrier heterozygous variants to be classified as equally pathogenic as dominant-acting heterozygous mutations In Mendelian family-based analysis, MAPPIN would be useful for variant prioritization and interpretation especially in cases where there is not enough information to identify the inheritance pattern For population genetics, MAPPIN annotations and predictions can support interpretation of variant and phenotype associations and variant aggregation for gene burden based association testing
Transcript
Page 1: MAPPIN: Method for Annotating, Predicting Pathogenicity ... · and mode of Inheritance for Nonsynonymous variants Nehal Gosalia1,2, Aris N. Economides1,2, Frederick E. Dewey1, ...

2.20e-206 3.55e-92

2.95e-07 2.32e-08

1.49e-08 3.74e-23

ns 4.19e-05

2.48e-08 1.53e-06

ns 2.32e-08

MAPPIN: Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants

Nehal Gosalia1,2, Aris N. Economides1,2, Frederick E. Dewey1, and Suganthi Balasubramanian1

1Regeneron Genetics Center, 2Regeneron Pharmaceuticals, Tarrytown, NY

BACKGROUND• An average exome can contain 10,500-13,500 nonsynonymous

single nucleotide variants (nsSNVs)1,2, which is lower thanexpected suggesting negative selection3,4

• A major challenge with whole exome sequencing (WES) isdifferentiating benign and disease-causing variants

• HGMD and OMIM® databases show that nsSNVs account for~45% of disease-causing mutations5,6 making it critical toidentify them

• Many algorithms predict pathogenicity of nsSNVs, howevernone of them are able to distinguish dominant vs. recessive-disease causing mutations7

• It is important to differentiate between heterozygous dominant-acting variants and heterozygous carrier variants

MAPPIN• Method for Annotating, Predicting Pathogenicity, and

mode of Inheritance for Nonsynonymous variants

• Training data sets:- Pathogenic variants from UniProt, ExoVar7,subdivided using known dominant- and recessive-disease causing genes from OMIM®6 and others8,9

- Haploinsufficient genes subset from dominantgenes using haploinsufficiency predictions10

- Benign variants from ClinVar11 refined by a) criteriaprovided, multiple submitters, no conflict, b) reviewedby expert panel, and c) practice guideline (★★-★★★★)

Input

VariantAnnotation

VCF File (chr, pos, ref, alt)

99 evolutionary, functional, network, and allele frequency

features

Training sets ExoVar

(pathogenic)ClinVar

(benign)

Scores for Dominant,

Recessive, and Benign classes

Random Forest

ClassifierPrediction

Output

AnnotationOutput

Category Features

EvolutionaryGERP score12, paralogs13, pseudogenes14 and other gene annotation metrics, dN/dS rates, average heterozygosity of nsSNVs and synonymous SNVs, nonsynonymous and synonymous SNP density

FunctionalTranscript length, variant affecting all/some transcript isoforms, single exon gene, protein domain annotations13, GTeX expression in individual tissues15

NetworkProtein-protein interactions (BIOGRID)16, number of networks and interfaces, interactions with known dominant or recessive disease causing genes (OMIM®)6

Allele Frequency

1000G17, ESP650018, ExAC19, pLI score by gene (measure of haploinsufficiency based on constraint, ExAC)20

Table 1. Features annotated within MAPPIN and used for predictionssubdivided into categories based on the type of annotation.

Figure 1. Adapted from Li etal., PLoS Genetics, 20137.Several prediction algorithmsand combinations were testedon a dataset composed ofknown dominant and recessivedisease causing mutations.Figure clearly demonstratesthat existing algorithms areunable to call dominant orrecessive mutations confidently(AUCs ~0.55).

RESULTSMAPPIN trained under two models:i. Haploinsufficient model composed of genes causing dominant

diseases through haploinsufficiency (Multiclass AUC = 0.96)ii. All dominant model composed of all dominant disease-causing

genes (Multiclass AUC = 0.91)

FEATURE IMPORTANCE PLOT

PERFORMANCE OF FEATURE SUBSETS

VALIDATION ON TWO MENDELIAN DATASETS: CMG AND DDDS

Figure 2. Workflow for MAPPIN. User inputs a VCF file which is annotated with 99features and then run through a prediction model based on a random forestclassifier trained on benign and pathogenic variants.

DIFFERENTIATING BETWEEN DOMINANT & RECESSIVE DISEASE-CAUSING VARIANTS

DOMINANT AND RECESSIVE DISCRIMINATION FOR HGMD VARIANTS

Figure 6. Violin plots of score distributions for HGMD5 variants in dominant andrecessive genes. Training variants are excluded in the comparison and variants weresubset using genes from Berg et al.24 MAPPIN dominant (A) and recessive (B) classscores for HGMD variants in dominant and recessive genes. CADD (C) and Eigen (D)phred scores23 for HGMD variants in dominant and recessive genes.

REFERENCES1. Levy et al., The diploid genome sequence of an individual human. PLoS Biology, 20072. Ng et al., Genetic variation in an individual human exome. PLoS Genetics, 20083. Cargill, M et al., Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature

Genetics, 19994. Stephens JC et al., Haplotype variation and linkage disequilibrium in 313 human genes. Science, 20015. Stenson et al., The Human Gene Mutation Database: building a comprehensive mutation repository for clinical

and molecular genetics, diagnostic testing and personalized genomic medicine. Human Genetics, 20146. Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins

University (Baltimore, MD). World Wide Web URL: http://omim.org/7. Li et al., Predicting Mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing

studies. PLoS Genetics, 20138. Blekhman, R. et al., Natural selection on genes that underlie human disease susceptibility. Current Biology, 20089. Boone, P.M. et al., Deletions of recessive disease genes: CNV contribution to carrier states and disease-

causing alleles. Genome Research, 201310. Huang et al., Characterising and predicting haploinsufficiency in the human genome. PLoS Genetics, 201011. Landrum MJ et al., ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids

Research, 201512. Cooper, G.M. et al., Distribution and intensity of constraint in mammalian genomic sequence. Genome

Research, 200513. Flicek et al., Ensembl 2014. Nucleic Acids Research, 201414. GENCODE, Pei, B. et al. The GENCODE pseudogene resource. Genome Biology, 201215. Lonsdale et al., The Genotype-Tissue Expression (GTEx) project. Nature Genetics, 201316. Stark, C. et al., BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 2006 (version

3.4.128)17. The 1000 Genomes Project Consortium, An integrated map of genetic variation from 1,092 human genomes.

Nature, 201218. Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL:

http://evs.gs.washington.edu/EVS/), 201519. Lek et al., Analysis of protein-coding genetic variation in 60,706 humans. Nature, 201620. Samocha et al., A framework for the interpretation of de novo mutation in human disease. Nature Genetics, 201421. Chong et al., The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. American

Journal of Human Genetics, 201522. Deciphering Developmental Disorders Study, Large-scale discovery of novel genetic causes of developmental

disorders. Nature, 201523. Liu et al., dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human

Nonsynonymous and Splice-Site SNVs. Human Mutation, 201624. Berg et al., An informatics approach to analyzing the incidentalome. Genetics in Medicine, 2013.

Figure 3. Precision and recall values for training data. A. Precision was calculated astrue positives over the sum of all positives for a 10-fold cross-validation. B. Recall wascalculated as the true positives over the sum of true positives and false negatives.

Benign

Recessive

Dominant

0.5

0.6

0.7

0.8

0.9

1.0

Precision 0.88 0.85 0.87

0.79 0.80

0.71

Benign

Recessive

Dominant

0.5

0.6

0.7

0.8

0.9

1.0

Recall

Haploinsufficient ModelAll Dominant Model

0.94 0.95

0.87

0.79

0.74

0.62

Figure 4. Feature importance plot. First, out-of-bag (OOB) prediction error iscalculated for each tree. Next, OOB error is calculated after permuting each feature.Finally, to derive the mean decrease in accuracy, the difference between the two isaveraged across all trees and normalized by the standard deviation of the differences.

Table 2. Precision values using different subsets of features. The precision valueswere calculated based on the training data under the haploinsufficient model using a10-fold cross-validation.

• 68 variants from the Centers for Mendelian Genomics (CMG)21, which are working towards identifying the genetic basis of Mendelian diseases

• 158 variants from the Deciphering Developmental Disorders Study (DDDS)22, which includes 1,133 children presenting with severe, undiagnosed developmental disorders of which 28% were identified with possibly pathogenic variants

Table 3. MAPPIN prediction accuracy for two Mendelian datasets. Table showing theprediction accuracies for pathogenicity and mode of inheritance for Mendelianvalidation datasets from CMG and DDDS. CMG (genes not in training) and DDDS(genes not in training) are pathogenicity and mode of inheritance results afterexcluding all CMG and DDDS genes from the training data.

Dataset Pathogenicity Prediction Accuracy

Inheritance Prediction Accuracy

CMG 68/68 (100%) 45/64 (70.3%)

DDDS 138/158 (87.3%) 124/158 (78.5%)

CMG (genes not in training) 68/68 (100%) 45/64 (70.3%)

DDDS (genes not in training) 138/158 (87.3%) 125/158 (79.1%)

MAPPIN FEATURES

A. B.

Figure 5. Violin plots of score distributions for CMG and DDDS dominant andrecessive disease-causing variants. MAPPIN dominant (A) and recessive (B) classscores for CMG and DDDS genes annotated as dominant or recessive. CADD (C) andEigen (D) phred scores23 for CMG and DDDS genes annotated as dominant orrecessive.

CONCLUSIONS & APPLICATIONS• To our knowledge, this is the first nsSNV prediction algorithm that

predicts pathogenicity and mode of inheritance by classifyingvariants into three groups

• Mode of inheritance predictions are useful because they allowtaking the genotype into account when prioritizing variants• Prevents carrier heterozygous variants to be classified as

equally pathogenic as dominant-acting heterozygous mutations

• In Mendelian family-based analysis, MAPPIN would be useful forvariant prioritization and interpretation especially in cases wherethere is not enough information to identify the inheritance pattern

• For population genetics, MAPPIN annotations and predictions cansupport interpretation of variant and phenotype associations andvariant aggregation for gene burden based association testing

Recommended