MÁSTER EN BIOINFORMÁTICA Y BIOLOGÍA COMPUTACIONAL ESCUELA NACIONAL DE SALUD- INSTITUTO DE SALUD CARLOS III
NutVar 2: Sequence-based functional an-notation of truncating variantsfrom Genome data
Manuel Tardáguila Sancho
2013-2014
Lab Telenti, Centre Hospitalaire Universitaire Vaudois (CHUV) Institut de Microbiologie (IMUL)- Swiss Institute of Bioinformatics (SIB)
Antonio Rausell (SIB).Amalio Telenti (J. Craig Venter Institute) and Ioannis Xenarios (SIB).
Michael Tress (CNIO)
January, 2015
Venue
Supervisors
Master´s supervisorDate
Index
2
INDEX
INDEX .......................................................................................................................2
OBJECTIVES ...........................................................................................................3
INTRODUCTION ......................................................................................................4
MATERIALS ..............................................................................................................6
RESULTS .................................................................................................................7
Assembly of the training and test sets and variant effect prediction .....................................9
Construction of Pre-Calculated tables (Pre-C tables) and obtention of files necessary to an notate features of molecular damage .................................................................................14
Annotation of gene and sequence based features .............................................................23 Machine Learning ...............................................................................................................28
DISCUSSION .........................................................................................................31
NutVar2 versus implementations in SnpEff and VEP .........................................................31
NutVar2 versus NutVar1 .....................................................................................................33
Future Perspectives ...........................................................................................................35
CONCLUSIONS .....................................................................................................36
REFERENCES .......................................................................................................37
ANEXES .................................................................................................................39
Objectives
3
OBJECTIVES
• Obtain a training set for the classifier using up-to-date data from large se-quencing projects and from the database ClinVar• Extend the analysis of NutVar to splice donor/acceptor disruption sites• Add new functionalities to increase the rank of potential users.• Extend protein feature analysis to functional sites• Develop an annotation pipeline for all the features• Train and test a classifier using Naïve Bayes algorithm
Introduction
4
INTRODUCTION
Recent large sequencing projects report that stop-gains, frameshifts and splice donor/acceptor disruptions giving rise to truncations are collectively prevalent in humans. Current computational appro-aches evaluating their impact on health rely on gene-level characteristics such as evolutionary conservation and functional redundancy across the genome. However, sequence based features such as loss of functional domains, isoform-specific trun-cation and onset of nonsense-mediated decay provide cues that improve the ran-king of potentially pathogenic truncations. Here, we present NutVar2 second release of a classifier producing a ranking of se-verity relying on sequence-based features. New characteristics in NutVar2 include an expanded training set comprised by com-mon truncations from ESP, 1.000 Geno-mes Project, 10GK project and pathogenic truncations extracted from ClinVar, eva-luation of protein functional sites loss and ENSEMBL basis for transcript annotation. The sequence-based score ranks variants according to their potential to cause disea-se, and complements existing gene-based pathogenicity scores.
The amount of data coming out of large scale human exome sequencing projects is growing exponentially [1] as we enter the era of perso-nal genomics. Even though humans only di-ffer in 0.1% of their DNA sequence, the study of these nucleotide changes is the basis to as-certain the genomic predisposition to suffer a disease. Therefore, a series of tools are requi-red to prioritize the clinical study of new/ pre-viously not associated with disease variants regarding their pathogenicity potential.
Setting aside large structural variations (SVs) and copy number variations (CNVs), va-riants that have a profound effect on protein functionality are often associated with disea-se [2-4]. Among these are the so-called ‘Loss-of-Function’ (LoF) variants mainly due to truncation onset after a stop-gain, frameshift or destruction of a splice donor/acceptor site. However, truncating variants have proven to be surprisingly prevalent among the general
population with the average individual ca-rrying between 100 and 200 of them [2, 5]. Furthermore, 20 of these truncations are esti-mated to appear in homozygosis [2].
The assessment of the pathogenicity of trun-cations until the advent of NutVar [4] remai-ned focused in features derived from the gene affected by the variant. These ‘gene-based’ features reflect evolutionary constraints, such as sequence conservation or tolerance to functional variation, and are used to evaluate the pathogenicity potential of the gene. Under this approach, truncations are assumed to be severe irrespective of their position in the gene so its outcome depends solely on the gene being affected. A general conclusion of these studies is that the more conserved the affected gene, the greater the likelihood of the truncation being pathogenic [2].
Two studies pioneered the use of gene-based features and showed its classifying power. First MacArthur et al [2], identified 2.951 LoF candidate variants from the 1000 Genomes project [5] and validated them using arrays whenever possible (n=1.877). The authors fo-cused in producing a high confidence set of truncations leading to LoF, eliminating up to 25% of candidates that were likely sequen-cing/mapping errors, annotation/reference sequence errors, and variants unlikely to cau-se genuine LoF. Interestingly, the authors po-inted that ‘accurate functional interpretation requires integrating multiple variants on the same haplotype’. This means that the sequen-ce context of an a priori truncation needs to be evaluated carefully to determine whether it produces or not a truncation. For instan-ce; SNPs leading to stop-gains are prone to associate with SNPs in adjacent nucleotides reverting their effect, frameshifts are prone to associate with frameshifts in adjacent se-quence recovering the original reading frame, and indels close to or spanning exon splice si-tes can be rescued by alternative splice sites. With the surviving 1.285 High Confidence LoF variants MacArthur et al build a Gene Score based in the observation that genes in which these variants were more prevalent shared a series of characteristics. They were relatively less evolutionarily conserved (i), had more
5
Introduction
paralogs (ii) and had lower connectivity in both protein-protein interaction and gene in-teraction networks (iii). All of these pointed to the redundancy and/or dispensability of these genes, with the typical example being human olfactory receptors.
Within short time Petrovski et al provided a di-fferent but related gene-based score [3]. The authors developed ‘an intolerance scoring system (RVIS) that assesses whether genes have relatively more or less functional genetic variation than expected based on the appa-rently neutral variation found in the gene‘ [3]. Their primary motivation was to distinguish between two types of genes; genes for which variants affecting functionality are rarely ob-served and, if so, tend to accumulate in pa-tients of a certain disease, and genes which carry functional variants at high frequencies. Using a logistics regression model they ob-served that genes causing Mendelian disea-ses are more intolerant to functional variants.
Both of these scores share two central ca-veats; they solely rely in gene-based featu-res and they tend to classify wrongly variants affecting genes that are under positive selec-tion such as genes of the Innate Immune Res-ponse, IRR. First of all, the assumption that a truncation irrespective of its position will lead to LoF is too simplistic; depending on its posi-tion the truncation will affect a greater or lower number of protein domains and functional si-tes. In addition, the genomic coordinate of the truncation dictates the isoform(s) it affects. This is of special importance as most genes have a principal isoform [6]. Furthermore, the Non-sense-Mediated-Decay (NMD) machi-nery abrogates the expression of transcripts with premature termination codons 3´of 50 pb from the last exon-exon junction. In summary, the overall effect of the truncation needs to be inferred from a series of sequence-based fea-tures inherent to the position of the variant. Both group use preliminary strategies to deal with this caveat; MacArthur et al discard every truncation lying in the last 5% of the sequence assuming it will not entail LoF, and Petrovski et al combine their intolerance score with Po-lyPhen2, a sequence-based estimator of the potential impact of missense variants.
Second, variants affecting genes that undergo positive selection tend to be classified wron-gly (for example IRR genes). The gene score of MacArthur et al tends to classify functio-nal variants in these genes as pathogenic when some of them are not. The authors of the RVIS score suggest an alternative to deal with this problem; depending on the nature of the disease being studied (according to the classification of diseases by Goh et al [7]) the association between Intolerance Score and disease shall be interpreted in a different manner. For example, developmental disea-ses are caused by genes with the most the most intolerant scores to functional variation (RVIS low). On the other hand, immunological diseases are caused by genes very tolerant to functional variation (RVIS high)
All these caveats were addressed by the first release of NutVar developed by Rausell et al [4]. The software focuses on truncations, extracting the sequence features aforemen-tioned and providing a classifier that can be blended with either MacArthur or RVIS gene scores. The authors showed the gain in accu-racy when both scores were combined, and validated the decrease in transcript levels for variants eliciting NMD using correlated trans-criptomics and genomics data from the 1000 genomes project and the Geuvadis Project [4]. The authors analyzed in detail genes of the IIR and showed that NutVar complemen-ted gene-based scores and in some instances classified better the truncations in these ge-nes than gene-based scores alone.
Here, we present NutVar2, second release of the classifier. New characteristics in NutVar2 include an expanded training set comprised by common truncations from ESP, 1.000 Ge-nomes Project, 10GK project and pathogenic truncations extracted from ClinVar, evaluation of protein functional sites loss and ENSEM-BL basis for transcript annotation. The imple-mentations and extended functionalities over NutVar1 account for a gain in the accuracy of classification.
Materials & Methods
6
MATERIALS & METHODS
Scripts and original files
All the scripts and files needed to run NutVar2 annotation phase are provided along with this memory as .gz file. The structure of the NutVar2 directory allocates scripts in the bin/ folder whereas data/ contains the files nee-ded to run the file. The bin/ directory is divided in turn between scripts needed to build Pre-C tables and Training set (build_table/ and Training_set/) scripts used exclusively to run VEP (VEP/) scripts used exclusively to run SnpEff (snpeff/) and scripts common for both programs in the pipeline (shared/). The data folder includes files generated in the cons-truction of the Pre-C tables (build_tables/) and training set (Training_set/), intermediate files generated in the annotation process (interme-diate/), files obtained from external sources (external/) and a folder with the final matrix of annotation (final/). Finally a test folder inclu-ding an example that can be used to run the whole pipeline is provided.
The map of dependencies provided in the an-nexes includes the orders needed to run the different scripts. The route to the external ftp sites is also displayed in the map of depen-dencies.
Calculation of the Allele Frequency (AF) in the Training Set
An R script (Beta_distribution.R) was imple-mented to calculate the 95% Credible Intervals for the AF of each variant. HOM=homozygous for the variant, HET=Heterozygous for the variant, WT=wild type.Briefly, the allele of in-terest was calculated using (2*HOM+HET) and the total allese were calculated using (2*(HOM+HET+WT)). For each variant the qbeta function of R was used to calculate the 0.025 and 0.975 vector of probabilities using as shape parameter1 1+allele of interest, and as shape parameter2 1+total alleles- allele of interest. The ncp was set to 0.
Only variants for which the lower limit of the Credible Intervals was bigger than 0.01 were acceted as common.
Minimal Representation
The Minimal Representation of each variant was calculated as explained in the guidelines established in [8].
Selection of Protein features
For functional sites, sites being annotated in UniProt as ‘by similarity´, ‘probable’ or lacking a label (highest level of acceptance) were ac-cepted as valid. Sites labelled as ‘potential’ were discarded.
Naive Bayesian Classifier
The k-fold crossvalidation was set to leave-one-out as there were few examples in the final training set. No variant filtering was per-formed.
Plotting NutVar2 features against class variable
The mean value for each feature was plotted for the two labels of the class variable (Patho-genic and non-Pathogenic). Error bars repre-sent the STD deviation of the mean. Of note, the STD was assumed to be equal between the two classes.
Results
7
User vcf file
Minimal representation
AnnotationSequenceFeatures
SnpEff, VEP or both
Variant effect; p.e. stop_gained
AnnotationGene
Features
Pre-C tables
External Files
MolecularDamage
Gene Essentiality
Innate Immune Gene
Biological Readout
Rank of truncating variants according to Pathogenicity Score
Classifier
Figure 1. Workflow of NutVar2. The variants from the user are converted to minimal representation and their effect is assessed using SnpEff, VEP or both. Next, sequence features of molecular damage are annotated using Pre-C tables. Gene based features assessing gene essentiality and affiliation to the innate immune response are added afterwards. Finally, the probability of the truncating variants being pathogenic is estimated by means of a classifier trained and tested for a known set of common and pathogenic truncating variants. Truncating variants are ranked according to their score of poten-tial pathogenicity.
RESULTS
NutVar2 has been designed to work as a stan-dalone tool to be downloaded and function with the user own set of .vcf files. Figure 1 depicts an overview of NutVar2 workflow; data from the user is first converted from the original vcf file to its minimal representation [8] and the effect of the variant is determined using SnpEff [12], VEP [13] or both. Then sequence features of molecular damage are
annotated for the variant based on Pre-Cal-culated (Pre-C) tables build by the software. Afterwards a layer of gene based features as-sessing gene essentiality and its belonging to the innate immune response is added to the annotation. Finally, variants are introduced in a classifier previously trained with a set of known pathogenic and common (Minimum Allele Frequency, MAF > 1%, assumed to be non-Pathogenic) truncating variants that have been annotated with NutVar2. The result is a
Results
8
InterPro
snpEff VEP
ENSEMBL
UniProt
CCDS
Appris
Pervasive
FUNCTIONALSITES
Figure 2. Development of NutVar2. The training set (1) is constructed with a set of truncating va-
1
MACHINELEARNING
VARIANT EFFECT
TRAINING SET
TRANSCRIPT FEATURES
PROTEIN FEATURES
% sequence affected
Isoforms affected: Constitutive or Alternative
Principal Isoform?Pervasive Isoform?
Max percentage of a do-main affectedFuntional sites
Nonsense-Mediated-Decay degradation:NMD + NMD down frameshift
ES Project
ClinVar
2
3
4
MACHINE LEARNING6
1000 G Project
10 GK Project
ANNOTATION5
Results
9
riants classified as pathogenic (obtained from ClinVar) and truncating variants known to be common obtained from large sequencing projects. The efect of the variant is assessed using SnpEff or VEP or both (2) and a series of features of molecular damage at the transcript level (3) or at the protein level (4) are annotated (5). For each pre-classified variant, features of molecular damage and features of gene essentiality [2, 3] are used to train a classifier (6) under a naive Bayes paradigm. The sensitivity and specificity of the classifier are assessed afterwards by ROC curves.
TRANSCRIPT FEATURES
MACHINE LEARNING
ASN.1 terms ClinVar and VCF0 – unknown Uncertain significan-
ce1 – untested not provided (inclu-
des the cases where data are not availa-ble or unknown)
2 - non-Pathogenic Benign3 - probable-non-Pa-thogenic
Likely benign
4 - probable-patho-genic
Likely pathogenic
5 – pathogenic Pathogenic6 - drug-response drug response7 – histocompatibility histocompatibility255 - other other
Table 1. ClinVar code for the clinical signifi-cance of variants. Obtained from [11].
probability for the variant being pathogenic and a rank of the truncating variants accor-ding to this score.
The core of the developing process of NutVar2 can be divided in four main stages: first, as-sembling a large set of pathogenic and com-mon truncating variants to train the classifier (steps 1 and 2 in Figure 2). Second, building a series of Pre-C tables that will speed up the process of annotation (steps 3 and 4 in Figu-re 2). Third, developing a series of scripts to annotate the set of truncating variants assem-bled in the first step (step 5 in Figure 2). Fina-lly, building the code for the classifier and trai-ning and testing it with the data obtained from all the previous steps (step 6 in Figure 2).
We will analyze in detail these four stages.
1. Assembly of the training and test sets and variant effect predic-tion
Selection of non–Pathogenic and Pathogenic variants for the Training set
First of all a file comprising 1.523.770 variants coming from large sequencing projects (ESP, 1000 genomes, 10GK, in-house projects) was obtained from project supervisor Antonio Rau-sell. The file was in vcf format and the INFO field included allele counts for every variant indicating the number of individuals obser-ved being homozygous (HOM), heterozygous (HET) and wild type (WT) (see Figure 3).
From this large pool of variants we wanted to select truncating variants that are common (Allele Frequency, AF, > 1%). We had to ac-count for statistical uncertainty as the size of the population sample to calculate AF di-ffered widely from variant to variant, and we
had to account for genetic uncertainty that arises from the underlying stochastic evolutio-nary process that gave rise to the population sampled, so we applied a hierarchical Baye-sian model to estimate AF [9]. First, to ac-count for statistical uncertainty, we assumed that alleles are sampled independently within populations and that the samples are drawn independently across loci and population [9]. Second, to account for genetic uncertainty we assumed a parametric form for the among-population allele frequency distribution. It is natural to assume that population allele fre-quencies follow a Beta distribution [9]. Using the Beta distribution (see M&M) we obtained 95% credible intervals for each variant.
Consequently, the AF of a given variant lies with 95% probability between the limits of the credible intervals we calculated. Next, only variants for which the lower limit of the credi-ble interval was bigger than 0.01 (0.1%) were accepted to be common. By means of this
Results
10
ES P
roje
ct
1000
G P
roje
ct
10 G
K Pr
ojec
t
Com
mon
var
iant
s (A
F >
1%)
assu
med
to b
e no
n-pa
thog
enic
1.52
3.77
0 va
riant
s
70.0
42 v
aria
nts
Assu
me
a Be
ta d
istri
butio
n fo
r the
AF
of e
ach
varia
nt
Cal
cula
te u
pper
an
low
er li
mits
of 9
5% C
redi
ble
Inte
rval
s
(Low
er li
mit
- Upp
er li
mit)
Low
er li
mit
> 0.
01
Figure 3. Selection of non-Pathogenic variants from the pool of variants from large sequen-cing projects. An initial pool of 1.523.770 variants along with their allele counts coming from different sequencing projects was assembled. The AF for each variant was assumed to be Beta distributed.
Results
11
ClinVar
Pathogenic variants
63.060 variants
14.571 variants
Select variants reported as ‘likely Pathogenic’ (4) or ‘Pathogenic’ (5)
Discard variants with conflicting reports (variants also classified as ‘benign’ (2)or ‘likely benign’ (3))
Restrict to variants at least once reported as ‘Pathogenic’
95% Credible Intervals were calculated for each variant. Variants with the lower limit of the Credible Interval bigger than 0.01 (1%) were accepted to be common (AF >1%) and therefore non-Pathogenic.
Figure 4. Selection of pathogenic variants from the ClinVar set. From the initial 63.060 variants in the pool only those having been reported as Pathogenic or likely Pathogenic were selected. This subset was further purified from variants showing conflicting reports. Finally, we restricted our analysis to variants having been classified at least once as Pathogenic. The final amount of selected Pathogenic variants was 14.571.
stringent assumption we eliminated variants with high statistical and genetic uncertainty.
A total of 70.042 variants were found to be common and therefore assumed to be non-Pathogenic.
Next, a set of 63.060 variants was obtained from ClinVar [10]. ClinVar classifies the cli-nical significance of variants according to a numeric code (Table 1). Different submissions may address different clinical significances for a given variant. Consequently some variants exhibit conflicting reports; they are classified by some authors as Pathogenic and by other authors as Benign.
In order to discard these conflicting variants we first selected variants from the original set that have at least once been classified as Pa-thogenic (5) or likely Pathogenic (4) (Figure 3). From this subset variants that have been at least once classified as Benign (2) or likely Benign (3) were excluded. Finally, we restric-ted ourselves to variants that have been at least once reported as Pathogenic. The final subset of ClinVar Pathogenic variants amoun-ted to 14.571 variants.
Conversion to minimal representa-tion and variant effect prediction
Once the Pathogenic and Non-Pathogenic variants of the training set have been selec-
Results
12
1
Conversion to minimal representation (MR) form.
Original .vcf file
Undo joint calls
1 1231299 rs19800 CGG AGG
1 1231299 rs19800 CGG CTT, AGG
1 1231300 rs19800 CGG CTT
Conversion of ALT alleles <DEL> or < . > to MR.
1 1231299 rs19800 CGG CTT
1 1231299 rs19800 CGG AGG
+
1 1231299 rs19800 C A
1 1231300 rs19800 GG TT
2
Eliminate padding nucleotides
3
4
Figure 5. Modifications to reach minimal representation. Joint allele calls, if present in the original vcf file, are separated in simple calls (2) and padding nucleotides added to make coherent joint calls are eliminated (3). If any cryptic alternative allele (<DEL>, or < . >) is present it is converted to mini-mal representation via ENSEMBL API.
ENSEMBL API
Results
13
Comparison of SnpEff and Variant Effect Predictor (VEP)
SnpEff VEPVersion 3.6 Release 75Genome Assembly GRCh37.75 GRCh37.75Running time 2 hours o.n. paralelized in 10 chunksRun features Loss-of-Function (LOF and
NMD prediction).
Nextprot information
Pathogenicity Probability
Protein Domain information
SIFT and PolyPhen scores
Terminology of variant effect
Sequence Ontology (S.O.) Sequence Ontology (S.O.)
Same in both
SnpEff exclusive VEP exclusive
Total variants 91.13% 1.77 % 8%Stop_gained 99% 0.15% 0.75%Frameshift_variant 77.32% 16.77% 5.91%Splice_variants (acceptor, donor, region)
88.80% 8.77% 5.34%
Missense_variants >99% -- --Synonimous_variants >99% -- --
Table 2. Characteristics of SnpEff and VEP runs on 1000G phase 1 set.
Table 3. Results of SnpEff and VEP runs on 1000G phase 1 set. SnpEff and VEP showed 91.13% coincidence addressing the consequences of the variants in the set. Variants with a consequence exclusively addressed by SnpEff or VEP amounted to 1.77% and 8%, respectively. Convergence between both predictors was greater for synonymous and missense variants (>99%) while truncating variants showed in general a lower degree of accordance (99% for Stop gains, 88% for splice variants and 77% for frameshifts).
ted, variants are converted from their stan-dard vcf format to minimal representation [8] (Figure 5). This conversion decomposes jo-int calls, eliminates padding nucleotides and modifies genomic coordinates if necessary. It also transforms the deletion alternative alleles annotated as <DEL> or < . > for their minimal representation form using ENSEMBL API. This transformation is a convenient way to avoid errors in variant identification due to the addition of padding nucleotides in joint-calling
of vcf files [8]. Once the variants were in their minimal re-presentation form their effect was assessed using SnpEff [12] and/or VEP [13]. The rea-son for introducing two variant effect predic-tors is an implementation over the first release of NutVar when only SnpEff was used. SnpEff is much faster than VEP but it is less com-prehensive. VEP is becoming a standard in the field of clinical genomics so we wanted to feature it in order to reach a greater audience
Results
14
of users. Besides, the possibility of using the two of them in the same set of variants and restricting the results to variants predicted to have the same effect by both could provide a way to avoid false positives.
Before deciding on using both of them for our pipeline we run a comparative analysis using the 1000 Genomes phase 1 vcf [14]. As shown in Table 2, SnpEff requires much less time to process the data than VEP (2 hours versus and overnight process parallelized in ten chunks). Both programs were run using the option to obtain protein information for the effect of the variant, NextProt in the case of SnpEff and ENSEMBL information in the case of VEP. In addition, VEP also provides SIFT and PolyPhen scores to evaluate the impact of missense variants. The user can later mine all these pieces of information from the resul-ting files.
Importantly, we also obtained the data for the SnpEff predictor of loss-of-function of the va-riant and the predictor of pathogenicity of VEP. The information of these two scores goes in the same sense of NutVar 2, that is, evalua-ting when the molecular damage is severe (loss-of-function) and providing a probability score of pathogenicity (VEP). SnpEff predic-tion of loss-of-function is based in McArthur et al [2] (Pablo Cingolani, personal communi-cation [15]) while the basis for VEP predictor of probability could not be ascertained. The-se scores can later be used to compare with NutVar2 results.
As shown in Table 3 the overall convergence in assessing the effect of a variant between VEP and SnpEff is 91.13%. For missense and synonymous variants both programs show greater accordance (>99%) than for truncating variants, 99%, 88% and 77% for stop gains, splice and frameshift variants respectively.
The Pathogenic and Non-Pathogenic variants of the training set were annotated separately using SnpEff and VEP. Consequently, the fo-llowing stages in the development of NutVar2 were undertaken separately depending on ha-ving used SnpEff or VEP fot the assessment of the effect of variants.
2. Construction of Pre-Calculated tables (Pre-C tables) and obtention of files necessary to annotate featu-res of molecular damage
For NutVar2 to run fast, a series of Pre-C ta-bles and external files need to be built in ad-vance. These Pre-C tables and files (listed in table5) contain transcript and protein annota-tions that would allow NutVar2 to rapidly cal-culate the features of molecular damage for the Training Set obtained in the previous sec-tion. The construction of the Pre-C tables for transcript features and protein features was therefore the second main stage of NutVar 2 development.
Transcript Features Pre-C tables and files
A major implementation of NutVar 2 is the use of ENSEMBL as the source of transcript annotation, while in NutVar1 it was restricted to transcripts belonging to the CCDS smaller subset. From the whole set of human trans-cripts of the GRCh37.75 release we restric-ted ourselves to protein coding transcripts (ENSTs) obtained from protein coding genes (ENSGs) that lie in autosomal and sexual chromosomes (ENSGs in “patch chromoso-mes“ were discarded).
A series of Pre-C tables containing sequen-ce characteristics of every transcript was first derived from ENSEMBL gtf file (gtf_tabladef_sorted_by_SYMBOL.txt, gtf_output_ENSG.txt and gtf_output_ENST.txt, see Table 5). These Pre-C tables were instrumental to build the first main Pre-C table of transcript featu-res; a table of intervals of genomic coordi-nates occupied by coding positions of all the ENSTs for every ENSG (Figure 6). This Pre-C table called ENST_table_full_condensed.txt (see Table 5) would be central for the annota-tion process explained later.
In addition, a second major Pre-C table for the annotation process, a table of Nonsense-Me-diated-Decay window regions was developed in this stage (Figure 6, called NMD_table see Table 5). This Pre-C table allows NutVar2 to calculate swiftly whether or not a truncating
Results
15
ENSEMBLCCDS
Appris Pervasive
1. Create a pre-calculated table of intervals of genomic coordinates occupied by
coding positions of all the protein coding transcripts for every gene in ENSEMBL.
2. Create a pre-calculated table of Nonsense-Mediated-Decay window regions,
from start codon to 50 pb (non-inclusive) from last exon-exon junction
3. Selection of longest isoform, principal isoform (APPRIS) and isoform most
expressed across different tissues (Pervasive isoform)
Version Range and CoverageENSEMBL release 75 (GRCh37.75) 20.314 protein coding genes in autosomal
and sexual chromosomesENSEMBL release 75 (GRCh37.75) 81.732 protein coding transcriptsAPPRIS Gencode 19 17.902 genes for which there is a Princi-
pal IsoformPervasive GRCh37.75 5.227 genes for which there is a Pervasi-
ve isoform
Figure 6. Transcript features annotated from ENSEMBL set of protein coding transcripts. The set of protein coding transcripts of ENSEMBL is used to create two Pre-C tables. First, a table of intervals of genomic coordinates covering all the coding positions of every protein-coding transcript in ENSEMBL. Second a table of Nonsense-Mediated-Decay (NMD) susceptible regions, covering all co-ding positions until 50 pb (not included) of the last exon-exon junction. The APPRIS server is used to select the principal isoform of each gene while the longest isoform is calculated from ENSEMBL data. To select the pervasive isoform, the isoform most expressed across different genes, we use the table provided by Gonzalez-Porta et al [16]. The Consensus CoDifying Sequence (CCDS) set of transcripts is displayed as the software includes the possibility to limit the results to CCDS transcripts.
Results
16
variant leads to NMD and therefore transcript downregulation.
Finally two key files were obtained from exter-nal sources: the APPRIS selection of principal isoforms in Gencode 19 [6] (appris_principal_isoform in Table 5) and the table of pervasive isoforms (isoforms most expressed across di-fferent tissues) obtained from Gonzalez-Porta et al [16]. These files are vital to ascertain the importance of an ENST in the context of the gene; truncating variants affecting the princi-pal or pervasive isoforms of a given gene will have a greater potential of being pathogenic.
Protein Features Pre-C tables and fi-les
By building a Pre-C table of protein functional sites and domains we aimed at quantifying the loss in overall protein features produced by a truncating variant. In order to do so, we mapped functional sites and protein domains to genomic coordinates.
We chose as a basis for protein features Uni-Prot-SwissProt [17], comprised by manually curated annotations, though we plan to imple-ment TrEMBL (automatic annotation without manual curation) in the near future.
The process of data extraction from the Uni-Prot-SwissProt complete file is detailed in Fi-gure7 and Table4. For each ENSEMBL gene, SwissProt annotates protein features for only one isoform, the “displayed isoform”, amoun-ting to a total of 18.955 displayed isoforms. For each displayed isoform we extracted UniProt identifier (AC number), InterPro identifiers of protein domains and protein coordinates and types of selected functional sites (Figure 7). InterPro identifiers were later used to extract domain protein coordinates from InterPro [18]. The types of functional sites selected are listed in Table 4. This selection contains types of small functional sites whose disruption may entail protein function impairment due to them being active catalytic sites, sites of protein post-translational modification, or small motifs needed to bind substrates and cofactors.
Domain Transformations
InterPro provides protein domain coordina-tes of different domain annotators (Pfam, G3DSA,etc) for a combination of AC number (identifying the protein) and InterPro identifier (identifying the domain). As a result of this, the same protein domain can be defined by di-fferent boundaries depending on the domain annotator selected (Figure 8 B)). In addition, repetitions of the same domain can be found across the same polypeptide (Figure8 A)). Consequently, to correctly assess the number and extension of domain loss after a trunca-tion we needed to establish domain bounda-ries and to number domain repetitions (Figu-re8). Domain repetitions obtained from the same domain annotator were numbered in the first place (Figure 8 A)). Secondly, when diffe-rent domain annotators display overlapping coordinates delimiting a protein domain, the outermost coordinates were used to define a “collapsed form” of the domain resulting from all the domain annotators predictions (Figure 8 B)). These two sequential transformations may take place for the same domain in the same polypeptide, as exemplified in Figure 8 C) for the IPR003961 domain of the Tenascin-C human protein.
Mapping protein coordinates to ge-nomic coordinates
Once the types and coordinates of selected functional sites and the coordinates and IDs of collapsed and numbered InterPro domains have been obtained, we proceeded to map-ping them to genomic coordinates.
This process was divided in two steps. Firstly, we selected the ENSEMBL transcript (ENST) whose peptide sequence matched that of the displayed isoform whose sites and domains we wanted to map. Secondly, we obtained the genomic coding positions for the resulting ENST (using the Pre-C tables carrying Trans-cript features) and retrieved the positions co-vered by the protein features.
Results
17
ENSE
MBL
UN
IPR
OT
Swis
sPro
t
Inte
rPro
Prot
ein
Coo
rdin
ates
Gen
omic
Coo
rdin
ates
# U
NIP
RO
T AC
num
ber
# In
terp
ro D
omai
n ID
s
Dom
ain
Coo
rdin
ates
# Si
te In
fo a
nd C
oord
inat
es
Ser 2
7PT
1:49
5112
T 1:
4951
13
C 1
:495
200
Intro
n
A)
Results
18
Site UniProt code
Description
BINDING Describes the interaction between a single amino acid and another che-mical entity
METAL Indicates at which position the protein binds a given metal ion. By defini-tion each ‘Metal binding’ subsection refers to a single amino acid
NP_BIND Describes a region in the protein which binds nucleotide phosphates. It always involves more than one amino acid and includes all residues invol-ved in nucleotide-binding
DNA_BIND Specifies the position and type of each DNA-binding domain present wi-thin the protein.
CA_BIND Specifies the position(s) of the calcium-binding region(s) within the protein. One common calcium-binding motif is the EF-hand, but other calcium-binding motifs also exist.
MOD_RES Position and type of each modified residue excluding lipids, glycans and protein cross-links. For example: phosphorylation, methylation, acetyla-tion...
LIPID Specifies the position(s) and the type of covalently attached lipid group(s)CARBOHYD Specifies the position and type of each covalently attached glycan group
(mono-, di- , or polysaccharide)NON_STD Describes the occurrence of non-standard amino acids selenocysteine
(Sec) or pyrrolysine (Pyl) in the protein sequenceACT_SITE Indicates the residues directly involved in catalysis, used for enzymes
Figure 7 and Table 4. Scheme of the flux of protein information to genomic coordinates and ty-pes of functional sites selected. A) UniProt-SwissProt data addressing InterPro domain IDs present in the peptide and type and protein coordinates of functional sites are extracted for each protein from complete database files obtained from the web ftp server. The UniProt AC number and the InterPro IDs are used to retrieve domain protein coordinates based on InterPro information. Site and Domain coordinates are then mapped to genomic coordinates as described afterwards. B) Types of functio-nal sites selected by the software, functional sites deemed as ‘Potential’ are not used. Adapted from UniProt documentation [19].
B)
The process is detailed in Figure 9. Matching of the peptide sequence of the displayed iso-form against all the ENSEMBL peptide se-quences corresponding to the same genes gave rise to three types of outcomes. The displayed isoform matched exactly at least one of the ENSEMBL peptide sequences in 18.074 cases, whereas in 137 cases at least one of the ENSEMBL peptide sequences con-tained completely the displayed isoform and showed a longer N-terminus (Figure 9). For 712 displayed isoforms there was no mat-ching among the correspondent ENSEMBL peptides. This is due to differences in sequen-ce composition between ENSEMBL and UNI-
PROT, for instance there can be punctual re-sidue changes or intrasequence loops (Figure 9). These 712 unmatched isoforms were later aligned against their cognate ENSEMBL pep-tide sequences using BLAST.
To convert protein coordinates to genomic coordinates we first calculated the offset bet-ween the displayed isoform and the longer ENSEMBL peptide sequence in the 137 ins-tances were the ENSEMBL peptide N-termi-nus was bigger. A longer N-terminus in the ENSEMBL peptide sequence implies that the offset between both forms has to be added to the protein coordinates of every feature in the displayed isoform to allow its correct re-
Results
19
PS51406
PF00147SSF56496 SM00186
Domain collapsed
IPR # B
PS51406IPR # A IPR # A IPR # A
IPR # A1 IPR # A2 IPR # A3Domains numbered
# 1 # 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10
Human Tenascin IPR003961
A)
B)
C)
Figure 8. Domain transformations undertaken by the software. Basing on domain coordina-tes the software perform the following transformations. First, numbering the repetitions of the same domain along the peptide sequence (A). Second, when different domain predictors define the same InterPro domain, the outermost coordinates are used to delimit a “collapsed” domain result of all the overlapping predictions (B). In C) an example of both transformations on the same InterPro domain (IPR003961 of the human Tenascin protein, UniProt AC P2481).
allocation in ENSEMBL (Figure 9).Once this had been done we carried on with the process, this time also for the 18.074 dis-played isoforms with an exact matching ENS-EMBL peptide sequence. The protein coordi-nates of the domains and sites were converted to array indexes indicating the beginning of the feature and its length and accounting for the equivalence between residue and three pb codon (Figure 9). Then, the coding geno-
mic coordinates of the ENST giving rise to the matching ENSEMBL peptide sequence were retrieved from Pre-C tables of transcript fea-tures obtained previously. Next, we used the indexes to extract the genomic coordinates occupied by the feature from the array of ge-nomic coding positions of the ENST.
For instance; a Serine amenable to phos-phorylation in position 1 of a given displayed
Results
20
ENSEMBL UNIPROT SwissProt
Domain Coordinates
Site CoordinatesSer 27
P IPR A
Peptide sequence of the UniProt displayed isoform for a given gene
ftp file with all the sequences of all the ENSEMBL transcripts
Peptide sequence of all the EN-SEMBL transcripts of a given gene
FindCorrespondence
Exact correspondence ENSEMBL sequence is N-ter bigger
No correspondence
ENSEMBL UNIPROT
-Residue changes-UniProt N-ter bigger-internal loops
18.074 /18.955
137/18.955
712/18.955
offset
Re-align using
BLAST
T 1:495112T 1:495113
C 1:495200
IntronFeature with Genomic Coordinates
• Apply offset to feature (Domain or Site) co-ordinates when ENSEMBL seq is N-ter bigger
• Transform feature coordinates to 3-based array index references encoding the beggining of the feature and its length
• Retrieve coding genomic positions of the transcript from ENSEMBL
• Select from the array of transcript coding positions the ones applying for the beggining and length of the feature
Figure 9. Mapping of protein features to genomic coordinates. The peptide sequence of the
Results
21
displayed isoform is retrieved from UniProt-SwissProt and matched against the peptide sequences of all the ENSEMBL proteins corresponding to the same gene. Out of 18.955 displayed isoforms, 18.074 have at least one exact correspondent ENSEMBL peptide sequence, 137 have at least one corres-pondent ENSEMBL peptide with a larger N-terminus (denoted in the figure as “offset”) and 712 do not match any of the ENSEMBL peptide sequences for the same gene. Some reasons for this lack of matching are enumerated in the figure; punctual differences in residue composition along the sequen-ces, internal loops, larger UniProt N-terminus, etc. The unmatched 712 displayed isoforms and the corresponding ENSEMBL peptide sequences will be aligned afterwards using BLAST (see Figure 10). For the remaining, first in the case of the 137 displayed isoforms with an N-terminus offset, the offset is applied to protein feature coordinates. Then, for all of them, feature coordinates are converted to array indexes indicating feature start and length multiplied by three to account for the aminoacid-codon equivalence. Next, the coding genomic coordinates of the corresponding ENSEMBL peptide sequence are extracted from ENSEMBL. Finally the array indexes denoting feature start and length are used to select the corresponding coding positions from the array of total coding positions.
isoform was converted to two indexes indi-cating the beginning of the feature (1-1*3; 0) and its length ((1-1+1)*3; 3) both multiplied by three. This means that from the array of geno-mic coordinates corresponding to the ENST giving rise to the ENSEMBL peptide, the Seri-ne starts in position 0 of the array (first coding position) and extends for two more positions for a total of 3 pb. Should the Serine be in position 3 instead, then the indexes (begin 6, length 3) would select genomic coordinates occupying positions 6,7,8 from the array of co-ding positions of the ENST correctly locating the protein feature in genomic coordinates.
For the 712 UniProt displayed isoforms wi-thout a matching ENSEMBL peptide sequen-ce we conducted a multiple alignment against the ENSEMBL peptide sequences correspon-ding to the same gene using BLAST. The re-sults of these alignments were complex; the-re were mismatches due to different residue composition, displayed isoforms with longer N-terminus or C-terminus and even intrase-quence loops both in displayed isoforms and ENSEMBL peptide sequences (Figure 10). Furthermore, for some displayed isoforms these mismatches and loops appeared com-bined.To overcome the difficulties posed by these alignments we took a three-stage pre-process of the protein feature coordinates in the dis-played isoform. First, we selected as the ENS-EMBL peptide sequence matching the displa-yed isoform the one with the highest BLAST bit score. The bit score, S’, is derived from the raw alignment score, S, taking the statistical
properties of the scoring system into account. Because bit scores are normalized with res-pect to the scoring system, they can be used to compare alignment scores from different searches [20].
Secondly, we excluded every protein domain or site lying entirely in regions that were pre-sent in the UniProt displayed isoform but ab-sent in the ENSEMBL peptide sequence we had selected. For instance, features lying enti-rely in the region of the N-terminus of a displa-yed isoform absent in the selected ENSEMBL peptide sequence were discarded. Features lying partially (for example protein domains) were adapted; their boundaries were adjusted to regions present in the ENSEMBL peptide sequence (Figure 10).
Thirdly, as intrasequence loops introduced offsets affecting only features c-terminal to them, we calculated a compound offset (re-sulting from general offsets and loop offsets n-terminal to the feature) for every feature and applied it to its coordinates (Figure 10).
Finally, we proceeded to convert protein coor-dinates to genomic coordinates as described
Results
22
ENSEMBL
UNIPROT
ENSEMBL UNIPROT SwissProt
Peptide sequence of the UniProt displayed isoform for a given gene
Peptide sequence of all the EN-SEMBL transcripts of a given gene
FindCorrespondence
No correspondence
712 (18.955)Realignment using BLAST
T 1:495112T 1:495113
C 1:495200
IntronFeature with Genomic Coordinates
offsetS
F*
offset
offset
offset
Residue mismatches
Bigger UniProt N-terminus
Bigger UniProt C-terminus Intrasequence
loops
685 “isoforms displayed” rescued
• Discard features that lie entirely in regions not present in ENS-EMBL sequence. Adjust those that lie partially.
• Calculate a compound offset from all the partial offsets that lie n-terminal of the protein feature
• Apply compound offset to feature (Domain or Site) coordinates • Transform feature coordinates to 3-based array index references
encoding the beggining of the feature and its length • Retrieve coding genomic positions of the transcript from ENS-
EMBL• Select from the array of transcript coding positions the ones
applying for the beggining and length of the feature
Results
23
Figure 10. BLAST alignment of the 712 UniProt displayed isoforms that do not match any of their gene corresponding ENSEMBL peptide sequences. Each “displayed isoform” is aligned against all of its gene corresponding ENSEMBL peptide sequences using BLAST. Alignments show residue mismatches, bigger N or C terminus in the “displayed isoform”, intrasequence loops and com-binations of all or some of them. For every alignment the best candidate ENSEMBL peptide is selec-ted basing on its bit score. For 685 of the initial 712 “displayed isoforms” a candidate ENSEMBL pepti-de is selected. In order to map their protein features (sites and domains) to genomic coordinates first, features lying entirely in regions absent in the ENSEMBL peptide are discarded and features lying partially are adapted so their boundaries lie within ENSEMBL peptide sequence limits. Second, for every protein feature a compound offset resulting from all the partial offsets n-terminal to the feature is calculated and applied to feature coordinates. Finally, we proceed as explained in Figure 9 to obtain the genomic coordinates of the feature.
above.Addition of gene based features to the variants
As show in figure 1, in addition to features of molecular damage NutVar2 requires a series of gene based features derived from exter-nal files. These gene based features provide two lines of information; the essentiality of the gene and its belonging to the Innate Immu-ne Response. All of these gene features were obtained from external files (Table 6).
Gene essentiality prioritizes genes; genes with a low tolerance to functional variation [3] or rarely observed to carry loss-of-function va-riants [2] are presumed to be essential genes actively under purifying selection. Therefore, functional variants appearing in these genes have a greater potential to be pathogenic.
The belonging of a gene to the Innate Immune Response set is added as a gene feature as these genes are thought to undergo positive selection and thus the amount of functional variation is expected to be higher than that observed under neutral evolution [4, 21]. The-refore, functional variants appearing in these genes may have a lower potential to be pa-thogenic.
Finally, a summary of all the Pre-C tables and files described in this section is displayed in Table 5 (Pre-C tables and external files ca-rrying sequence based features) and Table 6 (external files carrying gene based features).
3. Annotation of gene and se-quence based features
Once stages 1 and 2 of NutVar2 development were completed; that is the training set was divided among Pathogenic and Non-Patho-genic variants and the Pre-C tables and files needed for a swift annotation of sequence and gene features had been built, the annotation phase started.
We developed a series of scripts to calculate the features displayed in Table 7 for the who-le variants of the Training set. These features can be broadly divided into five categories. A first category of informative features (first 7 rows of Table 7) that provide information of the overall effect of the variant in the gene. As the effect of the variant depends on the transcript(s) it affects, variants can lead to truncations in certain isoforms and to other effects in other isoforms of the same gene. That information can be retrieved from this set of informative features.
Secondly, transcript features of molecular da-mage (rows 8 to 14) that have been explai-ned before. Of note, for frameshift variants NutVar2 calculates the onset of derived stop-gains in the new reading frame elicited by the variant. Should they exist, NutVar2 calculates whether they entail NMD. The result is the feature “ratioAffectedIsoformsTargetedby_de-rived_NMD”.
Thirdly, protein features of molecular dama-ge (rows 15 to 24) that have been explained before. These features are intended to cal-culate protein damage globally (percentage
Results
24
Nam
e of
File
Des
crip
tion
Mai
n us
eO
rigin
gtf_
tabl
adef
_sor
ted_
by_S
YMBO
L.tx
tPr
e-C
tabl
e of
inte
rval
s of
gen
omic
coo
rdin
ates
list
ing
all t
he
prot
ein
codi
ng E
NST
s an
d th
e ch
arac
ter o
f the
inte
rval
of c
o-or
dina
tes
(UTR
_5_p
rime,
STA
RT, C
DS,
STO
P an
d U
TR_3
_pr
ime)
. Inc
lude
s eq
uiva
lenc
ies
ENST
-CC
DS
(type
d “N
aNei
n”
whe
n ab
sent
)
Sour
ce
of
equi
-va
lenc
e EN
ST-
CC
DS
PreC
fro
m
ENSE
MBL
gtf
file.
See
*Fi
-gu
reXX
*
gtf_
outp
ut_E
NSG
.txt
Pre-
C ta
ble
show
ing
the
equi
vale
nce
of p
rote
in c
odin
g EN
SG
that
lie
in a
utos
omal
or s
exua
l chr
omos
omes
(Ass
embl
y pa
t-ch
es a
re d
isca
rded
) to
HG
NC
sym
bols
Sour
ce if
equ
iva-
lenc
e EN
SG-H
G-
NC
sym
bol
PreC
fro
m
ENSE
MBL
gtf
file.
See
*Fi
-gu
reXX
*gt
f_ou
tput
_EN
ST.tx
tPr
e-C
tabl
e of
all t
he p
rote
in c
odin
g EN
STs
belo
ngin
g to
eve
ry
ENSG
pre
sent
in g
tf_ou
tput
_EN
SG.tx
tSo
urce
of a
ll pr
o-te
in c
odin
g EN
STPr
eC
from
EN
SEM
BL g
tf fil
e. S
ee *
Fi-
gure
XX*
NM
D_t
able
.txt
Pre-
C t
able
of
Non
sens
e-M
edia
ted-
Dec
ay (
NM
D)
win
dow
re
gion
s, fr
om s
tart
codo
n to
50
pb (
non-
incl
usiv
e) fr
om la
st
exon
-exo
n ju
nctio
n
Cal
culte
N
MD
fro
m v
aria
nt c
oor-
dina
te
PreC
fro
m
ENSE
MBL
gtf
file.
See
*Fi
-gu
reXX
*EN
ST_t
able
_ful
l_co
nden
sed.
txt
Pre-
C ta
ble
of in
terv
als
of g
enom
ic c
oord
inat
es o
ccup
ied
by
codi
ng p
ositi
ons
of a
ll th
e pr
otei
n co
ding
tran
scrip
ts fo
r eve
ry
ENSG
Loca
te
varia
nt
coor
dina
te w
ithin
EN
ST c
onte
xt
PreC
fro
m
ENSE
MBL
gtf
file.
See
*Fi
-gu
reXX
*Pe
rvas
ive.
txt
Tabl
e in
dica
ting
the
perv
asiv
e is
ofor
m fo
r eve
ry E
NSG
Sele
ct P
erva
sive
is
ofor
mG
on
zale
z-Po
rta e
t al*
appr
is_p
rinci
pal_
isof
orm
.txt
Tabl
e in
dica
ting
the
APPR
IS P
rinci
pal is
ofor
m fo
r eve
ry E
NSG
Sele
ct
Prin
cipa
l is
ofor
mAp
pris
Ser
ver
ALL_
ISO
FOR
MS_
PRO
TEIN
_tab
le_f
ull.t
xtPr
e-C
tabl
e of
inte
rval
s of
gen
omic
coo
rdin
ates
ocu
ppie
d by
ev
ery
Inte
rPro
dom
ain
and
Uni
Prot
func
tiona
l site
pre
viou
sly
trans
form
ated
(num
bere
d an
d co
llaps
ed)
Cal
cula
te p
rote
in
feat
ures
affe
cted
by
a tr
unca
tion
Pre-
C
from
E
NS
EM
BL
, U
niPr
ot
and
INTE
RPR
OAL
L_IS
OFO
RM
S_D
OM
AIN
_tab
le_f
ull.t
xtPr
e-C
tabl
e of
inte
rval
s of g
enom
ic co
ordi
nate
s ocu
ppie
d by
all
Inte
rPro
dom
ains
, lab
elle
d as
DO
MAI
N, a
nd U
niPr
ot fu
nctio
-na
l site
s, la
belle
d as
SIT
E, p
revi
ousl
y tra
nsfo
rmat
ed (n
umbe
-re
d an
d co
llaps
ed).
Cal
cula
te t
he t
o-ta
l am
ount
of D
O-
MAI
N
and
SITE
po
sitio
ns a
ffect
ed
Pre-
C
from
E
NS
EM
BL
, U
niPr
ot
and
INTE
RPR
O
Results
25
Table 5. Files and Pre-C tables needed to run the analysis. ENSG= ENSEMBL Gene, ENST=ENSEMBL Transcript, HGNC= HUGO Genome Nomenclature Committee, CCDS= Consensus CoDifying Sequence, UTR_5_prime= 5 prime UTR, START= Start codon, CDS= codifying sequence, STOP= Stop codon, UTR_3_prime= 3 prime UTR. *Gonzalez-Porta et al [16], Appris server [6].
Nam
e of
File
Des
crip
tion
Mai
n us
eO
rigin
pRD
G2.
txt
Prob
abilit
y of
rece
ssiv
e di
seas
e ca
usat
ion
as p
robi
ded
by M
a-cA
rthur
et a
lAs
sess
gen
e es
-se
ntia
lity
McA
rthur
et a
l
RVIS
2.tx
tIn
tole
ranc
e sc
orin
g sy
stem
tha
t as
sess
es w
heth
er g
enes
ha
ve re
lativ
ely
mor
e or
less
func
tiona
l gen
etic
var
iatio
n th
an
expe
cted
bas
ed o
n th
e ap
pare
ntly
neu
tral v
aria
tion
foun
d in
th
e ge
ne
Asse
ss g
ene
es-
sent
ialit
yPe
trovs
ki e
t al
Gen
es_A
llInn
ateI
mm
unity
.txt
Gen
es in
volv
ed in
Inna
te Im
mun
itySe
lect
ion
of
ge-
nes
invo
lved
in
In
nate
Res
pons
e
Lab
Tele
nti
(inho
use)
Gen
es_A
ntiv
iral.t
xtG
enes
invo
lved
in a
ntiv
iral r
espo
nse
Sele
ctio
n of
ge
-ne
s in
volv
ed
in
Inna
te R
espo
nse
Lab
Tele
nti
(inho
use)
Gen
es_I
SGs.
txt
Gen
es in
volv
ed in
Inna
te Im
mun
itySe
lect
ion
of
ge-
nes
invo
lved
in
In
nate
Res
pons
e
Lab
Tele
nti
(inho
use)
Gen
es_O
MIM
rece
ssiv
e.tx
tG
enes
kno
wn
to c
ause
dis
ease
onl
y th
roug
h au
toso
mic
rece
s-si
ve in
herit
ance
Asse
ss g
ene
es-
sent
ialit
yLa
b Te
lent
i (in
hous
e)
Table 6. Files carrying gene based features. McArthur et al [2], Petrovski et al [3].
Results
26
Feature Description1 NumIsoformsInQueryGene Number of total isoforms in the gene2 ratioIsoformsBearingTheVariant Ratio of isoforms bearing the variant
from all the isoforms of the gene. Ran-ges from 0 to 1
3 ratioAffectedIsoforms_stop-gained Ratio of isoforms bearing the variant that give rise to a stop gained from all the isoforms of the gene. Ranges from 0 to 1
4 ratioAffectedIsoforms_frameshift Ratio of isoforms bearing the variant that give rise to a frameshift from all the isoforms of the gene. Ranges from 0 to 1
5 ratioAffectedIsoforms_splice Ratio of isoforms bearing the variant that give rise to a splice donor or spli-ce acceptor abrogation from all the iso-forms of the gene. Ranges from 0 to 1
6 ratioAffectedIsoforms_coding-synonymous Ratio of isoforms bearing the variant that give rise to a synonymous change from all the isoforms of the gene. Ran-ges from 0 to 1
7 ratioAffectedIsoforms_missense Ratio of isoforms bearing the variant that give rise to a missense mutation from all the isoforms of the gene. Ran-ges from 0 to 1
8 ratioAffectedIsoformsTargetedbyNMD Ratio of isoforms bearing the variant that are targeted by Nonsense-Meadia-ted-Decay. Meaningful only for stop-gains (where it ranges from 0 to 1) for the rest its set to “NaN”
9 ratioAffectedIsoformsTargetedby_derived_NMD
Ratio of isoforms bearing the variant for which a downstream Stop-gain is created and are targeted by Nonsense-Meadiated-Decay. Meaningful only for frameshifts (where it ranges from 0 to 1) for the rest its set to “NaN”
of total domain and site positions affected by the truncation) and functionally (max percen-tage of a domain or site affected and number of domains and sites completely affected by the truncation). Whether the variant lies within the boundaries of a protein or site feature (the “domain matched” and “site matched featu-res”) is of special importance in the case of sites, where partial disruption of the site may completely abrogate its function.
Fourthly, the class variable Pathogenic or Non-Pathogenic (rows 25 and 26). The Pathogeni-
city Tag counts the number of times a variable has been annotated as Pathogenic in ClinVar (as explained in the first section of the results, conflicting variants have been left out). Only variants with a Pathogenicity Tag equal or bi-gger than 1 will be accepted as Pathogenic in the training of the classifier. The Credible Tag displays the result of the comparison of the limits of the 95% credible intervals with the threshold of 0.01 (1%) to ascertain whether they are rare (highCI =< 0.01), unknown (low-CI<0.01 and highCI>0.01) or common (lowCI >0.01). Only common variants (Credible tag
Results
27
Feature Description10 IsPrincipalIsoformAffected Indicates whether the Principal Isoform
predicted by Appris is affected by the variant. 0-> not affected, 1->affected NaN->No principal isoform predicted for the gene
11 IsWithinLongestCCDS Indicates whether the variant affect the longest isoform of the gene. 1->affec-ted, 0->not affected
12 IsWithinPervasiveIsoform Indicates whether the variant affect the pervasive isoform of the gene. 1->affec-ted, 0->not affected, NaN->No pervasi-ve isoform for the gene
13 LongestCCDSLength Length of the longest transcript affected14 PercentagePrincipalOrLongestCCDSAffec-
tedPercentage of the longest transcript that lays 3’ downstream the coordinate of the variant
15 DomainINFOAvailable Indicates whether the longest transcript has associated InterPro domain info. 0-> no domain info 1-> domain info pre-sent
16 PercentageOfDomainPositionsAffected Percentage of the total positions which have associated InterPro domain info that lie 3´downstream the coordinate of the variant. Ranges from 0 to 100 unless DomainINFOAvailable equals 0 when is set to NaN
17 maxPercDomainAffected Maximum percentage of a domain potentially affected by the mutation (mapping 3’ downstream the variant). Ranges from 0 to 100 unless Domai-nINFOAvailable equals 0 when is set to NaN
18 NumberOfDomains100Damage Number of complete domains that lie 3´downstream of the variant potentially affected by it. Integer unless Domai-nINFOAvailable equals 0 when is set to NaN
19 DomainMatched Is the variant within the boundaries of an INTERPRO domain? YES=1; NO=0 unless DomainINFOAvailable equals 0 when is set to NaN
20 SiteINFOAvailable Indicates whether the longest transcript has associated SwissProt site info. 0-> no site info 1-> site info present
21 PercentageOfSitePositionsAffected Percentage of the total positions which have associated SwissProt site info that lie 3´downstream the coordinate of the variant. Ranges from 0 to 100 unless SiteINFOAvailable equals 0 when is set to NaN
Results
28
Feature Description22 maxPercSiteAffected Maximum percentage of a site poten-
tially affected by the mutation (mapping 3’ downstream the variant). Ranges from 0 to 100 unless SiteINFOAvailable equals 0 when is set to NaN
23 NumberOfSites100Damage Number of complete sites that lie 3´downstream of the variant potentia-lly affected by it. Integer unless Site-INFOAvailable equals 0 when is set to NaN
24 SiteMatched Is the variant within the boundaries of a site? YES=1; NO=0 unless SiteINFOA-vailable equals 0 when is set to NaN
25 Pathogenicity_Tag Number of times a variant has been re-ported as Pathogenic in ClinVar
26 Credible_Tag(-1,rare, 0,not_credible,1,common)
Tag indicating the result of the compari-son of the limits of the 95 % credible in-tervals obtained from the BetaDistribu-tion with the threshold of 0.01. -1=rare (highCI =< 0.01), 0=unknown (lowCI < 0.01, highCI> 0.01), 1=common (lowCI >0.01)
27 Location_Tag Internal annotation reference28 IsInnateImmunity Gene based feature. See Table 629 IsAntiviral Gene based feature. See Table 630 IsISG Gene based feature. See Table 631 IsOMIMrecessive Gene based feature. See Table 632 pRDG_score Gene based feature. See Table 633 RVIS_score Gene based feature. See Table 6
Table 7. Features calculated by NutVar2 for each variant. lowCI= lower limit of the Credible Inter-val, highCI= higher limit of the Credible Interval.
=1) will be assumed to be ‘ Non-Pathogenic‘ and used to train the classifier.
Finally, the fifth subgroup of features is com-prised by gene-based features (rows 27 to 33, see table 6).
The resulting matrix of data (summarized in table 8) was the substrate for the training of the classifier.
4. Machine Learning
Distribution of features values among the Pathogenic and non-Pa-thogenic classes
Figure 11 depicts the distribution of the mean of every feature we have annotated among the two possible values of the class variable; Pathogenic and non-Pathogenic. It is divided among stop-gains (A), frameshifts (B) and splice donor/ acceptor disruptions (C).
As expected, in all three types of truncations, Pathogenic variants lie more often in genes with longer isoforms than in genes with shor-ter isoforms (feature “longestCCDSlength”). Given the nature of a truncation when it oc-curs in long isoforms, prone to contain multi-ple domains, it is more likely that it will entail severe functional consequences than in shor-ter isoforms with fewer domains.
Results
29
Variant Pathogenic/ Non Pathogenic
At least Domain
INFO
At least Site INFO
Target of NMD
Stop gains (initial 26.070) 2.848 Pathogenic 2.550 1.434 2.393407 Non Pathogenic 319 121 249
Frameshifts (initial 31.145) 2.222 Pathogenic 1.783 1.100 1.0162.891 Non Pathogenic 2.125 942 803
Splice acceptor/donor abrogations (initial 1.211)
577 Pathogenic 522 285 52929 Non Pathogenic 27 5 22
Table 8. Statistics of truncating variants used to train the classifier. From the initial large pools, the stringent criteria leaves much reduced subsets of Pathogenic and Non Pathogenic variants. We show the amount of variants having at least domain info (column 3) or site info (column 4) and the amount of them eliciting NMD in at least one transcript (column 5, in the case of Frameshift variants the number refers to the feature “ratioAffectedIsoformsTargetedby_derived_NMD”) to display the co-verage of the features.
In the case of stop gains (11 A) the mean va-lues for most of the features are greater in Pathogenic variants. This is the case for the features: NMD, Principal isoform affected, longest CCDS affected, percentage of se-quence affected, percentage of domain posi-tions affected, maximum of a domain affected, number of sites with a 100% damage and site matched. That pathogenic variants score hig-her than non-Pathogenic ones for these fea-tures is in coherence with our a priori expecta-tions and has also been observed in NutVar1 [4]. For the rest of the features in stop-gains; derived-NMD does not apply to them and thus is set to 0 and “Pervasive isoform” seems to be more targeted in non-Pathogenic isoforms but it might be due to its lack of comprehen-siveness (there´s only Pervasive information for 5.227 genes). The features number of domains with a 100% damage, domain mat-ched, Percentage of site positions affected and maximum percentage of a site affected associate with non-pathogenicity contrary to what we would have expected.
In the case of frameshifts (11 B) higher means in features involving NMD (NMD and derived NMD) and the importance of the isoform tar-geted (Principal isoform, longest isoform and Pervasive isoform) associate with the Pa-thogenic label. NMD applies for frameshifts and splice truncations as in some instances these truncations are also classified as stop-gains. However, all the features reflecting the percentage of sequence affected and the da-
mage in protein domains (except the domain matched feature) behave inversely as they do in stop-gains; the mean is greater in non-Pathogenic variants. This behaviour is also observed for the percentage of site positions affected and the maximum percentage of a domain affected.
In the case of splice disruptions (11 C), the means of features behave more similar to stop-gains but the few common splice dis-ruptions analyzed (29) might be thwarting our results. Evaluation of the classifier perfor-mance
NutVar2 was built using the Naïve Bayesian paradigm of classification. The performance was evaluated constructing ROC curves for the test subsets of the training data (leave one out crossvalidation). The ROC curves are dis-played in figure 12.
The results show the combination of sequen-ce-based features with gene-based features from MacArthur et al (A),B) and C)) or RVIS (D), E) and F))
For Stop-gains the classifier using sequence based features alone showed a 68% of accu-racy (figure 12 A, D), SB) while its combina-tion with gene based features raised the ac-curacy to 80% and 76% (figure 12 A) and D), SB+GB, MacArthur et al and RVIS respecti-
Results
30
1
.
.
Cla
ss-s
peci
fic M
ean
0
0.2
-0.2
-0.4
-0.6
Cla
ss-s
peci
fic M
ean
0.3
0.4
-0.1
0.2
0.1
0
-0.2
-0.3
Cla
ss-s
peci
fic M
ean
2
3
-2
1
0
-1
ratioA
ffecte
dIsofo
rmsTa
rgeted
byNMD
ratioA
ffecte
dIsofo
rmsTa
rgeted
by_d
erive
d_NMD
IsPrin
cipalI
sofor
mAffecte
d
IsWith
inLon
gestC
CDS
IsWith
inPerv
asive
Isofor
m
Long
estCCDSLe
ngth
Percen
tageP
rincip
alOrLo
nges
tCCDSAffecte
d
Percen
tageO
fDomain
Positio
nsAffe
cted
maxPerc
Domain
Affecte
d
Numbe
rOfDom
ains1
00Dam
age
Domain
Matche
d
Percen
tageO
fSitePos
itions
Affecte
d
maxPerc
SiteAffe
cted
Numbe
rOfSite
s100
Damag
e
SiteMatc
hed
PathogenicNon Pathogenic
A)
B)
C)
Figure 11. Mean values of sequence based features in pathogenic and non-pathogenic va-riants. The fifteen sequence based features where plotted against the class variable values (Pathoge-nic and ‘Non-pathogenic’) for all the truncations in the training set. A) Stop-gains, B) Frameshifts and C) Splice donor/acceptor disruptions. The mean value for every feature in the Pathogenic and ‘Non-Pathogenic’ series is shown. Error bars depict the standard deviation of the mean, however it should be noted that the standard deviation is assumed to be equal between the two values of the variable class.
Results
31
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate: FP/(TN+FP)
True
Pos
itive
Rat
e: T
P/(T
P+FN
)
AUC Model0.68: SB0.50: SB(r)
0.78: GB0.80: GB SB0.78: GB SB(r)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate: FP/(TN+FP)
True
Pos
itive
Rat
e: T
P/(T
P+FN
)
AUC Model0.77: SB0.50: SB(r)
0.81: GB0.84: GB SB0.81: GB SB(r)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate: FP/(TN+FP)
True
Pos
itive
Rat
e: T
P/(T
P+FN
)
AUC Model0.91: SB0.51: SB(r)
0.71: GB0.88: GB SB0.72: GB SB(r)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
12834 Pos and 374 Neg samples
False Positive Rate: FP/(TN+FP)
True
Pos
itive
Rat
e: T
P/(T
P+FN
)
AUC Model0.68: SB0.50: SB(r)
0.73: RV0.76: RV ! SB0.73: RV ! SB(r)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
12223 Pos and 2885 Neg samples
False Positive Rate: FP/(TN+FP)
True
Pos
itive
Rat
e: T
P/(T
P+FN
)
AUC Model0.77: SB0.50: SB(r)
0.55: RV0.67: RV ! SB0.55: RV ! SB(r)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1578 Pos and 38 Neg samples
False Positive Rate: FP/(TN+FP)
True
Pos
itive
Rat
e: T
P/(T
P+FN
)
AUC Model0.91: SB0.51: SB(r)
0.71: RV0.87: RV ! SB0.69: RV ! SB(r)
A) B) C)
vely). These results show no increase in ove-rall performance in stop-gains with respect to the first release of NutVar [4].
For frameshifts the accuracy was 77% and 84% for sequence based and the combina-tion of sequence and gene based features, respectively (Figure 12 B) and E) SB and SB+GB MacArthur and RVIS respectively). This represents a 11% increase in the Area Under the Curve (AUC) with respect to the first release of NutVar [4].
Splice acceptor/donor disruptions show an accuracy of 91% and 88% for sequence ba-sed and the combination of sequence and gene based features, respectively (Figure 12 C) and F) SB and SB+GB MacArthur and RVIS respectively). These variants were not analyzed in the first release of NutVar [4].
DISCUSSION
NutVar2 versus implementations in SnpEff and VEP
Figure 12. ROC curves evaluating the performance of NutVar2. A) Stop-gains, B) Frameshifts and C) Splice disruptions using as gene-based features MacArthur et al. D) Stop-gains, E) Frameshifts and F) Splice disruptions using the RVIS score. SB=Sequence Based, GB= Gene Based. Dashed curves correspond to a randomization test in which rows in sequence features are shuffled column-wise (denoted by SB(r)).
Currently, big sequencing initiatives report that stop-gains, frameshifts and splice donor/acceptor disruptions that give rise to trunca-tions are collectively prevalent in humans. The need to prioritize the study of these trun-cations attending to its possible impact on hu-man health is growing in accordance with the influx of genomic Big Data. NutVar2 provides a ranking of the severity of truncations relying on sequence-based features that can be com-bined with gene-based features to produce an overall score. Other software has been developed aiming at the same objective. Currently two of the most used predictors of variants effects, SnpEff [12] and VEP [13], offer as a feature the prediction of the loss-of-function and the pathogenicity of a variant respectively. As mentioned earlier, the prediction of the loss of function (LoF) of a variant in SnpEff is based in McArthur et al [2, 15]. Therefore, the main criterion for the prediction of LoF in a variant in SnpEff is the amount of sequence affected by the trunca-tion, with variants affecting the last 5% of the
D) E) F)
Discussion
32
POS REF>ALT
GENE ClinVar CLNSIG
SnpEff LoF
VEP Pathog.
NutVar Rank Per. (SB)
NutVar Rank Per.(SB +GB)
13:32906565 CA>C BRCA2 --- True(50%) --- 55,10 96,3913:32906576 CA>C BRCA2 --- True(50%) --- 55,09 96,3913:32906602 GA>G BRCA2 Pathog. True(50%) --- 55,03 96,3713:32907202 TA>T BRCA2 --- True(50%) --- 52,83 96,1213:32911357 CA>C BRCA2 --- True(33%) --- 48,63 95,7913:32911442 GA>G BRCA2 --- True(33%) --- 48,36 95,7413:32912345 GA>G BRCA2 Pathog. True(33%) --- 45,27 95,4313:32912655 CT>C BRCA2 Pathog. True(33%) --- 44,25 95,3513:32913079 GA>G BRCA2 --- --- Pathog. 42,75 95,2713:32913422 GA>G BRCA2 Pathog. True(33%) --- 41,64 95,2113:32913836 CA>C BRCA2 Pathog. True(33%) --- 40,18 95,0413:32953632 CA>C BRCA2 --- True(33%) --- 28,79 93,6613:32953640 GA>G BRCA2 --- --- --- 28,76 93,6613:32954022 CA>C BRCA2 Pathog. True(33%) Pathog. 28,44 93,6013:32972589 GA>G BRCA2 Likely
Pathoge-nic
--- --- 5,53 78,05
17:41245118 G>GGT BRCA1 --- True(19%) --- 95,11 99,9317:41245586 CT>C BRCA1 Pathog. True(19%) Pathog. 96,10 99,9317:41246531 C>CG BRCA1 --- True(32%) --- 97,83 99,95
target isoform being discarded [15], and the rest labelled as LoF positive. Given the di-fferent criteria involved in the NutVar2 score (importance of the isoform targeted, amount of sequence truncated, protein domains and functional sites lost) it is a much-refined sco-re in terms of the subtleties LoF may depend upon.
Although it was not possible to ascertain the basis of the VEP prediction of pathogenicity for variants it must rely heavily on pre-existent databases of variants with known pathogenic consequences such as OMIM and ClinVar [11]. VEP predictions for the training set of
NutVar2 display 99% and 91% with ClinVar classification for Pathogenic and ‘ Likely-Pa-thogenic’ variants respectively (n=15.370 and n=3.192, respectively). Furthermore, this pre-diction showed a lack of comprehensiveness as most of the truncations in the Training set without a ClinVar classification lacked as well a VEP prediction (n=154.919 of which only 267 where classified in ClinVar). In summary, in our hands the VEP predictor of pathogeni-city largely reflects pre-existent data.
To prove further the advantages of NutVar2 over SnpEff and VEP implementations, and as a part of a practical exercise for a job offer,
Table 9. Human frameshift variants in genes BRCA1 and BRCA2. POS = position in the format Chromosome:Genomic coordinate, REF>ALT = change in nucleotides between reference allele (REF) and the observed variant (ALT), ClinVar CLNSIG = Clinical Significance obtained from ClinVar, SnpEff LoF = result of the prediction of loss of function obtained from SnpEff, VEP Pathog. = result of the VEP prediction of pathogenicity. NutVar Rank Per. (SB) = percentiles of the NutVar2 rank based only in sequence based features. NutVar Rank Per. (SB + GB) = percentiles of the NutVar2 rank combining sequence and gene based features.
33
Discussion
a series of Roche 454 human DNA reads were trimmed, mapped to GRCh37.75, sub-jected to VEP and SnpEff and its potential pathogenicity evaluated with NutVar2. The reads corresponded to frameshift variants of the BRCA1 and BRCA2 genes implicated in breast cancer.
As shown in table 9, the sequence-based fea-tures of NutVar2 produce a rank for which per-centiles are shown (table 9, NutVar Rank Per.) reflecting the molecular impact of the trunca-tions in BRCA1 and BRCA2. While the per-centiles occupied by the variants of BRCA2 vary from 5 to 55 the percentiles of BRCA1 stay 95 or above. This means that variants affecting BRCA2 have lower molecular im-pact. However when gene based features are added (NutVar Rank Per.), the percentiles for both genes rise up to 93 (except for a ‘likely pathogenic’ variant in BRCA2, 13:32972589 GA>G). The greater likelihood of being pa-thogenic when sequence and gene features are combined comes as no surprise given the essentiality of both genes and their widely ac-knowledged role in breast cancer.
The predicted pathogenicity of these BRCA1 and BRCA2 frameshift variants is also dis-played in table 9 (VEP Pathog.). It is absent for most of the variants and when present it matches the ClinVar (ClinVar CLNSIG) score almost in all cases.
The LoF prediction of SnpEff is also depicted in table 9. SnpEff estimation indicates whether there is or there is not LoF (True or False) and the percentage of the total transcripts in the gene affected by it [15]. It is difficult to apprai-se from this indicator the possible outcome of the variants in terms of impact on health; for BRCA2 in general LoF affecting more trans-cripts correlate with higher NutVar2 percentile whereas for BRCA1 LoF affect few transcripts and still the NutVar2 is very high. Even for va-riants whose implication in breast cancer is established in ClinVar (17:41245586 CT>C) the LoF prediction of SnpEff is difficult to in-terpret (True (19%)).
In summary, implementations developed for SnpEff and VEP lack the comprehensive or the pathogenicity-focused scope of NutVar2
and provide worse indicators for the prioritiza-tion of the study of truncations.
NutVar2 versus NutVar1
The main upgrades between NutVar1 and NutVar2 are: the inclusion of splice acceptor/donor variants, a bigger training set, the adap-tation to the minimal representation format, the dual use of SnpEff and VEP, the use of ENS-EMBL as the basis for transcript annotation, new transcript features (Pervasive isoform) and new protein features (functional sites). All these features and improves allow for the use of new algorithms for classification that are currently being developed for NutVar2 as well as broaden its range of potential users.
NutVar2 is devised as a standalone tool that can be distributed and downloaded from the web server and applied on the user´s own vcf files without the need to perform any trasnfor-mation. It is designed to be robust (allowing for single or joint variant calling) and include as many variants as possible (inclusion of <DEL> and < . > alternative alleles). Impor-tantly all of the files and Pre-C tables used are built from files directly downloaded from the ftp servers of ENSEMBL, UniProt and INTER-PRO. This has been done with the purpose of updating more easily to changes such as the recent new assembly of the human genome (GRCh38). Again our goal has been to increa-se the adaptability of our software to the user ´s needs.
The inclusion of VEP as an alternative to SnpEff, or even allow for use of both of them, seeks to increase the reach of potential users of NutVar2. This functionality offers as well the possibility to restrict the analysis to variants having been predicted to have the same effect by both programs, thus decreasing the rate of false positives due to annotation errors.
The use of ENSEMBL instead of CCDS as a basis for transcript annotation argues in favor of our effort to extend the reach of NutVar pre-diction to as many protein coding transcripts as possible. This notwithstanding, for users wanting to restrict to CCDS transcripts, the posibility is allowed in the pipeline.
Discussion
34
The mining of new features with respect to NutVar1 (Pervasive isoform, number of do-mains/sites with a 100% damage and total of domain/site positions affected) has been un-dertaken aiming at increaing the amount of feature variables for classification. Our goal has been to offer more cues for the training of the classifier. Some of them such as “number of domains with a 100% damage” (figure 11) may have predictive value and will be instru-mental in the development of a new algorithm of classification that we are currently evalua-ting. In this sense, we are planning to move to a TAN-Bayes or a decision tree (C 4.5) pa-radigm due to both of them using the mutual information that we think some of our features share (figure 11).
The results of NutVar2 performance (figure 12) show a great improve in performance for frameshifts, no increase in the % of AUC for stop-gains and a great accuracy in the clas-sification of splice truncations (around 90%).
It should be noted however that to compa-re two classifiers both must use the same learning/test set. NutVar1 was trained with an smaller training set comprising variants available at the time. In fact, when NutVar2 training set was used to train NutVar1 a si-milar increase was observed for the AUC in frameshifts (data not shown) arguing against the capacity of the features we have selected to increase the accuracy of the classifier with the Naive Bayes paradigm. As mentioned be-fore we are planning to change the algorithm of classification to try profiting of the new fea-
100
80
60
40
20
0
20
40
60
80
100
VPS1
3BM
OB3
CPD
E4DI
PIL
17RB
CYP2
A7CP
N2O
R2L8
DISP
1IL
34 KRTA
P4−5
C17o
rf107
FAM
187B
C5or
f20
TRPM
1AK
R1E2
OR4
D1 RFPL
1O
R10X
1IN
TS4
CACN
A2D4
PRAM
EF10
FAM
187B
SLC6
A18
OR5
AR1
NOL4
PCNX
L2AT
P5G
2SL
C6A9
FCG
R1A
TTN EF
CAB1
3CO
L4A4
LAM
A3EN
PEP
CASP
12SL
C22A
10FC
RL3
MRO
H2B
TRPM
4AL
DH1L
1
100
80
60
40
20
0
20
40
60
80
100
VPS1
3BM
OB3
CPD
E4DI
PIL
17RB
CYP2
A7CP
N2O
R2L8
DISP
1IL
34 KRTA
P4−5
C17o
rf107
FAM
187B
C5or
f20
TRPM
1AK
R1E2
OR4
D1 RFPL
1O
R10X
1IN
TS4
CACN
A2D4
PRAM
EF10
FAM
187B
SLC6
A18
OR5
AR1
NOL4
PCNX
L2AT
P5G
2SL
C6A9
FCG
R1A
TTN EF
CAB1
3CO
L4A4
LAM
A3EN
PEP
CASP
12SL
C22A
10FC
RL3
MRO
H2B
TRPM
4AL
DH1L
1
Seq
uenc
e-ba
sed
path
ogen
icity
Gen
e-ba
sed
path
ogen
icity
(Mac
Arth
ur)
Seq
uenc
e-ba
sed
path
ogen
icity
Gen
e-ba
sed
path
ogen
icity
(RV
IS)
Homozygous stop-gain (MAF>=0.1)
Heterozygous stop-gain (MAF>=0.1)
Heterozygous stop-gain (MAF<0.1)
Homozygous stop-gain (MAF>=0.1)
Heterozygous stop-gain (MAF>=0.1)
Heterozygous stop-gain (MAF<0.1)
Figure 13. The genome of J. Craig Venter analyzed with NutVar1. MAF= Minor Allele Frequency in the general population. Orange = truncations in homozygosis, Green and gray = truncations in hetero-zygosis.
35
Discussion
tures we have mined.
With respect to the impressive accuracy of the classifier for splice variants we believe the classifier is overfitted due to the paucity of non-Pathogenic instances (27 versus the 522 pathogenic). We plan to increase the set of splice variants either using new releases of large sequencing projects and/or re-asses-sing the effect derived from SnpEff and VEP.
Future Perspectives
With the myriad of genomics data coming from NGS technologies expected to keep growing exponentially every year, NutVar2 provides a tool for the end-user to rank and prioritize the study of the variants in his own vcf files. NutVar2 limits its scope to truncating variants for the time being, however, some of the features used by it can also be applied for missense variants. For instance, the principal isoform feature or the gene based features. These features can be combined with current scores to assess the functional consequences of aminoacid substitutions such as PolyPhen, SIFT or Condel in an effort to extend the rea-ch of our tool to missense variants.
Restricting ourselves to truncations, we envi-sage NutVar´s role will produce results such as the one displayed in figure 13. A public genome (in this case Craig J. Venter´s) with phased information that allows constructing haplotypes was used by Antonio Rausell in combination with NutVar1 to create a graph of Venter´s truncations plotting the sequence based score against the gene based score. In addition, as haplotypes could be construc-ted, the truncations are colored according to them being in homozygosis (orange) or in he-terozygosis (green or gray depending on the MAF). Figure 13 provides a very visual and di-rect approach to ascertain the outcome of the truncations present in Venter´s genome; most of the truncations affecting important genes in homozygosis have low molecular impact consistent with little functional consequences. As the molecular impact of the truncations in homozygosis grow (bars to the right in figu-re 13) so does the essentiality of the genes involved decrease;in living individuals severe truncations in homozygosis lie in “dispensa-
ble” genes.
Truncations in heterozygosis (green and gray) have a different behavior, with some severe truncations affecting essential genes (the two right most bars). The reason for this tolerance is the existence of a backup copy for those va-riants in the diploid genome of the individual. However, there are genes that so far have ne-ver been observed to tolerate severe trunca-tions in heterozygosis. These haploinsufficient genes that require a high degree of functiona-lity in both alleles of the locus comprise a spe-cial set of ‘essential genes’. Rausell et al have extensively studied these genes and have re-cently submitted a manuscript (current under evaluation) for which I contributed [23].
Last but not least, we envisage that the pro-gressive completion of systems biology net-works will entail their inclusion as a third fea-ture in the classifier. Currently, we evaluate variants as single entities unrelated to other variants. It is true that essential genes tend to be central genes in protein-protein networks and thus that information is partially captured in gene-based features. However, the combi-nations and interactions between genes are not completely captured in evolutionary pa-rameters. Therefore, these features are very likely to play a future role in functional geno-mics as a means to evaluate the overall effect of variants in the biological network.
Conclusions
36
CONCLUSIONS
• NutVar2 works as a standalone tool that can be distributed and down-loaded.• The implementations and extended functionalities in NutVar2 increase its reach of potential users and provides new cues that can be used to change the paradigm its classification.
References
37
REFERENCES
1. Thayer AM: Next-‐Gen Sequencing Is A Numbers Game. In.; 2014. 2. MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K,
Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-‐Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB, Tyler-‐Smith C: A systematic survey of loss-‐of-‐function variants in human protein-‐coding genes. Science 2012, a(6070):823-‐828.
3. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB: Genic intolerance to functional variation and the interpretation of personal genomes. PLoS genetics 2013, a(8):e1003709.
4. Rausell A, Mohammadi P, McLaren PJ, Bartha I, Xenarios I, Fellay J, Telenti A: Analysis of stop-‐gain and frameshift variants in human innate immunity genes. PLoS computational biology 2014, a(7):e1003757.
5. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, a(7422):56-‐65.
6. Rodriguez JM, Maietta P, Ezkurdia I, Pietrelli A, Wesselink JJ, Lopez G, Valencia A, Tress ML: APPRIS: annotation of principal and alternative splice isoforms. Nucleic acids research 2013, a(Database issue):D110-‐117.
7. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease network. Proceedings of the National Academy of Sciences of the United States of America 2007, a(21):8685-‐8690.
8. Minikel E: Converting variants to their minimal representation. In.; 2014.
9. [http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=7&ved=0CFMQFjAG&url=http%3A%2F%2Fwww.tiem.utk.edu%2F~gross%2FMath589Spring2010%2FChapter2.ppt&ei=XcGmVNGTG4XqUtybhKgL&usg=AFQjCNHkrVWlEGpTacpsBcP_ht-‐KgjC3-‐g&sig2=8BxusLAJ0379cpw0vL1cwA&bvm=bv.82001339,d.d24]
10. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR: ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research 2014, a(Database issue):D980-‐985.
11. Clinical Significance [http://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/] 12. snpEff [http://snpeff.sourceforge.net/] 13. Variant Effect Predictor (VEP)
[http://www.ensembl.org/info/docs/tools/vep/index.html] 14. ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf
[ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz]
15. Cingolani P: We adopted a definition for LoF variants
* expected to correlate with complete loss of function
* of the affected transcripts:. In. Edited by Tardaguila M; 2014.
References
38
16. Gonzalez-‐Porta M, Frankish A, Rung J, Harrow J, Brazma A: Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome biology 2013, a(7):R70.
17. UniProt main [http://www.uniprot.org/] 18. InterPro main [http://www.ebi.ac.uk/interpro/] 19. [http://www.uniprot.org/help/function_section] 20. BLAST Glossary [http://www.ncbi.nlm.nih.gov/books/NBK62051/] 21. Schaffner SS, P.: Evolutionary adaptation in the human lineage. In:
Nature Education. 2008. 22. Larrañaga P: Supervised Classification Lesson. In. Edited by Tardáguila
M; 2014. 23. István Bartha AR, Paul McLaren, Manual Tardaguila, Pejman Mohammadi,
Jacques Fellay, Amalio Telenti: Heterozygous gene truncation delineates the human haploinsufficient genome. In.; 2014.
Anexes
39
ANEXES:
• MAP OF DEPENDENCIES OF THE CONSTRUCTION OF THE PRE-CALCULATED TABLE (I)• MAP OF DEPENDENCIES OF THE ANNOTATION PHASE OF NUTVAR2 (II)• MAP OF DEPENDENCIES OF THE CONSTRUCTION AND ANNOTATION OF THE TRAINING SET (III)
BLAST SOFTWARE
ENSEMBL release 75 gtf
DOWNLOAD
Homo_sapiens.GRCh37.75.gtf
3_parseo_del_gtf_8.0.pl
gtf_output_ENSG.txt
gtf_output_ENST.txt
gtf_output_EXON.txt
gtf_output_CDS.txt
gtf_output_start_codons.txt
gtf_output_UTR.txt
gtf_output_Selenocysteine.txt
4_def_conversor_de_ficheros_UTR_CDS_etc_múltiples_en_individuales.pl
gtf_output_stop_codons.txt
gtf_output_JOINED_CDS.txt
gtf_output_JOINED_start_codons.txt
gtf_output_JOINED_UTR.txt
gtf_output_JOINED_Selenocysteine.txt
gtf_output_JOINED_stop_codons.txt
5_gtf_ENSG_ENST_EON.pl
gtf_ENSG_ENST_EXON.txt
6_gtf_tabladef_3.0.pl
gtf_tabladef.txt
sort -t $'\t' -k3,3 data/build_tables/gtf_tabladef.txt > data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt
gtf_tabladef_sorted_by_SYMBOL.txt7_Position_ENST_converter.pl
TRANSCRIPTS_table.txt
8_TRANSCRIPT_TABLE_condenser_1_4.0.pl
ENST_table_midC.txt
ENST_table_full_condensed.txt
ENSEMBL UNIPROT & INTERPRO UNIPROT release-2014_07
DOWNLOAD
uniprot_sprot.dat
INTERPRO release-2014_07
DOWNLOAD
protein2ipr.datHUMAN.fasta
14_parse_del_ID_UNIPROT_human_16_prueba.pl
Equiv_ENSG_seq.txt
15_new_mapping.pl
ENSEMBL_includes_UNIPROT.txt
19_NEW_PROTEIN_POSITION_CONVERTER_DEF_corregida_dfeature.pl
PROTEIN_condensed.txt
18_NEW_2_PROTEINAS_CONDENSER.pl
PROTEIN_Re_mapping.txt
18_NEW_3_full_condensed_per_domain.pl
PROTEIN_full_condensed_feature.txt22_MAPEO_ISOFORMAS_NO_DISPLAYED.pl
ALL_ISOFORMS_PROTEIN_table.txt
23_ISOFORMAS_NO_DISPLAYED_CONDENSER_midC.pl
ALL_ISOFORMS_PROTEIN_table_midC.txt
ALL_ISOFORMS_PROTEIN_table_full.txt
24_TABLA_DOMAIN.pl
ALL_ISOFORMS_DOMAIN_table_full.txt
25_NMD_table.pl
NMD_table.txt
OTROS
appris_principal_isoform_gencode_19_15_10_2014.txt
Pervasive.txt
pRDG2.txt
Genes_AllInnateImmunity.txt
Genes_Antiviral.txt
Genes_ISGs.txt
Genes_OMIMrecessive.txt
RVIS2.txt
ALL_ISOFORMS_DOMAIN_table_midC.txt
db.fastano_aligned.fasta
scp -r data/build_tables/db.fasta /home/bioinfo/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/db.fasta
makeblastdb -in db.fasta -parse_seqids -dbtype prot
blastp -db db.fasta -query /home/bioinfo/Dropbox/nutvar2/data/build_tables/no_aligned.fasta -outfmt 3 > resultsfmt3.out
blastp -db db.fasta -query /home/bioinfo/Dropbox/nutvar2/data/build_tables/no_aligned.fasta -outfmt 7 > resultsfmt7.out
resultsfmt3.out resultsfmt7.out
16_Jalview_multiple_alignment.pl
selected_alignments.txt
16_Jalview_multiple_alignment_parteII.pl
aligned_features_coordinates.txt
16_NEW_SELECT_HUMAN_FROM_INTERPRO.pl
protein2ipr_human.dat17_INTERPRO_UNIPROT_SITES_DOMAINS_COORDINATES_MULTIDOMAIN.pl
site_coordinates.txt multidomain.txt
17_MULTIDOMAIN_parte_II_2.0.pl
multidomain_midC.txt
sort -k3,3 -k4,4 -k 1n data/build_tables/ multidomain_midC.txt > data/build_tables/multidomain_midC_sorted.txt
17_MULTIDOMAIN_parte_II_2.0.pl
multidomain_midC_sorted.txt
17_MULTIDOMAIN_parte_III.pl
domain_cordinates.txt
18_Realocator_coordinates.pl
aligned_features_coordinates_midC.txt
aligned_no_display_offset.txt
18_Realocator_coordinates_parteII.pl
UNIPROT_PLUS.txt
ENSEMBL release 75 gtf
DOWNLOAD
Homo_sapiens.GRCh37.75.cdna.all.fa
Homo_sapiens.GRCh37.75.cds.all.fa
26_cDNA_genomic_coordinate.pl
cDNA_genomic_coordinates.txt
27_CDS_genomic_coordinate.pl
CDS_genomic_coordinates.txt
28_compresor_CDS.pl
CDS_genomic_coordinates_full_compresed.txt
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/ Homo_sapiens.GRCh37.75.cdna.all.fa.gz
perl bin/build_tables/26_cDNA_genomic_coordinate.pl data/build_tables/NMD_table.txt ~/../../media/bioinfo/9631bd0c-4897-4987-8651-994c6463aa73/external_files_high_volume/Human_genome/Homo_sapiens.GRCh37.75.cdna.all.fa data/build_ta-bles/cDNA_genomic_coordinates.txt
perl bin/build_tables/27_CDS_genomic_coordinate.pl ~/../../media/bioinfo/9631bd0c-4897-4987-8651-994c6463aa73/external_fi-les_high_volume/Human_genome/Homo_sapiens.GRCh37.75.cds.all.fa data/build_tables/cDNA_genomic_coordinates.txt ~/Escritorio/Proyecto_clasificador/Raw_Data/CDS_genomic_coordinates.txt
perl bin/build_tables/28_compresor_CDS.pl ~/Escritorio/Proyecto_clasificador/Raw_Data/CDS_genomic_coordinates.txt ~/Es-critorio/Proyecto_clasificador/Raw_Data/CDS_genomic_coordinates_full_compresed.txt
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/ Homo_sapiens.GRCh37.75.cds.all.fa.gz
ORDERSwget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
perl bin/build_tables/5_gtf_ENSG_ENST_EON.pl data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_EXON.txt data/build_tables/gtf_ENSG_ENST_EXON.txt
perl bin/build_tables/6_gtf_tabladef_3.0.pl data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_JOINED_CDS.txt data/build_tables/gtf_output_JOINED_UTR.txt data/build_tables/gtf_output_JOI-NED_START.txt data/build_tables/gtf_output_JOINED_STOP.txt data/build_tables/gtf_tabladef.txt
perl bin/build_tables/8_TRANSCRIPT_TABLE_condenser_1_4.0.pl ~/Escritorio/Proyecto_clasificador/Raw_Data/TRANS-CRIPTS_table.txt data/build_tables/ENST_table_midC.txt data/build_tables/ENST_table_full_condensed.txt
perl bin/build_tables/New_mapping_scripts/14_parse_del_ID_UNIPROT_human_16_prueba_API_independent.pl data/build_ta-bles/gtf_output_ENSG.txt data/external/HUMAN.fasta data/external/uniprot_sprot_human.dat data/external/Homo_sapiens.GRCh37.75.pep.all.fa data/build_tables/Equiv_ENSG_seq.txt
perl bin/build_tables/New_mapping_scripts/15_new_mapping.pl data/build_tables/Equiv_ENSG_seq.txt data/build_tables/EN-SEMBL_includes_UNIPROT.txt data/build_tables/db.fasta data/build_tables/no_aligned.fasta
scp -r data/build_tables/db.fasta /home/bioinfo/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/db.fasta
makeblastdb -in db.fasta -parse_seqids -dbtype prot
perl bin/build_tables/3_parseo_del_gtf_8.0.pl data/external/Homo_sapiens.GRCh37.75.gtf data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_EXON.txt data/build_tables/gtf_output_CDS.txt data/build_tables/gtf_output_start_codons.txt data/build_tables/gtf_output_stop_codons.txt data/build_tables/gtf_output_UTR.txt data/build_tables/gtf_output_Selenocysteine.txt
perl bin/build_tables/4_def_conversor_de_ficheros_UTR_CDS_etc_m√∫ltiples_en_individuales.pl data/build_tables/gtf_output_UTR.txt data/build_tables/gtf_output_start_codons.txt data/build_tables/gtf_output_CDS.txt data/build_tables/gtf_output_Selenocysteine.txt data/build_tables/gtf_output_stop_codons.txt data/build_tables/gtf_output_JOINED_UTR.txt data/build_tables/gtf_output_JOINED_START.txt data/build_tables/gtf_output_JOINED_CDS.txt data/build_tables/gtf_output_JOI-NED_Seleno.txt data/build_tables/gtf_output_JOINED_STOP.txt
perl bin/build_tables/New_mapping_scripts/16_Jalview_multiple_alignment_parteII.pl data/build_tables/selected_alignments.txt data/build_tables/no_aligned.fasta data/build_tables/db.fasta ~/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/resultsfmt3.out data/build_tables/aligned_features_coordinates.txt
perl bin/build_tables/New_mapping_scripts/16_NEW_SELECT_HUMAN_FROM_INTERPRO.pl ~/../../media/bioinfo/Elements/Proyecto_clasificador_28_07_2014_NO_TOCAR/Documentos/UNIPROT/HUMAN.fasta ~/../../media/bioinfo/Elements/Proyec-to_clasificador_28_07_2014_NO_TOCAR/Documentos/INTERPRO/protein2ipr.dat data/external/protein2ipr_human.dat
sort -t $’\t’ -k3,3 data/build_tables/gtf_tabladef.txt > data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt
perl bin/build_tables/7_Position_ENST_converter.pl data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt data/build_tables/TRANSCRIPTS_table.txt
wget ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2014_07/ + protein2ipr.dat + HUMAN.fasta
perl bin/build_tables/New_mapping_scripts/17_INTERPRO_UNIPROT_SITES_DOMAINS_COORDINATES_MULTIDOMAIN.pl ~/../../media/bioinfo/Elements/Proyecto_clasificador_28_07_2014_NO_TOCAR/Documentos/UNIPROT/HUMAN.fasta ~/../../me-dia/bioinfo/Elements/Proyecto_clasificador_28_07_2014_NO_TOCAR/Documentos/UNIPROT/uniprot_sprot.dat data/external/protein2ipr_human.dat data/build_tables/site_coordinates.txt data/build_tables/multidomain.txt
perl bin/build_tables/18_NEW_3_full_condensed_per_domain.pl data/build_tables/PROTEIN_condensed.txt data/build_tables/PROTEIN_full_condensed_feature.txt
perl bin/build_tables/22_MAPEO_ISOFORMAS_NO_DISPLAYED.pl data/build_tables/gtf_output_ENSG.txt data/build_tables/PROTEIN_full_condensed_feature.txt data/build_tables/ENST_table_full_condensed.txt ~/Escritorio/Proyecto_clasificador/Raw_Data/ALL_ISOFORMS_PROTEIN_table.txt
perl bin/build_tables/23_ISOFORMAS_NO_DISPLAYED_CONDENSER_midC.pl ~/Escritorio/Proyecto_clasificador/Raw_Data/ALL_ISOFORMS_PROTEIN_table.txt data/build_tables/ALL_ISOFORMS_PROTEIN_table_midC.txt data/build_tables/ALL_ISO-FORMS_PROTEIN_table_full.txt
perl bin/build_tables/25_NMD_table.pl data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_ENSG_ENST_EXON.txt data/build_tables/NMD_table.txt
perl bin/build_tables/24_TABLA_DOMAIN_2.0.pl data/build_tables/ALL_ISOFORMS_PROTEIN_table_full.txt data/build_tables/ALL_ISOFORMS_DOMAIN_table_midC.txt data/build_tables/ALL_ISOFORMS_DOMAIN_table_full.txt
blastp -db db.fasta -query /home/bioinfo/Dropbox/nutvar2/data/build_tables/no_aligned.fasta -outfmt 3 > resultsfmt3.out
blastp -db db.fasta -query /home/bioinfo/Dropbox/nutvar2/data/build_tables/no_aligned.fasta -outfmt 7 > resultsfmt7.out
perl bin/build_tables/New_mapping_scripts/16_Jalview_multiple_alignment.pl ~/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/re-sultsfmt7.out data/build_tables/selected_alignments.txt
From: /home/bioinfo/SOFTWARE/BLAST/ncbi-blast-2.2.30+/db/db.fasta
perl bin/build_tables/New_mapping_scripts/17_MULTIDOMAIN_parte_II_2.0.pl data/build_tables/multidomain.txt data/build_ta-bles/multidomain_midC.txt
sort -k3,3 -k4,4 -k 1n data/build_tables/ multidomain_midC.txt > data/build_tables/multidomain_midC_sorted.txt
perl bin/build_tables/New_mapping_scripts/17_MULTIDOMAIN_parte_III.pl data/build_tables/multidomain_midC_sorted.txt data/build_tables/domain_cordinates.txt
perl bin/build_tables/18_NEW_2_PROTEINAS_CONDENSER.pl ~/Escritorio/Proyecto_clasificador/Raw_Data/PROTEIN.txt data/build_tables/PROTEIN_condensed.txt
perl bin/build_tables/New_mapping_scripts/18_Realocator_coordinates.pl data/build_tables/aligned_features_coordinates.txt data/build_tables/aligned_features_coordinates_midC.txt data/build_tables/aligned_no_display_offset.txt
perl bin/build_tables/New_mapping_scripts/18_Realocator_coordinates_parteII.pl data/build_tables/site_coordinates.txt data/build_tables/domain_coordinates.txt data/build_tables/gtf_output_ENSG.txt data/build_tables/ENSEMBL_includes_UNIPROT.txt data/build_tables/aligned_no_display_offset.txt data/build_tables/UNIPROT_PLUS.txt
perl bin/build_tables/New_mapping_scripts/19_NEW_PROTEIN_POSITION_CONVERTER_DEF_corregida_dfeature.pl data/build_tables/UNIPROT_PLUS.txt ~/Escritorio/Proyecto_clasificador/Raw_Data/TRANSCRIPTS_table.txt ~/Escritorio/Proyec-to_clasificador/Raw_Data/PROTEIN_Re_mapping.txt
- .- .
=&,
- . - .
VEP
Matrix_vep_CCDS_added_gene_based_scores.txt
53BIS_Fuse_Matrix\&Gene_based.pl
10X variant_effec_output_second_round.txt
CCDS VERSION
VEP EXCLUSIVE
scriptssnpEff
EXCLUSIVEscripts
ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf
OUT_FILE_minimal_representation.vcf
2_script_minimal_representation_vcf_7.0.pl
gtf_tabladef_4_sorted_by_SYMBOL.txt
Matrix_vep_added_gene_based_scores.txt
53BIS_Fuse_Matrix\&Gene_based.pl
pRDG2.txt
Genes_AllInnateImmunity.txt
Genes_Antiviral.txt
Genes_ISGs.txt
Genes_OMIMrecessive.txt
RVIS2.txt
ALL_ISOFORMS_DOMAIN_table_full.txt
ENST_table_full_condensed.txt
ALL_ISOFORMS_PROTEIN_table_full.txt
CDS_genomic_coordinates_full_compresed.txt
32_key_PROTEINS_8.0_GLOBAL_3.0.pl
snpeff_detailed_ProtAndSite_Pre_step.txt
SORT
snpeff_detailed_ProtAndSite_Pre_step_ordered.txt
32_key_PROTEINS_8.0_GLOBAL_ParteII.pl
snpeff_detailed_ProtAndSite_Post_step.txt
32_key_PROTEINS_8.0_GLOBAL_3.0.pl
snpeff_DOMAINS_Pre_step.txt
SORT
snpeff_DOMAINS_Pre_step_ordered.txt
32_key_PROTEINS_8.0_GLOBAL_ParteII.pl
snpeff_DOMAINS_Post_step.txt
ORDERSperl bin/shared/2_Script_minimal_representation_vcf_7.0.pl test/example.vcf data/intermediate/example_mr.vcf 2h 40 min for 1GK set . Requires internet
to undo <DEL>
perl ~/Escritorio/Proyecto_clasificador/SOFTWARE/ensembl-tools-release-75/scripts/variant_effect_predictor/variant_effect_pre-dictor.pl -i data/intermediate/example_mr.vcf --offline --output_file data/intermediate/vep_example_mr.vcf --everything --vcf --ca-che --dir ~/Escritorio/Proyecto_clasificador/SOFTWARE/./.vep/tmp/
o.n. for 10 chunks of 1GK set
perl bin/VEP/24_VEP_parser_def_minus_heather.pl data/intermediate/vep_example_mr.vcf data/intermediate/out_vep_parsed.txt
perl bin/shared/25_Downstream_frameshift_API_independent_5.0.pl data/intermediate/out_snpeff_parsed.txt ~/Escritorio/Proyecto_clasificador/Raw_Data/CDS_genomic_coordinates_full_compresed.txt data/intermediate/snpeff_derived_PTCS_API_independent.txt
perl bin/shared/26_NEW_EXTRA_key_%_sequence_2.0.pl data/intermediate/out_snpeff_parsed.txt data/build_tables/ENST_ta-ble_full_condensed.txt data/intermediate/snpeff_percentage.txt
perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_ParteII.pl data/intermediate/snpeff_detailed_ProtAndSite_Pre_step_ordered.txt data/intermediate/snpeff_detailed_ProtAndSite_Post_step.txt
sort -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -k7,7 -k8,8 data/intermediate/snpeff_DOMAINS_Pre_step.txt > data/intermediate/snpeff_DOMAINS_Pre_step_ordered.txt
perl bin/shared/27_key_NMD_5.0_DERIVED_STOPS_2.0.pl data/intermediate/snpeff_derived_PTCS.txt data/intermediate/out_snpeff_parsed.txt data/build_tables/NMD_table.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/snpeff_derived_NMD.txt
perl bin/shared/38_global_feature_table_1_4_paralell.pl data/intermediate/snpeff_NMD.txt data/intermediate/snpeff_derived_NMD.txt data/intermediate/snpeff_detailed_ProtAndSite_Post_step.txt data/intermediate/snpeff_DOMAINS_Post_step.txt data/intermediate/snpeff_percentage.txt data/intermediate/snpeff_first_table.txt
perl bin/shared/40_tabla_PEJMAN_16_def_2.0.pl data/intermediate/snpeff_first_table.txt data/intermediate/snpeff_NMD.txt data/intermediate/snpeff_derived_NMD.txt data/intermediate/gtf_output_ENST.txt data/intermediate/gtf_output_ENSG.txt data/build_tables/ENST_table_full_condensed.txt data/external/appris_principal_isoform_gencode_19_15_10_2014.txt data/exter-nal/Pervasive.txt data/intermediate/Matrix_snpeff.txt
perl bin/shared/41_CCDS_collapser_3.0.pl data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt data/intermediate/snpeff_NMD.txt data/intermediate/snpeff_NMD_CCDS.txt data/intermediate/snpeff_derived_NMD.txt data/intermediate/snpeff_deri-ved_NMD_CCDS.txt data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENSG_CCDS.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_ENST_CCDS.txt data/build_tables/ENST_table_full_condensed.txt data/build_tables/ENST_table_full_condensed_CCDS.txt data/external/appris_principal_isoform_gencode_19_15_10_2014.txt data/external/appris_principal_isoform_gencode_19_15_10_2014_CCDS.txt data/external/Pervasive.txt data/external/Pervasive_CCDS.txt data/intermediate/snpeff_first_table.txt data/intermediate/snpeff_first_table_CCDS.txt
perl bin/shared/42_tabla_PEJMAN_15.0_version_paralel_4.0.pl data/intermediate/snpeff_first_table_CCDS.txt data/intermedia-te/snpeff_NMD_CCDS.txt data/intermediate/snpeff_derived_NMD_CCDS.txt data/build_tables/gtf_output_ENST_CCDS.txt data/build_tables/gtf_output_ENSG.txt data/build_tables/ENST_table_full_condensed_CCDS.txt data/external/appris_principal_iso-form_gencode_19_15_10_2014_CCDS.txt data/external/Pervasive_CCDS.txt data/intermediate/Matrix_snpeff_CCDS.txt
java -Xmx4g -jar ~/Escritorio/Proyecto_clasificador/SOFTWARE/snpEff/snpEff.jar eff -c ~/Escritorio/Proyecto_clasificador/SOFTWARE/snpEff/snpEff.config -v GRCh37.75 -lof -csvStats -nextProt -sequenceOntology data/intermediate/example_mr.vcf > data/intermediate/example_mr.eff.vcf
perl bin/snpEff/24_snpEff_parser_def_minus_heather.pl data/intermediate/example_mr.eff.vcf data/intermediate/out_snpeff_parsed.txt
perl bin/shared/53BIS_Fuse_Matrix\&Gene_based.pl data/intermediate/Matrix_snpeff.txt data/external/pRDG2.txt data/external/Genes_AllInnateImmunity.txt data/external/Genes_Antiviral.txt data/external/Genes_ISGs.txt data/external/Genes_OMIMreces-sive.txt data/external/RVIS2.txt data/final/Matrix_snpeff_added_gene_based_scores.txt
perl bin/shared/53BIS_Fuse_Matrix\&Gene_based.pl data/intermediate/Matrix_snpeff_CCDS.txt data/external/pRDG2.txt data/external/Genes_AllInnateImmunity.txt data/external/Genes_Antiviral.txt data/external/Genes_ISGs.txt data/external/Genes_OMIMrecessive.txt data/external/RVIS2.txt data/final/Matrix_snpeff_CCDS_added_gene_based_scores.txt
perl bin/shared/27_key_NMD_5.0_3.0.pl data/intermediate/out_snpeff_parsed.txt data/build_tables/NMD_table.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/snpeff_NMD.txt
perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_3.0.pl data/intermediate/out_snpeff_parsed.txt data/build_tables/ENST_ta-ble_full_condensed.txt data/build_tables/ALL_ISOFORMS_PROTEIN_table_full.txt data/intermediate/snpeff_detailed_ProtAnd-Site_Pre_step.txt
perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_3.0.pl data/intermediate/out_snpeff_parsed.txt data/build_tables/ENST_ta-ble_full_condensed.txt data/build_tables/ALL_ISOFORMS_DOMAIN_table_full.txt data/intermediate/snpeff_DOMAINS_Pre_step.txt
sort -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -k7,7 -k8,8 data/intermediate/snpeff_detailed_ProtAndSite_Pre_step.txt > data/intermedia-te/snpeff_detailed_ProtAndSite_Pre_step_ordered.txt
perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_ParteII.pl data/intermediate/snpeff_DOMAINS_Pre_step_ordered.txt data/in-termediate/snpeff_DOMAINS_Post_step.txt
perl bin/shared/25_Downstream_frameshift_6.0.pl data/intermediate/out_vep_parsed.txt data/intermediate/vep_derived_PTCS.txt
perl bin/shared/26_NEW_EXTRA_key_%_sequence_2.0.pl data/intermediate/out_vep_parsed.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/vep_percentage.txt
perl bin/shared/27_key_NMD_5.0_3.0.pl data/intermediate/out_vep_parsed.txt data/build_tables/NMD_table.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/vep_NMD.txt
perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_3.0.pl data/intermediate/out_vep_parsed.txt data/build_tables/ENST_table_full_condensed.txt data/build_tables/ALL_ISOFORMS_PROTEIN_table_full.txt data/intermediate/vep_detailed_ProtAndSite_Pre_step.txt
sort -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -k7,7 -k8,8 data/intermediate/vep_detailed_ProtAndSite_Pre_step.txt > data/intermediatevep_detailed_ProtAndSite_Pre_step_ordered.txt
perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_ParteII.pl data/intermediate/vep_detailed_ProtAndSite_Pre_step_ordered.txt data/intermediate/vep_detailed_ProtAndSite_Post_step.txt
perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_3.0.pl data/intermediate/out_vep_parsed.txt data/build_tables/ENST_table_full_condensed.txt data/build_tables/ALL_ISOFORMS_DOMAIN_table_full.txt data/intermediate/vep_DOMAINS_Pre_step.txt
sort -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -k7,7 -k8,8 data/intermediate/vep_DOMAINS_Pre_step.txt > data/intermediate/vep_DO-MAINS_Pre_step_ordered.txt
perl bin/shared/32_key_PROTEINS_8.0_GLOBAL_ParteII.pl data/intermediate/vep_DOMAINS_Pre_step_ordered.txt data/interme-diate/vep_DOMAINS_Post_step.txt
perl bin/shared/27_key_NMD_5.0_DERIVED_STOPS_2.0.pl data/intermediate/vep_derived_PTCS.txt data/intermediate/out_vep_parsed.txt data/build_tables/NMD_table.txt data/build_tables/ENST_table_full_condensed.txt data/intermediate/vep_derived_NMD.txt
perl bin/shared/38_global_feature_table_1_4_paralell.pl data/intermediate/vep_NMD.txt data/intermediate/vep_derived_NMD.txt data/intermediate/vep_detailed_ProtAndSite_Post_step.txt data/intermediate/vep_DOMAINS_Post_step.txt data/intermediate/vep_percentage.txt data/intermediate/vep_first_table.txt
perl bin/shared/40_tabla_PEJMAN_16_def_2.0.pl data/intermediate/vep_first_table.txt data/intermediate/vep_NMD.txt data/inter-mediate/vep_derived_NMD.txt data/intermediate/gtf_output_ENST.txt data/intermediate/gtf_output_ENSG.txt data/build_tables/ENST_table_full_condensed.txt data/external/appris_principal_isoform_gencode_19_15_10_2014.txt data/external/Pervasive.txt data/intermediate/Matrix_vep.txt
perl bin/shared/41_CCDS_collapser_3.0.pl data/build_tables/gtf_tabladef_sorted_by_SYMBOL.txt data/intermediate/vep_NMD.txt data/intermediate/vep_NMD_CCDS.txt data/intermediate/vep_derived_NMD.txt data/intermediate/vep_derived_NMD_CCDS.txt data/build_tables/gtf_output_ENSG.txt data/build_tables/gtf_output_ENSG_CCDS.txt data/build_tables/gtf_output_ENST.txt data/build_tables/gtf_output_ENST_CCDS.txt data/build_tables/ENST_table_full_condensed.txt data/build_tables/ENST_ta-ble_full_condensed_CCDS.txt data/external/appris_principal_isoform_gencode_19_15_10_2014.txt data/external/appris_princi-pal_isoform_gencode_19_15_10_2014_CCDS.txt data/external/Pervasive.txt data/external/Pervasive_CCDS.txt data/intermediate/vep_first_table.txt data/intermediate/vep_first_table_CCDS.txt
perl bin/shared/42_tabla_PEJMAN_15.0_version_paralel_4.0.pl data/intermediate/vep_first_table_CCDS.txt data/intermediate/vep_NMD_CCDS.txt data/intermediate/vep_derived_NMD_CCDS.txt data/build_tables/gtf_output_ENST_CCDS.txt data/build_ta-bles/gtf_output_ENSG.txt data/build_tables/ENST_table_full_condensed_CCDS.txt data/external/appris_principal_isoform_gen-code_19_15_10_2014_CCDS.txt data/external/Pervasive_CCDS.txt data/intermediate/Matrix_vep_CCDS.txt
perl bin/shared/53BIS_Fuse_Matrix\&Gene_based.pl data/intermediate/Matrix_vep.txt data/external/pRDG2.txt data/external/Ge-nes_AllInnateImmunity.txt data/external/Genes_Antiviral.txt data/external/Genes_ISGs.txt data/external/Genes_OMIMrecessive.txt data/external/RVIS2.txt data/final/Matrix_vep_added_gene_based_scores.txt
perl bin/shared/53BIS_Fuse_Matrix\&Gene_based.pl data/intermediate/Matrix_vep_CCDS.txt data/external/pRDG2.txt data/exter-nal/Genes_AllInnateImmunity.txt data/external/Genes_Antiviral.txt data/external/Genes_ISGs.txt data/external/Genes_OMIMreces-sive.txt data/external/RVIS2.txt data/final/Matrix_vep_CCDS_added_gene_based_scores.txt
ORDERSClinVar ClinVar
DOWNLOAD
20_Script_minimal_representation_vcf_ClinVAR_9.0.pl
clinvar_Antonio_25_08_2014.vcf
JOIN_vcf.vcf
51_Fuse_features_Permisiveness1.0.pl
OUT_plus_Intervals_of_confidence_plus_TAGS.txt
Train\&Test.txt
52_Fuse_Matrix_feaures_Permisiveness_2.0.pl
Matrix_snpeff_CCDS_added_tags.txt
ClinVarFullRelease_2014-08.xmlclinvar_20140807.vcf
prueba_clinvar2.txt
21_Clinvar_no_repetitions_2.0.pl
OUTPUTVCF_CollapsedPerGene_Variants_PlusCountsFromIstvan.txt
In house developed
50_Antonio\&Istvan_Counts_parser.pl
Matrix_snpeff_CCDS.txt
Annotation (see map II)
pRDG2.txt
Genes_AllInnateImmunity.txt
Genes_Antiviral.txt
Genes_ISGs.txt
Genes_OMIMrecessive.txt
RVIS2.txt
53_Fuse_Matrix\&Permisiveness_Gene_based.pl
Matrix_snpeff_CCDS_added_tag_added_scores.txt
wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive/2014/clinvar_20140807.vcf
wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive/2014/ClinVarFullRelease_2014-08.xml
perl ../../Scripts_Ad_Hoc/Perl/20_Script_minimal_representation_vcf_ClinVAR_9.0.pl ClinVarFullRelease_2014-08.xml clinvar_20140807.vcf prue-ba_clinvar2.txt 1>lista_xml1.txt
perl ~/Escritorio/Proyecto_clasificador/Scripts_Ad_Hoc/Perl/21_Clinvar_no_repetitions_2.0.pl prueba_clinvar2.txt clinvar_Antonio_25_08_2014.vcf
perl ../../Scripts_Ad_Hoc/Perl/50_Antonio\&Istvan_Counts_parser.pl OUTPUTVCF_CollapsedPerGene_Variants_PlusCountsFromIstvan.txt clin-var_Antonio_25_08_2014.vcf JOIN_vcf.vcf
perl ../../Scripts_Ad_Hoc/Perl/51_Fuse_features_Permisiveness1.0.pl JOIN_vcf.vcf OUT_plus_Intervals_of_confidence_plus_TAGS.txt Train\&Test.txt
perl ../../../Scripts_Ad_Hoc/Perl/52_Fuse_Matrix_feaures_Permisiveness_2.0.pl Matrix_snpeff_CCDS.txt ../Train\&Test.txt Matrix_snpeff_CCDS_added_tags.txt 1>lista_52.txt
perl ../../../Scripts_Ad_Hoc/Perl/53_Fuse_Matrix\&Permisiveness_Gene_based.pl Matrix_snpeff_CCDS_added_tags.txt ../Gene_based/pRDG2.txt ../Gene_based/Genes_AllInnateImmunity.txt ../Gene_based/Genes_Antiviral.txt ../Gene_based/Genes_ISGs.txt ../Gene_based/Genes_OMIMreces-sive.txt ../Gene_based/RVIS2.txt Matrix_snpeff_CCDS_added_tag_added_scores.txt