IterativeVariableGeneDiscoveryfromWholeGenome ... · 2019. 7. 30. ·...

Research ArticleIterative Variable Gene Discovery from Whole GenomeSequencing with a Bootstrapped Multiresolution Algorithm

David N. Olivieri 1 and Francisco Gambón-Deza2

1Department of Computer Science, University of Vigo, Ourense 32004, Spain2Department of Immunology, Hospital of Meixoeiro, Vigo, Spain

Correspondence should be addressed to David N. Olivieri; [email protected]

Received 12 June 2018; Revised 25 December 2018; Accepted 15 January 2019; Published 11 February 2019

Academic Editor: Andrzej Kloczkowski

Copyright © 2019 David N. Olivieri and Francisco Gambón-Deza. 0is is an open access article distributed under the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.

In jawed vertebrates, variable (V) genes code for antigen-binding regions of B and T lymphocyte receptors, which generate aspecific response to foreign pathogens. Obtaining the detailed repertoire of these genes across the jawed vertebrate kingdomwouldhelp to understand their evolution and function. However, annotations of V-genes are known for only a few model species sincetheir extraction is not amenable to standard gene finding algorithms. Also, the more distant evolution of a taxon is from suchmodel species, and there is less homology between their V-gene sequences. Here, we present an iterative supervised machinelearning algorithm that begins by training a small set of known and verified V-gene sequences. 0e algorithm successivelydiscovers homologous unaligned V-exons from a larger set of whole genome shotgun (WGS) datasets frommany taxa. Upon eachiteration, newly uncovered V-genes are added to the training set for the next predictions. 0is iterative learning/discovery processterminates when the number of new sequences discovered is negligible. 0is process is akin to “online” or reinforcement learningand is proven to be useful for discovering homologous V-genes from successively more distant taxa from the original set. Resultsare demonstrated for 14 primate WGS datasets and validated against Ensembl annotations. 0is algorithm is implemented in thePython programming language and is freely available at http://vgenerepertoire.org.

1. Introduction

A hallmark of an adaptive immune system (AIS) is its abilityto generate a large and specific response to foreign patho-gens. 0is is accomplished through using a recognitionmachinery of two molecular structures, immunoglobulins(IGs) and T-cell (lymphocyte) receptors (TCRs). IGs andTCRs recognize an antigen (Ag) through different mecha-nisms. IG binds to an antigen in soluble form, while TCRbinds to an antigen with the major histocompatibilitycomplex (MHC) molecule [1, 2]. Antigen-binding sites inboth the IG and TCR molecules possess similar recognitiondomains, called variable (V) domains. 0ese domains arecoded by V-genes.

Jawed vertebrate species contain multiple V-genes lo-cated within seven genomic loci. V-genes share a commonsequence homology (either orthologous across species or

paralogous due to gene duplication). Most jawed verte-brates have three loci for genes that encode the IG chains(IGH for heavy (H) chains and IGK and IGL for κ and λchains, respectively) and four loci for genes that encode theTCR chains (TRA, TRB, TRG, and TRD coding for the TCRα-, β-, c-, and δ-chains, respectively). In each locus, there isa variable number of each of these V-genes. To generate theimmunoglobulin or TCR chains, one of these genes isbrought to the proximity of the exons that encode theconstant regions through a recombination process. 0isprocess is complex (since additional D and J gene se-quences are involved) and is the basis for the wide diversityof these molecules, required for adaptive immunity. Moredetails of the structure and function of these molecules aredescribed elsewhere [2–4].

Motivation for V-gene finder algorithm: Knowing thedetailed structure of these genes and the molecules they

HindawiComputational and Mathematical Methods in MedicineVolume 2019, Article ID 3780245, 13 pageshttps://doi.org/10.1155/2019/3780245

mailto:[email protected]://vgenerepertoire.orghttp://orcid.org/0000-0001-7862-6917https://creativecommons.org/licenses/by/4.0/https://creativecommons.org/licenses/by/4.0/https://doi.org/10.1155/2019/3780245

encode, as well as the entire repertoire that each speciespossess, would help to understand the evolution of theadaptive immune system. Nonetheless, these variable (V)gene repertoires have only been annotated for a few modeltaxa. 0e maturity and breadth of genome sequencingprojects of >150 jawed vertebrate species provide an excitingopportunity to identify the full set of V-gene repertoires(i.e., the set of V-genes possessed by each species) across theentire jawed vertebrate kingdom.

Context for questions in immunology: In brief, there areseveral fundamental questions that a full understanding of theV-gene Ig/TCR germline repertoire would provide. First, it isnot known why the number of V-genes vary so differentlybetween species (for example, some species belonging to theChiroptera family have >300 V-genes, while others, such asCetacea, possess very few such genes,150 mammal species, 20reptiles, >100 fish with an average genome coverage >15–20× (depending upon the sequencing technology) and N50(>20 kbp) which is sufficient for uncovering approximately>90% of the V-gene repertoire of a species [5].0e patches ofthe genome assemblies that are still incomplete represent theonly limiting factor for uncovering the full V-exon reper-toires. With maturity of these projects, however, it can beexpected that the full gene repertoires can be annotated.

Structure of germline IG/TCR loci: In jawed vertebrates(i.e., mammals, reptiles, fish, and birds), functional V-geneisotypes, corresponding to either Ig or TCR receptor mol-ecules, are found in seven separate genomic loci. For im-munoglobulin chains, there are three V-gene loci: oneheavy chain (IGHV), and two light chains, referred to as κ(IGKV) and λ (IGLV). For the TCR chains, there are twotypes: α/β and c/δ. 0e TCR α/β is composed of two chains(α and β), whose variable regions are coded in two loci,TRAV and TRBV, respectively. In a similar way, thevariable regions of TCR c/δ also are encoded by the lociTRGV and TRDV (the locus TRDV is found in the samechromosomal location as TRAV). 0e number of V-genes ineach locus varies considerably between different chains andacross different species. Additionally, varying numbers ofpseudogenes—sequences that either contain stop codons orhave alterations in their reading frame and are not functionallyexpressed V-genes—exist throughout these loci [8–10].

At present, the vast majority of genome sequencingprojects exists either as WGS contigs or scaffolds(i.e., segments of the DNA, which have not been assemblednor associated at the chromosome level). 0us, the loci of IGand TCR of each individual V-gene must be inferred fromsequence homology. From a molecular phylogenetic treeanalysis, the V-genes from the same loci would belong to thesame clade.0is same classification could be automated withstatistical machine learning, as will be shown.

2 Computational and Mathematical Methods in Medicine

Other gene finding software for V-genes: 0ere are severalbioinformatic software packages for automatically identi-fying genes [11] (for example, geneid [12]). However, thesegeneral algorithms, which are effective for identifying mostgenes, are not valid for discovering V-exons. 0e reason isthat these algorithms use a general rule for the start/stop ofexons with an AG/GT signals, whereas the exon boundariesof the V-exons are more complex and variable due to theneed of the VDJ recombination mechanism. In the case ofthe V-exon, the GT motif does not mark the exon termi-nation boundary, rather there is a CACAGTG motif, that isonly partially conserved.

Our previous algorithm: Vgenextractor: Our previousalgorithm assumed that V sequences must contain con-served sequence motifs near specific positions, i.e., when theamino acid length is >80, there exists a cysteine C betweenpositions 15 and 28, a tryptophan (W) between positions 25and 40, and the YYC motif (Y∗), or variants are found in thelast 15 amino acids. 0e algorithm also takes advantage ofthe highly conserved canonical recombination signal se-quence (RSS) motif. Knowing, to a very high degree, the exonstructure obviates the need for applying a general (andgenome wide) gene finding algorithm (e.g., mgene,Augustus, Craig, fgenesh, and geneid, others) that attempt todiscover all protein coding genes, given wide variations ofgenomic segment types (i.e., intergenic, 5′ untranslatedregion (UTR) and coding exon, intron, or 3′ UTR). Instead,in-frame exons are identified between a nearly universal-AG- start motif and the RSS canonical -CAC- motif (i.e., ashortened version of the generally conserved motif ).

Results from our previous algorithm: Vgenextractor:Predictions of our previous algorithm, VgenExtractor [5],described briefly above, provided a minimum confidenceregion for discovering V-genes in other species whose ge-nomes are only partially annotated (more than 150 mam-mals and 12 reptiles species; public repository http://vgenerepertoire.org). While still representing incompleteimmune repertoires, this large set of V-gene sequences hasyielded heretofore unavailable information about the evo-lutionary origins of these IG and TCR repertoires. Oneexample is the identification of ancestral clades found bothin reptiles [13] and mammals, suggesting that V-genes inextant taxa are descendants of an ancestral immunoglobulin(Ig) recognition progenitor gene [14] that coincided with therise of jawed vertebrates and has been maintained since thenthroughout their evolution [15]. 0ese gene sequences alsoprovide detailed clues of repertoire adaptation and di-vergence amongst orders. In primates, evolutionary con-served TCR clades were identified [16] that were later seen toexist throughout all present-day mammals [17].

Despite the success of the VgenExtractor pipeline, themethod has several drawbacks. First, a class of sequences canbe overlooked since it is probable that some V-gene se-quences may not obey canonical amino acid (AA) motifconservation rules; Iguanidae, for example, possess V-genesthat lack the canonical tryptophan at position 41 (tryp-41)[18]. 0e VgenExtractor algorithm produces a set of falsepositives, requiring a Blastp pipeline step to remove non-homologous sequences and to classify V-gene sequences into

their respective loci (i.e., IGHV, IGLV, IGKV, TRA/DV,TRBV, and TRGV). 0e Blastp step requires sequencealignment and depends on the completeness of the Non-reduntant (nr) protein database. 0ese deficiencies mayaccount for only 10% error in mammalian taxa, but for moredistant orders (e.g., reptiles, birds, and bony fish), V-genesequences may deviate substantially from their conservedbrethren and not have sufficient representation in the blastNP database for ortholog determination.

Difference between Vgenextractor and new ML approach:No a priori assumptions are made about V-gene sequences,and a supervised learning algorithm was developed that startswith a known annotated set of V-genes (from humans) anditeratively discovers new sequences, gradually incorporatingnewly learned sequences into the next learning iteration. Suchiterative algorithms are commonly applied in other machinelearning tasks such as face detection, voice recognition, andnatural text processing.0is iterative learning methodology istermed online because it continually learns new information(here sequences) and thus adaptively learns in more distantsituations (in this case, more V-gene sequences from moredistant taxa). Figure 1 illustrates the general iterative steps ofthe VgeneFinder workflow.

2. Methods

Here, details of the iterative algorithm are described. Inparticular, this includes the entire pipeline for extractingcandidate exons from the WGS files, forming multi-resolution feature vectors, training Random Forest classi-fiers, and the iterative training/prediction process. First, thegenome sets used for the study are described. Next, thedetails of the algorithm are provided.

2.1. Genome Datasets. To demonstrate the iterative boot-strap learning method and validate the VgeneFinder soft-ware, 16 primates (including human) WGS datasetsobtained from the NCBI were used (M. Mulatta was left outfor validation comparisons). A detailed summary of theaccession numbers and relevant assembly parameters can befound in Table 1. All WGS datasets had coverage >15× andN50 values >20k, representing an adequate threshold foridentifying V-genes [13].

A listing of the primate species used (with WGS ab-breviation and N50 value) are Lemuriformes: D. mada-gascariensis (AGTM01, 3.6 kbp), O. garnettii (AQR03,27.1 kbp), and M. murinus (AAHY01, 21.7 kbp); Tarsii-formes: T. syrichta (ABRT01, 38.17 kbp); New Worldmonkeys: C. jacchus (ACFV01, 29.3 kbp) and S. boliviensis(AGCE01, 38,823 kbp); Old World monkeys: M. mulatta(AANU01, 25.7 kbp), M. fascicularis (CAEC01, 8.9 kbp), C.sabaeus (AQIB01, 90.5 kbp), P. anubis (AHZZ01, 40.3 kbp);and hominids: N. leucogenys (ADFV01, 35.2 kbp), P. abelii(ABGA01, 15.6 kbp), G. gorilla (CABD02), P. paniscus(AJFE01,66.8 kbp), and P. troglodytes(AACZ03, 50.7 kbp).Table 1 provides details of the WGS statistics, particularlyindicating the sequencing technology used, the coverage,and the N50 values.

Computational and Mathematical Methods in Medicine 3

http://vgenerepertoire.orghttp://vgenerepertoire.org

2.2. 300M years of evolution). Search algorithms,such as BLAST, are extremely useful for obtaining geneswith high homology but are not reliable when the sequencesdiffer significantly. 0us, an algorithm that can infer se-quences quite different from the known V-genes is needed.We implemented such an algorithm as a Python-basedsoftware package called VgeneFinder. 0e VgeneFinder

software tool is an improvement over our previous methodbecause it discovers V-genes with a probabilistic,alignment-free method, translating these sequences intonumerical feature vectors and then applying a RandomForest classifier to determine whether the sequences arevalid V-genes and to which loci they belong. Moreover,V-genes from distant taxa are incorporated into the systemsknowledge base, allowing for more robust V-gene ho-mology discovery. Figure 1 illustrates the iterative steps ofthe VgeneFinder workflow.

Exon defined between AG and CAC

exon

NNseq =

AAseq =

FeatVeVV c =

No

IGHV

-1

IGHV

-2

IGHV

-3

Valid V-genes withlocations in contig

and locus assignment

end while

n ← n + 1

Train with Random Forestfor each locus k, using j exons

MkM jk = train [Tnkj]

Prediction steps

Search for candidate exonsbetween AG and CAC

in the hit region of contig

Convert deduced AA sequenceto multiresolution feature vector

IGKV

-11

IGKV

-2222

TRAV

-1

TTT

cntg2

cntg3

cntg4

Training steps

Convert deduced AA sequenceto multiresolution feature vector

MultRes =

Class probability for exons EjEfor each locus k, with Random Forest

matrices MkjM

Dn = predict [P(EjE | Mij)]n

InitializationStart with the initial training set

of known V-genesT0TT = { υ1, ..., υk }

D0 =

n = 0

| NnNN – Nn–NN 1 | > ε

Training/discovery iterationLet NnN ≡ number of

sequences at iteration n

while | NnN – NnN –1 | <

Add the new sequences tothe training set for the next iteration

Tn+T 1 = TnT + DnYes

do

Figure 1: Iterative workflow for predicting V-gene repertoire from WGS datasets. 0e algorithm bootstraps from a small set of initial V-gene sequences (step 1); these sequences are converted from nucleotide to amino acid sequences so that a multiresolution (MR) featurevector is constructed. Random Forests are trained for eachMR levels; and the trainingmatrices are saved for eachMR level. In the predictionphase, the collection of exons, obtained from different unconnected contigs the WGS files, is processed with Random Forests (for eachmultiresolution level) to determine those that have sufficient probability (homologous to the training sets) for being V-genes.0e results area set of V-exons classified into their respective locus.


0e iterative training/prediction process stops when nomore genes are further discovered. At this point, the al-gorithm has a high specificity for predicting homologousV-gene sequences with a low false-positive rate (

interval tree data structure (with the Banyan python library)that groups overlapping intervals dened by sequence start/stop contig positions.

An alternative method to the brute force enumeration ofall intervals between the AG-CAC motifs is to useTBLASTN. While TBLASTN can act as a rough lter onpotential V-genes, it is neither specic enough for dis-criminating V-genes and determining loci nor possible todetect the exon boundaries correctly. To illustrate, Figure 3shows histograms of negative TBLAST hits (with a searchusing eValue� 1.0 and queries from consensus sequencesfrom each IG and TR loci) against theMacaca mulattaWGSAANU01, together with hits that are positively identied asV-genes by the VgeneFinder algorithm. e plots demon-strate that a simple lter based on the eValue score is notadequate for identifying V-genes.

2.3.MultiresolutionFeatureVectors. As known from proteinhomology studies, the numerical representation of theV-exon AA sequences is critical for classication. reenumerical feature vector transforms were studied: a simplevector based on the occurrence frequency of AA and pairs, avector that uses physicochemical properties of AA, and ahybrid vector that uses the two methods at dierent scales ofthe MR sequence. e transformation vector based on AAoccurrence frequency (the AA pairs method) is formed byconcatenating two vectors: the histogram of each AA and thehistogram of each pair of AA; with 20 AA, the resulting

feature vector has a length of 440 integer values. For thetransformation based on physicochemical properties,AAindex1 [21] is used together with a normalization pro-cedure (PDT) [22] that captures the position of AA andneighboring correlations. e resulting vector is a nor-malized 500 element vector of oating point values. isPDTmethod can also capture longer correlations; howeverin practice, no improvements were seen in sequence dis-crimination results.

Because V-gene peptide sequences are relatively long(i.e., ∼90 AA) and such feature methods work best forshorter sequences, we developed a multiresolution (MR)sequence decomposition data structure, Sij, shown inFigure 2(d). In this structure, the original AA-deducedV-gene sequence is recursively subdivided into j se-quences at hierarchical level i for which one of the trans-formation methods is applied. Such a structure allows forexibility in applying transforms to the levels of the hier-archy. In the hybrid transform method, the AA pairs andPDT transforms are applied to dierent levels of this hi-erarchy. In another structure tested, the PDT was appliedwith dierent correlation lengths, λ, at each scale, so thatlonger correlations are captured on the highest layer in thehierarchy, while the bottom most layer captures the im-mediate neighboring correlations. When combined, theresulting data structure is a tree reminiscent of wavelettransformations, where each decomposition captures adierent level of structure. Note that this method obviatesthe need for sequences to be aligned for classication.

Exon

(a)

24

135

(b)

pi,l

pk,j pk+1,j+1

xi

xk xk+1 yj yj+1

ylxi+1

pi+1,l

(c) (d)

n > 0

Tn+1 = Tn + Dn

rj = max { Pi(Ei | Mi }Dn = {r1, ..., rm}

T0 = { d1, ..., dk }D0 = Ø

n = 0

Train [Tn]

(e)

Figure 2: Process of obtaining candidate exon sequences. (a)e denition of an in-frame exon sequence between the -AG- start motif and theRSS canonical -CAC- motif. (b) Identication of all sequence possibilities between the AG-CAC motifs. (c) Examples of overlapping exonintervals; candidates are reduced with an interval tree, while best candidate V-genes are chosen by maximum probability. (d) Multiscaledecomposition of a sequence stored as a recursive tree structure. (e) High-level ow diagram of steps of the iterative bootstrap training process:n is the iteration step, Tn is the set of V-exons used in training Random Forests (using >100 random trees and default parameters from thesklearn library) for each level,Do are the new exons that have been discovered at step n and will be added to the n + 1 iteration for training, andEj and Mi represent exon intervals and training matrices, respectively, for which maximum likelihood criteria are applied.


Training and prediction of the Random Forest classifier isperformed separately at each MR level (i, j). From the trainingmatrices Mij for the i multiresolution levels (i, j), binaryprobabilities are calculated for each locus,Lk, from the ensembleclassifier, so that pij(Lk) � [p0, p1] |∀(i, j)< n , where[p0, p1] represents the background and signal probability,respectively. 0erefore, the probability for a candidate AA se-quence, Sc, is expressed as the setPs � (pij)0 · · · (pij)K | ∀Lk .0e loci with maximum probability are chosen by maximumlikelihood: Lp � argmaxk(Ps).

0e probabilities at each multiresolution (MR) levelsprovide additional degrees of freedom for applying intuitiverestriction criteria for selecting valid V-genes sequences. Inparticular, by demanding that the probabilities from eachsequence segment are within a range ϵ of each other|pi+1,j −pi,k|< ε |∀i, j, k , it is equivalent to demanding thatthe sequences are homologous throughout. 0e value of ϵ inpractice was chosen empirically to be ≈ 0.17, by observingmany bootstrap training/prediction runs and comparingpredictions with genes identified by VgenExtractor; the valuesof ϵ, together with the overall probability threshold, are freeparameters and control the homology bandwidth for dis-covering sequences far from the median homology of thetraining set. Nonetheless, the choice of these parameter valuesdoes not significantly affect the results of the most of themachine learning predictions. A further important conditionthat guarantees that the exon boundaries correspond tofunctional V-genes is imposed on the subsequences at theextreme ends of the AA translated exon, corresponding to theleft-most (L) and right-most (R) sequences of the lowest MRlevel (n) or SnL and SnR, respectively. 0is condition corre-sponds to pi∗(L); pi∗(R) > τ, where i∗ is the maximumsubdivision, L and R refer to the left-most and right-mostsubsequences, and τ is the threshold (in practice τ ≈ 0.7).

2.4. Online Iterative Learning. Given the numerical repre-sentation of the AA sequence, supervised machine learningis used to train a Random Forest ensemble classifier [23]from a small initial set of known V-genes obtained fromHomo sapiens and Mus musculus obtained from the IMGT

[24] and Ensembl [25]. Binary training, consisting of pos-itive (functional V-genes) and background (random) se-quences, is performed for each locus and at eachmultiresolution level. 0e background sequences are se-lected randomly with a signal ratio of 3 :1 and shuffled foreach multiresolution level training matrix, Mij. From theinitial training matrices, V-gene prediction is carried outwith 14 WGS primate datasets; positively selected V-genesare incorporated into the subsequent round of training. 0isonline (i.e., incremental and iterative addition of newtraining data) supervised learning procedure is repeateduntil no additional new genes are discovered upon furtheriterations. Figure 2(e) shows the general steps of this pro-cedure in a flow diagram.

2.5. Practical Implementation. VgeneFinder is a multi-threaded application (using a MapReduce design pattern)that concurrently divides large WGS contigs into smalleroverlapping chunks for V-exon search and processing (inpractice, the chunk size is 20 kbp, with an overlap of 1 kbp). Ineach chunk, a map processing phase identifies candidate exonintervals, which are then combined in a reduction phase,thereby removing possible duplicates from the overlaps. Foreach candidate, the MR predictions are made for each V-geneisotype. As mentioned previously, WGS Fasta files areapproximately 3G consisting of ≈ 3 × 105 contigs with N50>15 kbp, but average contig sizes are ≈100–200 kbp. 0eaverage processing time for the WGS files of primates isapproximately 2.5minutes/contig on a modest desktop LinuxPC (i.e., Intel Core i5-2400 CPU 3.10GHz 4-core i5 Intelprocessor, running the Linux kernel 3.2).

3. Results

0e bootstrap learning algorithm iteratively improves theensemble class probabilities for predicting each locus. 0eprocess was applied to 14 WGS primate datasets, in-dependently testing each of the feature vector transforms.Binary training with a Random Forest classifier was carried outfor each V-gene isotype, k (resulting inmatricesMij(k)), usingthe sklearn [26] library with 500 trees and a signal/background

NegativeaN ticandidatesdcandidates

V-genegsequencese

–400

10

20

30

40

–35 –30 –25 –20log(eValue)

–15 –10 –5 0

(a)

V-genenVVVequencescsesees qqq

NegativeeNNNNeNNeNNNeNcandidateseeedidatcacccananannnana

2005

15

25

35

40 60 80 100TBLASTN score

120 140 160

(b)

Figure 3: Comparison of TBLASTN hits compared to the V-gene sequences positively identified by VgeneFinder: (a) log(eValue) scores and(b) scores of TBLASTN for all candidates.0e histogram negative candidates (red) are candidate sequences that VgeneFinder has discarded.0e V-gene sequence histograms (yellow) are positively identified by VgeneFinder. 0e plots show that just based on TBLASTN homology,there would be no manner to determine positive and negative sequences; TBLASTN and similar homology methods are not effective for thistask nor could they be used to automatically classify the exons into their respective loci.


ratio of 3 :1 (as described in Methods). For all candidate exonsequences at each iteration t, sequences were converted to anMR structure Sij and binary predictions made for each locuswith Mij(k, t). e predicted class probabilities, pij(k), ob-tained from each MR level were combined into a single score,which served as the basis for selecting sequences with respect toan adaptive threshold. e MR score, τ � N− 1/N∑k∑m(1− exp(|p00 −pkm|2/σ)), is degraded if the probabilityat dierent MR levels (pkm) deviates signicantly from theprobability p00 of the zeroth-level MR sequence.

e distributions of predicted sequences based on theirMR scores are visualized with a histogram and t to a kerneldensity estimation (KDE). Figure 4(a) shows KDE proba-bility distribution results from successive learning/prediction iterations of the bootstrap process correspond-ing to the AA-pair transform (Section 2); the KDE results forthe other feature transforms behave similarly. As can beseen, in the rst iteration step, the KDE distributions arebroad and have low mean probabilities. Upon successiveiterations, the mean probability of predicted sequences movetowards higher values and the KDE distributions of all lociare more sharply peaked, indicating that predictions ofV-genes have a high specicity and (with constant area)most sequences are under peak of the distribution.

Figure 4(b) (top) shows results of the total number oftrue-positive (TP) V-genes as a function of iteration t,comparing two-feature vector transforms: AA pairs andPDT. Figure 4(b) (bottom) shows the number of sequencesdiscarded at each iteration whose probability was belowthreshold. Figure 4(c) shows the phylogenetic tree of theTRAV loci at each iteration step. From these plots, it is clearthat the best method is the AA pairs methods for forming thefeature vector (i.e., AA pairs are based on the occurrencefrequency of amino acids and pairs of consecutive aminoacids). Figure 5 shows a more detailed view of the TRAVlocus in the iterative discovery of V-genes.

3.1. Validation of VgeneFinder with Known Sequences. Tovalidate VgeneFinder, we compared the genes found by thissoftware with the available V-gene annotations of the IMGTand with our previous software, VgenExtractor. e se-quences annotated by the IMGT (and deposited in theEnsembl database) were obtained through laborious multipleexperimental methods. As such, these sequence annotationsare accepted by the scientic community as gold standards.

As described previously, other standard gene ndingsoftware is not valid for discovering V-genes because the

lteration 1

lteration 2

lteration 4

0.36

IGHVIGKVIGLVTRAV

TRBVTRGVTRDV

0.67 1.0

(a)

3500

3000

2500

2000

1500

1000

500

0Num

ber o

f TP

V-ge

ne se

quen

ces

lterations0 1 2 3 4 5

AA freq.AA physiochem.

(b)

110

100

90

80

Seq

uenc

es b

elow

thre

shol

d

70

60

50

40

300 1 2

lterations3 4 5

AA freq.AA physiochem.

(c)

TRAVlteration 0

TRAVlteration 1

TRAVlteration 2

(d)

Figure 4: (a) Density distributions of the iterative learning algorithm of VgeneFinder for successive iterations using 14WGS primate datasets.(b) Number of total sequences as a function of iterations for two dierent feature vector transforms; the AA frequency transform considersconsecutive pairs of amino acids, while the AA physicochemical is a method that forms a feature vector using physical properties depending onthe position of amino acids. (c)e number of sequences that are below the prediction threshold as a function of iteration, indicating that exonswhich are quite distant from the initial training set (but nonetheless viable V-genes), are gradually included as the iterative process evolves. (d)Example of TRAV multispecies tree for starting set (with H. sapiens) and 2 iterations (see more detailed view in Figure 5(c)).


V-exon boundaries do not follow canonical rules. As such,the automatic V-exon annotations provided in new genomeprojects that use classic gene nding software have signif-icant deciencies in reporting the actual number of V-exons.rough validation with the known IMGT sequences, oursoftware accurately automates V-exon annotation and canbe used to identify V-genes newly available genomes.

3.2. Multispecies Trees and Comparison with VgenExtractor.e predicted sequences obtained by applying the iterativealgorithm to the 14 WGS primate datasets were used to con-struct a multispecies V-gene tree. In particular, phylogenetictrees were constructed using clustalO [27] alignment andFastTree [28] with the WAG matrix and 500 bootstraps toproduce newick les. For the tree construction, we used amaximum likelihood algorithm and the LG matrix. Finally, weused the MEGA (ver. 5) [29] (https://www.megasoftware.net/)and FigTree (http://tree.bio.ed.ac.uk/software/gtree/) to pro-duce tree graphics. Figure 6 shows the resulting trees at dierentiteration steps, starting with the initial training set (consisting ofH. sapiens and M. musculus). ese results provide a separatetest of theVgeneFinder loci classication since the predicted lociform the well-dened clades as expected.

All VgenExtractor sequences were processed with AA-pair transform feature vector and scores calculated with theVgeneFinder predictor. All sequences of VgenExtractor aredetected; however, many are discarded because of low MRclassication scores. Phylogenetic comparisons are shown in

Supplementary Materials (available here). Figure 7(a)summarizes the results for sequences that did not agree(sequences predicted by VgenExtractor but discarded byVgeneFinder and those found by VgeneFinder but not foundby VgenExtractor). Low scores for sequences indicate thatthey are far from the homology in the training set, notnecessarily that they are nonfunctional V-genes. Moreover,the VgeneFinder score provides a homology metric, in-dicating which V-gene sequences can be considered withhigh condence. Such information was not available pre-viously with the VgenExtractor tool.

3.3. Validation from the Prediction of V-genes from KnownGenes in Macaca mulatta. We validated the algorithm bystudying the nonhuman primate, rhesus macaque (Macacamulatta), whose genome is complete in the IG/TCR locus. erhesus macaque (Macaca mulatta) is one of the most studiedprimates (apart fromH. sapiens) because it is an ideal laboratorysurrogate model for human disease and treatment. As such, thegenome of the macaque is known in great detail, sharing ap-proximately 93% of genes with H. sapiens, with completechromosome reconstruction (21 pairs) and 3097.37Mb. Geneannotation WGS pipelines have taken advantage of thealignment with the human genome, uncovering a large numberof coding/noncoding genes. Nonetheless, the V-gene repertoirein this species has not been fully annotated yet.

In the training phase, the 14 WGS primates with Vge-neFinder, and the genome ofM.Mulattawas excluded so that

0.4

Vs557|Homo_sapiens|ABBA01061272.1|trav












































(a)

0.3

V50RF-JZKE01143890.1-trav

V23RF-CABD02105615.1-trav

V129RF-ABGA01386446.1-trav

V99RF-ADFV01192049.1-trav

V215RF-AACZ03171149.1-trav

V144RF-AQIA01064146.1-trav

V76RF-ADFV01192039.1-travV216RF-AACZ03171512.1-trav

V122RF-AJFE01024276.1-trav




V115RF-ABDC01533415.1-trav









V232RF-AQIB01128644.1-trav







V64RF-JYKQ01054942.1-trav

V37RF-AHZZ01098200.1-travV236RF-AQIB01128644.1-trav







V35RF-AHZZ01098200.1-trav


V204RF-JABR01098600.1-trav




V97RF-AGCE01026529.1-trav











































V10RF-CABD02105588.1-travV101RF-AGCE01026530.1-trav












V225RF-AQIB01128644.1-travV60RF-JYKQ01054940.1-trav









V55RF-JZKE01143895.1-travV78RF-AJFE01069977.1-trav





























(b)

0.3










V76RF-AACZ03097618.1-travV73RF-ADFV01192039.1-trav




























V144RF-ABGA01113129.1-travV55RF-AJFE01069975.1-trav




















V166RF-JYKQ01026663.1-travV167RF-JYKQ01026665.1-trav













































V70RF-JYKQ01054957.1-travV117RF-ABGA01386446.1-trav








V58RF-AACZ03095314.1-travV150RF-AQIA01064166.1-trav



V133RF-AQIA01064146.1-travV58RF-AHZZ01107544.1-trav





















V52RF-JZKE01143895.1-travV70RF-JZKE01143916.1-trav
























































































V129RF-AQIA01064146.1-travV90RF-AGCE01026530.1-trav

























V138RF-AQIA01064148.1-travV218RF-AQIB01128644.1-trav












(c)

Figure 5: Two iterations of the TRAV tree using the bootstrap method and showing the branch labels of each taxon. With each iteration,more branches are discovered from the 14WGS primate data and included in subsequent training.e VgeneFinder algorithm classies theV-genes according to their loci. Here only the V-genes pertaining to the TRAV locus are shown.


https://www.megasoftware.net/http://tree.bio.ed.ac.uk/software/figtree/



















































































IGHV

(a) (b) (c)

TRGV

IGKV

IGLV

TRGVTRAV TRBV

TRAV

IGLV

IGKV IGHV

TRAV

TRBV

InitializationHomo sapiens

V107RF-ABDC01281089.1-trbv

V110RF-ABDC01281096.1-trbv

V111RF-JZKE01294692.1-trbv


































































































TRAV/TRDV

IGLV

IGHV

IGKV

TRBV

TRGV

Iteration 1(7 :1)

Iteration 3(15:1)

TRAV/TRDV

1.4

0.0

0.0

0.40.60.81.01.21.4

0.2

0.0

0.40.60.81.01.21.4

0.2

0.0

0.4

0.6

0.8

1.0

1.2

1.4

0.2

Figure 6: Phylogenetic trees of the amino acid sequences of V-exons for each iteration step. (a) Positively identied V-exon sequences areclassied into their respective locus; the clearly delineated clades (i.e., IGHV, IGLV, IGKV, TRAV/D, TRBV, and TRGV) show that thisclassication is correct. e V-exon sequences were aligned with Clustal omega [27]. For constructing the phylogenetic trees, a maximumlikelihood algorithm with the WAG matrix and 500 bootstrap replicates were realized for validation. Rooting was performed at themidpoint, and linearization provided by Mega [29] was applied to improve the visualization of the trees. In the initial iteration (b), onlyknownV-exon sequences from humans andmouse were used in the training set. From this training, predictions weremade by processing 14WGS of primates; the discovered sequences from these primates were used to retrain Random Forests, thereby rening the possibility ofincluding V-genes that are more distant in homology. In the third iteration (c), the program VgeneFinder uncovered 15 times moresequences than from the start of the iteration. For illustration, sequences from a small section of the TRAV are amplied (inset). More detailsof the branch distances can be found in Supplementary Materials.

1.00.750.50.250.0

VgenExtractorexcluded; below

threshold

VgeneFinder only

(a)

Figure 7: Continued.


prediction validation could be carried out and compared withV-gene annotations. IG and TCR V-genes annotations areavailable from the Ensembl repository [25] as a WGS as-sembly (MMul ver. 1.0) that maps to chromosome and/orscaolds. For each gene in Ensembl, the correspondingprotein transcript was downloaded from the UniProt data-base. e protein transcript sequences were saved in Fastaformat for direct comparison with the sequences obtainedwith VgeneFinder and VgenExtractor from the nucleotidechromosome or scaold segments.

All annotated sequences were used in the validation, exceptfor three sequences (TRAV12-1, TRAV12-2, and TRBV4-1)which are only partial transcripts, not having a minimumlength. Nonetheless, the Ensembl annotations are far fromcomplete. At present, ve IGHV sequences are located non-chromosomal scaolds of the assembly, eight IGLV are inChr10, four IGKV in Chr13, 16 TRAV are found in Chr7, andnine TRBV are in Chr3. No TRGV sequences are found, andthere is one delta chain, TRDV, found in a scaold region.

A summary of the comparison results between the Vge-neFinder algorithm and VgenExtractor is shown in Table 2.VgeneFinder detects nearly 100% of the Ensembl annotatedgenes, except for IGLV1-51, which is only a partial sequenceand whose functionality is questionable (Supplementary Ma-terials). Figure 8(a) shows a detailed comparison of the twomethods with Ensembl TRAV and TRBV loci, in segments ofChr 7 and Ch3, respectively. e discrepancy between

VgeneFinder and VgenExtractor for detecting Ensembl se-quences can be understood in the sequence alignments(Figure 8(b)); sequences (ENS-TRAV40/1-83 and ENS-TRBV5-3/1-77) were detected by VgeneFinder fact but notVgenExtractor because they lack conserved motifs (i.e., ENS-TRAV40 lacks a cysteine between locations 15-28, and ENS-TRBV5-3 lacks a commonY∗motif in the last 15AA). DetailedIG comparisons and phylogenetic trees are shown in Sup-plementary Materials.

4. Discussion and Conclusions

e evolution of the vast majority of V-genes foundthroughout jawed vertebrate orders has progressed with ahigh degree of conservation at particular positions along thegermline sequence. Structural or functional requirements ofthe resulting antigen-binding V domains may be responsiblefor such canonical motifs. Previous methods have exploitedthis structure but are unable to identify V-genes having lesscommon motifs or extending the algorithm to more distantspecies such as bony sh, with additional IG and TCRisotypes. e iterative learning algorithms of VgeneFinderprovides an alignment-free probabilistic method forobtaining V-genes with high specicity for homologousgenes but can be used to gradually expand the original set toevolutionary distant taxa. e probabilistic scores of theclassier provide an alignment-free homology distance

Table 2: Prediction comparisons with the annotated genes ofM. mulatta obtained from the Ensembl (ENS) repositories. Predictions resultsof the total and true positives (TPs) against ENS of VgeneFinder (MRV) and VgenExtractor (VE) are shown.

Locus Gen. loc. ENS (MRV and VE) TP (MRV and VE)TRAV Chr7 12 46/43 12/11TRBV Chr3 8 56/53 8/7IGHV Sca 3 32/31 3/3IGKV Chr13 3 35/31 3/3IGLV Chr10 8 41/36 7/6

0

IGHVIGLVIGKVTRAV

TRBVTRDVTRGV

5

10

15

20∗ MResVgene

Mea

n nu

mbe

r of g

enes A

QIB

01

CABD

02

AQ

IB01

JYKQ

01

ABD

C01

AD

FV01

AJF

E01

AH

ZZ01

ABG

A01

JZKE

01

JABR

01

AG

CE01

ABR

T01

∗

∗

∗

∗

∗

∗

∗

∗

∗

∗

∗

∗

∗

(b)

Figure 7: Prediction results comparing VgeneFinder and VgenExtractor. (a) e class probability of sequences predicted by VgeneFinder(right curve) that were not predicted by VgenExtractor and those not accepted by VgeneFinder (left curve) having class probabilities

metric which can serve as a condence score for V-genesequences. is quantitative metric can be used to rule outsequences when no other information is available, such asgene expression transcripts.

Apart from the iterative ensemble learning processesitself, there are two features that contribute to the success ofthis algorithm. e rst is the multiresolution decompositionof the deduced amino acid sequences, and the other is thechoice of the feature vector transformation. Because theseV-genes are relatively long homologous germline exon se-quences (≈300 bp), a single transformation does not provideenough local information of the sequence to properly dis-tinguish homology; the prediction probabilities frommultiplelevels of the sequence probe the sequence at multiple scales.Finally, while the iterative online learningmethodwas appliedhere to V-genes, it is general and could be used more broadlyfor homologous gene discovery in situations where the exonstructure is well understood.

Data Availability

All genomeWGS data used in this study were obtained fromthe public repository at NCBI (http://www.ncbi.nlm.nih.gov) with the detailed accession numbers provided in themanuscript. e genes extracted by our softwaredescribed in this study have been deposited online in theVgeneRepertoire.org repository (see description at https://doi.org/10.1101/002139).

Conflicts of Interest

e authors declare that there are no conicts of interestregarding the publication of this paper.

Supplementary Materials

supp_materialsV2.pdf: this is an updated document, whichincludes new gures requested by the reviewer. sequence-s_alignments.zip: these are genome sequence les that arerelevant for reproducing some of the results. (SupplementaryMaterials)

References

[1] C. Janeway, P. Travers, M. Walport, and M. Shlomchik,Immunobiology: e Immune System in Health and Disease,Garland Science, New York, NY, USA, 2005.

[2] T. W. Mak and M. E. Saunders, e Immune Response: Basicand Clinical Principles, Academic Press, San Diego, CA, USA,2005.

[3] M.-P. Lefranc and G. Lefranc,e Immunoglobulin Factsbook,Academic Press, San Diego, CA, USA, 2001.

[4] M.-P. Lefranc and G. Lefranc, e T Cell Receptor FactsBook,Academic Press, San Diego, CA, USA, 2001.

[5] D. Olivieri, J. Faro, B. von Haeften, C. Sánchez-Espinel, andF. Gambón-Deza, “An automated algorithm for extractingfunctional immunologic V-genes from genomes in jawed

TRAV

4

TRAV

5

TRAV

6

TRAV

8–4

TRAV

8–6

TRAV

17

TRAV

19

TRAV

25

TRAV

27

TRAV

30

TRAV

40

TRAV

41A VA

TRA

TRRAT T TR TRT T

84.3M 84.4M

Ensemble

MResVgene

10 20 30 40 50 60 70 80 90 10084.5M 84.6M 84.7M 84.8M 84.9M 85.0M 85.1M

VgenExtractor

(a)

TRBV

5–1

TRBV

5–3

TRBV

11–1

TRBV

5–4

TRBV

5–6

TRBV

13

TRBV

11–3

TRBV

18

EnsembleMResVgene

179.3M 179.4M10 20 30 40 50 60 70 80 90

179.5M 179.6M 179.7M 179.8M 179.9M 180.0M 180.1M

VgenExtractor

(b)

Figure 8: V-genes inMacacamulatta. Comparison of V-genes obtained fromVgeneFinder andVgenExtractor for TRAV and TRBV againstthe Ensembl annotations. e gene annotations inM. mulatta are limited as described in the text. is comparison shows that our softwaretools correctly identify all the known annotated genes as well as identify the rest of the V-gene repertoire. e comparison betweenVgeneFinder and VgenExtractor shows that VgeneFinder is able to uncover sequences which are not canonical (as seen in the alignments).


http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.govhttps://doi.org/10.1101/002139https://doi.org/10.1101/002139http://downloads.hindawi.com/journals/cmmm/2019/3780245.f1.ziphttp://downloads.hindawi.com/journals/cmmm/2019/3780245.f1.zip

vertebrates,” Immunogenetics, vol. 65, no. 9, pp. 691–702,2013.

[6] M. L. Metzker, “Sequencing technologies—the next genera-tion,” Nature Reviews Genetics, vol. 11, no. 1, pp. 31–46, 2009.

[7] H. Fang, M. E. Oates, R. B. Pethica et al., “A daily-updated treeof (sequenced) life as a reference for genome research,”Scientific Reports, vol. 3, no. 1, 2013.

[8] J. P. Cannon, R. N. Haire, J. P. Rast, and G. W. Litman, “0ephylogenetic origins of the antigen-binding receptors andsomatic diversification mechanisms,” Immunological Reviews,vol. 200, no. 1, pp. 12–22, 2004.

[9] S. Das, M. Hirano, C. McCallister, R. Tako, and N. Nikolaidis,“Comparative genomics and evolution of immunoglobulin-encoding loci in tetrapods,” in Advances in Immunology,F. W. Alt, Ed., vol. 111, pp. 143–178, Academic Press, SanDiego, CA, USA, 2011.

[10] T. Ota and M. Nei, “Divergent evolution and evolution by thebirth-and-death process in the immunoglobulin VH genefamily,” Molecular Biology and Evolution, vol. 11, no. 3,pp. 469–82, 1994.

[11] M. Yandell and D. Ence, “A beginner’s guide to eukaryoticgenome annotation,” Nature Reviews Genetics, vol. 13, no. 5,pp. 329–342, 2012.

[12] T. Alioto, E. Blanco, G. Parra, and R. Guigó, “Using geneid toidentify genes,” Current Protocols in Bioinformatics, vol. 64,no. 1, p. e56, 2018.

[13] D. N. Olivieri, B. vonHaeften, C. Sánchez-Espinel, J. Faro, andF. Gambón-Deza, “Genomic V exons from whole genomeshotgun data in reptiles,” Immunogenetics, vol. 66, no. 7-8,pp. 479–492, 2014.

[14] A. L. Hughes, “0e evolution of functionally novel proteinsafter gene duplication,” in Proceedings of the Royal Society ofLondon. Series B: Biological Sciences, vol. 256, no. 1346,pp. 119–24, 1994.

[15] M. F. Flajnik and M. Kasahara, “Origin and evolution of theadaptive immune system: genetic events and selective pres-sures,”Nature Reviews Genetics, vol. 11, no. 1, pp. 47–59, 2009.

[16] D. N. Olivieri and F. Gambón-Deza, “V genes in primatesfrom whole genome sequencing data,” Immunogenetics,vol. 67, no. 4, pp. 211–228, 2015.

[17] D. N. Olivieri, S. Gambón-Cerdá, and F. Gambón-Deza,“Evolution of V genes from the TRV loci of mammals,”Immunogenetics, vol. 67, no. 7, pp. 371–384, 2015.

[18] D. N. Olivieri, E. Garet, O. Estevez, C. Sánchez-Espinel, andF. Gambón-Deza, “Genomic structure and expression ofimmunoglobulins in Squamata,” Molecular Immunology,vol. 72, pp. 81–91, 2016.

[19] A. Hassanin, R. Golub, S. M. Lewis, and G. E. Wu, “Evolutionof the recombination signal sequences in the Ig heavy-chainvariable region locus of mammals,” in Proceedings of theNational Academy of Sciences, vol. 97, no. 21, pp. 11415–11420, 2000.

[20] Y. N. Lee, F. W. Alt, J. Reyes, M. Gleason, A. A. Zarrin, andD. Jung, “Differential utilization of T cell receptor TCR/TCRlocus variable region gene segments is mediated by accessi-bility,” in Proceedings of the National Academy of Sciences,vol. 106, no. 41, pp. 17487–17492, 2009.

[21] S. Kawashima and M. Kanehisa, “AAindex: amino acid indexdatabase,” Nucleic Acids Research, vol. 28, no. 1, p. 374, 2000.

[22] B. Liu, X.Wang, Q. Chen, Q. Dong, and X. Lan, “Using aminoacid physicochemical distance transformation for fast proteinremote homology detection,” PLoS One, vol. 7, no. 9, ArticleID e46633, 2012.

[23] L. Breiman, “Random forests,” Machine Learning, vol. 45,no. 1, pp. 5–32, 2001.

[24] M.-P. Lefranc, “Immunoglobulins: 25 years of immu-noinformatics and imgt-ontology,” Biomolecules, vol. 4, no. 4,pp. 1102–1139, 2014.

[25] J. Herrero, M. Muffato, K. Beal et al., “Ensembl ComparativeGenomics Resources,” Database, vol. 2016, article bav096,2016.

[26] F. Pedregosa, G. Varoquaux, A. Gramfort et al., “Scikit-learn:machine learning in Python,” Journal of Machine LearningResearch, vol. 12, pp. 2825–2830, 2011.

[27] F. Sievers and D. Higgins, “Clustal omega, accurate alignmentof very large numbers of sequences,” in Multiple SequenceAlignment Methods, pp. 105–116, Springer, Berlin, Germany,2014.

[28] M. N. Price, P. S. Dehal, and A. P. Arkin, “FastTree2—approximately maximum-likelihood trees for largealignments,” PLoS One, vol. 5, no. 3, Article ID e9490, 2010.

[29] K. Tamura, D. Peterson, N. Peterson, G. Stecher, M. Nei, andS. Kumar, “MEGA5: molecular evolutionary genetics analysisusing maximum likelihood, evolutionary distance, andmaximum parsimony methods,” Molecular Biology andEvolution, vol. 28, no. 10, pp. 2731–2739, 2011.


Stem Cells International

Hindawiwww.hindawi.com Volume 2018


MEDIATORSINFLAMMATION

of

EndocrinologyInternational Journal of



Disease Markers


BioMed Research International

OncologyJournal of



Oxidative Medicine and Cellular Longevity


PPAR Research

Hindawi Publishing Corporation http://www.hindawi.com Volume 2013Hindawiwww.hindawi.com

The Scientific World Journal

Volume 2018

Immunology ResearchHindawiwww.hindawi.com Volume 2018

Journal of

ObesityJournal of



Computational and Mathematical Methods in Medicine


Behavioural Neurology

OphthalmologyJournal of


Diabetes ResearchJournal of



Research and TreatmentAIDS


Gastroenterology Research and Practice


Parkinson’s Disease

Evidence-Based Complementary andAlternative Medicine

Volume 2018Hindawiwww.hindawi.com

Submit your manuscripts atwww.hindawi.com
https://www.hindawi.com/journals/sci/https://www.hindawi.com/journals/mi/https://www.hindawi.com/journals/ije/https://www.hindawi.com/journals/dm/https://www.hindawi.com/journals/bmri/https://www.hindawi.com/journals/jo/https://www.hindawi.com/journals/omcl/https://www.hindawi.com/journals/ppar/https://www.hindawi.com/journals/tswj/https://www.hindawi.com/journals/jir/https://www.hindawi.com/journals/jobe/https://www.hindawi.com/journals/cmmm/https://www.hindawi.com/journals/bn/https://www.hindawi.com/journals/joph/https://www.hindawi.com/journals/jdr/https://www.hindawi.com/journals/art/https://www.hindawi.com/journals/grp/https://www.hindawi.com/journals/pd/https://www.hindawi.com/journals/ecam/https://www.hindawi.com/https://www.hindawi.com/

Date post:	26-Jan-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

IterativeVariableGeneDiscoveryfromWholeGenome ... · 2019. 7. 30. ·...

Documents