Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | kerry-miles |
View: | 218 times |
Download: | 0 times |
Ivan Dimitrov
Copyright © 1997 Ivo Ivanov
School of PharmacyMedical University of Sofia
Application of machine learning techniques for allergenicity prediction
2nd Regional Conference“Supercomputing Applications in Science and Industry”Rodopi Hotel, Sunny Beach, Bulgaria,September 20-21, 2011
Allergen processing pathways
C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005, 271-283
FAO and WHO Codex alimentarius guidelines for evaluating potential allergenicity for novel proteins
A query protein is potentially allergenic if it:
has > 35% sequence similarity over a window of 80 amino acids
has an identity of 6 to 8 contiguous amino acids
or
when compared with known allergens.
Codex Principles and Guidelines on Foods Derived from Biotechnology. 2003 Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization.
1. Sequence-alignment search of query protein
Extensive databases of known allergen proteins and the FAO/WHO guidelines- Structural Database of Allergenic Proteins - Allermatch
-High sensitivity (true positives/(true positives + false negatives))- Produce many false positives and low precision (true positives/(true positives + false positives)) - Discovery of novel antigens is restricted by their lack of similarity to known allergens.
Bioinformatics approaches to allergen prediction
Characteristics:
Ivanciuc et al. Nucleic Acids Res. 2003, 31, 359–362Fiers et al. BMC Bioinformatics 2004, 5, 133
Bioinformatics approaches to allergen prediction
- Comparing allergens to non-allergens by MEME motif discovery tool- Clustering of known allergens, wavelet analysis and hidden Markov model- Automated Selection of Allergen-Representative Peptides (DASARP).- Motif search by Support Vector Machines (SVM), MEME/MAST, IgE epitopes and Allergen-Representative Peptides (ARP)- Iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine
Both approaches are based on the assumption that the allergenicity is a linearly coded property.
2. Identification of conserved allergenicity-related linear motifs
Stadler and Stadler FASEB J. 2003, 17, 1141-1143 Saha and Raghava Nucleic Acids Research,2006,34, 202-209Li et al. Bioinformatics 2004, 20, 2572-2578. Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 Björklund et al. Bioinformatics. 2005, 21, 39–50
AIM of the study
To create an alignment-free method for in silico identification of allergens based on the
main chemical properties of amino acid sequences and implement it to a web server.
Obstacles:
Allergens are proteins with different length.
The choice of an appropriate descriptors to represent the physicochemical properties of amino acid
sequences.
The z-scales
z1 z2 z3 hydrophobicity molecular size polarity
…Phe – Arg – Trp…
z1 z2 z3
-4.22 1.94 1.08 3.62 2.60 -3.60 -4.36 3.94 0.69
z1 z2 z3 z1 z2 z3
Hellberg et al. J. Med. Chem. 1987; 30, 1126-1135
ACC transformation
lagn
i
lagijijjj lagn
ZZlagACC ,,)(
lagn
i
lagikij
kjjk lagn
ZZlagACC ,,)(
Auto-covariance Cross-covariance
Phe – Arg – Trp – Phe – Arg – Trp protein
z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3
ACC11(1)
z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3
/5
/5 ACC13(1)
j, k are the zscales (j=1,2,3); i is the amino acid positions; n is the number of amino acids in the sequence;
Wold et al. Anal. Chim. Acta 1993, 277:239-225
Preliminary study
595 food allergens from CSL allergen database 595 non-allergens from NCBI database
external validation
PLS - discriminant analysisLogistic regressionNaïve - Bayes algorithmDecision tree algorithmk Nearest Neighbours
Training set 475 food allergens 475 non-allergens
Test set 120 food allergens 120 non-allergens
statistical methods, machine learning
matrix with 45 variables (32 x 5) and 950 observations
ACC transformation of z descriptors
Sensitivity
Specificity
Accuracy
http://allergen.csl.gov.ukhttp://www.ncbi.nlm.nih.gov/
Results from preliminary study
FPTN
TNyspecificit
FNTP
TPysensitivit
FNTNFPTP
TNTPaccuracy
TP – true positive, FP – false positiveTN – true negative, FN – false negative
0
10
20
30
40
50
60
70
80
90
100
PLS-DA Logisticregression
Decision tree Naïve-Bayes kNN(k=3) kNN(k=5)
Algorithm
%
Sensitivity,%
Specificity,%
Accuracy,%
Web servers on the test setAlgpred - SVM with single aa composition - SVM with dipeptide composition
EvallerAPPELAllerhunter
Test set 120 food allergens 120 non-allergens
SensitivitySpecificityAccuracy
0
10
20
30
40
50
60
70
80
90
100
ALGPRED (svm, single aacomposition)
ALGPRED (svm, dipeptidecomposition)
EVALLER APPELL ALLERHUNTER kNN(5)
Server
%
Sensitivity,%
Specificity,%
Accuracy,%
Saha and Raghava Nucleic Acids Research,2006,34, 202-209. Barrio et al., Nucleic Acids Research 2007, 35, 694-700
http://jing.cz3.nus.edu.sg/cgi-bin/APPEL Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861
Conclusions from the preliminary study
1. The model developed by the k Nearest Neighbors method shows the best performance on the test set comparing to the other methods. It has a good balance between specificity and sensitivity, and the highest accuracy. kNN was used further in the study.
2. The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements.
3. A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too.
matrix of 45 variables (32 x 5) and 950 observations
ACC transformationof z descriptors
The kNN algorithm
Training set475 allergens, 475 non-allergens
Sort the distance by value in ascending
order
Unknown
protein
Calculate the Euclidian distance between the vector and each
observation
ACC transformationof z descriptors
vector with 45 variables (32 x 5)
Determine the k
nearest neighbours
Determine the class of unknown allergen according to the
majority of nearest neighbours
Next: Extend the data sets
CSL allergen database, FARRP allergen database SDAP database, ADFS database
684 food, 1157 inhalant,553 toxins, venom or salivary allergens
NCBI database
Allergen species
Proteins from allergen species
Create local
database
Blasts search against all allergens
684 non-allergen from food origin 1157 non-allergens from inhalant origin
553 non-allergens from species with toxins, venom or salivary allergens
http://allergen.csl.gov.ukhttp://www.allergenonline.org/http://fermi.utmb.edu/SDAP/
http://allergen.nihs.go.jp/ADFS/index.jsphttp://www.ncbi.nlm.nih.gov/
Next: kNN optimization
684 food allergens684 non-allergens
Training set528 allergens
528 non-allergens
Test set156 allergens
156 non-allergens
machine learning
k nearest neighbours
external validation
SensitivitySpecificityAccuracy
50
55
60
65
70
75
80
85
90
95
100
3 5 7 9 11 13 15 17 19
k nearest neigbours
%
sensitivity
specificity
accuracy
kNN models
1157 inhalant allergens1157 non-allergens
684 food allergens684 non-allergens
Training set528 allergens
528 non-allergens
Test set156 allergens
156 non-allergens
Training set933 allergens
933 non-allergens
Test set224 allergens
224 non-allergens
k NN
k = 3
external validation
k NN
k = 3
external validation
external validation
SensitivitySpecificityAccuracy
kNN models
0
10
20
30
40
50
60
70
80
90
100
kNN, food training andtest set
kNN, food training seton inhalant test set
kNN, inhalant trainingand test set
kNN inhalant trainingset on food test set
kNN aggregated training and test set
sensitivity
specificity
accuracy
AllerTOP web tool for allergenicity prediction
Training set 1952 food, inhalant and others
allergens and 1952 non-allergens
ACC transformationof z descriptors
kNN model
external validation
AllerTOP
http://www.pharmfac.net/alletop
Servers performance on united testset
Two of the servers from preliminary studies: Appel and Evaller were not available during recent study.The results for Allerhunter server are achieved with smaller testset due to its incapability to work with short sequences (<21 amino acids)
United test set of 441 food and inhalant allergens and 441 non-allergens
0
10
20
30
40
50
60
70
80
90
100
AllerTOP(KNN, K=3) Allerhunter AlgPred, svm aminoacid decomposition
AlgPred, svmdipeptide
decomposition
AlgPred (ARP)
sensitivity
specificity
accuracy
Conclusions
1. An alignment-free method for in silico prediction of allergens based on the main physicochemical properties of proteins was developed.
2. The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors.3. The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree algorithm.
4. The k NN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 5. The kNN algorithm was implemented on a web server, freely available on:http://www.pharmfac.net/allertop
Drug Design Group
Irini Doytchinova Ivan DimitrovMariyana AtanasovaPanaiot Garnev
Funding: National Research Fund, Ministry of Education and Science, Bulgaria, Grant 02-1/2009
Acknowledgements
Darren R. Flower Aston University, Birmingham, UK
School of PharmacyMedical University of Sofia