Ivan Dimitrov School of Pharmacy Medical University of Sofia Application of machine learning...

Ivan Dimitrov

Copyright © 1997 Ivo Ivanov

School of PharmacyMedical University of Sofia

Application of machine learning techniques for allergenicity prediction

2nd Regional Conference“Supercomputing Applications in Science and Industry”Rodopi Hotel, Sunny Beach, Bulgaria,September 20-21, 2011

Allergen processing pathways

C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005, 271-283

FAO and WHO Codex alimentarius guidelines for evaluating potential allergenicity for novel proteins

A query protein is potentially allergenic if it:

has > 35% sequence similarity over a window of 80 amino acids

has an identity of 6 to 8 contiguous amino acids

or

when compared with known allergens.

Codex Principles and Guidelines on Foods Derived from Biotechnology. 2003 Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization.

1. Sequence-alignment search of query protein

Extensive databases of known allergen proteins and the FAO/WHO guidelines- Structural Database of Allergenic Proteins - Allermatch

-High sensitivity (true positives/(true positives + false negatives))- Produce many false positives and low precision (true positives/(true positives + false positives)) - Discovery of novel antigens is restricted by their lack of similarity to known allergens.

Bioinformatics approaches to allergen prediction

Characteristics:

Ivanciuc et al. Nucleic Acids Res. 2003, 31, 359–362Fiers et al. BMC Bioinformatics 2004, 5, 133

Bioinformatics approaches to allergen prediction

- Comparing allergens to non-allergens by MEME motif discovery tool- Clustering of known allergens, wavelet analysis and hidden Markov model- Automated Selection of Allergen-Representative Peptides (DASARP).- Motif search by Support Vector Machines (SVM), MEME/MAST, IgE epitopes and Allergen-Representative Peptides (ARP)- Iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine

Both approaches are based on the assumption that the allergenicity is a linearly coded property.

2. Identification of conserved allergenicity-related linear motifs

Stadler and Stadler FASEB J. 2003, 17, 1141-1143 Saha and Raghava Nucleic Acids Research,2006,34, 202-209Li et al. Bioinformatics 2004, 20, 2572-2578. Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 Björklund et al. Bioinformatics. 2005, 21, 39–50

AIM of the study

To create an alignment-free method for in silico identification of allergens based on the

main chemical properties of amino acid sequences and implement it to a web server.

Obstacles:

Allergens are proteins with different length.

The choice of an appropriate descriptors to represent the physicochemical properties of amino acid

sequences.

The z-scales

z1 z2 z3 hydrophobicity molecular size polarity

…Phe – Arg – Trp…

z1 z2 z3

-4.22 1.94 1.08 3.62 2.60 -3.60 -4.36 3.94 0.69

z1 z2 z3 z1 z2 z3

Hellberg et al. J. Med. Chem. 1987; 30, 1126-1135

ACC transformation

lagn

i

lagijijjj lagn

ZZlagACC ,,)(

lagn

i

lagikij

kjjk lagn

ZZlagACC ,,)(

Auto-covariance Cross-covariance

Phe – Arg – Trp – Phe – Arg – Trp protein

z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3

ACC11(1)

z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3

/5

/5 ACC13(1)

j, k are the zscales (j=1,2,3); i is the amino acid positions; n is the number of amino acids in the sequence;

Wold et al. Anal. Chim. Acta 1993, 277:239-225

Preliminary study

595 food allergens from CSL allergen database 595 non-allergens from NCBI database

external validation

PLS - discriminant analysisLogistic regressionNaïve - Bayes algorithmDecision tree algorithmk Nearest Neighbours

Training set 475 food allergens 475 non-allergens

Test set 120 food allergens 120 non-allergens

statistical methods, machine learning

matrix with 45 variables (32 x 5) and 950 observations

ACC transformation of z descriptors

Sensitivity

Specificity

Accuracy

http://allergen.csl.gov.ukhttp://www.ncbi.nlm.nih.gov/

Results from preliminary study

FPTN

TNyspecificit

FNTP

TPysensitivit

FNTNFPTP

TNTPaccuracy

TP – true positive, FP – false positiveTN – true negative, FN – false negative

0

10

20

30

40

50

60

70

80

90

100

PLS-DA Logisticregression

Decision tree Naïve-Bayes kNN(k=3) kNN(k=5)

Algorithm

%

Sensitivity,%

Specificity,%

Accuracy,%

Web servers on the test setAlgpred - SVM with single aa composition - SVM with dipeptide composition

EvallerAPPELAllerhunter

Test set 120 food allergens 120 non-allergens

SensitivitySpecificityAccuracy

0

10

20

30

40

50

60

70

80

90

100

ALGPRED (svm, single aacomposition)

ALGPRED (svm, dipeptidecomposition)

EVALLER APPELL ALLERHUNTER kNN(5)

Server

%

Sensitivity,%

Specificity,%

Accuracy,%

Saha and Raghava Nucleic Acids Research,2006,34, 202-209. Barrio et al., Nucleic Acids Research 2007, 35, 694-700

http://jing.cz3.nus.edu.sg/cgi-bin/APPEL Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861

Conclusions from the preliminary study

1. The model developed by the k Nearest Neighbors method shows the best performance on the test set comparing to the other methods. It has a good balance between specificity and sensitivity, and the highest accuracy. kNN was used further in the study.

2. The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements.

3. A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too.

matrix of 45 variables (32 x 5) and 950 observations

ACC transformationof z descriptors

The kNN algorithm

Training set475 allergens, 475 non-allergens

Sort the distance by value in ascending

order

Unknown

protein

Calculate the Euclidian distance between the vector and each

observation


vector with 45 variables (32 x 5)

Determine the k

nearest neighbours

Determine the class of unknown allergen according to the

majority of nearest neighbours

Next: Extend the data sets

CSL allergen database, FARRP allergen database SDAP database, ADFS database

684 food, 1157 inhalant,553 toxins, venom or salivary allergens

NCBI database

Allergen species

Proteins from allergen species

Create local

database

Blasts search against all allergens

684 non-allergen from food origin 1157 non-allergens from inhalant origin

553 non-allergens from species with toxins, venom or salivary allergens

http://allergen.csl.gov.ukhttp://www.allergenonline.org/http://fermi.utmb.edu/SDAP/

http://allergen.nihs.go.jp/ADFS/index.jsphttp://www.ncbi.nlm.nih.gov/

Next: kNN optimization

684 food allergens684 non-allergens

Training set528 allergens

528 non-allergens

Test set156 allergens

156 non-allergens

machine learning

k nearest neighbours

external validation


50

55

60

65

70

75

80

85

90

95

100

3 5 7 9 11 13 15 17 19

k nearest neigbours

%

sensitivity

specificity

accuracy

kNN models

1157 inhalant allergens1157 non-allergens

684 food allergens684 non-allergens


528 non-allergens


156 non-allergens


933 non-allergens


224 non-allergens

k NN

k = 3

external validation

k NN

k = 3

external validation

external validation


kNN models

0

10

20

30

40

50

60

70

80

90

100

kNN, food training andtest set

kNN, food training seton inhalant test set

kNN, inhalant trainingand test set

kNN inhalant trainingset on food test set

kNN aggregated training and test set

sensitivity

specificity

accuracy

AllerTOP web tool for allergenicity prediction

Training set 1952 food, inhalant and others

allergens and 1952 non-allergens


kNN model

external validation

AllerTOP

http://www.pharmfac.net/alletop

Servers performance on united testset

Two of the servers from preliminary studies: Appel and Evaller were not available during recent study.The results for Allerhunter server are achieved with smaller testset due to its incapability to work with short sequences (<21 amino acids)

United test set of 441 food and inhalant allergens and 441 non-allergens

0

10

20

30

40

50

60

70

80

90

100

AllerTOP(KNN, K=3) Allerhunter AlgPred, svm aminoacid decomposition

AlgPred, svmdipeptide

decomposition

AlgPred (ARP)

sensitivity

specificity

accuracy

Conclusions

1. An alignment-free method for in silico prediction of allergens based on the main physicochemical properties of proteins was developed.

2. The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors.3. The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree algorithm.

4. The k NN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 5. The kNN algorithm was implemented on a web server, freely available on:http://www.pharmfac.net/allertop

Drug Design Group

Irini Doytchinova Ivan DimitrovMariyana AtanasovaPanaiot Garnev

Funding: National Research Fund, Ministry of Education and Science, Bulgaria, Grant 02-1/2009

Acknowledgements

Darren R. Flower Aston University, Birmingham, UK

School of PharmacyMedical University of Sofia

Date post:	13-Jan-2016
Category:	Documents
Upload:	kerry-miles
View:	218 times
Download:	0 times

Ivan Dimitrov School of Pharmacy Medical University of Sofia Application of machine learning...

Documents