CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
T cell Epitope predictionsusing bioinformatics
(Neural Networks andhidden Markov models)
Morten Nielsen, CBS, BioCentrum,
DTU
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Processing of intracellular proteins
http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm
MHC binding
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
What makes a peptide a potential and effective
epitope?• Part of a pathogen protein• Successful processing
– Proteasome cleavage– TAP binding
• Binds to MHC molecule• Protein function
– Early in replication• Sequence conservation in
evolution
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
From proteins to immunogensFrom proteins to immunogens
Lauemøller et al., 2000
20% processed 0.5% bind MHC 50% CTL response
=> 1/2000 peptide are immunogenic
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
MHC Class I and II• Class I
– Peptides 8-12 amino acids long– Intracellular pathogen presentation– Broad range of bioinformatical prediction tools
• Class II– Peptides 13+ amino acids long– Intravesicular pathogen presentation– Few prediction tools
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
MHC class I with peptideMHC class I with peptide
http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Prediction of HLA binding
specificity• Simple Motifs
– Allowed/non allowed amino acids• Extended motifs
– Amino acid preferences (SYFPEITHI)– Anchor/Preferred/other amino acids
• Hidden Markov models– Peptide statistics from sequence alignment
• Neural networks– Can take sequence correlations into account
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Syfpeithi database
•Anchors: •Required for binding
•Auxiliary anchor: •Helps binding
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Pattern recognition• 10 peptides from MHCpep database
– Bind to the MHC complex A*0201• Which of the following are most likely to bind?
1. FLLTRILTI2. WLDQVPFSV3. TVILGVLLL
• Regular expression– X1[LMIV]2X3…X8[MVL]9– 2 and 3 will bind and 1 will not bind– Cannot tell if 2 if more likely to bind
• Truth is that 1 and 2 binds and 1 binds the strongest. 3 does not bind
• A probabilistic model can capture this!
ALAKAAAAMALAKAAAAMALAKAAAAMALAKAAAAVALAKAAAAVGMNERPILVGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Probability estimationALAKAAAAMALAKAAAAMALAKAAAAMALAKAAAAVALAKAAAAVGMNERPILVGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
€
p1(A) = 610
, p1(G) = 210
,..
....
p9(M) = 410
, p9(V ) = 510
,...
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Weight matrices• Estimate amino acid frequencies from alignment • Now a weight matrix is given as
Wij = log(pij/qj)– Here i is a position in the motif, and j an amino acid. q j
is the background frequency for amino acid j.• In nature not all amino acids are found equally
often– PA = 0.07, PW = 0.013– Finding 6% A is hence not significant, but 6% W highly
significant • W is a L x 20 matrix, L is motif length
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Scoring sequences to a weight matrix
A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5
ILYQVPFSVALPYWNFATMTAQWWLDA
Which peptide is most likely to bind?Which peptide second?
15.0 -3.4 0.8
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Weight-matrix construction Example from real life
• 10 peptides from MHCpep database
• Bind the MHC complex• Estimate sequence
motif and weight matrix
• Evaluate on 528 peptides (not included in training)
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Pseudo-count and sequence weighting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
• Limited number of data• Poor or biased sampling
of sequence space• I is not found at position
P9. Does this mean that I is forbidden?
• No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9
}Similar sequencesWeight 1/5
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Low count correction using Blosum matrices
# I L V L 0.12 0.38 0.10 V 0.16 0.13 0.27
Blosum62 substitution frequencies• Every time for instance L/V is observed, I is also likely to occur
• Estimate low (pseudo) count correction using this approach
• As more data are included the pseudo count correction becomes less important
€
p = Neff • p I +β • p Is
Neff +β ,
p = 12• 0+10• 0.0512+10 = 0.02
€
NL = 2,NV = 2,Neff =12,
pIs = 2 • 0.12 + 2 • 0.16
12= 0.05
Neff: Number of sequences: Weight on prior or pseudo count
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Example from real life (cont.)
• Raw sequence counting– No sequence
weighting – No pseudo count– Prediction accuracy
0.45• Sequence weighting
– No pseudo count– Prediction accuracy
0.5
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Example from real life (cont.)
• Sequence weighting and pseudo count– Prediction accuracy
0.60• Sequence weighting,
pseudo count and anchor weighting– Prediction accuracy
0.72
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Example from real life (cont.)
• Sequence weighting, pseudo count and anchor weighting– Prediction accuracy
0.72
• Motif found on all data (485)– Prediction accuracy
0.79
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Training on small data setsClass I Class II
Using a biased weight matrix with differential weight on anchor positions gives reliable performance for N~20-50 Lundegaard et al. 2004
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
How to predict• The effect on the binding affinity of
having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations).
– Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule.
• Artificial neural networks (ANN) are ideally suited to take such correlations into account
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Neural networks• Neural networks can
learn higher order correlations!– What does this mean?
0 0 => 00 1 => 11 0 => 11 1 => 0
No linear function can learn this pattern
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Learning higher order correlation
0 0 => 0; 1 0 => 11 1 => 0; 0 1 => 1
X1
W1 W2
X2
0
€
x1 • w1 + x2 • w2 = o
€
x1 • w11 + x2 • w21 = h1
x1 • w12 + x2 • w22 = h2
h1 • v1 + h2 • v2 = o
W11
X1
W22
X2
0
W21 W12
V2V1
h1 hs
Has no solution!
€
v1 = −v2
w11 = w12
w21 = w22
Solution
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Mutual informationI(i,j) = aai
aaj P(aai, aaj) *
log[P(aai, aaj)/P(aai)*P(aaj)]
P(G1) = 2/9 = 0.22, ..P(V6) = 4/9 = 0.44,..P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10
log(0.22/0.10) > 0
ALWGFFPVAILKEPVHGVILGFVFTLTLLFGYPVYVGLSPTVWLSYMNGTMSQV GILGFVFTL WLSLLVPFVFLPSDFFPS
P1 P6
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Epitope predictions Mutual information
313 binding peptides 313 random peptides
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Choice of method• Neural networks are superior when
trained on many data• Simple and extended motif method
when little or no data is available• HMM/weight matrices with position
specific differential weight otherwise– Increase weight on anchor positions
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Evaluation of prediction accuracy
True positive proportion = TP/(AP)
False positive proportion = FP/(AN)
Aroc=0.5
Aroc=0.8
Roc curvesPearson correlation
TPFP
APAN
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Construction of ROC curves
True positive proportion = TP/(AP)
False positive proportion = FP/(AN)
Aroc=0.5
Aroc=0.8
Roc curves
Number Sequence Assignment Prediction 1 ILYQVPFSV 0.853 12.137 2 YLEPGPVTV 0.647 11.509 3 GLMTAVYLV 0.798 10.021 4 YLDLALMSV 0.842 9.632 5 GLYSSTVPV 0.697 9.335 6 HLYQGCQVV 0.539 9.265 7 RMYGVLPWI 0.689 8.948 8 FLPWHRLFL 0.564 8.926 9 LLPSLFLLL 0.554 8.890 10 ILSSLGLPV 0.638 8.491 11 FLLTRILTI 0.803 8.343 12 ILDEAYVMA 0.494 6.084 13 VVMGTLVAL 0.589 5.935 14 MALLRLPLV 0.634 4.761 15 MLQDMAILT 0.527 4.450 16 KILSVFFLA 0.851 3.578 17 ILTVILGVL 0.451 3.358 18 ALAKAAAAA 0.563 2.849 19 LVSLLTFMI 0.301 1.193 20 ALPYWNFAT 0.323 0.994
>0.5 AP (16) <0.5 AN (4) TP=3,FP=0
TP=3,FP=0
TP=11,FP=1
TP=16,FP=4
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Epitope predictionsSequence motif and HMM’s
Sequence motif HMM
cc: 0.76Aroc: 0.92
cc: 0.80Aroc: 0.95
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Epitope prediction. Neural Networks
cc: 0.91Aroc: 0.98
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Evaluation of prediction accuracy
0
0.2
0.4
0.6
0.8
1
MotifHmm ANN
PearAroc
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Location of class I epitopes
GP1200 proteinStructure(1GM9)
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Hepatitis C virus. Epitope predictions
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
MHC Class II binding• TEPITOPE. Virtual matrices (Hammer, J., Current Opinion
in Immunology 7, 263-269, 1995)• PROPRED. Quantitative matrices (Singh H, Raghava GP
Bioinformatics 2001 Dec;17(12):1236-7)– Web interface http://www.imtech.res.in/raghava/propred
• Gibbs sampler (Nielsen et al., Bioinformatics 2004. Improved prediction of MHC class I and II epitopes using a novel Gibbs sampler approach)
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
MHC class II prediction• Complexity of
problem– Peptides of different
length– Weak motif signal
• Alignment crucial• Gibbs Monte Carlo
sampler
RFFGGDRGAPKRGYLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPAGSLFVYNITTNKYKAFLDKQSALLSSDITASVNCAKPKYVHQNTLKLATGFKGEQGPKGEPDVFKELKVHHANENISRYWAIRTRSGGITYSTNEIDLQLSQEDGQTIE
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Class II binding motif RFFGGDRGAPKRG YLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI
Gibbs sampler motifAlignment by Gibbs sampler
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
MHC class II predictionsAllele DRB1_0401
00.10.20.30.40.50.60.70.80.9
MHCbench1MHCbench2MHCbench3MHCbench4MHCbench5MHCbench6MHCbench7MHCbench8SouthwoodGeluk
TepitopeGibbs
Accu
racy
CENTER FO
R BIOLO
GICAL SEQ
UEN
CE ANALYSIS
TECHN
ICAL UN
IVERSITY OF D
ENM
ARK DTU
Summary• Binding motif of class I MHC binding well
characterized by HMM/weight matrices– This even when limited data is available
• Neural networks can be trained to predict MHC binding with high accuracy– NN can include higher order sequence
correlations• MHC Class II peptide binding motif can
be described using a Gibbs sampler algorithm