T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

transcript

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

T cell Epitope predictionsusing bioinformatics

(Neural Networks andhidden Markov models)

Morten Nielsen, CBS, BioCentrum,

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Processing of intracellular proteins

http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

MHC binding

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

What makes a peptide a potential and effective

epitope?• Part of a pathogen protein• Successful processing

– Proteasome cleavage– TAP binding

• Binds to MHC molecule• Protein function

– Early in replication• Sequence conservation in

evolution

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

From proteins to immunogensFrom proteins to immunogens

Lauemøller et al., 2000

20% processed 0.5% bind MHC 50% CTL response

=> 1/2000 peptide are immunogenic

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

MHC Class I and II• Class I

– Peptides 8-12 amino acids long– Intracellular pathogen presentation– Broad range of bioinformatical prediction tools

• Class II– Peptides 13+ amino acids long– Intravesicular pathogen presentation– Few prediction tools

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

MHC class I with peptideMHC class I with peptide

http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Prediction of HLA binding

specificity• Simple Motifs

– Allowed/non allowed amino acids• Extended motifs

– Amino acid preferences (SYFPEITHI)– Anchor/Preferred/other amino acids

• Hidden Markov models– Peptide statistics from sequence alignment

• Neural networks– Can take sequence correlations into account

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Syfpeithi database

•Anchors: •Required for binding

•Auxiliary anchor: •Helps binding

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Pattern recognition• 10 peptides from MHCpep database

– Bind to the MHC complex A*0201• Which of the following are most likely to bind?

1. FLLTRILTI2. WLDQVPFSV3. TVILGVLLL

• Regular expression– X1[LMIV]2X3…X8[MVL]9– 2 and 3 will bind and 1 will not bind– Cannot tell if 2 if more likely to bind

• Truth is that 1 and 2 binds and 1 binds the strongest. 3 does not bind

• A probabilistic model can capture this!

ALAKAAAAMALAKAAAAMALAKAAAAMALAKAAAAVALAKAAAAVGMNERPILVGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Probability estimationALAKAAAAMALAKAAAAMALAKAAAAMALAKAAAAVALAKAAAAVGMNERPILVGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

p1(A) = 610

, p1(G) = 210

p9(M) = 410

, p9(V ) = 510

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Weight matrices• Estimate amino acid frequencies from alignment • Now a weight matrix is given as

Wij = log(pij/qj)– Here i is a position in the motif, and j an amino acid. q j

is the background frequency for amino acid j.• In nature not all amino acids are found equally

often– PA = 0.07, PW = 0.013– Finding 6% A is hence not significant, but 6% W highly

significant • W is a L x 20 matrix, L is motif length

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Scoring sequences to a weight matrix

A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

ILYQVPFSVALPYWNFATMTAQWWLDA

Which peptide is most likely to bind?Which peptide second?

15.0 -3.4 0.8

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Weight-matrix construction Example from real life

• 10 peptides from MHCpep database

• Bind the MHC complex• Estimate sequence

motif and weight matrix

• Evaluate on 528 peptides (not included in training)

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Pseudo-count and sequence weighting

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

• Limited number of data• Poor or biased sampling

of sequence space• I is not found at position

P9. Does this mean that I is forbidden?

• No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

}Similar sequencesWeight 1/5

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Low count correction using Blosum matrices

# I L V L 0.12 0.38 0.10 V 0.16 0.13 0.27

Blosum62 substitution frequencies• Every time for instance L/V is observed, I is also likely to occur

• Estimate low (pseudo) count correction using this approach

• As more data are included the pseudo count correction becomes less important

p = Neff • p I +β • p Is

Neff +β ,

p = 12• 0+10• 0.0512+10 = 0.02

NL = 2,NV = 2,Neff =12,

pIs = 2 • 0.12 + 2 • 0.16

12= 0.05

Neff: Number of sequences: Weight on prior or pseudo count

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Example from real life (cont.)

• Raw sequence counting– No sequence

weighting – No pseudo count– Prediction accuracy

0.45• Sequence weighting

– No pseudo count– Prediction accuracy

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

• Sequence weighting and pseudo count– Prediction accuracy

0.60• Sequence weighting,

pseudo count and anchor weighting– Prediction accuracy

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

• Sequence weighting, pseudo count and anchor weighting– Prediction accuracy

• Motif found on all data (485)– Prediction accuracy

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Training on small data setsClass I Class II

Using a biased weight matrix with differential weight on anchor positions gives reliable performance for N~20-50 Lundegaard et al. 2004

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

How to predict• The effect on the binding affinity of

having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations).

– Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule.

• Artificial neural networks (ANN) are ideally suited to take such correlations into account

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Neural networks• Neural networks can

learn higher order correlations!– What does this mean?

0 0 => 00 1 => 11 0 => 11 1 => 0

No linear function can learn this pattern

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Learning higher order correlation

0 0 => 0; 1 0 => 11 1 => 0; 0 1 => 1

x1 • w1 + x2 • w2 = o

x1 • w11 + x2 • w21 = h1

x1 • w12 + x2 • w22 = h2

h1 • v1 + h2 • v2 = o

W21 W12

Has no solution!

v1 = −v2

w11 = w12

w21 = w22

Solution

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Mutual informationI(i,j) = aai

aaj P(aai, aaj) *

log[P(aai, aaj)/P(aai)*P(aaj)]

P(G1) = 2/9 = 0.22, ..P(V6) = 4/9 = 0.44,..P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10

log(0.22/0.10) > 0

ALWGFFPVAILKEPVHGVILGFVFTLTLLFGYPVYVGLSPTVWLSYMNGTMSQV GILGFVFTL WLSLLVPFVFLPSDFFPS

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Epitope predictions Mutual information

313 binding peptides 313 random peptides

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Choice of method• Neural networks are superior when

trained on many data• Simple and extended motif method

when little or no data is available• HMM/weight matrices with position

specific differential weight otherwise– Increase weight on anchor positions

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Evaluation of prediction accuracy

True positive proportion = TP/(AP)

False positive proportion = FP/(AN)

Aroc=0.5

Aroc=0.8

Roc curvesPearson correlation

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Construction of ROC curves

True positive proportion = TP/(AP)

False positive proportion = FP/(AN)

Aroc=0.5

Aroc=0.8

Roc curves

Number Sequence Assignment Prediction 1 ILYQVPFSV 0.853 12.137 2 YLEPGPVTV 0.647 11.509 3 GLMTAVYLV 0.798 10.021 4 YLDLALMSV 0.842 9.632 5 GLYSSTVPV 0.697 9.335 6 HLYQGCQVV 0.539 9.265 7 RMYGVLPWI 0.689 8.948 8 FLPWHRLFL 0.564 8.926 9 LLPSLFLLL 0.554 8.890 10 ILSSLGLPV 0.638 8.491 11 FLLTRILTI 0.803 8.343 12 ILDEAYVMA 0.494 6.084 13 VVMGTLVAL 0.589 5.935 14 MALLRLPLV 0.634 4.761 15 MLQDMAILT 0.527 4.450 16 KILSVFFLA 0.851 3.578 17 ILTVILGVL 0.451 3.358 18 ALAKAAAAA 0.563 2.849 19 LVSLLTFMI 0.301 1.193 20 ALPYWNFAT 0.323 0.994

>0.5 AP (16) <0.5 AN (4) TP=3,FP=0

TP=3,FP=0

TP=11,FP=1

TP=16,FP=4

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Epitope predictionsSequence motif and HMM’s

Sequence motif HMM

cc: 0.76Aroc: 0.92

cc: 0.80Aroc: 0.95

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Epitope prediction. Neural Networks

cc: 0.91Aroc: 0.98

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Evaluation of prediction accuracy

MotifHmm ANN

PearAroc

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Location of class I epitopes

GP1200 proteinStructure(1GM9)

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Hepatitis C virus. Epitope predictions

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

MHC Class II binding• TEPITOPE. Virtual matrices (Hammer, J., Current Opinion

in Immunology 7, 263-269, 1995)• PROPRED. Quantitative matrices (Singh H, Raghava GP

Bioinformatics 2001 Dec;17(12):1236-7)– Web interface http://www.imtech.res.in/raghava/propred

• Gibbs sampler (Nielsen et al., Bioinformatics 2004. Improved prediction of MHC class I and II epitopes using a novel Gibbs sampler approach)

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

MHC class II prediction• Complexity of

problem– Peptides of different

length– Weak motif signal

• Alignment crucial• Gibbs Monte Carlo

sampler

RFFGGDRGAPKRGYLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPAGSLFVYNITTNKYKAFLDKQSALLSSDITASVNCAKPKYVHQNTLKLATGFKGEQGPKGEPDVFKELKVHHANENISRYWAIRTRSGGITYSTNEIDLQLSQEDGQTIE

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Class II binding motif RFFGGDRGAPKRG YLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI

Gibbs sampler motifAlignment by Gibbs sampler

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

MHC class II predictionsAllele DRB1_0401

00.10.20.30.40.50.60.70.80.9

MHCbench1MHCbench2MHCbench3MHCbench4MHCbench5MHCbench6MHCbench7MHCbench8SouthwoodGeluk

TepitopeGibbs

CENTER FO

R BIOLO

GICAL SEQ

CE ANALYSIS

ICAL UN

IVERSITY OF D

ARK DTU

Summary• Binding motif of class I MHC binding well

characterized by HMM/weight matrices– This even when limited data is available

• Neural networks can be trained to predict MHC binding with high accuracy– NN can include higher order sequence

correlations• MHC Class II peptide binding motif can

be described using a Gibbs sampler algorithm

T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

Documents