+ All Categories
Home > Documents > T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

Date post: 19-Mar-2016
Category:
Upload: teenie
View: 33 times
Download: 0 times
Share this document with a friend
Description:
T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models). Morten Nielsen, CBS, BioCentrum, DTU. Processing of intracellular proteins. MHC binding. http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm. What makes a peptide a potential and effective epitope?. - PowerPoint PPT Presentation
37
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models) Morten Nielsen, CBS, BioCentrum, DTU
Transcript
Page 1: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

T cell Epitope predictionsusing bioinformatics

(Neural Networks andhidden Markov models)

Morten Nielsen, CBS, BioCentrum,

DTU

Page 2: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Processing of intracellular proteins 

http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

MHC binding

Page 3: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

What makes a peptide a potential and effective

epitope?• Part of a pathogen protein• Successful processing

– Proteasome cleavage– TAP binding

• Binds to MHC molecule• Protein function

– Early in replication• Sequence conservation in

evolution

Page 4: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

From proteins to immunogensFrom proteins to immunogens

Lauemøller et al., 2000

20% processed 0.5% bind MHC 50% CTL response

=> 1/2000 peptide are immunogenic

Page 5: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

MHC Class I and II• Class I

– Peptides 8-12 amino acids long– Intracellular pathogen presentation– Broad range of bioinformatical prediction tools

• Class II– Peptides 13+ amino acids long– Intravesicular pathogen presentation– Few prediction tools

Page 6: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

MHC class I with peptideMHC class I with peptide

http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

Page 7: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

 Prediction of HLA binding

specificity• Simple Motifs

– Allowed/non allowed amino acids• Extended motifs

– Amino acid preferences (SYFPEITHI)– Anchor/Preferred/other amino acids

• Hidden Markov models– Peptide statistics from sequence alignment

• Neural networks– Can take sequence correlations into account

Page 8: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Syfpeithi database

•Anchors: •Required for binding

•Auxiliary anchor: •Helps binding

Page 9: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Pattern recognition• 10 peptides from MHCpep database

– Bind to the MHC complex A*0201• Which of the following are most likely to bind?

1. FLLTRILTI2. WLDQVPFSV3. TVILGVLLL

• Regular expression– X1[LMIV]2X3…X8[MVL]9– 2 and 3 will bind and 1 will not bind– Cannot tell if 2 if more likely to bind

• Truth is that 1 and 2 binds and 1 binds the strongest. 3 does not bind

• A probabilistic model can capture this!

ALAKAAAAMALAKAAAAMALAKAAAAMALAKAAAAVALAKAAAAVGMNERPILVGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 10: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Probability estimationALAKAAAAMALAKAAAAMALAKAAAAMALAKAAAAVALAKAAAAVGMNERPILVGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

p1(A) = 610

, p1(G) = 210

,..

....

p9(M) = 410

, p9(V ) = 510

,...

Page 11: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Weight matrices• Estimate amino acid frequencies from alignment • Now a weight matrix is given as

Wij = log(pij/qj)– Here i is a position in the motif, and j an amino acid. q j

is the background frequency for amino acid j.• In nature not all amino acids are found equally

often– PA = 0.07, PW = 0.013– Finding 6% A is hence not significant, but 6% W highly

significant • W is a L x 20 matrix, L is motif length

Page 12: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Scoring sequences to a weight matrix

A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

ILYQVPFSVALPYWNFATMTAQWWLDA

Which peptide is most likely to bind?Which peptide second?

15.0 -3.4 0.8

Page 13: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Weight-matrix construction Example from real life

• 10 peptides from MHCpep database

• Bind the MHC complex• Estimate sequence

motif and weight matrix

• Evaluate on 528 peptides (not included in training)

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

Page 14: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Pseudo-count and sequence weighting

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

• Limited number of data• Poor or biased sampling

of sequence space• I is not found at position

P9. Does this mean that I is forbidden?

• No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

}Similar sequencesWeight 1/5

Page 15: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Low count correction using Blosum matrices

# I L V L 0.12 0.38 0.10 V 0.16 0.13 0.27

Blosum62 substitution frequencies• Every time for instance L/V is observed, I is also likely to occur

• Estimate low (pseudo) count correction using this approach

• As more data are included the pseudo count correction becomes less important

p = Neff • p I +β • p Is

Neff +β ,

p = 12• 0+10• 0.0512+10 = 0.02

NL = 2,NV = 2,Neff =12,

pIs = 2 • 0.12 + 2 • 0.16

12= 0.05

Neff: Number of sequences: Weight on prior or pseudo count

Page 16: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Example from real life (cont.)

• Raw sequence counting– No sequence

weighting – No pseudo count– Prediction accuracy

0.45• Sequence weighting

– No pseudo count– Prediction accuracy

0.5

Page 17: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Example from real life (cont.)

• Sequence weighting and pseudo count– Prediction accuracy

0.60• Sequence weighting,

pseudo count and anchor weighting– Prediction accuracy

0.72

Page 18: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Example from real life (cont.)

• Sequence weighting, pseudo count and anchor weighting– Prediction accuracy

0.72

• Motif found on all data (485)– Prediction accuracy

0.79

Page 19: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Training on small data setsClass I Class II

Using a biased weight matrix with differential weight on anchor positions gives reliable performance for N~20-50 Lundegaard et al. 2004

Page 20: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

How to predict• The effect on the binding affinity of

having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations).

– Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule.

• Artificial neural networks (ANN) are ideally suited to take such correlations into account

 

Page 21: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Neural networks• Neural networks can

learn higher order correlations!– What does this mean?

0 0 => 00 1 => 11 0 => 11 1 => 0

No linear function can learn this pattern

Page 22: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Learning higher order correlation

0 0 => 0; 1 0 => 11 1 => 0; 0 1 => 1

X1

W1 W2

X2

0

x1 • w1 + x2 • w2 = o

x1 • w11 + x2 • w21 = h1

x1 • w12 + x2 • w22 = h2

h1 • v1 + h2 • v2 = o

W11

X1

W22

X2

0

W21 W12

V2V1

h1 hs

Has no solution!

v1 = −v2

w11 = w12

w21 = w22

Solution

Page 23: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Mutual informationI(i,j) = aai

aaj P(aai, aaj) *

log[P(aai, aaj)/P(aai)*P(aaj)]

P(G1) = 2/9 = 0.22, ..P(V6) = 4/9 = 0.44,..P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10

log(0.22/0.10) > 0

ALWGFFPVAILKEPVHGVILGFVFTLTLLFGYPVYVGLSPTVWLSYMNGTMSQV GILGFVFTL WLSLLVPFVFLPSDFFPS

P1 P6

Page 24: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Epitope predictions Mutual information 

313 binding peptides 313 random peptides

Page 25: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Choice of method• Neural networks are superior when

trained on many data• Simple and extended motif method

when little or no data is available• HMM/weight matrices with position

specific differential weight otherwise– Increase weight on anchor positions

Page 26: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Evaluation of prediction accuracy 

True positive proportion = TP/(AP)

False positive proportion = FP/(AN)

Aroc=0.5

Aroc=0.8

Roc curvesPearson correlation

TPFP

APAN

Page 27: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Construction of ROC curves 

True positive proportion = TP/(AP)

False positive proportion = FP/(AN)

Aroc=0.5

Aroc=0.8

Roc curves

Number Sequence Assignment Prediction 1 ILYQVPFSV 0.853 12.137 2 YLEPGPVTV 0.647 11.509 3 GLMTAVYLV 0.798 10.021 4 YLDLALMSV 0.842 9.632 5 GLYSSTVPV 0.697 9.335 6 HLYQGCQVV 0.539 9.265 7 RMYGVLPWI 0.689 8.948 8 FLPWHRLFL 0.564 8.926 9 LLPSLFLLL 0.554 8.890 10 ILSSLGLPV 0.638 8.491 11 FLLTRILTI 0.803 8.343 12 ILDEAYVMA 0.494 6.084 13 VVMGTLVAL 0.589 5.935 14 MALLRLPLV 0.634 4.761 15 MLQDMAILT 0.527 4.450 16 KILSVFFLA 0.851 3.578 17 ILTVILGVL 0.451 3.358 18 ALAKAAAAA 0.563 2.849 19 LVSLLTFMI 0.301 1.193 20 ALPYWNFAT 0.323 0.994

>0.5 AP (16) <0.5 AN (4) TP=3,FP=0

TP=3,FP=0

TP=11,FP=1

TP=16,FP=4

Page 28: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Epitope predictionsSequence motif and HMM’s 

Sequence motif HMM

cc: 0.76Aroc: 0.92

cc: 0.80Aroc: 0.95

Page 29: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Epitope prediction. Neural Networks 

cc: 0.91Aroc: 0.98

Page 30: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Evaluation of prediction accuracy

0

0.2

0.4

0.6

0.8

1

MotifHmm ANN

PearAroc

Page 31: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Location of class I epitopes

GP1200 proteinStructure(1GM9)

Page 32: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Hepatitis C virus. Epitope predictions 

Page 33: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

MHC Class II binding• TEPITOPE. Virtual matrices (Hammer, J., Current Opinion

in Immunology 7, 263-269, 1995)• PROPRED. Quantitative matrices (Singh H, Raghava GP

Bioinformatics 2001 Dec;17(12):1236-7)– Web interface http://www.imtech.res.in/raghava/propred

• Gibbs sampler (Nielsen et al., Bioinformatics 2004. Improved prediction of MHC class I and II epitopes using a novel Gibbs sampler approach)

Page 34: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

MHC class II prediction• Complexity of

problem– Peptides of different

length– Weak motif signal

• Alignment crucial• Gibbs Monte Carlo

sampler

RFFGGDRGAPKRGYLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPAGSLFVYNITTNKYKAFLDKQSALLSSDITASVNCAKPKYVHQNTLKLATGFKGEQGPKGEPDVFKELKVHHANENISRYWAIRTRSGGITYSTNEIDLQLSQEDGQTIE

Page 35: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Class II binding motif RFFGGDRGAPKRG YLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI

Gibbs sampler motifAlignment by Gibbs sampler

Page 36: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

MHC class II predictionsAllele DRB1_0401

00.10.20.30.40.50.60.70.80.9

MHCbench1MHCbench2MHCbench3MHCbench4MHCbench5MHCbench6MHCbench7MHCbench8SouthwoodGeluk

TepitopeGibbs

Accu

racy

Page 37: T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

CENTER FO

R BIOLO

GICAL SEQ

UEN

CE ANALYSIS

TECHN

ICAL UN

IVERSITY OF D

ENM

ARK DTU

Summary• Binding motif of class I MHC binding well

characterized by HMM/weight matrices– This even when limited data is available

• Neural networks can be trained to predict MHC binding with high accuracy– NN can include higher order sequence

correlations• MHC Class II peptide binding motif can

be described using a Gibbs sampler algorithm


Recommended