Machine learning in immunology

Machine learning in immunologyA brief overview

Vadim Nazarov

Genomics of Adaptive Immunity Lab, IBCH RASNational Research University Higher School of Economics

Table of contents

1. Introduction to immunology

2. Introduction to deep learning

3. MHC:peptide binding affinity prediction

4. TCR-peptide binding prediction

5. TCR CD4/CD8 classification

6. TCR repertoire comparison using high-dimensional features

7. Conclusion

1

Introduction to immunology

Immune system

• Recognizes foreign / dangerous substances from theenvironment (mainly microbes).

• Is envolved in elimintation of old and damaged cellsof the body.

• Attacks tumor and virus-infected celss.

2

Two branches of immune system

• Innate, nonspecific – very quickly recognizes mostforeign substances and eliminates them. No memoryor learning.

• Adaptive, specific – high degree of specificity indistinction between self and non-self. The reactiontakes several days to be effectively triggered. It learnsand memorizes the pathogen landscape.

3

Adaptive immune system

4

TCR chains

αβ chain - ”classic” adaptive immunity (virus detection)

γδ chain - terra incognita (phagocytosis, invariant cells)

Different generation processes!

5

V(D)J recombination

6

TCR selection

7

TCR:peptide:MHC interaction

8

TCR data example

9

Introduction to deep learning

Deep network architecture ideas

Fully connected / dense networks (DNN)

Convolutional neural networks (CNN)

Recurrent neural networks (RNN)

10

Fully connected networks 1

11

Fully connected networks 2

12

Convolutions

13

Convolutional neural networks

14

Recurrent neural networks

15

MHC:peptide binding affinityprediction

Problem

Prediction of strong / weak binders (immunotherapy, etc.)

16

Data

140,000 pairs of MHC-peptide for training

30,000 pairs of MHC-peptide for testing

17

NetMHCpan

Paper: just google ”netMHCpan paper”

Features:

• Onehot encoding• Blosum encoding• Lengths• Indels

Pseudo-sequences – pan-allele approach

Model: DNN with 60 hidden neurons

F1 score - 0.8

F1 = 2 ∗ precision ∗ recall/(precision+ recall)

precision = TP/(TP+ FP)recall = TP/(TP+ FN) 18

word2vec

19

word2vec vectors

20

Imputation

MICE: average multiple imputations generated usingGibbs sampling from the joint distribution of columns.

21

mhcflurry

Paper: http://biorxiv.org/content/biorxiv/early/2016/05/22/054775.full.pdf

Features:

• Embeddings (per-pseudo-sequence!)

Model: DNN with 60 neurons

F1 score - 0.79

22

ResNet - old networks

23

ResNet - old networks’ problems

• Gradient vanishing• Large number of parameters• Shallowness

24

ResNet - proposed model

25

ResNet - results

26

ResNet - old networks

27

ResNet - current deep networks

28

ResNet - current deep networks

29

CNN for NLP

30

Our approach

31

Our approach - results

• F1 0.81 (on a subset of the dataset)• Global models – prediction of binding affinities forunseen MHCs (mean F1 0.72)

• Better models for the per-pseudo-sequenceapproach.

32

TCR-peptide binding prediction

Problem

Paper:http://biorxiv.org/content/early/2017/03/20/118539.full.pdf+html

Immunogenicity prediction.

33

Decision tree (Titanic survival prediction

34

Random forest

35

Methods and results

Features:

• One-hot encoding of V/J• The average CDR3 basicity, hydrophobicity, helicity,isoelectric point

• The asolute count of each individual amino acid inthe CDR3 sequence

• The total mass of the 258 amino acids in the CDR3sequence

• Numerical features encoding individual amino acidbasicity, hydrophobicity, helicity, 269 isoelectric point,and mutation stability were also created for eachposition

Accuracy: 75.90%

Analysis of feature importances36

TCR CD4/CD8 classification

Problem

Paper: http://www.jleukbio.org/content/99/3/505.short

In-silico detection of CD4 / CD8 TCRs. Exploratorypre-analysis.

37

PCA 2D

38

PCA 3D

39

Kidera / Atchley factors

PCA on biophysical properties, explains ~80% variabilityin the data

40

Methods and results

• CDR3 to kmers, kmers to Atchley factors• Support Vector Machines classifier (different lengths?accuracy?)

41

TCR repertoire comparison usinghigh-dimensional features

Problem

Paper:http://biorxiv.org/content/early/2017/04/20/128025

Comparison of repertoires of TCRs and detection thesubrepertoires with the most contribution to theinter-sample differences in-silico.

42

Data

• 8 repertoires• Repertoire - table with CDR nuc/aa sequence, V gene,J gene, abundance columns.

43

t-SNE

1. Construct a probability distribution over the datasetin such a way that similar objects have a highprobability of being ”picked”.

2. Define a similar probability distribution over thepoints in the low-dimensional map (2-dimensional),and minimize the Kullback–Leibler divergencebetween the two distributions with respect to thelocations of the points in the map.

44

t-SNE on peptides

45

t-SNE on MNIST

46

Methods and results

• Smith-Waterman on all pairs of sequences• Transformation of pairwise similarity matrix into a dissimilaritymatrix using:

Si,j = 1− 2 ∗ Di,j/(Di,i + Dj,j)

• Apply t-SNE• Extract subrepertoires and motifs

47

Methods and results: t-SNE

48

Methods and results: similarities

49

Conclusion

Vadim I. NazarovGenomics of Adaptive Immunity Lab, IBCH RAS

National Research University Higher School of Economics

email: [email protected]

telegram: @vadimnazarov

49

Date post:	20-Feb-2022
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Machine learning in immunology

Documents