Date post: | 17-Jul-2015 |
Category: |
Science |
Upload: | erick-matsen |
View: | 254 times |
Download: | 1 times |
Learning how antibodies are drafted and revised
Frederick “Erick” Matsen
Fred Hutchinson Cancer Research Center
@ematsenhttp://matsen.fredhutch.org/
with Trevor Bedford (FH), Connor McCoy, Vladimir Minin (UW), and Duncan Ralph (FH)
RV144 HIV trial: 2003-200926,676 volunteers enrolled16,395 volunteers randomized125 infections$105,000,000 and 6 years
Prospective studies are expensive, slow, and entail complex moral issues.This does not lend itself to rapid vaccine development.
How might we guide vaccine development without disease exposure?
Vaccines manipulate the adaptive immune system
What can we learn from antibody-making B cells without battle-testing them through disease exposure?
B cell diversification processV genes D genes J genes
Affinitymaturation
Somatic hypermutation
VDJ rearrangement
includingerosion and
nontemplatedinsertion
AntigenNaive B cell
Experienced B cell
Big aim: reconstruct from memory reads
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...inference
Why reconstruct B cell lineages?
...
1. Vaccine design
This one is really good.How can we elicit it?
Why reconstruct B cell lineages?
...
1. Vaccine design
3. Evolutionary analysis to learn about underlying mechanisms
2. Vaccine assay
Goal 1: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...rearrangement groups
VDJ annotation problem: from where did each nucleotide come?
Somatic hypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G
This is a key first step in BCR sequence analysis.
Data: Illumina reads from CDR3 locus
Somatic hypermut ation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion5’J deletion
DJ insertion
Biological process
SequencingG
Total of about 15 million unique 130nt sequences from memory B cellpopulations of three healthy individuals A, B, and C.
Detour: write HMM inference package
We wanted to use HMMoC by G Lunter (Bioinf 2007)… then tried extending StochHMM by Lott & Korf (Bioinf 2014)…
but it ended up being a complete rewrite by Duncan to make ham.
Takes HMM description in concise & intuitive YAML format (for CpG example, 440 chars for ham vs 5,961 for HMMoC XML)slightly faster and more memory efficient than HMMoCcontinuous integration via Docker
Then write BCR annotation package:
https://github.com/psathyrella/ham
https://github.com/psathyrella/partis
Distributions are reproducibly weird!
bases0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHV270*12 V 3' deletion
ABC
IGHV270*12 V 3' deletion
bases0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHD114*01 D 5' deletion
ABC
IGHD114*01 D 5' deletion
bases0 5 10
frequency
0.0
0.2
0.4
0.6
IGHD727*01 D 3' deletion
ABC
IGHD727*01 D 3' deletion
bases0 5 10
frequency
0.00
0.05
0.10
0.15
0.20
IGHJ4*02 J 5' deletion
ABC
IGHJ4*02 J 5' deletion
Distributions are reproducibly weird!
position200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV323D*01
ABC
IGHV323D*01
position200 250
mutation freq
0.0
0.2
0.4
0.6
IGHV333*06
ABC
IGHV333*06
Only insertions look simple
bases0 5 10 15
frequency
0.00
0.05
0.10
0.15
VD insertion
ABC
VD insertion
bases0 5 10
frequency
0.0
0.1
0.2
DJ insertion
ABC
DJ insertion
Simulate sequences to benchmark
Somatic hypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G
Simulation code independent from inference code.
Incorporating this complexity is good
hamming distance0 5 10 15
frequency
0.0
0.1
0.2
0.3
HTTNpartis (k=5)partis (k=1)ighutiliHMMunealignigblastimgt
HTTN
but there are still a number of errors.
Remember goal: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...rearrangement groups
Say we are given two sequences
1-p
p
1-p
2 ×
2 ×
Double rollof a single die
per turn
1-p
p
1-p
1-p
p
1-p
+
Two independentdie rolling games
vs.
Double roll Pair HMM↔
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p
Do two sequences come from a singlerearrangement event?
The forward algorithm for HMMs gives probability of generatingobserved sequence from a given HMM:x
P(x) = P(x; σ),∑paths σ
probability of generating two sequences and from the same paththrough the HMM (summed across paths).
P(x, y) = P(x, y; σ),∑paths σ
x y
Do sets of sequences come from a single rearrangement event?
=P(A ∪ B)P(A)P(B)
P(A ∪ B | single rearrangement)P(A, B | independent rearrangements)
Use this for agglomerative clustering; stop when the ratio < 1.
First, investigate BCR mutation patterns
affinitymaturation
antigennaive B cell
experienced B cell
clonalexpansion
somatic hypermutation
Use two-taxon “trees” for model fittingnote: we know ancestral state within V, D, J.
VV DD JJ
IGN
OR
E
IGN
OR
E
IGN
OR
E
IGN
OR
E
Our “trees” have an observed read on the bottom and the corresponding“ancestral” germline sequence on top, connected by a branch,
representing some amount of divergence.
model fitGeneral Time ReversibleIndividual A Individual B Individual C
0.14
0.79
0.10
0.22
0.72
0.41
0.69
0.40
0.06
0.28
0.73
0.17
0.08
0.48
0.27
0.17
0.35
0.50
0.66
0.23
0.32
0.42
0.37
0.36
0.35
0.11
0.46
1.02
0.12
1.12
0.85
0.31
0.18
1.10
0.91
0.06
0.12
0.79
0.10
0.19
0.60
0.43
0.76
0.36
0.07
0.24
0.67
0.18
0.07
0.64
0.23
0.14
0.36
0.44
0.74
0.21
0.36
0.33
0.33
0.45
0.28
0.14
0.44
0.76
0.13
1.15
0.94
0.34
0.24
0.89
0.86
0.07
0.14
0.72
0.11
0.21
0.54
0.43
0.71
0.37
0.08
0.24
0.65
0.18
0.08
0.50
0.27
0.16
0.27
0.49
0.65
0.16
0.45
0.39
0.34
0.52
0.27
0.14
0.50
0.73
0.14
1.05
0.79
0.28
0.23
0.90
0.70
0.08
T
C
G
A
T
C
G
A
T
C
G
A
IGHV
IGHD
IGHJ
A G C T A G C T A G C Tread
germ
line
Best model according to AIC/BIC… has different matrices and fixed rate multipliers
for the different segments.
V D J
Seq. 1
Seq. 2
Seq. 3
t2
t3
t1 rDt1rJt1
rDt2rJt2
rDt3rJt3
Mutation Model
Branch length distribution under this bestmodel
IGHD rate: 3.36IGHJ rate: 0.62
IGHD rate: 4.44IGHJ rate: 0.62
IGHD rate: 3.88IGHJ rate: 0.63
Individual A Individual B Individual C
0e+00
2e+05
4e+05
6e+05
8e+05
0.0e+00
5.0e+05
1.0e+06
1.5e+06
2.0e+06
0e+00
5e+05
1e+06
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4ML branch length
count
D segments evolve substantially faster than VJ segments evolve more slowly than VIndividual A has a higher mutational load.
Next consider selection (Goal 2 con’t)
affinitymaturation
antigennaive B cell
experienced B cell
clonalexpansion
somatic hypermutation
CCA CCTPro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
ὡ� ὡ� AAC AAG
GTGGTC
more likely
less likely
In antibodies
Would like per-site selection inference
ω ≡ ≡dN
dS
rate of non-synonymous substitutionrate of synonymous substitution
position200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV323D*01
ABC
IGHV323D*01
Productive vs. out-of-frame receptors
Each cell may carry two IGH alleles, but only one is expressed.
V D J
V D J
insertion thatdisrupts frame
ω ≡ ≡dN
dS
rate of non-synonymous substitutionrate of synonymous substitution
λSo
ut−
of−
fram
e
70 80 90 100
site (IMGT numbering)
0.1
1.0
individual A B C
Out-of-frame reads can be used to infer neutral mutation rate!
is a ratio of rates in terms of observedneutral process
ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: synonymous in-frame rate for site
: synonymous out-of-frame rate for site
λ(N−I)l l
λ(N−O)l l
λ(S−I)l l
λ(S−O)l l
=ωl
/λ(N−I)l λ
(N−O)l
/λ(S−I)l λ
(S−O)l
Renaissance count (Lemey,Minin… 2012)
TGGCCGCGAseq−5 CCTCAAATCACTCTATGGCCGCGA
seq−2 CCACAAATCACGTTA TGGCCGCGA
ArgPro Gln
Thr
Ile Thr L eu Trp Gln
Pro
seq−1 CCACAAACCACGTTA TGGCAG
seq−3
CGA
CCTCAAACCACTCTATGGCAGCGAseq−4 CCTCAAATCACTCTA
ACCATCATC
ATCACC
ACCATC
ATC
ATC
ACC
ATC
ATCACC
ACC
ACC
ATCATC
mutation historysample
Use sampledmutation histories to estimate rates...
but suchestimatescan be unstable.
Empirical Bayes regularization to stabilize estimates
Say we are doing a per-county smoking survey.
zero smokers? Really?
Use all of the data to fit prior distribution of smoking prevalence, thenwith given observations obtain per-county posterior.
Estimating selection coefficient ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: synonymous in-frame rate for site
: synonymous out-of-frame rate for site
λ(N−I)l l
λ(N−O)l l
λ(S−I)l l
λ(S−O)l l
=ωl
/λ(N−I)l λ
(N−O)l
/λ(S−I)l λ
(S−O)l
Overall IGHV selection map
0.1
1.0
10.0
75 80 85 90 95 100 105
me
dia
nω
Individual A
050
100150200
75 80 85 90 95 100 105S ite (IMGT numbering)
cou
nt
purifying neutral diversifying
Distribution of classifications across IGHV genes
Distribution ofmedian estimates of ω
Similar across individualsIndividual A
050
100150200
75 80 85 90 95 100 105
cou
nt
Individual B
050
100150200
75 80 85 90 95 100 105
cou
nt
Individual C
Site (IMGT numbering)
050
100150
75 80 85 90 95 100 105
cou
nt
purifying neutral diversifying
Conclusion
B cell receptors are “drafted” and “revised” randomly, but
… with remarkably consistent deletion and insertion patterns… with remarkably consistent substitution and selection
We can learn about these processes using model-based inference.
Paper on annotation with partis will be up soon is up on arXivSelection analysis paper
Thank youTrevor Bedford, Connor McCoy, Vladimir Minin & Duncan RalphPhil Bradley for doing structural workMolecular work done by Paul Lindau in Phil Greenberg’s lab withsupport from Harlan Robins and Adaptive BiotechnologiesAdaptive Biotechnologies computational biology team
National Science Foundation and National Institute of HealthUniversity of Washington Center for AIDS Research (CFAR)University of Washington eScience InstituteW. M. Keck Foundation
Measuring clustering agreementgood agreement:
bad agreement:
Cx
Cy
Cx
Cy
Intuition: “how much variability is there in the color for amongst theitems of a given color under ?
Cx
Cy
Mutual information IThink of cluster identity under for a uniformly selected point as a
random variable (similarly for and ):Cx
X Cy Y
I(X; Y ) = H(X) − H(X|Y )where is the entropy of (ignoring ), and is the
entropy of given the value for .H(X) X Y H(X|Y )
X Y
I(X; Y ) = p(x, y) log ( )∑y∈Y
∑x∈X
p(x, y)p(x) p(y)
AMI(U, V ) =MI(U, V ) − E{MI(U, V )}
max {H(U), H(V )} − E{MI(U, V )}
Estimates of the mutational process are quiteconsistent between individuals
(each point is a single entry for one of the matrices for a pair ofindividuals.)
Branch length differences between productive,unproductive
Unproductive rearrangements are more likely to be either: unchangedfrom germline, or more divergent.
Sites are generally under purifying selectionIndividual A
Individual B
Individual C
0
200
400
600
800
0
200
400
600
800
0
200
400
600
800
−1 0 1median log10(ω)
cou
nt
purifying diversifying neutral
cou
nt
cou
nt
Distribution of amino acidsbeginningof CDR3
selection for aromaticamino acids?Frequency: left of line = out-of-frame, right of line = in-frame
Stabilize with empirical Bayes regularizationAssume that , the substitution rate at site , comes from a Gamma
distribution with shape and rate :λl l
α β
∼ Gamma(α, β).λl
Model total substitution counts (sampled via stochastic mapping) for asite as Poisson with rate :λl
∼ Poisson( ),Cl λl
Fit and to all data, then draw rates from the posterior:α̂ β̂ λl
∣ ∼ Gamma( + , 1 + ).λl Cl Cl α̂ β̂
We extended this regularization to case of non-constant coverage.
Sequence countsstatus A B Cfunctional 4,139,983 4,861,800 3,748,306out-of-frame 533,919 794,845 558,246stop 104,525 169,423 112,901
Simulation results for selection inference
● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●● ● ● ●
●●
● ● ● ●● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ●
●● ●
● ● ● ● ● ● ● ● ● ● ●●
● ● ● ● ● ● ● ● ●●
●●
● ● ●
● ●●
●● ●
● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ●
●● ●
● ● ● ● ● ● ● ● ● ● ●●
● ● ● ● ● ● ● ● ●●
● ●●
● ●
●
● ●
●
● ● ●●
● ● ● ● ● ● ●●
●●
● ● ●
●● ●
●● ●●
●● ●
0.1
1.0
10.0
0 25 50 75 100site
ω
s ynonymouschang epos s ible?
● yesno
type●●
●●
●●
purifyingneutraldiversifying
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100site
Pro
po
rtio
n
typeNS
0
250
500
750
1000
0 25 50 75 100site
cove
rag
e
Random factsMean length of D segment in individual A’s naive repertoire is 16.61.Subject A’s naive sequences were 37% CDR3Divergence between the various germ-line V genes:> summary(dist.dna(allele_01, pairwise.deletion=TRUE, model='raw'))Min. 1st Qu. Median Mean 3rd Qu. Max.0.003846 0.201300 0.344600 0.304700 0.384900 0.539500