1
The Application of NGS to HLA Typing
Challenges in Data Interpretation
The Application of NGS to HLA Typing
Challenges in Data Interpretation
Marcelo A. Fernández Viña, Ph.D.
Department of Pathology
Medical School
Stanford University
The HLA system
� High degree of polymorphism at most of the
expressed loci (function)
� Lack of a single predominant allele, high degree of
heterozygosity (function)
� Strong linkage disequilibrium (unknown, function?)
34924358 3111
2135 198 7940 73671 43
5
43
B
R
D
2
4
GENOMIC ORGANIZATION OF THE HLA GENES
HLA-A
HLA-B
HLA-C
HLA-DQA1
HLA-DQB1
HLA-DRB1
HLA-DPA1
HLA-DPB1
DRB1*08 Alleles
INTRON 1
INTRON 1
INTRON 1
INTRON 2
3
0
20
40
60
80
100
120
140
160
180
A B C DPA
1
DPB
1
DQA1
DQ
B1
DRB1
A B C DPA
1
DPB
1
DQ
A1
DQB1
DRB1
A B C DPA
1
DPB
1
DQA1
DQ
B1
DRB1
A B C DPA
1
DPB
1
DQ
A1
DQ
B1
DRB1
Covera
ge
HLA coverage over WGS
Average
Minimum
Average
Minimum
Average
Minimum
Average
Minimum
Sample 4Sample 3Sample 2Sample 1
Why not whole-genome sequencing?
• Inadequate coverage of complex genomic regions,
such as HLA. Conventional WGS (30x avg.
coverage) provides only sparse coverage of HLA.
Complexities due to:
– Indels
– GC-rich regions, secondary structure
– Paralogous genes
– Repeat regions across HLA loci
• Cost. Using WGS, to achieve adequate coverage of HLA would require >1,000X avg. coverage
J.Immunol. 1992 Jun 15;148(12):4043-53.
HLA-J, a second inactivated class I HLA gene related to HLA-G and HLA-A.
Implications for the evolution of the HLA-A-related genes.
Messer G, Zemmour J, Orr HT, Parham P, Weiss EH, Girdlestone J.
Ragoussis and co-workers described a class I HLA gene that maps to within 50 kb of HLA-A. Comparison of the nucleotide sequences of HLA-J alleles shows this gene is more related to HLA-G, A, and H.
All alleles of HLA-J are pseudogenes because of deleterious mutations that produce translation termination either in exon 2 orexon 4.
HLA-J appears, like HLA-H, to be an inactivated gene that result from duplication of an Ag-presenting locus related to HLA-A.
Evolutionary relationships as assessed by construction of trees suggest the four modern loci: HLA-A, G, H, and J were formed by successive duplications from a common ancestral gene.
In this scheme one intermediate locus gave rise to HLA-A and H, the other to HLA-G and J.
4
Alleles at different HLA loci (genes and pseudogenes)
share nucleotide sequences
HLA_A and HLA-H (pseudogene)
AA Codon 30 35 40 45 50
A*24:02:01:01 GGC TAC GTG GAC GAC ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AGG ATG GAG CCG CGG GCG CCG
A*01:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- --- --- --- --- ---
A*02:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
A*25:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
A*32:01:01 --- --- --- --- --- --- --- --- --- --- --T --- --- --- --- --- --- --- --- --- --- --- --- --- ---
H*01:01:01:01 GGC TAC GTG GAC GAT ACG CAG TTC GTG CGG TTC GAC AGC GAC GCC GCG AGC CAG AGG ATG GAG CCG CGG GCG CCG
HLA-A, B and HLA-H (pseudogene)
AA Codon 80 85 90
B*57:01:01 GAG AAC CTG CGG ATC GCG CTC CGC TAC TAC AAC CAG AGC GAG GCC G
B*07:02:01 --- -G- --- --- -A- CT- -G- G-- --- --- --- --- --- --- --- -
B*08:01:01 --- -G- --- --- -A- CT- -G- G-- --- --- --- --- --- --- --- -
B*15:17:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -
B*35:01:01:01 --- -G- --- --- -A- CT- -G- G-- --- --- --- --- --- --- --- -
B*44:02:01:01 --- --- --- --C -C- --- --- --- --- --- --- --- --- --- --- -
B*51:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -
AA Codon 80 85 90
H*01:01:01:01 GAG AAC CTG CGG ATC GCG CTC CGC TAC TAC AAC CAG AGC GAG GGC G
AA Codon 80 85 90
A*24:02:01:01 GAG AAC CTG CGG ATC GCG CTC CGC TAC TAC AAC CAG AGC GAG GCC G
A*01:01:01:01 -C- --- --- G-- -C- CT- -G- G-- --- --- --- --- --- --- -A- -
A*02:01:01:01 -T- G-- --- G-- -C- CT- -G- G-- --- --- --- --- --- --- --- -
A*25:01:01 --- -G- --- --- --- --- --- --- --- --- --- --- --- --- -A- -
A*32:01:01 --- -G- --- --- --- --- --- --- --- --- --- --- --- --- --- -
DRB Gene Content varies in Haplotypes Bearing Different
DRB1 Allele-Sero-Groups – (Copy Number Variation has been
known in HLA for more than 3 decades)Haplo-groups
DR1 DQB1 DQA1 DRB1 DRB6 DRB9 DRA HLA-B
DR51 DQB1 DQA1 DRB1 DRB6 DRB5 DRB9 DRA HLA-B
DR52 DQB1 DQA1 DRB1 DRB2 DRB3 DRA HLA-B
DR8 DQB1 DQA1 DRB1 DRB9 DRA HLA-B
DR53 DQB1 DQA1 DRB1 DRB7 DRB8 DRB4 DRB9 DRA HLA-B
Nature. 1986 Jul 3-9;322(6074):67-70.
Polymorphism of human Ia antigens: gene conversion between two DR beta
lociresults in a new HLA-D/DR specificity.
Gorski J, Mach B.
Molecular mapping of the DR beta-chain region allows true allelic comparisons of the two expressed DR beta-chain loci, DR beta I and DR beta III.
At the more polymorphic locus, DR beta I, the allelic differences are clustered and may result from gene conversion events over very short distances.
The gene encoding the HLA-DR3/Dw3 specificity has been generated by a gene conversion involving the DR beta I and the DR beta III loci of the HLA-DRw6/Dw18 haplotype, as recipient and donor gene, respectively.
The generation of HLA-DR polymorphism within the DRw52 supertypic group can thus be accounted for by a succession of gene duplication, divergence and gene conversion.
5
Alleles at different HLA-DRB loci share nucleotide
sequences
AA Codon 10 15 20 25
DRB1*01:01:01 CA CGT TTC TTG TGG CAG CTT AAG TTT GAA TGT CAT TTC TTC AAT GGG ACG GAG CGG GTG CGG TTG CTG GAA AGA
DRB1*01:03 -- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
DRB1*03:01:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---
DRB1*04:02:01 -- --- --- --- GA- --- G-- --A CA- --G --- --- --- --- --C --- --- --- --- --- --- --C --- --C ---
DRB1*07:01:01:01 -- --- --- C-- --- --- GG- --- -A- A-G --- --- --- --- --C --- --- --- --- --- -A- --C --- --- ---
DRB1*11:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---
DRB1*11:30 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---
DRB1*13:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---
DRB3*01:01:02:01 -- --- --- --- GA- -T- -G- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---
DRB3*02:02:01:01 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --G ---
Alleles at different HLA-DRB loci share nucleotide sequences
Importance of determining PhaseAA Codon 10 15 20 25
DRB1*01:01:01 CA CGT TTC TTG TGG CAG CTT AAG TTT GAA TGT CAT TTC TTC AAT GGG ACG GAG CGG GTG CGG TTG CTG GAA AGA
DRB1*03:01:01:02 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---
DRB1*13:01:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---
DRB1*13:67 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---
DRB3*01:01:02:01 -- --- --- --- GA- -T- -G- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---
DRB3*02:02:01:02 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --G ---
AA Codon 30 35 40 45 50
DRB1*01:01:01 TGC ATC TAT AAC CAA GAG GAG TCC GTG CGC TTC GAC AGC GAC GTG GGG GAG TAC CGG GCG GTG ACG GAG CTG GGG
DRB1*03:01:01:02 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---
DRB1*13:01:01:01 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---
DRB1*13:67 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---
DRB3*01:01:02:01 -A- T-- C-- --- --G --- --- -T- C-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
DRB3*02:02:01:02 CA- T-- C-- --- --G --- --- -A- -C- --- --- --- --- --- --- --- --- --- --- --- --- -G- --- --- ---
AA Codon 55 60 65 70 75
DRB1*01:01:01 CGG CCT GAT GCC GAG TAC TGG AAC AGC CAG AAG GAC CTC CTG GAG CAG AGG CGG GCC GCG GTG GAC ACC TAC TGC
DRB1*03:01:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CG- --- --- -AT --- ---
DRB1*13:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- A-- --- --A G-C GA- --- --- --- --- --- --- --- ---
DRB1*13:67 --- --- --- --- --- --- --- --- --- --- --- --- A-- --- --A G-C GA- --- --- --- --- --- --- --- ---
DRB3*01:01:02:01 --- --- -TC --- --- -C- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CG- --- --- -AT --- ---
DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CA- --- --- -AT --- ---
AA Codon 80 85 90
DRB1*01:01:01 AGA CAC AAC TAC GGG GTT GGT GAG AGC TTC ACA GTG CAG CGG CGA G
DRB1*03:01:01:02 --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -
DRB1*13:01:01:01 --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -
DRB1*13:67 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -
DRB3*01:01:02:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -
DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -
15
HLA typing using high throughput sequencing
technologies.
Whole-gene amplification.
Exon-wise amplification of few exons.
6
16
Sequencing workflow
Ex 3 Ex 4Ex 1 Ex 2
Fragmentation
Ligate barcoded adaptors
Size select and purify 300-340 bp fragments
Sequencing library Q/C
Ex 3 Ex 4Ex 1 Ex 2
7
19
• High-throughput
• Accurate
• Long read length
• Simple to use
• Able to detect all types of genomic changes (SNP’s, insertion or deletionss, large scale rearrangements, methylation)
What would be the ideal sequencing
machine?
20
High-throughput sequencing technologies: an
overview
• All platforms share core similarities:
– DNA templates are spatially segregated, no physical separation step
– DNA is sequenced through synthesis, rather than termination
– DNA sequence is decoded by the emission of light or pH change
• Platforms differ by:
– Specific method used to generate template libraries
– Chemistries/approaches used to generate the sequence signal (light) signal
– Throughput (amount of bases sequenced per run)
– Length of sequence read
– Error modalities and error rates (e.g. homopolymer regions)
The Platforms that we Tested
• 454-Roche: exon coverage only – Multiplexing – Work flow was demanding (2011-2012)
• PGM-Ion Torrent: Instrument problems – reads too short -homopolymer problems (2012-2013)
• Pacific Biosciences: Extremely log reads!- Throughput and Workflow still in development; appears to be simple – Base calling (15 percent error) by consensus - homopolymer problems (2014)
• Illumina: Less error rate – robust instruments (2011-2015). Various systems
8
Examples of ambiguities: exon shuffling, segmental
exchange, substitutions in untested segments
23
Potential benefits of next-generation
sequencing for HLA typing
• Clonal template amplification in vitro to eliminate
problem of sequencing heterozygous DNA
• Sufficiently long read length (300+ bp) to cover entire
exon (or more) in phase
• Increased sequence coverage of HLA genes
• Capability to multiplex patient specimens
• Potential to complete run and data analysis within one
week
Practical Advantages or of Extending
Sequence Coverage
• Test complete gene
• No Assumptions made
• Transplantation: Detect mismatches thought to be
absent
• Mapping of Disease Susceptibility Factors
9
Many allele groups in HLA-A show one allele with an
insertion of an extra ‘C’ after seven ‘C’
A*01010101 AC CCC CCC .AAG ACA CAT ATG ACC CAC CAC
A*0104N -- --- --- C--- --- --- --- --- --- ---
A*02010101 -- G-- --- .--A --G --- --- --T --- ---
A*03010101 -- --- --- .--- --- --- --- --- --- ---
A*0321N -- --- --- C--- --- --- --- --- --- ---
A*310102 -- --- --- .--- --G --- --- --T --- ---
A*3114N -- --- --- C--- --G --- --- --T --- ---
C
A*02010101 AAAACGCATATGACTCACCAC
A*0321N CAAGACACATATGACCCACCA
MAARMSMMWWWK
A*03010101 AAGACACATATGACCCACCAC
A*02null CAAAACGCATATGACTCACCA
MARAMMSMWWWK
Resolution of common and well documented
null- alleles ( clinically relevant)Locus Allele related allele Difference Change Resolution Alternative
A 0104N 010101 EXON 4 ins 1 routine SBT
A 0253N 020101 EXON 2 PTC routine SBT
A 2409N 240201 EXON 4 PTC routine SBT
A 2411N 240201 EXON 4 ins 1 routine SBT
A 6811N 680102 EXON 1 del 1 ad hoc SSP
B 15010102N 150101 INTRON 1 del 10 ad hoc SSP extend reading by SBT
B 4022N 400201 EXON 3 PTC routine SBT
B 4423N 440201 EXON 3 PTC routine SBT
B 5111N 510101 EXON 4 ins 1 routine SBT
Cw 0409N 040101 EXON 7 del 1 ad hoc SSP
Cw 0507N 050101 EXON 3 del 2 routine SBT
DRB4 01030102N 010301 INTRON 1 splicing site ad hoc SSP extend reading by SBT
DRB5 0108N 0102 EXON 3 del 19 ad hoc SSP
DRB5 0110N 0102 EXON 2 del 2 routine SBT
Cw*0401/Cw*0409N if B*4403 is present
PTC = premature termination codonins = nuc. insertiondel = nuc. deletion
DRB5*0102/0108N if possible haplotype is DRB1*1502-DQB1*050188
Detection of C*04:09N (common) and
A*31:14N(rare) allele in single pass
A*31:01:02 (red line) shows interrupted
coverage at the beginning of Exon 4, while
A*31:14N (blue line), which differs from
A*31:01:02 with one base insertion, show
continuous coverage.
C*04:01:01:01 (red line) shows interrupted
coverage at the end of Exon 7, while
C*04:09N(blue line), which differs from
C*04:01:01:01 with one base deletion, show
continuous coverage.
10
HLA Typing by NGS• Wang C, Krishnakumar S, Wilhelmy J, Babrzadeh F, Stepanyan L,
Su LF, Levinson D, Fernandez-Viña MA, Davis RW, Davis MM, Mindrinos M
High-throughput, high-fidelity HLA genotyping with deep sequencing. Proc Natl Acad Sci U S A. 2012May 29;109(22):8676-81. doi: 10.1073/pnas.1206614109. Epub 2012 May 15. PubMed PMID: 22589303; PubMed Central PMCID: PMC3365218.
• New methodology that leverages the power of Next Generation Sequencing (NGS) and long range PCR
• Interrogated the entire sequences of the class I genes and most of the extent Class II genes in more than 9,000 subjects
1. Sample Collection
2. Long-Range PCR
3. Quantification
& Pooling
4. Fragmentation
5. Library preparation
& Pooling
6. Sequencing
7. Data analysis
5’UTR 1 2 3 4 5 6 7 3’UTR
5’UTR 1 2 3 4 5 6 7 3’UTR8
5’UTR 1 2 3 4 5 6 7 3’UTR8
5’UTR 1 2 3 4 3’UTR
5’UTR 1 2 3 4 5 3’UTR6
5’UTR 1 2 3 4 3’UTR
5’UTR 1 2 3 4 5 3’UTR
5’UTR 1 2 3 4 5 3’UTR6
HLA-B
HLA-A
HLA-C
HLA-DQA1
HLA-DQB1
HLA-DPB1
HLA-DPA1
HLA-DRB1, 3, 4, 5
8
NGS HLA TYPING SYSTEMS
30
Data Analysis
Shotgun sequencing
11
Genotype calling
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 500 1000 1500 2000 2500 3000
Covera
ge
Position
0
1000
2000
3000
4000
5000
6000
7000
0 200 400 600 800 1000 1200C
overa
ge
Position
A*02:07
A*02:01:01:01
A*02:01:01:01
A*02:07
Genomic mapping cDNA mapping
One nucleotide difference at exon 3 distinguishes A*02:01:01:01(A) from A*02:07(G). The cell line BM9
HLA-A is A*02:01:01:01. Top left pane shows the coverage plot when sequencing reads are mapped to
A*02:01:01:01 and A*02:07 genomic sequence. Top right panel shows the coverage plot when sequencing
reads are mapped to A*02:01:01:01 and A*02:07 cDNA sequence.
High-throughput, High resolution HLA
genotyping
32
Data Analysis Steps• De-multiplexing
– Identical barcodes at both ends of pair-end reads
• Lowering the chance of cross-contamination
• Mapping
– Competitive mapping
• All available reference sequence, including those form pseudo-genes are mapped, best
alignments are passed.
• Filtering
– Best alignments
– identical alignments (for cDNA only)
– Pair-end alignment
• Genotype calling
– Limited number of candidates (top 10 of each category: number of reads mapped, minimal coverage,
minimal central coverage)
– Enumerate possible combination of homozygous and heterozygous set
– Rank those combination on aggregated number of reads mapped, minimal coverage, minimal central
coverage.
• De novo Assembly
– Local de novo assembly can be performed to capture SNP for novel allele 33
12
Paired-end Sequencing
Reference Sequence 1
Reference Sequence 2✕
Pair-end reads
~500bp
34
Central Read Definition
35
Using Central Reads Coverage
On regular coverage plot, the two candidates looks similar. On central read coverage plot, the wrong
candidate have much lower coverage in comparison with the authentic candidate.
36
13
Complement Logics Resolved Difficult
Alleles
C*03:03:01 and C*03:04:01:01 differ in a single base at the end of exon 2. Due to similarity between
some B alleles and C alleles at this region, with cDNA alignment, there is no much difference between
those two candidates. With genomic alignment and paired-end filter, the difference between those two
candidates is greatly amplified to provide definite evidences to call one versus the other.
37
Using Complement Logics
cDNA alignment genomic alignment
Some short exons such as exon 6 of some C alleles are identical to that of B alleles. With cDNA
alignment, it is hard to predict whether the alignment is authentic. With genomic alignment and pair-
end reads, the neighboring polymorphic site provides sufficient information for this.
38
Maximize Usage of Computing Power for
SpeedRaw
reads
Raw
reads
B1B1 B2B2 B3B3 B4B4 B5B5
B1.1B1.1 B1.2B1.2 B1.3B1.3 B1.4B1.4 B1.5B1.5
B1B1
B1.AB1.A B1.BB1.B B1.CB1.CB1.D
PA
B1.D
PA
B1.D
PB
B1.D
PB
De-multiplexing
Mapping
Merging, Filtering,
De-multiplexing
Genotype calling
One process *
M-processes per barcode *****
One processes per barcode ***
Several processes per barcode *****
Level of parallelism
Streaming SIMD Extensions-vectorized implementation of Smith-Waterman algorithm 39
14
User Friendly Interface
• Data analysis pipeline runs with one single
command: hla_pipeline.py –c config.file
• Result reviewing is through web page graphically.
• The two components will be merged together in a
single standalone program in next about 6 months.
40
Interface
Sample Info
Genotypes
Candidate
Commenting
Counting Logics
41
Interface
Coverage plot Central Read Coverage plot
Reference alignment Read tiling pattern
42
15
Phasing Strategy
de novo Assembly
GCCAATGATGCACTGACTAGCCTAGCCACCC
GCCAATGATGCACTGACTAGCCTAGCCACCC
GCCAATGATGCACTGACTAGCCTAGCCACCCTGCACTGACTAGCCTAGCCACCCGATCAGCTCC
TGCACTGACTAGCCTAGCCACCCGATCAGCTCC
TGCACTGACTAGCCTAGCCACCCGATCAGCTCC
CTAGCCACCCGATCAGCTCCGATCGATCGGG
CTAGCCACCCGATCAGCTCCGATCGATCGGG
CTAGCCACCCGATCAGCTCCGATCGATCGGG
CTAGCCTAGCCACCCGATCAGCTCCGATC
CTAGCCTAGCCACCCGATCAGCTCCGATC
CTAGCCTAGCCACCCGATCAGCTCCGATC
CCGATCGATCGGGCATCGATCGATCGG
CCGATCGATCGGGCATCGATCGATCGG
CCGATCGATCGGGCATCGATCGATCGG
GCCAATGATGCACTGACTAGCCTAGCCACCC
GCCAATGATGCACTGACTAGCCTAGCCACCC
GCCAATGATGCACTGACTAGCCTAGCCACCC
TGCACTGACTAGCCTAGCCACCCGATCAGCTCC
TGCACTGACTAGCCTAGCCACCCGATCAGCTCC
TGCACTGACTAGCCTAGCCACCCGATCAGCTCC
CTAGCCACCCGATCAGCTCCGATCGATCGGG
CTAGCCACCCGATCAGCTCCGATCGATCGGG
CTAGCCACCCGATCAGCTCCGATCGATCGGG
CTAGCCTAGCCACCCGATCAGCTCCGATC
CTAGCCTAGCCACCCGATCAGCTCCGATC
CTAGCCTAGCCACCCGATCAGCTCCGATC
CCGATCGATCGGGCATCGATCGATCGG
CCGATCGATCGGGCATCGATCGATCGG
CCGATCGATCGGGCATCGATCGATCGG
Multiple fragments of similar sequences generated by NGS
Clustering of fragments based on similar sequences to create contiguous sequence
Phasing Analysis
Step1: Identify true polymorphic sites
• Ratio between major and minor alleles needs be above set threshold to be considered as true polymorphic sites
• The polymorphic sites are determined by a statistical model
5x”T”
6x”G”
5x”G”
6x”A”
5x”C”
6x”A”
All 3 sites are true polymorphic sites
10x”T”
1 x”G”
1 x”G”
10x”A”
5x”C”
6x”A”
“G” = noise
“G” = noise
True Polymorphic site
Build Phase Resolved “Contigs”
CCATGTTCCAATGATGCCCTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCG
CCATGTTCCAATGATGCCCTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCG
CCATGTTCCAATGATGCCCTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCG
CCATGTTCCAATGATGCCCTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCG
CCATGTTCCAATGATGCCCTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCG
Polymorphic Sites
T/G G/A C/A
Step1: Identify polymorphic sites Step2: Determine which polymorphisms are
linked together to resolve two contigs
CCATGTTCCAATGATGCCCTGTGCATGCATCG
CCATGTTCCAATGATGCCCTGTGCATGCATCGCCATGTTCCAATGATGCCCTGTGCATGCATCG
CCATGTTCCAATGATGCCCTGTGCATGCATCGCCATGTTCCAATGATGCCCTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCGCCATGTGCCAATAATGCACTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCGCCATGTGCCAATAATGCACTGTGCATGCATCG
CCATGTGCCAATAATGCACTGTGCATGCATCG
T-G-C are linked
G-A-A are linked
16
Best Matching Alleles
Dynamic Phasing
Calling polymorphisms from de novo assembled, mapped, paired-end sequences
Phase Resolved Consensus
Build phased contig sequences based on polymorphic linkage
Consensus Alignment
Compare contig sequences back to the database to find the best match
Best Matching
Alleles
• “Detail Review” window can be used for in-depth review of HLA genotyping
• “Detail Review” window displays the contig alignment browser as well as other reference parameters
(eReads and xReads)
• “Contig alignment” browser indicates phased blocks
Build Phased Contig Blocks
Summary
• Broad coverage (exons & introns) and deep sequencing (> 50)
• Paired-end sequencing
• Mapping
• Phasing:
– Complement logic (cDNA vs. genomic)
– Paired-end sequencing
– Central read logic
– Build Contig blocks
48
17
49
Coverage variance
0
5000
10000
15000
20000
Covera
ge
0 500 1000 1500 2000 2500
Position
0
5000
10000
15000
20000
25000
30000
Covera
ge
0 500 1000 1500 2000 2500
Position
0
5000
10000
15000
20000
25000
30000
Covera
ge
0 500 1000 1500 2000 2500
Position
0
200
400
600
800
1000
Covera
ge
0 1000 2000 3000 4000 5000
Position
0
5000
10000
15000
20000
Covera
ge
0 1000 2000 3000 4000 5000 6000Position
0
2500
5000
7500
10000
12500
15000
Covera
ge
0 2500 5000 7500 10000 12500 15000
Position
HLA-A HLA-B
HLA-C HLA-DQA1
HLA-DQB1 HLA-DRB1
50
Genotype calling
Mapping
Reads mapped onto the IMGT-HLA
references, including non-classic
HLA genes and pseduogenes with
NCBI BLASTN.
Filtering
Filtering alignments of sub-best bits,
containing mismatches or gaps, and
shorter than 50bp, and those where
references are mapped to only one
end of a pair-end read while one
reference is mapped to both ends of
the pair-end read, subsequently.
Genotyping
Computing MCOR, MCCR for each
mapped reference. Eliminating those
of either MCOR = 0 or MCCR = 0.
Enumerate combinations of either
one reference (homozygous) or two
references (heterozygous), and pick up combination of maximum reads
01:02:01
01:03:01
01:02:02L R
L/R<=2 or R/L<=2
A
B
C
D
Data Analysis• Determination of number of reads
• Bar codes specific for sample and locus (amplicon)
• Barcodes specific for sample (early pooling)
• Informatics:
• Mapping of Reads
• Phasing Reads
• Insertions and Deletions
• Homozygous and Heterozygous Positions
• Reads from other Loci
• Hybrid alleles, Novel alleles
18
1. Sample Collection
2. Long-Range PCR
3. Quantification
& Pooling
4. Fragmentation
5. Library preparation
& Pooling
6. Sequencing
7. Data analysis
5’UTR 1 2 3 4 5 6 7 3’UTR
5’UTR 1 2 3 4 5 6 7 3’UTR8
5’UTR 1 2 3 4 5 6 7 3’UTR8
5’UTR 1 2 3 4 3’UTR
5’UTR 1 2 3 4 5 3’UTR6
5’UTR 1 2 3 4 3’UTR
5’UTR 1 2 3 4 5 3’UTR
5’UTR 1 2 3 4 5 3’UTR6
HLA-B
HLA-A
HLA-C
HLA-DQA1
HLA-DQB1
HLA-DPB1
HLA-DPA1
HLA-DRB1, 3, 4, 5
8
NGS HLA TYPING SYSTEMS
Data Analysis• Determination of number of reads
• Bar codes specific for sample and locus (amplicon)
- Technically unwieldy
- Easier interpretation by Software (reads are assigned to the locus)
• Barcodes specific for sample (early pooling)
- Technically simple
- Software needs to be more sophisticated
need to phase longer sequence stretches
Alleles at different HLA-DRB loci share nucleotide sequences
Importance of determining PhaseAA Codon 10 15 20 25
DRB1*01:01:01 CA CGT TTC TTG TGG CAG CTT AAG TTT GAA TGT CAT TTC TTC AAT GGG ACG GAG CGG GTG CGG TTG CTG GAA AGA
DRB1*03:01:01:02 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---
DRB1*13:01:01:01 -- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---
DRB1*13:67 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --C ---
DRB3*01:01:02:01 -- --- --- --- GA- -T- -G- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- -AC --- --C ---
DRB3*02:02:01:02 -- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- --- --- --- --- --- --C --- --G ---
AA Codon 30 35 40 45 50
DRB1*01:01:01 TGC ATC TAT AAC CAA GAG GAG TCC GTG CGC TTC GAC AGC GAC GTG GGG GAG TAC CGG GCG GTG ACG GAG CTG GGG
DRB1*03:01:01:02 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---
DRB1*13:01:01:01 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---
DRB1*13:67 -A- T-- C-- --- --G --- --- AA- --- --- --- --- --- --- --- --- --- -T- --- --- --- --- --- --- ---
DRB3*01:01:02:01 -A- T-- C-- --- --G --- --- -T- C-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
DRB3*02:02:01:02 CA- T-- C-- --- --G --- --- -A- -C- --- --- --- --- --- --- --- --- --- --- --- --- -G- --- --- ---
AA Codon 55 60 65 70 75
DRB1*01:01:01 CGG CCT GAT GCC GAG TAC TGG AAC AGC CAG AAG GAC CTC CTG GAG CAG AGG CGG GCC GCG GTG GAC ACC TAC TGC
DRB1*03:01:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CG- --- --- -AT --- ---
DRB1*13:01:01:01 --- --- --- --- --- --- --- --- --- --- --- --- A-- --- --A G-C GA- --- --- --- --- --- --- --- ---
DRB1*13:67 --- --- --- --- --- --- --- --- --- --- --- --- A-- --- --A G-C GA- --- --- --- --- --- --- --- ---
DRB3*01:01:02:01 --- --- -TC --- --- -C- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CG- --- --- -AT --- ---
DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A- --- -G- CA- --- --- -AT --- ---
AA Codon 80 85 90
DRB1*01:01:01 AGA CAC AAC TAC GGG GTT GGT GAG AGC TTC ACA GTG CAG CGG CGA G
DRB1*03:01:01:02 --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -
DRB1*13:01:01:01 --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -
DRB1*13:67 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -
DRB3*01:01:02:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -
DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -
19
Data AnalysisInformatics:
• Mapping of Reads
• Phasing Reads
• Reads from other Loci (Highly homologous genes, DQA2, DPA2, DQB2, DPB2, DRB2/6/7/8/9)
• Alleles with incomplete references (in general rare)
• Hybrid alleles
• Novel alleles
Pseudogene Disambiguation
(Alpha sample SBC060)
SBT Result: A*02:01 NGS Result
HLA-H (novel)
A*02:01 TAC CAC CAG TAC GCC TAC GAC GGC AAG GAT TAC ATC GCC CTG AAA GAG GAC CTG
CGC TCT TGG
H*01:01 GAC CAC CAG TAC GCC TAC GAC AGC AAG GAT TAC ATC GCT CTG AAA GAG GAC CTG
CGC TCC TGG
Hybrid allele carrying sequences of two loci
AA Codon 1 5 10 15 20
DRB1*01:01:01 CTG GCT TTG GCT GGG GAC ACC CGA C|CA CGT TTC TTG TGG CAG CTT AAG TTT GAA TGT CAT TTC TTC AAT GGG ACG
DRB1*14:54:01 --- --- --- --- --- --- --- A-- -|-- --- --- --- GA- T-C TC- -C- -C- --G --- --- --- --- --- --- ---
DRB1*14:141 --- --- --- --- --- --- --- A-- -|-- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- ---
DRB3*02:02:01:02 --- --- --C --- --- --- --- --- -|-- --- --- --- GA- -T- --- --- -C- --G --- --- --- --- --- --- ---
AA Codon 25 30 35 40 45
DRB1*01:01:01 GAG CGG GTG CGG TTG CTG GAA AGA TGC ATC TAT AAC CAA GAG GAG TCC GTG CGC TTC GAC AGC GAC GTG GGG GAG
DRB1*14:54:01 --- --- --- --- --C --- --C --- -A- T-- C-- --- --G --- --- -T- --- --- --- --- --- --- --- --- ---
DRB1*14:141 --- --- --- --- --C --- --G --- CA- T-- C-- --- --G --- --- -A- -C- --- --- --- --- --- --- --- ---
DRB3*02:02:01:02 --- --- --- --- --C --- --G --- CA- T-- C-- --- --G --- --- -A- -C- --- --- --- --- --- --- --- ---
AA Codon 50 55 60 65 70
DRB1*01:01:01 TAC CGG GCG GTG ACG GAG CTG GGG CGG CCT GAT GCC GAG TAC TGG AAC AGC CAG AAG GAC CTC CTG GAG CAG AGG
DRB1*14:54:01 --- --- --- --- --- --- --- --- --- --- -C- --G --- C-- --- --- --- --- --- --- --- --- --- -G- ---
DRB1*14:141 --- --- --- --- -G- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -A-
AA Codon 75 80 85 90 95
DRB1*01:01:01 CGG GCC GCG GTG GAC ACC TAC TGC AGA CAC AAC TAC GGG GTT GGT GAG AGC TTC ACA GTG CAG CGG CGA G|TT GAG
DRB1*14:54:01 --- --- -A- --- --- --- --T --- --- --- --- --- --- --- -TG --- --- --- --- --- --- --- --- -|-C C-T
DRB1*14:141 --- -G- CA- --- --- -AT --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -|-C C-T
DRB3*02:02:01:02 --- -G- CA- --- --- -AT --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -|-C C-T
AA Codon 100 105 110 115 120
DRB1*01:01:01 CCT AAG GTG ACT GTG TAT CCT TCA AAG ACC CAG CCC CTG CAG CAC CAC AAC CTC CTG GTC TGC TCT GTG AGT GGT
DRB1*14:54:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --G --- --- --T --- --- --- ---
DRB1*14:141 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --G --- --- --T --- --- --- ---
DRB3*02:02:01:02 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --G ---
AA Codon 125 130 135 140 145
DRB1*01:01:01 TTC TAT CCA GGC AGC ATT GAA GTC AGG TGG TTC CGG AAC GGC CAG GAA GAG AAG GCT GGG GTG GTG TCC ACA GGC
DRB1*14:54:01 --- --- --- --- --- --- --- --- --- --- --- --- --T --- --- --- --- --- A-- --- --- --- --- --- ---
DRB1*14:141 --- --- --- --- --- --- --- --- --- --- --- --- --T --- --- --- --- --- A-- --- --- --- --- --- ---
DRB3*02:02:01:02 --- --- --- --- --- --C --- --- --- --- --- --- --- --- --A --- --- --- --- --- --- --- --- --- ---
20
Characterization of a rare allele with incomplete
sequence
B*15:147 derives from B*15:01:01:01
21
SBT/SSO vs NGS
Identifying a novel allele
S-101, Reference Type Result: B*13@, B*38@, one allele is an exon 4 variant ?
NGS Result:
B*13:02:01, B*38:02:01 _Exon 4 variant A to G, Lys to Arg, codon 186.
E2 E3 E4
I2 I3
X(AAGG)
DPB1*463:01/
DPA1*01:03:01:05244 - 2761 bp from exon2
Recombination area
DPB1*04:02:01:01/DP
A1*01:03:01:05
DPB1*03:01:01/
DPA1*01:03:01:03
DPB1 Hybrid Alleles
22
Characterization of a Novel allele through the
evaluation of unmapped reads
Functional SignificanceSubject with two closely related alleles included in the DPB1*04:02:01:01G
DPB1*04:02:01G:
DPB1*04:02:01:01
DPB1*04:02:01:02
DPB1*105:01
DPB1*463:01
DPB1*571:01
Identical Antigen Recognition Site Structure
Different levels of Expression (we propose)
AA Codon -25 -20 -15 -10 -5
DPB1*105:01 ATG ATG GTT CTG CAG GTT TCT GCG GCC CCC CGG ACA GTG GCT CTG ACG GCG TTA CTG ATG GTG CTG CTC ACA TCT
DPB1*414:01 *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
AA Codon 1 5 10 15 20
DPB1*105:01 GTG GTC CAG GGC AGG GCC ACT CCA G|AG AAT TAC CTT TTC CAG GGA CGG CAG GAA TGC TAC GCG TTT AAT GGG ACA
DPB1*414:01 *** *** *** *** *** *** *** *** *|-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
DPB1*463:01 --- --- --- --- --- --- --- --- -|-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
AA Codon 25 30 35 40 45
DPB1*105:01 CAG CGC TTC CTG GAG AGA TAC ATC TAC AAC CGG GAG GAG TTC GTG CGC TTC GAC AGC GAC GTG GGG GAG TTC CGG
DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
AA Codon 50 55 60 65 70
DPB1*105:01 GCG GTG ACG GAG CTG GGG CGG CCT GAT GAG GAG TAC TGG AAC AGC CAG AAG GAC ATC CTG GAG GAG AAG CGG GCA
DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- G-- --- ---
DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
AA Codon 75 80 85 90 95
DPB1*105:01 GTG CCG GAC AGG ATG TGC AGA CAC AAC TAC GAG CTG GGC GGG CCC ATG ACC CTG CAG CGC CGA G|TC CAG CCT AGG
DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -|-- --- --- -A-
DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- -|-- --- --- -A-
AA Codon 100 105 110 115 120
DPB1*105:01 GTG AAT GTT TCC CCC TCC AAG AAG GGG CCC TTG CAG CAC CAC AAC CTG CTT GTC TGC CAC GTG ACG GAT TTC TAC
DPB1*414:01 --- --C --- --- --- --- --- --- --- --- C-- --- --- --- --- --- --- --- --- --- --- --A --- --- ---
DPB1*463:01 --- --C --- --- --- --- --- --- --- --- C-- --- --- --- --- --- --- --- --- --- --- --A --- --- ---
AA Codon 125 130 135 140 145
DPB1*105:01 CCA GGC AGC ATT CAA GTC CGA TGG TTC CTG AAT GGA CAG GAG GAA ACA GCT GGG GTC GTG TCC ACC AAC CTG ATC
DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
AA Codon 150 155 160 165 170
DPB1*105:01 CGT AAT GGA GAC TGG ACC TTC CAG ATC CTG GTG ATG CTG GAA ATG ACC CCC CAG CAG GGA GAT GTC TAC ACC TGC
DPB1*414:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --C --- --- -T- ---
DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --C --- --- -T- ---
AA Codon 175 180 185 190 195
DPB1*105:01 CAA GTG GAG CAC ACC AGC CTG GAT AGT CCT GTC ACC GTG GAG TGG A|AG GCA CAG TCT GAT TCT GCC CGG AGT AAG
DPB1*414:01 --- --- --- --- --- --- --- --C --- --- --- --- --- --- --- -|-- --- --- --- --- --- --- --- --- ---
DPB1*463:01 --- --- --- --- --- --- --- --C --- --- --- --- --- --- --- -|-- --- --- --- --- --- --- --- --- ---
AA Codon 200 205 210 215 220
DPB1*105:01 ACA TTG ACG GGA GCT GGG GGC TTC GTG CTG GGG CTC ATC ATC TGT GGA GTG GGC ATC TTC ATG CAC AGG AGG AGC
DPB1*414:01 --- --- --- --- --- --- --- --- A-- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
DPB1*463:01 --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
23
24
Possible eSTR proximal to intron-2 splicing site
STR length may play a regulatory role in the expression of DPB1
E2
264
E3
282E4
111
DPB1 Fragment E2/E4
(~5.1kb)
I2 I3
F_DPB1 R_DPB1
X(AAGG)
4172 –
4227 bp
E5
22
E1
101
5’
UTR
3’
UTR
DPB1 Intron 2 eSTR(AAGG)
Low
High
DPB1*02:01:04e1 L_AA
DPB1*02:01:02v5 L_AA
DPB1*02:01:02v6 L_AA
DPB1*02:01:02 L_AA
DPB1*02:01:02v3 L_AA
DPB1*02:01:02v7 L_AA
DPB1*02:02 L_AA
DPB1*02:01:02v2 L_AA
DPB1*02:01:02v1 L_AA
DPB1*02:01:02v4 L_AA
DPB1*04:02:01:01 L_AA
DPB1*04:02:01:02 L_AA
DPB1*04:01:01:01v1 L_AA
DPB1*04:01:31 L_AA
DPB1*04:01:31 L_AA
DPB1*04:01:01:01v4 L_AA
DPB1*04:01:01:01v5 L_AA
DPB1*04:01:01:01 L_AA
DPB1*04:01:01:01v3 L_AA
DPB1*464:01
DPB1*04:01:01:02 L_AA
DPB1*398:01
DPB1*30:01
DPB1*58:01e1
DPB1*17:01e1 L_AA
DPB1*17:01x1
DPB1*19:01e1 H_GA
DPB1*39:01x1
DPB1*11:01:01e1 H_GA
DPB1*27:01e1
DPB1*13:01:01/DPB1*107:01e1 H_GA
DPB1*85:01e1 H_GA
DPB1*01:01:01e1 H_GA
DPB1*296:01e1
DPB1*15:01:01e1 H_GA
DPB1*18:01e1 H_GA
DPB1*05:01:01e1 H_GA
DPB1*414:01e1
DPB1*463:01
DPB1*16:01:01 H_GG
DPB1*21:01e1
DPB1*06:01e1 H_GG
DPB1*09:01:01e1 H_GG
DPB1*104:01e1
DPB1*03:01:01 H_GG
DPB1*14:01:01e1 H_GG
9 9
5 5
74
9 1
96
6 9
76
9 8
9 4
6 8
6 4
4 5
4 8
4 0
10 0
1 00
9 7
9 9
1 00
1 00
74
8 6
6 8
4 2
3 2
49
5 4
Intron2 (-43)
STR Analysis : Short - Short
DPB1* 01:01:01e1, 05:01:01e1
25
STR Analysis : Short - Long
DPB1* 02:01:02, 13:01:01e1
STR Analysis : Long - Long
DPB1* 02:01:02, 02:01:02v3
Data Analysis• Determination of number of reads
• Bar codes specific for sample and locus (amplicon)
• Barcodes specific for sample (early pooling)
• Informatics:
• Mapping of Reads
• Phasing Reads
• Insertions and Deletions
• Homozygous and Heterozygous Positions
• Reads from other Loci
• Hybrid alleles, Novel alleles
26
Typing two DRB5 alleles
All reads need to be accounted
Correct genotype: DRB5*01:01:01, DRB5*01:08N
DRB5*01:02, 0108N DRB5*01:01:01, 01:02 DRB5*01:01:01, 01:08N
DRB5*01:02/01:08N identical in exon 2, differ by 19 nt indel in exon 3
DRB5*01:01:01/01:02 identical in exon 3, differ by 3 nt substitutions in exon 2
Must Know• Amplicon: size (homogeneous or variable according
to allele families)
• Preferential amplifications (locus or allele families)
• Primers: multiplexed or single location
• Other genes co-amplified (DRB)
• Software: Binning of reads (to a given locus, to a given allele family). No binning (possible interference in allele assignment)
• Phasing: reads covering informative SNPs, Central Reads, Assembly
• Utilization of reads
Homozygous allele? Not exactlyDRB1DRB1 DQB1DQB1DQA1DQA1
DRB1 DQA1 DQB1 Count
*13:02:01 *01:02:01:04 *06:04:01/*06:09:01 21/553
*15:01:01:01 *01:02:01:03 *06:02:01/*06:03:01 423/553
27
DRB1DRB1 DQB1DQB1DQA1DQA1~40Kb ~10Kb
DRB1 DQA1 DQB1
*01:01:01/*01:03 *01:01:01 *05:01:01:0x
*10:01:01/*14:54:01 *01:05/*01:04:01:01 *05:01:01:02
*01:02:01 *01:01:02 *05:01:01:01
DQB1*05:01:01:0x =DQB1*05:01:01:01(intron 4) +DQB1*05:01:01:02 (intron 2)
My thought Process for Genotype Assignment
• Examine Genotype assigned by software through mapping
• Perfect match with reference vs no full match at genomic level
• Check match with reference vs no full match at exon level
• Check completeness of reference
• Identify novel allele; see close allele and check differences and sequences
• Examine by other method phasing (central reads, pair end reads, assembly)
• Check LD tables (my help identify drop outs)
81
Barcode performance
0
200000
400000
600000
800000
1e+06
1.2e+06
1.4e+06
1.6e+06
1.8e+06
GC
AG
AC
TG
CATG
AT
GC
GTATT
GC
TG
CAT
GTAG
CTT
GTATAG
TG
TC
ATC
TG
TG
AC
GT
GTG
CG
AT
TAC
AC
AT
TAC
GTC
TTAG
CTAT
TAG
TAC
GTAG
TC
TC
TATAC
TA
TATC
TG
CTATG
CG
T
Read c
ount
Barcode
28
Data Analysis• Solid and simple logic
– error is minor
• Accurate
– User-friendly interface for reviewing result
• Fast
– Less than 2 hours for seq run (12-24 samples)
• Ability to pick up new allele
• Stand-alone desktop solution
• Ability to evaluate genotype assignment by second method
Our experience• Allele calls were made virtually by the software with
no operator evaluation
• Fourth field data: in most instances no previous information
• Haplotype associations stronger than expected
• Several common allele subtypes distinguished at the fourth field
• Specific allele associations came apparent without any assumptions made
• These studies show the robustness and comprehensive coverage provided by the typing system
84
Summary of State of the Art NGS for HLA
• Application to HLA typing is feasible
• Processes have been optimized
• Current methods are appropriate for both Registry Typing and small scale quick TAT
• Extremely accurate and comprehenisve
• Great developments in the informatics and analysis
•Completion of sequences of common alleles will be helpful
•Studies in familes may unravel limitations