Mark Routbort, MD, PhD On behalf of the Informatics Subdivision of AMP
1
Hands-on Workshop Variant Interpretation and
Classification
Workshop Scope
•Is it real? Believe it •What is it? Name/describe it •What does it mean? Understand it
2
Questions for any sequence variant
Workshop Examples
•Positional noise & thresholding artifacts •Degenerate alignments •Missed calls •Phasing issues •Complex mutations
3
Outline • Viewing variants
• Overview of IGV/genomics viewers • Review of basic bioinformatics file types
• Reference sequences/genomes • Sequence alignment maps • Annotation tracks
• Clinical NGS examples • Brief summary of classification/knowledge
4
Viewing variants
•All of the examples used in this workshop will be available online
•Viewing them and working yourself requires IGV
5
IGV – Integrative Genomics Viewer • Open source software from the Broad Institute • Highly capable and well maintained genomic viewer
• The Java based version is the most full-featured (there is also a Javascript based version which can be embedded in web pages directly and viewed on nearly all browsers)
• Latest version downloads https://software.broadinstitute.org/software/igv/download
• Help/feature chat group https://groups.google.com/forum/#!forum/igv-help 6
The Website
•http://variantworkshop.org •Hosts the sequence files we will use today •Has a simple set of links to session files that will load the examples into a running instance of IGV
7
Generic anatomy of a genome browser
Tracks representing different features (annotations, assay regions, etc) or characteristics of samples (sequence, coverage depth, variant calls, etc)
Genomic position(s)
Start up IGV, display a test message
• First, start up IGV. IGV must be running for URL based session files to load. Once IGV is running, click on this link:
• Load test message for IGV • If all is well, you will see a little message about AMP in the
alignment reads shown • Color our reads
• Right click on alignment sequence and choose “Color alignments by – Read Strand”
9
10
Anatomy of IGV
Reference genome
11
Anatomy of IGV
Contig (chromosome)
12
Anatomy of IGV
Navigation entry
13
Anatomy of IGV
Contig axis/map
14
Anatomy of IGV
Sequence alignments track
Sequence coverage track
15
Anatomy of IGV
Reference sequence
Reference sequence • A consensus, baseline, wild-type, or comparator DNA sequence • Used as the comparator to define ‘what has changed’ –
sequence variants, structural rearrangements • Reference alleles – REF • Alternative alleles/variants – ALT
16
Reference genome • A set of reference sequences for an organism or species under study
• Divided into ‘contigs’ – pieces of DNA that are part of the same physical package
• ‘Shotgun’ sequencing – if sequences have overlaps, they are part of the same contig • In eukaryotes, contigs more or less chromosomes
• Human contigs (major): chr1 – chr22, chrX, chrY, chrM • Interesting point – since a reference genome is generally a
consensus created by sequencing many individual members of a species, it is very likely that no individual member actually has a perfect consensus sequence
• We all have variants
17
IGV and the reference genome • Everything starts with the reference genome • Most clinical labs are still using hg19/GRCh37 • Latest build is actually GRCh38 • For most clinical sequence this doesn’t make much difference
because clinical labs are generally reporting at the gene & protein level (c. & p. in HGVS), not the genome level
• BRAF c.1799T>A (p.V600E) is the nomenclature of the common BRAF mutation in melanoma, PTC, HCL regardless of the reference genome
18
IGV and the reference genome • A reference genome at the minimum includes sequence data for
each of the contigs • NNNNNNNNNNNNNNATCGCGCGCGTAGCTGANNNNNNNNNNNN • What are the N’s?
• Optionally, reference genomes (GENOME files) may include • gene level projections/transcript mappings • a cytomap/karyotype projection
• We will start with the simplest possible (artificial) reference genome
19
The simplest reference genome
• FASTA file • FASTA is a simple, old text format for representing DNA
sequences • Can include multiple sequences (in the setting of a reference
genome, multiple contigs) • Sample FASTA genome
• How many contigs are there? • What are the contig names? • How many base pairs are in each contig?
20
Adding a sequence alignment track • SAM – sequence alignment map file (.SAM)
• Uncompressed human readable format • BAM – binary (sequence) alignment map file (.BAM)
• Compressed non-readable format • Both can be indexed (.SAI and .BAI)
• Index is always binary and is fraction of the size of the .SAM/.BAM • Contains information about where alignments that map to a particular
position are location in the sequence alignment map
21
Adding a sequence alignment track • Sample sequence alignment map (SAM) • This is about the simplest possible SAM file; the full
specification allows for a lot more detail about reads
22
Read name Contig CIGAR
Mapping quality
Alignment start Read (sequence)
Adding a VCF track • Sample VCF (variant call format file) • This is about the simplest possible VCF, with one entry
23
Putting the simple example together
24
• Load simple example - an artificial genome and sequence alignment
• This is a link to a “session” file which tells IGV to
• Load our reference genome • Load our sequence alignment map • Display VCF file as an annotation track
• How many different sequences are shown?
• This is a trick question
• One sequence, three alignments • Can we prove this?
Review - Ambiguity/degeneracy in alignment
25
Reference
Alt (sample)
GGGCATCATCATGGG
GGGCATCATGGG
Ref GGGCATCATCATGGG Alt GGG---CATCATGGG
Alignment 1
Ref GGGCATCATCATGGG Alt GGGCAT---CATGGG
Alignment 2
Ref GGGCATCATCATGGG Alt GGGCATCAT---GGG
Alignment 3
One CAT got deleted – which one?
Describing sequence variants
• VCF (variant call format) uses genomic coordinates and is not normative:
• HGVS (Human Genome Variation Society) nomenclature is normative: the only acceptable representation is
Chr3 1111333333 A T Chr3 1111333333 AT TT Chr3 1111333332 CA CT
Synonymous and equally acceptable VCF representation of the same variant:
g.1111333333A>T
HGVS • Standardized nomenclature to promote portability, enduring meaning, and accuracy • Human Genome Variation Society (HGVS): http://varnomen.hgvs.org
HGNC Gene Symbol
Gene reference sequence (Genbank or EMBL accession + version #
Valine to Glutamate at codon 600
Thymidine to Adenosine at mRNA position 1799
“c” oding DNA sequence
“p” rotein impact (predicted)
NM_004333.4 (BRAF): c. 1799T>A (p. V600E)
BRAF mutation analysis: Mutation detected in codon 600, exon 15 (GTG to GAG) of the BRAF gene that would change the encoded amino acid from Valine to Glutamate (p.Val600Glu)
A case of colorectal (CRC) adenocarcinoma (and lots of parameters to adjust)
• 50-gene amplicon hotspot NGS panel • Tumor only – no germline control • 4 called variants post-filtering • Load colorectal adenocarcinoma example
28
A case of colorectal (CRC) adenocarcinoma
• When germline is not available for comparison, comparison to other samples on the same assay can be highly useful
• Background noise • Platform specific artifacts • Contamination
29
Parameter and display play
• IGV has NUMEROUS parameters that affect the display of alignments, highlighting of the coverage track
• Also many right-context menu display & sorting options • “Top Few” that tend to be useful/important in routine clinical review
and where default settings are very problematic for somatic NGS • We’ll do this with the current CRC case
30
Parameter play - downsampling • Downsampling improves performance – by loading less data • View → Preferences → Alignments → Downsample reads • Turn OFF downsampling (uncheck the box)
• NB: IGV defaults to downsampling turned on, to 100 reads. This is not appropriate for most labs doing high-depth somatic sequencing.
• If you are confused or not able to see variant in the reads, check to see if you are downsampling!
31
Parameter play – highlighting variants • View → Preferences → Alignments → Coverage track options → Coverage
allele-fraction threshold • IGV will color positions in the coverage track that have a variant allele >
coverage allele-fraction threshold • Indels and insertions are not variant alleles/don’t color
• This can be helpful in drawing attention to variants or just noisy positions • The threshold may need to be tweaked to the sensitivity and error correction of
the library & sequencing modality • UMID/consensus calling/liquid biopsy may warrant lower thresholds
• NB: IGV defaults to 0.2 here (20%) – this is much too high for somatic laboratories/assays
• This assay has a nominal sensitivity of about 5% - we will set the highlight to be about half of this – say 0.02
32
Parameter play – soft clipping • View → Preferences → Alignments → Show soft-clipped bases • Soft clipped bases are bases that are present in the read sequence but
not considered part of the alignment by the upstream bioinformatics pipeline
• These may be adapter sequences, barcodes, or inadvertently clipped patient sequence depending
• Some pipelines ‘hard-clip’ – simple remove unaligned bases. This is poor practice.
• For pipelines that soft-clip, it can be incredibly helpful in some settings to show the soft-clipped bases, as they may provide hidden support for sequence variants – we will have an example of this soon
33
Parameter play – show center line • Preferences → Alignments → Show center line • If on, shows the exact center position, very useful for getting absolute nucleotide
positions/c. position • May want to turn off for figures • Let’s make sure it’s on and calculate the c. of the APC mutation
• Navigate to APC mutation only (cannot move track sideways or zoom in/out in multi-locus view) • To calculate c. position:
• Ensure alignment is correct (3’ alignment) • What codon is it (be careful of multiple reference transcripts)? • Is it a positive strand (left to right) or negative strand (right to left) gene – how can you tell?
• Zoom out until you see the intronic arrows • c.1 is the A of the initiating methionine that starts every eukaryotic translation • So c position = [Codon #] * 3 – correction factor, where correction factor = 1 for the second nucleotide of
the codon and correction factor = 2 for the first nucleotide of the codon. • Using NM_000038 as the canonical transcript for APC, what is the c. number?
34
Parameter play – feature flanking region
• If you use multi-locus view, this may be important
• Preferences → General • Amount of flanking sequence added
to regions in multi-locus view • For targeted clinical sequencing,
want this number to be small, or won’t be able to see details (too zoomed out)
35
Display play
• Right click on alignments – color reads • Right click on alignments – sort reads
• Sorting by base can be HIGHLY useful • Very low frequency events – e.g. liquid biopsy • To establish / interrogate phase relationships
36
Collapsing Sidetrack • IGV can show 3 visual modes for tracks
• Collapsed (smallest) • Squished • Expanded (largest)
• What these modes do depend a little on the kind of track: • Sequence tracks never actually overlap even if collapsed • Annotation tracks may overlap/consolidate when collapsed
• Hovering over an annotation track may indicate it is collapsed (more than one annotation is seen)
• Right-clicking on a track will show a context menu that give the visual mode and allows it to be changed
• This is particularly important for genes with more than one reference transcript
• Does APC have more than one reference transcript? How many?
37
The importance of alignment • Load lung adenocarcinoma example #1 • How many mutations?
• HGVS says if sequence variants are separated by at least one base pair of wild-type sequence, to express them individually
• E.g. [c.4_6del;8C>T] – a 3 bp deletion slightly separated from a SNV • HGVS also states that all alignments need to be as 3’ as possible
(‘right-aligned’) • Whether this is right or left on the IGV screen depends on whether the gene is on
the plus strand (‘genomic sense’ or ‘positive’) or negative strand (‘genomic antisense’ or ‘negative’)
• What strandedness is EGFR? (sense, positive) • Is the deletion 3’ aligned? (no)
38
The importance of being aligned
Alignment can be more than misleading • 50-gene amplicon hotspot NGS panel • Tumor only – no germline control • 80% tumor cellularity • Nearly all melanomas have mutations on our panel • No mutations were called • Manually reviewed NRAS, KIT, BRAF amplicons • Load melanoma with no mutation calls example
40
A case of melanoma
• Is there a BRAF V600E mutation? • Why was this not called? • Why is that phenomenon so extreme – could there be an
explanation? • And a setting we could tweak to prove our hypothesis?
41
The importance of alignment
Missing reads?
Showing soft-clipped bases . . .
A case of melanoma Conclusions: A novel BRAF mutation is present NM_004333.4(BRAF):c.[1799T>A;1802_1813del] p.V600_W604delinsE Almost missed because a deletion near the amplicon ends resulted in suboptimal alignment and soft-clipping of reverse reads We can infer driver functionality (but not responsiveness to inhibition)
45
Phasing – a second case of melanoma • Tumor & germline paired sequence • Load melanoma with three mutation calls example • 3 mutation calls in BRAF (highly pipeline dependent)
46
Phasing is challenging • In cis – same allele/same read, always. One mutation/one protein. • In trans – distinct allele/different read, always. Two mutations/two proteins. • Subclonal – one allele subsumes the other – mutations are in cis when both are
present, but reads containing only the truncal mutation are also present. Two mutations/two proteins, but different than in trans.
• Optimally, pipelines give options to select from
47
Phasing can be helpful/informative • Phasing with (either in cis or fully in trans) a known germline
polymorphism strongly supports germline origin • Subclonal phasing with a known germline polymorphism strongly
supports somatic origin • Too many alleles? (>2 for autosomal loci, possibly >1 for chrX/Y in
men) • At least one is somatic!
48
Phasing can be helpful/informative • 52 with lung adenocarcinoma • Treatment naïve • Load lung adenocarcinoma example #3 • EGFR L858R is a known common driver missense mutation • EGFR T790M is a known common treatment induced resistance
mutation • EGFR c.2361G>A is a very common silent germline polymorphism
49
Phasing can be helpful/informative
• Conclusion – this is a germline EGFR T790M mutation • (Probably) pathogenic germline mutation associated in the literature
with hereditary lung cancer
50
Sequence variants can be complex
51
• 47 y/o with lung adenocarcinoma, several mutations called in EGFR exon 19
• Load lung adenocarcinoma example #2
How many mutations do you see here?
Four: 18 bp deletion and 3 SNVs
Three: 18 bp deletion, SNV, and a dinucleotide mutation
Two: 18 bp deletion + non-frameshifting 5 bp delins
One: 26 bp complex non-frameshifting delins
Reasonable HGVS representations: NM_005228.3(EGFR):c.[2240_2257del;2261A>G;2264_2265delinsAT] p.L747_P755delinsSRD (nominal) NM_005228.3(EGFR): c.2240_2265delinsCGAGAGAT p.L747_A755delinsSRD
Thresholding: Danger
54
VAF
Sample #
0
1
2
3
4
5
6
7
8
9
10
0 20 40 60 80 100 120 140 160 180 200
Thresholding: Danger
55
VAF
Sample #
0
1
2
3
4
5
6
7
8
9
10
0 20 40 60 80 100 120 140 160 180 200
Thesholding
• VAF or variant count based thresholding is common in variant calling pipelines and is a common source of outlier based false positive calls which may appear plausible in isolation from the underlying noise
• VAFs may be outliers due to limited DNA quality • Variant counts may be outliers due to CNVs
• Protections: • Have a positional error model for thresholding • Review multiple samples at a novel lower-quality positional call
56
Understanding positional noise to avoid false positive outliers (sample comparison)
• Liquid biopsy platform using UMIDs and high levels of noise reduction to sequence ccfDNA – nominal sensitivity about 0.2%
• Several TP53 mutation calls made • Load positional noise example
57
A case of myeloproliferative disorder • 81-gene leukemia/MDS/MPD panel run • JAK2, MPL were wild-type • Several called variants in CALR
58
A case of myeloproliferative disorder
• Load myeloproliferative/myelofibrosis example • Do you see the variants? Real? Pathogenic? • Sometimes you need to copy the reads and mock things up in a
text editor to be sure. Let’s try it!
59
A case of myeloproliferative disorder
• Short in-frame deletions
• Often well-tolerated (similar to missense)
• Prone to occur in triplet repeat areas
• Can be relatively common germline polymorphisms
60
A case of myeloproliferative disorder
• The apparent dinucleotide subheterozygous variant is an alignment artifact that is allele specific to the polymorphic allele
• Always question if multiple alignments around indels might be the same sequence, whether somatic or germline
61
Extreme trans complexity, two examples
• Most commonly seen in the context of circulating DNA, whether heme malignancy or liquid biopsy
• Case 1: Refractory acute myeloid leukemia patient • 4 distinct NRAS mutations called • Load AML example
62
Extreme trans complexity, two examples • Case 2: Patient with BRAF V600E positive metastatic
cholangiocarcinoma • Had received dabrafenib/trametnib with initial good response followed by
progression • Liquid biopsy shows truncal BRAF V600E plus at least 6 low level
mutations in KRAS & NRAS • Load liquid biopsy example
63
Variant classification – high level and only my opinions • ExAC – best source for population polymorphisms
• http://exac.broad.org • Allows VCF level URLs, e.g. http://exac.broadinstitute.org/variant/22-
46615880-T-C • ClinVar – best source for pathogenicity
• https://www.ncbi.nlm.nih.gov/clinvar/ • COSMIC –most comprehensive literature based summary
• https://cancer.sanger.ac.uk/cosmic • Some caution regarding germline inclusion is needed
64
Variant classification – what I know • Occasionally, phasing with known germline polymorphisms or the
presence of ‘too many alleles’ can be used to definitively ascribe somatic origin even without germline comparison
• Does not mean pathogenic … • For germline mutations,
• Frameshift mutations (indels not divisible by 3) and nonsense point mutations can generally be inferred to be deleterious (rare exceptions for mutations at the extreme carboxyl end of the protein)
• For somatic mutations, • Deleterious mutations in tumor suppressor genes can generally be inferred to
be oncogenic • That’s about it!
65
Summary • Trust but verify your pipeline, alignments, and annotations • Review positive calls – they may be more complex than called • Review pertinent negatives • Use Occam’s razor when approach complex indels • Strand bias is useful as a discriminator but does not rule out a
true mutation. Knowing about soft-clipping can save you • Be wary of phasing but know that it can also help you • Be wary of thesholding artifacts and calling an outlier in the
noise as something real
66