Date post: | 15-Jul-2015 |
Category: |
Health & Medicine |
Upload: | gabe-rudy |
View: | 605 times |
Download: | 2 times |
Clinical Grade Annotations:Public Data Resources for
Interpreting Genomic Variants
February 19, 2105
Gabe Rudy
@gabeinformatics
VP Product Management and Engineering
Golden Helix
My Background
Golden Helix
- Founded in 1998
- Genetic association software
- Analytic services
- Thousands of users worldwide
- Over 800 customer citations in journals
Products I Build with My Team
- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis
- Import and deal with all flavors of upstream data
- VarSeq
- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.
- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates.
All standardized file formats.
Agenda
Getting High Quality Variant Calls
Data Sharing and the Maturing of Public Resources
2
3
4
Clinical Grade Candidate Variant Identification
How I Met My Exomes1
NGS Clinical Utopia: Are We There Yet?5
Exome Sequencing in Consumer Genomics
Exomes done as part of Pilot
Program
80x coverage
Raw data with no interpretation
Erin
JIA
Gabe
(me)
Ethan
Research or clinical grade?
Total Reads 140M
Unique Align 87%
Mean Target 105x
% Target at 2x 97%
% Target at 10x 94%
% Target at 20x 89%
% Target at 30x 83%
Agenda
Getting High Quality Variant Calls
Data Sharing and the Maturing of Public Resources
2
3
4
Clinical Grade Candidate Variant Identification
How I Met My Exomes1
NGS Clinical Utopia: Are We There Yet?5
PSPH mis-alignment
Splice Mutation
GRCh38 – Here Now, but no hurry
A better human reference
- Revised Cambridge Reference
Sequence (rCRS) MT
- Has centromere models
- ~2000 incorrect alleles fixed
- ~100 assembly gaps updated
NCBI Annotations 106 on 38
- dbSNP 141, ClinVar,
RefSeqGene
- Ensembl 76 on both
No Poplulation Catalogs
- Some being ported (by
Ensembl, dbSNP)GRCh37 GRCh38
Ts/Tv 2.06558 2.10171
snps
snps
mnps
mnps
indels
indels
complex
complex
270000
280000
290000
300000
310000
320000
330000
340000
GRCh37 GRCh38
My Exome
331,824
319,442
Blog Post
Agenda
Getting High Quality Variant Calls
Data Sharing and the Maturing of Public Resources
2
3
4
Clinical Grade Candidate Variant Identification
How I Met My Exomes1
NGS Clinical Utopia: Are We There Yet?5
Baylor Workflow - Clinical Exomes Paper
Disease gene related
Medically actionable
deleterious variants
Deleterious variants in ACMG
gene list
Deleterious variants
VUS in dominant gene or
homozygous in recessive
gene
Deleterious variant in gene
with no known disease
Annotate, Then Filter and Interpret
Data Sources to Replicate Workflow
1000 Genomes (Phase 1)
“ESP” (NHLBI 6500 Exomes v2)
HGMD (Public vs Professional)
Variant’s Protein Coding Effect
RNA Splicing Effect (dbscSNV)
- −3 to +8 at the 5’, −12 to +2 at the 3’
Genes Lists:
- Single-Gene Disorder (OMIM with Inheritance)
- Medically Actionable (114 genes NHLBI study)
- Dominant Inheritance (MedGen)
- ACMG Carrier Panel (ACMG Incidental
Findings guidelines)
My Exome Analyzed
Start: 235,689
847
234,842
224,914
9,928
9,069
807
859
40
242 13
59 565
0
624
624
255
20
20
20
0
0
598
644
• Pathogenic OTC Variant
• What if I got this through BabySeq?
Agenda
Getting High Quality Variant Calls
Data Sharing and the Maturing of Public Resources
2
3
4
Clinical Grade Candidate Variant Identification
How I Met My Exomes1
NGS Clinical Utopia: Are We There Yet?5
Annotating against Transcripts
RefSeqGenes – Versioned on RNA sequence
- Annotated against human reference by “Annotation Releases”
- Last on GRCh37 was 105 (2013-08-20) – GRCh38 release 106 (2014-01-17)
- 84,950 transcripts, most are “predicted” (XM_” and non-coding)
- Standard in US for reporting variation (NM_016335.4:c.123C>T etc)
- UCSC grabs RNA from RefSeq directly and maps to their genome references
“continuously”
Ensembl – Versioned on Alignment
- GENCODE: Well curated subset of high-quality, validated transcripts
- V75 last version of GRCh37, 2014-06-27
- Many specific bio-types, but protein_coding usued for annotation
- Has mappings to RefSeq IDs, but
Reference Sequence Versus Gene Sequence
EMG1 on GRCh37
“Gap” of the mRNA coding sequence versus reference seq:
Handled differently by 3 different “gene alignments”
Reference Sequence Versus Gene Sequence
EMG1 on GRCh38
Reference sequence patched, no gap
Alignments agree
RefSeq Accession Not Sufficient for Var-Tx Interaction
RefSeq defines transcripts as mRNA sequence
NCBI “Annotation Releases” (like v105) provides alignments using “Splign”
UCSC pulls RefSeq mRNA and aligns themselves using “BLAT”
They can choose equally valid but different alignments for the same assession
This alignment of NM_052814.3 places the exon at dramatically different loci.
Will result in different annotations of any variant overlapping these exons
COSMIC
Does not provide data in easy
to use form for NGS
Just announced change in
licensing affective in March
- Access to the COSMIC website will
stay free for all users.
- The new licensing strategy will
charge for-profit organisations to
download COSMIC datasets.
- Download by academic and non-
profit organisations will remain free
2015 Roadmap:
- GRCh38
- More curation
- Visualization improvements
ClinVar
Submitters:
- OMIM: Johns Hopkins
- Samuels
- Lab for Molecular Medicine
- Invitae
- Emory Genetics Lab
Star rating system
- 0-4 stars – level of review
ClinVar is designed to provide a freely accessible,
public archive of reports of the relationships
among human variations and phenotypes, with
supporting evidence.
ClinVitae: ClinVar and Friends by Invitae
Sources:
- ClinVar (62,913)
- Emory (13,365)
- ARUP (2,850)
- Carver Mut (199)
- K Cunningham (581)
79,907 V, 9,189 G
- 32,523 Pathogenic
- 38,796 Likely Pathogenic
Provided in HGVS
- 59,878 after mapping to genomic space
BRCA: The back door to Myriad’s database
1995 – Patent issued
to Myriad Genetics
June 2013 – Patents
invalidated by ruling
Lab setting up Dx
has a lot of catch up
“Free the Data” and
other ways in which
Mryiad’s data is in
ClinVar, etc.
Sharing Clinical Reports Project
BRCA: In my wife
HGMD
Data mines academic
papers for reported
functional variants
Also takes
submissions,
corrections reviewed by
team
First available in 1996
- Originally 10k variants
- 105k in Public (2014)
- 148k in “Pro” (2014)
Left-Align Delta F508 to Make it Match
Left-Align Annotations
Using a Smith-
Waterman
algorithm to left-
align variants
from public
databases show
non-obvious
differences
NGS alignment
and variant
calling always
left-aligned
Left-align your
database so they
can be annotated
Changes in Monthly Updates
• 36 variants went missing from
December to Jan release
• Some where Pathogenic
ClinVar’s VCF File
• ClinVar current relies on their
dbSNP identifier mappings to
“build” VCF files
• There are ~14,000 small variants
in their database without dbSNP
identifiers, and thus missing from
the VCF
• ~5K Pathogenic
• Often these variants are in newer
dbSNP builds, and the ClinVar
mappings are just not updated.
• This variant was in ClinVar, with
genomic coordinates, but no
RSID:
- HGVS(c.): NM_002894.2:c.298C>T
- Chromosome:Start:Stop: 18:20548818:20548818
- (Recently RSID was added)
dbSNP 141 Had Allele Errors
I reported the issue
7/22/2014
Confirmed, 8/12
generated better VCF
and placed in “test”
folder
Found more issues
Replaced official VCF
in 02/09/2014
We waited until fixed
to publish official
support
Agenda
Getting High Quality Variant Calls
Data Sharing and the Maturing of Public Resources
2
3
4
Clinical Grade Candidate Variant Identification
How I Met My Exomes1
NGS Clinical Utopia: Are We There Yet?5
asdf
NM_002626.4:c.1877G>C in PFKL
NP_002617.3:p.Arg626Pro missense mutation
Predicted damaging by 4/5 functional predictions
VEST3: 0.948, GERP++: 4.59
ExAC and 1kG have a G>A, but G>C is novel
Variants in region are extremely rare (G>C ExAC 4 of 122,364 alleles) – 0.003%
No ClinVar variants for gene
OMIM entry has no known disease association
PubMed search shows few recent articles: Most recent 1998 paper showed
- phosphofructokinase (PFKL) overexpressed in Down syndrome (DS)
- Transgenic PFKL mice had an abnormal glucose metabolism with reduced clearance
rate from blood and enhanced metabolic rate in brain.
d
d
35 LoF Variants, None Homozygous
Training
Most variants are rare or novel
- Training to interpret these is
extensive
MD/Pathology background is
insufficient
Need a PhD in molecular
genetics
There’s only 500 board certified
Clinical Molecular Geneticists
since started
Let’s share in the learning
process
Baylor Exome Sign-Out
Phenotypeing and Matchmaking Portals
Diagnosis often requires finding
another family to confirm a novel
gene to phenotype association
Finding a second family:
- Social media
- PhenoDB
- PhenomeCentral.org
- Orphanet – Resources on over 6000 rare
diseases and orphan drugs.
- European centric: GEN2PHEN (G2P)
Matt Might found a second
family with NGLY1
deficiency through a blog
post that went viral.
N-Glycanase Deficiency
http://www.ngly1.org/
Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics:
how next-generation sequencing and
families are altering the way rare
diseases are discovered, studied,
and treated. Genetics in Medicine.
March 2014.
Thank you
Heidi Rehm – Chief Laboratory Director at
Laboratory for Molecular Medicine,
PCPGM
Joel Parker – Cancer Genetics, UNC
Chapel Hill
Gerry Higgins – VP, Pharmacogenomic
Science, Assure Rx Health
Frank Schacherer – Chief Technical
Officer, BIOBASE
Reece Hart – Computational Biologist,
Invitae (now 23andMe)
Greta Linse Peterson – Director of Product
Management and Quality, Golden Helix
Questions?