+ All Categories
Home > Health & Medicine > The Clinical Significance of Transcript Alignment Discrepancies

The Clinical Significance of Transcript Alignment Discrepancies

Date post: 28-Aug-2014
Category:
Upload: reece-hart
View: 352 times
Download: 2 times
Share this document with a friend
Description:
Gene transcripts are the lens through which we understand variants that are identified by genome sequencing, reported in scientific literature, and communicated on clinical reports. An accurate, shared representation of transcripts is essential to communicating variants reliably. This talk presents observations of significant discrepancies between sources of transcripts that will lead to discrepancies in the clinical interpretation of variants, and tools that we have released to contend with these complexities.
24
Ⓒ 2014 Invitae Reece Hart, Ph.D. Reece Hart, Ph.D. [email protected] [email protected] Human Variome Project Meeting 2014, Paris Human Variome Project Meeting 2014, Paris The Clinical Significance of Transcript The Clinical Significance of Transcript Alignment Discrepancies Alignment Discrepancies and tools to help you deal with them. and tools to help you deal with them.
Transcript
Page 1: The Clinical Significance of Transcript Alignment Discrepancies

Ⓒ 2014 Invitae

Reece Hart, Ph.D.Reece Hart, [email protected]@invitae.com

Human Variome Project Meeting 2014, ParisHuman Variome Project Meeting 2014, Paris

The Clinical Significance of Transcript The Clinical Significance of Transcript Alignment DiscrepanciesAlignment Discrepancies… … and tools to help you deal with them.and tools to help you deal with them.

Page 2: The Clinical Significance of Transcript Alignment Discrepancies

2 / 24 Ⓒ 2014 Invitae

The fidelity of transcript-genome mapping matters.The fidelity of transcript-genome mapping matters.

Variants are identified and computed on in genome coordinates

Variants are analyzed and communicated using

transcript coordinatesgenome totranscript(g. to c.)

transcriptto genome

(c. to g.)

Page 3: The Clinical Significance of Transcript Alignment Discrepancies

3 / 24 Ⓒ 2014 Invitae

Motivation 1: Discordant exon coordinatesMotivation 1: Discordant exon coordinatesNCBI and UCSC report different coordinates for NM_052813.3, exon 12NCBI and UCSC report different coordinates for NM_052813.3, exon 12

UCSC(BLAT)

NCBI(Splign)

Consequences:1. An assay that targets the wrong genomic region will generate uninformative sequence data.2. A genomic variant will be interpreted as exonic when it is intronic, or vice versa.

exon 12displaced 322 nt

Page 4: The Clinical Significance of Transcript Alignment Discrepancies

4 / 24 Ⓒ 2014 Invitae

Motivation 2: indels confound mappingMotivation 2: indels confound mappingNM_006158.3 (NEFL) contains indel in CDSNM_006158.3 (NEFL) contains indel in CDS

Page 5: The Clinical Significance of Transcript Alignment Discrepancies

5 / 24 Ⓒ 2014 Invitae

Challenges and Solutions in Transcript ManagementChallenges and Solutions in Transcript Management

➢ Biological● Alternative splicing● Paralogs● Natural polymorphisms● Alternative references

➢ Technical / Logistical● Multiple transcript sources● Multiple alignment methods● Multiple references● Genome-transcript sequence

differences● Historical transcript alignments

➢ Existing resources● RefSeq, UCSC, Ensembl● Locus Reference Genomic● Mutalyzer

➢ See also● McCarthy DJ¸ et al. Genome

Medicine 6:26 (2014).● Garla V, et al. Bioinformatics

27(3): 416–8 (2010).

Page 6: The Clinical Significance of Transcript Alignment Discrepancies

6 / 24 Ⓒ 2014 Invitae

Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)

➢ Single database of:● Multiple transcripts and versions● … from multiple sources● … aligned to multiple references● … by multiple alignment methods

➢ Freely available!● Apache licensed● Public PostgreSQL database instance at uta.invitae.com:5432● Local installation instructions● Code at http://bitbucket.org/invitae/uta/

Page 7: The Clinical Significance of Transcript Alignment Discrepancies

7 / 24 Ⓒ 2014 Invitae

Our Bermuda TriangleOur Bermuda Triangle

RefAgreeDo transcript and genome sequences agree?

Transcript EquivalenceWhich RefSeq and Ensembl transcripts are equivalent?

RefSeq(NM)

Ensembl(ENST)

Genome(GRCh37)

➊ SNV

➋ Indel

➍ Historical Transcripts

Page 8: The Clinical Significance of Transcript Alignment Discrepancies

8 / 24 Ⓒ 2014 Invitae

Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database

transcriptNM_01234.4NM_01234.4NM_01234.5NM_01234.5NM_01234.5NM_01234.5ENST012345ENST012345

referenceNM_01234.4NC_000012.3NM_01234.5NC_000012.3AC_45678.9NC_000012.3ENST012345NC_000012.3

methodselfsplignselfsplignsplignblatselfgenebuild

exonsexon set

Page 9: The Clinical Significance of Transcript Alignment Discrepancies

9 / 24 Ⓒ 2014 Invitae

Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database

transcriptNM_01234.4NM_01234.4NM_01234.5NM_01234.5NM_01234.5NM_01234.5ENST012345ENST012345

referenceNM_01234.4NC_000012.3NM_01234.5NC_000012.3AC_45678.9NC_000012.3ENST012345NC_000012.3

methodselfsplignselfsplignsplignblatselfgenebuild

exonsexon set

exon alignmentsNM_01234.4 NC_000012.3 0 50=NM_01234.4 NC_000012.3 1 100=1X49=NM_01234.4 NC_000012.3 2 5=1I44=

➊➋

Alignments use coordinates from source databases.

Page 10: The Clinical Significance of Transcript Alignment Discrepancies

10 / 24 Ⓒ 2014 Invitae

Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database

transcriptNM_01234.4NM_01234.4NM_01234.5NM_01234.5NM_01234.5NM_01234.5ENST012345ENST012345

referenceNM_01234.4NC_000012.3NM_01234.5NC_000012.3AC_45678.9NC_000012.3ENST012345NC_000012.3

methodselfsplignselfsplignsplignblatselfgenebuild

exonsexon set

Page 11: The Clinical Significance of Transcript Alignment Discrepancies

11 / 24 Ⓒ 2014 Invitae

Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database

transcriptNM_01234.4NM_01234.4NM_01234.5NM_01234.5NM_01234.5NM_01234.5ENST012345ENST012345

referenceNM_01234.4NC_000012.3NM_01234.5NC_000012.3AC_45678.9NC_000012.3ENST012345NC_000012.3

methodselfsplignselfsplignsplignblatselfgenebuild

exonsexon set

Page 12: The Clinical Significance of Transcript Alignment Discrepancies

12 / 24 Ⓒ 2014 Invitae

““RefAgree” Statistics by Protein Coding TranscriptRefAgree” Statistics by Protein Coding TranscriptSequence concordance between RefSeq and GRCh37 primary assemblySequence concordance between RefSeq and GRCh37 primary assembly

c.f. Garla V, et al. Bioinformatics 27(3): 416–8 (2010).

34531 NM transcripts (Jan 2014)760 0.2% with length discrepancies

3481 10% with substitutions321 0.9% with deletions255 0.7% with insertions

➊➋

Page 13: The Clinical Significance of Transcript Alignment Discrepancies

13 / 24 Ⓒ 2014 Invitae

NCBI (Splign) v. UCSC (BLAT) Alignment StatisticsNCBI (Splign) v. UCSC (BLAT) Alignment StatisticsSplign and BLAT provide significantly different exon structures for 886 transcriptsSplign and BLAT provide significantly different exon structures for 886 transcripts

Are Splignand BLATsimilar ?

31472 (97.3%)transcripts

Y

N

32358transcripts

w/exon structures

886 (2.7%)transcripts

“similar” means either1) identical exon coordinates, or2) coordinates that differ only by short 3' terminal artifacts

Page 14: The Clinical Significance of Transcript Alignment Discrepancies

14 / 24 Ⓒ 2014 Invitae

Characterization of transcripts discrepanciesCharacterization of transcripts discrepanciesWhether alignments provided by NCBI and UCSC agree with GRCh37 primary sequence.Whether alignments provided by NCBI and UCSC agree with GRCh37 primary sequence.

Splign

BLA

TT F

T 14 18

F 545 311

886 transcripts withsignificant discrepancies

Page 15: The Clinical Significance of Transcript Alignment Discrepancies

15 / 24 Ⓒ 2014 Invitae

Characterization of transcripts discrepanciesCharacterization of transcripts discrepanciesReference agreement (blue) and alignment “simplicity” (green)Reference agreement (blue) and alignment “simplicity” (green)

Splign

BLA

TT F

T 14 18

F 545 311Splign

BLA

T

T F

T 200(0)

4(97)

F 90(82)

16(84)

Splign

BLA

T

T F

T 6(41)

12(180)

F

Splign

BLA

T

T F

T 434(7)

F 110(652)

Splign

BLA

T

T F

T 14(11)

F

886 transcripts withsignificant discrepancies

Page 16: The Clinical Significance of Transcript Alignment Discrepancies

16 / 24 Ⓒ 2014 Invitae

Summary of Splign-BLAT gene-wise coordinate deltas.Summary of Splign-BLAT gene-wise coordinate deltas.

delta # genes # ACMG must report

=0 15206 44

>=1 183 8

>=10 116 0

>=25 6 0

>=50 5 0

>=250 13 0

>=1000 94 2

ND 3

delta ≝ minimum per gene of maximum per transcript of difference of exon coordinates between NCBI and UCSC.

MYH7, TNNI3

(all trivial diffs)LDLR, MYL2,

PRKAG2, SDHB, SDHC, TGFBR1, TGFBR2, WT1

APOV, MYHBPC3, NTRK

Page 17: The Clinical Significance of Transcript Alignment Discrepancies

17 / 24 Ⓒ 2014 Invitae

HGVS Python PackageHGVS Python Packagehttp://bitbucket.org/invitae/hgvs/http://bitbucket.org/invitae/hgvs/

➢ Parser● HGVS Python object→● Based on a Parsing Expression

Grammar➢ Formatter

● Python object HGVS→➢ Validator

● intrinsic & extrinsic validation➢ Mapping tools indel-aware!

● g. c. p. (m,n,r also supported)↔ →● transcript-to-transcript liftover● uses on UTA data

Page 18: The Clinical Significance of Transcript Alignment Discrepancies

18 / 24 Ⓒ 2014 Invitae

Example: Variant liftover between transcriptsExample: Variant liftover between transcriptsMapfrom NM_182763.2:c.688+403C>T➀to NC_000001.10:g.150550916G>A➁to ➂ NM_001197320.1:281C>Twith Splign alignments

NM_001197320.1NP_001184249.1

NM_182763.2NP_877495.1

NC_000001.10

Page 19: The Clinical Significance of Transcript Alignment Discrepancies

19 / 24 Ⓒ 2014 Invitae

Developer InfoDeveloper Info

Testing➢ 91% code coverage➢ 25665 tests variants

● ~200 hand curated, rest from dbSNP

● 23436 sub, 1254 del, 908 ins, 45 delins, 22 dup

● 44 distinct transcripts, many selected for difficulty

Upcoming issues(all issues are publicly readable)➢ multi-variant alleles➢ release LRG➢ GRCh38➢ API changes

Page 20: The Clinical Significance of Transcript Alignment Discrepancies

20 / 24 Ⓒ 2014 Invitae

AcknowledgementsAcknowledgements

➢ Vince Fusaro➢ John Garcia➢ Emily Hare➢ Kevin Jacobs➢ Geoff Nilsen➢ Rudy Rico➢ Jody Westbrook

http://bitbucket.com/invitae/➢ Code (Python)➢ Documentation & Examples➢ Issues➢ BED files➢ Code testing is public

Or just:pip install hgvs

Page 21: The Clinical Significance of Transcript Alignment Discrepancies

21 / 24 Ⓒ 2014 Invitae

Page 22: The Clinical Significance of Transcript Alignment Discrepancies

22 / 24 Ⓒ 2014 Invitae

T

RefSeqNM_01234.4

UTA solves four issues with transcript management.UTA solves four issues with transcript management.

RefSeqNM_01234.5

InDel

UCSCNM_01234.5

Exon coordinate differences between sources for same accession➍

Historical transcripts alignments no longer available

➊ SNV

A

➋Transcript =≠ Genome Reference

Page 23: The Clinical Significance of Transcript Alignment Discrepancies
Page 24: The Clinical Significance of Transcript Alignment Discrepancies

24 / 24 Ⓒ 2014 Invitae

ENSTs equivalent with NMsENSTs equivalent with NMs

=> select N.hgnc,N.es_fingerprint,N.tx_ac,E.tx_acfrom uta_20140210.tx_exon_set_summary_mv Njoin uta_20140210.tx_exon_set_summary_mv E  on N.es_fingerprint=E.es_fingerprint  and N.tx_ac ~ '^NM_' and E.tx_ac ~ '^ENST'  and N.alt_aln_method='transcript'  and E.alt_aln_method='transcript';

┌─────────┬──────────────────────────────────┬────────────────┬─────────────────┐  │ hgnc              es_fingerprint                tx_ac             tx_ac      │ │ │ │

├─────────┼──────────────────────────────────┼────────────────┼─────────────────┤ │ AFF2      db0e20be1a2bb687c33227d2e6bf9d53   NM_002025.3      ENST00000370460 │ │ │ │ │ UBE3A     d1eace7da295c45378fa5f898f2f03f6   NM_130838.1      ENST00000438097 │ │ │ │ │ ANXA8L1   1f6fd4f3fe9854aa468489ec7f507512   NM_001098845.1   ENST00000359178 │ │ │ │ │ APOL5     939a9e9e4a46ef9aef862cf9b369afe6   NM_030642.1      ENST00000249044 │ │ │ │ │ ARID4B    524fc954d10b08a4014e86aee81d0358   NM_016374.5      ENST00000264183 │ │ │ │


Recommended