Deep phenotyping to aid identification of coding & non-coding rare disease variants

Deep phenotyping to aid identification of coding & non-coding rare disease variants

Melissa Haendel, PhDMarch 2017@monarchinit

@ontowonka [email protected]

AcknowledgmentsCharite

Max SchubachSebastian Koehler

Univ of MilanGiorgio Valentini

RTIJim Balhoff

OHSUKent ShefchekJohn LetawJulie McMurryNicole VasilevskyMatt BrushTom ConlinDan Keith

Genomics England/Queen Mary

Damian SmedleyJulius Jacobsen

Jackson LaboratoryPeter Robinson

StanfordShruti MarwahaMatthew WheelerEuan Ashley

Lawrence BerkeleyChris MungallSuzanna LewisJeremy NguyenSeth Carbon

GarvanTudor Groza

https://monarchinitiative.org/page/team

The genome is sequenced, but...

3,398 OMIM

Mendelian Diseases with no known genetic basis

?At least 120,000*

ClinVar

Variants with no known pathogenicity

…we still don’t know very much about what it does

*This is > twice what it was in 2016!

Prevailing clinical genomic pipelines leverage only a tiny fraction of the available data

PATIENT EXOME/ GENOME

PATIENT CLINICAL PHENOTYPES

PUBLIC GENOMIC DATA

PUBLIC CLINICAL PHENOTYPE, DISEASE DATA

POSSIBLE DISEASES

DIAGNOSIS & TREATMENT

PATIENT ENVIRONMENTPUBLIC ENVIRONMENT,

DISEASE DATA

PATIENT OMICS PHENOTYPES PUBLIC OMICS PHENOTYPES,CORRELATIONS

Under-utilized data

The Human Phenotype Ontology 11,813

phenotype terms

127,125 rare disease - phenotype annotations

136,268 common disease -phenotype annotations

bit.ly/hpo-paper

Adding other species’ data helps fill knowledge gaps in human genome

More species = more coverage

19,008

78%

14,779

Number of human protein-coding genes in ExAC DB as per Lek et al. Nature 2016

19,008

Even inclusion of just four species boosts phenotypic coverage of genes by 38% (5189%)Combined = 89%

19,008

2,195 7,544 7,235 = 16,974 (union of coverage in any species)

9,739

51%

Mungall et al Nucleic Acids Research bit.ly/monarch-nar-2016

Phenotypic profile matching

Combining G2P data for variant prioritization

Whole exome

Remove off-target and common variants

Variant score from allele freq and pathogenicity

Phenotype score from phenotypic similarity

PHIVE score to give final candidates

Mendelian filters

Exomiser results for UDP diagnosed patients

Inclusion of phenotype data improves variant prioritization

In 60% of first 1000 genomes at GEL, Exomiser predicts top candidateIn 86% of cases, Exomiser predicts within top 5

Example case solved by ExomiserPh

enot

ypic

pr

ofile

Gene

s Heterozygous, missense mutation

STIM-1

N/A

Heterozygous, missense mutation

STIM-1N/A

Stim1Sax/Sax

Ranked STIM-1 variant maximally pathogenic based on cross-species G2P data,

in the absence of traditional data sourceshttp://bit.ly/exomiser

How to make sense of whole genomes

…when there are 3.5 Billion base pairs and so little is known about non-coding regions?

bit.ly/genomiser-2016

1) Gather all evidence at each position (3.5B)

• ancestral conservation• GC content• Max methylation, Acetylation, trimethylation levels• DNAse hypersensitivity• Enhancer attributes (robust, permissive)• # overlapping transcription factor binding sites• # rare variants (<0:5% AF) +/-500 nt• # common variants (> 0:5% AF) +/- 500 nt• Overlapping CNVs (ISCA , dbVAR, DGV)• (… 26 features in total)


2) Predict negative controls

> 5% prevalence14.7 M putative non-deleterious positions

Highly conserved in ancestral genomes


3) Hand-curate positives from literatureWe curated 453 regulatory mutations judged as pathogenic by reported phenotypes (HPO) and other metrics


4) Address positive-negative imbalance

14.7 MPutative non-deleterious

453Known regulatory mutations

?

36,000 negative examples are available for every positive one


Synthetically oversample positives,& undersample negatives

14.7 MPutative non-deleterious

453Known regulatory mutations

1) Partition negatives into 100 groups

2) Add to each negative group, all 453 known positives

3) In each group, oversample positives AND undersample negatives

Strongest predictors of deleterious mutation

• Higher DNAse hypersensitivity• Greater methylation• Richer GC content• Higher ratio of rare:common variation• Higher conservation


4. Benchmark using synthetic genomes 10,235 simulated disease genomes using 1000 Genomes Data Novel Regulatory Mendelian Mutation (ReMM) scoring method

Genomiser +ReMM outperforms other methods/tools across non-coding region types bit.ly/genomiser-2016

www.monarchinitiative.orgLeadership: Melissa Haendel, Chris Mungall, Peter Robinson,

Tudor Groza, Damian Smedley, Sebastian Köhler, Julie McMurry Funding: NIH Office of Director: 2R24OD011883; NHGRI UDP: HHSN268201300036C,

HHSN268201400093P; NCATS: UDN U01TR001395, Biomedical Data Translator: 1OT3TR002019; E-RARE 2015: Hipbi-RD

01GM1608

http://www.monarchinitiative.org/

Date post:	05-Apr-2017
Category:	Science
Upload:	mhaendel
View:	313 times
Download:	1 times

Deep phenotyping to aid identification of coding & non-coding rare disease variants

Science