Deep phenotyping to aid identification of coding & non-coding rare disease variants
Melissa Haendel, PhDMarch 2017@monarchinit
@ontowonka [email protected]
AcknowledgmentsCharite
Max SchubachSebastian Koehler
Univ of MilanGiorgio Valentini
RTIJim Balhoff
OHSUKent ShefchekJohn LetawJulie McMurryNicole VasilevskyMatt BrushTom ConlinDan Keith
Genomics England/Queen Mary
Damian SmedleyJulius Jacobsen
Jackson LaboratoryPeter Robinson
StanfordShruti MarwahaMatthew WheelerEuan Ashley
Lawrence BerkeleyChris MungallSuzanna LewisJeremy NguyenSeth Carbon
GarvanTudor Groza
https://monarchinitiative.org/page/team
The genome is sequenced, but...
3,398 OMIM
Mendelian Diseases with no known genetic basis
?At least 120,000*
ClinVar
Variants with no known pathogenicity
…we still don’t know very much about what it does
*This is > twice what it was in 2016!
Prevailing clinical genomic pipelines leverage only a tiny fraction of the available data
PATIENT EXOME/ GENOME
PATIENT CLINICAL PHENOTYPES
PUBLIC GENOMIC DATA
PUBLIC CLINICAL PHENOTYPE, DISEASE DATA
POSSIBLE DISEASES
DIAGNOSIS & TREATMENT
PATIENT ENVIRONMENTPUBLIC ENVIRONMENT,
DISEASE DATA
PATIENT OMICS PHENOTYPES PUBLIC OMICS PHENOTYPES,CORRELATIONS
Under-utilized data
The Human Phenotype Ontology 11,813
phenotype terms
127,125 rare disease - phenotype annotations
136,268 common disease -phenotype annotations
bit.ly/hpo-paper
Adding other species’ data helps fill knowledge gaps in human genome
More species = more coverage
19,008
78%
14,779
Number of human protein-coding genes in ExAC DB as per Lek et al. Nature 2016
19,008
Even inclusion of just four species boosts phenotypic coverage of genes by 38% (5189%)Combined = 89%
19,008
2,195 7,544 7,235 = 16,974 (union of coverage in any species)
9,739
51%
Mungall et al Nucleic Acids Research bit.ly/monarch-nar-2016
Phenotypic profile matching
Combining G2P data for variant prioritization
Whole exome
Remove off-target and common variants
Variant score from allele freq and pathogenicity
Phenotype score from phenotypic similarity
PHIVE score to give final candidates
Mendelian filters
Exomiser results for UDP diagnosed patients
Inclusion of phenotype data improves variant prioritization
In 60% of first 1000 genomes at GEL, Exomiser predicts top candidateIn 86% of cases, Exomiser predicts within top 5
Example case solved by ExomiserPh
enot
ypic
pr
ofile
Gene
s Heterozygous, missense mutation
STIM-1
N/A
Heterozygous, missense mutation
STIM-1N/A
Stim1Sax/Sax
Ranked STIM-1 variant maximally pathogenic based on cross-species G2P data,
in the absence of traditional data sourceshttp://bit.ly/exomiser
How to make sense of whole genomes
…when there are 3.5 Billion base pairs and so little is known about non-coding regions?
bit.ly/genomiser-2016
1) Gather all evidence at each position (3.5B)
• ancestral conservation• GC content• Max methylation, Acetylation, trimethylation levels• DNAse hypersensitivity• Enhancer attributes (robust, permissive)• # overlapping transcription factor binding sites• # rare variants (<0:5% AF) +/-500 nt• # common variants (> 0:5% AF) +/- 500 nt• Overlapping CNVs (ISCA , dbVAR, DGV)• (… 26 features in total)
bit.ly/genomiser-2016
2) Predict negative controls
> 5% prevalence14.7 M putative non-deleterious positions
Highly conserved in ancestral genomes
bit.ly/genomiser-2016
3) Hand-curate positives from literatureWe curated 453 regulatory mutations judged as pathogenic by reported phenotypes (HPO) and other metrics
bit.ly/genomiser-2016
4) Address positive-negative imbalance
14.7 MPutative non-deleterious
453Known regulatory mutations
?
36,000 negative examples are available for every positive one
bit.ly/genomiser-2016
Synthetically oversample positives,& undersample negatives
14.7 MPutative non-deleterious
453Known regulatory mutations
1) Partition negatives into 100 groups
2) Add to each negative group, all 453 known positives
3) In each group, oversample positives AND undersample negatives
Strongest predictors of deleterious mutation
• Higher DNAse hypersensitivity• Greater methylation• Richer GC content• Higher ratio of rare:common variation• Higher conservation
bit.ly/genomiser-2016
4. Benchmark using synthetic genomes 10,235 simulated disease genomes using 1000 Genomes Data Novel Regulatory Mendelian Mutation (ReMM) scoring method
Genomiser +ReMM outperforms other methods/tools across non-coding region types bit.ly/genomiser-2016
www.monarchinitiative.orgLeadership: Melissa Haendel, Chris Mungall, Peter Robinson,
Tudor Groza, Damian Smedley, Sebastian Köhler, Julie McMurry Funding: NIH Office of Director: 2R24OD011883; NHGRI UDP: HHSN268201300036C,
HHSN268201400093P; NCATS: UDN U01TR001395, Biomedical Data Translator: 1OT3TR002019; E-RARE 2015: Hipbi-RD
01GM1608