Gene information consolidation
Pathways integration into SuperPaths
Gene set descriptor enrichment
NGS Gene-phenotype prioritization
Disease unification and annotation
The human gene digital compendium
>150 Data Sources
ProtoNet HomoloGene AceView AB KEGGWormBase FlyBase InterPro SOURCE
GeneLynx NCBI dbSNP HORDE GeneTestsBLOCKS
PDB MIPS RZPD
IMGT LEIDEN PupaSNP euGenesGeneAtlas
HGMD BCGD
MTDB Kegg MGD DOts UCSC
Doctors guide
GO
UniProt SwissProt TrEMBL OMIMGDBEnsembl EntrezGene
HORDE
bioalma
GeneLoc
TGDBATLASPubMed
Crow21GenBank
HUGO
UniGene MINTGAD
Blocks
GeneNote
~148,000 gene cards entries with “deep links”
• Automatic mining• Inter-source integration
Number of GenesCategory21,360Protein-coding98,609RNA genes16,337Pseudogenes1,819Genetic loci
18Gene Clusters9,770Uncategorized
147,913Total
Gene-centric information 18 sections – including Summaries, Aliases, Diseases, Genomics, Expression, Pathways, Localization, Publications
The human gene compendium
Stelzer et al, Cur. Prot. Bioinformatics 2016
ncRNA 23 data sourcesHeterogeneity of data
ncRNA onlyAll genes
All classes
ncRNA class-specific
Belinky et al, Bioinformatics 2013
127k entries15k entries
Clustering~99,000 ncRNAs
Positional integration
Unification of ncRNAs in GeneCards
The ncRNA Grand Unification
15 data sources v3
~70k genes
v4.7~150k genes
21,360
98,609
16,337
980132
7,350
RNA genes
Protein codingPseudo
genes
Genetic loci
Gene clusters
Uncategorized
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32100000
10000
1000
10010
RNA gene class
Coun
tBefore unification
After unification
X8 X30 X2
Enhanced ncRNA classes in GeneCards
epileptic encephalopathy, early infantile, 1
early infantile epileptic encephalopathy with suppression bursts
spasticity - intellectual disability - x-linked epilepsy
infantile epileptic-dyskinetic encephalopathy
infantile spasms
All names for the same disease Genes for this disease
No disease symbolsNo systematic name management
CDKL5ARXPOMCLINC00581RDXP2FOXG1PVALBSLC25A22PHGDHPNKPIDUA
TPH1SPTAN1MECP2SLC1A3STXBP1ABATFLNAPTCH1RDXTLE1
OMIM6834
Disease Ontology
6047
NIH Rare Diseases
6416
2190
512
Three majordisease namesources showlow overlap-need Integration!
MalaCards19,552
60476497
502
614
6416
983
68342545
3467
2511
405
3963
2533
Disease Ontology
DiseasecardWikipedia
Orphanet
MedlinePlus
CopenhagenDISEASES
GeneTests
GeneReviews
NIH Rare Diseases
Genetics Home Reference
OMIMNovoSeek
GTR
Name integration from 15 sources
A disease compendium
Rappaport et al. (2014) Cur. Prot. Bioinformatics
85,000 disease strings (names+aliases)
~19,600disease names
15 info sources
65,000 aliases
Source hierarchy
Name integration process
TextualCanonicalization
Example –• Liver Cancer• Hepatic Carcinoma• Neoplasm of Liver
• Remove non alphanumeric characters• Replace identities (juvenile=childhood)• Remove suffixes (‘s, s, ies)• Eliminate less-informative words (syndrome)• Eliminate prefixes (resistance to)• Translate Greek characters• Unify word order (liver failure = failure, liver)• Unify, but leave in, Roman/Arabic/Latin numerals
Textual Canonization85,000 disease strings
~19,600 disease names
MalaCards annotation schemes
• Interrogating disease resources for classifications, symptoms, variations, drugs…
• Searching GeneCards for publications, associated genes • Querying disease-related gene-sets in GeneAnalytics /
GeneCards for pathways, phenotypes, compounds, and GO terms
• Searching within MalaCards itself, e.g. for related diseases, organ categories…
Number of GenesCategory4,470Cancer diseases4,821Fetal diseases17,334Genetic diseases
712Infectious diseases2,594Metabolic diseases29,140Rare diseases
Disease-centric information 14 sections – including Summaries, Aliases, Genes, Anatomical context, Pathways, Drugs & therapeutics, Publications
The human disease compendium
Stelzer et al, Cur. Prot. Bioinformatics 2016
General disease categories
Cancer
Fetal
Genetic (Rare)
Genetic(common)Rare
(Non-genetic)
MalaCards affords category overviews
Genetic (Rare)Genetic (common)Rare (Non-genetic)InfectiousFetalCancer
Cancer
Fetal
Genetic (Rare)
Genetic(common)Rare
(Non-genetic)
Sources for gene information
Elite genesSupported also by specific evidence such as:OMIM: “Molecular basis known” Orphanet: “Causative mutation”Humsavar: “Causative variation”GeneTests: “Genetic Tests” Clinvar“Pathogenic”
10,178 genesTotal: 18,864 diseases
7,987with elitegenes
7,338withoutgenes
3,593 with non-
elite genes
Non-Elite genesGeneCards searchesin sections such as –SummariesPathwaysFunctionPublications
The challenge of confederating diseases with genes
dis 1 dis 2 dis 3 dis 4 dis 5 dis 6 dis 7 dis 8 dis 9 dis 10
gen 1 1 1 1
gen 2 1
gen 3 1 1
gen 4 1
gen 5 1
gen 6 1 1 1
gen 7
gen 8 1
gen 9 1
gen 10 1 1
The gene-diseasematrix
11,791 diseases10,464 genes
dis 1 dis 2 dis 3 dis 4 dis 5 dis 6 dis 7 dis 8 dis 9 dis 10
Gen 1 1 1 1
Gen 2 1
gen 3 1 1
gen 4 1
gen 5 1
gen 6 1 1 1
gen 7
gen 8 1
gen 9 1
gen 10 1 1
The gene- diseasematrix: “mutualmonogamy”
The challenge of confederating diseases with genes
dis 1 dis 2 dis 3 dis 4 dis 5 dis 6 dis 7 dis 8 dis 9 dis 10
gen 1 1 1 1
gen 2 1
gen 3 1 1
gen 4 1
gen 5 1
gen 6 1 1 1
gen 7
gen 8 1
gen 9 1
gen 10 1 1
The gene-diseaseMatrix:Disease with several genes
Gene withseveral diseases
The challenge of confederating diseases with genes
Gene
s
With disease
No disease
1
10
100
1000
10000
100000
100000023k 123k 1.7k 135 20k 1k Total genes
.36 .003 .73 .06 .02 .07 Fraction with disease
The ncRNA gene prospect
Gene category
Esteller M., Non-coding RNAs in human disease, Nature Reviews 2011
Disease Involved ncRNAs ncRNA type
Beckwith–Wiedeman syndrome lncRNAs H19 and KCNQ1OT1 lncRNA
Silver–Russell syndrome lncRNA H19 lncRNADeafness miR-96 miRNAAlzheimer's disease miR-29, miR-146 and miR-107 miRNA
Alzheimer's disease ncRNA antisense transcript for BACE1 lncRNA
Rheumatoid arthritis miR-146a miRNATransient neonatal diabetes mellitus lncRNA HYMAI lncRNA
Amyotropic lateral sclerosis miR-206 miRNA
ncRNAs in human diseases
Pathway sources
Existingredundancyand ambiguity:
Same genes with different pathway names
Different genes with same pathway name
Pathway unification
Pathway Unification
Creating SuperPaths –Unification in GeneCards of 12 pathway sources
Belinky et al. Database (Oxford). 2015
Pairwise gene compositional similarity between pathways as a basis for combination of hierarchical and nearest neighborhood clustering
𝐽𝐽 𝐴𝐴,𝐵𝐵 =𝐴𝐴 ∩ 𝐵𝐵𝐴𝐴 ∪ 𝐵𝐵
Jaccard similarity index
The unification algorithm
A combination of hierarchical clustering and nearest neighbor graph representationwith 2 cutoffs T1 and T2
0
0
0.2
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.30.1
T1
T2
0.1
0.4 0.5 0.6 0.7 0.8 0.9 1
Redundancy vs informativenessoptimizer
=0.3 =0.7
(Join all J>T2 (H) and the best if J>T1 (NN))
A powerful gene set analysis tool
Categorized results • Tissue/cell branded expression data • Disease association• Pathways and SuperPaths• Gene Ontologies (GO) connection• Compounds (from several sources)Unique matching algorithms
Tissue Expression Example
PRPH2
Gene-set
Gene-Set Analysis
Photoreceptor-Like Cells
Mature Rod Cells
AIPL1, CRB1, FAM161A, UNC119
Eye
PHO,RP1,
RPGRIP1
RPE65
RPE65
2% 25%
The augmented functional genomeRequires effective bioinformatic tools
Whole exome sequencing
Whole genome sequencing
Regulatory
ncRNAgenes
Intronsand UTRs
Exome
Challenges: annotation, filtration, gene-based interpretation
The NGS interpretation cycle
PatientNGS VCF
V1 G1V2 G2
V3 G2V4 G3V5 G4
V6 G4V7 G5
Phenotypes
Interpretation program
Variant (V) calling annotationandfiltering
G4 (V5)
Genes (G)And other genomic elements
DNA Genomemapping
Medium list
1) Filtering• Genetic model• Population frequency• Protein damage• Known variants
Two-tier variant-containinggene sifting
1-5 genes
~100 genesmedium list
25k NGS coding variants
2) Phenotype- and gene-based interpretation
VarElect - A medium gene list interpreter
Gene A
Gene B
Phenotype
VarElectInterpretation
Direct mode
Indirect mode
Gene- gene relations
Guilt by association via sharing of
Super-Pathways
Mouse phenotypes, tissue expression, paralogy, publications, protein-prtoein interaction, drugs/compounds, domains
geneA geneB geneCgeneA S1 S2geneB S1 S3geneC S2 S3
Stelzer et al, BMC Genomics 2016
The VarElect user web interface - data entry
User Name
diarrhea
Disease: Atypical syndromic congenital diarrhea
VarElect in action - Retinitis Pigmentosa with Epilepsy
Input 2: Phenotype keywords-epilepsy OR macular OR retinitis
Zhu X, Oz Levi D, Genetics in Medicine, 2015
Input 1: 63 Genes Output:
CLN6 is 1stTerra Incognita
37 Genes
No genes are DIRECTLY related
Phenotype keywords: “capillary leak”
Danit Oz Levi (with Lancet)
Example of indirect mode - Capillary leak syndrome
TLN1 (talin 1) is an excellent candidate gene
• Strong splice site mutation • Passed segregation analysis in family• Talin binds integrins, coupling them to
actin, has known role in vascular endothelium permeability
What are genomic enhancers
• Distant-acting DNA elements that regulate transcription• Mostly occur in intergenic genome regions• Mediated by DNA looping via a large protein complex• Contain binding site sequences for transcription factors • Weak DNA signatures - identification is challenging
Enhancer
Transcription begins
DNALoop
Promoter
The significance of enhancers
Estimated count – 400,000(x10 than functional genes)
Gene expression fine-tuning much beyond promoters
Enhancers in development and human diseaseExample: Preaxial polydactyly (“many fingers” in Greek)
Visel, A.Nature, 2009
Enhanceropathies – Enhancer-related diseases
SHH – Sonic Hedgehog
Motivation
• A unified compendium of enhancer elements• Combined information on gene-enhancer relations• Use in WGS interpretation
GeneHancer – an integrated enhancer database
Different data sources Automated
data miningPart of the GeneCards Suite
Fishilevich et al, Database (Oxford), 2017
Enhancer integration from 4 sources
284,834 Unified scored
enhancers
Histone modificationsOpen chromatin
TFBSs
Histone modificationsOpen chromatin
Enhancer RNA (eRNA)
Transgenic mouse assaysIn-vivo validation
434,139Total entries
Inter-source overlaps are significant
• Elite enhancers - ~94,000 enhancers (33%) derived from multiplesources
• Enhancer confidence score - Number of sources, sources scores,conserved regions, TFBSs.
Rational prediction of an enhancer’s target genes is essential for biological interpretation
5 methods for enhancer-gene associations
• Methods are predictions/inferences
• Combination is optimal
Chromosome conformation capture
Hi-C - Combination of DNA proximity ligation with high-throughput sequencing
Capture Hi-C (CHi-C) - Promoter-targeted Hi-C
Mifsud, B., et al., Nature Genetics, 2015
Lieberman-Aiden, E., Science, 2009
Expression Quantitative Trait Loci (eQTLs)
Genetic association between an enhancer’s SNP genotype and the expression of a potential target gene
The GTEx Consortium, Science, 2015
Nica, A., et al., Philosophical Transactions of the Royal Society B, 2013
Gene expression intensity
SNP genotype
Enhancer
Variant Gene
Genomic distance
• Immediate neighbors were connected
• This is in line with the distance dependence of other methods
Enhancer – gene association by distance
Overlap among target gene predictions
Distance
TFs co-expression
eQTLs
CHi-C
eRNA co-expression~1,000,000 scored enhancer-
gene pairs• ~75,000 Elite (supported
by >1 method) • ~40,000 Double elite (both
enhancer and gene link are elite)
Enhancer experimental validation
1. How good is our enhancer set?
121/175 (69%) overlap with GeneHancer
2. How good is our gene set?
100/121 (83%) overlap with GeneHancer
56% of the validated enhancer-gene pairs
are Eliteas compared to 7.4% expected at random
Validated enhancers have higher enhancer scores
Comparison to 175 published single-enhancer studies
Exome sequencing Gene Phenotype
Coding variants
(25,000)
Whole genome
sequencing
Non-Coding variants(4,000,000)
Enhancer
GeneHancer for sequence analyses of diseasesWGS – Whole Genome SequencingThe interpretation challenge - Which variant causes the disease?
180 kb CNV
Input 1: 18 genes, 31 enhancers
Input 2: phenotype keywords -skin, dermatol*, …
Output:Highest scoring gene is a target of an enhancer
Candidate enhancer for a developmental skin disease
Enhancer Candidate target gene