© 2013 Illumina, Inc. All rights reserved.Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, NuPCR, SeqMonitor, Solexa, TruSeq, TruSight, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
BaseSpace Correlation Engine:Curation Process
&Bioset-Bioset Correlation
&Meta-Analysis Calculations
2 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Largest Collection of Curated Genomic Data- Preprocessed, Normalized, and Tagged Datasets
Easy To Use Web Platform For Exploring It- Billions of associations and correlations pre-computed
- Rich Biomedical Ontology and Semantic Structure
Applications To Correlate Private & Public Data
Secure Collaboration & Data Sharing
SaaS platform – Big Data Frameworks
API Suite for integrations to and from internal resources
Illumina’s Enterprise Informatics mission is to integrate genomic information to help researchers and clinicians develop unprecedented biomedical insights.
Aggregate, Curate, Correlate & Integrate
3 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Correlation Engine - Public Content is Added Continuouslyover 22,000 studies & over 169,000 molecular lists (26 March 2018)
Data Types: comparisons at gene symbol level enable different molecular data type comparisons
Species: ortholog gene clusters enable cross-species comparisons
4 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Correlation Engine: Aggregate, Curate
Processed & Integrated:> 22,000 Studies, > 169,000 Bioset and Biogroup lists , 13 Species , 12 data types
5 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
NextBio Research: Correlate
6 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Bioset versus Bioset Scoring in Curated Studies app‘Disease _vs_ Normal’ list ‘Treatment _vs_ Control’ list
anti-TNF Etanercept treatment reverses expression of psoriasis disease profilesOpportunities arise to deeply investigate:
• specific gene activities• divergent pathways• on-target and off-target effects
7 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
1
2 3
4
5
Example
Disease _vs_ Normal Treatment _vs_ Control
https://enterprise.ussc.informatics.illumina.com/c/search/as/?type=bioset&id=65254#tf=Etanercept
8 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Explanation – Step 1
Positive correlations
Negative correlations
• Subset pair enrichment score is the negative logarithm of the p-value with a sign reflecting the consistency of the subset directions
• score for b1+b2+ is –ln(0.999) = 0.001 , sign = +
• score for b1-b2- is –ln(1) = 0, sign = +
• score for b1+b2- is –ln(1.7e-307) = 706.363 , sign = -
• score for b1-b2+ is –ln(8.1e-135) = 1015.119 , sign = -
2
3
4
5
9 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Subset scores are signed based on consistency of subset directionsPositive correlation is average of b1+b2+ and b1-b2- = score_positiveNegative correlation is average of b1+b2- and b1-b2+ = score_negative
• Overall score is the sum of score_positive and score_negative
• Score_positive = (0.001 + 0) / 2 = 0.0005
• Score_negative = -(706.363 + 1015.119) / 2 = - 507.560
• Overall score = 0.0005 + -507.560 = - 507.560, which means correlation is negative, and the magnitude is equivalent to Overlap p-value score of
exp(-507.560) = 3.712E-221 , displayed as 3.7E-221 in user interface.
1
Explanation – Step 2
10 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote-data-correlation-enrichment.pdf
Rank-Based Directional EnrichmentProcessed experimental comparisons are first set to ranks
Absolute Fold Change ranking Lowest p-value ranking
11 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Rank-Based Directional Enrichment
http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote-data-correlation-enrichment.pdf
1. Each bioset is divided into directional subsets2. Enrichment score for each subset pair is computed (details next slide)3. Subset scores are signed based on consistency of subset directions
Positive correlation is average of b1+b2+ and b1-b2- score_positiveNegative correlation is average of b1+b2- and b1-b2+ score_negative
4. Overall score is the sum of score_positive and score_negative
12 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
1 - Both b1 and b2 are sorted by ranks2 - Up to 1% of platform P1 from b1 is defined as b1 top ranking genes (b1’)3 - b2 is scanned in rank order to identify matching genes with b1 top ranking genes b1’4 - At each matching rank, a Fisher’s exact test is performed to assess the significance of the enrichment of b1’ in the b2 scanned
portion b2’5 - When b2 scan finishes, the best p-value is multiplied with a multiple testing correction factor; if the multiple testing corrected
p-value is worse than the p-value of single Fisher’s exact test, the p-value of single Fisher’s exact test is used6 - Steps 2-5 are run in the reverse direction, so that b1 is scanned to evaluate enrichment of b2 top ranking genes7 - Subset pair p-value is defined as geometric mean of the p-values from both scans8 - Subset pair enrichment score is the negative logarithm of the p-value with a sign reflecting the consistency of the subset directions
Rank-Based Directional Enrichment
13 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
14 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Fisher’s exact test for b1’ vs b2’
How does the Correlation Engine score results?
Fisher exact test– Are there nonrandom associations between two categorical variables?– Is there a nonrandom enrichment for genes common to both gene lists?
Contingency table
In bioset b2’ Not in bioset b2’ Totals
Mapped to bioset b1’
b1’∩b2’ b1’∩p2-b1’∩b2’ b1’∩p2
Not mapped to bioset b1’
b2’∩p1-b1’∩b2’ p1∩p2-b1’∩p2-b2’∩p1+b1’∩b2’
p1∩p2-b1’∩p2
Totals b2’∩p1 P1∩p2-b2’∩p1 p1∩p2
15 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Similarities– Dynamically detects the most significant enrichment signal in a ranked signature set
Differences– Running Fisher computes p-values iteratively in directional subsets by Fisher Exact
test rather than by permutations– More flexible to computing correlation scores for data of different sizes and filter
thresholds– Is more permissive/less stringent than GSEA in scoring correlations
Running Fisher vs. GSEA
16 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝐶𝐶𝑆𝑆𝐶𝐶𝑆𝑆𝑆𝑆𝐶𝐶𝐶𝐶 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑏𝑏 𝑥𝑥 𝐴𝐴𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝐴𝐴𝑁𝑁𝐴𝐴𝑁𝑁𝑁𝑁𝐴𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝐴𝐴𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵
In a bioset bioset correlation, the score is the negative log of the best p-value computed by the Running Fisher
NormalizedBiosetCount is sum of associated score of the query bioset with each bioset tagged with that concept divided by best score
– 𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁1+𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁2+𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁3+𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑏𝑏𝑁𝑁𝑁𝑁𝑁𝑁
– Value range will be 1 – n More scores increase value Highly concordant scores increase this value with limit n dDscordant scores minimize added value of additional scores
– If the bioset-bioset score is below minimum cut-off, it is not counted Cut-off = 𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑀𝑀𝑁𝑁𝑥𝑥
1𝐸𝐸𝐸
17 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝐶𝐶𝑆𝑆𝐶𝐶𝑆𝑆𝑆𝑆𝐶𝐶𝐶𝐶 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑏𝑏 𝑥𝑥 𝐴𝐴𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝐴𝐴𝑁𝑁𝐴𝐴𝑁𝑁𝑁𝑁𝐴𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝐴𝐴𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵
BackgroundCount is the number of biosets tagged with the term– Normalization step that reduces bias towards popular concepts.
Prevents a concept with large number of low scores surpassing concepts with relatively few biosets with excellent scores
– If there are n biosets with identical qualifying scores then
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
= 𝑁𝑁𝑁𝑁
= 1
AverageWeightedRank is the average rank of tagged biosets relative to all other correlated biosets
– All biosets that are tagged with the concept are scored against all other correlated biosets and ranked
– Concept biosets that rank highly with strong scores will trend closer down to a value of 1. Concept biosets with a range of ranks and low scores will inflate this factor, decreasing overall score.
18 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
AverageWeightedRank continued
–𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅1𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆1
+ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅2𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆2
+ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑅𝑅
𝑁𝑁
– Rank = bioset rank against all other biosets (1, 4, 6, 1000, …)
– Max score is set to 1, and all other scores adjusted proportionally to 1E-6, the imposed cutoff for an accepted bioset (cutoff reference “Computation of New Scores for Categories Incorporating both Scaled Log Scores and Authority Level”)
– Concept biosets that rank highly with strong scores will trend closer down to a value of 1 improving the Conceptscore. Concept biosets with a range of ranks and low scores will inflate this factor, decreasing overall Conceptscore.
Simplified explanation:
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝐶𝐶𝑆𝑆𝐶𝐶𝑆𝑆𝑆𝑆𝐶𝐶𝐶𝐶=
𝑁𝑁𝑆𝑆𝑆𝑆𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑆𝑆𝑁𝑁𝑁𝑁𝑁𝑁𝑆𝑆𝑁𝑁𝑆𝑆𝐶𝐶𝐶𝐶𝑆𝑆𝑁𝑁𝐶𝐶𝐶𝐶𝑁𝑁
𝑁𝑁𝑁𝑁𝑆𝑆𝐵𝐵𝐵𝐵𝑆𝑆𝑆𝑆𝑁𝑁𝐶𝐶𝑁𝑁𝐶𝐶𝑆𝑆𝑁𝑁𝐶𝐶𝐶𝐶𝑏𝑏 𝑥𝑥 𝐴𝐴𝐴𝐴𝑆𝑆𝑆𝑆𝑁𝑁𝐴𝐴𝑆𝑆𝐴𝐴𝑆𝑆𝑁𝑁𝐴𝐴𝐴𝐶𝐶𝑆𝑆𝑁𝑁𝐴𝐴𝑁𝑁𝐶𝐶𝐵𝐵𝑆𝑆
Sum of the ratio of scores to the best score. Range is 1 – n where n = theoretical max = number of biosets
Number of biosetsscored, n
The smaller the rank number the better the Scoreconcept. The weighted ranks are improved (decreased) proportionally by stronger associated scores.
19 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
20 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Scoring concept terms in Disease, Pharmaco and Knockdown Atlases
21 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
22 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Meta-Analysis Scoring ExampleBreast Cancer: Distinctions between Clinical Phenotype
23 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Breast Cancer: Clinically-informed cancer genomics researchER- vs ER+, PR- vs PR+, triple - vs triple +, HER2+ vs HER2-
Top 4 genes down-regulated in 3 phenotypes, up-regulated in 1 phenotype
Top 3 genes up-regulated in 3 phenotypes, down-regulated in 1 phenotype
Top 1 gene (ESR1)down-regulated in all 4 phenotypes
https://enterprise.ussc.informatics.illumina.com/c/search/adv.nb?ids=86347%2C13883%2C85801%2C68605
24 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
The top scoring Meta-Analysis gene is maximized by taking into account individual scores of a gene across the selected biosets (based upon highest ranking feature for a gene).
The Overall Gene Score is the sum of the Bioset gene scores. The top meta-analysis Overall Gene Score is then normalized to 100 in UI, lower ranking meta-analysis, level 1 gene scores in UI are scaled linearly (export report contains underlying numerical details, as in following slides).
If directionality filters are set, then the Overall Gene Score sums the matching Bioset gene scores and subtracts the mismatching Bioset gene scores (there is no reward or penalty for matching or mis-matching Biosets set to ‘Absent’). The top meta-analysis Overall Gene Score is then normalized to 100, lower ranking meta-analysis genes are scaled accordingly (export report contains numerical details).
UI presents highest specificity genes 1st (In certain cases, genes present in 3 out of 4 may out score genes with 4 out of 4 specificity)
Individual bioset scores show up in level 2 UI results. These scores are not scaled showing the original scores for a gene within bioset #1, #2, #3, etc. The individual scores contribute to the meta score, from which the top gene in a single bioset typically gets a rank of 1 and a score very close to 100.
25 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Scale top result to 100 and normalize all othersfor presentation in UI as ‘Score’
Export Report contains numerical details for Meta-Analysis Gene ScoringNo directional Filters set
26 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
Meta-Analysis Gene Scoring Examples with Directional Filters Set
Directional Filters set: 1 mismatch Bs1-down, Bs2-down, Bs3-up, Bs4-absent
Directional Filters set: No mismatches Bs1-up, Bs2-up, Bs3-up, Bs4-down
Overall Gene Score: 1 mismatch in Bs1-up, Bs2-up, Bs3-up, Bs4-downScore Bioset 1 + Score Bioset 2 + Score Bioset 3 - Score Bioset 4
207.7635784
Directional Filters set: 1 mismatch in Bs1-up, Bs2-up, Bs3-up, Bs4-down
99.96134599.51491026
Overall Gene Score: 1 mismatch Bs1-down, Bs2-down, Bs3-up, Bs4-absentScore Bioset 1 + Score Bioset 2 - Score Bioset 3 (no reward match Bs4)
Normalize to top ranked Overall Gene Score
Normalize to top ranked Overall Gene Score
Overall Gene Score: No mismatches Bs1-up, Bs2-up, Bs3-up, Bs4-downScore Bioset 1 + Score Bioset 2 + Score Bioset 3 + Score Bioset 4
393.1234968376.0915713 Normalize to top ranked Overall Gene Score
27 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
28 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
https://enterprise.nextbio.com/c/search/adv.nb?ids=860254,832867,205168,224836,65254,710162,59051,860242,832846,596584,124588,661607
Gene expression of Psoriasis skin lesions from 7 independent ‘disease _vs_ normal’show reversal of RNA expression compared to 5 biologic ‘treated _vs_ untreated’ biosets
CuratedStudies
BiosetQuery
Meta-AnalysisGene Results> >
7 Disease vs Normal RNA experimentsGSE51440 , GSE53552 GSE18686, GSE2737,
GSE14905, GSE30999, GSE13355
5 Treated vs Untreated RNA experimentsGSE51440, GSE53552, GSE31652,
GSE11903, GSE30768
Guselkumab::anti-IL23 (Janssen) Brodalumab::anti-IL17R (Amgen/AZ) Etanercept::anti-TNF (Amgen) Ixekizumab::LY2439821::anti-IL17R antibody (Lilly) Efalizumab::anti-CD11a (Genentech/Merck Serono)
29 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
https://enterprise.nextbio.com/c/search/advBg.nb?ids=860254,832867,205168,224836,65254,710162,59051,860242,832846,596584,124588,661607
Gene expression of Psoriasis skin lesions from ‘disease _vs_ normal’ show reversal ofRNA expression compared to 5 biologic ‘treated _vs_ untreated’ biosets
CuratedStudies
BiosetQuery
Meta-AnalysisBiogroup Results> >
7 Disease vs Normal RNA experimentsGSE51440 , GSE53552 GSE18686, GSE2737,
GSE14905, GSE30999, GSE13355
5 Treated vs Untreated RNA experimentsGSE51440, GSE53552, GSE31652,
GSE11903, GSE30768
Guselkumab::anti-IL23 (Janssen) Brodalumab::anti-IL17R (Amgen/AZ) Etanercept::anti-TNF (Amgen) Ixekizumab::LY2439821::anti-IL17R antibody (Lilly) Efalizumab::anti-CD11a (Genentech/Merck Serono)