Curation Process Bioset-Bioset Correlation Meta-Analysis...

© 2013 Illumina, Inc. All rights reserved.Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, NuPCR, SeqMonitor, Solexa, TruSeq, TruSight, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.

COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE

BaseSpace Correlation Engine:Curation Process

&Bioset-Bioset Correlation

&Meta-Analysis Calculations

2 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE

Largest Collection of Curated Genomic Data- Preprocessed, Normalized, and Tagged Datasets

Easy To Use Web Platform For Exploring It- Billions of associations and correlations pre-computed

- Rich Biomedical Ontology and Semantic Structure

Applications To Correlate Private & Public Data

Secure Collaboration & Data Sharing

SaaS platform – Big Data Frameworks

API Suite for integrations to and from internal resources

Illumina’s Enterprise Informatics mission is to integrate genomic information to help researchers and clinicians develop unprecedented biomedical insights.

Aggregate, Curate, Correlate & Integrate


Correlation Engine - Public Content is Added Continuouslyover 22,000 studies & over 169,000 molecular lists (26 March 2018)

Data Types: comparisons at gene symbol level enable different molecular data type comparisons

Species: ortholog gene clusters enable cross-species comparisons


Correlation Engine: Aggregate, Curate

Processed & Integrated:> 22,000 Studies, > 169,000 Bioset and Biogroup lists , 13 Species , 12 data types


NextBio Research: Correlate


Bioset versus Bioset Scoring in Curated Studies app‘Disease _vs_ Normal’ list ‘Treatment _vs_ Control’ list

anti-TNF Etanercept treatment reverses expression of psoriasis disease profilesOpportunities arise to deeply investigate:

• specific gene activities• divergent pathways• on-target and off-target effects


1

2 3

4

5

Example

Disease _vs_ Normal Treatment _vs_ Control

https://enterprise.ussc.informatics.illumina.com/c/search/as/?type=bioset&id=65254#tf=Etanercept


Explanation – Step 1

Positive correlations

Negative correlations

• Subset pair enrichment score is the negative logarithm of the p-value with a sign reflecting the consistency of the subset directions

• score for b1+b2+ is –ln(0.999) = 0.001 , sign = +

• score for b1-b2- is –ln(1) = 0, sign = +

• score for b1+b2- is –ln(1.7e-307) = 706.363 , sign = -

• score for b1-b2+ is –ln(8.1e-135) = 1015.119 , sign = -

2

3

4

5


Subset scores are signed based on consistency of subset directionsPositive correlation is average of b1+b2+ and b1-b2- = score_positiveNegative correlation is average of b1+b2- and b1-b2+ = score_negative

• Overall score is the sum of score_positive and score_negative

• Score_positive = (0.001 + 0) / 2 = 0.0005

• Score_negative = -(706.363 + 1015.119) / 2 = - 507.560

• Overall score = 0.0005 + -507.560 = - 507.560, which means correlation is negative, and the magnitude is equivalent to Overlap p-value score of

exp(-507.560) = 3.712E-221 , displayed as 3.7E-221 in user interface.

1

Explanation – Step 2


http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote-data-correlation-enrichment.pdf

Rank-Based Directional EnrichmentProcessed experimental comparisons are first set to ranks

Absolute Fold Change ranking Lowest p-value ranking


Rank-Based Directional Enrichment

http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote-data-correlation-enrichment.pdf

1. Each bioset is divided into directional subsets2. Enrichment score for each subset pair is computed (details next slide)3. Subset scores are signed based on consistency of subset directions

Positive correlation is average of b1+b2+ and b1-b2- score_positiveNegative correlation is average of b1+b2- and b1-b2+ score_negative

4. Overall score is the sum of score_positive and score_negative


1 - Both b1 and b2 are sorted by ranks2 - Up to 1% of platform P1 from b1 is defined as b1 top ranking genes (b1’)3 - b2 is scanned in rank order to identify matching genes with b1 top ranking genes b1’4 - At each matching rank, a Fisher’s exact test is performed to assess the significance of the enrichment of b1’ in the b2 scanned

portion b2’5 - When b2 scan finishes, the best p-value is multiplied with a multiple testing correction factor; if the multiple testing corrected

p-value is worse than the p-value of single Fisher’s exact test, the p-value of single Fisher’s exact test is used6 - Steps 2-5 are run in the reverse direction, so that b1 is scanned to evaluate enrichment of b2 top ranking genes7 - Subset pair p-value is defined as geometric mean of the p-values from both scans8 - Subset pair enrichment score is the negative logarithm of the p-value with a sign reflecting the consistency of the subset directions

Rank-Based Directional Enrichment



Fisher’s exact test for b1’ vs b2’

How does the Correlation Engine score results?

Fisher exact test– Are there nonrandom associations between two categorical variables?– Is there a nonrandom enrichment for genes common to both gene lists?

Contingency table

In bioset b2’ Not in bioset b2’ Totals

Mapped to bioset b1’

b1’∩b2’ b1’∩p2-b1’∩b2’ b1’∩p2

Not mapped to bioset b1’

b2’∩p1-b1’∩b2’ p1∩p2-b1’∩p2-b2’∩p1+b1’∩b2’

p1∩p2-b1’∩p2

Totals b2’∩p1 P1∩p2-b2’∩p1 p1∩p2


Similarities– Dynamically detects the most significant enrichment signal in a ranked signature set

Differences– Running Fisher computes p-values iteratively in directional subsets by Fisher Exact

test rather than by permutations– More flexible to computing correlation scores for data of different sizes and filter

thresholds– Is more permissive/less stringent than GSEA in scoring correlations

Running Fisher vs. GSEA


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝐶𝐶𝑆𝑆𝐶𝐶𝑆𝑆𝑆𝑆𝐶𝐶𝐶𝐶 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁

𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑏𝑏 𝑥𝑥 𝐴𝐴𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝐴𝐴𝑁𝑁𝐴𝐴𝑁𝑁𝑁𝑁𝐴𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝐴𝐴𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵

In a bioset bioset correlation, the score is the negative log of the best p-value computed by the Running Fisher

NormalizedBiosetCount is sum of associated score of the query bioset with each bioset tagged with that concept divided by best score

– 𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁1+𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁2+𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁3+𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑏𝑏𝑁𝑁𝑁𝑁𝑁𝑁

– Value range will be 1 – n More scores increase value Highly concordant scores increase this value with limit n dDscordant scores minimize added value of additional scores

– If the bioset-bioset score is below minimum cut-off, it is not counted Cut-off = 𝑆𝑆𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑀𝑀𝑁𝑁𝑥𝑥

1𝐸𝐸𝐸


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝐶𝐶𝑆𝑆𝐶𝐶𝑆𝑆𝑆𝑆𝐶𝐶𝐶𝐶 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁

𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵𝐵𝐵𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑏𝑏 𝑥𝑥 𝐴𝐴𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝐴𝐴𝑁𝑁𝐴𝐴𝑁𝑁𝑁𝑁𝐴𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝐴𝐴𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵

BackgroundCount is the number of biosets tagged with the term– Normalization step that reduces bias towards popular concepts.

Prevents a concept with large number of low scores surpassing concepts with relatively few biosets with excellent scores

– If there are n biosets with identical qualifying scores then

𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝐵𝐵𝐵𝐵𝐴𝐴𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁

= 𝑁𝑁𝑁𝑁

= 1

AverageWeightedRank is the average rank of tagged biosets relative to all other correlated biosets

– All biosets that are tagged with the concept are scored against all other correlated biosets and ranked

– Concept biosets that rank highly with strong scores will trend closer down to a value of 1. Concept biosets with a range of ranks and low scores will inflate this factor, decreasing overall score.


AverageWeightedRank continued

–𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅1𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆1

+ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅2𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆2

+ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑅𝑅

𝑁𝑁

– Rank = bioset rank against all other biosets (1, 4, 6, 1000, …)

– Max score is set to 1, and all other scores adjusted proportionally to 1E-6, the imposed cutoff for an accepted bioset (cutoff reference “Computation of New Scores for Categories Incorporating both Scaled Log Scores and Authority Level”)

– Concept biosets that rank highly with strong scores will trend closer down to a value of 1 improving the Conceptscore. Concept biosets with a range of ranks and low scores will inflate this factor, decreasing overall Conceptscore.

Simplified explanation:

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝐶𝐶𝑆𝑆𝐶𝐶𝑆𝑆𝑆𝑆𝐶𝐶𝐶𝐶=

𝑁𝑁𝑆𝑆𝑆𝑆𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑆𝑆𝑁𝑁𝑁𝑁𝑁𝑁𝑆𝑆𝑁𝑁𝑆𝑆𝐶𝐶𝐶𝐶𝑆𝑆𝑁𝑁𝐶𝐶𝐶𝐶𝑁𝑁

𝑁𝑁𝑁𝑁𝑆𝑆𝐵𝐵𝐵𝐵𝑆𝑆𝑆𝑆𝑁𝑁𝐶𝐶𝑁𝑁𝐶𝐶𝑆𝑆𝑁𝑁𝐶𝐶𝐶𝐶𝑏𝑏 𝑥𝑥 𝐴𝐴𝐴𝐴𝑆𝑆𝑆𝑆𝑁𝑁𝐴𝐴𝑆𝑆𝐴𝐴𝑆𝑆𝑁𝑁𝐴𝐴𝐴𝐶𝐶𝑆𝑆𝑁𝑁𝐴𝐴𝑁𝑁𝐶𝐶𝐵𝐵𝑆𝑆

Sum of the ratio of scores to the best score. Range is 1 – n where n = theoretical max = number of biosets

Number of biosetsscored, n

The smaller the rank number the better the Scoreconcept. The weighted ranks are improved (decreased) proportionally by stronger associated scores.



Scoring concept terms in Disease, Pharmaco and Knockdown Atlases



Meta-Analysis Scoring ExampleBreast Cancer: Distinctions between Clinical Phenotype


Breast Cancer: Clinically-informed cancer genomics researchER- vs ER+, PR- vs PR+, triple - vs triple +, HER2+ vs HER2-

Top 4 genes down-regulated in 3 phenotypes, up-regulated in 1 phenotype

Top 3 genes up-regulated in 3 phenotypes, down-regulated in 1 phenotype

Top 1 gene (ESR1)down-regulated in all 4 phenotypes

https://enterprise.ussc.informatics.illumina.com/c/search/adv.nb?ids=86347%2C13883%2C85801%2C68605


The top scoring Meta-Analysis gene is maximized by taking into account individual scores of a gene across the selected biosets (based upon highest ranking feature for a gene).

The Overall Gene Score is the sum of the Bioset gene scores. The top meta-analysis Overall Gene Score is then normalized to 100 in UI, lower ranking meta-analysis, level 1 gene scores in UI are scaled linearly (export report contains underlying numerical details, as in following slides).

If directionality filters are set, then the Overall Gene Score sums the matching Bioset gene scores and subtracts the mismatching Bioset gene scores (there is no reward or penalty for matching or mis-matching Biosets set to ‘Absent’). The top meta-analysis Overall Gene Score is then normalized to 100, lower ranking meta-analysis genes are scaled accordingly (export report contains numerical details).

UI presents highest specificity genes 1st (In certain cases, genes present in 3 out of 4 may out score genes with 4 out of 4 specificity)

Individual bioset scores show up in level 2 UI results. These scores are not scaled showing the original scores for a gene within bioset #1, #2, #3, etc. The individual scores contribute to the meta score, from which the top gene in a single bioset typically gets a rank of 1 and a score very close to 100.


Scale top result to 100 and normalize all othersfor presentation in UI as ‘Score’

Export Report contains numerical details for Meta-Analysis Gene ScoringNo directional Filters set


Meta-Analysis Gene Scoring Examples with Directional Filters Set

Directional Filters set: 1 mismatch Bs1-down, Bs2-down, Bs3-up, Bs4-absent

Directional Filters set: No mismatches Bs1-up, Bs2-up, Bs3-up, Bs4-down

Overall Gene Score: 1 mismatch in Bs1-up, Bs2-up, Bs3-up, Bs4-downScore Bioset 1 + Score Bioset 2 + Score Bioset 3 - Score Bioset 4

207.7635784

Directional Filters set: 1 mismatch in Bs1-up, Bs2-up, Bs3-up, Bs4-down

99.96134599.51491026

Overall Gene Score: 1 mismatch Bs1-down, Bs2-down, Bs3-up, Bs4-absentScore Bioset 1 + Score Bioset 2 - Score Bioset 3 (no reward match Bs4)

Normalize to top ranked Overall Gene Score

Normalize to top ranked Overall Gene Score

Overall Gene Score: No mismatches Bs1-up, Bs2-up, Bs3-up, Bs4-downScore Bioset 1 + Score Bioset 2 + Score Bioset 3 + Score Bioset 4

393.1234968376.0915713 Normalize to top ranked Overall Gene Score



https://enterprise.nextbio.com/c/search/adv.nb?ids=860254,832867,205168,224836,65254,710162,59051,860242,832846,596584,124588,661607

Gene expression of Psoriasis skin lesions from 7 independent ‘disease _vs_ normal’show reversal of RNA expression compared to 5 biologic ‘treated _vs_ untreated’ biosets

CuratedStudies

BiosetQuery

Meta-AnalysisGene Results> >

7 Disease vs Normal RNA experimentsGSE51440 , GSE53552 GSE18686, GSE2737,

GSE14905, GSE30999, GSE13355

5 Treated vs Untreated RNA experimentsGSE51440, GSE53552, GSE31652,

GSE11903, GSE30768

Guselkumab::anti-IL23 (Janssen) Brodalumab::anti-IL17R (Amgen/AZ) Etanercept::anti-TNF (Amgen) Ixekizumab::LY2439821::anti-IL17R antibody (Lilly) Efalizumab::anti-CD11a (Genentech/Merck Serono)


https://enterprise.nextbio.com/c/search/advBg.nb?ids=860254,832867,205168,224836,65254,710162,59051,860242,832846,596584,124588,661607

Gene expression of Psoriasis skin lesions from ‘disease _vs_ normal’ show reversal ofRNA expression compared to 5 biologic ‘treated _vs_ untreated’ biosets

CuratedStudies

BiosetQuery

Meta-AnalysisBiogroup Results> >

7 Disease vs Normal RNA experimentsGSE51440 , GSE53552 GSE18686, GSE2737,

GSE14905, GSE30999, GSE13355

5 Treated vs Untreated RNA experimentsGSE51440, GSE53552, GSE31652,

GSE11903, GSE30768

Guselkumab::anti-IL23 (Janssen) Brodalumab::anti-IL17R (Amgen/AZ) Etanercept::anti-TNF (Amgen) Ixekizumab::LY2439821::anti-IL17R antibody (Lilly) Efalizumab::anti-CD11a (Genentech/Merck Serono)

Date post:	02-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Curation Process Bioset-Bioset Correlation Meta-Analysis...

Documents