Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT -...

Post on 28-Mar-2015

218 views 1 download

Tags:

transcript

Statistical Data Fusion to Prioritize Lists of Genes

Bert Coessens, Stein Aerts

Departement ESAT - SCDKatholieke Universiteit Leuven

Promotor: Bart De MoorAssessor: Yves Moreau

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Context

x

xx

x

x

x

x

xx

Linkage AnalysisPositional Cloning

NEFL

RAB7

GARS

GIB1

LMNA

High-throughput technologies

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Concept

Pathology / Biological process / …

Gene Expression Literature

AnatomicalExpression

GeneRegulation

ProteinDomains

FunctionalAnnotation

EvolutionaryConservation …

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Concept

Model with multiple submodels

Training genes

Training set

Choose submodels TRAIN

Candidate genes

Test set

One ranking foreach submodel

Combinedranking

Orderstatistics

SCORE

genei

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Order Statistics

Given a set of n rank ratios for genei

- what is the probability of getting these ratios by chance alone?

Q r1,r 2, ... , r n n!0

r 1

s1

r2

...sn 1

rn

dsn dsn 1 ...ds1

Joint probability density function of all n order statistics:

V ki 1

k 1

1 i 1 V k i

i !rn k 1

i

Complexity O(n2)

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Setup

29 lists of disease genes from OMIM

5 lists of random genesfrom the human genome

Foreach disease or random gene set do:Foreach gene in the set do:a. Leave one gene outb.TRAIN all submodels on the set minus the left-out genec. Create a test set by adding left-out gene to [9, 49, 99] random genesd. SCORE the test set with all trained submodelse. RANK the genes in the test set according to their order statistics p-valueend

end

Calculate for a certain cut-off x the number of - TP: number of left-out genes ranked above x- FP: number of genes but left-out gene ranked above x- TN: number of genes but the left-out gene ranked below x- FN: number of left-out genes ranked below x

Calculate sensitivity and specificity using the above mentioned values,plot (1-specificity) versus sensitivity to obtain a Rank ROC plot andcalculate the area under the curve.

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Disease genes

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Disease genes

- 29 human diseases (OMIM) = 29 gene sets- 627 disease genes with Ensembl identifier in total- average gene set contains 19 genes- smallest gene set = ALS with 4 genes- largest gene set = leukemia with 113 genes

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: SubmodelsTextual data: TXTGate

Sequence similarity: BLAST

+

Rank genes according to e-value

Example: Presenilin 1 vs. Presenilin 2 e-value = 10-133

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodels

Functional annotation: GO

Functional annotation: Kegg

Set ofgenes

GO IDs observed

frequencies

Full Genome

GO IDsGO-id

expected frequencies

GO IDs

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodels

Protein information: InterPro

Protein information: BIND

Training genes+

Interaction partners

Test gene+

Interaction partners

Overlap?

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodels

Gene expression: Microarray data

Gene expression: ESTs

- Model is average expression profile of training genes- Score test gene by calculating Pearson correlation

Human gene expression atlas: Su et al.47 normal human tissues

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodels

Cis-regulatory elements: TFBSs

Cis-regulatory elements: TFBS modules

- Check human-mouse CNS blocks in upstream sequence of a test gene

- Compare found motifs with motifs in training set

ModuleSearcher:searches best combination of 3 TFs in 300 bp USof genes in training set

ModuleScanner:scores test gene with model

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Similarity

Statistical meta-analysis

Vector-based similarity

Fisher’s methodAssume there are m independent tests of H0.1. For the i-th test calculate the corresponding p-value, pi.2. If pi has a uniform distribution on [0,1],

then –2Σlog pi has a χ2m

distribution.

T1

T3

T2

- Euclidean distance- Pearson correlation- Cosine similarity

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Correlation

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Rank ROC

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Submodel Rank ROC

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Statistical Validation: Bias towards known genes

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Screenshot

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Architecture

ESATWeb server

Linux cluster

Java RMI

SOAP messages

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Conclusions and Future

- Different weighting for different submodels- Explore mathematical modeling techniques (neural nets, SVM)- Add more information models- Define best combination of submodels

F

- Allows integration of heterogeneous data- Solves problem of uncertainty- Solves multiple testing problem (Bonferroni correction)- Allows for cut-offs with statistical significance

C

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Acknowledgements

Bart De MoorStein Aerts Yves Moreau

Patrick Glenisson Steven Van Vooren Joke Allemeersch

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Load training set

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Add submodels

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Train submodels

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Load candidate genes

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Score candidate genes with all submodels

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Results of scoring

Database Issues in Biological Databases (DBiBD), January 8-9, 2005

Endeavour Application: Demo

Ranking visualized in sprintplot