+ All Categories
Home > Documents > Identification of overlapping biclusters using Probabilistic Relational Models

Identification of overlapping biclusters using Probabilistic Relational Models

Date post: 13-Jan-2016
Category:
Upload: aelan
View: 33 times
Download: 0 times
Share this document with a friend
Description:
Identification of overlapping biclusters using Probabilistic Relational Models. Tim Van den Bulcke Hui Zhao Kristof Engelen Bart De Moor Kathleen Marchal. PMCB Workshop Thursday, 26 July, 2007. Overview. Biclustering and biology Probabilistic Relational Models - PowerPoint PPT Presentation
32
1 Identification of overlapping biclusters using Probabilistic Relational Models Tim Van den Bulcke Hui Zhao Kristof Engelen Bart De Moor Kathleen Marchal PMCB Workshop Thursday, 26 July, 2007.
Transcript
Page 1: Identification of overlapping biclusters  using Probabilistic Relational Models

1

Identification of overlapping biclusters using Probabilistic Relational Models

Tim Van den Bulcke

Hui Zhao

Kristof Engelen

Bart De Moor

Kathleen Marchal

PMCB Workshop

Thursday, 26 July, 2007.

Page 2: Identification of overlapping biclusters  using Probabilistic Relational Models

2

Overview

• Biclustering and biology

• Probabilistic Relational Models

• ProBic biclustering model

• Algorithm

• Results

• Conclusion

Page 3: Identification of overlapping biclusters  using Probabilistic Relational Models

3

Overview

• Biclustering and biology

• Probabilistic Relational Models

• ProBic biclustering model

• Algorithm

• Results

• Conclusion

Page 4: Identification of overlapping biclusters  using Probabilistic Relational Models

4

Biclustering and biology

• Definition in the context of gene expression data: A bicluster is a subset of genes which show a similar expression profile under a subset of conditions.

genes

conditions

Page 5: Identification of overlapping biclusters  using Probabilistic Relational Models

5

Biclustering and biology

Why bi-clustering?*

• Only a small set of the genes participates in a cellular process.

• A cellular process is active only in a subset of the conditions.

• A single gene may participate in multiple pathways that may or may not be coactive under all conditions.

* From: Madeira et al. (2004) Biclustering Algorithms for Biological Data Analysis: A Survey

Page 6: Identification of overlapping biclusters  using Probabilistic Relational Models

6

Overview

• Biclustering and biology

• Probabilistic Relational Models

• ProBic biclustering model

• Algorithm

• Results

• Conclusion

Page 7: Identification of overlapping biclusters  using Probabilistic Relational Models

7

Probabilistic Relational Models (PRMs)

Patient

Treatment

Virus strain Contact

Image: free interpretation from Segal et al. Rich probabilistic models

Page 8: Identification of overlapping biclusters  using Probabilistic Relational Models

8

Probabilistic Relational Models (PRMs)

• Traditional approaches “flatten” relational data– Causes bias

– Centered around one view of the data

– Loose relational structure

• PRM models– Extension of Bayesian networks

– Combine advantages of probabilistic reasoning with relational logic

Patient

flatten

Contact

Page 9: Identification of overlapping biclusters  using Probabilistic Relational Models

9

Overview

• Biclustering and biology

• Probabilistic Relational Models

• ProBic biclustering model

• Algorithm

• Results

• Conclusion

Page 10: Identification of overlapping biclusters  using Probabilistic Relational Models

10

ProBic biclustering model: notation

• g: gene

• c: condition

• e: expression

• g.Bk: gene-bicluster assignment for gene g to bicluster k

• c.Bk: condition-bicluster assignment for condition c to bicluster k

• e.Level: expression level value

• G, C, E (capital letters): set of all genes, conditions, expression levels resp.

• μg.B, c.B, c, σg.B, c.B, c: Normal distribution parameters for condition c, with gene-bicluster and condition-bicluster assignments g.B and c.B

Page 11: Identification of overlapping biclusters  using Probabilistic Relational Models

11

ProBic biclustering model

• Dataset instance

GeneGene

ExpressionExpression

ConditionCondition

ID B1 B2

g1 ? (0 or 1) ? (0 or 1)

g2 ? (0 or 1) ? (0 or 1)

ID B1 B2

c1 ? (0 or 1) ? (0 or 1)

c2 ? (0 or 1) ? (0 or 1)

g.ID c.ID level

g1 c1 -2.4

g1 c2 (missing value)

g2 c1 1.6

g2 c2 0.5

Page 12: Identification of overlapping biclusters  using Probabilistic Relational Models

12

ProBic biclustering model

• Relational schema and PRM model

Notation:

• g: gene

• c: condition

• e: expression

• g.Bk: gene-bicluster assignment for gene g to bicluster k (0 or 1, unknown)

• c.Bk: condition-bicluster assignment for condition c to bicluster k: (0 or 1, unknown)

• e.Level: expression level value (continuous, known)

GeneGene

ExpressionExpression

ConditionCondition

B1 B2

level

B1 B2

P(e.level | g.B1,g.B2,c.B1,c.B2,c.ID)=

Normal( μg.B,c.B,c.ID, σg.B,c.B,c.ID )

ID

P(e.level | g.B1,g.B2,c.B1,c.B2,c.ID)=

Normal( μg.B,c.B,c.ID, σg.B,c.B,c.ID )

1

2

3

Page 13: Identification of overlapping biclusters  using Probabilistic Relational Models

13

ProBic biclustering model

GeneGene

ExpressionExpression

ConditionCondition

? (0 or 1)? (0 or 1)g2

? (0 or 1)? (0 or 1)g1

B2B1ID

? (0 or 1)? (0 or 1)c2

? (0 or 1)? (0 or 1)c1

B2B1ID

1.6c1g2

(missing value)c2g1

0.5c2g2

-2.4c1g1

levelc.IDg.ID

g1.B1

g1.B2 level1,1

c1.B1 c1.B2

g2.B1

g2.B2 level2,

2

c2.B1 c2.B2

level2,1

c1.ID c2.ID

PRM modelDatabase instance

ground Bayesian network

GeneGene

ExpressionExpression

ConditionCondition

B1 B2B1 B2

P(e.level | g.B1,g.B2,c.B1,c.B2, c.ID)=

Normal( μg.B,c.B, c.ID, σg.B,c.B,c.ID )

ID

level

Page 14: Identification of overlapping biclusters  using Probabilistic Relational Models

14

ProBic biclustering model

• ProBic posterior ( ~ likelihood x prior ):

Expression level

conditional probabilities

Expression level prior

(μ, σ)’s

Prior condition to bicluster assignmentsPrior gene to bicluster assignment

Page 15: Identification of overlapping biclusters  using Probabilistic Relational Models

15

Overview

• Biclustering and biology

• Probabilistic Relational Models

• ProBic biclustering model

• Algorithm

• Results

• Conclusion

Page 16: Identification of overlapping biclusters  using Probabilistic Relational Models

16

Algorithm: choices

• Different approaches possible

• Only approximative algorithms are tractable:– MCMC methods (e.g. Gibbs sampling)

– Expectation-Maximization (soft, hard assignment)

– Variational approaches

– simulated annealing, genetic algorithms, …

• We chose a hard assignment Expectation-Maximization algorithm (E.-M.)– Natural decomposition of the model in E.-M. steps

– Efficient

– Good convergence properties for this model

– Extensible

Page 17: Identification of overlapping biclusters  using Probabilistic Relational Models

17

Algorithm: Expectation-Maximization

• Maximization step:– Maximize posterior w.r.t. μ, σ values (model parameters),

given the current gene-bicluster and condition-bicluster assignments (=the hidden variables)

• Expectation step:– Maximize posterior w.r.t. gene-bicluster and condition-

bicluster assignments, given the current model parameters

– Two-step approach:

• Step 1: max. posterior w.r.t. C.B, given G.B and μ, σ values

• Step 2: max. posterior w.r.t. G.B, given C.B and μ, σ values

Page 18: Identification of overlapping biclusters  using Probabilistic Relational Models

18

Algorithm: Expectation-Maximization

• Expectation step 1: condition-bicluster assignment– Independent per condition

– Evaluate function for every condition and for every bicluster assignment e.g. 200 conditions, 30 biclusters: 200 * 230 = 200 billion ~ a lot

– But can be performed very efficiently:

• Partial solutions can be reused among different bicluster assignments

• Only evaluate potential good solutions: use Apriori-like approach.

• Avoid background evaluations

1

2

3

Page 19: Identification of overlapping biclusters  using Probabilistic Relational Models

19

Algorithm: initialization

• Initialization options:– Multiple random initializations

– Initialize biclusters with (nearly) complete dataset

– Initialize all biclusters simultaneously

– Init/converge one bicluster at a time, then add next (still allowing first bicluster to change)

• Best results:– One initialization: initialize biclusters with (nearly)

complete dataset

– Iteratively add one bicluster and run E.-M.

Page 20: Identification of overlapping biclusters  using Probabilistic Relational Models

20

Algorithm: example

Page 21: Identification of overlapping biclusters  using Probabilistic Relational Models

21

Algorithm: example

Page 22: Identification of overlapping biclusters  using Probabilistic Relational Models

22

Algorithm properties

• Speed:– 500 genes, 200 conditions, 2 biclusters: 2 min.

– Scaling:

• ~ #genes . #conditions . 2#biclusters (worse case)

• ~ #genes . #conditions . (#biclusters)p (in practice), p=1..3

Page 23: Identification of overlapping biclusters  using Probabilistic Relational Models

23

Overview

• Biclustering and biology

• Probabilistic Relational Models

• ProBic biclustering model

• Algorithm

• Results– Noise sensitivity

– Bicluster shape

– Overlap

• Conclusion

Page 24: Identification of overlapping biclusters  using Probabilistic Relational Models

24

Results: noise sensitivity

• Setup: – Simulated dataset: 500 genes x 200 conditions

– Background distribution: Normal(0,1)

– Bicluster distributions: Normal( rnd(N(0,1)), σ ), varying sigma

– Shapes: three 50x50 biclusters

ordered randomized

Page 25: Identification of overlapping biclusters  using Probabilistic Relational Models

25

Results: noise sensitivity

A

B

Precision (genes) Recall (genes)

Precision (conditions) Recall (conditions)

A B A B

A

B

A B

σ σ

σ σ

…Precision = TP / (TP+FP) Recall = TP / (TP+FN)

Page 26: Identification of overlapping biclusters  using Probabilistic Relational Models

26

Results: bicluster shape independence

• Setup:– Dataset: 500 genes x 200 conditions

– Background distribution: N(0,1)

– Bicluster distributions: N( rnd(N(0,1)), 0.2 )

– Shapes: 80x10, 10x80, 20x20

Page 27: Identification of overlapping biclusters  using Probabilistic Relational Models

27

Results: bicluster shape independence

Page 28: Identification of overlapping biclusters  using Probabilistic Relational Models

28

Results: 10 biclusters

Page 29: Identification of overlapping biclusters  using Probabilistic Relational Models

29

Overlap examples

• Two biclusters (50 genes, 50 conditions)

• Overlap:25 genes, 25 conditions

• Two biclusters (10 genes, 80 conditions)

• Overlap: 2 genes, 40 conditions

Page 30: Identification of overlapping biclusters  using Probabilistic Relational Models

30

Near future

• Automated definition of algorithm parameter settings

• Application biological datasets– Dataset normalization

• Extend model with different overlap models

• Model extension from biclusters to regulatory modulesinclude motif + ChIP-chip data

PromoterPromoter

ExpressionExpression

ArrayArray

S1 S2 S3 S4

R1 R2 R3

M1 M2

level

M1P1 P2 P3

Gene

M2

TTCAATACAGG

R1

R2

Page 31: Identification of overlapping biclusters  using Probabilistic Relational Models

31

Conclusion

• Noise robustness

• Naturally deals with missing values

• Independent of bicluster shape

• Simultaneous identification of multiple overlapping biclusters

• Can be used query-driven

• Extensible

Page 32: Identification of overlapping biclusters  using Probabilistic Relational Models

32

Acknowledgements

KULeuven:• whole BioI group, ESAT-SCD

– Hui Zhao

– Thomas Dhollander

• whole CMPG group(Centre of Microbial and Plant Genetics)

– Kristof Engelen

– Kathleen Marchal

UGent:• whole Bioinformatics &

Evolutionary Genomics group

– Tom Michoel BIO I..


Recommended