Hierarchical multilabel classification trees for gene function prediction

transcript

Leander SchietgatHendrik Blockeel

Jan StruyfKatholieke Universiteit Leuven (Belgium)

Amanda ClareUniversity of Aberystwyth (Wales)

Sašo DžeroskiJožef Stefan Institute Ljubljana (Slovenia)

Probabilistic Modeling and Machine Learning in Structural and Systems Biology

Tuusula, Finland, 17-18 June 2006

Overview

The application gene function prediction

The machine learning context hierarchical multilabel classification

Decision trees for HMC the algorithm: Clus-HMC

Experimental results

Conclusions2/21

Gene Function Prediction

Task Given a data set with descriptions of

genes and the functions they have Learn a model that can predict for a

new gene what functions it performs

Genes can have multiple functions

These functions are hierarchically organised3/21

c1 c3c2

c21 c22

Machine Learning

Classifier predicts for unseen instances the

class to which they belong learned with already classified

training examples Different techniques

decision trees support vector machines bayesian networks …4/21

Hierarchical Multilabel Classification Normal classification setting

only predicts a single class

HMC predict multiple classes at once classes are organized in a hierarchy

Hierarchy constraint instances of a class must be

instances of its superclasses5/21

Two HMC approaches

1. Learn model for each class and combine the predictions

Advantage a lot of machine learning algorithms

available

Disadvantages efficiency skewed class distributions hierarchical relationships

m1 m2 mn

c1? c2? cn?

Two HMC approaches (c’ted)2. Learn a single model that

predicts all the classes together Advantages

faster to learn easier to interpret hierarchy constraint

automatically imposed selection of features

relevant for all classes Disadvantage

may have worse predictive performance

[c1, c2, …, cn]

Related work on HMC Barutcuoglu et al. (2006)

learn classes separately with SVM’s and combine the predictions with Naïve Bayes

Clare (2003) extension of C4.5 decision tree method that

learns all classes together A lot of work in the area of text classification

Rousu et al. (2005) give an overview on SVM-methods that learn a single model for all classes

Gene function prediction

Text classification

Approach 1 Barutcuoglu et al. …

Approach 2 Clare …

Why decision trees?

fast to build fast to use accurate predictions easy to interpret

Gene ND HS … MF?G1 25 29 … G2 32 40 … +G3 19 0 … G4 44 45 … +… … … … …

Nitrogen depletion <= -2.74?

Heat shock > 1.28?

yes no

training examples

+ + ��

Positive

Positive Negative

Decision trees for HMC

The Clus system created by Jan Struyf propositional DT learner, implemented in

Java uses ideas of:

C4.5 [Quinlan93] and CART [Breiman84] Predictive Clustering Trees [Blockeel98]

Heuristic for HMC look for test that minimizes the intra-

cluster variance (= generalisation of CART)

can be used for HMC (Clus-HMC) …

… as well as binary classification (Clus-SC ~ CART)

Decision trees for HMC (c’ted)

c1? c2? cn?

c1 c1,c21,c22

c2,c21,c22 c1c1,c2,c21 c1,c3

Saccharomyces cerevisiae or baker’s/brewer’s yeast

MIPS FunCat hierarchy 250 functions of yeast genes

12 datasets [Clare03] Sequence structure (seq) Phenotype growth (pheno) Secondary structure (struc) Homology search (hom) Microarray data

cellcycle, church, derisi, eisen, gasch1, gasch2, spo, expr (all)

Experiments in yeast functional genomics

1 METABOLISM

1/1 amino acid metabolism1/2 nitrogen and sulfur metabolisms

2 ENERGY

2/1 glycolysis and gluconeogenesis

…12/21

Example run

each leaf contains multiple classes

which classes to predict?

problem: different class frequencies

use of threshold

precision-recall curves: independent of a specific threshold

nitrogen_depletion > 5

Name A1 A2 … An 1 … 5 5/1 … 40 40/3 40/16 …G1 … … … … x x x x xG2 … … … … x x x x G3 … … … … x x G4 … … … … x x xG5 … … … … x x xG6 … … … … x x x… … … … … … … … … … … … … … … …

description functions

37C_to_25C_shock > 1.28

{1,5,5/1,3,3/5}

{5,5/1,40,40/3}

{40,40/3,40/16}

{5,5/1,40}

{40,40/3, 40/16}

{1,5,5/1,3,3/5}

{5,5/1,40}{5,5/1,40, 40/3}

{40,40/16}

{5,5/1,40}

40,40/3,40/16

5,5/1,40,40/3

1,5,5/1,3,3/5 p=0%

40,40/3,40/16

5,5/1,40 1,5 p=50%

40,40/16 5,5/1,40 1,5 p=100%

Predictions

Comparison of Clus-HMC with [Clare03]

Average precision-recall curves

PRECISION

= proportion of (instance, class) predictions that is correct

RECALL

= proportion of true (instance, class) cases that are predicted

Extracting rules

e.g. predictions for class 40/3 in “gasch1” dataset

IF Nitrogen_Depletion_8_h <= -2.74 AND

Nitrogen_Depletion_2_h > -1.94 AND

1point5_mM_diamide_5_min > -0.03 AND

1M_sorbitol___45_min_ > -0.36 AND

37C_to_25C_shock___60_min > 1.28

THEN 40,40/3

Precision: 0.97

Recall: 0.15

HMC vs. single classification Tree sizes

on average HMC tree: 24 nodes SC tree: 33 nodes (250 of such trees)

Time to grow trees single SC tree is grown faster than single

HMC but 250 single trees have to be built HMC on average 37 times faster

Predictive performance next slide

HMC vs. single classification Average precision-recall curves

Explanation of the results The classes are not independent

different trees for different classes actually share structure

explains some complexity reduction achieved by Clus-HMC

one class carries information on other classes

this increases the signal-to-noise ratio provides better guidance when learning the

tree (explaining good predictive performance)

avoids overfitting (explaining further reduction of tree size)

this was confirmed empirically

Conclusions

HMC decision trees are a useful tool for gene function prediction fast to learn high interpretability

Compared to regular tree learning, HMC tree learning: is even faster yields trees that:

are smaller are easier to interpret have equal or better predictive performance

Further work

Comparison to other HMC learning algorithms kernel methods studied by Rousu et al.

and Barutcuoglu et al. other suggestions are welcome!

Use more advanced hierarchy such as Gene Ontology thousands of classes, spread over 19

levels how to handle the part_of relationship?

if a function A is part-of a function B then does a gene with function A also have function B?

gene “has” function B X vs. gene “is involved” in function B

Questions?

Hierarchical multilabel classification trees for gene function prediction

Documents