+ All Categories
Home > Documents > The MORPH Algorithm

The MORPH Algorithm

Date post: 24-Feb-2016
Category:
Upload: zed
View: 57 times
Download: 0 times
Share this document with a friend
Description:
The MORPH Algorithm. MORPH = MO dule guided R anking of candidate P at H way genes high throughput data Slides: Rachel E. Bell , June 2013. Motivation. Challenges in studying biological pathways Identify missing pathway members Information gaps on participating genes: - PowerPoint PPT Presentation
Popular Tags:
45
The MORPH Algorithm MORPH = MOdule guided Ranking of candidate PatHway genes high throughput data
Transcript
Page 1: The MORPH Algorithm

The MORPH Algorithm

MORPH = MOdule guided Ranking of candidate PatHway genes

high throughput dataSlides: Rachel E. Bell, June 2013

Page 2: The MORPH Algorithm

MotivationChallenges in studying biological pathways

• Identify missing pathway members• Information gaps on participating genes: a) e.g. nature of interactions between metabolites and

gene expressionb) understanding control mechanisms, feedback, cross-talk• Many genes in genome(s) have unknown function

Page 3: The MORPH Algorithm

Biological Pathways: OverviewWhat is a pathway?A series of interactions between genes (proteins) involved in performing a certain biological function

Cell input = extracellular/ endogenous:e.g.: stress, changes in PH, UV exposure, nutrients Cell output = response:e.g.: transcription of genes, sucrose degradation

Page 4: The MORPH Algorithm

MORPH Algorithm: Overview

INPUT

ALGORITHMOUTPUT

High throughput data of gene expression, networks and biological pathways

Machine learning and validation methods

Predict genes involved in biological pathways

Page 5: The MORPH Algorithm

Other methods for functional predictionCoexpression-based methods (& possibly pathways)e.g.: ACT, GeneCat, ATED-II, MapMan Assumptions: 1) Similar expression patterns -> similar function or regulation2) Pathway genes -> coordinated expression

Network-based methods (& gene expression)e.g: Markov random field (MRF) models , k-nearest neighbours (k-NN), ADOMETA: coexpression, phylogeny, clustering on chrom., metabolic networks

Assumption: Closer nodes -> common functions

Page 6: The MORPH Algorithm

Introduction: MORPH Algorithm

MORPH uses pathway information, gene expression data and network information

Compared to other methods, MORPH:• offers robustness (performs well on many pathways)• increases networks coverage • applied to different organisms

Page 7: The MORPH Algorithm

Talk outline1. MORPH input types: (a) gene expression data, (b)

pathways and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary

Page 8: The MORPH Algorithm

MORPH IntroductionArabidopsis Thaliana

Solanum Lycopersicum

(Tomato)

MORPH was developed on 2 model organisms

Page 9: The MORPH Algorithm

MORPH Input: Arabidopsis Thaliana

Pathways: 66 AraCyc, 164 MapManPreprocessing: filter pathways with <10 genes with expression dataTotal 230 pathways, 2 sets

Gene Expression datasets: seedlings, tissues (leaves, roots, flowers, seeds), seed developmental stages, DS1Preprocessing: filter low variance and detection call, average replicates, normalize to controls, standardize experimentsTotal 216 GE profiles, 4 datasets, ~12500 genes

Page 10: The MORPH Algorithm

MORPH Input: Arabidopsis Thaliana

Metabolic (MD) Network (AraCyc) Node = metabolic genes (enzymes)Edges = nodes share a metabolite (reactant or product)Preprocessing: remove most common metabolites (they connect enzymes with weak functional associations)Total: 1987 genes, 56244 interactions

PPI Network (PAIR & Interactome Map databases)Node = genes (proteins)Edges = interactions between proteins Preprocessing: Unite (predicted & expt.) interactions from both databasesTotal: 4642 genes, 149229 interactions

Page 11: The MORPH Algorithm

Talk outline1. MORPH input types: (a) gene expression data, (b) pathways

and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary

Page 12: The MORPH Algorithm

MORPH GoalMORPH goal:Given a specific biological pathway MORPH seeks candidate genes that participate in (or regulate) the pathway.

A key step in MORPH is the partitioning of genes into modules (clusters).

MORPH receives 3 types of input:

1. Pathways2. Gene expression data3. Partitioning into modules

Page 13: The MORPH Algorithm

Assumptions of clustering data into modules

Q: Why use modules?

• Modules reflect broad functions

• Some functions are related to target pathway

• Pathway genes -> more coordinated expression than random genes

Page 14: The MORPH Algorithm

Different strategies for partitioning genes

Expression based clustering

Network based clustering

Input: Partitioning Gene Modules and Networks

Annotation based clustering

SOM = self-organizing map(partitions all genes)

CLICK = CLuster Identification via Connectivity Kernels(partitions most genes)

Enzyme/not enzyme

Orthologs in rice & maize/no orthologs

Matisse*

Markov cluster algorithm (MCL)

Page 15: The MORPH Algorithm

Input: Partitioning NetworksReminder: MATISSE seeks connected sub-networks with high expression similarity

InteractionHigh expression similarity

(Ulitsky & Shamir, 2007)

Goal: construct modules using gene expression data and networks

Problem: low coverage of MD network

Page 16: The MORPH Algorithm

Input: Partitioning Networks - MATISSE*

Results: Matisse* increased MD network coverage to ~4500 genes

Matisse* performed similarly to Matisse

Motivation - overcome low coverage of networks

MATISSE* (modified MATISSE)

• Add genes with high correlation• Repeat until module correlation

<0.4• Connectivity ignored

Page 17: The MORPH Algorithm

Clustering algorithm MethodSOM Co-expressionCLICK Co-expression

Clustering algorithm Network Markov cluster process (MCL) PPIMATISSE* PPIMATISSE* MD network

Gene expression-based clustering

Modules using network data

Bipartition CategoriesEnzymes Y/NOrthologs Y/N

Summary: Methods of Partitioning Gene Modules and Networks

Total of 8 clustering solutions

No clustering - single module

Annotation-based clustering

Page 18: The MORPH Algorithm

Talk outline1. MORPH input types: (a) gene expression data, (b) pathways

and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary

Page 19: The MORPH Algorithm

MORPH = MOdule guided Ranking of candidate PatHway genes

Input:1. Pathway genes S = {s1,s2,…sl}2. Gene expression profiles 3. Partition solution for genes with gene

expression data: k modules = M1……Mk 4. Similarity function (D)

Pearson/Spearman

MORPH is an algorithm for prioritizing novel candidate genes in a given specific pathway.

Page 20: The MORPH Algorithm

Module-Guided Ranking Algorithm

Step #1: Partition genes into k modules M1,M2,…,Mk #1

#2#3

Step #3: Analyze each module separately

Step #2: • Identify pathway genes s1,s2,…,sl

and candidate genes g• ignore modules with no pathway

genes• add module for non partitioned

pathway genes

Page 21: The MORPH Algorithm

Step #4: For each g (candidate gene) in module Mi calculate mean similarity with sj (pathway genes) using gene expression data

Module-Guided Ranking Algorithm

candidate genes

pre-defined module

Similarity function(Pearson’s Corr.)

pathway genes in module

provides ranking within module

#3 #4

Page 22: The MORPH Algorithm

Step #5: Standardize mean similarity scores within each module

candidate genes

stdev / mean of

mean similarity scores of all candidate genes in module Mi

Step #6: Rank all candidate genes (using standardized z-scores)

#5

#6

Module-Guided Ranking Algorithm

Page 23: The MORPH Algorithm

How do we assess predictions of many pathways?

Given a clustering solution AND gene dataset

run algorithm for each pathway

Arabidopsis Thaliana 230 pathways

Assessment of pathways using Leave-One-Out Cross-Validation (LOOCV) procedure

Page 24: The MORPH Algorithm

Kharchenko et al., 2006

Leave-One-Out Cross-Validation (LOOCV) procedure

LOOCV generates for each pathway gene -> SELF-RANK

SELF RANK of a gene is its position in ranking, when left out of algorithm calculation

Definition

Self rank of pathway gene = its overall strength of association with remaining pathway genes

Meaning

Page 25: The MORPH Algorithm

Self-Rank Curve: AUSR scoreLOOCV procedure

For each pathway S:1. Remove one gene (v) -> S\{v}2. Consider S\{v} = test set3. Generate ranking of v using S\

{v}4. Repeat for every v

• Calculate self-rank for all v in S• Create self-rank plot• Self-rank threshold of k=1..1000• Calculate area under self-rank

curve (AUSR) Self-Rank plot of the Carotenoid Biosynthetic Pathway contains 13 genes; SOM - clustering solution

Figure 2

(Random gene set of size 13 genes)

(k)

AUSR score assesses pathway solutions (given input combinations – discussed next)

Page 26: The MORPH Algorithm

Talk outline1. MORPH input types: (a) gene expression data, (b) pathways

and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary

Page 27: The MORPH Algorithm

FIGURE 3: Comparison of 2 gene expr. datasets

AUSR(seedlings) - AUSR(DS1)

Different: gene expression dataset

Same: MD network, Matisse*, 66 AraCyc Pathways

Inspired adoption of selection (learning configuration)

Different input produces different AUSR scores

Page 28: The MORPH Algorithm

Learning ConfigurationEvery pathway tested with gene expression dataset and partitioning solution (modules)

Total of 4x8 = 32 combinations

Learning configuration = combination of: gene expression dataset (4) AND Clustering solution (8)

Definition

Page 29: The MORPH Algorithm

Machine LearningLOOCV used to select optimal learning configuration (i.e. data set and clustering) for each examined pathway.

LOOCV avoids overfitting, since test gene is left out.

MORPH applies a selection procedure

Page 30: The MORPH Algorithm

Comparison of selection process to other ‘fixed’ configurations

Results• Better: enzymes or

MD network • Poorer: PPI network,

no clustering, SOM, CLICK & Orthologs

(metabolic genes had higher corr.)

Selection improved on all configurations Figure 4: The average AUSR for each learning combination

(gene expr. dataset + clustering solution)

66 AraCyc metabolic pathways

Page 31: The MORPH Algorithm

Robustness of selection method

Real vs. Random Pathways

randomly selected sets with same size (repeated 100 times for each size)

Results29/66 AUSR > maximal random score

AUSR > 0.75 15/66 - real pathways0 - random

66 AraCyc pathways

Figure 5: AUSR Scores of Real and Random PathwaysSizes

AUSR

0.0

0.5

1.0

1.5

Page 32: The MORPH Algorithm

Talk outline1. MORPH input types: (a) gene expression data, (b) pathways

and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary

Page 33: The MORPH Algorithm

Comparison of MORPH to other methods: Arabidopsis Thaliana pathways

66 AraCyc Pathways

Input:Gene expression: seeds, tissues, seedlings, DS1 Networks: PPI and MD networks Pathways: AraCyc, MapMan

Coexpression (no network data) methods using reference datasets: ACT, DS1

Markov Ranking Field (MRF) methods (network data)

CMRF = total # of pathway gene in neighbourhoodWMRF= total similarity with path. genes in neighbourhoodk-Nearest Neighbour (k-NN) (network data)

Figures 4B & 4C

164 MapMan Pathways

*

*

Page 34: The MORPH Algorithm

AraCyc pathways with AUSR>0.8

MapMan pathways with AUSR>0.7

k-NN predictor complements MORPH

Figure 4D & 4E: Comparison to other methods

Page 35: The MORPH Algorithm

My analysis: AUSR scores of MORPH and k-NN

k-NN is twice as good as MORPH for high AUSRs >0.9 (6 compared to 3)

Data retrieved from Supplemental Data Set 3

Page 36: The MORPH Algorithm

Carotenoid Pathway and the MORPH Candidate genes

Carotenoids are antioxidants, perform stress response functions

Candidate Genes (Numbered Octagons)

• 8/25 top candidates have predicted functions, with little details of roles in plants

• Other predictions inc. genes with similar functions – response to oxidative stress

SQE3 –catalyzes the precursor of a pathway which is coordinated expression with the carotenoid pathway

SPS2 – Plastoquinone pathway essential for carotenoid pathway

Page 37: The MORPH Algorithm

Predictors include MORPH, k-NN, MRF-based, and coexpression based classifiers.

(A) Average and median AUSR scores.(B) The number of pathways that had AUSR score above 0.7

Comparison of MORPH to other methods 93 Tomato pathways

Figure 7

Page 38: The MORPH Algorithm

Talk outline1. MORPH input types: (a) gene expression data, (b) pathways

and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary

Page 39: The MORPH Algorithm

Summary: Advantages of MORPH

1. Robust – different pathways2. k-NN consider only genes in the network, MORPH increases

network coverage3. k-NN more dependent on sub-networks diameter (higher

diameter lower AUSR), MORPH more robust4. Self-rank k=1000 threshold for AUSR, ignores poor pathway

gene correlations5. Potential useful predictions

Page 40: The MORPH Algorithm

Summary: Drawbacks of MORPH

1. If pathway genes not coherent, better select best/top module(s) than average

2. Dependent on input quality (e.g. AraCyc > MapMan)3. Predicts close pathways (drawback/advantage)4. Requires known pathway info for predictions

Page 41: The MORPH Algorithm

Questions?

Page 42: The MORPH Algorithm
Page 43: The MORPH Algorithm

Top AUC scores for tested pathwaysPathway Spearman AUC Pearson AUC Sizephotosynthesis light reactions 0.995115 0.994654 26

Chlorophyllide biosynthesis I 0.952 0.950643 14

Carotenoids Core pathway 0.859312 0.868158 13

tRNA charging pathway 0.832438 0.831844 32

gluconeogenesis 0.831634 0.833135 30

triacylglycerol degradation 0.78642 0.770003 12

cysteine biosynthesis I 0.785097 0.787916 11

fatty acid &beta;-oxidation II (core pathway) 0.746601 0.752534 15

glycolysis I 0.742482 0.747914 44

glycolysis IV (plant cytosol) 0.730273 0.74716 44

Calvin-Benson-Bassham cycle 0.723338 0.729027 29

glucosinolate biosynthesis from homomethionine 0.721732 0.721641 11

homogalacturonan biosynthesis 0.720999 0.729749 12

glucosinolate biosynthesis from hexahomomethionine 0.719277 0.719277 11

glucosinolate biosynthesis from pentahomomethionine 0.719277 0.719277 11

ethylene biosynthesis from methionine 0.709665 0.766496 12

Page 44: The MORPH Algorithm

MORPH Classifications3 types of input data:Pathways genes (s1,s2,…sl)Gene expressionPartition gene expression data into k modules = M1,…,Mk

66 Arabidopsis Thaliana4 datasets

8 Partitioning methods

Page 45: The MORPH Algorithm

Recommended