Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau...

Computational detection of cis-regulatory modules

Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor

Katholieke Universiteit Leuven, Belgium

Slides by Chulyun Kim

Presented by Saurabh Sinha

Contents

Introduction Methods

Methodology overview Score functions ModuleSearch algorithm

Results Conclusions

Contents



Results Conclusions

Motivation

The transcriptional regulation of a metazoan gene depends on the cooperative action of multiple transcription factors

These factors bind to cis-regulatory modules(CRMs) located in the neighborhood of the gene

By integrating multiple signals, CRMs confer an organism specific spatial and temporal rate of transcription

Related Works Yuh et al., 1998: Working with combinations of factors makes it

possible to integrate multiple inputs and this further provides cross-coupling of a signal transduction and gene regulatory path ways

Bray et al., 2003: AVID, alignment algorithm designed to identify functional non coding segments

Aerts et al., 2003: delineation of putative regions containing CRMs in large intergenic sequences

Thijs et al., 2002: detecting DNA motifs by their statistical over-representation in a set of sequences

Aerts et al., 2003: detecting over-represented hits of known TFBSs

Recently, exploiting colocalization to find true biding sites in a particular gene yields valuable hypotheses regarding transcriptional regulation

Problem

To find the best combination of transcription factor binding sites(TFBSs) that occur several times across multiple coregulated human genes

Specifically within syntenic regions with respective mouse orthologous genes

Contents



Results Conclusions

Methodology Overview

Data

Human-mouse orthologous pairs 10kb of sequence upstream of the

coding sequence of the human and mouse gene from Ensemble release 9

18,778 pairs with successful selection

Alignment and Parsing

Parsing The alignment output was parsed using

VISTA Select regions with at least 75% identity

in windows of 100 bp 33,282 regions in total Syntenic fastA database

Alignment Each 10kb pair was

aligned with AVID

Background Model and MotifScanner

Background Model 3rd-order Markov model is calculated form Syntenic fastA

database For scoring and generating artificial dataset

MotifScanner All syntenic regions are scanned to predict trascription factor

binding sites(TFBSs) TRANSFAC: Frequency matrices All occurrences are stored in GFF format in Syntenic GFF

database

PO A C G T 01 12 4 3 1 A 02 3 2 11 4 G 03 11 2 4 3 A…..

GFF (Gene-Finding Format or General Feature Format): a protocol for the transfer of feature informationFields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame>

Coregulated Genes Sets of coexpressed genes

From SOURCE database for cyclin B2 Dataset of gene expression during the cell cycle in a human cancer cell

line 44 genes might share a common cis-regulatory element Of these, 34 had a Ensemble identifier Among them, 13 genes have at least one syntenic region with the

respective mouse gene 32 regions in total

Contents



Results Conclusions

Scoring single TFBSs Combining a position-specific frequency matrix Θ

(PSFM) and a higher-order background model Bm

How likely it is that the segment is generated by the motif model with respect to the background

x is a segment [b1, b2, … , bw]

Bj is the nucleotide found at position j in x

Θ(bj, j) is the probability of fiding bj at position j according to the PSFM

P(bj | s, Bm) is the probability of finding bj in the sequence according to the background model

Matrix similarity Redundancy of motif model

There can be multiple matrices describing the same TF There can be distinct TFs with similar PSFMs

Kullback-Leiber distance between two motif models

Θ1(j,b) is the probability of finding base b at position j in Motif 1 w is the length of the motif A is the set of all possible alignments for an allowed shift

The motif models can be grouped into classes depending on a threshold on this average distance

Module Score Function

A biding site and a motif model (a frequency matrix) CRMs and CRM models CRMs: clusters of actual binding sites on a sequence CRM models: sets of motif models

The score of a CRM model m on a set of sequences s=(s1,…,sn)

The score of a CRM model mon a sequence s

m is a collection of motif models Θ1, …, Θl is a set of matching binding sites

represents a count over the occurring TFBSs of model Θi in sequence s If the number of the occurrences is q, can take any value in 0, … , q

is the kth instance of Θi on sequence s is the score of single TFBS b(t) is a boolean function expressing whether the given combination

of TFBSs is valid or not Overlap between different TFBSs The sites within the specified window length distance constraint

p(t) is the penalization function of CRMS The number of occurring sites divided by the number of motif models l

The score does not take the motif order into account

Contents



Results Conclusions

ModuleSearch

Since the order of sites is not considered, CRM models can be sorted in alphabetical order

nΘ which is the number of sites a module should contain is given

Search for the best CRM model on a set of coregulated genes Typical Best-First / Branch-and-bound search From empty model, expand incomplete models by adding a

model in a different class until there is no incomplete models whose overestimate heuristic score is greater than the score of the current best complete model

The model having the best heuristic score is first expanded

Heuristic Score is the score function without penalization of m

is an overestimate heuristic value of the

rise in score from CRM model m to the best child CRM model

[Θi] is a CRM model containing one matrix Θi

t = ( ) (Θl +1 , …, Θe)

is a boolean function expressing whether the classes of motif models, when added to m, their class are all different or not

Contents



Results Conclusions

Semi-Artificial Sequences

Artificial sequences were generated by sampling symbols from the background model

Detecting Modules in Microarray Clusters

Selected gene cluster around cyclin B2

The best module model in the cluster selected by ModuleSearcher window=100 bp and nΘ=4 [NFY, STAF, TCF4, CEBPA]

Contents



Results Conclusions

Conclusions

the scoring functions of module for syntenic regions and the algorithm to find the best scoring module were proposed

They have tested the proposed algorithm on artificial data and showed that wit could find the hidden modules with a high sensitivity

They predicted a module in a set of coexpressed genes and validated the prediction using the same approach

Date post:	16-Dec-2015
Category:	Documents
Upload:	cecelia-bolte
View:	213 times
Download:	0 times

Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau...

Documents