Date post: | 16-Dec-2015 |
Category: |
Documents |
Upload: | cecelia-bolte |
View: | 213 times |
Download: | 0 times |
Computational detection of cis-regulatory modules
Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor
Katholieke Universiteit Leuven, Belgium
Slides by Chulyun Kim
Presented by Saurabh Sinha
Contents
Introduction Methods
Methodology overview Score functions ModuleSearch algorithm
Results Conclusions
Contents
Introduction Methods
Methodology overview Score functions ModuleSearch algorithm
Results Conclusions
Motivation
The transcriptional regulation of a metazoan gene depends on the cooperative action of multiple transcription factors
These factors bind to cis-regulatory modules(CRMs) located in the neighborhood of the gene
By integrating multiple signals, CRMs confer an organism specific spatial and temporal rate of transcription
Related Works Yuh et al., 1998: Working with combinations of factors makes it
possible to integrate multiple inputs and this further provides cross-coupling of a signal transduction and gene regulatory path ways
Bray et al., 2003: AVID, alignment algorithm designed to identify functional non coding segments
Aerts et al., 2003: delineation of putative regions containing CRMs in large intergenic sequences
Thijs et al., 2002: detecting DNA motifs by their statistical over-representation in a set of sequences
Aerts et al., 2003: detecting over-represented hits of known TFBSs
Recently, exploiting colocalization to find true biding sites in a particular gene yields valuable hypotheses regarding transcriptional regulation
Problem
To find the best combination of transcription factor binding sites(TFBSs) that occur several times across multiple coregulated human genes
Specifically within syntenic regions with respective mouse orthologous genes
Contents
Introduction Methods
Methodology overview Score functions ModuleSearch algorithm
Results Conclusions
Methodology Overview
Data
Human-mouse orthologous pairs 10kb of sequence upstream of the
coding sequence of the human and mouse gene from Ensemble release 9
18,778 pairs with successful selection
Alignment and Parsing
Parsing The alignment output was parsed using
VISTA Select regions with at least 75% identity
in windows of 100 bp 33,282 regions in total Syntenic fastA database
Alignment Each 10kb pair was
aligned with AVID
Background Model and MotifScanner
Background Model 3rd-order Markov model is calculated form Syntenic fastA
database For scoring and generating artificial dataset
MotifScanner All syntenic regions are scanned to predict trascription factor
binding sites(TFBSs) TRANSFAC: Frequency matrices All occurrences are stored in GFF format in Syntenic GFF
database
PO A C G T 01 12 4 3 1 A 02 3 2 11 4 G 03 11 2 4 3 A…..
GFF (Gene-Finding Format or General Feature Format): a protocol for the transfer of feature informationFields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame>
Coregulated Genes Sets of coexpressed genes
From SOURCE database for cyclin B2 Dataset of gene expression during the cell cycle in a human cancer cell
line 44 genes might share a common cis-regulatory element Of these, 34 had a Ensemble identifier Among them, 13 genes have at least one syntenic region with the
respective mouse gene 32 regions in total
Contents
Introduction Methods
Methodology overview Score functions ModuleSearch algorithm
Results Conclusions
Scoring single TFBSs Combining a position-specific frequency matrix Θ
(PSFM) and a higher-order background model Bm
How likely it is that the segment is generated by the motif model with respect to the background
x is a segment [b1, b2, … , bw]
Bj is the nucleotide found at position j in x
Θ(bj, j) is the probability of fiding bj at position j according to the PSFM
P(bj | s, Bm) is the probability of finding bj in the sequence according to the background model
Matrix similarity Redundancy of motif model
There can be multiple matrices describing the same TF There can be distinct TFs with similar PSFMs
Kullback-Leiber distance between two motif models
Θ1(j,b) is the probability of finding base b at position j in Motif 1 w is the length of the motif A is the set of all possible alignments for an allowed shift
The motif models can be grouped into classes depending on a threshold on this average distance
Module Score Function
A biding site and a motif model (a frequency matrix) CRMs and CRM models CRMs: clusters of actual binding sites on a sequence CRM models: sets of motif models
The score of a CRM model m on a set of sequences s=(s1,…,sn)
The score of a CRM model mon a sequence s
m is a collection of motif models Θ1, …, Θl is a set of matching binding sites
represents a count over the occurring TFBSs of model Θi in sequence s If the number of the occurrences is q, can take any value in 0, … , q
is the kth instance of Θi on sequence s is the score of single TFBS b(t) is a boolean function expressing whether the given combination
of TFBSs is valid or not Overlap between different TFBSs The sites within the specified window length distance constraint
p(t) is the penalization function of CRMS The number of occurring sites divided by the number of motif models l
The score does not take the motif order into account
Contents
Introduction Methods
Methodology overview Score functions ModuleSearch algorithm
Results Conclusions
ModuleSearch
Since the order of sites is not considered, CRM models can be sorted in alphabetical order
nΘ which is the number of sites a module should contain is given
Search for the best CRM model on a set of coregulated genes Typical Best-First / Branch-and-bound search From empty model, expand incomplete models by adding a
model in a different class until there is no incomplete models whose overestimate heuristic score is greater than the score of the current best complete model
The model having the best heuristic score is first expanded
Heuristic Score is the score function without penalization of m
is an overestimate heuristic value of the
rise in score from CRM model m to the best child CRM model
[Θi] is a CRM model containing one matrix Θi
t = ( ) (Θl +1 , …, Θe)
is a boolean function expressing whether the classes of motif models, when added to m, their class are all different or not
Contents
Introduction Methods
Methodology overview Score functions ModuleSearch algorithm
Results Conclusions
Semi-Artificial Sequences
Artificial sequences were generated by sampling symbols from the background model
Detecting Modules in Microarray Clusters
Selected gene cluster around cyclin B2
The best module model in the cluster selected by ModuleSearcher window=100 bp and nΘ=4 [NFY, STAF, TCF4, CEBPA]
Contents
Introduction Methods
Methodology overview Score functions ModuleSearch algorithm
Results Conclusions
Conclusions
the scoring functions of module for syntenic regions and the algorithm to find the best scoring module were proposed
They have tested the proposed algorithm on artificial data and showed that wit could find the hidden modules with a high sensitivity
They predicted a module in a set of coexpressed genes and validated the prediction using the same approach