Periodic clusters
Non periodic clusters
That was only the beginning…
The human cell cycle
G1-Phase S-Phase
G2-Phase M-Phase
The proliferation cluster genes are cell cycle periodic
5 10 15 20 25 30 35 40 45
4
3
2
1
0
-1
-2
-3
-4
G2/M G1/SCHR
Samples
Gen
e Ex
pres
sion
Disrtribution of cell cycle periodicity
00.10.20.30.40.50.60.70.8
1 2 3 4 5 6 7 8 9 10CCP score
Prop
ortio
n
All genes Proliferation genes
200 150 100 50 TSS
NFYE2F
ELK1
CDE
CHR
The cell cycle motifs are enriched among the periodic genes
Not in the cluster, mutated in cancer
Tabach et al. Mol Sys Biol 2005
Potential regulatory motifs in 3’ UTRs
Finding 3’ UTRs elements associated with high/low transcript stability (in yeast)
AAGCTTCC CCTACAACEntire genome
0 5 10 15-2
-1
0
1
2
3
4
Time/tissues
Expr
essio
n
ClusteringMotif
finding
Diagnosing motifs using expression
Reverse the inference flow
Once we reverse the inference order we can
• Enumerate and score all possible k-mer motifs• Examine the effect of “mutations” on motifs• Examine the effect of motif location within
promoter• Examine the effect of motif combinations,
distances within a combination• More?
• …But the correlation between gene• cluster and motifs is imprecise in both directions:
• there are genes in the cluster without the motif
• and many genes with the motif do not• respond. • If gene control is multifactorial, groups of genes defined by a
common motif will not be mutually disjointed• partitioning• the data into disjoint clusters will cause loss of information.
A k-mer enumeration method: score every possible k-mer for an association with expression level
Ag is expression level of gene gC is a basal expression level (same for all gs)The integer Nμg equals the number of occurrences of motif μ in gene gM a set of motifsFμ is the increase/decrease in expression level caused by the presence of motif μ (same for all gs)
2 4 6 8 10 12 14-2
-1
0
1
2
3
4
Time
Expr
essi
on le
vel
2 4 6 8 10 12 14
-3
-2
-1
0
1
2
3
Time
Expr
essi
on le
vel
EC score = 0.05
EC score = 0.5ScanACE(Hughes et al.)
Motifs characterization through Expression
Coherence (EC)
*
*
**
*
*
** *
*
**
*
** *
** *** **
** *
*****
1 2
3 4
EC1=0 EC2=0.66
EC3=0.2 EC4=0.2
Threshold distance, D
Expression coherence score, intuition
Interaction of motifs
5 10 15
-2
0
2EC=0.05
5 10 15
-2
0
2EC=0.05
TimeTime
Expr
essi
on le
vel
Only M1 Only M2
Expr
essi
on le
vel
Time5 10 15
-2
0
2EC=0.23
M1 AND M2
G2 G2
M1 M2
Synergistic motifs
A combination of two motifs is called ‘synergistic’ if the expression coherence score of the genes that
have the two motifs is significantly higher than the scores of the genes that have either of the motifs
SFFMcm1
A global map of combinatorial expression control
mRPE72
SWI5
SFF '
MCM1
SFFMCM1'
ECB SCB
MCB
PAC
mRRPE
mRRSE3
GCN4
BAS1
LYS14
RAP1
mRPE34
mRPE57
mRPE6mRPE58
STRE
RPN4 ABF1
PDR
CCAPHO4
AFT1
STE12
MIG1
CSRE
HAP234
ALPHA1'
ALPHA1
ALPHA2
mRPE8
mRPE69
Heat-shockCell cycleSporulationDiauxic shiftMAPK signalingDNA damage
*High connectivity*Hubs*Alternative partners in various conditions
Pilpel et al. Nature Genetics 2001
Deduced network Properties.
0
0.5
1
-0.5
-1
0.2
0.4
0.6
0.8
G1G2
Mbp1 Ndt80 Ume6 MCM1'
MCB MSE URS1 SCB MCM1' SFF'
Corre
latio
nEx
pres
sion
Cohe
renc
e
Fkh1
Swi4
Sufficiency
Necessity
Ho et al. Nature. 2002
TF-TF interaction
Hierarchy
Detect the effect of mutations in a motif
.
0
0.5
1
1.5
2
-200
-120
-40
40
120
200
-0.5
-0.4
-0.3
-0.2
-0.1
0
36 19 8 14 20 2 3 7 1 2 0Exp
ress
ion
cohe
renc
e1-
Cor
rela
tion
Dis
tanc
e in
b.p
.
mRRPE is closer
PAC is closer
ATG
ATG
ATG
ATG
ATG
ATG
Distance and orientation of motifsaffect expression profiles
Some typical expression patterns
A Bayesian approach (conditional probability)Xi could “1” to denote denote:
• The presences of motif m
• It’s distance from TSS is < N
• It’s on the coding strand
• It neighbors another motif m’
Or “0” otherwiseei = being expressed in patter i
Example: two rRNA processing motifs
The two motifs Work together
The two motifs’ orientation matters
The procedure
• Given that P(N|D)=P(N)*P(D|N) / P(D):• Search in the space of possible Ns to look for a
one that maximizes the above probability• Impossible to enumerate all possible networks• Use cross validation: partition the data into 5
gene sets, learn the rules based on all but one and test based on the left-out, each time.
For example: what does it take to belong to expression patter (4)?
• Need to have RRPE and PAC
• If PAC is not within 140 bps from ATG , but RRPE is within 240 bps then the probability of pattern 4 is 22%
• If PAC is within 140 and RRPE is within 240 bp then 100% chance
Inferring various logical conditions (“gates”) on motif combinations
The Bayesian network predicts very accurately expression profiles
Can make useful predictions in worm
The modern synthetic approach
Motif discovery from evolutionary conservation data
S. Cerevisiae S. mikatae, S. kudriavzevii, S. bayanus). S. castellii S. Kluyveri
Their intergenicsequences average 59 to 67% identityto their S. cerevisiae orthologs in globalAlignmentsS. castellii and S. Kluyveri~40% identity to Cerevisae
Nucleotide conservation in promoters is highest close to the TSS
TATA-containing genes
All genes
?????
A set of discovered motifs
NATURE | VOL 434 | 17 MARCH 2005
The data• Examined intergenic regions of human mouse rate and dog• ~18,000 genes• “Promoters”: 4kb centered on TSS• 3UTRs based on RNA annotations• 64 Mb, and 15 Mb in total respectively• Negative control: Introns of ~120 Mb• % of alignable sequence:
promoters: 51% (44% upstream and 58% downstream of the TSS),
3’ UTR: 73%, Introns:34%, Entire genome: 28%
The phylogenetic trees
Questions:• How would addition of species affect analyses?• What if the sequences were not only mammalian?
An example: a known binding site of Err-a in the GABPA promoter
Questions:• What is the
“meaning” of the other conserved positions?
Discovery of new motifs: exhaustive enumeration of all 6-mers
Discovery of new motifs: exhaustive enumeration of all 6-mers
Targets of new motifs showed defined expression patterns
Motifs often show clear positional bias – close to TSS
Same methods to look for motifs in 3’ UTRs reveals strand-specific motifs