Modeling Dependencies in Protein-DNA Binding Sites
1 School of Computer Science & Engineering2 Hadassah Medical School
The Hebrew University, Jerusalem, Israel
Yoseph Barash 1
Gal Elidan 1
Nir Friedman 1
Tommy Kaplan 1,2
promoter
gene
binding site
Dependent positions in binding sites
Pros: Biology suggests dependencies Single amino-acid interacts with two nucleotides Change in conformation of protein or DNA
Cons: Modeling dependencies is harder Additional parameters Requires more data, not as robust
A?C?T
To model or not to model dependencies ?[Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002]
Most approaches assume position independence
Can we learn dependencies from available genomic data ?
Do dependency models perform better ?
Outline Flexible models of dependencies Learning from (un)aligned sequences Systematic evaluation
Biological insights
Data driven approach
Yes
Yes
How to model binding sites ?
))P(X)P(X)P(X)P(XP(X)XP(X 543215 1 T
5432151 T)|T)P(X|T)P(X|T)P(X|T)P(X|P(T)P(X)XP(X )X|)P(X)P(XX|)P(XX|)P(XP(X)XP(X 354133215 1
X1 X2 X3 X4 X5 Profile: Independency model
Tree: Direct dependencies
Mixture of Profiles:Global dependencies
Mixture of Trees:Both types of dependencies
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
T
3541332151 )XT,|T)P(X|)P(XXT,|)P(XXT,|T)P(X|P(T)P(X)XP(X
? )X X X X P(X 54321 represent a distribution of binding sites
Learning models: Aligned binding sites
Learning based on methods for probabilistic graphical models (Bayesian networks)
GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTAAAGGGCCGGGCGGGAGGCCGGGAGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC
Aligned binding sitesModels
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
LearningMachinery
select maximum likelihood model
Evaluation using aligned data
Estimate generalization of each model:
Test: how probable is the site given the model?
-20.34-23.03-21.31-19.10-18.42-19.70-22.39-23.54-22.39-23.54-18.07-19.18-18.31-21.43
ATGGGGCGGGGCGTGGGGCGGGGCATGGGGCGGGGCGTGGGGCGGGGCGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC
GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTTGGGGGCCGGGC
GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTTGGGGGCCGGGC
Data set Test Log-LikelihoodTest setTraining set
Testavg. LL = -20.77
95 TFs with ≥ 20 binding sites from TRANSFAC database [Wingender et al, 2001’]
Cross-validation:
Arabidopsis ABA binding factor 1
Profile
Test LL per instance -19.93
Mixture of Profiles76%
24%
Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)
X4 X5 X6 X7 X8 X9 X10 X11 X12
Tree
Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold)
Likelihood improvement over profiles
TRANSFAC 95 aligned data sets
0.5
1
2
4
8
16
32
64
128
10 20 30 40 50 60 70 80 90
Significant(paired t-test)
Fol
d-ch
ange
in li
kelih
ood Not significant
Significant improvement in generalization
Data often exhibits dependencies
Sources of data: Gene annotation (e.g. Hughes et al, 2000)
Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000)
ChIP (e.g. Simon et al, 2001; Lee et al, 2002)
Motif finding problemInput: A set of potentially co-regulated genes
Output: A common motif in their promoters
Evaluation for unaligned data
EM algorithm
Learning models: unaligned data
Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model
Unaligned Data
Learna model
Identify binding
sites
ModelsX1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
ChIP location analysis[Lee et al, 2002]
Yeast genome-wide location experiments Target genes for 106 TFs in 146 experiments
YAL005C...
YAL010CYAL012CYAL013WYPR201W
YAL001CYAL002WYAL003W
Gene
YAL001CYAL002WYAL003W
+ – +– ...
+ –––
ABF1 Targets
– +––. ..
– ++ –
ZAP1 Targets…....
# genes ~ 6000
Learned Mixture of Profiles
43
492
Example: Models learned for ABF1 (YPD) Autonomously replicating sequence-binding factor 1
Learned profile
Known profile(from TRANSFAC)
Evaluating PerformanceDetect target genes on a genomic scale:
ACGTAT…………….………………….AGGGATGCGAGC-1000 0-473
-180 -160 -140 -120 -100 -80 -60
p-v
alu
e
10-8
10-7
10-6
10-5
10-4
10-2
10-1
Profile
10-3
Evaluating Performance
Mix of Trees
Bonferroni corrected p-value ≤ 0.01
Gal4 regulates Gal80
Biologicallyverified site
Detect target genes on a genomic scale:
YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W
Evaluation using ChIP location data[Lee et al, 2002]
Evaluate using a 5-fold cross-validation test:
+–+
YAL001CYAL002WYAL003W
Data set Test set Prediction
– +––+ –––
YAL001CYAL002WYAL003W
+–+
––– – ++– –
Evaluate using a 5-fold cross-validation test:
+–+
True
– +––+ –––
+–+
√√√√FN√√√FP√√
YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W
Data set
YAL001CYAL002WYAL003W
Prediction
Evaluation using ChIP location data[Lee et al, 2002]
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0% 1% 2% 3% 4% 5%
Tru
e P
ositi
ve R
ate
(Sen
sitiv
ity)
False Positive Rate
Profile
Example: ROC curve of HSF1
Mixture of Trees
Tree
~60 FP
Mixture of Profiles
-20 -10 0 10 20 30 40 50 60
-25
-20
-15
-10
-5
0
5
10
15
20
Δ s
pe
cif
icit
y
Δ sensitivity
Tree vs. ProfileTrue
Predicted
TP
Improvement in sensitivity & specificity
30
615
3
SensitivityTP / True
SpecificityTP / Predicted
105 unaligned data sets from Lee et al.
-20 -10 0 10 20 30 40 50 60
-25
-20
-15
-10
-5
0
5
10
15
20
Δ s
pe
cif
icit
y
Δ sensitivity
Mixture of Profiles vs. ProfileTrue
Predicted
TP
Improvement in sensitivity & specificity
52
1718
0
SensitivityTP / True
SpecificityTP / Predicted
105 unaligned data sets from Lee et al.
-20 -10 0 10 20 30 40 50 60
-25
-20
-15
-10
-5
0
5
10
15
20
Δ s
pe
cif
icit
y
Δ sensitivity
Mixture of Trees vs. ProfileTrue
Predicted
TP
Improvement in sensitivity & specificity
84
162
1
SensitivityTP / True
SpecificityTP / Predicted
105 unaligned data sets from Lee et al.
“Is it worthwhile to model dependencies?”Evaluation clearly supports this
What about the underlying biology ?(with Prof. Hanah Margalit, Hadassah Medical School)
Distance between dependent positions
0
10
20
30
40
50
Nu
m o
f d
epe
nd
en
cies
1 2 3 4 5 6 7 8 9 10 11
Distance
Weak (< 0.3 bits)
Medium (< 0.7 bits)
Strong
Tree models learned from the aligned data sets
< 1/3 of the dependencies
0.5
1
2
4
8
16
32
64
128
Fo
ld-c
han
ge
in li
ke
liho
od
Zinc finger
bZIPbHLH
Helix
Turn Helix
β Sheetothers ???
Structural families
Dependency models vs. Profile on aligned data sets
0.5
1
2
4
8
16
32
64
128
10 20 30 40 50 60 70 80 90
Significant(paired t-test)
Fol
d-ch
ange
in li
kelih
ood
Not Significant
Conclusions Flexible framework for learning dependenciesDependencies are found in many cases It is worthwhile to model them -
Better learning and binding site prediction
http://compbio.cs.huji.ac.il/TFBN
Future work Link to the underlying structural biology Incorporate as part of other regulatory
mechanism models