Download - Modeling Dependencies in Protein-DNA Binding Sites 1 School of Computer Science & Engineering 2 Hadassah Medical School The Hebrew University, Jerusalem,

Modeling Dependencies in Protein-DNA Binding Sites

1 School of Computer Science & Engineering2 Hadassah Medical School

The Hebrew University, Jerusalem, Israel

Yoseph Barash 1

Gal Elidan 1

Nir Friedman 1

Tommy Kaplan 1,2

promoter

gene

binding site

Dependent positions in binding sites

Pros: Biology suggests dependencies Single amino-acid interacts with two nucleotides Change in conformation of protein or DNA

Cons: Modeling dependencies is harder Additional parameters Requires more data, not as robust

A?C?T

To model or not to model dependencies ?[Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002]

Most approaches assume position independence

Can we learn dependencies from available genomic data ?

Do dependency models perform better ?

Outline Flexible models of dependencies Learning from (un)aligned sequences Systematic evaluation

Biological insights

Data driven approach

Yes

Yes

How to model binding sites ?

))P(X)P(X)P(X)P(XP(X)XP(X 543215 1 T

5432151 T)|T)P(X|T)P(X|T)P(X|T)P(X|P(T)P(X)XP(X )X|)P(X)P(XX|)P(XX|)P(XP(X)XP(X 354133215 1

X1 X2 X3 X4 X5 Profile: Independency model

Tree: Direct dependencies

Mixture of Profiles:Global dependencies

Mixture of Trees:Both types of dependencies

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

T

3541332151 )XT,|T)P(X|)P(XXT,|)P(XXT,|T)P(X|P(T)P(X)XP(X

? )X X X X P(X 54321 represent a distribution of binding sites

Learning models: Aligned binding sites

Learning based on methods for probabilistic graphical models (Bayesian networks)

GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTAAAGGGCCGGGCGGGAGGCCGGGAGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC

Aligned binding sitesModels

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

LearningMachinery

select maximum likelihood model

Evaluation using aligned data

Estimate generalization of each model:

Test: how probable is the site given the model?

-20.34-23.03-21.31-19.10-18.42-19.70-22.39-23.54-22.39-23.54-18.07-19.18-18.31-21.43

ATGGGGCGGGGCGTGGGGCGGGGCATGGGGCGGGGCGTGGGGCGGGGCGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC

GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTTGGGGGCCGGGC

GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTTGGGGGCCGGGC

Data set Test Log-LikelihoodTest setTraining set

Testavg. LL = -20.77

95 TFs with ≥ 20 binding sites from TRANSFAC database [Wingender et al, 2001’]

Cross-validation:

Arabidopsis ABA binding factor 1

Profile

Test LL per instance -19.93

Mixture of Profiles76%

24%

Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)

X4 X5 X6 X7 X8 X9 X10 X11 X12

Tree

Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold)

Likelihood improvement over profiles

TRANSFAC 95 aligned data sets

0.5

1

2

4

8

16

32

64

128

10 20 30 40 50 60 70 80 90

Significant(paired t-test)

Fol

d-ch

ange

in li

kelih

ood Not significant

Significant improvement in generalization

Data often exhibits dependencies

Sources of data: Gene annotation (e.g. Hughes et al, 2000)

Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000)

ChIP (e.g. Simon et al, 2001; Lee et al, 2002)

Motif finding problemInput: A set of potentially co-regulated genes

Output: A common motif in their promoters

Evaluation for unaligned data

EM algorithm

Learning models: unaligned data

Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model

Unaligned Data

Learna model

Identify binding

sites

ModelsX1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

ChIP location analysis[Lee et al, 2002]

Yeast genome-wide location experiments Target genes for 106 TFs in 146 experiments

YAL005C...

YAL010CYAL012CYAL013WYPR201W

YAL001CYAL002WYAL003W

Gene


+ – +– ...

+ –––

ABF1 Targets

– +––. ..

– ++ –

ZAP1 Targets…....

# genes ~ 6000

Learned Mixture of Profiles

43

492

Example: Models learned for ABF1 (YPD) Autonomously replicating sequence-binding factor 1

Learned profile

Known profile(from TRANSFAC)

Evaluating PerformanceDetect target genes on a genomic scale:

ACGTAT…………….………………….AGGGATGCGAGC-1000 0-473

-180 -160 -140 -120 -100 -80 -60

p-v

alu

e

10-8

10-7

10-6

10-5

10-4

10-2

10-1

Profile

10-3

Evaluating Performance

Mix of Trees

Bonferroni corrected p-value ≤ 0.01

Gal4 regulates Gal80

Biologicallyverified site

Detect target genes on a genomic scale:

YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W

Evaluation using ChIP location data[Lee et al, 2002]

Evaluate using a 5-fold cross-validation test:

+–+


Data set Test set Prediction

– +––+ –––


+–+

––– – ++– –

Evaluate using a 5-fold cross-validation test:

+–+

True

– +––+ –––

+–+

√√√√FN√√√FP√√

YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W

Data set


Prediction

Evaluation using ChIP location data[Lee et al, 2002]

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0% 1% 2% 3% 4% 5%

Tru

e P

ositi

ve R

ate

(Sen

sitiv

ity)

False Positive Rate

Profile

Example: ROC curve of HSF1

Mixture of Trees

Tree

~60 FP

Mixture of Profiles

-20 -10 0 10 20 30 40 50 60

-25

-20

-15

-10

-5

0

5

10

15

20

Δ s

pe

cif

icit

y

Δ sensitivity

Tree vs. ProfileTrue

Predicted

TP

Improvement in sensitivity & specificity

30

615

3

SensitivityTP / True

SpecificityTP / Predicted

105 unaligned data sets from Lee et al.

-20 -10 0 10 20 30 40 50 60

-25

-20

-15

-10

-5

0

5

10

15

20

Δ s

pe

cif

icit

y

Δ sensitivity

Mixture of Profiles vs. ProfileTrue

Predicted

TP


52

1718

0




-20 -10 0 10 20 30 40 50 60

-25

-20

-15

-10

-5

0

5

10

15

20

Δ s

pe

cif

icit

y

Δ sensitivity

Mixture of Trees vs. ProfileTrue

Predicted

TP


84

162

1




“Is it worthwhile to model dependencies?”Evaluation clearly supports this

What about the underlying biology ?(with Prof. Hanah Margalit, Hadassah Medical School)

Distance between dependent positions

0

10

20

30

40

50

Nu

m o

f d

epe

nd

en

cies

1 2 3 4 5 6 7 8 9 10 11

Distance

Weak (< 0.3 bits)

Medium (< 0.7 bits)

Strong

Tree models learned from the aligned data sets

< 1/3 of the dependencies

0.5

1

2

4

8

16

32

64

128

Fo

ld-c

han

ge

in li

ke

liho

od

Zinc finger

bZIPbHLH

Helix

Turn Helix

β Sheetothers ???

Structural families

Dependency models vs. Profile on aligned data sets

0.5

1

2

4

8

16

32

64

128

10 20 30 40 50 60 70 80 90

Significant(paired t-test)

Fol

d-ch

ange

in li

kelih

ood

Not Significant

Conclusions Flexible framework for learning dependenciesDependencies are found in many cases It is worthwhile to model them -

Better learning and binding site prediction

http://compbio.cs.huji.ac.il/TFBN

Future work Link to the underlying structural biology Incorporate as part of other regulatory

mechanism models