Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | jordan-washington |
View: | 216 times |
Download: | 2 times |
Outline
• Re-Introduction of my problem
• Current state of affair
• Known dependency factor 1 – Rotamer
• Known dependency factor 2 – Water
• Known dependency factor 3 – DNA flexibility
• Some thoughts on what to do next
Re-Introduction
• I am working on finding dependency model of TF-DNA binding
• What is TF-DNA binding?– If you ask this, you may be in the wrong room
• It is known that different TFs prefer different DNA sequence to bind to.
• Classic example TATA box binding proteins binds the sequence “TATA”.
Re-Introduction (2)
• It is commonly assumed that each position in T-A-T-A contributes independently to the binding energy.
• That is to say, some guys from the TF will bind the first “T”, some other will bind the second “A” and so on.
• If the sequence become CATA, then it depends on how much the guys who binds the 1st position likes the new “C”. If they are OK, the binding energy may change a little but the TF still binds.
• Otherwise, too bad.
Re-Introduction (3)
• One such model, a very popular one, is the PSSM model.
• And it is shown to be very good in estimating the real binding sites of many TF.
• However, some were curious whether the model holds for all TF.
Current state of affair• There are quite a few publications which tries to show
that there are measurable dependencies among the positions.– RECOMB 2003-Modeling dependencies in Protein-DNA binding
sites• Multi PSSM, Tree, Multi Tree. Bayesian network based training.
– Bioinformatics 2004-Modeling within-motif dependence for transcription factor binding site predictions
• PSSM with pairwise correlated position using Bayes Factor. Gibbs sampling based.
– BIBE 2006-Discovering DNA Motifs with Nucleotide Dependency• PSSM with multi-positions, heuristic.
– Bioinformatics 2007-Position dependencies in transcription factor binding sites
• Checks dependencies within a set of aligned binding site with different statistical measures.
Current state of affair (2)
– Bioinformatics 2008-Context-dependent DNA recognition code for C2H2 zinc-finger transcription factors
• Neural network based.
– PLoSCompBio 2008-A Feature-Based Approach to Modeling Protein-DNA Interactions
• Feature based – currently only consider pairwise position dependency feature.
– NAR 2010-On the detection and refinement of transcription factor binding sites using ChIP-Seq data
• Similar to Bioinformatics 2004.
Current state of affair (3)• However, they have a similar framework
– Start with a set of “known” binding sequence– Try to guess a model with and without
dependencies– Train the model using the dataset (possibly
making gradual change on the model during the training)
– Compare which model is better– They will list down the positions with
dependencies – most are consecutive positions, but some have quite distant positions.
Current state of affair (4)
• Well, these are just a fitting of a model to a set of sequence known to bind. The binding energy was not really taken into account.
• So others, with more $$$ in their lab, did a huge biological experiments and try to see if the experimental binding energies of some TFs do exhibit some dependency pattern.
Current state of affair (5)
• Hence some more paper,
– NAR 2002-Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors
– NAR2002-Additivity in protein-DNA interactions-how good an approximation is it?
– Nature Biotechnology 2006-Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities
– Science 2009-Diversity and Complexity in DNA Recognition by Transcription Factors
– PLoSCompBio 2009-Inferring Binding Energies from Selected Binding Sites
Current state of affair (7)
• Yet, none of the publication I have read so far gives a concrete evidence on HOW such dependencies could happen.
• We are now trying to find the answer on what happen on the physical level when two positions in the DNA are dependent.
Known dependency factor 1 – Rotamer
• Recently there is an experiment involving the Zinc Finger TF, Zf268 which has been one of the most popular Zinc finger modeling target.
Known dependency factor 1 – Rotamer
• They tried to change the DNA sequence of the wildtype GCG to ACG, CCG, AAG, and CAG
• We try to see if a program that can change the side chains of the TF to conform to the new DNA sequence can approximate the change in the binding energy.
• We tried FoldX – it does rotamer checks-not sure if it is optimal.
total energy
Backbone Hbond
Sidechain Hbond
Van der
WaalsElectro statics
Solvation Polar
Solvation Hydro
phobic
0 0 0 0 0 0 0
4.23 -0.36 5.01 2.08 2.25 -5.13 0.95
4.28 0 4.37 0.06 1 -1.23 -0.17
-0.02 -0.01 1.96 0.87 0.29 -3.1 -0.1
4 -0.35 6.81 3.14 2.38 -8.67 1.17
4.39 0 5.58 1.28 1.55 -4.15 -0.13
FoldX results
Known dependency factor 1 – Rotamer
• However, the rotamers that FoldX predict does not coincide with the diagrams.
• Either FoldX is not optimal, or the homology modeling done in the paper is not accurate.
• But given the close agreement on the predicted and experimental difference in the binding affinity, most probably they are (more) correct.
• I am still checking on that.
Known dependency factor 2 – Water
• The thing that is explicitly computed in the NAR paper are the solvation penalties (the circles, rectangles and triangles in the diagram).
• They claim that the water mediated H-bonds are not that crucial.
• We can see that FoldX does compute hydration to a certain extent. Yet the rotamer search may not be good enough.
Known dependency factor 3 – DNA flexibility
• G-C will have higher roll angle – making it less stable (weaker stacking energy) and easier to “open”.
• There are several work showing that different dinucleotide steps have different bending and twisting energy.
Known dependency factor 3 – DNA flexibility
•TATA binding protein actually binds TATA not because it generates the best binding energy
•The bindings are mostly non-specific.
Conclusion
• Up to now, the 3 factors are the known/most probable factors of DNA dependency.
• The challenge would be to combine all these into one scoring function that is simple enough to run on large dataset.