+ All Categories
Home > Documents > In Silico Transcription Factor Binding Site Prediction How...

In Silico Transcription Factor Binding Site Prediction How...

Date post: 13-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
43
31/03/2014 1 Pieter De Bleser, Ph.D. [email protected] In Silico Transcription Factor Binding Site Prediction – How To Improve? 1 Credits: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui; Stewart MacArthur - DNA Motif Finding PSSM -Detecting binding sites in a single sequence Raw Scores A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC Sp1 Relative Scores A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] Max_score = 15.2 (sum of highest column scores) Min_score = -10.3 (sum of lowest column scores) 93% 100% 10.3) ( 15.2 (-10.3) - 13.4 % 100 Min_score - Max_score Min_score - Abs_score Rel_score Empirical p-value Scores Abs_score = 13.4 (sum of column scores) 0.3 0.2 0.1 0.0 Frequency 0.0 0.2 0.4 0.6 0.8 1.0 Relative Score Area to right of value Area under entire curve 2 Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui
Transcript
Page 1: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

1

Pieter De Bleser, Ph.D.

[email protected]

In Silico Transcription Factor Binding Site Prediction –

How To Improve?

1 Credits: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui; Stewart MacArthur - DNA Motif Finding

PSSM -Detecting binding sites in a single sequence Raw Scores

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]

T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC Sp1

Relative Scores A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]

T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]

T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores)

Min_score = -10.3 (sum of lowest column scores)

93%

100%10.3)(15.2

(-10.3)-13.4

% 100Min_score - Max_score

Min_score - Abs_score Rel_score

Empirical p-value Scores

Abs_score = 13.4 (sum of column scores)

0.3

0.2

0.1

0.0

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8 1.0

Relative Score

Area to right of value

Area under entire curve

2 Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

Page 2: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

2

The Good…

• Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound!

• Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy

PSSM SCORE

BIN

DIN

G

EN

ER

GY

3 Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

…The Bad…

• Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence

– This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average human gene size)

4 Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

Page 3: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

3

…and the Ugly! Human Cardiac a-Actin gene analyzed

with a set of profiles (each line represents a TFBS prediction)

Futility Conjecture: TFBS

predictions, as a collective group,

are almost always wrong!

True binding sites are defined by

properties not incorporated into

the PSSM profile scores

Red boxes are protein coding exons -

TFBS predictions excluded in this analysis

5 Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

PSSMs - Conclusions

6 Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

• PSSMs accurately reflect in vitro binding properties of DNA binding proteins

• Suitable binding sites occur at a rate far too frequent to reflect in vivo function

• Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity

• Unfiltered predictions are too noisy for most applications • Note: Organisms with short regulatory sequences are less problematic

(e.g. yeast and bacteria)

Example of a bioinformatics challenge that needs more biological information to make predictions!

Page 4: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

4

Phylogenetic footprinting

Phylogenetic footprinting relies upon two major concepts:

1. The function and DNA binding preferences of transcription factors are well-conserved between diverse species.

2. Important non-coding DNA sequences that are essential for regulating gene expression will show differential selective pressure. A slower rate of change occurs in TFBS than in other, less critical, parts of the non-coding genome.

7

http://en.wikipedia.org/wiki/Phylogenetic_footprinting

Phylogenetic footprinting

8

Protocol: 1.Decide on the gene of interest. 2.Carefully choose species with orthologous

genes. 3.Decide on the length of the upstream or

maybe downstream region to be looked at. 4.Align the sequences. 5.Look for conserved regions and analyze

them.

http://en.wikipedia.org/wiki/Phylogenetic_footprinting

Page 5: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

5

9

Actin gene compared between human and mouse

200 bp Window Start Position (human sequence)

• Align orthologous gene sequences (e.g. use LAGAN). For 1st window of x

bp (i.e. 200 bp), of sequence#1, determine % identity with sequence#2.

Step across (slide window across) the first sequence, recording % identity

in each window with the second sequence.

• Observe high identity with exons, lower identity in 5’ and 3’ UTRs

• Additional conserved region could be regulatory region

Phylogenetic footprinting

Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

Multi-species

10

Phylogenetic footprinting

Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

http://genome.ucsc.edu

Page 6: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

6

11

Dramatic Reduction in Spurious Hits!

Human

Mouse

Actin, alpha cardiac

11

Phylogenetic footprinting

Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

Phylogenetic Footprinting Tools Phylogenetic Footprinting Servers

CONTRA (http://www.dmbr.ugent.be/prx/bioit2-public/contrav2/)

FOOTER (http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php)

CONSITE (http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ or http://consite.genereg.net/)

rVISTA (http://rvista.dcode.org/)

SNPs in TFBS Analysis

RAVEN (http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home)

Prokaryotes or Yeast

PRODORIC (http://prodoric.tu-bs.de/)

YEASTRACT (http://www.yeastract.com/index.php)

Software Packages

MotifLab (http://motiflab.org)

TOUCAN (http://homes.esat.kuleuven.be/~saerts/software/toucan.php)

Programming Tools

TFBS (http://tfbs.genereg.net/)

12 Ackn.: R. Bruskiewich and F. Brinkman, MBB with material from Wyeth W. Wasserman and Shannan Ho Sui

Page 7: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

7

How to improve? PhysBinder: an integrative tool based on random forest and flexible inclusion of

biophysical properties improves prediction of transcription factor binding sites.

The noncontacted spacer region in the DNA-binding site for the E2 protein encoded by the HPV type 16 genome

– role of nucleotides INSIDE of the TF binding motif

Zhang Y et al. PNAS 2004;101:8337-8341

E2 is a homodimer where each monomer inserts an alpha-helix in the major groove of an "ACCG" sequence (direct readout), holding the DNA in an arc and leaving the spacer NNNN uncontacted by the E2 protein. This spacer is responsible for the indirect readout effect since it is not bound by E2 but its flexibility (if AT-rich) allows for the DNA to be bent "around" E2. The higher the bendability of this region the higher the affinity and the strength of the interaction.

Page 8: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

8

Genomic Regions Flanking E-Box Binding Sites Influence DNA Binding Specificity of bHLH Transcription Factors through DNA Shape – role of nucleotides located OUTSIDE of the TF binding motif

Raluca Gordân , Ning Shen , Iris Dror , Tianyin Zhou , John Horton , Remo Rohs , Martha L. Bulyk. Cell Reports Volume 3, Issue 4, 2013, 1093 – 1104.

• Origins of transcription factor specificity:

Direct Readout (a.k.a Base Readout): Direct interactions between protein amino acids and DNA base

pairs in the binding site (hydrogen bonds, electrostatics,…)

Captured with varying success by ‘classical’ (probabilistic) models for TF specificity such as the PWM

Indirect Readout (a.k.a Shape Readout + Flexibility): Sequence-dependent conformation and flexibility of DNA

Major and minor groove widths are different for different sequences

Some sequences have intrinsic bends or increased flexibility relative to others

• Open questions: 1. how specific transcription factors utilize the structural information?

2. how to incorporate such information in predictive models, especially on a genomic scale?

Modelling transcription factor binding specificity (I)

Page 9: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

9

• Sequence-based methods:

model the binding specificity from a collection of aligned sequences known to bind the TF in vitro or in vivo

treat DNA as a uniform static structure that is independent of the nucleotide sequence

the PWM method [Stormo] takes into account only the nucleotide frequency at each position of the TFBS and assumes independence between those positions (additivity).

Recently, it was shown that for most TFs, dependencies exist between nucleotide positions in their binding sites [Hu; ChIP-Seq].

• Structure-based methods:

Use information from available crystal structures of TF–DNA complexes [e.g. Angarica]

Some are valuable for comparative modelling and seem promising for TFBS prediction

None of the structure-based methods have offered substantial improvement over the PWM method yet!

Modelling transcription factor binding specificity (II)

The PhysBinder Approach

• Sequence-based method

Uses the random forest (RF) algorithm with features that cover:

Nucleotide positional dependencies -> NPD model

Nucleotide sequence-dependent structural characteristics -> structural model

Combines the NPD model and the structural model and tries to integrate the PWM score in the combined model

The goal is to find the features combination(s) that maximize(s) the classification accuracy for each transcription factor binding model individually

Page 10: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

10

Random Forest algorithm

http://www.ualberta.ca/~drr3/random-forest.html

• Methodology

Ensemble of unpruned classification or regression trees (CARTs) by bootstrapping samples of the training data and using random feature selection in the tree induction process.

Disadvantage: embedded feature selection procedure cannot handle large numbers of irrelevant features: comprehensive filter feature selection and wrapper-based feature selection before the final

model is trained

Structural features – Pearson correlation analysis

Red: no correlation, Yellow: slight correlation, White: high correlation The structural characteristics are correlated to some extent:

the feature selection procedures and the RF algorithm decide which features are most relevant for each TF.

Page 11: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

11

Building the Model – (A) Input

ChIP-Seq Peak Regions (bed),…

Extract Genomic Sequences (fasta)

Align sequences using MEME: Extend 50 bp upstream and downstream

w.r.t. the start of the found motif.

Use 100 to build Model Use remainder for validation

Positive sequences (P) Negative sequences (N)

Randomly select 1000 background sequences with a length of 100 bp from the genome of interest.

(B) Calculation of the structural and NPD profiles

Each nucleotide sequence, from either class, is converted into multiple series of

values; each series provides values for a specific DNA structural characteristic

at all positions of the TFBS and its context (structural model), or simply consists

of one base or two base parts of the sequence (NPD).

Page 12: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

12

(C) Filtering out relevant features

• Basic selection of relevant features (i.e. positions) is made by statistical

comparison of distributions of values for positive and negative sequences

with mild thresholds:

• In order not to exclude too many features

• To permit detection of their interactions later on

(D) Wrapper-based feature selection Further selection is performed through cross-validation performance evaluation

with the RF algorithm. Per characteristic, redundant features are removed by

sequential backwards elimination (SBE). Several models with one characteristic

might be merged through best incremental ranked subset (BIRS). The final NPD

model and final structural model can be merged into one integrative model.

Page 13: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

13

The resulting model can be used by RF to

predict the likelihood that a nucleotide

sequence is a TFBS, after converting the

sequence into series of the features

contained in the model.

(E) TFBS prediction

based on their prediction scores:

for both the structural method and the NPD method this is the RF confidence score, which is assigned to each sequence and indicates the certainty with which this sequence is predicted to belong to either the positive or the negative class.

for PWMs, we used the matrix similarity score.

visualized by ROC curves and precision-recall curves.

Each ROC and precision-recall curve shown is derived from a threshold-based average of 20 curves. Data for each of these 20 curves were obtained by training the model with a randomly taken subset of 80% of the data and testing that trained model on the remaining 20%.

Principle component analysis was performed on the full models to select a top five feature set for each TF (default parameters).

Evaluation of performance of the classification models

Page 14: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

14

Classification model accuracy – HIF1

A B

HIF1 REF_TPR(FPR0.01): 1 +- 0 PWM NPD struct NPD_struct NPD_struct_PWM

PWM 0.015 +- 0.01423 1

NPD 0.00364 +- 0.00619 4,37E-03 1

struct 0.00909 +- 0.01216 1,44E-01 1,74E-01 1

NPD_struct 0.00682 +- 0.00715 8,10E-02 1,23E-01 9,07E-01 1

NPD_struct_PWM 0.00182 +- 0.00373 5,40E-04 4,00E-01 3,83E-02 0,0144 1

Classification models applied: PWM (black), NPD (green), struct (blue), NPD_struct (purple), NPD_struct_PWM (orange)

Classification model accuracy – p53

Classification models applied: PWM (black), NPD (green), struct (blue), NPD_struct (purple), NPD_struct_PWM (orange)

A B

P53 REF_TPR(FPR0.01): 0.976 +- 0.04 PWM NPD struct NPD_struct NPD_struct_PWM

PWM 0.021 +- 0.0097 1

NPD 0.0391 +- 0.0113 1,90E-05 1

struct 0.075 +- 0.02072 8,97E-08 2,27E-06 1

NPD_struct 0.021 +- 0.0117 8,79E-01 7,48E-05 1,04E-07 1

NPD_struct_PWM 0.00920 +- 0.00755 3,60E-04 7,71E-08 6,05E-08 1,11E-03 1

Page 15: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

15

Classification model accuracy – SP1

Classification models applied: PWM (black), NPD (green), struct (blue), NPD_struct (purple), NPD_struct_PWM (orange)

A B

SP1 REF_TPR(FPR0.01): 1 +- 0 PWM NPD struct NPD_struct NPD_struct_PWM

PWM 0.00977 +- 0.00371 1

NPD 0.00802 +- 0.00411 2,05E-01 1

struct 0.00329 +- 0.00270 3,36E-06 4,34E-04 1

NPD_struct 0.00288 +- 0.00279 2,75E-06 1,52E-04 5,77E-01 1

NPD_struct_PWM 0.00422 +- 0.00295 1,31E-05 3,83E-03 3,06E-01 1,58E-01 1

Classification model accuracy – STAT1

Classification models applied: PWM (black), NPD (green), struct (blue), NPD_struct (purple), NPD_struct_PWM (orange)

A B

STAT1 REF_TPR(FPR0.01): 0.957 +- 0.026 PWM NPD struct NPD_struct NPD_struct_PWM

PWM 0.0176 +- 0.00773 1

NPD 0.00993 +- 0.0047 1,30E-03 1

struct 0.01447 +- 0.00506 1,80E-01 5,85E-03 1

NPD_struct 0.00658 +- 0.0041 1,24E-05 2,94E-02 2,74E-05 1

NPD_struct_PWM 0.00885 +- 0.00460 2,98E-04 4,77E-01 9,60E-04 1,35E-01 1

Page 16: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

16

Classification model accuracy – TBP

Classification models applied: PWM (black), NPD (green), struct (blue), NPD_struct (purple), NPD_struct_PWM (orange)

A B

TBP REF_TPR(FPR0.01): 0.95 +- 0.064 PWM NPD struct NPD_struct NPD_struct_PWM

PWM 0.03835 +- 0.01678 1

NPD 0.00938 +- 0.00591 6,24E-07 1

struct 0.01449 +- 0.00677 9,36E-06 2,09E-02 1

NPD_struct 0.00910 +- 0.00675 4,65E-07 7,09E-01 1,07E-02 1

NPD_struct_PWM 0.00994 +- 0.00579 7,95E-07 7,02E-01 3,99E-02 4,07E-01 1

Visualization of our integrative model for SP1.

• Both the structural model and the NPD model include features at positions that precede the actual TFBS.

• The background genomic sequence in which SP1 binding sites are embedded is very similar to the consensus sequence of such sites.

• A PWM would thus predict many TFBSs, whereas the NPD model and structural model can look beyond position-independent nucleotide frequencies, each in its own way.

Page 17: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

17

Biological relevance of the top 5 selected features (PCA)

3 out of 5 top features are dinucleotide features. The dinucleotides together build the pattern 5′-CGTG-3′, known as the hypoxia-response element (HRE). Most important determining factor of HIF1 binding ; fully conserved in every HIF1 binding site.

the majority deals with the DNA conformation and the tendency to the A/B-DNA conformation. A shift to a non-standard B-DNA conformation can drastically alter the binding capacity of P53 and is likely responsible for the specific binding to the wide variety of P53 motifs

‘SP1’ distorts the B-structure of the DNA toward a more A-DNA oriented structure . Two global features of the SP1 model confirm the importance of DNA conformational features. The two CC-dint features are an indication of the cytosine enrichment in the canonical SP1 recognition element (CCCGCC).

‘STAT1’ shows a strong preference for sequences containing two palindromic half-sites (TTC…GAA), leading to a dyad symmetry. The inclusion of the dinucleotide features for AA, TT, GA and TC, together TTTC…GAAA, is the most specific variant of all STAT1 binding motifs (Ehret et al.)

‘TBP’ is one of the most well known DNA benders and it was shown that the unbound TATA box is already pre-bent. When looking at the top five features, four out of five top features contain properties about DNA bending, confirming the tendency of TBP to bend the DNA.

Precision-Recall curves

Influence of # background sequences HIF1 p53 SP1

STAT1 TBP

Classification models applied: PWM (black), RF (green)

Page 18: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

18

Conclusions

1. Inherent structural properties of DNA are involved in specific recognition by TFs to an extent that depends on each TF

2. A purely structural model performs often worse than the NPD model describing both base readout and a big portion of shape readout. – The relative importance of the more simple NPD characteristic consequently cannot be

ignored when analyzing TFBS binding patterns in the eukaryotic models.

3. Structural properties contain information other than the nucleotide sequence and can be used to further improve classification accuracy.

4. The PWM score that merely represents base readout in its most simple form, is sometimes complementary to the model combining the structural model and NPD model.

5. Most importantly, we present an integrative approach that can easily combine two or three different approaches to establish the best possible prediction of TFBSs.

PhysBinder - Implementation

PhysBinder is a novel online tool that is based on a flexible and extensible algorithm for the prediction of TFBSs.

Broos S, Soete A, Hooghe B, Moran R, van Roy F, De Bleser P. PhysBinder: improving the prediction of transcription factor binding sites by flexible inclusion of biophysical properties. Nucleic Acids Res. 2013 Apr 24. [Epub ahead of print] PubMed PMID: 23620286.

Hooghe B, Broos S, van Roy F, De Bleser P. A flexible integrative approach based on random forest improves prediction of transcription factor binding sites. Nucleic Acids Res. 2012 Aug;40(14):e106. doi: 10.1093/nar/gks283. Epub 2012 Apr 5. PubMed PMID: 22492513; PubMed Central PMCID: PMC3413102

http://bioit.dmbr.ugent.be/physbinder/index.php

Page 19: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

19

PhysBinder - Validation

ZEB Transcription Factor Switching in Melanoma Cells Results in Mitf Loss, and is Associated with Melanoma Progression: Is Mitf a target gene for Zeb1?

Downregulation of MITF promoter activity in the B16 melanoma cell line upon overexpression of Zeb1. An empty vector was used as a control (CTRL).

Identification of highly conserved ZEB1 binding sites in the MITF 5’- region by using Physbinder.

Page 20: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

20

Input

1

Input

2

Page 21: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

21

Input

Genomic regions can be fetched from a variety of species and associated genome assemblies.

Input

1

2

You can enter multiple locations at once, one per line…

Page 22: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

22

Input

3

Input

Page 23: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

23

Input - Summary

1

2

3

Sequences can be uploaded by one of the following means: 1. pasting a set of FASTA-formatted

sequences in the input field 2. indicating genomic regions in the ‘Fetch

genomic regions’ text field. 3. uploading a file with FASTA-formatted

sequences

Select a treshold Precalculated thresholds ("Max. Precision", "Average", "Max. F-Measure") are calculated on an external control set. Alternatively, one can enter a custom threshold. Scores and thresholds range from 1 to 1000. This score is calculated by the Random Forest algorithm and indicates the confidence the algorithm has in the result. We calculated thresholds to decide which results are valid in three ways: 1. The Max. Precision threshold guarantees a minimal number of false positive

predictions, while assuring that the positive predictions are of top quality. 2. The Max. F-Measure threshold tries to balance precision and recall (% of

identified true positives). It is a weighted average of the precision and recall, where an F-Measure score reaches its best value at 1 and worst value at 0.

3. The Average threshold is the average between precision and the F-Measure threshold. This threshold is a good starting point if you have no idea about the most suitable threshold.

Users that want to use their own custom score should take a look at the ROC curves on the models page to get an idea of what to expect from each score.

Page 24: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

24

Select a treshold

Just select one of the three precalculated thresholds ("Max. Precision", "Average", "Max. F-Measure") or set a custom threshold (enter a number between 1 and 1000). The default threshold is "Average."

Select a transcription factor binding site model Currently we offer more than 60 different vertebrate TFBS models. The number of models will continue to grow in the future. All models are build from available ENCODE ChIP-Seq experiments and other sources (e.g. http://www.ncbi.nlm.nih.gov/geo/) . It is possible to filter models by species name or by model type or to use a custom search term. Model type: 1. DE (Direct Evidence): models built from experimental data that

clearly contain a consensus motif that has been reported in literature.

2. PAF (Possibly Associated Factor): models built from ChIP-Seq data that clearly contain a sensible motif that has NOT been associated with the transcription factor yet. These are often factors associated with the transcription factor. It is also possible that the model represents the actual transcription factor with a consensus sequence that is not yet known.

Page 25: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

25

Select a transcription factor binding site model

You can select models in the models window. We use color codes to indicate the type of model: green means DE; yellow stands for PAF; human is blue and mouse is grey.

Filter for species name. Currently we have human and mouse models.

Select a transcription factor binding site model

Page 26: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

26

Select a transcription factor binding site model

Filtering for evidence type. Choose "DE" or "PAF".

Select a transcription factor binding site model

Enter a search term in the input field to search for a certain transcription factor. It is possible to search for aliases of the transcription factor. Just make sure you select the "Include Aliases in Search Terms" checkbox.

If you click on the grey triangle at the bottom of a model icon, all aliases will appear.

Page 27: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

27

Optional arguments Additional options that need an extra explanation. - Email address: It is possible to provide an email address but this is not required. If an email address is provided, an email will be sent when the calculations are finished. If the calculations result in an error, you will get an error report. We will not use your email address for any purpose other than informing you about your calculations. - Use as filter: For performance reasons, it is possible to pre-filter the sequences using a short PWM with mild thresholds in order to get a maximum recall. This will really increase the speed! In order to limit the load on our servers, we decided to enable this option by default. Unless you have a specific reason why not to use this filter step, it is best to keep it turned on.

Output - Summary By default, the summary section is hidden. When the user clicks on the green arrow, a table with some statistics about the results is shown. The summary section on the results page gives an indication of the number of hits that exceed the chosen threshold for each model. The results can be sorted according to the model or according to the input sequence.

By default, the summary table is ordered by model and lists the number of hits per sequence. You can also order the table by sequence ID. Each sequence link is clickable.

Page 28: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

28

Output – Change thresholds Thresholds can be changed by entering a new value in the "Change thresholds" field. You can enter any value between 1 and 1000 in this field. You can also type "avg", "ppv", or "f1" in this field, to respectively get the average threshold, the max precision or the max f-measure threshold for each model.

Click on the green arrow to the right of the "Set Threshold line", enter any value you like in the threshold field.

Output – Detailed results

The results are visualized in this section. Per sequence, hits that exceeds the threshold are indicated with a colored bar. The bar is shaded from a light color (low score) to a dark color (high score). An arrow indicates the orientation of the binding site (forward or reverse). More information is displayed by clicking on the arrow. Binding sites of models can be dynamically shown or hidden by clicking on the corresponding checkboxes. Nucleotides in a gray colored font were not scanned due to model limits. Repeats from RepeatMasker and Tandem Repeats Finder are shown in lower case; non-repeating sequence is in upper case (As used in UCSC Genome Browser for downloadable genome data).

Page 29: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

29

Output – Detailed results Show/hide hits of a model by clicking on the checkboxes above each sequence. Toggling this checkbox will only show/hide the hits on this sequence. If no hits are found for a particular model, no checkbox for this model is shown.

Toggling checkboxes on the side of the screen will affect all sequences.

Output – Working with genomic regions.

It is possible to map the different sequences to a human or mouse reference genome. This can be done by clicking on the "blat" button below each sequence. If the sequences were fetched from UCSC on the input page, this is not necessary because the sequence location is known already. If the sequence location is known, either from the input page or by blatting the sequence, some extra options are available: 1. The sequence can be visualized in the UCSC Genome Browser. In

order to do this, click on the "Map to UCSC" button. 2. A BED file with the genomic regions is available for download. Just

click on the "Get BED file" button. 3. ENCODE data available for the region can be integrated into the

results. Click on the "show ENCODE TFBS" button to fetch all ENCODE regions for this sequence.

Page 30: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

30

If the sequence originates from the human or mouse genome, blat can be used to look for the genomic region.

Select the correct reference genome for your sequence. If your sequences are human, select hg19. If the sequences come from mouse, select mm10. Press the Blat button to continue. It can take a while to process the results.

Output – Working with genomic regions.

Output – Working with genomic regions.

When blat is executed, some other options become available. Select the blat result with the highest overlap from the drop-down list (indicated in %). It is now possible to map your sequences to UCSC, to download a BED file and to integrate ENCODE data.

Clicking the "map to UCSC" button will visualize all TFBS hits in the UCSC Genome Browser.

Page 31: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

31

Output – Working with genomic regions.

Binding sites are visualized as a custom track in the genome browser. The original query sequence is shown as a black bar.

Downloading the results is done by clicking the "get BED file" button.

Output – Working with ENCODE data

We integrated the TFBS ChIP-seq sets from human in the PhysBinder web tool. ENCODE data can be integrated by clicking on the "show ENCODE TFBS" button. ENCODE data will be shown, along with the PhysBinder predictions, as grey bars. This way the genomic context of the PhysBinder predictions become immediately clear. The different ENCODE tracks can be toggled on or off using the ENCODE checkboxes

Page 32: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

32

Output – Working with ENCODE data

By clicking on the show ENCODE TFBS button, overlapping TFBS ChIP-seq sets are indicated on the sequence as grey bars.

When no ENCODE tracks are found in the sequence, a message is displayed below the sequence.

Output – Working with ENCODE data

If ENCODE tracks for this genomic region are found, A list of checkboxes will appear next to the sequence. The ENCODE tracks are visualized as simple grey bars below the sequence. In regions with many ENCODE tracks, the grey bars will be stacked and the area becomes darker.

Page 33: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

33

Output – Working with ENCODE data

The different ENCODE tracks can be switched on or off by clicking the corresponding checkboxes. You can switch on or off multiple tracks by clicking the "select all" or "deselect all" option below the checkboxes. Do not forget to press the update button!

Output – Create publication graphics

For each sequence in the results section we offer a download of the FASTA-file and feature-color-file with the binding sites. Both files can be used in Jalview to create custom visualizations of the results.

Click on the download link for the fasta and feature color file to save them locally. This way you can create graphics in Jalview (http://www.jalview.org/ ).

Page 34: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

34

Example Kyo et al. identified a core promoter of 181 bp responsible for the transcriptional activity of the TERT gene, encoding the catalytic subunit of telomerase. This 181-bp region, consisting of the 5’-UTR and the upstream promoter region, contains two E-boxes bound by MYC in vivo. Between these E-boxes, Kyo et al. discovered and validated five GC-boxes that are bound by SP1.

Kyo S et al. Nucl. Acids Res. 2000;28:669-677

Sequences of the hTERT core promoter and consensus motifs for factor

binding sites.

All predicted TFBS match the experimentally determined locations reported by Kyo et al.

Page 35: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

35

Example – Visualization of the models in the UCSC Genome Browser

Q.: I need a genome-wide list of genes possibly targeted by transcription factor X, transcription factor Y,…

A1.: Use PhysBinder-CLI* (recommended but depending on the availability of good ChIP-Seq data to build a model)

“Identification of human and mouse target genes for GATA1, GATA2 and, if possible, TAL1” => We make models on request*. Just point us to a good ChIP-Seq data set.

A2.: Use PWM-based methods (faster, but mind the futility theorem)

*Please contact us: [email protected]

FAQ

Page 36: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

36

Exercise

• Using PhysBinder explore the vicinity of the TAL1 (hg19) TSS and promoter for GATA1 and GATA2 transcription factor binding sites.

• Map the results to the UCSC genome browser and add tracks for transcription factor binding data and histone marks that are often found near active regulatory elements.

Exercise - Solution

Page 37: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

37

You can download the FASTA file and feature color file of each result. These files can be used to make publication graphics in Jalview.

Output – Create publication graphics

This is how the fasta file looks like

Page 38: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

38

This is how the feature color file looks like

In Jalview, click file > Input Alignment > from File.

Select the FASTA file and click Open.

Page 39: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

39

By default the sequence will be visualized on one line. You can change this behaviour by clicking “Wrap” in the “Format” menu.

Sometimes features will be shown in between the sequence. Turn this on or off with the “Show Annotations” option in menu “View”.

Page 40: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

40

Load the predicted binding sites by going to menu file and clicking the “Load Features / Annotations” option.

Page 41: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

41

Select the fc-file and click open.

The binding sites will be shown on top of the sequences

Page 42: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

42

Change the settings of the different features by going to the “Feature Settings…” option in the “View” menu.

Here you can change the names and colors of the different binding sites or show/hide sites.

Page 43: In Silico Transcription Factor Binding Site Prediction How ...pieterdb/MASTERS/TFBS_prediction_how_… · In Silico Transcription Factor Binding Site Prediction ... reflect in vivo

31/03/2014

43

Create an image for use in publications or other purposes by going to File > Export Image > PNG or other format.


Recommended