Introduction Prediction Validation Analysis Summary
Identification and analysis of functionaltranscription factor binding sites
Troy W. Whitfield and Zhiping Weng
July 20, 2010
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Outline
1 Outline
2 Introduction
3 TFBS Prediction
4 Validation of TFBS
5 Further analysis
6 Summary
7 Acknowledgements
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Introduction
Identify and functionally annotate transcription factor binding sites(TFBS) at base pair resolution.
Predict TF binding sites.
Mutate informative bases within the TFBS.
Measure the effect on promoter activity.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
ChIP sequencing peaks
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
1 10 100 1000 10000
P(le
ngth
)
DNA fragment length (bp)
K562 MACS
Figure 1: Distribution of ChIP-seq DNA fragment lengths for the GATA1transcription factor. ENCODE consorium data were reported by the Yale/UCDavis/Harvard team in K562 cells. Peak calling was done using MACS [Zhanget al., 2008]. ChIP-seq peaks are much larger than TFBS footprint(s).
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Assessing PWM predictiveness: GABP
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
True
pos
itive
rate
False positive rate
zlab_bidirect_GABPGABP.M00341
MGGAAGTG_GABP.M1028
Figure 2: ROC curves for the GABP transcription factor using existingmotifs. ChIP-seq data were reported by the lab of Richard Myers(HudsonAlpha Institute).
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Identifying TF binding sites
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
-10 -5 0 5 10 15 20
P(sc
ore)
score (a.u.)
ChIP hitsControl
Figure 3: Score distributions for ChIP-seq peaks (called using MACS[Zhang et al., 2008]) and control fragments. ChIP-seq data for theGABP transcription factor were reported by the lab of Richard Myers(HudsonAlpha Institute).
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Identifying TF binding sites
The PWMs that were most able to account for data fromhigh-throughput ChIP data were used to identify TF bindingsites.
The statistical significance of computed scores, S, for putativeTF binding sites was calculated as p(S) =
∫∞S Pc(S
′)dS′,where Pc(S) is the probability distribution for a set of controlsequences. Predicted TF binding sites with small p(S) wereexperimentally tested.
PWM discovery and refinement will enhance our ability toaccurately identify TF binding sites.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Experimental validation
Mutations were made to as many as 5 bases in each TFBS onthe basis that they caused the greatest reduction in thecomputed score.
Lucifese reporter assays were carried out using transienttransfection of promoter constructs.
Measurements were made using a total of 9 replicates perconstruct and were analyzed using a mixed-effects model.
From our most recent sets of ∼ 500 TFBS predictions, ∼ 350,or 70%, were experimentally validated in each of 4 human celllines: K562, HepG2, HT-1080 and HCT-116.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Experimental validation
Table 1: Summary of functional tests of 650 predicted TF binding sites inK562 cells. Approximately 1/3 of functionally validated TF binding sites wereshown to repress transcription.
Transcription factor No. TFBS val./pred. TFBS act./rep.
ATF3 3/5 1/2ATF6 5/8 4/1CTCF 115/171 73/42GABP 20/28 18/2GATA1 4/4 4/0GATA2 59/82 37/22JunD 2/3 2/0MAX 2/3 1/1STAT1 34/48 27/7STAT2 18/23 14/4USF1 2/2 2/0YY1 89/102 49/40Various other 68/154 47/21
Total 431/650 283/148
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
MAX
Validated 2 MAX binding sites, comprising 1 repressor and 1activator.
WT MT
510
1520
25C16orf35
Lum
inos
ity
WT MT
1012
1416
1820
PPP2R4
Lum
inos
ity
Figure 4: Box plots for validated (p < 0.05) MAX binding sites. Geneannotations appear above the boxes.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
GABP
Validated 18 GABP BS, comprising 2 repressors and 16activators.
WT MT
0.4
0.6
0.8
1.0
1.2
LILRB1
Luminosity
WT MT
1020
3040
50
ZNF687
Luminosity
WT MT
510
15
PISD
Luminosity
WT MT
0.1
0.3
0.5
AVPR2
Luminosity
WT MT
1020
3040
50
PSMB4
Luminosity
WT MT
510
1520
BUD13
Luminosity
WT MT
0.3
0.4
0.5
0.6
CHPF
Luminosity
WT MT
1030
5070
ERGIC3
Luminosity
WT MT
68
1012
14
C21orf59
Luminosity
WT MT
50100
150
200
GART
Luminosity
WT MT
010
2030
40
FLJ46020Luminosity
WT MT
020
60100
LOC168850
Luminosity
Figure 5: Box plots for validated (p < 0.05) GABP binding sites. Geneannotations appear above the boxes.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
GABP
WT MT
510
1520
25
CLDN12
Luminosity
WT MT
110
130
150
MEN1
Luminosity
WT MT
1030
5070
ZNF259
Luminosity
WT MT
1012
1416
HYPK
Luminosity
WT MT
1020
3040
LENG1
Luminosity
WT MT
050
100
200
C20orf44
Luminosity
WT MT
1015
20
NFS1
Luminosity
WT MT
2040
6080
120
SYNJ1
Luminosity
Figure 6: Additional box plots for validated (p < 0.05) GABP bindingsites. Gene annotations appear above the boxes.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Further analysis
Can conservation distinguish functionally validated fromnon-validated TF binding sites?
Among functionally validated TF binding sites, canconservation be used to distinguish sites activatingtranscription from those that repress transcription?
What other genomic (e.g. distance from TSS, nearbyenrichment of binding sites for other TFs) or epigenomicfeatures (e.g. histone modifications) correlate with thefunction of TF binding sites?
How specific are functional TF binding sites to the cell lines inwhich they are experimentally validated?
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Conservation in the binding sites of various TFs
Within the predicted binding sites of a given TF, it is difficultto distinguish experimentally validated from non-validatedTFBS predictions.
VertebratesPrimate
Mammal
log10(p-value phyloP)
log
10(p
-val
ue
Phas
tCon
s)
0-0.2-0.4-0.6-0.8-1-1.2
0
-0.2
-0.4
-0.6
-0.8
-1
-1.2
Figure 7: Assessing and comparing the ability of PhastCons and PhyloP scores todistinguish experimentally validated from non-validated TFBS predictions in thefollowing TFs: CTCF, E2F4, GABP, GATA2, STAT1, STAT1 and YY1.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Conservation in the binding sites of various TFs
Among the predicted binding sites of a given TF, conservationappears to be higher among experimentally validated thannon-validated sites, although the difference is not significantdue to the small number of sites.
N V N V N V N V N V N V N V
−10
12
34
phyl
oP s
core
(uni
tless
)
CTCF E2F4 GABP GATA2 STAT1 STAT2 YY1
Figure 8: Box plots of PhyloP scores for validated (p < 0.05) andnon-validated binding sites in several transcription factors.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Genomic conservation in TF binding sites
Over a larger number of diverse TF binding sites, however, theenhanced genomic conservation in experimentally validatedversus non-validated binding sites is statistically highlysignificant.
For example, aggregating the seven transcription factorsdisplayed on the previous slide, a Kolmogorov-Smirnov testbetween the conservation scores of validated andnon-validated TFBS predictions gives p < 0.01, even forconservation among primates.
The power of genomic conservation to distinguish validatedfrom non-validated TFBS predictions is even greater when all80 transcription factors for which TFBS predictions weremade are considered.
Repressing and activating TF binding sites are generallyequally conserved.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
TF binding sites in relation to the transcription start site
M/bp
P M
12008004000
1
0.8
0.6
0.4
0.2
0
Non-validatedFunctionally validated
|N |/bp
P|N
|
1400120010008006004002000
0.005
0.0045
0.004
0.0035
0.003
0.0025
0.002
0.0015
0.001
0.0005
0
Figure 9: Distinguishing between validated and non-validated TF binding sites fromtransient transfection assays. Here, P|N| = P−N + PN is the probability of finding avalidated TFBS within |N | base pairs of the transcription start site. Plotted in the
inset is the cumulative probability, PM =∑M
N=0 P|N|. The two distributions can be
distinguished with p < 1.5× 10−3 using a Kolmogorov-Smirnov test: validated TFbinding sites tend to be closer to the TSS.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
TF binding sites in relation to the transcription start site
M/bp
P M
12008004000
1
0.8
0.6
0.4
0.2
0
RepressorsActivators
|N |/bp
P|N
|
1400120010008006004002000
0.006
0.005
0.004
0.003
0.002
0.001
0
Figure 10: Distinguishing between activating and repressing TF binding sites, allexperimentally validated by transient transfection assays. Here, P|N| = P−N + PN isthe probability of finding a validated TFBS within |N | base pairs of the transcription
start site. Plotted in the inset is the cumulative probability, PM =∑M
N=0 P|N|. The
two distributions can be distinguished with p < 8× 10−3 using a Kolmogorov-Smirnovtest: activating TF binding sites tend to be closer than repressing TF binding sites tothe TSS.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Over-representation of additional motifs on promoters withfunctionally validated TF binding sites: CTCF example
Table 2: Promoters with functional CTCF binding sites are enriched in adifferent set of motifs than promoters with non-functional binding sites.A set of ∼ 13000 human promoters was used as background.
Transcription factor p-value
ELF5 0.003Myf 0.016Gfi 0.04
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Over-representation of additional motifs on promoters withfunctionally validated TF binding sites: YY1 example
Table 3: Over-represented motifs on promoters with functionally validatedYY1 binding sites. A set of ∼ 13000 human promoters was used asbackground. Over-represented motifs present on promoters with functionalYY1 binding sites were not over-represented on promoters with a predictedbut non-functional CTCF BS.
Transcription factor p-value
SRY < 0.001NFYA < 0.001GABPA 0.002Nkx2-5 0.006CREB1 0.009SOX5 0.01AR 0.011EWSR1-FLI1 0.014ELK4 0.015SP1 0.017FOXI1 0.02FOXD3 0.022SOX9 0.022Klf4 0.025STAT1 0.027
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Histone modifications and TFBS functional validation
H4K20me1H3K9me1
H3K9acH3K4me3H3K4me2
H3K4me1H3K36me3H3K27me3
H3K27ac
L/bp
p-v
alue
(L)
10410310210
1
0.1
Figure 11: Distinguishing between promoters with functionally validated andnon-validated TF binding sites. The p-values are computed by applying aKolmogorov-Smirnov (KS) test to histone modification signals, averaged over basepairs out to a distance L away from the TSS. Before applying the KS test, signals aregrouped according to whether or not they have a validated (in K562 cells) TFBS. Forpromoters with validated TF binding sites, there are are significantly (p < 0.05) higherH3K4me1 and H3K9me1 signals for 300 bp < L < 2000 bp.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Histone modifications and activation versus repression
H3K20me1H3K9me1
H3K9acH3K4me3H3K4me2
H3K4me1H3K36me3H3K27me3
H3K27ac
L/bp
p-v
alue
(L)
10410310210
1
0.1
Figure 12: Distinguishing between promoters with functionally validated TF bindingsites which activate or repress transcription. The p-values are computed by applying aKolmogorov-Smirnov (KS) test to histone modification signals, averaged over basepairs out to a distance L away from the TSS. Before applying the KS test, signals aregrouped according to whether the TF binding activates or represses transcription (inK562 cells). For promoters with activating TF binding sites, there are significantly(p < 0.05) higher H3K4me1 and signals for 300 bp < L < 1000 bp.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Cell line specificity
Figure 13: Venn diagram for functionally validated (p < 0.05) TFBSbinding sites in four different cell lines.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Summary
We have carried out approximately 36000 (500× 18× 4) functionalassays on our predicted TF binding sites. In each of 4 cell lines (K562,HepG2, HT-1080 and HCT-116), approximately 70% of predicted TFbinding sites were functionally validated.
Approximately 1/3 of validated TF binding sites repress transcription ofthe genes that they regulate.
Functional validation of predicted TF binding sites is cell line specific.
Validated TF binding sites are significantly more conserved thannon-validated preditions.
Validated TF binding sites tend to be closer to the TSS than TF bindingsites that were not validated.
Functionally validated TF binding sites that activate transcription tend tobe closer to the TSS than those that repress transcription.
Functionally validated TF binding sites can be distinguished fromnon-validated sites by the statistical over-representation of additional anddifferent TF motifs.
Histone modifications can distinguish validated from non-validated TFBSpredictions and activation from repression.
Whitfield and Weng TFBS Function
Introduction Prediction Validation Analysis Summary
Acknowledgements
Weng Lab NHGRIJie Wang
Myers LabE. Christopher Partridge
SwitchGear GenomicsPatrick CollinsNathan Trinklein
Whitfield and Weng TFBS Function