Date post: | 13-Feb-2017 |
Category: |
Science |
Upload: | amathelier |
View: | 778 times |
Download: | 2 times |
Transcription factor binding site prediction in vivousing DNA sequence and shape features
Anthony Mathelier, Lin Yang, Tsu-Pei Chiu, Remo Rohs, andWyeth Wasserman
[email protected] @AMathelier
REGSYSGEN2015 Nov. 17th
Centre for Molecular Medicine and Therapeutics
1
Transcriptional regulation of gene expression
Histone octamer TFs
Enhancer
Promoters
RNA PolII
RNAtranscripts
TSS
Cohesin
DNA
Nucleosome
Regulatoryproteins
A. Mathelier, W. Shi, and
W.W. Wasserman, Trendsin Genetics, 2015.
I Transcription of genes is turned on/off thanks to transcriptionfactors (TFs).
I TFs bind to DNA at transcription factor binding sites (TFBSs).
2
Modeling TFBS using position frequency matrices (PFMs)Known binding sites:
GTAACAATGTAAACATGTAAACAAGTAAACAAGTAAACATGTAAACAAGTAAACACGTCAACAGGTAAACATGTAAACAAGTAAACATTTAAGTAAATAAACAACTAAACAGGTAAACATGTAAACAAGTAAACATGTAAACACGTAAACATGTAAACAG
Position Frequency Matrix:
A [ 10 0 190 210 180 15 210 70]C [ 10 0 20 0 15 180 0 25]G [175 0 0 0 15 0 0 35]T [ 15 210 0 0 0 15 0 80]
PFMs - PWMs
Classically, position weight (PWMs) are derived from PFMs tomodel TFBSs, assuming nucleotide independence within TFBSs.
3
Modeling TFBS using Transcription Factor Flexible Models
>HNF4A 1...AGTTCAAAGTTCA...>HNF4A 2...AGTCCAAAGTTCA... ...>HNF4A 73554...CTTGGAACCGGGG...>HNF4A 73555...GGCAAGGTTCATA...
ChIP-seq sequences
TFFMs
positionn
...... 1
En
BG
bg/bg
bg/fg
E0
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
position1
1
E1
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
position2
1
E2
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
1
Logos
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 140
1
2
bits
A. Mathelier and W.W. Wasserman, PLoS Computational Biology, 2013.
TFFMs
TFFMs model the sequence property of TFBSs from ChIP-seqdata by capturing successive dinucleotide dependencies.
4
DNA shape features
The DNAshape tool predicts DNA shape features of a DNAsequence.Genome wide DNA shape features available on GBshape are:
I Minor Groove Width (MGW)
I Roll
I Propeller Twist (ProT)
I Helix Twist (HelT)
T. Zhou et al., Nucl. Acids Res., 2013.
T.P. Chiu et al., Nucl. Acids Res., 2015.
5
Using DNA shape to model TFBSs
Studies showed DNA shapes importance to model TFBSs from:
I SELEX-seq experiments.
I Protein-binding microarray experiments.
I BunDLE-seq experiments.
N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.
Aims of our study:I Construct computational models from large scale in vivo data
(ChIP-seq) by combining DNA sequence and shape features.
I Show TFBS prediction improvements on in vivo data.
I Analyze whether DNA shape induced improvements are TFfamily specific.
I Analyze position-specific DNA shape importance at TFBSs.
6
Using DNA shape to model TFBSs
Studies showed DNA shapes importance to model TFBSs from:
I SELEX-seq experiments.
I Protein-binding microarray experiments.
I BunDLE-seq experiments.
N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.
Aims of our study:I Construct computational models from large scale in vivo data
(ChIP-seq) by combining DNA sequence and shape features.
I Show TFBS prediction improvements on in vivo data.
I Analyze whether DNA shape induced improvements are TFfamily specific.
I Analyze position-specific DNA shape importance at TFBSs.
6
Combining TFFMs and DNA shapes at TFBSs
hit score
MGW
ProT
Roll
HelT
Feature vector
We used an ensemble machine learning approach to combine DNAsequence and shape features.
7
DNA shape features improve TFBS prediction in vivo
A B
Results on 400 human ENCODE ChIP-seq data sets
Combining TFFM scores and DNA shape features improve thediscriminative power. AUROC difference > 0.05 in 107 cases.
8
DNA shape features are important for specific TF familiesB
C
Data sets from E2F and MADS-domain TF families are enrichedfor strong improvements when considering DNA shape features.
9
Validation on independent plant MADS-domain TFs
Incorporating DNA shape features significantly improve TFBSprediction for plant MADS-domain TFs.
10
ProT position-specific importance for MADS-domain TFs
AGL15
bits
1
2
A
B
ProT is of critical importance for predicting TFBSs associated toplant MADS-domain TFs in a position-specific manner.
11
Conclusions
I Our analyses of ChIP-seq data reprensent the in vivoconterpart of the published in vitro studies.
I We can construct computational models combining DNAsequence and shape features from ChIP-seq data to improveTFBS prediction in vivo.
I Incorporating DNA shape information is most beneficial whenapplied to the E2F and MADS-domain TF families.
I ProT is critical for MADS-domain TF binding specificity in aposition-specific manner.
12
Acknowledgements
I Wyeth Wasserman
I Remo Rohs
I Lin Yang
I Tsu-Pei Chiu
I Francois Parcy
I Oriol Fornes
I Chih-Yu Chen
Centre for Molecular Medicine and Therapeutics
13
hit score
MGW
ProT
Roll
HelT
Feature vector
A B
C
Thank you
AGL15
bits
1
2
A
B
14