MedGen 505
Gene Regulation Bioinformatics
Wyeth W. Wasserman
www.cisreg.ca
CMMT
Overview
• TFBS Prediction with Motif Models
• Improving Specificity of Predictions
• Analysis of Sets of Co-Expressed and Co-Regulated Genes
CMMT
Transcription Factor Binding Sites(over-simplified for pedagogical purposes)
TATAURE
URF Pol-II
Teaching a computer to find TFBS…
CMMT
Laboratory Discovery of TFBS
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
LUCIFERASE
ACTIVITY
Representing Binding Sites for a TF
• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)
A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4
• A matrix describing a a set of sites
• A single site• AAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
Set of binding
sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA
CMMT
TGCTG = 0.9
PFMs to PWMs
Add the following features to the model:1. Correcting for the base frequencies in DNA2. Weighting for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic
A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1
A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2
f matrix w matrix
Log ( )f(b,i) + s(n)p(b)
CMMT
Performance of Profiles
• 95% of predicted sites bound in vitro (Tronche 1997)
• MyoD binding sites predicted about once every 600 bp (Fickett 1995)
• The Futility Conjuncture– Nearly 100% of predicted transcription factor
binding sites have no function in vivo
CMMT
JASPAR
AN OPEN-ACCESS DATABASE
OF TF BINDING PROFILES
PROBLEM: Too many spurious predictions
Actin, alpha cardiac
CMMT
Terms
• Specificity – The portion of predictions that are correct
• Sensitivity – The portion of “positives” that are detected
• The detection of TFBS is limited by terrible specificity. Why?
I.9
CMMT
Method#1Phylogenetic Footprinting
70,000,000 years of evolution reveals
most regulatory regions
CMMT
Phylogenetic Footprinting
FoxC2100%
80%
60%
40%
20%
0%
CMMT
Phylogenetic Footprinting to Identify Functional Segments
% Id
en
tity
Actin gene compared between human and mouse with DPB.200 bp Window Start Position (human sequence)
CMMT
Phylogenetic Footprinting Dramatically Reduces Spurious Hits
Human
Mouse
Actin, alpha cardiac
CMMT
Performance: Human vs. Mouse
• Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)
• 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained
SELECTIVITY SENSITIVITY
CMMT
ConSite (www.cisreg.ca)
NEW: Ortholog Sequence Retrieval Service
CMMT
Emerging Issues
• Multiple sequence comparisons– Incorporate phylogenetic trees– Visualization
• Analysis of closely related species– Phylogenetic shadowing
• Genome rearrangements– Inversion compatible alignment algorithm
• Higher order models of TFBS
CMMT
OnLine Resources for Phylogenetic Footprinting
• Linked to TFBS– ConSite– rVISTA
• Alignments– Blastz– Lagan– Avid– ORCA
I.18
• Visualization– Sockeye– Vista Browser– PipMaker
CMMT
Method#2Discrimination of Regulatory Modules
TFs do NOT act in isolation
Layers of Complexity in Metazoan Transcription
CMMT
Diverse and non-uniform use of terms: Partial glossary for tutorial
• Promoter – Sufficient to support the initiation of transcription; orientation dependent; includes TSS
• Regulatory Regions– Proximal – adjacent to promoter– Distal – some distance away from promoter (vague)– May be positive (enhancing) or negative (repressing)
• TSS – transcription start site• TFBS – single transcription factor binding site• Modules – Sets of TFBS that function together
EXONTFBS TATA
TSSTFBSTFBS
Promoter Region
TFBSTFBS
Distal Regulatory Region Proximal Regulatory Region
EXONTFBS TFBS
Distal R.R.
CMMT
Detecting Clusters of TF Binding Sites
• Trained Methods– Sufficient examples of real clusters to establish
weights on the relative importance of each TF
• Statistical Over-Representation of Combinations– Binding profiles available for a set of biologically
motivated TFs
CMMT
Training for the detection of liver cis-regulatory modules (CRMs)
CMMT
Models for Liver TFs…
HNF1
C/EBP
HNF3
HNF4
CMMT
Logistic Regression Analysis
“logit”
Optimize vector to maximize the distance between output values for positive and negative training data.
Output value is:
elogit
p(x)= 1 + elogit
CMMT
Performance of the Liver Model
• Performance– Sensitivity: 60% of known CRMs detected– Specificity: 1 prediction/35,000bp
• Limitations– Applies to genes expressed late in hepatocyte
differentiation– Requires 10-15 genes in positive training set– This model doesn’t account for multiple sites for the
same TF• New methods from several groups address this limit
CMMT
UGT1A1
WildtypeOther
Live
r M
odul
e M
odel
Sco
re
“Window” Position in Sequence
CMMT
Making better predictions
• Profiles make far too many false predictions to have predictive value in isolation
• Phylogenetic footprinting eliminates ~90% of false predictions
• Algorithms for detection of clusters of binding sites perform better, especially when possible to create train on known examples for the target context
CMMT
Method#3 Higher Order Models
Position-position dependence
What is a higher-order background model?
Zero-order:p(A)=0.29, p(C)=0.21, p(G)=0.21, p(T)=0.29
Ni
inucleotidePseqP...1
)()(
First-order:AA T
C
GA
m:th-order: The chance of drawing base x is dependant on the identity of the previous m bases
Probabilistic Methods for Pattern Discovery(7)
CMMT
Linking co-expressed genes to candidate transcription factors
CMMT
Deciphering Regulation of Co-Expressed Genes
CMMT
oPOSSUM Procedure
Set of co-expressed
genes
Automated sequence retrieval
from EnsEMBL
Phylogenetic Footprinting
Detection of transcription factor
binding sites
Statistical significance of binding sites
Putative mediating
transcription factors
ORCA
CMMT
Statistical Methods for Identifying Over-represented TFBS
• Z scores– Based on the number of occurrences of the TFBS relative
to background
– Normalized for sequence length
– Simple binomial distribution model
• Fisher exact probability scores– Based on the number of genes containing the TFBS
relative to background
– Hypergeometric probability distribution
CMMT
The oPOSSUM Database
• Orthologous genes: 8468
• Promoter pairs: 6911
• Promoters with TFBS: 6758
• Total # of TFBS predictions: 1638293
• Overall failure rate: 20.2%
CMMT
Validation using Reference Gene Sets
TFs with experimentally-verified sites in the reference sets.
A. Muscle-specific (23 input; 16 analyzed)
B. Liver-specific (20 input; 12 analyzed)
Rank Z-score Fisher Rank Z-score Fisher
SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08
MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03
c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01
Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01
TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02
deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01
S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01
Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02
Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01
HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01
Application to Microarray Data Sets
1. NF-кB inhibition microarray study
Genes Significantly Down-regulated by the NF-κB inhibitor (326 input; 179 analyzed)
TF Class Rank Z-score Fisher No. Genes
p65 REL 1 36.57 5.66e-12 62
NF-kappaB REL 2 32.58 5.82e-11 61
c-REL REL 3 26.02 8.59e-08 63
Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6
SPI-B ETS 5 16.59 1.23e-03 135
Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23
Sox-5 HMG 7 15.38 2.56e-02 126
p50 REL 8 14.72 2.23e-03 19
Nkx HOMEO 9 13.66 2.29e-03 111
Bsap PAIRED 10 13.2 9.92e-02 1
FREAC-4 FORKHEAD 11 12.05 1.66e-03 92
n-MYC bHLH-ZIP 25 6.695 1.84e-03 102
ARNT bHLH 26 6.695 1.84e-03 102
HNF-3beta FORKHEAD 29 5.948 3.32e-03 47
SOX17 HMG 31 5.406 8.60e-03 79
CMMT
C-Myc SAGE Data
• c-Myc transcription factor dimerizes with the Max protein
• Key regulator of cell proliferation, differentiation and apoptosis
• Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells
• They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR
CMMT
Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed)
TF Class Rank Z-score Fisher No. Genes
Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7
Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2
Max bHLH-ZIP 3 18.32 2.16e-02 12
SAP-1 ETS 4 13.23 1.61e-04 13
USF bHLH-ZIP 5 11.90 1.84e-01 16
SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12
n-MYC bHLH-ZIP 7 11.11 1.55e-01 20
ARNT bHLH 8 11.11 1.55e-01 20
Elk-1 ETS 9 10.92 3.88e-03 19
Ahr-ARNT bHLH 10 10.17 1.11e-01 25
CMMT
C-Fos Microarray Experiment
• In a study examining the role of transcriptional repression in oncogenesis, Ordway et al. compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line
• We mapped the list of 252 induced Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs
Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed)
TF Class Rank Z-score Fisher No. Genes
c-FOS bZIP 1 17.53 2.60e-05 45
RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1
PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1
CREB bZIP 4 3.626 1.25e-01 10
E2F Unknown 5 2.965 7.67e-02 15
NF-kappaB REL 6 2.915 1.04e-01 17
SRF MADS 7 2.707 2.24e-01 2
MEF2 MADS 8 2.634 1.32e-01 13
c-REL REL 9 2.467 5.79e-02 22
Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1
Ahr-ARNT bHLH 15 1.716 2.57e-03 63
deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75
Elk-1 ETS 21 0.7875 8.12e-03 37
MZF_1-4 ZN-FINGER, C2H2 27 -0.2421 5.41e-03 73
n-MYC bHLH-ZIP 30 -0.8738 8.20e-03 51
ARNT bHLH 31 -0.8738 8.20e-03 51
CMMT
oPOSSUM Server
CMMT
http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum
INPUT A LIST OF CO-EXPRESSED GENES
CMMT
SELECT YOUR TFBS PROFILES
CMMT
SELECT:
1. CONSERVATION2. PSSM MATCH THRESHOLD3. PROMOTER REGION4. STATISTICAL MEASURE
CMMT
de novo Discovery of TF Binding Sites
CMMT
Pattern Discovery
CMMT
de novo Pattern Discovery
• Exhaustive – e.g. YMF (Sinha & Tompa)– Generalization: Identify over-represented oligomers in
comparison of “+” and “-” (or complete) promoter collections
• Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo)– Generalization: Identify strong patterns in “+” promoter
collection vs. background model of expected sequence characteristics
Exhaustive methods
Word based methods: How likely are X words in a set of sequences, given sequence characteristics?
CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range
CMMT
Over-representation
How many words of type ’AGGAGTGA’ are found in our sequences?
k
jjapiinbeginswP
1
)(
k
jjw apknXE
1
)()1(
w
www XVar
XEXZ
How likely is this result?
Exhaustive methods
CMMT
Exhaustive methodsFind all words of length 7 in the yeast genome
Make a lookup table:
AAACCTTT 456TTTTTTTT 57788GATAGGCA 589
Etc...
GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA
Two data structures used:
1) Current pattern nucleotide frequencies qi,1,..., qi,4 and corresponding background frequencies pi,1,..., pi,4
2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j.
One starting point in each sequence is chosen randomly initially.
The Gibbs Sampling algorithm
tgacttcctgatctctagacctcatgacctct
Probabilistic Methods for Pattern Discovery
Iteration step
Remove one sequence z from the set. Update the current pattern according to
tgacttcctgatctctagacctcatgacctct
BN
bcq jji
ji
1
,,
Pseudocount for symbol j
Sum of all pseudocounts in column
Probabilistic Methods for Pattern Discovery
A
’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model
B
z
CMMT
Applied Pattern Discovery is Acutely Sensitive to Noise
10
12
14
16
18
0 100 200 300 400 500 600
SEQUENCE LENGTH
PA
TTE
RN
SIM
ILA
RIT
Yvs.
TR
UE
ME
F2 P
RO
FILE
True Mef2 Binding Sites
CMMT
Four Approaches to Improve Sensitivity
• Better background models-Higher-order properties of DNA
• Phylogenetic Footprinting– Human:Mouse comparison eliminates ~75% of
sequence
• Regulatory Modules – Architectural rules
• Limit the types of binding profiles allowed– TFBS patterns are NOT random
Information segmentation
Information content distributions of TFBS are distinctly non-random
(Wasserman et al 2000)
Palindromicity, dyads(van Helden et al 2000)
Variable gaps(Hu 2003)
TFBSs are not randomly drawn
Enhancing pattern detection sensitivity
CMMT
Pattern discovery methods using biochemical constraints
CMMT
Some profile constraints have been explored…
• Segmentation of informative columns
• Palindromic patterns
CMMT
Our Hypothesis
• Point 1: Structurally-related DNA binding domains interact with similar target sequences
• Exceptions exist (e.g. Zn-fingers)
• Point 2: There are a finite number of binding domains used in human TFs
• Approximately 20-25
• Idea: We could use the shared binding properties for each family to focus pattern detection methods
• Constrain the range of patterns sought
CMMT
Comparison of profiles requires alignment and a scoring function
• Scoring function based on sum of squared differences
• Align frequency matrices with modified Needleman-Wunsch algorithm
• Calculate empirical p-values based on simulated set of matrices
Score
Fre
que
ncy
CMMT
Intra-family comparisons more similar than inter-family
TF Database(JASPAR)
COMPARE
Match to bHLH
Jackknife Test 87% correct
Independent Test Set 93% correct
CMMT
CMMT
FBPs enhance sensitivity of pattern detection
CMMT
REVIEWING THE TOP POINTS
Orientation
Regulatory regions problem space
Sets of binding
sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Sets of binding
sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA
Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]
Clusters of binding sites
Clusters of binding sites
Transcription factors
Transcription factor binding sitesRegulatory nucleotide sequences
Transcription factors
Transcription factor binding sitesRegulatory nucleotide sequences
TATAURE
URF Pol-II
Analysis of regulatory regions with TFBS
Detecting binding sites in a single sequence
Scanning a sequence against a PWM
A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]
ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC
Abs_score = 13.4 (sum of column scores)
Sp1
Calculating the relative scoreA [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.51281.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.23481.2348 1.23481.2348 2.12222.1222 2.12222.1222 0.4368 1.23481.2348 1.51281.5128 1.74571.7457 1.74571.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.74571.7457 ]
A [-0.2284 0.4368 -1.5 -1.5 -1.5-1.5 0.4368 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5-1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5-1.5 ]T [ 0.4368 -0.22840.4368 -0.2284 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5-1.5 1.7457 ]
Max_score = 15.2 (sum of highest column scores)
Min_score = -10.3 (sum of lowest column scores)
93%
100%10.3)(15.2
(-10.3)-13.4
% 100Min_score - Max_scoreMin_score - Abs_score
Rel_score
Scanning 1300 bp of human insulin receptor gene with Sp1 at rel_score threshold of 75%
Ouch.
Low specificity of profiles:•too many hits•great majority not biologically significant
A dramatic improvement in the percentage of biologically significant detections
Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions
Analysis of regulatory regions with TFBS
Phylogenetic Footprints
CMMT
Pattern Discovery
10
12
14
16
18
0 100 200 300 400 500 600
CMMT
Concluding Thoughts
• Bioinformatics is often constrained by our understanding of biochemistry rather than computational or statistical limitations
• Evolution has a powerful influence on the performance of many bioinformatics methods
• Computational predictions have value, but only if you understand the limitations of the methods