MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

MedGen 505

Gene Regulation Bioinformatics

Wyeth W. Wasserman

www.cisreg.ca

CMMT

Overview

• TFBS Prediction with Motif Models

• Improving Specificity of Predictions

• Analysis of Sets of Co-Expressed and Co-Regulated Genes

CMMT

Transcription Factor Binding Sites(over-simplified for pedagogical purposes)

TATAURE

URF Pol-II

Teaching a computer to find TFBS…

CMMT

Laboratory Discovery of TFBS

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

ACTIVITY

Representing Binding Sites for a TF

• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

• A matrix describing a a set of sites

• A single site• AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

CMMT

TGCTG = 0.9

PFMs to PWMs

Add the following features to the model:1. Correcting for the base frequencies in DNA2. Weighting for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

f matrix w matrix

Log ( )f(b,i) + s(n)p(b)

CMMT

Performance of Profiles

• 95% of predicted sites bound in vitro (Tronche 1997)

• MyoD binding sites predicted about once every 600 bp (Fickett 1995)

• The Futility Conjuncture– Nearly 100% of predicted transcription factor

binding sites have no function in vivo

CMMT

JASPAR

AN OPEN-ACCESS DATABASE

OF TF BINDING PROFILES

PROBLEM: Too many spurious predictions

Actin, alpha cardiac

CMMT

Terms

• Specificity – The portion of predictions that are correct

• Sensitivity – The portion of “positives” that are detected

• The detection of TFBS is limited by terrible specificity. Why?

I.9

CMMT

Method#1Phylogenetic Footprinting

70,000,000 years of evolution reveals

most regulatory regions

CMMT

Phylogenetic Footprinting

FoxC2100%

80%

60%

40%

20%

0%

CMMT

Phylogenetic Footprinting to Identify Functional Segments

% Id

en

tity

Actin gene compared between human and mouse with DPB.200 bp Window Start Position (human sequence)

CMMT

Phylogenetic Footprinting Dramatically Reduces Spurious Hits

Human

Mouse

Actin, alpha cardiac

CMMT

Performance: Human vs. Mouse

• Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)

• 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

CMMT

ConSite (www.cisreg.ca)

NEW: Ortholog Sequence Retrieval Service

CMMT

Emerging Issues

• Multiple sequence comparisons– Incorporate phylogenetic trees– Visualization

• Analysis of closely related species– Phylogenetic shadowing

• Genome rearrangements– Inversion compatible alignment algorithm

• Higher order models of TFBS

CMMT

OnLine Resources for Phylogenetic Footprinting

• Linked to TFBS– ConSite– rVISTA

• Alignments– Blastz– Lagan– Avid– ORCA

I.18

• Visualization– Sockeye– Vista Browser– PipMaker

CMMT

Method#2Discrimination of Regulatory Modules

TFs do NOT act in isolation

Layers of Complexity in Metazoan Transcription

CMMT

Diverse and non-uniform use of terms: Partial glossary for tutorial

• Promoter – Sufficient to support the initiation of transcription; orientation dependent; includes TSS

• Regulatory Regions– Proximal – adjacent to promoter– Distal – some distance away from promoter (vague)– May be positive (enhancing) or negative (repressing)

• TSS – transcription start site• TFBS – single transcription factor binding site• Modules – Sets of TFBS that function together

EXONTFBS TATA

TSSTFBSTFBS

Promoter Region

TFBSTFBS

Distal Regulatory Region Proximal Regulatory Region

EXONTFBS TFBS

Distal R.R.

CMMT

Detecting Clusters of TF Binding Sites

• Trained Methods– Sufficient examples of real clusters to establish

weights on the relative importance of each TF

• Statistical Over-Representation of Combinations– Binding profiles available for a set of biologically

motivated TFs

CMMT

Training for the detection of liver cis-regulatory modules (CRMs)

CMMT

Models for Liver TFs…

HNF1

C/EBP

HNF3

HNF4

CMMT

Logistic Regression Analysis

“logit”

Optimize vector to maximize the distance between output values for positive and negative training data.

Output value is:

elogit

p(x)= 1 + elogit

CMMT

Performance of the Liver Model

• Performance– Sensitivity: 60% of known CRMs detected– Specificity: 1 prediction/35,000bp

• Limitations– Applies to genes expressed late in hepatocyte

differentiation– Requires 10-15 genes in positive training set– This model doesn’t account for multiple sites for the

same TF• New methods from several groups address this limit

CMMT

UGT1A1

WildtypeOther

Live

r M

odul

e M

odel

Sco

re

“Window” Position in Sequence

CMMT

Making better predictions

• Profiles make far too many false predictions to have predictive value in isolation

• Phylogenetic footprinting eliminates ~90% of false predictions

• Algorithms for detection of clusters of binding sites perform better, especially when possible to create train on known examples for the target context

CMMT

Method#3 Higher Order Models

Position-position dependence

What is a higher-order background model?

Zero-order:p(A)=0.29, p(C)=0.21, p(G)=0.21, p(T)=0.29

Ni

inucleotidePseqP...1

)()(

First-order:AA T

C

GA

m:th-order: The chance of drawing base x is dependant on the identity of the previous m bases

Probabilistic Methods for Pattern Discovery(7)

CMMT

Linking co-expressed genes to candidate transcription factors

CMMT

Deciphering Regulation of Co-Expressed Genes

CMMT

oPOSSUM Procedure

Set of co-expressed

genes

Automated sequence retrieval

from EnsEMBL

Phylogenetic Footprinting

Detection of transcription factor

binding sites

Statistical significance of binding sites

Putative mediating

transcription factors

ORCA

CMMT

Statistical Methods for Identifying Over-represented TFBS

• Z scores– Based on the number of occurrences of the TFBS relative

to background

– Normalized for sequence length

– Simple binomial distribution model

• Fisher exact probability scores– Based on the number of genes containing the TFBS

relative to background

– Hypergeometric probability distribution

CMMT

The oPOSSUM Database

• Orthologous genes: 8468

• Promoter pairs: 6911

• Promoters with TFBS: 6758

• Total # of TFBS predictions: 1638293

• Overall failure rate: 20.2%

CMMT

Validation using Reference Gene Sets

TFs with experimentally-verified sites in the reference sets.

A. Muscle-specific (23 input; 16 analyzed)

B. Liver-specific (20 input; 12 analyzed)

Rank Z-score Fisher Rank Z-score Fisher

SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08

MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03

c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01

Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01

TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02

deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01

S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01

Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02

Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01

HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01

Application to Microarray Data Sets

1. NF-кB inhibition microarray study

Genes Significantly Down-regulated by the NF-κB inhibitor (326 input; 179 analyzed)

TF Class Rank Z-score Fisher No. Genes

p65 REL 1 36.57 5.66e-12 62

NF-kappaB REL 2 32.58 5.82e-11 61

c-REL REL 3 26.02 8.59e-08 63

Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6

SPI-B ETS 5 16.59 1.23e-03 135

Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23

Sox-5 HMG 7 15.38 2.56e-02 126

p50 REL 8 14.72 2.23e-03 19

Nkx HOMEO 9 13.66 2.29e-03 111

Bsap PAIRED 10 13.2 9.92e-02 1

FREAC-4 FORKHEAD 11 12.05 1.66e-03 92

n-MYC bHLH-ZIP 25 6.695 1.84e-03 102

ARNT bHLH 26 6.695 1.84e-03 102

HNF-3beta FORKHEAD 29 5.948 3.32e-03 47

SOX17 HMG 31 5.406 8.60e-03 79

CMMT

C-Myc SAGE Data

• c-Myc transcription factor dimerizes with the Max protein

• Key regulator of cell proliferation, differentiation and apoptosis

• Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells

• They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR

CMMT

Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed)


Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7

Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2

Max bHLH-ZIP 3 18.32 2.16e-02 12

SAP-1 ETS 4 13.23 1.61e-04 13

USF bHLH-ZIP 5 11.90 1.84e-01 16

SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12

n-MYC bHLH-ZIP 7 11.11 1.55e-01 20

ARNT bHLH 8 11.11 1.55e-01 20

Elk-1 ETS 9 10.92 3.88e-03 19

Ahr-ARNT bHLH 10 10.17 1.11e-01 25

CMMT

C-Fos Microarray Experiment

• In a study examining the role of transcriptional repression in oncogenesis, Ordway et al. compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line

• We mapped the list of 252 induced Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs

Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed)


c-FOS bZIP 1 17.53 2.60e-05 45

RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1

PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1

CREB bZIP 4 3.626 1.25e-01 10

E2F Unknown 5 2.965 7.67e-02 15

NF-kappaB REL 6 2.915 1.04e-01 17

SRF MADS 7 2.707 2.24e-01 2

MEF2 MADS 8 2.634 1.32e-01 13

c-REL REL 9 2.467 5.79e-02 22

Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1

Ahr-ARNT bHLH 15 1.716 2.57e-03 63

deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75

Elk-1 ETS 21 0.7875 8.12e-03 37

MZF_1-4 ZN-FINGER, C2H2 27 -0.2421 5.41e-03 73

n-MYC bHLH-ZIP 30 -0.8738 8.20e-03 51

ARNT bHLH 31 -0.8738 8.20e-03 51

CMMT

oPOSSUM Server

CMMT

http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum

INPUT A LIST OF CO-EXPRESSED GENES



CMMT

SELECT YOUR TFBS PROFILES

CMMT

SELECT:

1. CONSERVATION2. PSSM MATCH THRESHOLD3. PROMOTER REGION4. STATISTICAL MEASURE

CMMT

de novo Discovery of TF Binding Sites

CMMT

Pattern Discovery

CMMT

de novo Pattern Discovery

• Exhaustive – e.g. YMF (Sinha & Tompa)– Generalization: Identify over-represented oligomers in

comparison of “+” and “-” (or complete) promoter collections

• Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo)– Generalization: Identify strong patterns in “+” promoter

collection vs. background model of expected sequence characteristics

Exhaustive methods

Word based methods: How likely are X words in a set of sequences, given sequence characteristics?

CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range

CMMT

Over-representation

How many words of type ’AGGAGTGA’ are found in our sequences?

k

jjapiinbeginswP

1

)(

k

jjw apknXE

1

)()1(

w

www XVar

XEXZ

How likely is this result?

Exhaustive methods

CMMT

Exhaustive methodsFind all words of length 7 in the yeast genome

Make a lookup table:

AAACCTTT 456TTTTTTTT 57788GATAGGCA 589

Etc...

GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA

Two data structures used:

1) Current pattern nucleotide frequencies qi,1,..., qi,4 and corresponding background frequencies pi,1,..., pi,4

2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j.

One starting point in each sequence is chosen randomly initially.

The Gibbs Sampling algorithm

tgacttcctgatctctagacctcatgacctct

Probabilistic Methods for Pattern Discovery

Iteration step

Remove one sequence z from the set. Update the current pattern according to

tgacttcctgatctctagacctcatgacctct

BN

bcq jji

ji

1

,,

Pseudocount for symbol j

Sum of all pseudocounts in column

Probabilistic Methods for Pattern Discovery

A

’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model

B

z

CMMT

Applied Pattern Discovery is Acutely Sensitive to Noise

10

12

14

16

18

0 100 200 300 400 500 600

SEQUENCE LENGTH

PA

TTE

RN

SIM

ILA

RIT

Yvs.

TR

UE

ME

F2 P

RO

FILE

True Mef2 Binding Sites

CMMT

Four Approaches to Improve Sensitivity

• Better background models-Higher-order properties of DNA

• Phylogenetic Footprinting– Human:Mouse comparison eliminates ~75% of

sequence

• Regulatory Modules – Architectural rules

• Limit the types of binding profiles allowed– TFBS patterns are NOT random

Information segmentation

Information content distributions of TFBS are distinctly non-random

(Wasserman et al 2000)

Palindromicity, dyads(van Helden et al 2000)

Variable gaps(Hu 2003)

TFBSs are not randomly drawn

Enhancing pattern detection sensitivity

CMMT

Pattern discovery methods using biochemical constraints

CMMT

Some profile constraints have been explored…

• Segmentation of informative columns

• Palindromic patterns

CMMT

Our Hypothesis

• Point 1: Structurally-related DNA binding domains interact with similar target sequences

• Exceptions exist (e.g. Zn-fingers)

• Point 2: There are a finite number of binding domains used in human TFs

• Approximately 20-25

• Idea: We could use the shared binding properties for each family to focus pattern detection methods

• Constrain the range of patterns sought

CMMT

Comparison of profiles requires alignment and a scoring function

• Scoring function based on sum of squared differences

• Align frequency matrices with modified Needleman-Wunsch algorithm

• Calculate empirical p-values based on simulated set of matrices

Score

Fre

que

ncy

CMMT

Intra-family comparisons more similar than inter-family

TF Database(JASPAR)

COMPARE

Match to bHLH

Jackknife Test 87% correct

Independent Test Set 93% correct

CMMT

CMMT

FBPs enhance sensitivity of pattern detection

CMMT

REVIEWING THE TOP POINTS

Orientation

Regulatory regions problem space

Sets of binding

sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA

Sets of binding

sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA

Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Clusters of binding sites

Clusters of binding sites

Transcription factors

Transcription factor binding sitesRegulatory nucleotide sequences

Transcription factors

Transcription factor binding sitesRegulatory nucleotide sequences

TATAURE

URF Pol-II

Analysis of regulatory regions with TFBS

Detecting binding sites in a single sequence

Scanning a sequence against a PWM

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC

Abs_score = 13.4 (sum of column scores)

Sp1

Calculating the relative scoreA [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.51281.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.23481.2348 1.23481.2348 2.12222.1222 2.12222.1222 0.4368 1.23481.2348 1.51281.5128 1.74571.7457 1.74571.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.74571.7457 ]

A [-0.2284 0.4368 -1.5 -1.5 -1.5-1.5 0.4368 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5-1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5-1.5 ]T [ 0.4368 -0.22840.4368 -0.2284 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5-1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores)

Min_score = -10.3 (sum of lowest column scores)

93%

100%10.3)(15.2

(-10.3)-13.4

% 100Min_score - Max_scoreMin_score - Abs_score

Rel_score

Scanning 1300 bp of human insulin receptor gene with Sp1 at rel_score threshold of 75%

Ouch.

Low specificity of profiles:•too many hits•great majority not biologically significant

A dramatic improvement in the percentage of biologically significant detections

Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions

Analysis of regulatory regions with TFBS

Phylogenetic Footprints

CMMT

Pattern Discovery

10

12

14

16

18

0 100 200 300 400 500 600

CMMT

Concluding Thoughts

• Bioinformatics is often constrained by our understanding of biochemistry rather than computational or statistical limitations

• Evolution has a powerful influence on the performance of many bioinformatics methods

• Computational predictions have value, but only if you understand the limitations of the methods

Date post:	08-Jan-2016
Category:	Documents
Upload:	balin
View:	34 times
Download:	2 times

MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman

Documents