+ All Categories
Home > Documents > MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Date post: 16-Dec-2015
Category:
Upload: river-hollinsworth
View: 219 times
Download: 2 times
Share this document with a friend
Popular Tags:
72
MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman www.cisreg.ca
Transcript
Page 1: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

MedGen 505

Gene Regulation Bioinformatics

Wyeth W. Wasserman

www.cisreg.ca

Page 2: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Overview

• TFBS Prediction with Motif Models

• Improving Specificity of Predictions

• Analysis of Sets of Co-Expressed and Co-Regulated Genes

Page 3: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Transcription Factor Binding Sites(over-simplified for pedagogical purposes)

TATAURE

URF Pol-II

Page 4: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Teaching a computer to find TFBS…

Page 5: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Laboratory Discovery of TFBS

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

LUCIFERASE

ACTIVITY

Page 6: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Representing Binding Sites for a TF

• A set of sites represented as a consensus• VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

• A matrix describing a a set of sites

• A single site• AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Page 7: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

TGCTG = 0.9

PFMs to PWMs

Add the following features to the model:1. Correcting for the base frequencies in DNA2. Weighting for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

f matrix w matrix

Log ( )f(b,i) + s(n)p(b)

Page 8: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Performance of Profiles

• 95% of predicted sites bound in vitro (Tronche 1997)

• MyoD binding sites predicted about once every 600 bp (Fickett 1995)

• The Futility Conjuncture– Nearly 100% of predicted transcription factor

binding sites have no function in vivo

Page 9: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

JASPAR

AN OPEN-ACCESS DATABASE

OF TF BINDING PROFILES

Page 10: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

PROBLEM: Too many spurious predictions

Actin, alpha cardiac

Page 11: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Terms

• Specificity – The portion of predictions that are correct

• Sensitivity – The portion of “positives” that are detected

• The detection of TFBS is limited by terrible specificity. Why?

I.9

Page 12: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Method#1Phylogenetic Footprinting

70,000,000 years of evolution reveals

most regulatory regions

Page 13: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Phylogenetic Footprinting

FoxC2100%

80%

60%

40%

20%

0%

Page 14: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Phylogenetic Footprinting to Identify Functional Segments

% Id

en

tity

Actin gene compared between human and mouse with DPB.200 bp Window Start Position (human sequence)

Page 15: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Phylogenetic Footprinting Dramatically Reduces Spurious Hits

Human

Mouse

Actin, alpha cardiac

Page 16: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Performance: Human vs. Mouse

• Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)

• 75-90% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

SELECTIVITY SENSITIVITY

Page 17: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

ConSite (www.cisreg.ca)

NEW: Ortholog Sequence Retrieval Service

Page 18: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Emerging Issues

• Multiple sequence comparisons– Incorporate phylogenetic trees– Visualization

• Analysis of closely related species– Phylogenetic shadowing

• Genome rearrangements– Inversion compatible alignment algorithm

• Higher order models of TFBS

Page 19: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

OnLine Resources for Phylogenetic Footprinting

• Linked to TFBS– ConSite– rVISTA

• Alignments– Blastz– Lagan– Avid– ORCA

I.18

• Visualization– Sockeye– Vista Browser– PipMaker

Page 20: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Method#2Discrimination of Regulatory Modules

TFs do NOT act in isolation

Page 21: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Layers of Complexity in Metazoan Transcription

Page 22: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Diverse and non-uniform use of terms: Partial glossary for tutorial

• Promoter – Sufficient to support the initiation of transcription; orientation dependent; includes TSS

• Regulatory Regions– Proximal – adjacent to promoter– Distal – some distance away from promoter (vague)– May be positive (enhancing) or negative (repressing)

• TSS – transcription start site• TFBS – single transcription factor binding site• Modules – Sets of TFBS that function together

EXONTFBS TATA

TSSTFBSTFBS

Promoter Region

TFBSTFBS

Distal Regulatory Region Proximal Regulatory Region

EXONTFBS TFBS

Distal R.R.

Page 23: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Detecting Clusters of TF Binding Sites

• Trained Methods– Sufficient examples of real clusters to establish

weights on the relative importance of each TF

• Statistical Over-Representation of Combinations– Binding profiles available for a set of biologically

motivated TFs

Page 24: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Training for the detection of liver cis-regulatory modules (CRMs)

Page 25: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Models for Liver TFs…

HNF1

C/EBP

HNF3

HNF4

Page 26: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Logistic Regression Analysis

“logit”

Optimize vector to maximize the distance between output values for positive and negative training data.

Output value is:

elogit

p(x)= 1 + elogit

Page 27: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Performance of the Liver Model

• Performance– Sensitivity: 60% of known CRMs detected– Specificity: 1 prediction/35,000bp

• Limitations– Applies to genes expressed late in hepatocyte

differentiation– Requires 10-15 genes in positive training set– This model doesn’t account for multiple sites for the

same TF• New methods from several groups address this limit

Page 28: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

UGT1A1

WildtypeOther

Live

r M

odul

e M

odel

Sco

re

“Window” Position in Sequence

Page 29: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Making better predictions

• Profiles make far too many false predictions to have predictive value in isolation

• Phylogenetic footprinting eliminates ~90% of false predictions

• Algorithms for detection of clusters of binding sites perform better, especially when possible to create train on known examples for the target context

Page 30: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Method#3 Higher Order Models

Position-position dependence

Page 31: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

What is a higher-order background model?

Zero-order:p(A)=0.29, p(C)=0.21, p(G)=0.21, p(T)=0.29

Ni

inucleotidePseqP...1

)()(

First-order:AA T

C

GA

m:th-order: The chance of drawing base x is dependant on the identity of the previous m bases

Probabilistic Methods for Pattern Discovery(7)

Page 32: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Linking co-expressed genes to candidate transcription factors

Page 33: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Deciphering Regulation of Co-Expressed Genes

Page 34: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

oPOSSUM Procedure

Set of co-expressed

genes

Automated sequence retrieval

from EnsEMBL

Phylogenetic Footprinting

Detection of transcription factor

binding sites

Statistical significance of binding sites

Putative mediating

transcription factors

ORCA

Page 35: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Statistical Methods for Identifying Over-represented TFBS

• Z scores– Based on the number of occurrences of the TFBS relative

to background

– Normalized for sequence length

– Simple binomial distribution model

• Fisher exact probability scores– Based on the number of genes containing the TFBS

relative to background

– Hypergeometric probability distribution

Page 36: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

The oPOSSUM Database

• Orthologous genes:  8468

• Promoter pairs:  6911

• Promoters with TFBS:  6758

• Total # of TFBS predictions:  1638293

• Overall failure rate:  20.2%

Page 37: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Validation using Reference Gene Sets

TFs with experimentally-verified sites in the reference sets.

A. Muscle-specific (23 input; 16 analyzed)

B. Liver-specific (20 input; 12 analyzed)

Rank Z-score Fisher Rank Z-score Fisher

SRF 1 21.41 1.18e-02 HNF-1 1 38.21 8.83e-08

MEF2 2 18.12 8.05e-04 HLF 2 11.00 9.50e-03

c-MYB_1 3 14.41 1.25e-03 Sox-5 3 9.822 1.22e-01

Myf 4 13.54 3.83e-03 FREAC-4 4 7.101 1.60e-01

TEF-1 5 11.22 2.87e-03 HNF-3beta 5 4.494 4.66e-02

deltaEF1 6 10.88 1.09e-02 SOX17 6 4.229 4.20e-01

S8 7 5.874 2.93e-01 Yin-Yang 7 4.070 1.16e-01

Irf-1 8 5.245 2.63e-01 S8 8 3.821 1.61e-02

Thing1-E47 9 4.485 4.97e-02 Irf-1 9 3.477 1.69e-01

HNF-1 10 3.353 2.93e-01 COUP-TF 10 3.286 2.97e-01

Page 38: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Application to Microarray Data Sets

1. NF-кB inhibition microarray study

Page 39: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Genes Significantly Down-regulated by the NF-κB inhibitor (326 input; 179 analyzed)

TF Class Rank Z-score Fisher No. Genes

p65 REL 1 36.57 5.66e-12 62

NF-kappaB REL 2 32.58 5.82e-11 61

c-REL REL 3 26.02 8.59e-08 63

Irf-2 TRP-CLUSTER 4 20.39 5.74e-04 6

SPI-B ETS 5 16.59 1.23e-03 135

Irf-1 TRP-CLUSTER 6 15.4 9.55e-04 23

Sox-5 HMG 7 15.38 2.56e-02 126

p50 REL 8 14.72 2.23e-03 19

Nkx HOMEO 9 13.66 2.29e-03 111

Bsap PAIRED 10 13.2 9.92e-02 1

FREAC-4 FORKHEAD 11 12.05 1.66e-03 92

n-MYC bHLH-ZIP 25 6.695 1.84e-03 102

ARNT bHLH 26 6.695 1.84e-03 102

HNF-3beta FORKHEAD 29 5.948 3.32e-03 47

SOX17 HMG 31 5.406 8.60e-03 79

Page 40: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

C-Myc SAGE Data

• c-Myc transcription factor dimerizes with the Max protein

• Key regulator of cell proliferation, differentiation and apoptosis

• Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells

• They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR

Page 41: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Induced Genes after Ectopic Expression of c-Myc (SAGE) (53 input; 36 analyzed)

TF Class Rank Z-score Fisher No. Genes

Myc-Max bHLH-ZIP 1 21.68 5.35e-03 7

Staf ZN-FINGER, C2H2 2 20.17 1.70e-02 2

Max bHLH-ZIP 3 18.32 2.16e-02 12

SAP-1 ETS 4 13.23 1.61e-04 13

USF bHLH-ZIP 5 11.90 1.84e-01 16

SP1 ZN-FINGER, C2H2 6 11.68 4.40e-02 12

n-MYC bHLH-ZIP 7 11.11 1.55e-01 20

ARNT bHLH 8 11.11 1.55e-01 20

Elk-1 ETS 9 10.92 3.88e-03 19

Ahr-ARNT bHLH 10 10.17 1.11e-01 25

Page 42: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

C-Fos Microarray Experiment

• In a study examining the role of transcriptional repression in oncogenesis, Ordway et al. compared the gene expression profiles of fibroblasts transformed by c-fos to the parental 208F rat fibroblast cell line

• We mapped the list of 252 induced Affymetrix Rat Genome U34A GeneChip sequences to 136 human orthologs

Page 43: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Induced Genes after Ectopic Expression of c-Fos (Affymetrix) (136 input; 86 analyzed)

TF Class Rank Z-score Fisher No. Genes

c-FOS bZIP 1 17.53 2.60e-05 45

RREB-1 ZN-FINGER, C2H2 2 8.899 1.41e-01 1

PPARgamma-RXRal NUCLEAR RECEPTOR 3 3.991 2.98e-01 1

CREB bZIP 4 3.626 1.25e-01 10

E2F Unknown 5 2.965 7.67e-02 15

NF-kappaB REL 6 2.915 1.04e-01 17

SRF MADS 7 2.707 2.24e-01 2

MEF2 MADS 8 2.634 1.32e-01 13

c-REL REL 9 2.467 5.79e-02 22

Staf ZN-FINGER, C2H2 10 2.385 3.74e-01 1

Ahr-ARNT bHLH 15 1.716 2.57e-03 63

deltaEF1 ZN-FINGER, C2H2 23 0.271 5.39e-03 75

Elk-1 ETS 21 0.7875 8.12e-03 37

MZF_1-4 ZN-FINGER, C2H2 27 -0.2421 5.41e-03 73

n-MYC bHLH-ZIP 30 -0.8738 8.20e-03 51

ARNT bHLH 31 -0.8738 8.20e-03 51

Page 44: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

oPOSSUM Server

Page 45: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum

INPUT A LIST OF CO-EXPRESSED GENES

Page 46: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

SELECT YOUR TFBS PROFILES

Page 47: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

SELECT:

1. CONSERVATION2. PSSM MATCH THRESHOLD3. PROMOTER REGION4. STATISTICAL MEASURE

Page 48: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

de novo Discovery of TF Binding Sites

Page 49: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Pattern Discovery

Page 50: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

de novo Pattern Discovery

• Exhaustive – e.g. YMF (Sinha & Tompa)– Generalization: Identify over-represented oligomers in

comparison of “+” and “-” (or complete) promoter collections

• Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo)– Generalization: Identify strong patterns in “+” promoter

collection vs. background model of expected sequence characteristics

Page 51: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Exhaustive methods

Word based methods: How likely are X words in a set of sequences, given sequence characteristics?

CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range

Page 52: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Over-representation

How many words of type ’AGGAGTGA’ are found in our sequences?

k

jjapiinbeginswP

1

)(

k

jjw apknXE

1

)()1(

w

www XVar

XEXZ

How likely is this result?

Exhaustive methods

Page 53: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Exhaustive methodsFind all words of length 7 in the yeast genome

Make a lookup table:

AAACCTTT 456TTTTTTTT 57788GATAGGCA 589

Etc...

GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA

Page 54: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Two data structures used:

1) Current pattern nucleotide frequencies qi,1,..., qi,4 and corresponding background frequencies pi,1,..., pi,4

2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes to qi,j.

One starting point in each sequence is chosen randomly initially.

The Gibbs Sampling algorithm

tgacttcctgatctctagacctcatgacctct

Probabilistic Methods for Pattern Discovery

Page 55: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Iteration step

Remove one sequence z from the set. Update the current pattern according to

tgacttcctgatctctagacctcatgacctct

BN

bcq jji

ji

1

,,

Pseudocount for symbol j

Sum of all pseudocounts in column

Probabilistic Methods for Pattern Discovery

A

’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model

B

z

Page 56: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Applied Pattern Discovery is Acutely Sensitive to Noise

10

12

14

16

18

0 100 200 300 400 500 600

SEQUENCE LENGTH

PA

TTE

RN

SIM

ILA

RIT

Yvs.

TR

UE

ME

F2 P

RO

FILE

True Mef2 Binding Sites

Page 57: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Four Approaches to Improve Sensitivity

• Better background models-Higher-order properties of DNA

• Phylogenetic Footprinting– Human:Mouse comparison eliminates ~75% of

sequence

• Regulatory Modules – Architectural rules

• Limit the types of binding profiles allowed– TFBS patterns are NOT random

Page 58: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Information segmentation

Information content distributions of TFBS are distinctly non-random

(Wasserman et al 2000)

Palindromicity, dyads(van Helden et al 2000)

Variable gaps(Hu 2003)

TFBSs are not randomly drawn

Enhancing pattern detection sensitivity

Page 59: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Pattern discovery methods using biochemical constraints

Page 60: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Some profile constraints have been explored…

• Segmentation of informative columns

• Palindromic patterns

Page 61: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Our Hypothesis

• Point 1: Structurally-related DNA binding domains interact with similar target sequences

• Exceptions exist (e.g. Zn-fingers)

• Point 2: There are a finite number of binding domains used in human TFs

• Approximately 20-25

• Idea: We could use the shared binding properties for each family to focus pattern detection methods

• Constrain the range of patterns sought

Page 62: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Comparison of profiles requires alignment and a scoring function

• Scoring function based on sum of squared differences

• Align frequency matrices with modified Needleman-Wunsch algorithm

• Calculate empirical p-values based on simulated set of matrices

Score

Fre

que

ncy

Page 63: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Intra-family comparisons more similar than inter-family

TF Database(JASPAR)

COMPARE

Match to bHLH

Jackknife Test 87% correct

Independent Test Set 93% correct

Page 64: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Page 65: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

FBPs enhance sensitivity of pattern detection

Page 66: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .
Page 67: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

REVIEWING THE TOP POINTS

Page 68: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Orientation

Regulatory regions problem space

Sets of binding

sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA

Sets of binding

sitesAATCACCAAATCACCAAATCACCAAATCACCAAATCTCCCAATCTCCGAATCACACAATCATCAAATCTCACAATCTCTGAGTCCCCAAATCCCGGAATCTGAGAATCCATAATTCAGCCAATAACTTGATAACCTAATTAGACGATTACAGGATTAGCGATTCTTCCTATGAACAGATTAAAAAGACCCCA

Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Specificity profiles for binding sitesA [ -2 0 -2 -0.415 0.585 -2 -2 2.088 -2 -2 -1 0.585 ]C [ 1 0.585 0 0 -1 -2 -2 -2 2.088 -2 0.585 0.807 ]G [0.585 0.322 0.807 1.585 1 -2 2 -2 -2 2.088 -2 0 ]T [0.319 0.322 1 -2 0 2.088 -1 -2 -2 -2 1.459 -0.415 ]

Clusters of binding sites

Clusters of binding sites

Transcription factors

Transcription factor binding sitesRegulatory nucleotide sequences

Transcription factors

Transcription factor binding sitesRegulatory nucleotide sequences

TATAURE

URF Pol-II

Page 69: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Analysis of regulatory regions with TFBS

Detecting binding sites in a single sequence

Scanning a sequence against a PWM

A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC

Abs_score = 13.4 (sum of column scores)

Sp1

Calculating the relative scoreA [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.51281.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.23481.2348 1.23481.2348 2.12222.1222 2.12222.1222 0.4368 1.23481.2348 1.51281.5128 1.74571.7457 1.74571.7457 -1.5 ]T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.74571.7457 ]

A [-0.2284 0.4368 -1.5 -1.5 -1.5-1.5 0.4368 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 ]C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5-1.5 -0.2284 -1.5 -0.2284 -1.5 ]G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5-1.5 ]T [ 0.4368 -0.22840.4368 -0.2284 -1.5 -1.5-1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5-1.5 1.7457 ]

Max_score = 15.2 (sum of highest column scores)

Min_score = -10.3 (sum of lowest column scores)

93%

100%10.3)(15.2

(-10.3)-13.4

% 100Min_score - Max_scoreMin_score - Abs_score

Rel_score

Scanning 1300 bp of human insulin receptor gene with Sp1 at rel_score threshold of 75%

Ouch.

Page 70: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

Low specificity of profiles:•too many hits•great majority not biologically significant

A dramatic improvement in the percentage of biologically significant detections

Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions

Analysis of regulatory regions with TFBS

Phylogenetic Footprints

Page 71: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Pattern Discovery

10

12

14

16

18

0 100 200 300 400 500 600

Page 72: MedGen 505 Gene Regulation Bioinformatics Wyeth W. Wasserman .

CMMT

Concluding Thoughts

• Bioinformatics is often constrained by our understanding of biochemistry rather than computational or statistical limitations

• Evolution has a powerful influence on the performance of many bioinformatics methods

• Computational predictions have value, but only if you understand the limitations of the methods


Recommended