+ All Categories
Home > Documents > CS262 Lecture 17, Win07, Batzoglou Gene Regulation and Microarrays.

CS262 Lecture 17, Win07, Batzoglou Gene Regulation and Microarrays.

Date post: 20-Dec-2015
Category:
View: 221 times
Download: 2 times
Share this document with a friend
Popular Tags:
81
262 Lecture 17, Win07, Batzoglou Gene Regulation and Gene Regulation and Microarrays Microarrays
Transcript

CS262 Lecture 17, Win07, Batzoglou

Gene Regulation and Gene Regulation and MicroarraysMicroarrays

CS262 Lecture 17, Win07, Batzoglou

Overview

• A. Gene Expression and Regulation

• B. Measuring Gene Expression: Microarrays

• C. Finding Regulatory Motifs

CS262 Lecture 17, Win07, Batzoglou

Cells respond to environment

Cell responds toenvironment—various external messages

CS262 Lecture 17, Win07, Batzoglou

Genome is fixed – Cells are dynamic

• A genome is static

Every cell in our body has a copy of same genome

• A cell is dynamic

Responds to external conditions Most cells follow a cell cycle of division

• Cells differentiate during development

• Gene expression varies according to:

Cell type Cell cycle External conditions Location

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Where gene regulation takes place

• Opening of chromatin

• Transcription

• Translation

• Protein stability

• Protein modifications

CS262 Lecture 17, Win07, Batzoglou

Transcriptional Regulation

• Efficient place to regulate:

No energy wasted making intermediate products

• However, slowest response time

After a receptor notices a change:

1. Cascade message to nucleus

2. Open chromatin & bind transcription factors

3. Recruit RNA polymerase and transcribe

4. Splice mRNA and send to cytoplasm

5. Translate into protein

CS262 Lecture 17, Win07, Batzoglou

Transcription Factors Binding to DNA

Transcription regulation:

• Transcription factors bind DNA

• Binding recognizes DNA substrings:

• Regulatory motifs

CS262 Lecture 17, Win07, Batzoglou

Promoter and Enhancers

• Promoter necessary to start transcription

• Enhancers can affect transcription from afar

CS262 Lecture 17, Win07, Batzoglou

Transcription Factor(Protein)

DNA

Gene Regulation with TFs

Regulatory Element Gene

RNA polymerase

CS262 Lecture 17, Win07, Batzoglou

Gene

RNA polymerase

Transcription Factor(Protein)

Regulatory Element

DNA

Gene Regulation with TFs

CS262 Lecture 17, Win07, Batzoglou

DNA

New protein

Gene Regulation with TFs

Transcription Factor(Protein)

Regulatory Element Gene

RNA polymerase

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT

Promoter motifs

3’ UTR motifs

Exons

Introns

CS262 Lecture 17, Win07, Batzoglou

Example: A Human heat shock protein

• TATA box: positioning transcription start

• TATA, CCAAT: constitutive transcription

• GRE: glucocorticoid response

• MRE: metal response

• HSE: heat shock element

TATASP1CCAAT AP2HSEAP2CCAATSP1

promoter of heat shock hsp70

0--158

GENE

CS262 Lecture 17, Win07, Batzoglou

The Cell as a Regulatory Network

• Genes = wires• Motifs = gates

A B Make DC

If C then D

If B then NOT D

If A and B then D D

Make BD

If D then B

C

gene D

gene B

CS262 Lecture 17, Win07, Batzoglou

The Cell as a Regulatory Network (2)

CS262 Lecture 17, Win07, Batzoglou

DNA Microarrays

Measuring gene transcription in a high-throughput fashion

CS262 Lecture 17, Win07, Batzoglou

What is a microarray

CS262 Lecture 17, Win07, Batzoglou

What is a microarray

• A 2D array of DNA sequences from thousands of genes

• Each spot has many copies of same gene

• Measure number of hybridizations per spot

Result:• Thousands of “experiments” – one per gene –

in one go

• Perform many microarrays for different conditions: Time during cell cycle Temperature Nutrient level

CS262 Lecture 17, Win07, Batzoglou

Goal of Microarray Experiments

• Measure level of gene expression across many different conditions:

Expression Matrix M: {genes}{conditions}:

Mij = |genei| in conditionj

• Group genes into coregulated sets

Observe cells under different conditions

Find genes with similar expression profiles

• Potentially regulated by same TF

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Clustering vs. Classification

• Clustering Idea: Groups of genes that share similar function have similar expression

patterns• Hierarchical clustering• k-means • Bayesian approaches• Projection techniques

• Principal Component Analysis• Independent Component Analysis

• Classification Idea: A cell can be in one of several states

• (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal) Can we train an algorithm to use the gene expression patterns to

determine which state a cell is in?• Support Vector Machines• Decision Trees• Neural Networks• K-Nearest Neighbors

CS262 Lecture 17, Win07, Batzoglou

Clustering Algorithms

b

ed

f

a

c

h

ga b d e f g hc

• K-meansb

ed

f

a

c

h

gc1

c2

c3a b g hcd e f

• Hierarchical

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Hierarchical clustering

• Bottom-up algorithm: Initialization: each point in a separate cluster

• At each step: Choose the pair of closest clusters Merge

• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y

• Avoids the problem of specifying the number of clusters

b

ed

f

a

c

h

g

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Distance between clusters

• CD(X,Y)=minx X, y Y D(x,y)

Single-link method

• CD(X,Y)=maxx X, y Y D(x,y)

Complete-link method

• CD(X,Y)=avgx X, y Y D(x,y)

Average-link method

• CD(X,Y)=D( avg(X) , avg(Y) )

Centroid method

ed

f

h

g

ed

f

h

g

ed

f

h

g

ed

f

h

g

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Results of Clustering Gene Expression

• CLUSTER is simple and easy to use

• De facto standard for microarray analysis

Time: O(N2M)

N: #genesM: #conditions

CS262 Lecture 17, Win07, Batzoglou

K-Means Clustering Algorithm

• Each cluster Xi has a center ci

• Define the clustering cost criterion

• COST(X1,…Xk) = ∑Xi ∑x Xi |x – ci|2

• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST

• K-means algorithm: Initialize centers Repeat:

• Compute best clusters for given centers

• → Attach each point to the closest center

• Compute best centers for given clusters

• → Choose the centroid of points in cluster

Until the changes in COST are “small”

b

ed

f

a

c

h

g

c1

c2

c3

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

K-Means Algorithm

• Randomly Initialize Clusters

CS262 Lecture 17, Win07, Batzoglou

K-Means Algorithm

• Assign data points to nearest clusters

CS262 Lecture 17, Win07, Batzoglou

K-Means Algorithm

• Recalculate Clusters

CS262 Lecture 17, Win07, Batzoglou

K-Means Algorithm

• Recalculate Clusters

CS262 Lecture 17, Win07, Batzoglou

K-Means Algorithm

• Repeat

CS262 Lecture 17, Win07, Batzoglou

K-Means Algorithm

• Repeat

CS262 Lecture 17, Win07, Batzoglou

K-Means Algorithm

• Repeat … until convergence

Time: O(KNM) per iteration

N: #genesM: #conditions

CS262 Lecture 17, Win07, Batzoglou

Mixture of Gaussians – Probabilistic K-means

• Data is modeled as mixture of K Gaussians N(1, 2I), …, N(K, 2I)

Prior probabilities 1, …, K

• Different i for every Gaussian i, or even different covariance matrices are possible, but learning becomes harder

P(x) = ∑i P(x | N(1, 2I)) i

Use EM to learn parameters

CS262 Lecture 17, Win07, Batzoglou

Analysis of Clustering Data

• Statistical Significance of Clusters

Gene Ontology http://www.geneontology.org/

KEGG http://www.genome.jp/kegg/

• Regulatory motifs responsible for common expression

• Regulatory Networks

• Experimental Verification

CS262 Lecture 17, Win07, Batzoglou

Evaluating clusters – Hypergeometric Distribution

rm

k

N

mk

pN

m

p

rposP )(

• N genes, p labeled ++, (N-p) ––• Cluster: k genes, m labeled ++• P-value of single cluster containing k

genes of which at least r are ++

Prob a random set of k genes

has m ++ and k-m –– genes

P-value that at least r genes

are ++ in the cluster

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Finding Regulatory Motifs

CS262 Lecture 17, Win07, Batzoglou

Regulatory Motif Discovery

DNA

Group of co-regulated genesCommon subsequence

Find motifs within groups of corregulated genes

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Characteristics of Regulatory Motifs

• Tiny

• Highly Variable

• ~Constant Size Because a constant-size

transcription factor binds

• Often repeated

• Low-complexity-ish

CS262 Lecture 17, Win07, Batzoglou

Sequence Logos

Height of each letter proportional to its frequency

Height of all letters proportional to information content at that position

CS262 Lecture 17, Win07, Batzoglou

Problem Definition

Probabilistic

Motif: Mij; 1 i W

1 j 4

Mij = Prob[ letter j, pos i ]

Find best M, and positions p1,…, pN in sequences

Combinatorial

Motif M: m1…mW

Some of the mi’s blank

Find M that occurs in all si with k differences

Given a collection of promoter sequences s1,…, sN of genes with common expression

CS262 Lecture 17, Win07, Batzoglou

Discrete Approaches to Motif Finding

CS262 Lecture 17, Win07, Batzoglou

Discrete Formulations

Given sequences S = {x1, …, xn}

• A motif W is a consensus string w1…wK

• Find motif W* with “best” match to x1, …, xn

Definition of “best”:

d(W, xi) = min hamming dist. between W and any word in xi

d(W, S) = i d(W, xi)

CS262 Lecture 17, Win07, Batzoglou

Exhaustive Searches

1. Pattern-driven algorithm:

For W = AA…A to TT…T (4K possibilities)

Find d( W, S )

Report W* = argmin( d(W, S) )

Running time: O( K N 4K )

(where N = i |xi|)

Advantage: Finds provably “best” motif W

Disadvantage: Time

CS262 Lecture 17, Win07, Batzoglou

Exhaustive Searches

2. Sample-driven algorithm:

For W = any K-long word occurring in some xi

Find d( W, S )

Report W* = argmin( d( W, S ) )or, Report a local improvement of W*

Running time: O( K N2 )

Advantage: Time

Disadvantage: If the true motif is weak and does not occur in data

then a random motif may score better than any instance of true motif

CS262 Lecture 17, Win07, Batzoglou

MULTIPROFILER

• Extended sample-driven approach

Given a K-long word W, define:

Nα(W) = words W’ in S s.t. d(W,W’) α

Idea:

Assume W is occurrence of true motif W*

Will use Nα(W) to correct “errors” in W

CS262 Lecture 17, Win07, Batzoglou

MULTIPROFILER

Assume W differs from true motif W* in at most L positions

Define:

A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K

Example:

K = 7; L = 3

W = ACGTTGA

G = --A--CG

CS262 Lecture 17, Win07, Batzoglou

MULTIPROFILER

Algorithm:

For each W in S:For L = 1 to Lmax

1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,

1. Modify W by the wordlet G W’2. Compute d(W’, S)

Report W* = argmin d(W’, S)

Step 2 above: Smaller motif-finding problem; Use exhaustive search

CS262 Lecture 17, Win07, Batzoglou

Expectation Maximization in Motif FindingExpectation Maximization in Motif Finding

CS262 Lecture 17, Win07, Batzoglou

All K-long wordsmotif background

Expectation Maximization

Algorithm (sketch):

1. Given genomic sequences find all k-long words

2. Assume each word is motif or background

3. Find likeliest Motif Model

Background Model

classification of words into either Motif or Background

CS262 Lecture 17, Win07, Batzoglou

Expectation Maximization

Given sequences x1, …, xN,

• Find all k-long words X1,…, Xn

• Define motif model:

M = (M1,…, MK)

Mi = (Mi1,…, Mi4)(assume {A, C, G, T})

where Mij = Prob[ letter j occurs in motif position i ]

• Define background model:

B = B1, …, B4

Bi = Prob[ letter j in background sequence ]

motif background

ACGT

M1 MKM1 B

CS262 Lecture 17, Win07, Batzoglou

Expectation Maximization

• Define

Zi1 = { 1, if Xi is motif;

0, otherwise }

Zi2 = { 0, if Xi is motif;

1, otherwise }

• Given a word Xi = x[s]…x[s+k],

P[ Xi, Zi1=1 ] = M1x[s]…Mkx[s+k]

P[ Xi, Zi2=1 ] = (1 – ) Bx[s]…Bx[s+k]

Let 1 = ; 2 = (1 – )

motif background

ACGT

M1 MKM1 B

1 –

CS262 Lecture 17, Win07, Batzoglou

Expectation Maximization

Define:

Parameter space = (M, B)

1: Motif; 2: Background

Objective:

Maximize log likelihood of model:

2

1

2

111

1

2

11

log)|(log

))|(log(),|,...(log

j jjij

n

ijiij

n

i

n

i jjijijn

ZZ

Z

XP

XPZXXP

ACGT

M1 MKM1 B

1 –

CS262 Lecture 17, Win07, Batzoglou

Expectation Maximization

• Maximize expected likelihood, in iteration of two steps:

Expectation:

Find expected value of log likelihood:

Maximization:

Maximize expected value over ,

)],|,...([log 1 ZXXPE n

CS262 Lecture 17, Win07, Batzoglou

Expectation:

Find expected value of log likelihood:

2

1

2

111

1

log][)|(log][

)],|,...([log

j jjij

n

ijiij

n

i

n

ZZ EXPE

ZXXPE

where expected values of Z can be computed as follows:

ijii

jijijij Z

XPXP

XPZobZE *

)|()1()|(

)|(]1[Pr][

21

Expectation Maximization: E-step

CS262 Lecture 17, Win07, Batzoglou

Expectation Maximization: M-step

Maximization:

Maximize expected value over and independently

For , this has the following solution:(we won’t prove it)

Effectively, NEW is the expected # of motifs per position, given our current parameters

n

i

n

i

iii

NEW

n

Zxam ZZ

1 1

121

*))1log(log(arg **

CS262 Lecture 17, Win07, Batzoglou

• For = (M, B), define

cjk = E[ # times letter k appears in motif position j]

c0k = E[ # times letter k appears in background]• cij values are calculated easily from Z* values

It then follows:

4

1k jk

jkNEWjk

c

cM

4

1 0

0

k k

kNEWk

c

cB

to not allow any 0’s, add pseudocounts

Expectation Maximization: M-step

CS262 Lecture 17, Win07, Batzoglou

Initial Parameters Matter!

Consider the following artificial example:

6-mers X1, …, Xn: (n = 2000)

990 words “AAAAAA” 990 words “CCCCCC” 20 words “ACACAC”

Some local maxima:

= 49.5%; B = 100/101 C, 1/101 A M = 100% AAAAAA

= 1%; B = 50% C, 50% A M = 100% ACACAC

CS262 Lecture 17, Win07, Batzoglou

Overview of EM Algorithm

1. Initialize parameters = (M, B), : Try different values of from N-1/2 up to 1/(2K)

2. Repeat:

a. Expectation

b. Maximization

3. Until change in = (M, B), falls below

4. Report results for several “good”

CS262 Lecture 17, Win07, Batzoglou

Gibbs Sampling in Motif FindingGibbs Sampling in Motif Finding

CS262 Lecture 17, Win07, Batzoglou

Gibbs Sampling

• Given: x1, …, xN, motif length K, background B,

• Find: Model M Locations a1,…, aN in x1, …, xN

Maximizing log-odds likelihood ratio:

N

i

K

kika

ika

i

i

xB

xkM

1 1 )(

),(log

CS262 Lecture 17, Win07, Batzoglou

Gibbs Sampling

• AlignACE: first statistical motif finder• BioProspector: improved version of AlignACE

Algorithm (sketch):1. Initialization:

a. Select random locations in sequences x1, …, xN

b. Compute an initial model M from these locations

2. Sampling Iterations:a. Remove one sequence xi

b. Recalculate modelc. Pick a new location of motif in xi according to probability the location is a

motif occurrence

CS262 Lecture 17, Win07, Batzoglou

Gibbs Sampling

Initialization:

• Select random locations 1,…, N in x1, …, xN

• For these locations, compute M:

))((1

1

N

ikajkj jx

BNM

i

where j are pseudocounts to avoid 0s,

and B = j j

• That is, Mkj is the number of occurrences of letter j in motif position k, over the total

CS262 Lecture 17, Win07, Batzoglou

Gibbs Sampling

Predictive Update:

• Select a sequence x = xi

• Remove xi, recompute model:

))(()1(

1

,1

N

isskajkj jx

BNM

s

where j are pseudocounts to avoid 0s,

and B = j j

M

CS262 Lecture 17, Win07, Batzoglou

Gibbs Sampling

Sampling:

For every K-long word xj,…,xj+k-1 in x:

Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1)

Pi = Prob[ word | background ] B(x j)…B(xj+k-1)

Let

Sample a random new position ai

according to the probabilities A1,…, A|x|-k+1.

1||

1

/

/kx

jjj

jjj

PQ

PQA

0 |x|

Prob

CS262 Lecture 17, Win07, Batzoglou

Gibbs Sampling

Running Gibbs Sampling:

1. Initialize

2. Run until convergence

3. Repeat 1,2 several times, report common motifs

CS262 Lecture 17, Win07, Batzoglou

Advantages / Disadvantages

• Very similar to EM

Advantages:• Easier to implement• Less dependent on initial parameters• More versatile, easier to enhance with heuristics

Disadvantages:• More dependent on all sequences to exhibit the motif• Less systematic search of initial parameter space

CS262 Lecture 17, Win07, Batzoglou

Repeats, and a Better Background Model

• Repeat DNA can be confused as motif Especially low-complexity CACACA… AAAAA, etc.

Solution:

more elaborate background model

0th order: B = { pA, pC, pG, pT }

1st order: B = { P(A|A), P(A|C), …, P(T|T) }

Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }

Has been applied to EM and Gibbs (up to 3rd order)

CS262 Lecture 17, Win07, Batzoglou

Limits of Motif Finders

• Given upstream regions of coregulated genes: Increasing length makes motif finding harder – random motifs clutter the

true ones Decreasing length makes motif finding harder – true motif missing in

some sequences

Motif Challenge problem:

Find a (15,4) motif in N sequences of length

0

gene???

CS262 Lecture 17, Win07, Batzoglou

Example Application: Motifs in Yeast

Group:

Tavazoie et al. 1999, G. Church’s lab, Harvard

Data:

• Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)• 15 time points across two cell cycles

1. Clustering genes according to common expression

• K-means clustering -> 30 clusters, 50-190 genes/cluster• Clusters correlate well with known function

2. AlignACE motif finding • 600-long upstream regions

CS262 Lecture 17, Win07, Batzoglou

Motifs in Periodic Clusters

CS262 Lecture 17, Win07, Batzoglou

Motifs in Non-periodic Clusters

CS262 Lecture 17, Win07, Batzoglou

Motifs are preferentially conserved across evolution

Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * *

Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACGSpar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACGSmik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATGSbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * *

Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTSpar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCTSmik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCTSbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * ***

Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAASpar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGACSmik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCCSbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** ***

Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--TSpar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--GSmik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--GSbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * **

Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TTSpar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TTSmik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----ATSbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** *

Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACTSpar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACTSmik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACASbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** *

Scer TTAA-CGTCAAGGA---GAAAAAACTATASpar TTAT-CGTCAAGGAAA-GAACAAACTATASmik TCGTTCATCAAGAA----AAAAAACTA..Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** **

Gal10 Gal1Gal4

GAL10

GAL1

TBP

GAL4 GAL4 GAL4

GAL4

MIG1

TBPMIG1

Factor footprint

Conservation islandslide credits: M. Kellis

Is this enough to discover motifs?No.

CS262 Lecture 17, Win07, Batzoglou

Comparison-based Regulatory Motif Discovery

Study known motifs

Derive conservation rules

Discover novel motifs

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Known motifs are frequently conserved

• Across the human promoter regions, the Err motif: appears 434 times is conserved 162 times

Human

Dog

Mouse

Rat

Err Err Err

Conservation rate: 37%

• Compare to random control motifs– Conservation rate of control motifs: 6.8% – Err enrichment: 5.4-fold– Err p-value < 10-50 (25 standard deviations under binomial)

Motif Conservation Score (MCS)

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Finding conserved motifs in whole genomesM. Kellis PhD Thesis on yeasts, X. Xie & M. Kellis on mammals

1. Define seed “mini-motifs”

2. Filter and isolate mini-motifs that are more conserved than average

3. Extend mini-motifs to full motifs

4. Validate against known databases of motifs & annotations

5. Report novel motifs

CT A C GAN

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Test 1: Intergenic conservation

Total count

Con

serv

ed c

ount

CGG-11-CCG

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Test 2: Intergenic vs. Coding

Coding Conservation

Inte

rgen

ic C

onse

rvat

ion

CGG-11-CCG

Higher Conservation in Genes

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Test 3: Upstream vs. Downstream

CGG-11-CCG

Downstream motifs?

MostPatterns

Downstream Conservation

Ups

trea

m C

onse

rvat

ion

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Extend

Collapse

Full Motifs

Constructing full motifs

2,000 Mini-motifs

72 Full motifs

6CT A C GAR R

CT GR C C GA AA CCTG C GA A

CT GR C C GA ACT RA Y C GA A

Y 5Extend Extend Extend

Collapse Collapse Collapse

Merge

Test 1 Test 2 Test 3

slide credits: M. Kellis

CS262 Lecture 17, Win07, Batzoglou

Summary for promoter motifsRank Discovered Motif

Known TF motif

Tissue Enrichment

Distance bias

1 RCGCAnGCGY NRF-1 Yes Yes

2 CACGTG MYC Yes Yes

3 SCGGAAGY ELK-1 Yes Yes

4 ACTAYRnnnCCCR Yes Yes

5 GATTGGY NF-Y Yes Yes

6 GGGCGGR SP1 Yes Yes

7 TGAnTCA AP-1 Yes

8 TMTCGCGAnR Yes Yes

9 TGAYRTCA ATF3 Yes Yes

10 GCCATnTTG YY1 Yes

11 MGGAAGTG GABP Yes Yes

12 CAGGTG E12 Yes

13 CTTTGT LEF1 Yes

14 TGACGTCA ATF3 Yes Yes

15 CAGCTG AP-4 Yes

16 RYTTCCTG C-ETS-2 Yes Yes

17 AACTTT IRF1(*) Yes

18 TCAnnTGAY SREBP-1 Yes Yes

19 GKCGCn(7)TGAYG Yes Yes

20 GTGACGY E4F1 Yes Yes

21 GGAAnCGGAAnY Yes Yes

22 TGCGCAnK Yes Yes

23 TAATTA CHX10 Yes

24 GGGAGGRR MAZ Yes

25 TGACCTY ERRA Yes

• 174 promoter motifs 70 match known TF motifs 115 expression enrichment 60 show positional bias

75% have evidence

• Control sequences< 2% match known TF motifs

< 5% expression enrichment

< 3% show positional bias

< 7% false positives

Most discovered motifs are likely to be functional

NewNew

New

New

New

slide credits: M. Kellis


Recommended