Motif Finding

transcript

Motif Finding

Yueyi Irene LiuCS374 LectureOct. 17, 2002

Outline

• Background biology• Motif-finding methods

– Word enumeration– Gibbs sampling– Random projection– Phylogenetic footprinting– Reducer

Regulation of Gene Expression • Chromatin structure• Transcription initiation• Transcript processing and modification• RNA transport• Transcript stability• Translation initiation• Post-Translational Modification• Protein Transport• Control of Protein Stability

Typical Structure of an Eukaryotic mRNA Gene

Control of Transcription Initiation

• A conserved pattern that is found in two or more sequences

• Can be found in – DNA (e.g., transcription factor binding sites)– Protein – RNA

Models for Representing Motifs

• Regular expression– Consensus

• TGACGCA

– Degenerate• WGACRCA

• Position Specific Matrix

TGACGCATGACGCAAGACGCATGACACAAGACGCA

1 2 3 4 5 6 7A 0.4 0 1 0 0.2 0 1T 0.6 0 0 0 0 0 0G 0 1 0 0 0.8 0 0

C 0 0 0 1 0 1 0

Where to look for motifs?

• Gene families: a set of genes controlled by a common transcription factor or common environmental stimulus

• How do you construct gene families?– Microarray experiments

Known DNA sequences

Glass slide

Isolate mRNA

Cells of Interest

Reference samplegene

Resulting data

3.25 3.01 1.30 0.70

6.73 2.89 0.92 0.67

1.14 1.15 0.60 0.23

2.12 6.12 0.07 0.02

experiments

Microarrays

Motif-finding Methods

• Goal: Look for motifs (5-15bp) in the data set• Methods:

– Word enumeration method– Gibbs sampling– Random projection– Phylogenetic footprinting– Reducer

Word Enumeration• For every word w, calculate:

– Expected frequency based on entire upstream region of the yeast genome

• E.g., P(ATTGA) = (0.4)4(0.1)1, given P(A) = P(T) = 0.4, P(G)=P(C) = 0.1

• Expected number of occurrences of ATTGA: n*P(ATTGA)– Observed frequency in the data set– Statistical significance of enrichment

Z = (O - E) / sqrt[np (1 - p)] ~ N(0, 1)– Disadvantage: only consider exact word

• E.g, YCTGCA: TCTGCA and CCTGCA

Gibbs Sampling• Matrix to capture a motif• Goal: find the best ak to maximize the

difference between motif and background base distribution.

Liu, X

Gibbs Sampling (Lawrence, et al, 1993)

• Step 1: Pick random start position, compute current motif matrix

• Step 2: Iterative update– Take one sequence out, update motif matrix– Calcuate fitness score of each position of out sequence– Pick start position in out sequence based on weight Ax– Take out another sequence, …, until converge

• Step 3: Reset starting position

Liu, X

Gibbs Sampling InitializationPick random start position, compute motif matrix

Liu, X

Gibbs Sampling Iteration Steps1) Take out one sequence, calculate the fitness score of

every subsequence relative to the current motif

?????????????????a1'

Liu, X

Fitness Score

• Ax = Qx / Px– Qx: probability of

generating subsequence x from current motif

– Px: probability of generating subsequence x from background

A 0.1 0.3 0.7

T 0.1 0.2 0.1

G 0.7 0.4 0.1

C 0.1 0.1 0.1

Current Motif

Background:

P(A) = P(T) = 0.4

P(G) = P(C) = 0.1

X = GGA:

Gibbs Sampling Iteration Steps2) Pick new start position sampling from fitness score

Sample from Fitness Score

0 1 2 3 4 5 6 7 8 9 10 11 12 …

Starting position of motif in sequence

Liu, X

Recent Development

• Random Projection• Phylogenetic Footprinting• Reducer

Random Projection (Buhler, 2002)

• (l, d)-motif problem: – M is an (unknown) motif of length l – Each occurrence of M is corrupted by exactly d

point substitutions in random positions• No known biological motifs are

of (l, d)-motifCCcaAG

CCcgAG

CCgcAG

CCtaAG

CCtgAG

CtATgG

CCctAc

tCtTAG

CaAcAG

CCAgAa

Random Projection Algorithm

• Guiding principle: Some instances of a motif agree on a subset of positions.

• Use information from multiple motif instances to construct model.

ATGCGTC

...ccATCCGACca...

...ttATGAGGCtc...

...ctATAAGTCgc...

...tcATGTGACac... (7,2) motif

x(1)x(2)x(5)x(8)

Buhler, J

k-Projections

• Choose k positions in string of length l.• Concatenate nucleotides at chosen k

positions to form k-tuple.• In l-dimensional Hamming space,

projection onto k dimensional subspace.

ATGGCATTCAGATTC TGCTGAT

l = 15 k = 7P

P = (2, 4, 5, 7, 11, 12, 13)Buhler, J

Random Projection Algorithm

• Choose a projection by selecting k positions uniformly at random.

• For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.

• Recover motif from bucket containing multiple l-tuples.

Bucket TGCT

TGCACCT

Input sequence x(i):…TCAATGCACCTAT...

Buhler, J

Example• l = 7 (motif size) , k = 4 (projection size)• Choose projection (1,2,5,7)

...TAGACATCCGACTTGCCTTACTAC...

Buckets

Input Sequence

ATCCGAC

GCCTTAC

Buhler, J

Hashing and Buckets

• Hash function h(x) obtained from k positions of projection.

• Buckets are labeled by values of h(x).• Enriched buckets: contain more than s l-

tuples, for some parameter s.

ATTCCATCGCTCATGC Buhler, J

Motif Refinement• How do we recover the motif from the

sequences in the enriched buckets?• k nucleotides are known from hash value of

bucket.• Use information in other l-k positions as

starting point for local refinement scheme, e.g. EM or Gibbs sampler

Local refinement algorithm ATGCGTCCandidate motif

ATCCGACATGAGGCATAAGTCATGTGAC

Buhler, J

Parameter Selection

• Projection size k• Choose k small so several motif instances hash

to same bucket. (k < l - d)• Choose k large to avoid contamination by

spurious l-mers. ( 4k > t (n - l + 1)• Bucket threshold s: (s = 3, s = 4)

Buhler, J

Recent Development

Conservation of Regulatory Elements in Upstream of

ApoAI Gene

TATA boxTATA box

Hepatic site C CCAAT boxMouseRabbitHumanChicken

MouseRabbitHumanChicken

TATA box

AAGCA ACGCA

Substring Parsimony Problem

Given: • orthologous upstream sequences S1,…Sn

• phylogenetic tree T of the n species• size k of the motif, threshold d

Problem: Find all sets of substrings s1,…sn of S1,…Sn , each of size k,

such that the parsimony score of s1,…sn on T is at most d

Blanchette, M

Parsimony Score

Minimum (all possible labelings of internal nodes) TEvu

vluld),(

))(),((

•l(v) – label of node v

•d(l1, l2) – Hamming distance

Tree T:

Blanchette, M

String Parsimony ProblemS1: AAAGCATTC

S2: TACGCACCC

S3: GAAGCAGGG

S1 S2 S3

AAGCA ACGCA

Algorithm: version I

• Root the tree at arbitrary internal node r• Compute table Wu of size 4k for each node u, where Wu[s]

– best parsimony score for subtree rooted at u when u is labeled with s

• Direct implementation of this recursion gives O(n∙k∙(42k + l), where l – average sequence length

)(leaf anot is if ),(][(min

of substring a is and leaf is if ,0 of substring anot is and leaf is if ,

uChildvvkt

utsdtWSsu

Blanchette, M

Algorithm: version II

• Define X(u, v)[s] – best parsimony score for subtree consisting of edge (u,v) and the subtree rooted at v

)),(][(min][),( tsdtWsX vtvu k

u labeled s

),( ][][uChildv

vuu sXsW

Blanchette, M

Algorithm: version II (continued)

• Update X(u, v) in phases: in phase p maintain set Bp of sequences t, such that X(u, v)[t] = p

• Define: • Ra = {s: Wv[s] = a} • N(s) = {t in ∑k: d(s, t) = 1}

• Start in phase m and let Bm = Rm

• Update

• Computation of X(u, v) takes O(k∙4k)

pBs pj

jpp BsNRB

Blanchette, M

Improvements

• Reduce the size of Bp when sequences contribute to X(u, v) greater than threshold dIn phase p, only care for sequence X(u, v) [s] if

Leads to significant reductions in stages d/2 … d• Reduce the number of substrings inserted in W at the

leavesFor substring s of Si, if its best match against any Sj, has Hamming

distance at least d, s can be discarded

otherwise 1

computedbeen has ][ if ][max ),(),(

sXsXpd vuvu

vwuChildw

Blanchette, M

Results

• Practical limit on k = 10• There appeared to be a threshold d0 with very

few solutions below and many above• Algorithm found ~80% known binding sites• Performed better than ClustalW, MEME,

Consensus

Blanchette, M

Recent Development

Reducer (Bussemaker, et al 2001)

• Links motif finding to expression level• Ag = C + Σ Fu Nug

– Ag: gene expression level (logarithm of expression ratio)

– M: number of significant motifs– Ng: number of occurrences of motif u in gene g– C: baseline expression level (same for all genes)– F: increase/decrease of expression level caused by

presence of motif

Reducer (Cont’d)Expression vector

Log ratio of expression levelsGene1 Gene2 Gene3 Gene4 … GeneN1.3 -3.7 10.3 4.5 -2.3

Motif vector

Number of times that motif occurs in the upstream region of the geneGene1 Gene2 Gene3 Gene4 … GeneN

AAAAA 2 0 5 3 0AAAAT 5 3 2 1 5…

Liu, X

Reducer (Cont’d)

• Normalize expression (A) and motif (n) vectors

• Linear regression between A vector and every n vector to find the best fit n to A

• Step-wise regression to combine effects of motifs– Subtract the effect of one motif– Find the next best motif

Liu, X

Acknowlegement

• People from whom I borrowed slides:– Xiaole Liu (Reducer)– Olga Troyanskaya (Microarray)– Jeremy Buhler (Random projections)– Mathieu Blanchette (Phylogenetic footprinting)– Various web sources

cDNA clones(probes)

PCR product amplificationpurification

printing

microarray Hybridise target to microarray

mRNA target)

excitation

laser 1laser 2

emission

scanning

analysis

overlay images and normalise

0.1nl/spot

Information Content of Motifs

• Uncertainty

• Information = Hbefore - Hafter

Improvement on Original Gibbs sampler

• 0 ~ n copies of sites in each sequence• Iterative masking to find multiple motifs• Use higher order Markov models to improve

motif specificity

Clinical Importance of Defects in Regulatory Elements

Burkitt’s Lymphoma

Statistical Methods

• Expectation Maximization (EM)– MEME

• Gibbs sampling– BioProspector– AlignACE

Motifs are not limited to DNAs

• RNA motifs– RNA – RNA interaction motifs, e.g., intron-exon

splice sites– RNA – protein interaction motifs, e.g., binding of

proteins to RNA polyA tail• Protein motifs

– E.g., Helix-turn-helix motif

Sequence Logo

Why is this Problem Hard?

• Motif information content low• Hamming distance between each motif

instance high

Motif Finding

Documents