+ All Categories
Home > Documents > Motif Finding

Motif Finding

Date post: 19-Mar-2016
Category:
Upload: afya
View: 73 times
Download: 0 times
Share this document with a friend
Description:
Motif Finding. Yueyi Irene Liu CS374 Lecture Oct. 17, 2002. Outline. Background biology Motif-finding methods Word enumeration Gibbs sampling Random projection Phylogenetic footprinting Reducer. Regulation of Gene Expression. Chromatin structure Transcription initiation - PowerPoint PPT Presentation
52
Motif Finding Yueyi Irene Liu CS374 Lecture Oct. 17, 2002
Transcript
Page 1: Motif Finding

Motif Finding

Yueyi Irene LiuCS374 LectureOct. 17, 2002

Page 2: Motif Finding

Outline

• Background biology• Motif-finding methods

– Word enumeration– Gibbs sampling– Random projection– Phylogenetic footprinting– Reducer

Page 3: Motif Finding
Page 4: Motif Finding

Regulation of Gene Expression • Chromatin structure• Transcription initiation• Transcript processing and modification• RNA transport• Transcript stability• Translation initiation• Post-Translational Modification• Protein Transport• Control of Protein Stability

Page 5: Motif Finding

Typical Structure of an Eukaryotic mRNA Gene

Page 6: Motif Finding

Control of Transcription Initiation

Page 7: Motif Finding

Motif

• A conserved pattern that is found in two or more sequences

• Can be found in – DNA (e.g., transcription factor binding sites)– Protein – RNA

Page 8: Motif Finding

Models for Representing Motifs

• Regular expression– Consensus

• TGACGCA

– Degenerate• WGACRCA

• Position Specific Matrix

TGACGCATGACGCAAGACGCATGACACAAGACGCA

1 2 3 4 5 6 7A 0.4 0 1 0 0.2 0 1T 0.6 0 0 0 0 0 0G 0 1 0 0 0.8 0 0

C 0 0 0 1 0 1 0

Page 9: Motif Finding

Where to look for motifs?

• Gene families: a set of genes controlled by a common transcription factor or common environmental stimulus

• How do you construct gene families?– Microarray experiments

Page 10: Motif Finding

Known DNA sequences

Glass slide

Isolate mRNA

Cells of Interest

Reference samplegene

s

Resulting data

3.25 3.01 1.30 0.70

6.73 2.89 0.92 0.67

1.14 1.15 0.60 0.23

2.12 6.12 0.07 0.02

experiments

10

Microarrays

Page 11: Motif Finding

Motif-finding Methods

• Goal: Look for motifs (5-15bp) in the data set• Methods:

– Word enumeration method– Gibbs sampling– Random projection– Phylogenetic footprinting– Reducer

Page 12: Motif Finding

Word Enumeration• For every word w, calculate:

– Expected frequency based on entire upstream region of the yeast genome

• E.g., P(ATTGA) = (0.4)4(0.1)1, given P(A) = P(T) = 0.4, P(G)=P(C) = 0.1

• Expected number of occurrences of ATTGA: n*P(ATTGA)– Observed frequency in the data set– Statistical significance of enrichment

Z = (O - E) / sqrt[np (1 - p)] ~ N(0, 1)– Disadvantage: only consider exact word

• E.g, YCTGCA: TCTGCA and CCTGCA

Page 13: Motif Finding

Gibbs Sampling• Matrix to capture a motif• Goal: find the best ak to maximize the

difference between motif and background base distribution.

a2

a3

a4

ak

a1

Liu, X

Page 14: Motif Finding

Gibbs Sampling (Lawrence, et al, 1993)

• Step 1: Pick random start position, compute current motif matrix

• Step 2: Iterative update– Take one sequence out, update motif matrix– Calcuate fitness score of each position of out sequence– Pick start position in out sequence based on weight Ax– Take out another sequence, …, until converge

• Step 3: Reset starting position

Liu, X

Page 15: Motif Finding

Gibbs Sampling InitializationPick random start position, compute motif matrix

a1

a2

a3

a4

ak

a1'

a3'

a4'

ak'

a2'

Liu, X

Page 16: Motif Finding

Gibbs Sampling Iteration Steps1) Take out one sequence, calculate the fitness score of

every subsequence relative to the current motif

a3'

a4'

ak'

a2'

?????????????????a1'

Liu, X

Page 17: Motif Finding

Fitness Score

• Ax = Qx / Px– Qx: probability of

generating subsequence x from current motif

– Px: probability of generating subsequence x from background

1 2 3

A 0.1 0.3 0.7

T 0.1 0.2 0.1

G 0.7 0.4 0.1

C 0.1 0.1 0.1

Current Motif

Background:

P(A) = P(T) = 0.4

P(G) = P(C) = 0.1

X = GGA:

Q? P?

Page 18: Motif Finding

Gibbs Sampling Iteration Steps2) Pick new start position sampling from fitness score

Sample from Fitness Score

0

1

2

3

4

5

0 1 2 3 4 5 6 7 8 9 10 11 12 …

Starting position of motif in sequence

Fitn

ess

a1''

a3'

a4'

ak'

a2'

Liu, X

Page 19: Motif Finding

Recent Development

• Random Projection• Phylogenetic Footprinting• Reducer

Page 20: Motif Finding

Random Projection (Buhler, 2002)

• (l, d)-motif problem: – M is an (unknown) motif of length l – Each occurrence of M is corrupted by exactly d

point substitutions in random positions• No known biological motifs are

of (l, d)-motifCCcaAG

CCcgAG

CCgcAG

CCtaAG

CCtgAG

CtATgG

CCctAc

tCtTAG

CaAcAG

CCAgAa

Page 21: Motif Finding

Random Projection Algorithm

• Guiding principle: Some instances of a motif agree on a subset of positions.

• Use information from multiple motif instances to construct model.

ATGCGTC

...ccATCCGACca...

...ttATGAGGCtc...

...ctATAAGTCgc...

...tcATGTGACac... (7,2) motif

x(1)x(2)x(5)x(8)

=M

Buhler, J

Page 22: Motif Finding

k-Projections

• Choose k positions in string of length l.• Concatenate nucleotides at chosen k

positions to form k-tuple.• In l-dimensional Hamming space,

projection onto k dimensional subspace.

ATGGCATTCAGATTC TGCTGAT

l = 15 k = 7P

P = (2, 4, 5, 7, 11, 12, 13)Buhler, J

Page 23: Motif Finding

Random Projection Algorithm

• Choose a projection by selecting k positions uniformly at random.

• For each l-tuple in input sequences, hash into bucket based on letters at k selected positions.

• Recover motif from bucket containing multiple l-tuples.

Bucket TGCT

TGCACCT

Input sequence x(i):…TCAATGCACCTAT...

Buhler, J

Page 24: Motif Finding

Example• l = 7 (motif size) , k = 4 (projection size)• Choose projection (1,2,5,7)

GCTC

...TAGACATCCGACTTGCCTTACTAC...

Buckets

Input Sequence

ATGC

ATCCGAC

GCCTTAC

Buhler, J

Page 25: Motif Finding

Hashing and Buckets

• Hash function h(x) obtained from k positions of projection.

• Buckets are labeled by values of h(x).• Enriched buckets: contain more than s l-

tuples, for some parameter s.

ATTCCATCGCTCATGC Buhler, J

Page 26: Motif Finding

Motif Refinement• How do we recover the motif from the

sequences in the enriched buckets?• k nucleotides are known from hash value of

bucket.• Use information in other l-k positions as

starting point for local refinement scheme, e.g. EM or Gibbs sampler

Local refinement algorithm ATGCGTCCandidate motif

ATGC

ATCCGACATGAGGCATAAGTCATGTGAC

Buhler, J

Page 27: Motif Finding

Parameter Selection

• Projection size k• Choose k small so several motif instances hash

to same bucket. (k < l - d)• Choose k large to avoid contamination by

spurious l-mers. ( 4k > t (n - l + 1)• Bucket threshold s: (s = 3, s = 4)

Buhler, J

Page 28: Motif Finding

Recent Development

• Random Projection• Phylogenetic Footprinting• Reducer

Page 29: Motif Finding

Conservation of Regulatory Elements in Upstream of

ApoAI Gene

TATA boxTATA box

Hepatic site C CCAAT boxMouseRabbitHumanChicken

MouseRabbitHumanChicken

MouseRabbitHumanChicken

TATA box

Page 30: Motif Finding

AAGCA

AAGCA ACGCA

AAGCA

AAGCA

Page 31: Motif Finding

Substring Parsimony Problem

Given: • orthologous upstream sequences S1,…Sn

• phylogenetic tree T of the n species• size k of the motif, threshold d

Problem: Find all sets of substrings s1,…sn of S1,…Sn , each of size k,

such that the parsimony score of s1,…sn on T is at most d

Blanchette, M

Page 32: Motif Finding

Parsimony Score

s1

s2

s3s4

s5s6

s`34

Minimum (all possible labelings of internal nodes) TEvu

vluld),(

))(),((

•l(v) – label of node v

•d(l1, l2) – Hamming distance

Tree T:

Blanchette, M

Page 33: Motif Finding

String Parsimony ProblemS1: AAAGCATTC

S2: TACGCACCC

S3: GAAGCAGGG

S1 S2 S3

AAGCA

AAGCA ACGCA

AAGCA

AAGCA

k = 5

d = 1

Page 34: Motif Finding

Algorithm: version I

• Root the tree at arbitrary internal node r• Compute table Wu of size 4k for each node u, where Wu[s]

– best parsimony score for subtree rooted at u when u is labeled with s

• Direct implementation of this recursion gives O(n∙k∙(42k + l), where l – average sequence length

)(leaf anot is if ),(][(min

of substring a is and leaf is if ,0 of substring anot is and leaf is if ,

][

uChildvvkt

u

u

u

utsdtWSsu

SsusW

Blanchette, M

Page 35: Motif Finding

Algorithm: version II

• Define X(u, v)[s] – best parsimony score for subtree consisting of edge (u,v) and the subtree rooted at v

)),(][(min][),( tsdtWsX vtvu k

u labeled s

v

w

)(

),( ][][uChildv

vuu sXsW

Blanchette, M

Page 36: Motif Finding

Algorithm: version II (continued)

• Update X(u, v) in phases: in phase p maintain set Bp of sequences t, such that X(u, v)[t] = p

• Define: • Ra = {s: Wv[s] = a} • N(s) = {t in ∑k: d(s, t) = 1}

• Start in phase m and let Bm = Rm

• Update

• Computation of X(u, v) takes O(k∙4k)

pBs pj

jpp BsNRB

)(11

Blanchette, M

Page 37: Motif Finding

Improvements

• Reduce the size of Bp when sequences contribute to X(u, v) greater than threshold dIn phase p, only care for sequence X(u, v) [s] if

Leads to significant reductions in stages d/2 … d• Reduce the number of substrings inserted in W at the

leavesFor substring s of Si, if its best match against any Sj, has Hamming

distance at least d, s can be discarded

otherwise 1

computedbeen has ][ if ][max ),(),(

)( p

sXsXpd vuvu

vwuChildw

Blanchette, M

Page 38: Motif Finding

Results

• Practical limit on k = 10• There appeared to be a threshold d0 with very

few solutions below and many above• Algorithm found ~80% known binding sites• Performed better than ClustalW, MEME,

Consensus

Blanchette, M

Page 39: Motif Finding

Recent Development

• Random Projection• Phylogenetic Footprinting• Reducer

Page 40: Motif Finding

Reducer (Bussemaker, et al 2001)

• Links motif finding to expression level• Ag = C + Σ Fu Nug

– Ag: gene expression level (logarithm of expression ratio)

– M: number of significant motifs– Ng: number of occurrences of motif u in gene g– C: baseline expression level (same for all genes)– F: increase/decrease of expression level caused by

presence of motif

Page 41: Motif Finding

Reducer (Cont’d)Expression vector

Log ratio of expression levelsGene1 Gene2 Gene3 Gene4 … GeneN1.3 -3.7 10.3 4.5 -2.3

Motif vector

Number of times that motif occurs in the upstream region of the geneGene1 Gene2 Gene3 Gene4 … GeneN

AAAAA 2 0 5 3 0AAAAT 5 3 2 1 5…

Liu, X

Page 42: Motif Finding

Reducer (Cont’d)

• Normalize expression (A) and motif (n) vectors

• Linear regression between A vector and every n vector to find the best fit n to A

• Step-wise regression to combine effects of motifs– Subtract the effect of one motif– Find the next best motif

Liu, X

Page 43: Motif Finding

Acknowlegement

• People from whom I borrowed slides:– Xiaole Liu (Reducer)– Olga Troyanskaya (Microarray)– Jeremy Buhler (Random projections)– Mathieu Blanchette (Phylogenetic footprinting)– Various web sources

Page 44: Motif Finding
Page 45: Motif Finding

cDNA clones(probes)

PCR product amplificationpurification

printing

microarray Hybridise target to microarray

mRNA target)

excitation

laser 1laser 2

emission

scanning

analysis

overlay images and normalise

0.1nl/spot

Page 46: Motif Finding

Information Content of Motifs

• Uncertainty

• Information = Hbefore - Hafter

Page 47: Motif Finding

Improvement on Original Gibbs sampler

• 0 ~ n copies of sites in each sequence• Iterative masking to find multiple motifs• Use higher order Markov models to improve

motif specificity

Page 48: Motif Finding

Clinical Importance of Defects in Regulatory Elements

Burkitt’s Lymphoma

Page 49: Motif Finding

Statistical Methods

• Expectation Maximization (EM)– MEME

• Gibbs sampling– BioProspector– AlignACE

Page 50: Motif Finding

Motifs are not limited to DNAs

• RNA motifs– RNA – RNA interaction motifs, e.g., intron-exon

splice sites– RNA – protein interaction motifs, e.g., binding of

proteins to RNA polyA tail• Protein motifs

– E.g., Helix-turn-helix motif

Page 51: Motif Finding

Sequence Logo

Page 52: Motif Finding

Why is this Problem Hard?

• Motif information content low• Hamming distance between each motif

instance high


Recommended