+ All Categories
Home > Documents > Motif Finding

Motif Finding

Date post: 08-Jan-2016
Category:
Upload: lorene
View: 49 times
Download: 2 times
Share this document with a friend
Description:
Motif Finding. PSSMs Expectation Maximization Gibbs Sampling. Complexity of Transcription. A matrix describing a a set of sites. A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 - PowerPoint PPT Presentation
Popular Tags:
85
Motif Finding PSSMs Expectation Maximization Gibbs Sampling
Transcript
Page 1: Motif Finding

Motif Finding

PSSMs

Expectation Maximization

Gibbs Sampling

Page 2: Motif Finding

Complexity of Transcription

Page 3: Motif Finding

Representing Binding Sites for a TF

A set of sites represented as a consensus VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

A matrix describing a a set of sites

A single site AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Page 4: Motif Finding

Nucleic acid codes

code description

A Adenine

C Cytosine

G Guanine

T Thymine

U Uracil

R Purine (A or G)

Y Pyrimidine (C, T, or U)

M C or A

K T, U, or G

W T, U, or A

S C or G

B C, T, U, or G (not A)

D A, T, U, or G (not C)

H A, T, U, or C (not G)

V A, C, or G (not T, not U)

N Any base (A, C, G, T, or U)

Page 5: Motif Finding

From frequencies to log scores

TGCTG = 0.9

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

f matrix w matrix

Log ( )f(b,i) + s(N)p(b)

Page 6: Motif Finding

TFs do not act alone

http://www.bioinformatics.ca/

Page 7: Motif Finding

PSSMs for Liver TFs…

HNF1

C/EBP

HNF3

HNF4

Page 8: Motif Finding

PSSMs for Helix-Turn-Helix Motif

Page 9: Motif Finding

Promoter…

Page 10: Motif Finding

Promoter Weight Matrices (PWM)

Page 11: Motif Finding

E.Coli PWMs

Page 12: Motif Finding

Motif Logo Motifs can mutate on

less important bases. The five motifs at top

right have mutations in position 3 and 5.

Representations called motif logos illustrate the conserved regions of a motif.

http://weblogo.berkeley.eduhttp://fold.stanford.edu/eblocks/acsearch.html

1234567TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA

Position:

Page 13: Motif Finding

Example: Calmodulin-Binding Motif (calcium-binding proteins)

Page 14: Motif Finding

Sequence Motifs

• Motifs represent a short common sequence– Regulatory motifs (TF binding sites)

– Functional site in proteins (DNA binding motif)

http://webcourse.cs.technion.ac.il/236523/Winter2005-2006/en/ho_Lectures.html

Page 15: Motif Finding

Regulatory Motifs

Transcription Factors bind to regulatory motifs Motifs are 6 – 20 nucleotides long Activators and repressors Usually located near target gene, mostly

upstream

Page 16: Motif Finding

Challenges

How to recognize a regulatory motif? Can we identify new occurrences of

known motifs in genome sequences? Can we discover new motifs within

upstream sequences of genes?

Page 17: Motif Finding

Motif Representation

Exact motif: CGGATATA Consensus: represent only

deterministic nucleotides. Example: HAP1 binding

sites in 5 sequences. consensus motif:

CGGNNNTANCGG N stands for any nucleotide.

Representing only consensus loses information. How can this be avoided?

CGGATATACCGG

CGGTGATAGCGG

CGGTACTAACGG

CGGCGGTAACGG

CGGCCCTAACGG

------------

CGGNNNTANCGG

Page 18: Motif Finding

1 2 3 4 5

A 10 25 5 70 60

C 30 25 80 10 15

T 50 25 5 10 5

G 10 25 10 10 20

PSPM – Position Specific Probability Matrix

Represents a motif of length k (5) Count the number of occurrence of each

nucleotide in each position

Page 19: Motif Finding

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

PSPM – Position Specific Probability Matrix

Defines Pi{A,C,G,T} for i={1,..,k}. Pi (A) – frequency of nucleotide A in position i.

Page 20: Motif Finding

Identification of Known Motifs within Genomic Sequences

Motivation: identification of new genes controlled by the

same TF. Infer the function of these genes. enable better understanding of the regulation

mechanism.

Page 21: Motif Finding

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

PSPM – Position Specific Probability Matrix

Each k-mer is assigned a probability. Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

Page 22: Motif Finding

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example:

sequence = ATGCAAGTCT…

Page 23: Motif Finding

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

Page 24: Motif Finding

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4 Position 2: TGCAA

0.5*0.25*0.8*0.7*0.6=0.042

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

Page 25: Motif Finding

Detecting a Known Motif within a Sequence using PSSM

Is it a random match, or is it indeed an occurrence of the motif?

PSPM -> PSSM (Probability Specific Scoring Matrix) odds score matrix: Oi(n) where n {A,C,G,T} for i={1,..,k} defined as Pi(n)/P(n), where P(n) is background

frequency. Oi(n) increases => higher odds that n at position i is

part of a real motif.

Page 26: Motif Finding

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

1 2 3 4 5

A 0.4 1 0.2 2.8 2.4

1 2 3 4 5

A -1.322 0 -2.322 1.485

1.263

PSSM as Odds Score Matrix Assumption: the background frequency of each nucleotide is

0.25.

Original PSPM (Pi):

Odds Matrix (Oi):

Going to log scale we get an additive score,Log odds Matrix (log2Oi):

Page 27: Motif Finding

1 2 3 4 5

A -1.32 0 -2.32 1.48 1.26

C 0.26 0 1.68 -1.32 -0.74

T 1 0 -2.32 -1.32 -2.32

G -1.32 0 -1.32 -1.32 -0.32

Calculating using Log Odds Matrix

Odds 0 implies random match; Odds > 0 implies real match (?).

Example: sequence = ATGCAAGTCT… Position 1: ATGCA

-1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15

Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8

Page 28: Motif Finding

Calculating the probability of a match

ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18

P (i) = S / (∑ S)Example 0.15 /(.15+42.8+.18)=0.003

P (1)= 0.003P (2)= 0.993P (3) =0.004

Page 29: Motif Finding

Building a PSSM

Collect all known sequences that bind a certain TF.

Align all sequences (using multiple sequence alignment).

Compute the frequency of each nucleotide in each position (PSPM).

Incorporate background frequency for each nucleotide (PSSM).

Page 30: Motif Finding

Finding new Motifs

We are given a group of genes, which presumably contain a common regulatory motif.

We know nothing of the TF that binds to the putative motif.

The problem: discover the motif.

Page 31: Motif Finding

Example

Predicting the cAMP Receptor Protein (CRP) binding site motif

Page 32: Motif Finding

GGATAACAATTTCACAAGTGTGTGAGCGGATAACAAAAGGTGTGAGTTAGCTCACTCCCCTGTGATCTCTGTTACATAGACGTGCGAGGATGAGAACACAATGTGTGTGCTCGGTTTAGTTCACCTGTGACACAGTGCAAACGCGCCTGACGGAGTTCACAAATTGTGAGTGTCTATAATCACGATCGATTTGGAATATCCATCACATGCAAAGGACGTCACGATTTGGGAGCTGGCGACCTGGGTCATGTGTGATGTGTATCGAACCGTGTATTTATTTGAACCACATCGCAGGTGAGAGCCATCACAGGAGTGTGTAAGCTGTGCCACGTTTATTCCATGTCACGAGTGTTGTTATACACATCACTAGTGAAACGTGCTCCCACTCGCATGTGATTCGATTCACA

Extract experimentally defined CRP Binding Sites

Page 33: Motif Finding

GGATAACAATTTCACATGTGAGCGGATAACAATGTGAGTTAGCTCACTTGTGATCTCTGTTACACGAGGATGAGAACACACTCGGTTTAGTTCACCTGTGACACAGTGCAAACCTGACGGAGTTCACAAGTGTCTATAATCACGTGGAATATCCATCACATGCAAAGGACGTCACGGGCGACCTGGGTCATGTGTGATGTGTATCGAATTTGAACCACATCGCAGGTGAGAGCCATCACATGTAAGCTGTGCCACGTTTATTCCATGTCACGTGTTATACACATCACTCGTGCTCCCACTCGCATGTGATTCGATTCACA

Create a Multiple Sequence Alignment

Page 34: Motif Finding

A C G T

1 -0.43 0.1 -0.46 0.55

2 1.37 0.12 -1.59 -11.2

3 1.69 -1.28 -11.2 -1.43

4 -1.28 0.12 -11.2 1.32

5 0.91 -11.2 -0.46 0.47

6 1.53 -1.38 -1.48 -1.43

7 0.9 -0.48 -11.2 0.12

8 -1.37 -1.28 -11.2 1.68

9 -11.2 -11.2 1.73 -0.56

10 -11.2 -0.51 -11.2 1.72

11 -0.48 -11.2 1.72 -11.2

12 1.56 -1.59 -11.2 -0.46

13 -0.51 -0.38 -0.55 0.88

14 -11.2 0.5 0.57 0.13

15 0.17 -0.51 0.12 0.12

16 0.9 -11.2 0.5 -0.48

17 0.17 0.16 0.06 -0.48

18 -0.4 -0.38 0.82 -0.48

19 -1.38 -1.28 -11.2 1.68

20 -1.48 1.7 -11.2 -1.38

21 1.5 -1.38 -1.43 -1.28

Generate a PSSM

Page 35: Motif Finding

Shannon Entropy

Expected variation per column can be calculated

Low entropy means higher conservation

Page 36: Motif Finding

Entropy

The entropy (H) for a column is:

a: is a residue, fa: frequency of residue a in a column,

pa : probability of residue a in that column

)(

)log(aresidues

aa pfH

Page 37: Motif Finding

Entropy

entropy measures can determine which evolutionary distance (PAM250, BLOSUM80, etc) should be used

Entropy yields amount of information per column (discussed with sequence logos in a bit)

Page 38: Motif Finding

Log-odds score

Profiles can also indicate log-odds score: Log2(observed:expected)

Result is a bit score

Page 39: Motif Finding

Matlab

Multalign1 Enter an array of sequences.seqs =

{'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAACATCTCGC'};

2 Promote terminations with gaps in the alignment.multialign(seqs,'terminalGapAdjust',true)

ans =--CACGTAACATCTC--ACGACGTAACATCTTCT-AAACGTAACATCTCGC

Page 40: Motif Finding

Matlab

3 Compare alignment without termination gap adjustment.

multialign(seqs)

ans =

CA--CGTAACATCT--C

ACGACGTAACATCTTCT

AA-ACGTAACATCTCGC

Page 41: Motif Finding

Matlab

>> a={'ATATAGGAG','AATTATAGA','TTAGAGAAA'}

>> a =

'ATATAGGAG' 'AATTATAGA' 'TTAGAGAAA'

Page 42: Motif Finding

Char function

>> cseq=char(a)

cseq =

ATATAGGAG

AATTATAGA

TTAGAGAAA

Page 43: Motif Finding

Double function

>> intseq=double(cseq)

intseq =

65 84 65 84 65 71 71 65 71

65 65 84 84 65 84 65 71 65

84 84 65 71 65 71 65 65 65

Page 44: Motif Finding

double

>> double('A')ans = 65>> double('C')ans = 67>> double('G')ans = 71>> double('T')ans = 84

Page 45: Motif Finding

Initiate PSPM matrix

>> Pspm=zeros(4,length(intseq))

Pspm =

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 46: Motif Finding

Use a for loop to count each nucleotide at each position>> for i = 1:length(intseq)Pspm(1,i)=length(find(intseq(:,i)==65));Pspm(2,i)=length(find(intseq(:,i)==67));Pspm(3,i)=length(find(intseq(:,i)==71));Pspm(4,i)=length(find(intseq(:,i)==84));end>> Pspm

Pspm =

2 1 2 0 3 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1 1 1 1 2 1 2 0 1 0 0 0

Page 47: Motif Finding

Add pseudocounts

>> Pspmp=Pspm+1

Pspmp =

3 2 3 1 4 1 3 3 3

1 1 1 1 1 1 1 1 1

1 1 1 2 1 3 2 2 2

2 3 2 3 1 2 1 1 1

Page 48: Motif Finding

Normalize to get frequencies>> Pspmnorm=Pspmp./repmat(sum(Pspmp),4,1)

Pspmnorm =

Columns 1 through 7

0.4286 0.2857 0.4286 0.1429 0.5714 0.1429 0.4286 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.2857 0.1429 0.4286 0.2857 0.2857 0.4286 0.2857 0.4286 0.1429 0.2857 0.1429

Columns 8 through 9

0.4286 0.4286 0.1429 0.1429 0.2857 0.2857 0.1429 0.1429

Page 49: Motif Finding

Calculate odds score>> Pswm=Pspmnorm/0.25

Pswm =

Columns 1 through 7

1.7143 1.1429 1.7143 0.5714 2.2857 0.5714 1.7143 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 1.1429 0.5714 1.7143 1.1429 1.1429 1.7143 1.1429 1.7143 0.5714 1.1429 0.5714

Columns 8 through 9

1.7143 1.7143 0.5714 0.5714 1.1429 1.1429 0.5714 0.5714

Page 50: Motif Finding

Log odds ratio>> logPswm=log2(Pswm)

logPswm =

Columns 1 through 7

0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074

Columns 8 through 9

0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074

Page 51: Motif Finding

Estimate the probability of the given sequence to belong to the defined PSWM

>> Unknown='TTAAGAAGG'

Unknown =

TTAAGAAGG

>> intunknown=double(Unknown)

intunknown =

84 84 65 65 71 65 65 71 71

Page 52: Motif Finding

Get the index of the PSWM for the unknown sequence>> for i=1:length(intunknown)

A=find(intunknown==65)intunknown(A)=1;C=find(intunknown==67)intunknown(C)=2;G=find(intunknown==71)intunknown(G)=3;T=find(intunknown==84)intunknown(T)=4;

end>> intunknownintunknown =

4 4 1 1 3 1 1 3 3

Page 53: Motif Finding

Calculate the log odds-ratio of the Unknown 'TTAAGAAGG'

>> logunknown=logPswm(intunknown)

logunknown =

Columns 1 through 7

0.1926 0.1926 0.7776 0.7776 -0.8074 0.7776 0.7776

Columns 8 through 9

-0.8074 -0.8074

>> Punknown=sum(logunknown)

Punknown =

1.0737

Page 54: Motif Finding

Is this significant score or just random similarity?

>> cseqcseq =

ATATAGGAGAATTATAGATTAGAGAAA

>> Unknown

Unknown =

TTAAGAAGG

Page 55: Motif Finding

What would be the maximum score?

>> logPswm

logPswm = Columns 1 through 7 0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074 Columns 8 through 9 0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074

>> maxscore=max(logPswm)maxscore =Columns 1 through 7 0.7776 0.7776 0.7776 0.7776 1.1926 0.7776 0.7776Columns 8 through 9 0.7776 0.7776>> totalmaxscore=sum(maxscore)

totalmaxscore=

7.4135

Page 56: Motif Finding

Write a function using the above statements to scan a sequence

Write a function named ‘logodds’ that calculates the logs-odd ratio of a given alignment.

Write a function named ‘scanmotif’ that calls the ‘logodds’ to search through a sequence using a sliding window to calculate the logodds of a subsequence and store these scores. The function should allow for selection of a maximum number of locations that are likely to contain the motif based on the scores obtained.

Page 57: Motif Finding

Position Specific Scoring Matrix (PSSM) incorporate information theory to

indicate information contained within each column of a multiple alignment.

information is a logarithmic transformation of the frequency of each residue in the motif

Page 58: Motif Finding

PSSMs and Pseudocounts

Problem: PSSMs are only as good as the initial msa Some residues may be underrepresented Other columns may be too conserved

Solution: Introduce Pseudocounts to get a better indication

Page 59: Motif Finding

Pseudocounts

New estimated probability:

Pca: Probability of residue a in column c nca: count of a’s in column c bca: pseudocount of a’s in column c Nc: total count in column c Bc: total pseudocount in column c

cc

cacaca BN

bnP

Page 60: Motif Finding

PSSMs and pseudocounts

probabilities converted into a log-odds form (usually log2 so the information

can be reported in bits) and placed in the PSSM.

Page 61: Motif Finding

Searching PSSMs

value for the first residue in the sequence occurring in the first column is calculated by searching the PSSM

the value for the residue occurring in each column is calculated

Page 62: Motif Finding

Searching PSSMs

values are added (since they are logarithms) to produce a summed log odds score, S

S can be converted to an odds score using the formula 2S

odds scores for each position can be summed together and normalized to produce a probability of the motif occurring at each location.

Page 63: Motif Finding

Information in PSSMs

Information theory: amount of information contained within each sequence.

No information: amount of uncertainty can be measured as log220 = 4.32 for amino

acids, since there are 20 amino acids. For nucleic acid sequences, the amount of uncertainty can be measured as log24 = 2.

Page 64: Motif Finding

Information in PSSMs

If a column is completely conserved then the uncertainty is 0 – there is only one choice.

two residues occurring with equal probability -- uncertainty to deciding which residue it is.

Page 65: Motif Finding

Measure of Uncertainty

Measured as the entropy

)(

)log(aresidues

acacC pfH

Page 66: Motif Finding

Relative Entropy

. Relative entropy takes into account overall composition of the organism being studied

 

Ba is background frequency of residue a in the organism

)(

2 )/(logaresidues

aacacC bpfR

Page 67: Motif Finding

PSSM Uncertainty

Uncertainty for whole model is summed over all columns:

allcolumns

cc HH

Page 68: Motif Finding

Sequence Logos

Information in PSSMs can be viewed visually

Sequence logos illustrate information in each column of a motif

height of logo is calculated as the amount by which uncertainty has been decreased

Page 69: Motif Finding

Sequence Logos

Page 70: Motif Finding

Statistical Methods

Commonly used methods for locating motifs:

Expectation-Maximization (EM) Gibbs Sampling

Page 71: Motif Finding

Expectation-Maximization

Begin with set of sequences with an unknown signal in common Signal may be subtle Approximate length of signal must be

given

Randomly assign locations of this motif in each sequence

Page 72: Motif Finding

Expectation-Maximization

Two steps: Expectation Step Maximization Step

Page 73: Motif Finding

Expectation-Maximization

Expectation step Residue Frequencies for each position

calculated Residues not in a motif are background

Frequencies used to determine probability of finding site at any position in a sequence to fit motif model

Page 74: Motif Finding

Maximization Step

Determine location for each sequence that maximally aligns to the motif pattern

Once new motif location found for each sequence, motif pattern is revised in the expectation

E-M continues until solution converges

Page 75: Motif Finding

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCTCCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTGTCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAGAAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTCGGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGCAGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGAGCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCACATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCTTCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGCGCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCCCATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGGGATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAGTCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGACCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGCATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGTAGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTCCCAGCACACACACTTATCCAGTGGTAAATACACATCATTCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGATACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGATGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAGCAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAACTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAAGAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCTTGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACTGGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGTCAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTGCCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCAGGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTGCTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC

Page 76: Motif Finding

Residue Counts

Given motif alignment, count for each location is calculated:

Page 77: Motif Finding

Residue Frequencies

The counts are then converted to frequencies:

Page 78: Motif Finding

Example Maximization Step

Consider the first sequence:

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT

  There are 41 residues; 41-6+1 = 36

sites to consider

Page 79: Motif Finding

MEME Software

One of three motif models:

OOPS: One expected occurrence per sequence

ZOOPS: Zero or one expected occurrence per sequence

TCM: Any number of occurrences of the motif

Page 80: Motif Finding

Gibbs Sampling

Similar to E-M algorithm Combines E-M and simulated annealing

Goal: Find most probable pattern by sampling from motif probabilities to maximize ratio of model:background probabilities

Page 81: Motif Finding

Predictive Update Step

random motif start position chosen for all sequences except one

Initial alignment used to calculate residue frequencies for motif and background

similar to the Expectation Step of EM

Page 82: Motif Finding

Sampling Step

ratio of model:background probabilities normalized and weighted

motif start position chosen based on a random sampling with the given weights

Different than E-M algorithm

Page 83: Motif Finding

Gibbs Sampling

process repeated until residue frequencies in each column do not change

The sampling step is then repeated for a different initial random alignment

Sampling allows escape from local maxima

Page 84: Motif Finding

Gibbs Sampling

Dirichlet priors (pseudocounts) are added into the nucleotide counts to improve performance

shifting routine shifts motif a few bases to the left or the right

A range of motif sizes is checked

Page 85: Motif Finding

Gibbs Sampler Web Interface

http://bayesweb.wadsworth.org/gibbs/gibbs.html


Recommended