Algorithms for Bioinformatics
Lecture 2: Exhaustive search and randomized algorithms for motifdiscovery
13.9.2017
These slides are based on previous years’ slides byAlexandru Tomescu, Leena Salmela and Veli Mäkinen
These slides use material from http://bix.ucsd.edu/bioalgorithms/slides.php
http://bix.ucsd.edu/bioalgorithms/slides.php
Outline
DNA Sequence Motifs
Motif Finding Problem
Brute Force Motif Finding
Median String Problem
Greedy Motif Search
Randomized Algorithms
2 / 59
DNA Sequence Motif
I A DNA sequence motif is a recurring pattern in DNA presumed tohave some biological funtion.
I Often the occurences of the motif in DNA are binding sites whereproteins such as transcription factors can attach.
3 / 59
Transcription Factors
I Every gene contains a regulatory region typically stretching 100-1000bp upstream of the transcriptional start site.
I Transcription factor (TF) is a regulatory protein that can bind to aregulatory region and promote or inhibit the transcription of the gene.
I A single TF typically regulates multiple (co-regulated) genes.
4 / 59
Transcription Factor Binding Sites and Motifs
I Transcription Factor Binding Site (TFBS) is the DNA location wherethe TF binds to.
I May be located anywhere in the regulatory region.
I A TF binds to a specific kind of DNA sequence.I Similar but not identical in different co-regulated genes.I Some positions are more important and thus conserved while other
positions allow variation.
I Motif is a pattern that describes the TFBS sequences of a given TF.
5 / 59
TBFSs
gene
gene
gene
gene
gene
ATCCCG
ATCCCG
GGTCCT
AT GCCG
AT GCC C
6 / 59
Motif Representations
I Set of sequences
TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA
I Consensus sequence TGAGGGA
I Profile matrix
A 0 0 3 0 2 0 5C 0 0 0 0 0 0 0G 0 5 2 5 3 5 0T 5 0 0 0 0 0 0
I Motif logo
7 / 59
Outline
DNA Sequence Motifs
Motif Finding Problem
Brute Force Motif Finding
Median String Problem
Greedy Motif Search
Randomized Algorithms
8 / 59
Identifying Motifs
I Identify a set of co-regulated genes.
I Extract their regulatory regions.
I Find a pattern shared by all of the regions: the motif.
I The motif can be used for finding other (previously unknown) genesregulated by the same TF.
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
9 / 59
The Motif Finding Problem: Formulation
I Goal: Find the best motif in a set of DNA sequences
I Input: A collection of t strings Dna and an integer kI Output: The collection of t k-mers, one from each input string,
forming the best motif.I We will later consider what is the “best” motif.
k = 8
t = 5
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc︸ ︷︷ ︸n = 69
10 / 59
Planted (k , d)-Motif Problem
Artificially generated problem instance
1. Generate a random k-mer M (the motif)
2. Generate t random sequences of length n
3. Implant M into a random position in each of the t sequences
4. In each implanted copy of M, make d random mutations
11 / 59
Example: Generate sequences
gatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc
tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtac
cctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccg
ttggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag
gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttata
tgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgca
gcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttca
ttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt
ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaa
caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagc
12 / 59
Example: Implant motif AAAAAAAAGGGGGGG
gatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc
tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGG
cctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccg
ttggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag
gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttata
tgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgca
gcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttca
ttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt
ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaa
caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGG
13 / 59
Where are the implanted motifs?
gatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc
tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggg
cctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg
ttggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag
gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttata
tgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgca
gcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttca
ttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt
ttggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaa
caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggg
14 / 59
Example: Add four mutations per motif occurrence
gatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc
tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGG
cctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccg
ttggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag
gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttata
tgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgca
gcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttca
ttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt
ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaa
caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGG
15 / 59
Where are the implanted motifs???
gatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc
tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggg
cctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccg
ttggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggag
gcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttata
tgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgca
gcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttca
ttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgt
ttggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaa
caacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgg
16 / 59
Why finding (15,4)-motifs is hard?
gatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgcc
tttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGG
I Occurrences can differ in 8 positionsAgAAgAAAGGttGGG
..|..|||.|..|||
cAAtAAAAcGGcGGG
I Difficult to recognize similarity even if you know the positions
17 / 59
Planted (k , d)-Motif Problem
I Common benchmark or challenge problem for motif finding
I Adjustable difficulty
I For example, the planted (15, 4)-motif problem is difficult but notimpossible
Real data is noisier and more complex
I No upper bound d on mutation
I Some positions have little variation, others have a lot
18 / 59
Outline
DNA Sequence Motifs
Motif Finding Problem
Brute Force Motif Finding
Median String Problem
Greedy Motif Search
Randomized Algorithms
19 / 59
Scoring Motifs
I The output of motif finding is a set of k-mers, one from each inputsequence
I In a desired output, the k-mers are as similar to each other aspossible, representing a coherent motif
I How to measure the goodness of the motif?
20 / 59
Consensus Motif and Score
a G g t a c T tC c A t a c g t
Motif occurrences a c g t T A g ta c g t C c A tC c g t a c g G
A 3 0 1 0 3 1 1 0Counts C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4
Consensus A C G T A C G T
Score 2+1+1+0+2+1+2+1=10
I t motifs occurrences,one from each sequence
I Count symbols in eachcolumn
I Consensus formed bymost frequent symbols
I Score is the number ofdifferences fromconsensus
21 / 59
Consensus Motif and Score
I Good model for planted motif problem
I In real data, mutations should be expensive in conserved positions butmuch cheaper in variable positions
22 / 59
BruteForceMotifSearch
I Compute the score for every possible combination of motifs
I Output the set of motifs with the smallest score
23 / 59
Running Time of BruteForceMotifSearch
I (n − k + 1) different k-mers in each sequenceI (n − k + 1)t different combinations of motifsI kt time to compute score for one set of motifs
I kt(n − k + 1)t = O(ktnt) time in totalI E.g. for t = 20, n = 600, k = 15 we must perform approximately
1058 computations — it would take billions of years
24 / 59
Outline
DNA Sequence Motifs
Motif Finding Problem
Brute Force Motif Finding
Median String Problem
Greedy Motif Search
Randomized Algorithms
25 / 59
The Median String Problem
I Given a set of t DNA sequences find a pattern that appears in all tsequences with the minimum number of total mismatches
I This pattern will be the shared motif
26 / 59
Hamming Distance
I The Hamming distance d(v ,w) the number of mismatches betweentwo k-mers v and w
I For example:d(AAAAAA,ACAAAC) = 2
27 / 59
Computing Score
a G g t a c T t 2C c A t a c g t 2
Motifs a c g t T A g t 2a c g t C c A t 2C c g t a c g G 2
Score(Motifs) 2+1+1+0+2+1+2+1=10
Consensus(Motifs) A C G T A C G T
I Score is the number ofmismatching symbols
I Can be computedcolumn by column orrow by row
I Row sums are Hammingdistances
28 / 59
Computing Score
Define
I Motifs = {Motif 1,Motif 2, . . . ,Motif t}I d(Pattern,Motifs) =
∑ti=1 d(Pattern,Motif i )
Then
I Score(Motifs) = d(Consensus(Motifs),Motifs)
29 / 59
Best Match Distance
I Assume |String | > |Pattern| = kI The best match distance d(Pattern, String) is the smallest Hamming
distance d(Pattern,Motif ) between Pattern and any k-mer Motif inString
I Example: d(ACGTACGT, gcaaaAGGTACTTccaa) = 2
Generalize for a set of strings
I Dna = {Dna1,Dna2, . . . ,Dnat}I d(Pattern,Dna) =
∑ti=1 d(Pattern,Dnai )
30 / 59
The Median String Problem
I Goal: Given a set of DNA sequences, find a median string
I Input: A collection of strings Dna and an integer k
I Output: A k-mer Pattern minimizing d(Pattern,Dna) among allk-mers Pattern
31 / 59
Motif Finding Problem = Median String Problem
I Input: Dna, k
I Motifs: output of Motif Finding (t k-mers)
I Pattern: output of Median String
I Score(Motifs) = d(Pattern,Dna)
Why?
I If Score(Motifs) < d(Pattern,Dna), we could chooseConsensus(Motifs) as a better Pattern
I If Score(Motifs) > d(Pattern,Dna), we could choose the best matchoccurrences of Pattern as better Motifs
32 / 59
Motif Finding Problem = Median String Problem
I Input: Dna, k
I Motifs: output of Motif Finding (t k-mers)
I Pattern: output of Median String
I Score(Motifs) = d(Pattern,Dna)
Why?
I If Score(Motifs) < d(Pattern,Dna), we could chooseConsensus(Motifs) as a better Pattern
I If Score(Motifs) > d(Pattern,Dna), we could choose the best matchoccurrences of Pattern as better Motifs
32 / 59
Median String Algorithm
MedianString(DNA, k)
1: BestPattern← AAA...A2: for each k-mer Pattern from AAA...A to TTT...T do3: if d(Pattern,DNA) < d(BestPattern,DNA) then4: BestPattern← Pattern5: return BestPattern
33 / 59
Running Time of MedianString
I 4k different k-mers
I O(k · n) time to compute the best match distance to one stringI O(knt4k) time in total
I E.g. for t = 20, n = 600, k = 15 this is about about 1013
— still a lot but much less than 1058
I Reformulating a problem can help!
34 / 59
Running Time of MedianString
I 4k different k-mers
I O(k · n) time to compute the best match distance to one stringI O(knt4k) time in total
I E.g. for t = 20, n = 600, k = 15 this is about about 1013
— still a lot but much less than 1058
I Reformulating a problem can help!
34 / 59
Outline
DNA Sequence Motifs
Motif Finding Problem
Brute Force Motif Finding
Median String Problem
Greedy Motif Search
Randomized Algorithms
35 / 59
Search Space
I BruteForceMotifSearch and MedianString algorithms haveexponential running time
I This is because the search space, the set of possible solutions, isexponential
I nt different ways to choose MotifsI 4k different ways to choose Pattern
36 / 59
Exploring Only Part of Search Space
Branch and bound algorithms (covered in study groups)
I Avoid regions that cannot improve solution
I Still exponential in the worst case
Greedy algorithms
I Search the most promising directions
I No guarantee of finding an optimal solution
Randomized algorithms
I Add randomness to greedy search
I Avoids getting stuck in a dead end
37 / 59
Profile Matrixa G g t a c T tC c A t a c g t
Motifs a c g t T A g ta c g t C c A tC c g t a c g G
A 3 0 1 0 3 1 1 0Count(Motifs) C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4
A .6 0 .2 0 .6 .2 .2 0Profile(Motifs) C .4 .8 0 0 .2 .8 0 0
G 0 .2 .8 0 0 0 .6 .2T 0 0 0 1 .2 0 .2 .8
Consensus(Motifs) A C G T A C G T
I Profile represents theprobability of eachnucleotide in eachposition
I More detailed summaryof the set of motifs thanconsensus
38 / 59
k-Mer Probabilities
A .6 0 .2 0 .6 .2 .2 0Profile C .4 .8 0 0 .2 .8 0 0
G 0 .2 .8 0 0 0 .6 .2T 0 0 0 1 .2 0 .2 .8
The probability of a k-mer given a profile
I Pr(AGGTACTT | Profile) = .6 · .2 · .8 · 1 · .6 · .8 · .2 · .8 = 0.0073728I Measure how well the k-mer matches the motif
I Does 0.0073728 imply a good match?
39 / 59
Profile-Most Probable k-mer
I The k-mer with the highest probability in a string
I Considered the best matching motif
I Example: The Profile-most probable 8-mer ingcaaaAGGTACTTccaa is AGGTACTT
I Pr(AGGTACTT | Profile) = 0.0073728
A .6 0 .2 0 .6 .2 .2 0Profile C .4 .8 0 0 .2 .8 0 0
G 0 .2 .8 0 0 0 .6 .2T 0 0 0 1 .2 0 .2 .8
40 / 59
Problem: Zero Probabilities
A .6 0 .2 0 .6 .2 .2 0Profile C .4 .8 0 0 .2 .8 0 0
G 0 .2 .8 0 0 0 .6 .2T 0 0 0 1 .2 0 .2 .8
Consensus A C G T A C G T
Pr(TCGTACGT | Profile) = 0 · .8 · .8 · 1 · .6 · .8 · .6 · .8 = 0I Only one mismatch compared to consensus
I Should this probability really be 0?
41 / 59
Pseudocounts
I Add one to all counts
I Avoids zero counts
A 3 0 1 0 3 1 1 0Count C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1T 0 0 0 5 1 0 1 4
A 4 1 2 1 4 2 2 1PseudoCount C 3 5 1 1 2 5 1 1
G 1 2 5 1 1 1 4 2T 1 1 1 6 2 1 2 5
42 / 59
Laplace’s Rule of Succession
I Use pseudocounts instead of counts to compute probabilities
I As if we had seen one occurrence of each symbol before the main data
A 4 1 2 1 4 2 2 1PseudoCount C 3 5 1 1 2 5 1 1
G 1 2 5 1 1 1 4 2T 1 1 1 6 2 1 2 5
A 4/9 1/9 2/9 1/9 4/9 2/9 2/9 1/9Profile C 3/9 5/9 1/9 1/9 2/9 5/9 1/9 1/9
G 1/9 2/9 5/9 1/9 1/9 1/9 4/9 2/9T 1/9 1/9 1/9 6/9 2/9 1/9 2/9 5/9
Pr(TCGTACGT | Profile) = 1/9 · 5/9 · 5/9 · 6/9 · 4/9 · 5/9 · 4/5 · 5/9 =60000/43046721 = 0.001393834
43 / 59
Greedy Motif Search
Solve Motif Finding problemI For each input string, choose the profile-most probable k-mer as the
motifI Greedy choice
I The profile is computed from motifs of earlier strings
I In first string, try all k-mers
44 / 59
Greedy Motif Search
GreedyMotifSearch(DNA, k , t)
1: BestMotifs ← the first k-mer of each string in DNA2: for each k-mer Motif in the first string in DNA do3: Motif 1 ← Motif4: for i ← 2 to t do5: form Profile from Motif 1, . . . ,Motif i−16: Motif i ← Profile-most probable k-mer in the i-th string in DNA7: Motifs ← Motif 1, . . . ,Motif t8: if Score(Motifs) < Score(BestMotifs) then9: BestMotifs ← Motifs
10: return BestMotifs
45 / 59
Performance of GreedyMotifSearch
I Running time O(n · t · k · (t + n))I polynomial not exponential
I May not find the best motifsI Early choices may lead to a wrong direction
46 / 59
Outline
DNA Sequence Motifs
Motif Finding Problem
Brute Force Motif Finding
Median String Problem
Greedy Motif Search
Randomized Algorithms
47 / 59
Randomized Algorithms
I Make random choices during computation
I Use random number generator to “toss coins” or to “roll dice”
Why Randomness Helps?
I If a greedy algorithm fails for some input,it will always fail for that input
I If a randomized algorithm fails,it is unlikely to fail again in the same way
I We can run it many times and choose the best output
48 / 59
Monte Carlo and Las Vegas Algorithms
Monte Carlo algorithm
I May return an incorrect or inoptimal result
I Returns a correct answer or a good approximation with highprobability (if repeated sufficiently many times)
Las Vegas algorithm
I Always returns a correct/optimal result
I Very long runtime is possible but very unlikely
49 / 59
Turning Monte Carlo into Las Vegas
1. Run the Monte Carlo algorithm
2. If the result is good, stop. Otherwise return to Step 1.
I Requires that a correct or optimal result can be easily recognizedI This is not the case with the Motif Finding problem
I The following algoriths are Monte Carlo algorithms
50 / 59
Turning Monte Carlo into Las Vegas
1. Run the Monte Carlo algorithm
2. If the result is good, stop. Otherwise return to Step 1.
I Requires that a correct or optimal result can be easily recognizedI This is not the case with the Motif Finding problem
I The following algoriths are Monte Carlo algorithms
50 / 59
Randomized Motif Search
Improving a set of motifsI Starting with a set of motifs (one from each sequence)
1. Compute a profile from the motifs2. Find the profile-most probable motif in each sequence
I The result is a potentially better set of motifs
I Repeat this as long as the set of motifs keeps improving
Randomization
I Start with a random set of motifs
51 / 59
Randomized Motif Search
RandomizedMotifSearch(DNA, k, t)
1: randomly select k-mers Motifs = (Motif 1, . . . ,Motif t), one from eachstring in DNA
2: BestMotifs ← Motifs3: while forever do4: Profile ← Profile(Motifs)5: for i ← 1 to t do6: Motif i ← Profile-most probable k-mer in the i-th string in DNA7: Motifs ← Motif 1, . . . ,Motif t8: if Score(Motifs) < Score(BestMotifs) then9: BestMotifs ← Motifs
10: else11: return BestMotifs
52 / 59
Why Randomized Motif Search Works?
I If Motifs is a random set, the expectation is that Profile(Motifs) hasabout the same probability 0.25 for each symbol in each column
I If Motifs contains some of the true motifs, it is not random andProfile(Motifs) reflects this
I Then Profile(Motifs) is more likely to match the other true motifs
I Thus we might need just a few of the true motifs in the initial set
I This will happen eventually if repeated many times (may requirethousands of repeats)
53 / 59
Gibbs Sampler
I Gibbs Sampler is a more refined randomized algorithmI Compared to Randomized Motif Search Gibbs Sampler is
I More cautiousI More randomized
54 / 59
Gibbs Sampler Is More Cautious
I Randomized Motif Search might get some true motifs right but throwthem all away in the next round
I Gibbs Sampler changes just one motif in each round
55 / 59
Gibbs Sampler Is More Randomized
I Randomized Motif Search uses randomness only in the beginningI Gibbs Sampler uses randomness in every round
I Choose a random motif to discardI Replace it with a random motif (from the same sequence)I The second random choice is biased: a profile-randomly generated
k-mer
56 / 59
Profile-Randomly Generated k-Mer
I Given a Profile and a String
1. Compute probabilities of all k-mers in String2. Choose one of the k-mers randomly but biased by the probabilities
I The probabilities with respect to Profile do not usually sum up to 1and have to be normalized: Replace p1, . . . , pn with p1/C , . . . p1/C ,where C =
∑ni=1 pi
I ExampleI p1 = 0.1, p2 = 0.2, p3 = 0.3I C = 0.1 + 0.2 + 0.3 = 0.6I p1/C = 1/6, p2/C = 1/3, p3/C = 1/2I p1/C + p2/C + p3/C = 1/6 + 1/3 + 1/2 = 1
57 / 59
Gibbs Sampler
GibbsSampler(DNA, k , t,N)
1: randomly select k-mers Motifs = (Motif 1, . . . ,Motif t) in each stringin Dna
2: BestMotifs ← Motifs3: for j ← 1 to N do4: i ← Random(t)5: Profile ← profile matrix constructed from all strings in Motifs
except for Motif i6: Motif i ← profile-randomly generated k-mer in the i-th sequence in
DNA7: if Score(Motifs) < Score(BestMotifs) then8: BestMotifs ← Motifs9: return BestMotifs
58 / 59
Gibbs Sampler
I Because of randomness in every round, Gibbs Sampler can keep onrunning without getting stuck to single solution
I However, it may end up exploring the same small set of solutionsrepeatedly: It gets stuck in a local optimum
I This can be corrected by restarting from a new random set of motifsevery now and then
59 / 59
DNA Sequence MotifsMotif Finding ProblemBrute Force Motif FindingMedian String ProblemGreedy Motif SearchRandomized Algorithms