Date post: | 09-Jan-2016 |
Category: |
Documents |
Upload: | fabioreinoso |
View: | 212 times |
Download: | 0 times |
of 13
Bin MaUniversity of Western OntarioWhy Greed Works for Shortest Common Superstring Problem?
Shortest Common Superstring (SCS)Given n strings s1, s2, , sn. Find the shortest superstring s so that each si is a substring of s.BackgroundData compressionSequence reconstruction, esp. DNA assembly
Greedy AlgorithmAlgorithm: Repeatedly merge two strings with the longest overlap, until there is only one string left.
The algorithm works very well in practice.Any proof/explanation for the good performance?
thatwasawholelotoflotoffoodandfunthatwasawholelotoffoodandfun
HistoryGallant et.al 1980: first proposed the problem. NP-hard.Tarhio and Ukkonen 1988: conjectured Greedy has ratio 2 approximation.Blum et.al 1994: Greedy has ratio 4Frieze and Szpankowski 1998: Greedys average perform. is asymptotically optimal.Romero et.al 2004: Greedy ratio 1.014 on simulated dataKaplan and Shafrir 2005: Greedy has ratio 3.5Vassilevska 2005: SCS is Max-SNP hardAside: a series of publications 1990-2000 proposed different algorithms to achieve ratios 4, 3, 2.75, 2.67, , 2.5.
The QuestionWhy does the simple greedy algorithm work well in practice?Frieze and Szpankowski 1998: Greedys average perform. is asymptotically optimal.Each input string is independently random.Hence two strings do not have long overlap (or prob. is small).The optimal solution is not much better than concatenation.But practical cases have long overlapsWe need a better model to generate the random instances
Smoothed AnalysisSpielman and Teng, JACM 2004, Why The Simplex Algorithm Usually Takes Polynomial Time.For any given input matrix, a small perturbation (by adding a small Gaussian noise to each coefficient) will generate an easy instance with high prob.Or the average of a local subset of instances is polynomial.Stronger than average analysis.
All instancesAn arbitrary instanceperturbed instances
Our ResultRoughly: the Greedy algorithm is a PTAS under smoothed analysis.Instance generationa parent sequence and the substring locationsPerturbationRandomly mutate each letter of the parent sequence with p=O(log (nm)/m); without changing the substring locations.
ACGTAAGGTTTTAGCGTTTTAGATACGTAAGGTTTTAGGGTTTTAGCGTTTTGTTTTAGATACCTAAGGTGTTAGCGATTTAGCTACCTAAGGTGTTAGGGTGTTAGCGATTTGATTTAGCTA given instanceA perturbed instance
Our ResultTheorem: Starting from any given instance, for instances generated by the above perturbation, the simple greedy algorithm has average ratio 1+epsilon.Analysis is done with any given instancesimilar to worst case analysisEach perturbed instance has very similar structure to the given instance.the perturbation is so insignificant that a practical distribution should not prefer one instance to anotherThe theorem explains why Greed usually has good performance.
Properties of the PerturbationProperty 1. Long overlaps in the parent sequence are preserved. (called consistent overlaps)Property 2. The inconsistent overlap should not be very long.
ACGTAAGGTTTTAGCGTTTTAGATACGTAAGGTTTTAGGGTTTTAGCGTTTTGTTTTAGATACCTAAGGTGTTAGCGATTTAGCTACCTAAGGTGTTAGGGTGTTAGCGATTTGATTTAGCTA given instanceA perturbed instance
Sketch of ProofLemma 2 with high probability, all inconsistent overlaps are shorter than epsilon*mLemma 3 Greedy is PTAS when Lemma 2s condition is satisfied.Idea: Greedy will use long consistent overlaps first. The short overlaps do not affect the total length very much. The theorem is a corollary of Lemma 3.
ObservationsUnder worst case analysis, SCS is Max-SNP hard. But under smoothed analysis, the simple greedy algorithm is a PTAS.Smoothed complexity of an algorithm can be better than the lower-bound of the problem.Smoothed analysis provides a powerful way in between of worst case and average case analysis.To be meaningful, the perturbation should be insignificantOtherwise no difference to (or even worse than) avg. analysis.
Another Result in the PaperSome letters are undetermined by the DNA sequencer. E.g. ACGNTT.SCS with wildcards: same as SCS except that input strings may contain wildcards. SCS with wildcards cannot be approximated with ratio n^(1/7-epsilon)But the greedy algorithm is still PTAS under smoothed analysis.
*