+ All Categories

244_Ma

Date post: 09-Jan-2016
Category:
Upload: fabioreinoso
View: 212 times
Download: 0 times
Share this document with a friend
Description:
Dna sequence problem

of 13

Transcript
  • Bin MaUniversity of Western OntarioWhy Greed Works for Shortest Common Superstring Problem?

  • Shortest Common Superstring (SCS)Given n strings s1, s2, , sn. Find the shortest superstring s so that each si is a substring of s.BackgroundData compressionSequence reconstruction, esp. DNA assembly

  • Greedy AlgorithmAlgorithm: Repeatedly merge two strings with the longest overlap, until there is only one string left.

    The algorithm works very well in practice.Any proof/explanation for the good performance?

    thatwasawholelotoflotoffoodandfunthatwasawholelotoffoodandfun

  • HistoryGallant et.al 1980: first proposed the problem. NP-hard.Tarhio and Ukkonen 1988: conjectured Greedy has ratio 2 approximation.Blum et.al 1994: Greedy has ratio 4Frieze and Szpankowski 1998: Greedys average perform. is asymptotically optimal.Romero et.al 2004: Greedy ratio 1.014 on simulated dataKaplan and Shafrir 2005: Greedy has ratio 3.5Vassilevska 2005: SCS is Max-SNP hardAside: a series of publications 1990-2000 proposed different algorithms to achieve ratios 4, 3, 2.75, 2.67, , 2.5.

  • The QuestionWhy does the simple greedy algorithm work well in practice?Frieze and Szpankowski 1998: Greedys average perform. is asymptotically optimal.Each input string is independently random.Hence two strings do not have long overlap (or prob. is small).The optimal solution is not much better than concatenation.But practical cases have long overlapsWe need a better model to generate the random instances

  • Smoothed AnalysisSpielman and Teng, JACM 2004, Why The Simplex Algorithm Usually Takes Polynomial Time.For any given input matrix, a small perturbation (by adding a small Gaussian noise to each coefficient) will generate an easy instance with high prob.Or the average of a local subset of instances is polynomial.Stronger than average analysis.

    All instancesAn arbitrary instanceperturbed instances

  • Our ResultRoughly: the Greedy algorithm is a PTAS under smoothed analysis.Instance generationa parent sequence and the substring locationsPerturbationRandomly mutate each letter of the parent sequence with p=O(log (nm)/m); without changing the substring locations.

    ACGTAAGGTTTTAGCGTTTTAGATACGTAAGGTTTTAGGGTTTTAGCGTTTTGTTTTAGATACCTAAGGTGTTAGCGATTTAGCTACCTAAGGTGTTAGGGTGTTAGCGATTTGATTTAGCTA given instanceA perturbed instance

  • Our ResultTheorem: Starting from any given instance, for instances generated by the above perturbation, the simple greedy algorithm has average ratio 1+epsilon.Analysis is done with any given instancesimilar to worst case analysisEach perturbed instance has very similar structure to the given instance.the perturbation is so insignificant that a practical distribution should not prefer one instance to anotherThe theorem explains why Greed usually has good performance.

  • Properties of the PerturbationProperty 1. Long overlaps in the parent sequence are preserved. (called consistent overlaps)Property 2. The inconsistent overlap should not be very long.

    ACGTAAGGTTTTAGCGTTTTAGATACGTAAGGTTTTAGGGTTTTAGCGTTTTGTTTTAGATACCTAAGGTGTTAGCGATTTAGCTACCTAAGGTGTTAGGGTGTTAGCGATTTGATTTAGCTA given instanceA perturbed instance

  • Sketch of proofLemma 1 the probability of having a length k inconsistent overlap is less than (1-p)k.Proof: At the pointed position, whether X is mutated is independent to the other positions. The probability of match is
  • Sketch of ProofLemma 2 with high probability, all inconsistent overlaps are shorter than epsilon*mLemma 3 Greedy is PTAS when Lemma 2s condition is satisfied.Idea: Greedy will use long consistent overlaps first. The short overlaps do not affect the total length very much. The theorem is a corollary of Lemma 3.

  • ObservationsUnder worst case analysis, SCS is Max-SNP hard. But under smoothed analysis, the simple greedy algorithm is a PTAS.Smoothed complexity of an algorithm can be better than the lower-bound of the problem.Smoothed analysis provides a powerful way in between of worst case and average case analysis.To be meaningful, the perturbation should be insignificantOtherwise no difference to (or even worse than) avg. analysis.

  • Another Result in the PaperSome letters are undetermined by the DNA sequencer. E.g. ACGNTT.SCS with wildcards: same as SCS except that input strings may contain wildcards. SCS with wildcards cannot be approximated with ratio n^(1/7-epsilon)But the greedy algorithm is still PTAS under smoothed analysis.

    *