Abstract
There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NP-hard, but it has been known for some time that greedy algorithms work well for this problem. More precisely, it was proved in a recent sequence of papers that in the worst case a greedy algorithm produces a superstring that is at most β times (2≤β≤4) worse than optimal. We analyze the problem in a probabilistic framework,and consider the optimal total overlap O optn and the overlap O grn produced by various greedy algorithms. These turn out to be asymptotically equivalent. We show that in several cases, with high probability \(\lim _{n \to \infty } \tfrac{{O_n^{opt} }}{{n\log n}} = \lim _{n \to \infty } \tfrac{{O_n^{gr} }}{{n\log n}} = \tfrac{1}{H}\)where n is the number of original strings, and H is the entropy of the underlying alphabet. Our results hold under a condition that the lengths of all strings are not too short. Finally, we provide several generalizations and extensions of our basic result.
This work was supported by CCR-9225008.
This research was supported in part by NSF Grants CCR-9201078, NCR-9206315 and NCR-9415491, and in part by NATO Collaborative Grant CGR.950060.
Preview
Unable to display preview. Download preview PDF.
References
K. Alexander, Shortest Common Superstring of Random Strings, Proc. Combinatorial Pattern Matching, Springer-Verlag, LNCS #807, 164–172, 1994
C.Armen and C.Stein, Short Superstrings and the Structure of Overlapping Strings, Journal of Computational Biology, to appear.
C.Armen and C.Stein, A 2-2/3 Approximation Algorithm for the Shortest Superstring Problem, Proc. Combinatorial Pattern Matching, 1996.
W. Bains and G. Smith, A Novel Method for Nucleic Acid Sequence Determination, J. Theor. Biol., 135, 303–307, 1988.
A. Blum, T. Jiang, M. Li, J. Tromp, M. Yannakakis, Linear Approximation of Shortest Superstring, J. the ACM, 41, 630–647, 1994; also STOC, 328–336, 1991.
T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley&Sons, New York (1991).
A.Czumaj, L.Gasienic, M.Piotrow and W.Rytter, Parallel and Sequential Approximations of Shortest Superstrings, Proceedings of the Fourth Scandinavian Workshop on Algorithm Theory, 95–106, 1994.
R. Drmanac and C. Crkvenjakov, Sequencing by Hybridization (SBH) with Oligonucloide Probes as an Integral Approach for the Analysis of Complex Genome, Int. J. genomic Research, 1, 59–79, 1992.
J. Gallant, D. Maier and J.A. Storer, On Finding Minimal Length Superstrings, Journal of Computer and System Sciences, 20, 50–58, 1980.
P. Jacquet and W. Szpankowski, Analysis of Digital Tries with Markovian Dependency, IEEE Trans. on Information Theory, 37, 1470–1475, 1991.
T. Jiang and M. Li, Approximating Shortest Superstring with Constraints, WADS, 385–396, Montreal 1993.
T.Jiang, Z.Jiang and D.Breslauer, Rotation of Periodic Strings and Short Superstrings, Proceedings of the Third South American Conference on String Processing, to appear.
D. E. Knuth, The Art of Computer Programming. Sorting and Searching, Addison-Wesley 1973.
D. E. Knuth, Motwani, and B. Pittel, Stable Husbands, Random Structures and Algorithms, 1, 1–14, 1990.
S.R.Kosaraju, J.K.Park and C.Stein, Long Tours and Short Superstrings, Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science, 166–177, 1994.
A. Lesek (Ed.), Computational Molecular Biology, Sources and Methods for Sequence Analysis, Oxford University Press, 1988.
Ming Li, Towards a DNA Sequencing Theory, Proc. of 31st IEEE Symp. on Foundation of Computer Science, 125–134 1990.
T. Luczak and W. Szpankowski, A Lossy Data Compression Based on an Approximate Pattern Matching, IEEE Trans. Information Theory, to appear; also Purdue University, CSD-TR-94-072, 1994.
P. Pevzner, l-tuple DNA Sequencing: Computer Analysis, J. Biomolecular Structure and Dynamics, 7, 63–73, 1989.
B. Pittel, Asymptotic Growth of a Class of Random Trees, Ann. Probab., 18, 414–427, 1985.
P. Shields, Entropy and Prefixes, Ann. Probab., 20, 403–409, 1992.
W. Szpankowski, The Evaluation of an Alternative (sic!) Sum with Applications to the Analysis of Some Data Structures, Information Processing Letters, 28, 13–19, 1988.
W. Szpankowski, A Generalized Suffix Tree and its (Un)Expected Asymptotic Behaviors, SIAM J. Computing, 22, pp. 1176–1198, 1993.
S. Teng and F. Yao, Approximating Shortest Superstring, Proc. FOCS, 158–165, 1993.
E. Ukkonen, A Linear-Time Algorithm for Finding Approximate Shortest Common Superstrings, Algorithmica, 5, 313–323, 1990.
E. Ukkonen, Approximate String-Matching over Suffix Trees, Proc. Combinatorial Pattern Matching, 228–242, Padova, 1993.
E-H. Yang and Z. Zhang, The Shortest Common Superstring Problem: Average Case Analysis for Both Exact Matching and Approximate Matching, preprint.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Frieze, A., Szpankowski, W. (1996). Greedy algorithms for the shortest common superstring that are asymtotically optimal. In: Diaz, J., Serna, M. (eds) Algorithms — ESA '96. ESA 1996. Lecture Notes in Computer Science, vol 1136. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61680-2_56
Download citation
DOI: https://doi.org/10.1007/3-540-61680-2_56
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61680-1
Online ISBN: 978-3-540-70667-0
eBook Packages: Springer Book Archive