Greedy algorithms for the shortest common superstring that are asymtotically optimal

Frieze, Alan; Szpankowski, Wojciech

doi:10.1007/3-540-61680-2_56

Alan Frieze¹ &
Wojciech Szpankowski²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1136))

Included in the following conference series:

European Symposium on Algorithms

195 Accesses
3 Citations

Abstract

There has recently been a resurgence of interest in the shortest common superstring problem due to its important applications in molecular biology (e.g., recombination of DNA) and data compression. The problem is NP-hard, but it has been known for some time that greedy algorithms work well for this problem. More precisely, it was proved in a recent sequence of papers that in the worst case a greedy algorithm produces a superstring that is at most β times (2≤β≤4) worse than optimal. We analyze the problem in a probabilistic framework,and consider the optimal total overlap O ^opt_n and the overlap O ^gr_n produced by various greedy algorithms. These turn out to be asymptotically equivalent. We show that in several cases, with high probability \(\lim _{n \to \infty } \tfrac{{O_n^{opt} }}{{n\log n}} = \lim _{n \to \infty } \tfrac{{O_n^{gr} }}{{n\log n}} = \tfrac{1}{H}\)where n is the number of original strings, and H is the entropy of the underlying alphabet. Our results hold under a condition that the lengths of all strings are not too short. Finally, we provide several generalizations and extensions of our basic result.

This work was supported by CCR-9225008.

This research was supported in part by NSF Grants CCR-9201078, NCR-9206315 and NCR-9415491, and in part by NATO Collaborative Grant CGR.950060.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

K. Alexander, Shortest Common Superstring of Random Strings, Proc. Combinatorial Pattern Matching, Springer-Verlag, LNCS #807, 164–172, 1994
Google Scholar
C.Armen and C.Stein, Short Superstrings and the Structure of Overlapping Strings, Journal of Computational Biology, to appear.
Google Scholar
C.Armen and C.Stein, A 2-2/3 Approximation Algorithm for the Shortest Superstring Problem, Proc. Combinatorial Pattern Matching, 1996.
Google Scholar
W. Bains and G. Smith, A Novel Method for Nucleic Acid Sequence Determination, J. Theor. Biol., 135, 303–307, 1988.
PubMed Google Scholar
A. Blum, T. Jiang, M. Li, J. Tromp, M. Yannakakis, Linear Approximation of Shortest Superstring, J. the ACM, 41, 630–647, 1994; also STOC, 328–336, 1991.
Google Scholar
T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley&Sons, New York (1991).
Google Scholar
A.Czumaj, L.Gasienic, M.Piotrow and W.Rytter, Parallel and Sequential Approximations of Shortest Superstrings, Proceedings of the Fourth Scandinavian Workshop on Algorithm Theory, 95–106, 1994.
Google Scholar
R. Drmanac and C. Crkvenjakov, Sequencing by Hybridization (SBH) with Oligonucloide Probes as an Integral Approach for the Analysis of Complex Genome, Int. J. genomic Research, 1, 59–79, 1992.
Google Scholar
J. Gallant, D. Maier and J.A. Storer, On Finding Minimal Length Superstrings, Journal of Computer and System Sciences, 20, 50–58, 1980.
Article Google Scholar
P. Jacquet and W. Szpankowski, Analysis of Digital Tries with Markovian Dependency, IEEE Trans. on Information Theory, 37, 1470–1475, 1991.
Google Scholar
T. Jiang and M. Li, Approximating Shortest Superstring with Constraints, WADS, 385–396, Montreal 1993.
Google Scholar
T.Jiang, Z.Jiang and D.Breslauer, Rotation of Periodic Strings and Short Superstrings, Proceedings of the Third South American Conference on String Processing, to appear.
Google Scholar
D. E. Knuth, The Art of Computer Programming. Sorting and Searching, Addison-Wesley 1973.
Google Scholar
D. E. Knuth, Motwani, and B. Pittel, Stable Husbands, Random Structures and Algorithms, 1, 1–14, 1990.
Google Scholar
S.R.Kosaraju, J.K.Park and C.Stein, Long Tours and Short Superstrings, Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science, 166–177, 1994.
Google Scholar
A. Lesek (Ed.), Computational Molecular Biology, Sources and Methods for Sequence Analysis, Oxford University Press, 1988.
Google Scholar
Ming Li, Towards a DNA Sequencing Theory, Proc. of 31st IEEE Symp. on Foundation of Computer Science, 125–134 1990.
Google Scholar
T. Luczak and W. Szpankowski, A Lossy Data Compression Based on an Approximate Pattern Matching, IEEE Trans. Information Theory, to appear; also Purdue University, CSD-TR-94-072, 1994.
Google Scholar
P. Pevzner, l-tuple DNA Sequencing: Computer Analysis, J. Biomolecular Structure and Dynamics, 7, 63–73, 1989.
Google Scholar
B. Pittel, Asymptotic Growth of a Class of Random Trees, Ann. Probab., 18, 414–427, 1985.
Google Scholar
P. Shields, Entropy and Prefixes, Ann. Probab., 20, 403–409, 1992.
Google Scholar
W. Szpankowski, The Evaluation of an Alternative (sic!) Sum with Applications to the Analysis of Some Data Structures, Information Processing Letters, 28, 13–19, 1988.
Google Scholar
W. Szpankowski, A Generalized Suffix Tree and its (Un)Expected Asymptotic Behaviors, SIAM J. Computing, 22, pp. 1176–1198, 1993.
Google Scholar
S. Teng and F. Yao, Approximating Shortest Superstring, Proc. FOCS, 158–165, 1993.
Google Scholar
E. Ukkonen, A Linear-Time Algorithm for Finding Approximate Shortest Common Superstrings, Algorithmica, 5, 313–323, 1990.
Google Scholar
E. Ukkonen, Approximate String-Matching over Suffix Trees, Proc. Combinatorial Pattern Matching, 228–242, Padova, 1993.
Google Scholar
E-H. Yang and Z. Zhang, The Shortest Common Superstring Problem: Average Case Analysis for Both Exact Matching and Approximate Matching, preprint.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Mathematics, Carnegie Mellon University, 15213, Pittsburgh, PA, USA
Alan Frieze
Dept. of Computer Science, Purdue University, 47907, W. Lafayette, IN, USA
Wojciech Szpankowski

Authors

Alan Frieze
View author publications
You can also search for this author in PubMed Google Scholar
Wojciech Szpankowski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Josep Diaz Maria Serna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Frieze, A., Szpankowski, W. (1996). Greedy algorithms for the shortest common superstring that are asymtotically optimal. In: Diaz, J., Serna, M. (eds) Algorithms — ESA '96. ESA 1996. Lecture Notes in Computer Science, vol 1136. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61680-2_56

Download citation

DOI: https://doi.org/10.1007/3-540-61680-2_56
Published: 06 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61680-1
Online ISBN: 978-3-540-70667-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics