Combinatorial algorithms for DNA sequence assembly

Kececioglu, J. D.; Myers, E. W.

doi:10.1007/BF01188580

Combinatorial algorithms for DNA sequence assembly

Published: February 1995

Volume 13, pages 7–51, (1995)
Cite this article

Algorithmica Aims and scope Submit manuscript

J. D. Kececioglu¹ &
E. W. Myers²

700 Accesses
6 Altmetric
Explore all metrics

Abstract

The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Sequence Reconstruction Problem

The Contig Assembly Problem and Its Algorithmic Solutions

De Novo DNA Assembly with a Genetic Algorithm Finds Accurate Genomes Even with Suboptimal Fitness

References

Blum, A., T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings.Proceedings of the 23rd ACM Symposium on Theory of Computation, pp. 328–336, 1991.
Camerini, P., L. Fratta, and F. Maffioli. A note on finding optimum branchings.Networks 9, 309–312, 1979.
Article MATH MathSciNet Google Scholar
Camerini, P., L. Fratta, and F. Maffioli. Thek best spanning arborescences of a network.Networks 10, 91–110, 1980.
Article MATH MathSciNet Google Scholar
Chang, W. and E. Lawler. Approximate string matching in sublinear expected time.Proceedings of the 31st IEEE Symposium on Foundations of Computer Science, pp. 118–124, 1990. To appear inAlgorithmica.
Chvátal, V., and D. Sankoff. Longest common subsequences of two random sequences.Journal of Applied Probability 12, 306–315, 1975.
Article MATH MathSciNet Google Scholar
Cull, P. and J. Holloway. Reconstructing sequences from shotgun data. InSequences II: Methods in Communication, Security, and Computer Science, R. Capocelli, A. De Santis, and U. Vaccaro, eds., Springer-Verlag, New York, pp. 166–188, 1993.
Google Scholar
Foulser, D. A linear time algorithm for DNA sequencing. Technical Report 812, Department of Computer Science, Yale University, New Haven, CT 06520, 1990.
Google Scholar
Fredman, M., R. Sedgewick, D. Sleator, and R. Tarjan. The pairing heap: a new form of self-adjusting heap.Algorithmica 1, 111–129, 1986.
Article MATH MathSciNet Google Scholar
Fredman, M., and R. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms.Journal of the Association for Computing Machinery 34(3), 596–615, 1987.
MathSciNet Google Scholar
Gabow, H. Two algorithms for generating weighted spanning trees in order.SIAM Journal on Computing 6(2), 139–150, 1977.
Article MATH MathSciNet Google Scholar
Gabow, H., Z. Galil, T. Spencer, and R. Tarjan. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs.Combinatorica 6, 109–122, 1986.
Article MATH MathSciNet Google Scholar
Gallant, J. The complexity of the overlap method for sequencing biopolymers.Journal of Theoretical Biology 101, 1–17, 1983.
Article Google Scholar
Gallant, J., D. Maier, and J. Storer. On finding minimal length superstrings.Journal of Computer and System Sciences 20(1), 50–58, 1980.
Article MATH MathSciNet Google Scholar
Gingeras, T., J. Milazzo, D. Sciaky, and R. Roberts. Computer programs for the assembly of DNA sequences.Nucleic Acids Research 7(2), 529–545, 1979.
Article Google Scholar
Gusfield, D., G. Landau, and B. Schieber. An efficient algorithm for the all pairs suffix-prefix problem.Information Processing Letters 41, 181–185, 1992.
Article MATH MathSciNet Google Scholar
Huang, X. A contig assembly program based on sensitive detection of fragment overlaps.Genomics 14, 18–25, 1992.
Article Google Scholar
Hutchinson, G. Evaluation of polymer sequence fragments data using graph theory.Bulletin of Mathematical Biophysics 31, 541–562, 1969.
Article Google Scholar
Kececioglu, J. Exact and approximation algorithms for DNA sequence reconstruction. Ph.D. dissertation, Technical Report 91-26, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1991.
Google Scholar
Kececioglu, J., and E. Myers. A procedural interface for a fragment assembly tool. Technical Report 89-5, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1989.
Google Scholar
Lawler, E. A procedure for computing thek best solutions to discrete optimization problems and its application to the shortest path problem.Management Science 18, 401–405, 1972.
Article MATH MathSciNet Google Scholar
Li, M. Towards a DNA sequencing theory.Proceedings of the 31st IEEE Symposium on Foundations of Computer Science, pp. 125–134, 1990.
Manber, U. and G. Myers. Suffix arrays: A new method for on-line string searches.Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 319–327, 1990. To appear inSIAM Journal on Computing.
Margot, J., G. W. Demers, and R. Hardison. Complete nucleotide sequence of the rabbitβ-like globin gene cluster: analysis of intergenic sequences and comparison with the humanβ-like globin gene cluster.Journal of Molecular Biology 205, 15–40, 1989.
Article Google Scholar
Mehlhorn, K.Data Structures and Algorithms, Vol. 1. Springer-Verlag, Berlin, 1984.
Google Scholar
Myers, E. Incremental alignment algorithms and their applications. Technical Report 86-2, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1986.
Google Scholar
Peltola, H., H. Söderlund, J. Tarhio, and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics.Proceedings of the 9th IFIP World Computer Congress, pp. 59–64, 1983.
Peltola, H., H. Söderlund, and E. Ukkonen. SEQAID: a DNA sequence assembly program based on a mathematical model.Nucleic Acids Research 12(1), 307–321, 1984.
Article Google Scholar
Press, W., B. Flannery, S. Teukolsky, and W. Vetterling.Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, New York, 1988.
MATH Google Scholar
Sankoff, D. Minimal mutation trees of sequences.SIAM Journal on Applied Mathematics 28(1), 35–42, 1975.
Article MATH MathSciNet Google Scholar
Sankoff, D. and V. Chvátal. An upper bound technique for lengths of common subsequences. InTime Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence comparison, D. Sankoff and J. Kruskal, eds., Addison-Wesley, Reading, MA, pp. 353–357, 1983.
Google Scholar
Sankoff, D. and J. Kruskal, eds.Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983.
Google Scholar
Shapiro, M. An algorithm for reconstructing protein and RNA sequences.Journal of the Association for Computing Machinery 14, 720–731, 1967.
MATH Google Scholar
Smetanič, Y., and R. Polozov. On the algorithms for determining the primary structure of biopolymers.Bulletin of Mathematical Biology 41, 1–20, 1979.
MathSciNet Google Scholar
Smith, T. F., and M. S. Waterman. Identification of common molecular subsequences.Journal of Molecular Biology 147, 195–197, 1981.
Article Google Scholar
Staden, R. A strategy of DNA sequencing employing computer programs.Nucleic Acids Research 6(7), 2601–2610, 1979.
Article Google Scholar
Tarhio, J. and E. Ukkonen. A greedy approximation algorithm for constructing shortest common superstrings.Theoretical Computer Science 57, 131–145, 1988.
Article MATH MathSciNet Google Scholar
Tarjan, R. Finding optimum branchings.Networks 7, 25–35, 1977.
Article MATH MathSciNet Google Scholar
Turner, J. Approximation algorithms for the shortest common superstring problem.Information and Computation 83, 1–20, 1989.
Article MATH MathSciNet Google Scholar
Ukkonen, E. A linear algorithm for finding approximate shortest common superstrings.Algorithmica 5, 313–323, 1990.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Georgia, 30602, Athens, GA, USA
J. D. Kececioglu
Department of Computer Science, The University of Arizona, 85721, Tucson, AZ, USA
E. W. Myers

Authors

J. D. Kececioglu
View author publications
You can also search for this author in PubMed Google Scholar
E. W. Myers
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Communicated by E. W. Myers.

This research was supported by the National Library of Medicine under Grant R01-LM4960, by a postdoctoral fellowship from the Program in Mathematics and Molecular Biology of the University of California at Berkeley under National Science Foundation Grant DMS-8720208, and by a fellowship from the Centre de recherches mathématiques of the Université de Montréal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kececioglu, J.D., Myers, E.W. Combinatorial algorithms for DNA sequence assembly. Algorithmica 13, 7–51 (1995). https://doi.org/10.1007/BF01188580

Download citation

Received: 19 October 1992
Revised: 08 February 1993
Issue Date: February 1995
DOI: https://doi.org/10.1007/BF01188580

Key words

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combinatorial algorithms for DNA sequence assembly

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Sequence Reconstruction Problem

The Contig Assembly Problem and Its Algorithmic Solutions

De Novo DNA Assembly with a Genetic Algorithm Finds Accurate Genomes Even with Suboptimal Fitness

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key words

Subscribe and save

Buy Now

Navigation

Combinatorial algorithms for DNA sequence assembly

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Sequence Reconstruction Problem

The Contig Assembly Problem and Its Algorithmic Solutions

De Novo DNA Assembly with a Genetic Algorithm Finds Accurate Genomes Even with Suboptimal Fitness

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Subscribe and save

Buy Now

Search

Navigation