Abstract
The trend toward very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Program, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four-phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates, and list a series of alternate solutions in the event that several appear equally good. Moreover, it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed nonrepetitive sequences of length 50,000 sampled at error rates of as high as 10%.
Similar content being viewed by others
References
Blum, A., T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings.Proceedings of the 23rd ACM Symposium on Theory of Computation, pp. 328–336, 1991.
Camerini, P., L. Fratta, and F. Maffioli. A note on finding optimum branchings.Networks 9, 309–312, 1979.
Camerini, P., L. Fratta, and F. Maffioli. Thek best spanning arborescences of a network.Networks 10, 91–110, 1980.
Chang, W. and E. Lawler. Approximate string matching in sublinear expected time.Proceedings of the 31st IEEE Symposium on Foundations of Computer Science, pp. 118–124, 1990. To appear inAlgorithmica.
Chvátal, V., and D. Sankoff. Longest common subsequences of two random sequences.Journal of Applied Probability 12, 306–315, 1975.
Cull, P. and J. Holloway. Reconstructing sequences from shotgun data. InSequences II: Methods in Communication, Security, and Computer Science, R. Capocelli, A. De Santis, and U. Vaccaro, eds., Springer-Verlag, New York, pp. 166–188, 1993.
Foulser, D. A linear time algorithm for DNA sequencing. Technical Report 812, Department of Computer Science, Yale University, New Haven, CT 06520, 1990.
Fredman, M., R. Sedgewick, D. Sleator, and R. Tarjan. The pairing heap: a new form of self-adjusting heap.Algorithmica 1, 111–129, 1986.
Fredman, M., and R. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms.Journal of the Association for Computing Machinery 34(3), 596–615, 1987.
Gabow, H. Two algorithms for generating weighted spanning trees in order.SIAM Journal on Computing 6(2), 139–150, 1977.
Gabow, H., Z. Galil, T. Spencer, and R. Tarjan. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs.Combinatorica 6, 109–122, 1986.
Gallant, J. The complexity of the overlap method for sequencing biopolymers.Journal of Theoretical Biology 101, 1–17, 1983.
Gallant, J., D. Maier, and J. Storer. On finding minimal length superstrings.Journal of Computer and System Sciences 20(1), 50–58, 1980.
Gingeras, T., J. Milazzo, D. Sciaky, and R. Roberts. Computer programs for the assembly of DNA sequences.Nucleic Acids Research 7(2), 529–545, 1979.
Gusfield, D., G. Landau, and B. Schieber. An efficient algorithm for the all pairs suffix-prefix problem.Information Processing Letters 41, 181–185, 1992.
Huang, X. A contig assembly program based on sensitive detection of fragment overlaps.Genomics 14, 18–25, 1992.
Hutchinson, G. Evaluation of polymer sequence fragments data using graph theory.Bulletin of Mathematical Biophysics 31, 541–562, 1969.
Kececioglu, J. Exact and approximation algorithms for DNA sequence reconstruction. Ph.D. dissertation, Technical Report 91-26, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1991.
Kececioglu, J., and E. Myers. A procedural interface for a fragment assembly tool. Technical Report 89-5, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1989.
Lawler, E. A procedure for computing thek best solutions to discrete optimization problems and its application to the shortest path problem.Management Science 18, 401–405, 1972.
Li, M. Towards a DNA sequencing theory.Proceedings of the 31st IEEE Symposium on Foundations of Computer Science, pp. 125–134, 1990.
Manber, U. and G. Myers. Suffix arrays: A new method for on-line string searches.Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 319–327, 1990. To appear inSIAM Journal on Computing.
Margot, J., G. W. Demers, and R. Hardison. Complete nucleotide sequence of the rabbitβ-like globin gene cluster: analysis of intergenic sequences and comparison with the humanβ-like globin gene cluster.Journal of Molecular Biology 205, 15–40, 1989.
Mehlhorn, K.Data Structures and Algorithms, Vol. 1. Springer-Verlag, Berlin, 1984.
Myers, E. Incremental alignment algorithms and their applications. Technical Report 86-2, Department of Computer Science, The University of Arizona, Tucson, AZ 85721, 1986.
Peltola, H., H. Söderlund, J. Tarhio, and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics.Proceedings of the 9th IFIP World Computer Congress, pp. 59–64, 1983.
Peltola, H., H. Söderlund, and E. Ukkonen. SEQAID: a DNA sequence assembly program based on a mathematical model.Nucleic Acids Research 12(1), 307–321, 1984.
Press, W., B. Flannery, S. Teukolsky, and W. Vetterling.Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, New York, 1988.
Sankoff, D. Minimal mutation trees of sequences.SIAM Journal on Applied Mathematics 28(1), 35–42, 1975.
Sankoff, D. and V. Chvátal. An upper bound technique for lengths of common subsequences. InTime Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence comparison, D. Sankoff and J. Kruskal, eds., Addison-Wesley, Reading, MA, pp. 353–357, 1983.
Sankoff, D. and J. Kruskal, eds.Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983.
Shapiro, M. An algorithm for reconstructing protein and RNA sequences.Journal of the Association for Computing Machinery 14, 720–731, 1967.
Smetanič, Y., and R. Polozov. On the algorithms for determining the primary structure of biopolymers.Bulletin of Mathematical Biology 41, 1–20, 1979.
Smith, T. F., and M. S. Waterman. Identification of common molecular subsequences.Journal of Molecular Biology 147, 195–197, 1981.
Staden, R. A strategy of DNA sequencing employing computer programs.Nucleic Acids Research 6(7), 2601–2610, 1979.
Tarhio, J. and E. Ukkonen. A greedy approximation algorithm for constructing shortest common superstrings.Theoretical Computer Science 57, 131–145, 1988.
Tarjan, R. Finding optimum branchings.Networks 7, 25–35, 1977.
Turner, J. Approximation algorithms for the shortest common superstring problem.Information and Computation 83, 1–20, 1989.
Ukkonen, E. A linear algorithm for finding approximate shortest common superstrings.Algorithmica 5, 313–323, 1990.
Author information
Authors and Affiliations
Additional information
Communicated by E. W. Myers.
This research was supported by the National Library of Medicine under Grant R01-LM4960, by a postdoctoral fellowship from the Program in Mathematics and Molecular Biology of the University of California at Berkeley under National Science Foundation Grant DMS-8720208, and by a fellowship from the Centre de recherches mathématiques of the Université de Montréal.
Rights and permissions
About this article
Cite this article
Kececioglu, J.D., Myers, E.W. Combinatorial algorithms for DNA sequence assembly. Algorithmica 13, 7–51 (1995). https://doi.org/10.1007/BF01188580
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF01188580