Abstract
In this paper, we present a novel graph-theoretical approach for representing a wide variety of sequence analysis problems within a single model. The model allows incorporation of the operations “insertion”, “deletion”, and “substitution”, and various parameters such as relative distances and weights. Conceptually, we refer the problem as the minimum weight common mutated sequence (MWCMS) problem. The MWCMS model has many applications including multiple sequence alignment problem, the phylogenetic analysis, the DNA sequencing problem, and sequence comparison problem, which encompass a core set of very difficult problems in computational biology. Thus the model presented in this paper lays out a mathematical modeling framework that allows one to investigate theoretical and computational issues, and to forge new advances for these distinct, but related problems.
Through the introduction of supernodes, and the multi-layer supergraph, we proved that MWCMS is \({NP}\)-complete. Furthermore, it was shown that a conflict graph derived from the multi-layer supergraph has the property that a solution to the associated node-packing problem of the conflict graph corresponds to a solution of the MWCMS problem. In this case, we proved that when the number of input sequences is a constant, MWCMS is polynomial-time solvable. We also demonstrated that some well-known combinatorial problems can be viewed as special cases of the MWCMS problem. In particular, we presented theoretical results implied by the MWCMS theory for the minimum weight supersequence problem, the minimum weight superstring problem, and the longest common subsequence problem.
Two integer programming formulations were presented and a simple yet elegant decomposition heuristic was introduced. The integer programming instances have proven to be computationally intensive. Consequently, research involving simultaneous column and row generation and parallel computing will be explored. The heuristic algorithm, introduced herein for multiple sequence alignment, overcomes the order-dependent drawbacks of many of the existing algorithms, and is capable of returning good sequence alignments within reasonable computational time. It is able to return the optimal alignment for multiple sequences of length less than 1500 base pairs within 30 minutes. Its algorithmic decomposition nature lends itself naturally for parallel distributed computing, and we continue to explore its flexibility and scalability in a massive parallel environment.
Similar content being viewed by others
References
Babel, L. (1991). “Finding Maximum Cliques in Arbitrary and in Special Graphs.” Computing, 46(4), 321–341.
Baeza-Yates, R.A. and C.H. Perleberg. (1992). “Fast and Practical Approximate String Matching.” In Proceeings of the 3rd Annual Symposium on Combinatorial Pattern Matching.
Bains, W. and G.C. Smith. (1988). “A Novel Nethod for DNA Sequence Determination.” Journal of Theoretical Biology, 135, 303–307.
Bellare, M. and M. Sudan. (1994). “Improved Non-Approximability Results.” In Proc. 26th ACM Symp. on Theory of Computing, pp. 184–193.
Berge, C. (1961). “Färbung Von Graphen Deren Sämtliche bzw, Ungerade Kreise Starr Sind.” Wiss. Z. Matin-Luther-Univ. Halle-Wittenberg, 114.
Boppana, R. and M.M. Haldorsson. (1992). “Approximating Maximum Independent Set by Excluding Subgraphs.” BIT, 32, 130–196.
Chenna, R., H. Sugawara, T. Koike, T.J. Gibson, D.G. Higgins, and J.D. Thompson. (2003). “Multiple Sequence Alignment with the Clustal Series of Programs.” Nucleic Acids Research, 31(13), 3497–3500.
Chvátal, V. (1985). “Star-Cutsets and Perfect Graphs.” Journal of Combinatorial Theory Series B, 39, 189–199.
Chvátal, V. and D. Sankoff. (1975). “Longest Common Subsequences of two Random Sequences.” Journal of Applied Probability, 12, 306–315.
Duchet, P. (1984). “Classical Perfect Graphs, An Introduction with Emphasis on Triangulated and Interval Graphs.” Annals of Discrete Mathematics, 21, 67–96.
Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. (1998). Biological Sequence Analysis. Cambridge University Press, UK.
Gallant, J., D. Maier, and J.A. Storer. (1980). “On Finding Minimal Length Superstrings.” Journal of Computer and System Sciences, 20, 50–58.
Garey, M. and D. Johnson. (1979). Computers and Intractibility: A Guide to the Theory of ℕℙ-Completeness. W.H. Freeman, San Francisco.
Grötschel, M., L. Lovász, and A. Schrijver. (1988). Geometric Algorithms and Combinatorial Optimization. Springer-Verlag, New York.
Grötschel, M., L. Lovász, and A. Schrijver. (1984). “Polynomial Algorithms for Perfect Graphs.” Annals of Discrete Mathematics, 325–356.
Golumbic, M.C., D. Rotem, and J. Urrutia. (1983). “Comparability Graphs and Intersection Graphs.” Discrete Mathematics, 43, 37–46.
Hayward, R.B. (1985). “Weakly Triangulated Graphs.” Journal of Combinatorial Theory Series B, 39, 200–209.
Idury R.M. and M.S. Waterman. (1995). “A New Algorithm for DNA Sequence Assembly.” Journal of Computational Biology, 2(2), 291–306.
Jiang, T. and M. Li. (1995). “On the Approximation of Shortest Common Supersequences and Longest Common Subsequences.” SIAM J. Comput, 24(5), 1122–1139.
Kececioglu, J.D., H. Lenhof, K. Mehlhorn, P. Mutzel, K. Reinert, and M. Vingron. (2000). “A Polyhedral Approach to Sequence Alignment Problems.” Discrete Applied Mathematics, 104, 143–186.
Levenshtein, V.L.(1966). “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” Cybernetics Control Theory, 10(9), 707–710.
Lipman, D.J., S.F. Altschul, and J.D. Kececioglu. (1989). “ A Tool for Multiple Sequence Alignment.” Proc Natl Acad Sci USA, 86(12), 4412–4415.
Lu, M. and H. Lin. (1994) “Parallel Algorithms for the Longest Common Subsequence Problem.” IEEE Transaction on Parallel and Distri. Sys., 5(8), 835–847.
Maier, D. (1977). “The Complexity of Some Problems on Subsequences and Supersequences.” J. Assoc. Comput. Mach., 25, 322–336.
Maier, D. and J.A. Storer. (1977). “A Note on the Complexity of the Superstring Problem.” Technical Report Report No. 233, Princeton University
Myoupo, J.F. and D. Seme. (1999). “Time-Efficient Parallel Algorithms for the Longest Common Subsequence and Related Problems.” Journal of Parallel and Distributed Computing, 57, 212–223.
Notredame, C. (2001). “Recent Progress in Multiple Sequence Alignment: A Survey.” Pharmacogenomics, 3(1).
Sassano, A. (1997) “Chair-Free Berge Graphs are Perfect.” Graphs and Combinatorics, 13, 369–395.
Schierup, M.H. and J. Hein. (2000). “Consequences of Recombination on Traditional Phylogenetic Analysis.” Genetics, 156(2), 879–891.
Sellers, P.H. (1974). “On the Theory and Computation of Evolutionary Distances.” SIAM Journal on Applied Mathematics, 26(4), 787–793.
Shyu, S.J., Y.T. Tsai, and R.C.T. Lee. (2004). “The Minimal Spanning Tree Preservation Approaches for DNA Multiple Sequence Alignment and Evolutionary Tree Construction.” Journal of Combinatorial Optimization, 8(4), 453–468.
Tajima, F. and N. Takezaki. (1994). “Estimation of Evolutionary Distance for Reconstructing Molecular Phylogenetic Trees.” Molecular Biology and Evolution, 11, 278–286.
Teng, S. and F. Yao. (1993) “Approximating Shortest Supersequences.” In Proc. of 34th Ann. IEEE Symp. on Foundations of Comp. Sci., IEEE Computer Society, pp. 158–165.
Thompson, J.D., D.G. Higgins, and T.J. Gibson. (1994). “CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice.” Nucleic Acids Res., 22(22), 4673–4680.
Wagner, R.A. and M.J. Fischer. (1974). “The Sequence-to-Sequence Correction Problem.” J. Assoc. Comput. Mach., 21, 168–173.
Waterman M.S. (1995). Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, UK.
Zhang, Y. and M.S. Waterman. (2003). “An Eulerian Path approach to Global Multiple Alignment for DNA Sequences.” Journal of Computational Biology, 10(6), 803–819.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, E.K., Easton, T. & Gupta, K. Novel evolutionary models and applications to sequence alignment problems. Ann Oper Res 148, 167–187 (2006). https://doi.org/10.1007/s10479-006-0085-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-006-0085-9