Abstract
The Maximum Weight Trace (MWT) is an optimization problem for multiple sequence alignment that takes a set of sequences and weights on pairs of letters from different sequences and seeks a multiple sequence alignment that maximizes the sum of the weights for the pairs of letters that appear in the same column. MWT was introduced by Kececioglu in 1993, then proven to be NP-hard, and heuristics and exact solutions for MWT developed. Unfortunately none of the MWT methods are scalable to even moderate-sized datasets. Here we propose the MWT-AM problem (MWT for Alignment Merging), an extension of the MWT problem to be used in a divide-and-conquer setting, where we seek a merged alignment of a set of disjoint alignments that optimizes the MWT score. We present variations of GCM (the Graph Clustering Merger, originally developed for the MAGUS multiple sequence alignment method) that are specifically designed for MWT-AM. We show that the best of these variants, which we refer to as GCM-MWT, perform well for the MWT-AM criterion. We explore GCM-MWT in comparison to other methods for merging alignments, T-coffee and MAFFT–merge, and find that GCM-MWT produces more accurate merged alignments. GCM-MWT is available in open source form at https://github.com/vlasmirnov/MAGUS.
Supported by the University of Illinois.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cannone, J.J., et al.: The comparative RNA Web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinf. 3(1), 1–31 (2002). https://doi.org/10.1186/1471-2105-3-2
Edgar, R.C.: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinf. 5(1), 113 (2004)
Fiduccia, C.M., Mattheyses, R.M.: A linear-time heuristic for improving network partitions. In: 19th Design Automation Conference, pp. 175–181. IEEE (1982)
Katoh, K., Kuma, K.I., Toh, H., Miyata, T.: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33(2), 511–518 (2005)
Kececioglu, J.: The maximum weight trace problem in multiple sequence alignment. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1993. LNCS, vol. 684, pp. 106–119. Springer, Heidelberg (1993). https://doi.org/10.1007/BFb0029800
Kececioglu, J.D., Lenhof, H.P., Mehlhorn, K., Mutzel, P., Reinert, K., Vingron, M.: A polyhedral approach to sequence alignment problems. Discrete Appl. Math. 104(1–3), 143–186 (2000)
Koller, G., Raidl, G.R.: An evolutionary algorithm for the maximum weight trace formulation of the multiple sequence alignment problem. In: Yao, X., et al. (eds.) PPSN 2004. LNCS, vol. 3242, pp. 302–311. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30217-9_31
Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956)
Liu, K., Raghavan, S., Nelesen, S., Linder, C.R., Warnow, T.: Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324(5934), 1561–1564 (2009)
Liu, K., et al.: SATĂ©-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst. Biol. 61(1), 90 (2012)
Mirarab, S., Nguyen, N., Guo, S., Wang, L.S., Kim, J., Warnow, T.: PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22(5), 377–386 (2015)
Mirarab, S., Warnow, T.: FASTSP: linear time calculation of alignment accuracy. Bioinformatics 27(23), 3250–3258 (2011)
Modzelewski, M., Dojer, N.: MSARC: multiple sequence alignment by residue clustering. Alg. Mol. Biol. 9(1), 12 (2014)
Moreno-Centeno, E., Karp, R.M.: The implicit hitting set approach to solve combinatorial optimization problems with an application to multigenome alignment. Oper. Res. 61(2), 453–468 (2013)
Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000)
Reinert, K., Lenhof, H.P., Mutzel, P., Mehlhorn, K., Kececioglu, J.D.: A branch-and-cut algorithm for multiple sequence alignment. In: Proceedings of the First Annual International Conference on Computational Molecular Biology (RECOMB), pp. 241–250 (1997)
Satuluri, V., Parthasarathy, S.: Scalable graph clustering using stochastic flows: applications to community discovery. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 737–746 (2009)
Smirnov, V., Warnow, T.: MAGUS: multiple sequence alignment using graph clustering. Bioinformatics (2020)
Smirnov, V., Warnow, T.: Phylogeny estimation given sequence length heterogeneity. Syst. Biol. 70(2), 268–282 (2020)
Stoye, J., Evers, D., Meyer, F.: Rose: generating sequence families. Bioinformatics 14(2), 157–163 (1998)
Van Dongen, S.M.: A cluster algorithm for graphs. Technical report, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, iNS-R0010, May 2000
Wallace, I.M., O’sullivan, O., Higgins, D.G., Notredame, C.: M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 34(6), 1692–1699 (2006)
Wheeler, T.J., Kececioglu, J.D.: Multiple alignment by aligning alignments. Bioinformatics 23(13), i559–i568 (2007)
Acknowledgments
This work was supported in part by NSF ABI-1458652 to TW and by the Debra and Ira Cohen fellowship to VS.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zaharias, P., Smirnov, V., Warnow, T. (2021). The Maximum Weight Trace Alignment Merging Problem. In: MartĂn-Vide, C., Vega-RodrĂguez, M.A., Wheeler, T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science(), vol 12715. Springer, Cham. https://doi.org/10.1007/978-3-030-74432-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-74432-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74431-1
Online ISBN: 978-3-030-74432-8
eBook Packages: Computer ScienceComputer Science (R0)