RRCA: Ultra-Fast Multiple In-species Genome Alignments

Wandelt, Sebastian; Leser, Ulf

doi:10.1007/978-3-319-07953-0_20

Sebastian Wandelt²⁰ &
Ulf Leser²⁰

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8542))

Included in the following conference series:

International Conference on Algorithms for Computational Biology

1183 Accesses
2 Citations

Abstract

Multiple sequence alignment is an important method in Bioinformatics, for instance, to reconstruct phylogenetic trees or for identifying functional domains within genes. Finding an optimal MSA is computationally intractable, and therefore many alignment heuristics were proposed. However, computing MSA for sequences at chromosome/genome scale in a reasonable time with good alignment results remains an open challenge.

In this paper we propose RRCA, a very fast method to compute high-quality in-species MSAs at genome scale. RRCA uses referential compression to efficiently find long common subsequences in to-be-aligned sequences. A colinear sub collection of these subsequences is used for an initial alignment and the not yet covered subsequences are aligned following the same approach recursively. Our evaluation shows that RRCA achieves MSAs at similar quality as current state-of-the-art methods, while often being orders of magnitude faster for all our datasets. For instance, RRCA aligns eight human Chromosome 22 (around 50 MB each) within one minute on a consumer computer; a task that takes hours to days with competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing 467(7319), 1061–1073 (October 2010), http://dx.doi.org/10.1038/nature09534
Abouelhoda, M.I., Ohlebusch, E.: Multiple genome alignment: Chaining algorithms revisited. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 1–16. Springer, Heidelberg (2003), http://dx.doi.org/10.1007/3-540-44888-8_1
Chapter Google Scholar
Angiuoli, S.V., Salzberg, S.L.: Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27(3), 334–342 (2011)
Article Google Scholar
Brudno, M., Chapman, M., Göttgens, B., Batzoglou, S., Morgenstern, B.: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 4, 66 (2003)
Article Google Scholar
Cao, J., Schneeberger, K., Ossowski, S., Günther, T., Bender, S., Fitz, J., Koenig, D., Lanz, C., Stegle, O., Lippert, C., Wang, X., Ott, F., Müller, J., Alonso-Blanco, C., Borgwardt, K., Schmid, K.J., Weigel, D.: Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nature Genetics 43(10), 956–963 (2011), http://dx.doi.org/10.1038/ng.911
Article Google Scholar
Carillo, H., Lipman, D.: The multiple sequence alignment problem in biology. SIAM Journal of Applied Math 48, 1073–1082 (1988)
Article Google Scholar
Chen, X., Tompa, M.: Comparative assessment of methods for aligning multiple genome sequences. Nat. Biotech. 28(6), 567–572 (2010), http://dx.doi.org/10.1038/nbt.1637
Article Google Scholar
Cohn, M., Khazan, R.: Parsing with prefix and suffix dictionaries. In: Data Compression Conference, pp. 180–189 (1996)
Google Scholar
Deorowicz, S., Danek, A., Grabowski, S.: Genome compression: a novel approach for large collections. Bioinformatics 29(20), 2572–2578 (2013)
Article Google Scholar
Deorowicz, S., Debudaj-Grabysz, A., Gudyś, A.: Kalign-LCS — A more accurate and faster variant of kalign2 algorithm for the multiple sequence alignment problem. In: Gruca, A., Czachórski, T., Kozielski, S. (eds.) Man-Machine Interactions 3. AISC, vol. 242, pp. 499–506. Springer, Heidelberg (2014), http://dx.doi.org/10.1007/978-3-319-02309-0_54
Google Scholar
Deorowicz, S., Grabowski, S.: Robust Relative Compression of Genomes with Random Access. Bioinformatics, Oxford, England (September 2011), http://dx.doi.org/10.1093/bioinformatics/btr505
Döring, A., Weese, D., Rausch, T., Reinert, K.: Seqan an efficient, generic C++ library for sequence analysis. BMC Bioinformatics 9 (2008)
Google Scholar
Edgar, R.C.: Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5(1) (August 2004), http://dx.doi.org/10.1186/1471-2105-5-113
Ferrada, H., Gagie, T., Hirvola, T., Puglisi, S.J.: AliBI: An Alignment-Based Index for Genomic Datasets. ArXiv e-prints (July 2013)
Google Scholar
Gross, S.S., Brent, M.R.: Using multiple alignments to improve gene prediction. J. Comput. Biol., 379–393 (2005)
Google Scholar
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Huang, L., Popic, V., Batzoglou, S.: Short read alignment with populations of genomes. Bioinformatics 29(13), i361–i370 (2013), http://dx.doi.org/10.1093/bioinformatics/btt215
Katoh, K., Standley, D.M.: MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution 30(4), 772–780 (2013), http://dx.doi.org/10.1093/molbev/mst010
Article Google Scholar
Kemena, C., Notredame, C.: Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25(19), 2455–2465 (2009)
Article Google Scholar
Kreft, S., Navarro, G.: Lz77-like compression with fast random access. In: Proceedings of the 2010 Data Compression Conference, pp. 239–248. IEEE Computer Society Press, Washington, DC (2010), http://dx.doi.org/10.1109/DCC.2010.29
Chapter Google Scholar
Kuruppu, S., Puglisi, S., Zobel, J.: Optimized relative lempel-ziv compression of genomes. In: Australasian Computer Science Conference (2011)
Google Scholar
Larkin, M., Blackshields, G.: Brown: Clustal w and clustal x version 2.0. Bioinformatics 23(21), 2947–2948 (2007), http://dx.doi.org/10.1093/bioinformatics/btm404
Article Google Scholar
Larsson, J., Moffat, A.: Offline dictionary-based compression. In: Proceedings of the IEEE Data Compression Conference, pp. 296–305 (March 1999)
Google Scholar
McCreight, E.: Efficient algorithms for enumerating intersection intervals and rectangles. Tech. rep., Xerox Paolo Alte Research Center (1980)
Google Scholar
Mewes, H., Albermann, K., Bähr, M., Frishman, D., Gleissner, A., Hani, J., Heumann, K., Kleine, K., Maierl, A., Oliver, S., Pfeiffer, F., Zollner, A.: Overview of the yeast genome. Nature 387(6632 Suppl.), 7–65 (1997), http://www.nature.com/doifinder/10.1038/42755
Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48(3), 443–453 (1970), http://view.ncbi.nlm.nih.gov/pubmed/5420325
Article Google Scholar
Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: A novel method for fast and accurate multiple sequence alignment.. Journal of molecular biology 302(1), 205–217 (2000), http://dx.doi.org/10.1006/jmbi.2000.4042 , doi:10.1006/jmbi.2000.4042
Article Google Scholar
Notredame, C.: Recent Evolutions of Multiple Sequence Alignment Algorithms. PLoS Computational Biology 3(8), e123 (2007), http://dx.doi.org/10.1371/journal.pcbi.0030123
Roytberg, M., Gambin, A., Noe, L., Lasota, S., Furletova, E., Szczurek, E., Kucherov, G.: On subset seeds for protein alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(3), 483–494 (2009), http://dx.doi.org/10.1109/TCBB.2009.4
Article Google Scholar
Schmidt, M., Lipson, H.: Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)
Article Google Scholar
Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome biology 10(9), R98+ (2009), http://dx.doi.org/10.1186/gb-2009-10-9-r98
Wandelt, S., Leser, U.: FRESCO: Referential compression of highly-similar sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics 99(PrePrints), 1 (2013)
Google Scholar
Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337–348 (1994), http://view.ncbi.nlm.nih.gov/pubmed/8790475
Article Google Scholar
Wong, K.M., Suchard, M.A., Huelsenbeck, J.P.: Alignment Uncertainty and Genomic Analysis. Science 319(5862), 473–476 (2008), http://dx.doi.org/10.1126/science.1151532
Article MATH MathSciNet Google Scholar
Yu, H.J., Huang, D.S.: Normalized feature vectors: A novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10(2), 457–467 (2013), http://dx.doi.org/10.1109/TCBB.2013.10
Article Google Scholar
Zhang, Z., Raghavachari, B., Hardison, R.C., Miller, W.: Chaining multiple-alignment blocks. Journal of Computational Biology 1(3), 217–226 (1994)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Berlin, Germany
Sebastian Wandelt & Ulf Leser

Authors

Sebastian Wandelt
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Leser
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistics, Rovira i Virgili University, Avinguda Catalunya, 35, 43002, Tarragona, Spain
Adrian-Horia Dediu & Carlos Martín-Vide &
Fachbereich 07, Institut für Informatik, Justus-Liebig-Universität, Arndtstraße 2, 35392, Gießen, Germany
Bianca Truthe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wandelt, S., Leser, U. (2014). RRCA: Ultra-Fast Multiple In-species Genome Alignments. In: Dediu, AH., Martín-Vide, C., Truthe, B. (eds) Algorithms for Computational Biology. AlCoB 2014. Lecture Notes in Computer Science(), vol 8542. Springer, Cham. https://doi.org/10.1007/978-3-319-07953-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-07953-0_20
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07952-3
Online ISBN: 978-3-319-07953-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics