Optimization of consistency-based multiple sequence alignment using Big Data technologies

Lladós, Jordi; Cores, Fernando; Guirado, Fernando

doi:10.1007/s11227-018-2424-4

Optimization of consistency-based multiple sequence alignment using Big Data technologies

Published: 24 May 2018

Volume 75, pages 1310–1322, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

256 Accesses
3 Citations
Explore all metrics

Abstract

With the advent of new high-throughput next-generation sequencing technologies, the volume of genetic data processed has increased significantly. It is becoming essential for these applications to achieve large-scale alignments with thousands of sequences or even whole genomes. However, all current MSA tools have exhibited scalability issues when the number of sequences increases. The main drawback of these methods is that errors made in early pairwise alignments are propagated to the final result, affecting the accuracy of the global alignment. The use of consistency information enables the final result to be improved and makes it more stable from the accuracy point of view. However, such methods are severely limited by the memory required to store the consistency information. Authors in a previous work analyzed the structure and distribution of the data stored in the constraint library and demonstrated that it could be possible to reduce it without loosing accuracy, and thus it is possible to increase the number of sequences to be aligned. However, the execution time for obtaining the constraint library for a bigger number of sequences also increases greatly. In the present paper, the authors apply Big Data technologies to take advantage of the high degree of parallelism provided by the MapReduce paradigm in order to reduce considerably the library calculation time. Moreover, Big Data infrastructure provides a distributed storage system to improve the library scalability and machine-learning algorithms to enhance the consistency selection policies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Recovering accuracy methods for scalable consistency library

Article Open access 31 December 2014

Jordi Lladós, Fernando Guirado, … Cedric Notredame

ReformAlign: improved multiple sequence alignments using a profile-based meta-alignment approach

Article Open access 07 August 2014

Dimitrios P Lyras & Dirk Metzler

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing

Article Open access 29 September 2017

Shixiang Wan & Quan Zou

Notes

T-Coffee-MEL sources and its installation instructions can be found at: github.com/jllados/TCoffee-MEL.

References

Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77
Article Google Scholar
Do C, Brudno M, Batzoglou S (2004) PROBCONS: Probabilistic Consistency-based multiple alignment of amino acid sequences. In: Proceedings nineteenth national conference on artificial intelligence, pp 703–708
Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform 23:205–211
Google Scholar
Gouy M, Guindon S, Gascuel O (2010) Seaview version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27(2):221–224
Article Google Scholar
Gotoh O (1990) Consistency of optimal sequence alignments. Bull Math Biol 52(4):509–525
Article MATH Google Scholar
Just W (2001) Computational complexity of multiple sequence alignment with sp-score. J Comput Biol 8(6):615–623
Article Google Scholar
Katoh K, Misawa K, Kuma K, Miyata T (2002) Mafft: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30(14):3059–3066
Article Google Scholar
Karun AK, Chitharanjan K (2013) A review on hadoop—HDFS infrastructure extensions. In: IEEE Conference on Information & Communication Technologies, pp 132–137
Liu K, Linder CR, Warnow T (2010) Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Curr 2:RRN1198
Lladós J, Cores F, Guirado F (2017) Efficient consistency library for multiple sequence alignment tools. Int Conf Comput Math Methods Sci Eng 4:1269–1280
Google Scholar
Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Nat Biotech 30(11):1072–1080
Article Google Scholar
Notredame C, Higgins DG, Heringa J (2000) T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217
Article Google Scholar
Notredame C, Holm L, Higgins DG (1998) Coffee: an objective function for multiple sequence alignments. Bioinformatics 14(5):407–422
Article Google Scholar
Pruesse E, Peplies J, Glöckner FO (2012) SINA: accurate high throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28(14):1823–1829
Article Google Scholar
Sadasivam G, Baktavatchalam G (2010) A novel approach to multiple sequence alignment using hadoop data grids. Int J Bioinform Res Appl 6(5):472–483
Article Google Scholar
Sakr S (2017) Big Data processing stacks. IT Prof 19(1):34–41
Article Google Scholar
Sievers F, Dineen D, Wilm A, Higgins DG (2013) Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8):989–995
Article Google Scholar
Sievers F, Dineen D, Wilm A, Higgins DG (2013) Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29(8):989–995
Article Google Scholar
Subramanian AR, Weyer-Menkhoff J, Kaufmann M et al (2005) Dialign-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinform 6:66
Article Google Scholar
Thompson JD, Plewniak F, Poch O (1999) Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15(1):87–88
Article Google Scholar
Wang L, Jiang T (1994) On the complexity of multiple sequence alignment. J Computat Biol 1(4):337–348
Article Google Scholar
Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for Big Data: a survey. Proc IEEE 104(11):2114–2136
Article Google Scholar
Zou Q, Hu Q, Guo M, Wang G (2015) HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15):2475–2481
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the MEyC-Spain under contract TIN2014-53234-C2-2-R, TIN2017-84553-C2-2-R and TIN2016-81840-REDT.

Author information

Authors and Affiliations

INSPIRES Research Center, Universitat de Lleida, Lleida, Spain
Jordi Lladós, Fernando Cores & Fernando Guirado

Authors

Jordi Lladós
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Cores
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Guirado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jordi Lladós.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lladós, J., Cores, F. & Guirado, F. Optimization of consistency-based multiple sequence alignment using Big Data technologies. J Supercomput 75, 1310–1322 (2019). https://doi.org/10.1007/s11227-018-2424-4

Download citation

Published: 24 May 2018
Issue Date: 01 March 2019
DOI: https://doi.org/10.1007/s11227-018-2424-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Optimization of consistency-based multiple sequence alignment using Big Data technologies

Abstract

Access this article

Similar content being viewed by others

Recovering accuracy methods for scalable consistency library

ReformAlign: improved multiple sequence alignments using a profile-based meta-alignment approach

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimization of consistency-based multiple sequence alignment using Big Data technologies

Abstract

Access this article

Similar content being viewed by others

Recovering accuracy methods for scalable consistency library

ReformAlign: improved multiple sequence alignments using a profile-based meta-alignment approach

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation