An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop

Cattaneo, Giuseppe; Petrillo, Umberto Ferraro; Giancarlo, Raffaele; Roscigno, Gianluca

doi:10.1007/s11227-016-1835-3

An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop

Published: 08 August 2016

Volume 73, pages 1467–1483, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Giuseppe Cattaneo¹,
Umberto Ferraro Petrillo²,
Raffaele Giancarlo³ &
…
Gianluca Roscigno¹

526 Accesses
Explore all metrics

Abstract

Alignment-free methods are one of the mainstays of biological sequence comparison, i.e., the assessment of how similar two biological sequences are to each other, a fundamental and routine task in computational biology and bioinformatics. They have gained popularity since, even on standard desktop machines, they are faster than methods based on alignments. However, with the advent of Next-Generation Sequencing Technologies, datasets whose size, i.e., number of sequences and their total length, is a challenge to the execution of alignment-free methods on those standard machines are quite common. Here, we propose the first paradigm for the computation of k-mer-based alignment-free methods for Apache Hadoop that extends the problem sizes that can be processed with respect to a standard sequential machine while also granting a good time performance. Technically, as opposed to a standard Hadoop implementation, its effectiveness is achieved thanks to the incremental management of a persistent hash table during the map phase, a task not contemplated by the basic Hadoop functions and that can be useful also in other contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

Article Open access 18 April 2019

Benchmarking Spark Distributed Data Structures: A Sequence Analysis Case Study

References

Allen F, Almasi G, Andreoni W, Beece D, Berne BJ, Bright A, Brunheroto J, Cascaval C, Castanos J, Coteus P et al (2001) Blue Gene: a vision for protein science using a petaflop supercomputer. IBM Syst J 40(2):310–327
Article Google Scholar
Apostolico A, Giancarlo R (1998) Sequence alignment in molecular biology. J Comput Biol 5(2):173–196
Article MATH Google Scholar
Audano P, Vannberg F (2014) KAnalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics 30(14):2070–2072
Article Google Scholar
Boden M, Schöneich M, Horwege S, Lindner S, Leimeister C, Morgenstern B (2013) Alignment-free sequence comparison with spaced k-mers. OASIcs-OpenAccess Series in Informatics, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik 34:24–34
MATH Google Scholar
Cattaneo G, Roscigno G, Ferraro Petrillo U (2014) A scalable approach to source camera identification over Hadoop. In: 28th IEEE International Conference on Advanced Information Networking and Applications (AINA), IEEE, pp 366–373
Cattaneo G, Ferraro Petrillo U, Giancarlo R, Roscigno G (2015) Alignment-free sequence comparison over Hadoop for computational biology. In: 44rd International Conference on Parallel Processing Workshops (ICCPW 2015), IEEE, pp 1–9
Chan CX, Bernard G, Poirion O, Hogan JM, Ragan MA (2014) Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Reports 4:6504
Article Google Scholar
Chor B, Horn D, Goldman N, Levy Y, Massingham T et al (2009) Genomic DNA k-mer spectra: models and modalities. Genome Biol 10(10):R108
Article Google Scholar
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Operating Systems Design and Implementation (OSDI) pp 137–150
Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–1576
Article Google Scholar
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, New York
Book MATH Google Scholar
Ekanayake J, Pallickara S, Fox G (2008) MapReduce for data intensive scientific analyses. In: 2008 IEEE Fourth International Conference on eScience, pp 277–284
Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, pp 265–268
Fan H, Ives AR, Surget-Groba Y, Cannon CH (2015) An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genom 16(1):1–18
Article Google Scholar
Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G (2007) Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinform 8:252
Article Google Scholar
Giancarlo R, Scaturro D, Utro F (2009) Textual data compression in computational biology: a synopsis. Bioinformatics 25(13):1575–1586
Article MATH Google Scholar
Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013) A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinform 14(1):1–14
Article Google Scholar
Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Briefings Bioinform 15(3):390–406
Article Google Scholar
Greco V, Giancarlo R (2007) Grid-K: A cometa VO service for compression-based classification of biological sequences and structures. Symposium GRID Open Days at the University of Palermo, Italy pp 87–93
Gunarathne T, Wu TL, Qiu J, Fox G (2010) MapReduce in the clouds for science. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, pp 565–572
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York
Book MATH Google Scholar
Haubold B (2014) Alignment-free phylogenetics and population genetics. Briefings Bioinform 15(3):407–418
Article Google Scholar
Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister CA, Morgenstern B (2014) Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42(W1):7–11
Article Google Scholar
Huang K, Brady A, Mahurkar A, White O, Gevers D, Huttenhower C, Segata N (2013) MetaRef: a pan-genomic database for comparative and community microbial genomics
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article MATH MathSciNet Google Scholar
Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B (2014) Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30(14):1991–1999
Article MATH Google Scholar
Li KB (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12):1585–1586
Article Google Scholar
Lloyd S, Snell Q (2011) Accelerated large-scale multiple sequence alignment. BMC Bioinform 12:466
Article Google Scholar
Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764–770
Article Google Scholar
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA et al (2000) A whole-genome assembly of drosophila. Science 287(5461):2196–2204
Article Google Scholar
Nordberg H, Bhatia K, Wang K, Wang Z (2013) BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23):3014–3019
Article Google Scholar
Schatz MC (2009) Cloudburst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11):1363–1369
Article Google Scholar
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), IEEE, pp 1–10
Sims GE, Kim SH (2011) Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proceed Nat Acad Sci 108(20):8329–8334
Article Google Scholar
Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud: models, techniques and applications, 1st edn. Elsevier Science Publishers B. V, Amsterdam
Google Scholar
Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform 11(Suppl 12):1–6
Article Google Scholar
Torney DC, Burks C, Davison D, Sirotkin KM (1990) Computation of d2: a measure of sequence dissimilarity. In: Computers and DNA: the proceedings of the Interface between Computation Science and Nucleic Acid Sequencing Workshop, Redwood City, Calif.: Addison-Wesley Pub. Co
Vinga S (2014) Editorial: alignment-free methods in computational biology. Brief Bioinform 15(3):341–342
Article Google Scholar
Vinga S, Almeida J (2003) Alignment-free sequence comparison-a review. Bioinformatics 19:513–523
Article Google Scholar
Vouzis PD, Sahinidis NV (2010) GPU-BLAST: Using graphics processors to accelerate protein sequence alignment. Bioinformatics
Warnke J, Pawaskar S, Ali H (2012) An energy-aware Bioinformatics application for assembling short reads in high performance computing systems. In: 2012 International Conference onHigh Performance Computing and Simulation (HPCS), pp 154–160
Wong AK, You M (1985) Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans Patt Anal Mach Intel 7(5):599–609
Article MATH Google Scholar
Yang K, Zhang L (2008) Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucl Acids Res 36(5):1–9
Article Google Scholar

Download references

Acknowledgments

We would like to thank the Department of Statistical Sciences of University of Rome-La Sapienza for computing time on the TeraStat cluster and Nicola Segata for providing the meta-genomic dataset. We also would like to thank the referees for comments that helped in the presentation of our results.

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Salerno, Via Giovanni Paolo II, 132, 84084, Fisciano, SA, Italy
Giuseppe Cattaneo & Gianluca Roscigno
Dipartimento di Scienze Statistiche, Università di Roma “Sapienza”, P.le Aldo Moro 5, 00185, Rome, Italy
Umberto Ferraro Petrillo
Dipartimento di Matematica ed Informatica, Università degli Studi di Palermo, Via Archirafi, 34, 90123, Palermo, PA, Italy
Raffaele Giancarlo

Authors

Giuseppe Cattaneo
View author publications
You can also search for this author inPubMed Google Scholar
Umberto Ferraro Petrillo
View author publications
You can also search for this author inPubMed Google Scholar
Raffaele Giancarlo
View author publications
You can also search for this author inPubMed Google Scholar
Gianluca Roscigno
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Umberto Ferraro Petrillo.

Ethics declarations

Funding

MIUR PRIN Project: 2010RTFWBH_003 “Data-Centric Genomic Computing (GenData 2020)” and Unipa Progetto di Ateneo 2012-ATE-0298 “Metodi Formali ed Algoritmici per la Bioinformatica su Scala Genomica”.

Additional information

A paper related to this work was presented at the 8th International Workshop on Parallel Programming Models and Systems Software for High-End Computing [6].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cattaneo, G., Petrillo, U.F., Giancarlo, R. et al. An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop. J Supercomput 73, 1467–1483 (2017). https://doi.org/10.1007/s11227-016-1835-3

Download citation

Published: 08 August 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s11227-016-1835-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

Benchmarking Spark Distributed Data Structures: A Sequence Analysis Case Study

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now