Abstract
Alignment-free methods are one of the mainstays of biological sequence comparison, i.e., the assessment of how similar two biological sequences are to each other, a fundamental and routine task in computational biology and bioinformatics. They have gained popularity since, even on standard desktop machines, they are faster than methods based on alignments. However, with the advent of Next-Generation Sequencing Technologies, datasets whose size, i.e., number of sequences and their total length, is a challenge to the execution of alignment-free methods on those standard machines are quite common. Here, we propose the first paradigm for the computation of k-mer-based alignment-free methods for Apache Hadoop that extends the problem sizes that can be processed with respect to a standard sequential machine while also granting a good time performance. Technically, as opposed to a standard Hadoop implementation, its effectiveness is achieved thanks to the incremental management of a persistent hash table during the map phase, a task not contemplated by the basic Hadoop functions and that can be useful also in other contexts.






Similar content being viewed by others
References
Allen F, Almasi G, Andreoni W, Beece D, Berne BJ, Bright A, Brunheroto J, Cascaval C, Castanos J, Coteus P et al (2001) Blue Gene: a vision for protein science using a petaflop supercomputer. IBM Syst J 40(2):310–327
Apostolico A, Giancarlo R (1998) Sequence alignment in molecular biology. J Comput Biol 5(2):173–196
Audano P, Vannberg F (2014) KAnalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics 30(14):2070–2072
Boden M, Schöneich M, Horwege S, Lindner S, Leimeister C, Morgenstern B (2013) Alignment-free sequence comparison with spaced k-mers. OASIcs-OpenAccess Series in Informatics, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik 34:24–34
Cattaneo G, Roscigno G, Ferraro Petrillo U (2014) A scalable approach to source camera identification over Hadoop. In: 28th IEEE International Conference on Advanced Information Networking and Applications (AINA), IEEE, pp 366–373
Cattaneo G, Ferraro Petrillo U, Giancarlo R, Roscigno G (2015) Alignment-free sequence comparison over Hadoop for computational biology. In: 44rd International Conference on Parallel Processing Workshops (ICCPW 2015), IEEE, pp 1–9
Chan CX, Bernard G, Poirion O, Hogan JM, Ragan MA (2014) Inferring phylogenies of evolving sequences without multiple sequence alignment. Sci Reports 4:6504
Chor B, Horn D, Goldman N, Levy Y, Massingham T et al (2009) Genomic DNA k-mer spectra: models and modalities. Genome Biol 10(10):R108
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Operating Systems Design and Implementation (OSDI) pp 137–150
Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A (2015) KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10):1569–1576
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, New York
Ekanayake J, Pallickara S, Fox G (2008) MapReduce for data intensive scientific analyses. In: 2008 IEEE Fourth International Conference on eScience, pp 277–284
Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, pp 265–268
Fan H, Ives AR, Surget-Groba Y, Cannon CH (2015) An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genom 16(1):1–18
Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G (2007) Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. BMC Bioinform 8:252
Giancarlo R, Scaturro D, Utro F (2009) Textual data compression in computational biology: a synopsis. Bioinformatics 25(13):1575–1586
Giancarlo R, Lo Bosco G, Pinello L, Utro F (2013) A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinform 14(1):1–14
Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Briefings Bioinform 15(3):390–406
Greco V, Giancarlo R (2007) Grid-K: A cometa VO service for compression-based classification of biological sequences and structures. Symposium GRID Open Days at the University of Palermo, Italy pp 87–93
Gunarathne T, Wu TL, Qiu J, Fox G (2010) MapReduce in the clouds for science. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, pp 565–572
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York
Haubold B (2014) Alignment-free phylogenetics and population genetics. Briefings Bioinform 15(3):407–418
Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister CA, Morgenstern B (2014) Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res 42(W1):7–11
Huang K, Brady A, Mahurkar A, White O, Gevers D, Huttenhower C, Segata N (2013) MetaRef: a pan-genomic database for comparative and community microbial genomics
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Leimeister CA, Boden M, Horwege S, Lindner S, Morgenstern B (2014) Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30(14):1991–1999
Li KB (2003) ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12):1585–1586
Lloyd S, Snell Q (2011) Accelerated large-scale multiple sequence alignment. BMC Bioinform 12:466
Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6):764–770
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA et al (2000) A whole-genome assembly of drosophila. Science 287(5461):2196–2204
Nordberg H, Bhatia K, Wang K, Wang Z (2013) BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23):3014–3019
Schatz MC (2009) Cloudburst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11):1363–1369
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), IEEE, pp 1–10
Sims GE, Kim SH (2011) Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proceed Nat Acad Sci 108(20):8329–8334
Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud: models, techniques and applications, 1st edn. Elsevier Science Publishers B. V, Amsterdam
Taylor RC (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform 11(Suppl 12):1–6
Torney DC, Burks C, Davison D, Sirotkin KM (1990) Computation of d2: a measure of sequence dissimilarity. In: Computers and DNA: the proceedings of the Interface between Computation Science and Nucleic Acid Sequencing Workshop, Redwood City, Calif.: Addison-Wesley Pub. Co
Vinga S (2014) Editorial: alignment-free methods in computational biology. Brief Bioinform 15(3):341–342
Vinga S, Almeida J (2003) Alignment-free sequence comparison-a review. Bioinformatics 19:513–523
Vouzis PD, Sahinidis NV (2010) GPU-BLAST: Using graphics processors to accelerate protein sequence alignment. Bioinformatics
Warnke J, Pawaskar S, Ali H (2012) An energy-aware Bioinformatics application for assembling short reads in high performance computing systems. In: 2012 International Conference onHigh Performance Computing and Simulation (HPCS), pp 154–160
Wong AK, You M (1985) Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans Patt Anal Mach Intel 7(5):599–609
Yang K, Zhang L (2008) Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucl Acids Res 36(5):1–9
Acknowledgments
We would like to thank the Department of Statistical Sciences of University of Rome-La Sapienza for computing time on the TeraStat cluster and Nicola Segata for providing the meta-genomic dataset. We also would like to thank the referees for comments that helped in the presentation of our results.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Funding
MIUR PRIN Project: 2010RTFWBH_003 “Data-Centric Genomic Computing (GenData 2020)” and Unipa Progetto di Ateneo 2012-ATE-0298 “Metodi Formali ed Algoritmici per la Bioinformatica su Scala Genomica”.
Additional information
A paper related to this work was presented at the 8th International Workshop on Parallel Programming Models and Systems Software for High-End Computing [6].
Rights and permissions
About this article
Cite this article
Cattaneo, G., Petrillo, U.F., Giancarlo, R. et al. An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop. J Supercomput 73, 1467–1483 (2017). https://doi.org/10.1007/s11227-016-1835-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1835-3