Abstract
In this work we propose an approach to improve the performance of a current methodology, computing k-mer based sequence similarity via Jaccard index, for pangenomic analyses. Recent studies have shown a good performance of such a measure for retrieving homology among genetic sequences belonging to a group of genomes.
Our improvement is obtained by exploiting a suitable k-mer representation, which enables a fast and memory-cheap computation of sequence similarity. Experimental results on genomes of living organisms of different species give an evidence that a state of the art methodology is here improved, in terms of running time and memory requirements.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Besta, M., et al.: Communication-efficient jaccard similarity for high-performance distributed genome comparisons. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1122–1132. IEEE (2020)
Bonnici, V., Manca, V.: Infogenomics tools: a computational suite for informational analysis of genomes. J. Bioinforma Proteomics Rev. 1, 8–14 (2015)
Bonnici, V., Giugno, R., Manca, V.: PanDelos: a dictionary-based method for pan-genome content discovery. BMC Bioinformatics 19(15), 437 (2018)
Bonnici, V., Manca, V.: Informational laws of genome structures. Sci. Rep. 6, 28840 (2016). http://www.nature.com/articles/srep28840
Bonnici, V., Maresi, E., Giugno, R.: Challenges in gene-oriented approaches for pangenome content discovery. Brief. Bioinformatics 22(3), bbaa198 (2020)
Borja, M.C., Haigh, J.: The birthday problem. Significance 4(3), 124–127 (2007)
Castellini, A., Franco, G., Milanese, A.: A genome analysis based on repeat sharing gene networks. Nat. Comput. 14(3), 403–420 (2014). https://doi.org/10.1007/s11047-014-9437-6
Contreras-Moreira, B., Vinuesa, P.: GET\_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl. Environ. Microbiol. 79(24), 7696–7701 (2013)
D’Auria, G., Jiménez-Hernández, N., Peris-Bondia, F., Moya, A., Latorre, A.: Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genom. 11(1), 181 (2010)
Franco, G., Milanese, A.: An investigation on genomic repeats. In: Bonizzoni, P., Brattka, V., Löwe, B. (eds.) CiE 2013. LNCS, vol. 7921, pp. 149–160. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39053-1_18
Holt, K.E., et al.: High-throughput sequencing provides insights into genome variation and evolution in salmonella typhi. Nat. Genet. 40(8), 987–993 (2008)
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)
Kobayakawa, M., Kinjo, S., Hoshi, M., Ohmori, T., Yamamoto, A.: Fast computation of similarity based on jaccard coefficient for composition-based image retrieval. In: Muneesawang, P., Wu, F., Kumazawa, I., Roeksabutr, A., Liao, M., Tang, X. (eds.) PCM 2009. LNCS, vol. 5879, pp. 949–955. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10467-1_87
Lees, J.A., et al.: Fast and flexible bacterial genomic epidemiology with poppunk. Genome Res. 29(2), 304–316 (2019)
Muzzi, A., Masignani, V., Rappuoli, R.: The pan-genome: towards a knowledge-based discovery of novel targets for vaccines and antibacterials. Drug Discov. Today 12(11), 429–439 (2007)
Nguyen, N., et al.: Building a pan-genome reference for a population. J. Comput. Biol. 22(5), 387–401 (2015)
Puigbò, P., Lobkovsky, A.E., Kristensen, D.M., Wolf, Y.I., Koonin, E.V.: Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 12(1), 66 (2014)
Serruto, D., Serino, L., Masignani, V., Pizza, M.: Genome-based approaches to develop vaccines against bacterial pathogens. Vaccine 27(25), 3245–3250 (2009)
Soucy, S.M., Huang, J., Gogarten, J.P.: Horizontal gene transfer: building the web of life. Nat. Rev. Genet. 16(8), 472–482 (2015)
Tettelin, H., Medini, D.: The Pangenome: Diversity, Dynamics and Evolution of Genomes. Lecture Notes in Computer Science, Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38281-0
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Bonnici, V., Cracco, A., Franco, G. (2022). A k-mer Based Sequence Similarity for Pangenomic Analyses. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2021. Lecture Notes in Computer Science(), vol 13164. Springer, Cham. https://doi.org/10.1007/978-3-030-95470-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-95470-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95469-7
Online ISBN: 978-3-030-95470-3
eBook Packages: Computer ScienceComputer Science (R0)