Skip to main content

A k-mer Based Sequence Similarity for Pangenomic Analyses

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13164))

Abstract

In this work we propose an approach to improve the performance of a current methodology, computing k-mer based sequence similarity via Jaccard index, for pangenomic analyses. Recent studies have shown a good performance of such a measure for retrieving homology among genetic sequences belonging to a group of genomes.

Our improvement is obtained by exploiting a suitable k-mer representation, which enables a fast and memory-cheap computation of sequence similarity. Experimental results on genomes of living organisms of different species give an evidence that a state of the art methodology is here improved, in terms of running time and memory requirements.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Besta, M., et al.: Communication-efficient jaccard similarity for high-performance distributed genome comparisons. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1122–1132. IEEE (2020)

    Google Scholar 

  2. Bonnici, V., Manca, V.: Infogenomics tools: a computational suite for informational analysis of genomes. J. Bioinforma Proteomics Rev. 1, 8–14 (2015)

    Google Scholar 

  3. Bonnici, V., Giugno, R., Manca, V.: PanDelos: a dictionary-based method for pan-genome content discovery. BMC Bioinformatics 19(15), 437 (2018)

    Article  Google Scholar 

  4. Bonnici, V., Manca, V.: Informational laws of genome structures. Sci. Rep. 6, 28840 (2016). http://www.nature.com/articles/srep28840

  5. Bonnici, V., Maresi, E., Giugno, R.: Challenges in gene-oriented approaches for pangenome content discovery. Brief. Bioinformatics 22(3), bbaa198 (2020)

    Google Scholar 

  6. Borja, M.C., Haigh, J.: The birthday problem. Significance 4(3), 124–127 (2007)

    Article  MathSciNet  Google Scholar 

  7. Castellini, A., Franco, G., Milanese, A.: A genome analysis based on repeat sharing gene networks. Nat. Comput. 14(3), 403–420 (2014). https://doi.org/10.1007/s11047-014-9437-6

    Article  MathSciNet  Google Scholar 

  8. Contreras-Moreira, B., Vinuesa, P.: GET\_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl. Environ. Microbiol. 79(24), 7696–7701 (2013)

    Article  Google Scholar 

  9. D’Auria, G., Jiménez-Hernández, N., Peris-Bondia, F., Moya, A., Latorre, A.: Legionella pneumophila pangenome reveals strain-specific virulence factors. BMC Genom. 11(1), 181 (2010)

    Article  Google Scholar 

  10. Franco, G., Milanese, A.: An investigation on genomic repeats. In: Bonizzoni, P., Brattka, V., Löwe, B. (eds.) CiE 2013. LNCS, vol. 7921, pp. 149–160. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39053-1_18

    Chapter  Google Scholar 

  11. Holt, K.E., et al.: High-throughput sequencing provides insights into genome variation and evolution in salmonella typhi. Nat. Genet. 40(8), 987–993 (2008)

    Article  Google Scholar 

  12. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)

    Article  MathSciNet  Google Scholar 

  13. Kobayakawa, M., Kinjo, S., Hoshi, M., Ohmori, T., Yamamoto, A.: Fast computation of similarity based on jaccard coefficient for composition-based image retrieval. In: Muneesawang, P., Wu, F., Kumazawa, I., Roeksabutr, A., Liao, M., Tang, X. (eds.) PCM 2009. LNCS, vol. 5879, pp. 949–955. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10467-1_87

    Chapter  Google Scholar 

  14. Lees, J.A., et al.: Fast and flexible bacterial genomic epidemiology with poppunk. Genome Res. 29(2), 304–316 (2019)

    Article  Google Scholar 

  15. Muzzi, A., Masignani, V., Rappuoli, R.: The pan-genome: towards a knowledge-based discovery of novel targets for vaccines and antibacterials. Drug Discov. Today 12(11), 429–439 (2007)

    Article  Google Scholar 

  16. Nguyen, N., et al.: Building a pan-genome reference for a population. J. Comput. Biol. 22(5), 387–401 (2015)

    Article  MathSciNet  Google Scholar 

  17. Puigbò, P., Lobkovsky, A.E., Kristensen, D.M., Wolf, Y.I., Koonin, E.V.: Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 12(1), 66 (2014)

    Article  Google Scholar 

  18. Serruto, D., Serino, L., Masignani, V., Pizza, M.: Genome-based approaches to develop vaccines against bacterial pathogens. Vaccine 27(25), 3245–3250 (2009)

    Article  Google Scholar 

  19. Soucy, S.M., Huang, J., Gogarten, J.P.: Horizontal gene transfer: building the web of life. Nat. Rev. Genet. 16(8), 472–482 (2015)

    Article  Google Scholar 

  20. Tettelin, H., Medini, D.: The Pangenome: Diversity, Dynamics and Evolution of Genomes. Lecture Notes in Computer Science, Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38281-0

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vincenzo Bonnici .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bonnici, V., Cracco, A., Franco, G. (2022). A k-mer Based Sequence Similarity for Pangenomic Analyses. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2021. Lecture Notes in Computer Science(), vol 13164. Springer, Cham. https://doi.org/10.1007/978-3-030-95470-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-95470-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-95469-7

  • Online ISBN: 978-3-030-95470-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics