Skip to main content

USTAR: Improved Compression of k-mer Sets with Counters Using de Bruijn Graphs

  • Conference paper
  • First Online:
Bioinformatics Research and Applications (ISBRA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 14248))

Included in the following conference series:

  • 661 Accesses

Abstract

A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. Finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into a de Bruijn graph and then find a compact representation of the graph through the smallest path cover.

In this paper, we present USTAR, a tool for compressing a set of k-mers and their counts. USTAR exploits the node connectivity and density of the de Bruijn graph enabling a more effective path selection for the construction of the path cover. We demonstrate the usefulness of USTAR in the compression of read datasets. USTAR can improve the compression of UST, the best algorithm, from 2.3% up to 26,4%, depending on the k-mer size.

The code of USTAR and the complete results are available at the repository https://github.com/enricorox/USTAR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Andreace, F., Pizzi, C., Comin, M.: Metaprob 2: metagenomic reads binning based on assembly using minimizers and k-mers statistics. J. Comput. Biol. 28(11), 1052–1062 (2021). https://doi.org/10.1089/cmb.2021.0270

    Article  CAS  PubMed  Google Scholar 

  2. Bankevich, A., et al.: Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G., Iqbal, Z.: Ultrafast search of all deposited bacterial and viral genomic data. Nat. Biotechnol. 37(2), 152–159 (2019)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Břinda, K., Baym, M., Kucherov, G.: Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 22(1), 1–24 (2021)

    Article  Google Scholar 

  5. Cavattoni, M., Comin, M.: Classgraph: improving metagenomic read classification with overlap graphs. J. Comput. Biol. 30(6), 633–647 (2023). https://doi.org/10.1089/cmb.2022.0208, pMID: 37023405

  6. Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. (CSUR) 54(1), 1–22 (2021)

    Article  Google Scholar 

  7. Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201–i208 (2016)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011)

    Google Scholar 

  9. Denti, L., Previtali, M., Bernardini, G., Schönhuth, A., Bonizzoni, P.: Malva: genotyping by mapping-free allele detection of known variants. Iscience 18, 20–27 (2019)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Harris, R.S., Medvedev, P.: Improved representation of sequence bloom trees. Bioinformatics 36(3), 721–727 (2020)

    Article  CAS  PubMed  Google Scholar 

  11. Kokot, M., Długosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017)

    Article  CAS  PubMed  Google Scholar 

  12. Marchet, C., Iqbal, Z., Gautheret, D., Salson, M., Chikhi, R.: Reindeer: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics 36(Supplement_1), i177–i185 (2020)

    Google Scholar 

  13. Marcolin, M., Andreace, F., Comin, M.: Efficient k-mer indexing with application to mapping-free SNP genotyping. In: Lorenz, R., Fred, A.L.N., Gamboa, H. (eds.) Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2022, Volume 3: BIOINFORMATICS, 9–11 February 2022, pp. 62–70 (2022)

    Google Scholar 

  14. Monsu, M., Comin, M.: Fast alignment of reads to a variation graph with application to SNP detection. J. Integr. Bioinform. 18(4), 20210032 (2021)

    Article  PubMed  PubMed Central  Google Scholar 

  15. Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17(1), 1–14 (2016)

    Article  Google Scholar 

  16. Pandey, P., Almodaresi, F., Bender, M.A., Ferdman, M., Johnson, R., Patro, R.: Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7(2), 201–207 (2018)

    Article  CAS  PubMed  Google Scholar 

  17. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics 34(4), 568–575 (2018)

    Article  CAS  PubMed  Google Scholar 

  18. Pinho, A.J., Pratas, D.: Mfcompress: a compression tool for fasta and multi-fasta data. Bioinformatics 30(1), 117–118 (2014)

    Article  CAS  PubMed  Google Scholar 

  19. Qian, J., Comin, M.: Metacon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage. BMC Bioinform. 20(367) (2019). https://doi.org/10.1186/s12859-019-2904-4

  20. Rahman, A., Chikhi, R., Medvedev, P.: Disk compression of k-mer sets. Algorithms Mol. Biol. 16(1), 1–14 (2021)

    Article  Google Scholar 

  21. Rahman, A., Medvedev, P.: Representation of \(k\)-mer sets using spectrum-preserving string sets. In: Schwartz, R. (ed.) RECOMB 2020. LNCS, vol. 12074, pp. 152–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45257-5_10

    Chapter  Google Scholar 

  22. Rhie, A., Walenz, B.P., Koren, S., Phillippy, A.M.: Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)

    Article  CAS  PubMed  Google Scholar 

  24. Storato, D., Comin, M.: K2mem: discovering discriminative k-mers from sequencing data for metagenomic reads classification. IEEE/ACM Trans. Comput. Biol. Bioinf. 19(1), 220–229 (2022). https://doi.org/10.1109/TCBB.2021.3117406

    Article  Google Scholar 

  25. Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: Allsome sequence bloom trees. J. Comput. Biol. 25(5), 467–479 (2018)

    Article  CAS  PubMed  Google Scholar 

  26. Sun, C., Medvedev, P.: Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics 35(3), 415–420 (2019)

    Article  CAS  PubMed  Google Scholar 

  27. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), 1–12 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

Authors are supported by the National Recovery and Resilience Plan (NRRP), National Biodiversity Future Center - NBFC, NextGenerationEU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matteo Comin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rossignolo, E., Comin, M. (2023). USTAR: Improved Compression of k-mer Sets with Counters Using de Bruijn Graphs. In: Guo, X., Mangul, S., Patterson, M., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2023. Lecture Notes in Computer Science(), vol 14248. Springer, Singapore. https://doi.org/10.1007/978-981-99-7074-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7074-2_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7073-5

  • Online ISBN: 978-981-99-7074-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics