USTAR: Improved Compression of k-mer Sets with Counters Using de Bruijn Graphs

Rossignolo, Enrico; Comin, Matteo

doi:10.1007/978-981-99-7074-2_16

Enrico Rossignolo¹¹ &
Matteo Comin¹¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 14248))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

661 Accesses

Abstract

A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. Finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into a de Bruijn graph and then find a compact representation of the graph through the smallest path cover.

In this paper, we present USTAR, a tool for compressing a set of k-mers and their counts. USTAR exploits the node connectivity and density of the de Bruijn graph enabling a more effective path selection for the construction of the path cover. We demonstrate the usefulness of USTAR in the compression of read datasets. USTAR can improve the compression of UST, the best algorithm, from 2.3% up to 26,4%, depending on the k-mer size.

The code of USTAR and the complete results are available at the repository https://github.com/enricorox/USTAR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andreace, F., Pizzi, C., Comin, M.: Metaprob 2: metagenomic reads binning based on assembly using minimizers and k-mers statistics. J. Comput. Biol. 28(11), 1052–1062 (2021). https://doi.org/10.1089/cmb.2021.0270
Article CAS PubMed Google Scholar
Bankevich, A., et al.: Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
Article CAS PubMed PubMed Central Google Scholar
Bradley, P., Den Bakker, H.C., Rocha, E.P., McVean, G., Iqbal, Z.: Ultrafast search of all deposited bacterial and viral genomic data. Nat. Biotechnol. 37(2), 152–159 (2019)
Article CAS PubMed PubMed Central Google Scholar
Břinda, K., Baym, M., Kucherov, G.: Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol. 22(1), 1–24 (2021)
Article Google Scholar
Cavattoni, M., Comin, M.: Classgraph: improving metagenomic read classification with overlap graphs. J. Comput. Biol. 30(6), 633–647 (2023). https://doi.org/10.1089/cmb.2022.0208, pMID: 37023405
Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. (CSUR) 54(1), 1–22 (2021)
Article Google Scholar
Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201–i208 (2016)
Article CAS PubMed PubMed Central Google Scholar
Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011)
Google Scholar
Denti, L., Previtali, M., Bernardini, G., Schönhuth, A., Bonizzoni, P.: Malva: genotyping by mapping-free allele detection of known variants. Iscience 18, 20–27 (2019)
Article CAS PubMed PubMed Central Google Scholar
Harris, R.S., Medvedev, P.: Improved representation of sequence bloom trees. Bioinformatics 36(3), 721–727 (2020)
Article CAS PubMed Google Scholar
Kokot, M., Długosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017)
Article CAS PubMed Google Scholar
Marchet, C., Iqbal, Z., Gautheret, D., Salson, M., Chikhi, R.: Reindeer: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics 36(Supplement_1), i177–i185 (2020)
Google Scholar
Marcolin, M., Andreace, F., Comin, M.: Efficient k-mer indexing with application to mapping-free SNP genotyping. In: Lorenz, R., Fred, A.L.N., Gamboa, H. (eds.) Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2022, Volume 3: BIOINFORMATICS, 9–11 February 2022, pp. 62–70 (2022)
Google Scholar
Monsu, M., Comin, M.: Fast alignment of reads to a variation graph with application to SNP detection. J. Integr. Bioinform. 18(4), 20210032 (2021)
Article PubMed PubMed Central Google Scholar
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using minhash. Genome Biol. 17(1), 1–14 (2016)
Article Google Scholar
Pandey, P., Almodaresi, F., Bender, M.A., Ferdman, M., Johnson, R., Patro, R.: Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7(2), 201–207 (2018)
Article CAS PubMed Google Scholar
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics 34(4), 568–575 (2018)
Article CAS PubMed Google Scholar
Pinho, A.J., Pratas, D.: Mfcompress: a compression tool for fasta and multi-fasta data. Bioinformatics 30(1), 117–118 (2014)
Article CAS PubMed Google Scholar
Qian, J., Comin, M.: Metacon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage. BMC Bioinform. 20(367) (2019). https://doi.org/10.1186/s12859-019-2904-4
Rahman, A., Chikhi, R., Medvedev, P.: Disk compression of k-mer sets. Algorithms Mol. Biol. 16(1), 1–14 (2021)
Article Google Scholar
Rahman, A., Medvedev, P.: Representation of \(k\)-mer sets using spectrum-preserving string sets. In: Schwartz, R. (ed.) RECOMB 2020. LNCS, vol. 12074, pp. 152–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45257-5_10
Chapter Google Scholar
Rhie, A., Walenz, B.P., Koren, S., Phillippy, A.M.: Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020)
Article CAS PubMed PubMed Central Google Scholar
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)
Article CAS PubMed Google Scholar
Storato, D., Comin, M.: K2mem: discovering discriminative k-mers from sequencing data for metagenomic reads classification. IEEE/ACM Trans. Comput. Biol. Bioinf. 19(1), 220–229 (2022). https://doi.org/10.1109/TCBB.2021.3117406
Article Google Scholar
Sun, C., Harris, R.S., Chikhi, R., Medvedev, P.: Allsome sequence bloom trees. J. Comput. Biol. 25(5), 467–479 (2018)
Article CAS PubMed Google Scholar
Sun, C., Medvedev, P.: Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics 35(3), 415–420 (2019)
Article CAS PubMed Google Scholar
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), 1–12 (2014)
Article Google Scholar

Download references

Acknowledgments

Authors are supported by the National Recovery and Resilience Plan (NRRP), National Biodiversity Future Center - NBFC, NextGenerationEU.

Author information

Authors and Affiliations

Department of Information Engineering, University of Padua, Padua, Italy
Enrico Rossignolo & Matteo Comin

Authors

Enrico Rossignolo
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Comin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matteo Comin .

Editor information

Editors and Affiliations

University of North Texas, Denton, TX, USA
Xuan Guo
University of Southern California, Los Angeles, CA, USA
Serghei Mangul
Georgia State University, Atlanta, GA, USA
Murray Patterson
Georgia State University, Atlanta, GA, USA
Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rossignolo, E., Comin, M. (2023). USTAR: Improved Compression of k-mer Sets with Counters Using de Bruijn Graphs. In: Guo, X., Mangul, S., Patterson, M., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2023. Lecture Notes in Computer Science(), vol 14248. Springer, Singapore. https://doi.org/10.1007/978-981-99-7074-2_16

Download citation

DOI: https://doi.org/10.1007/978-981-99-7074-2_16
Published: 08 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7073-5
Online ISBN: 978-981-99-7074-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

USTAR: Improved Compression of k-mer Sets with Counters Using de Bruijn Graphs