skip to main content
10.1145/3543377.3543389acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbbtConference Proceedingsconference-collections
research-article

Quill: A Memory Efficient k-mer Counting and k-mer Querying Tool for Commodity Clusters

Published: 08 August 2022 Publication History

Abstract

K-mer counting is a widely used fundamental bioinformatics process. With next generation sequencing and other advances in sequencing techniques, newly generated large sequence datasets demand efficient k-mer counting techniques that are capable of utilizing available resources. We present Quill: a memory-efficient k-mer counting and a querying tool for commodity clusters. While existing distributed memory solutions require high-performance clusters, Quill manages to perform k-mer counting in a conventional computer cluster without relying on high-performance network interfaces or parallel file systems. Furthermore, Quill provides an additional advantage in cases where k-mer counting is required for multiple k-values in the same dataset. Quill shows a linear scaling for the k-mer counting stage when tested in a commodity cluster. The performance gain is more evident when executed with k values up to 22 and 28. Thus, Quill can be viewed as a cost-effective k-mer counting solution that can effectively use the combined computing power of a cluster of commodity-grade computers. Quill is freely available at https://github.com/CSE-Optimizers/k-mer_counter.

Supplementary Material

Presentation slides (ICBBT-Presentation.pptx)

References

[1]
2016. MurmurHash3. Retrieved April, 2022 from https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
[2]
2020. Google Sparsehash. Retrieved April, 2022 from https://github.com/sparsehash/sparsehash
[3]
Gunavaran Brihadiswaran and Sanath Jayasena. 2021. Frigate: A Fast, in-Memory Tool for Counting and Querying k-Mers. In 2021 13th International Conference on Bioinformatics and Biomedical Technology (Xi’an, China) (ICBBT 2021). Association for Computing Machinery, New York, NY, USA, 134–140. https://doi.org/10.1145/3473258.3473279
[4]
Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, and Szymon Grabowski. 2013. Disk-based k-mer counting on a PC. BMC bioinformatics 14 (05 2013), 160. https://doi.org/10.1186/1471-2105-14-160
[5]
Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. 2015. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 10 (01 2015), 1569–1576. https://doi.org/10.1093/bioinformatics/btv022 arXiv:https://academic.oup.com/bioinformatics/article-pdf/31/10/1569/17085507/btv022.pdf
[6]
Robert C. Edgar. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 5 (03 2004), 1792–1797. https://doi.org/10.1093/nar/gkh340 arXiv:https://academic.oup.com/nar/article-pdf/32/5/1792/7055030/gkh340.pdf
[7]
Marius Erbert, Steffen Rechner, and Matthias Müller-Hannemann. 2016. Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support. In Algorithms in Bioinformatics, Martin Frith and Christian Nørgaard Storm Pedersen (Eds.). Springer International Publishing, Cham, 150–161. https://doi.org/10.1007/978-3-319-43681-4_12
[8]
M. Fujimoto, P. M. Bodily, N. Okuda, M. J. Clement, and Q. Snell. 2014. Effects of error-correction of heterozygous next-generation sequencing data. BMC Bioinformatics 15 Suppl 7 (2014), S3. https://doi.org/10.1186/1471-2105-15-S7-S3
[9]
Tao Gao, Yanfei Guo, Yanjie Wei, Bingqiang Wang, Yutong Lu, Pietro Cicotti, Pavan Balaji, and Michela Taufer. 2017. Bloomfish: A Highly Scalable Distributed K-mer Counting Framework. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). 170–179. https://doi.org/10.1109/ICPADS.2017.00033
[10]
Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, Pavan Balaji, and Michela Taufer. 2017. Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1098–1108. https://doi.org/10.1109/IPDPS.2017.31
[11]
Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. 2017. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 17 (05 2017), 2759–2761. https://doi.org/10.1093/bioinformatics/btx304 arXiv:https://academic.oup.com/bioinformatics/article-pdf/33/17/2759/25163903/btx304.pdf
[12]
R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, and J. Wang. 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20, 2 (Feb 2010), 265–272. https://doi.org/10.1101/gr.097261.109
[13]
Yang Li and Xifeng Yan. 2015. MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting. ArXiv abs/1505.06550(2015).
[14]
Yongchao Liu, Jan Schröder, and Bertil Schmidt. 2012. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29, 3 (11 2012), 308–315. https://doi.org/10.1093/bioinformatics/bts690 arXiv:https://academic.oup.com/bioinformatics/article-pdf/29/3/308/17103359/bts690.pdf
[15]
Swati Manekar and Shailesh Sathe. 2018. A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience 7 (10 2018). https://doi.org/10.1093/gigascience/giy125
[16]
G. Marçais and C. Kingsford. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 6 (Mar 2011), 764–770. https://doi.org/10.1093/bioinformatics/btr011
[17]
Tony Pan, Patrick Flick, Chirag Jain, Yongchao Liu, and Srinivas Aluru. 2016. Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems. https://doi.org/10.1145/2975167.2975211
[18]
A. L. Price, N. C. Jones, and P. A. Pevzner. 2005. De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1 (Jun 2005), i351–358. https://doi.org/10.1093/bioinformatics/bti1018

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICBBT '22: Proceedings of the 14th International Conference on Bioinformatics and Biomedical Technology
May 2022
190 pages
ISBN:9781450396387
DOI:10.1145/3543377
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed computing
  2. K-mer counting
  3. Parallel computing
  4. Performance engineering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICBBT 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 69
    Total Downloads
  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media