research-article

Quill: A Memory Efficient k-mer Counting and k-mer Querying Tool for Commodity Clusters

Authors:

Budvin Edippuliarachchi,

Damika Gamlath,

Ruchin Amaratunga,

Gunavaran Brihadiswaran,

Sanath JayasenaAuthors Info & Claims

ICBBT '22: Proceedings of the 14th International Conference on Bioinformatics and Biomedical Technology

Pages 79 - 88

https://doi.org/10.1145/3543377.3543389

Published: 08 August 2022 Publication History

Abstract

K-mer counting is a widely used fundamental bioinformatics process. With next generation sequencing and other advances in sequencing techniques, newly generated large sequence datasets demand efficient k-mer counting techniques that are capable of utilizing available resources. We present Quill: a memory-efficient k-mer counting and a querying tool for commodity clusters. While existing distributed memory solutions require high-performance clusters, Quill manages to perform k-mer counting in a conventional computer cluster without relying on high-performance network interfaces or parallel file systems. Furthermore, Quill provides an additional advantage in cases where k-mer counting is required for multiple k-values in the same dataset. Quill shows a linear scaling for the k-mer counting stage when tested in a commodity cluster. The performance gain is more evident when executed with k values up to 22 and 28. Thus, Quill can be viewed as a cost-effective k-mer counting solution that can effectively use the combined computing power of a cluster of commodity-grade computers. Quill is freely available at https://github.com/CSE-Optimizers/k-mer_counter.

Supplementary Material

Presentation slides (ICBBT-Presentation.pptx)

Download
6.98 MB

References

[1]

2016. MurmurHash3. Retrieved April, 2022 from https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp

[2]

2020. Google Sparsehash. Retrieved April, 2022 from https://github.com/sparsehash/sparsehash

[3]

Gunavaran Brihadiswaran and Sanath Jayasena. 2021. Frigate: A Fast, in-Memory Tool for Counting and Querying k-Mers. In 2021 13th International Conference on Bioinformatics and Biomedical Technology (Xi’an, China) (ICBBT 2021). Association for Computing Machinery, New York, NY, USA, 134–140. https://doi.org/10.1145/3473258.3473279

Digital Library

[4]

Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, and Szymon Grabowski. 2013. Disk-based k-mer counting on a PC. BMC bioinformatics 14 (05 2013), 160. https://doi.org/10.1186/1471-2105-14-160

[5]

Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. 2015. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 10 (01 2015), 1569–1576. https://doi.org/10.1093/bioinformatics/btv022 arXiv:https://academic.oup.com/bioinformatics/article-pdf/31/10/1569/17085507/btv022.pdf

[6]

Robert C. Edgar. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32, 5 (03 2004), 1792–1797. https://doi.org/10.1093/nar/gkh340 arXiv:https://academic.oup.com/nar/article-pdf/32/5/1792/7055030/gkh340.pdf

[7]

Marius Erbert, Steffen Rechner, and Matthias Müller-Hannemann. 2016. Gerbil: A Fast and Memory-Efficient k-mer Counter with GPU-Support. In Algorithms in Bioinformatics, Martin Frith and Christian Nørgaard Storm Pedersen (Eds.). Springer International Publishing, Cham, 150–161. https://doi.org/10.1007/978-3-319-43681-4_12

[8]

M. Fujimoto, P. M. Bodily, N. Okuda, M. J. Clement, and Q. Snell. 2014. Effects of error-correction of heterozygous next-generation sequencing data. BMC Bioinformatics 15 Suppl 7 (2014), S3. https://doi.org/10.1186/1471-2105-15-S7-S3

[9]

Tao Gao, Yanfei Guo, Yanjie Wei, Bingqiang Wang, Yutong Lu, Pietro Cicotti, Pavan Balaji, and Michela Taufer. 2017. Bloomfish: A Highly Scalable Distributed K-mer Counting Framework. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). 170–179. https://doi.org/10.1109/ICPADS.2017.00033

[10]

Tao Gao, Yanfei Guo, Boyu Zhang, Pietro Cicotti, Yutong Lu, Pavan Balaji, and Michela Taufer. 2017. Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1098–1108. https://doi.org/10.1109/IPDPS.2017.31

[11]

Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. 2017. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 17 (05 2017), 2759–2761. https://doi.org/10.1093/bioinformatics/btx304 arXiv:https://academic.oup.com/bioinformatics/article-pdf/33/17/2759/25163903/btx304.pdf

[12]

R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, and J. Wang. 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20, 2 (Feb 2010), 265–272. https://doi.org/10.1101/gr.097261.109

[13]

Yang Li and Xifeng Yan. 2015. MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting. ArXiv abs/1505.06550(2015).

[14]

Yongchao Liu, Jan Schröder, and Bertil Schmidt. 2012. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29, 3 (11 2012), 308–315. https://doi.org/10.1093/bioinformatics/bts690 arXiv:https://academic.oup.com/bioinformatics/article-pdf/29/3/308/17103359/bts690.pdf

[15]

Swati Manekar and Shailesh Sathe. 2018. A benchmark study of k-mer counting methods for high-throughput sequencing. GigaScience 7 (10 2018). https://doi.org/10.1093/gigascience/giy125

[16]

G. Marçais and C. Kingsford. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 6 (Mar 2011), 764–770. https://doi.org/10.1093/bioinformatics/btr011

Digital Library

[17]

Tony Pan, Patrick Flick, Chirag Jain, Yongchao Liu, and Srinivas Aluru. 2016. Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems. https://doi.org/10.1145/2975167.2975211

[18]

A. L. Price, N. C. Jones, and P. A. Pevzner. 2005. De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1 (Jun 2005), i351–358. https://doi.org/10.1093/bioinformatics/bti1018

Cited By

Recommendations

Frigate: a fast, in-memory tool for counting and querying k-mers
ICBBT '21: Proceedings of the 2021 13th International Conference on Bioinformatics and Biomedical Technology

K-mer counting is an important step in many bioinformatics applications including genome assembly, sequence error correction, and sequence alignment. As the advancements in next generation sequencing technologies have resulted in tremendous growth of ...
K-mer Counting for Genomic Big Data
Big Data – BigData 2018
Abstract
Counting the abundance of all the k-mers (substrings of length k) in sequencing reads is an important step of many bioinformatics applications, including de novo assembly, error correction and multiple sequence alignment. However, processing ...
Estimating viral haplotypes in a population using k-mer counting
PRIB'13: Proceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics

Viral haplotype estimation in a population is an important problem in virology. Viruses undergo a high number of mutations and recombinations during replication for their survival in host cells and exist as a population of closely related genetic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICBBT '22: Proceedings of the 14th International Conference on Bioinformatics and Biomedical Technology

May 2022

190 pages

ISBN:9781450396387

DOI:10.1145/3543377

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICBBT 2022

ICBBT 2022: 2022 14th International Conference on Bioinformatics and Biomedical Technology

May 27 - 29, 2022

Xi'an, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
69
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten