Effective Parallel Multicore-Optimized K-mers Counting Algorithm

Farkaš, Tomáš; Kubán, Peter; Lucká, Mária

doi:10.1007/978-3-662-49192-8_38

Tomáš Farkaš¹⁶,
Peter Kubán¹⁶ &
Mária Lucká¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9587))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Informatics

1018 Accesses
1 Citations

Abstract

For many bioinformatics applications it is crucial to know frequencies of all subsequences of length k (k-mers) constructed from reads (short-reads) that are obtained in process of DNA sequencing. We present an effective parallel algorithm for k-mers counting that is based on nested bucket sort algorithm, whereby sizes of partitions and number of buckets per partition are precomputed. The proposed algorithm is designed for multicore architecture and properly combines MPI framework (OpenMPI) with POSIX threads achieving very good performance. According to our experiments it overcomes existing solutions in running time when compared on the genome of Drosophila melanogaster (SRX040485).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this case by sorting we mean classification by some criteria, ordering means arranging the data into non-increasing or non-decreasing order.
2.
Decreasing time effect is produced by comparative O(n.log(n)) final sorting algorithm that uses less time to sort k groups of l elements than 1 group of \(k*l\) elements. This implies overall time complexity to be \(O(n + n.log(\frac{n}{b}))\), b to be the number of buckets. For reasonable high b values the complexity tends to be O(n). Please keep in mind that time complexity is much less accurate than actual real performance measuring.
3.
Drosophila melanogaster (SRX040485) http://www.ebi.ac.uk/ena/data/view/SRX040485.

References

Audano, P., Vannberg, F.: Kanalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics (2014). doi:10.1093/bioinformatics/btu152. Accessed 18 March 2014
Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970). doi:10.1145/362686.362692
Article MATH Google Scholar
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Article Google Scholar
Compeau, P.E., Pevzner, P.A., Tesler, G.: How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011). doi:10.1038/nbt.2023
Article Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn., pp. 174–177. MIT Press and McGraw-Hill, Cambridge, New York (2001). ISBN: 0-262-03293-7. Section 8.4: Bucket sort
Google Scholar
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinf. 14, 160 (2013)
Article Google Scholar
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj, A.: KMC 2: fast and resource-frugal k-mer counting. abs/1407.1507 (2014)
Google Scholar
Edgar, G., Fagg, G.E., Bosilca, G.: Open MPI: goals, concept, and design of a next generation mpi implementation. In: Proceedings: 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary (2004)
Google Scholar
Farkaš, T.: Parallel Bucket sort algorithm for ordering short DNA sequences. In: IIT.SRC 2015: Student Research Conference, Bratislava, pp. 77–82 (2015). ISBN: 978-80-227-4342-6
Google Scholar
Hollerith, H.: US. pat. Nr. 395781, 395782, 395783
Google Scholar
Marais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Article Google Scholar
McIlroy, P.M., et al.: Engineering radix sort. Comput. Syst. 6(1), 5–27 (1993)
Google Scholar
Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12, 333 (2011)
Article Google Scholar
Pevzner, P.A., Tang, H., Waterman, M.S.: An eulerian path approach to DNA fragment assembly. Proc. Nat. Acad. Sci. U.S.A. 98(17), 9748–9753 (2001)
Article MATH MathSciNet Google Scholar
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)
Article Google Scholar
Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics (2014). doi:10.1093/bioinformatics/btu132
Google Scholar
Shendure, J., Ji, H.: Next-generation DNS sequencing. Nat. Biotechnol. 26(10), 1135–1145 (2008)
Article Google Scholar
Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS ONE 9(7), e101271 (2014). doi:10.1371/journal.pone.0101271
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the Institute of Informatics and Software Engineering, FIIT STU, Intelligent analysis of big data by semantic-oriented and bio-inspired methods in parallel environment, the scientific Grant Agency of the Slovak Republic, grant No. VG 1/0752/14, project DNApuzzleDNA, FIIT STU that allowed us to use high performance computing on cluster of STU (project number 26230120002) and by the Research and Development Operational Programme as part of the project “International Centre of Excellence for Research of Intelligent and Secure Information-Communication Technologies and Systems”, ITMS 26240120039.

Author information

Authors and Affiliations

Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Ilkovičova 2, 842 16, Bratislava, Slovakia
Tomáš Farkaš, Peter Kubán & Mária Lucká

Authors

Tomáš Farkaš
View author publications
You can also search for this author in PubMed Google Scholar
Peter Kubán
View author publications
You can also search for this author in PubMed Google Scholar
Mária Lucká
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Kubán .

Editor information

Editors and Affiliations

University of Latvia, Riga, Latvia
Rūsiņš Mārtiņš Freivalds
University of Paderborn, Paderborn, Germany
Gregor Engels
University of Genoa, Genoa, Italy
Barbara Catania

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Farkaš, T., Kubán, P., Lucká, M. (2016). Effective Parallel Multicore-Optimized K-mers Counting Algorithm. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_38

Download citation

DOI: https://doi.org/10.1007/978-3-662-49192-8_38
Published: 08 January 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49191-1
Online ISBN: 978-3-662-49192-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics