Skip to main content

Effective Parallel Multicore-Optimized K-mers Counting Algorithm

  • Conference paper
  • First Online:
Book cover SOFSEM 2016: Theory and Practice of Computer Science (SOFSEM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9587))

Abstract

For many bioinformatics applications it is crucial to know frequencies of all subsequences of length k (k-mers) constructed from reads (short-reads) that are obtained in process of DNA sequencing. We present an effective parallel algorithm for k-mers counting that is based on nested bucket sort algorithm, whereby sizes of partitions and number of buckets per partition are precomputed. The proposed algorithm is designed for multicore architecture and properly combines MPI framework (OpenMPI) with POSIX threads achieving very good performance. According to our experiments it overcomes existing solutions in running time when compared on the genome of Drosophila melanogaster (SRX040485).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this case by sorting we mean classification by some criteria, ordering means arranging the data into non-increasing or non-decreasing order.

  2. 2.

    Decreasing time effect is produced by comparative O(n.log(n)) final sorting algorithm that uses less time to sort k groups of l elements than 1 group of \(k*l\) elements. This implies overall time complexity to be \(O(n + n.log(\frac{n}{b}))\), b to be the number of buckets. For reasonable high b values the complexity tends to be O(n). Please keep in mind that time complexity is much less accurate than actual real performance measuring.

  3. 3.

    Drosophila melanogaster (SRX040485) http://www.ebi.ac.uk/ena/data/view/SRX040485.

References

  1. Audano, P., Vannberg, F.: Kanalyze: a fast versatile pipelined k-mer toolkit. Bioinformatics (2014). doi:10.1093/bioinformatics/btu152. Accessed 18 March 2014

    Google Scholar 

  2. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970). doi:10.1145/362686.362692

    Article  MATH  Google Scholar 

  3. Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)

    Article  Google Scholar 

  4. Compeau, P.E., Pevzner, P.A., Tesler, G.: How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. 29(11), 987–991 (2011). doi:10.1038/nbt.2023

    Article  Google Scholar 

  5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn., pp. 174–177. MIT Press and McGraw-Hill, Cambridge, New York (2001). ISBN: 0-262-03293-7. Section 8.4: Bucket sort

    Google Scholar 

  6. Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinf. 14, 160 (2013)

    Article  Google Scholar 

  7. Deorowicz, S., Kokot, M., Grabowski, S., Debudaj, A.: KMC 2: fast and resource-frugal k-mer counting. abs/1407.1507 (2014)

    Google Scholar 

  8. Edgar, G., Fagg, G.E., Bosilca, G.: Open MPI: goals, concept, and design of a next generation mpi implementation. In: Proceedings: 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary (2004)

    Google Scholar 

  9. Farkaš, T.: Parallel Bucket sort algorithm for ordering short DNA sequences. In: IIT.SRC 2015: Student Research Conference, Bratislava, pp. 77–82 (2015). ISBN: 978-80-227-4342-6

    Google Scholar 

  10. Hollerith, H.: US. pat. Nr. 395781, 395782, 395783

    Google Scholar 

  11. Marais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  12. McIlroy, P.M., et al.: Engineering radix sort. Comput. Syst. 6(1), 5–27 (1993)

    Google Scholar 

  13. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12, 333 (2011)

    Article  Google Scholar 

  14. Pevzner, P.A., Tang, H., Waterman, M.S.: An eulerian path approach to DNA fragment assembly. Proc. Nat. Acad. Sci. U.S.A. 98(17), 9748–9753 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  15. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)

    Article  Google Scholar 

  16. Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: identifying frequent k-mers with cache-efficient algorithms. Bioinformatics (2014). doi:10.1093/bioinformatics/btu132

    Google Scholar 

  17. Shendure, J., Ji, H.: Next-generation DNS sequencing. Nat. Biotechnol. 26(10), 1135–1145 (2008)

    Article  Google Scholar 

  18. Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS ONE 9(7), e101271 (2014). doi:10.1371/journal.pone.0101271

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Institute of Informatics and Software Engineering, FIIT STU, Intelligent analysis of big data by semantic-oriented and bio-inspired methods in parallel environment, the scientific Grant Agency of the Slovak Republic, grant No. VG 1/0752/14, project DNApuzzleDNA, FIIT STU that allowed us to use high performance computing on cluster of STU (project number 26230120002) and by the Research and Development Operational Programme as part of the project “International Centre of Excellence for Research of Intelligent and Secure Information-Communication Technologies and Systems”, ITMS 26240120039.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Kubán .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Farkaš, T., Kubán, P., Lucká, M. (2016). Effective Parallel Multicore-Optimized K-mers Counting Algorithm. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-49192-8_38

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-49191-1

  • Online ISBN: 978-3-662-49192-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics