Skip to main content

K-mer Mapping and RDBMS Indexes

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11347))

Abstract

K-mer Mapping, an internal process for De Novo NGS genome fragments assembly methods, constitutes a computational challenge due to its high main memory consumption. We present a study of index-based methods to deal with this problem, considering a RDBMS environment. We propose an ad-hoc I/O cost model and analyze the performance of hash and B-tree versions for index structures. Furthermore, we present a novel approach for an index based on hashing that takes into account the notion of minimum substrings. An actual RDBMS implementation for experiments with a sugarcane dataset shows that one can obtain considerable performance gains while reducing main memory requirements.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bradnam, K.R., Fass, J.N., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 1–31 (2013)

    Article  Google Scholar 

  2. Butler, J., et al.: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810–820 (2008)

    Article  Google Scholar 

  3. Claros, M.G., Bautista, R., Guerrero-Fernández, D., Benzerki, H., Seoane, P., Fernández-Pozo, N.: Why assembling plant genome sequences is so challenging. Biology 1(2), 439 (2012)

    Google Scholar 

  4. Cook, J.J., Zilles, C.: Characterizing and optimizing the memory footprint of de novo short read DNA sequence assembly. In: Performance Analysis of Systems and Software, ISPASS, pp. 143–152, April 2009

    Google Scholar 

  5. de Armas, E.M., Haeusler, E.H., Lifschitz, S., de Holanda, M.T., da Silva, W.M.C., Ferreira, P.C.G.: K-mer Mapping and de Bruijn graphs: the case for velvet fragment assembly. In: Proceedings IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 882–889 (2016)

    Google Scholar 

  6. Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinform. 14(1), 160 (2013)

    Article  Google Scholar 

  7. Earl, D., Bradnam, K., et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21(12), 2224–2241 (2011)

    Article  Google Scholar 

  8. El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), 1–19 (2013)

    Article  Google Scholar 

  9. Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLoS ONE 8(9) (2013)

    Google Scholar 

  10. Kundeti, V., Rajasekaran, S., Dinh, H.: Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs. ArXiv (2010)

    Google Scholar 

  11. Li, R., Zhu, H., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2009)

    Article  Google Scholar 

  12. Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. PVLDB 6(3), 169–180 (2013)

    Google Scholar 

  13. Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 1–6 (2012)

    Article  Google Scholar 

  14. Marcais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)

    Article  Google Scholar 

  15. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill Inc, New York, NY, USA (2003)

    MATH  Google Scholar 

  16. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013)

    Article  Google Scholar 

  17. Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)

    Article  Google Scholar 

  18. Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010)

    Article  Google Scholar 

  19. Schatz, M.C., Witkowski, J., McCombie, W.R.: Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 13(4), 1–7 (2012)

    Article  Google Scholar 

  20. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)

    Article  Google Scholar 

  21. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 18(5), 821–829 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elvismary Molina de Armas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Armas, E.M., Ferreira, P.C.G., Haeusler, E.H., de Holanda, M.T., Lifschitz, S. (2020). K-mer Mapping and RDBMS Indexes. In: Kowada, L., de Oliveira, D. (eds) Advances in Bioinformatics and Computational Biology. BSB 2019. Lecture Notes in Computer Science(), vol 11347. Springer, Cham. https://doi.org/10.1007/978-3-030-46417-2_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46417-2_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46416-5

  • Online ISBN: 978-3-030-46417-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics