Abstract
Several challenges are related to metagenomics, one of which is the data management. A related central concept is k-mer which means a possible subsequence of length k from a DNA (sub)sequence. In this work, the focus is on indexing k-mers and supporting box queries where a query string of length k might have multiple allowed nucleobases per position. A novel index structure: ND-GiST is introduced which has capability to handle box queries. Comparing it with full table scan and the traditional B-tree, the performance results of ND-GiST are encouraging.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The records are listed in the order of insertion into the tree.
- 3.
- 4.
References
Bayer, R., McCreight, E.M.: Organization and maintenance of large ordered indexes. Acta Inform. 1(3), 173–189 (1972). https://doi.org/10.1007/978-3-642-59412-0_15
Chen, C., Watve, A., Pramanik, S., Zhu, Q.: The bond-tree: an efficient indexing method for box queries in nonordered discrete data spaces. IEEE Trans. Knowl. Data Eng. 25(11), 2629–2643 (2013). https://doi.org/10.1109/TKDE.2012.132
Dorok, S., Breß, S., Teubner, J., Läpple, H., Saake, G., Markl, V.: Efficiently storing and analyzing genome data in database systems. Datenbank-Spektrum 17(2), 139–154 (2017). https://doi.org/10.1007/s13222-017-0254-9
Guttman, A.: R-trees: a dynamic index structure for spatial searching. SIGMOD Rec. 14(2) (1984). https://doi.org/10.1145/602259.602266
Janetzki, S., Tiedemann, M.R., Balar, H.: Genome data management using RDBMSs. Technical report, Otto-von-Guericke Universität, Magdeburg, Germany (2015). https://doi.org/10.13140/RG.2.1.4047.6006
Oulas, A., Pavloudi, C., Polymenakou, P., Pavlopoulos, G.A., Papanikolaou, N., Kotoulas, G., Arvanitidis, C., Iliopoulos, I.: Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies. Bioinform. Biol. Insights 9, 75–88 (2015). https://doi.org/10.4137/BBI.S12462
Qian, G., Zhu, Q., Xue, Q., Pramanik, S.: The ND-tree: a dynamic indexing technique for multidimensional non-ordered discrete data spaces. In: Proceedings 2003 VLDB Conference, pp. 620–631. Elsevier (2003). https://doi.org/10.1016/B978-012722442-8/50061-6
Scholz, M.B., Lo, C.C., Chain, P.S.: Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr. Opin. Biotechnol. 23(1), 9–15 (2012). https://doi.org/10.1016/j.copbio.2011.11.013
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014). https://doi.org/10.1186/gb-2014-15-3-r46
Acknowledgments
The project has been supported by the European Union’s Horizon 2020 research and innovation program under grant agreement no. 643476 (COMPARE), by the Novo Nordisk Foundation Interdisciplinary Synergy Programme [Grant NNF15OC0016584] and by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Szalai-Gindl, J.M., Kiss, A., Halász, G., Dobos, L., Csabai, I. (2019). ND-GiST: A Novel Method for Disk-Resident k-mer Indexing. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S. (eds) New Knowledge in Information Systems and Technologies. WorldCIST'19 2019. Advances in Intelligent Systems and Computing, vol 931. Springer, Cham. https://doi.org/10.1007/978-3-030-16184-2_63
Download citation
DOI: https://doi.org/10.1007/978-3-030-16184-2_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16183-5
Online ISBN: 978-3-030-16184-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)