Abstract
Philippe et al. (2011) proposed a data structure called Gk arrays for indexing and querying large collections of high-throughput sequencing data in main-memory. The data structure supports versatile queries for counting, locating, and analysing the coverage profile of k-mers in short-read data. The main drawback of the Gk arrays is its space-consumption, which can easily reach tens of gigabytes of main-memory even for moderate size inputs. We propose a compressed variant of Gk arrays that supports the same set of queries, but in both near-optimal time and space. In practice, the compressed Gk arrays scale up to much larger inputs with highly competitive query times compared to its non-compressed predecessor. The main applications include variant calling, error correction, coverage profiling, and sequence assembly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight BWT construction for very large string collections. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 219–231. Springer, Heidelberg (2011)
Beller, T., Gog, S., Ohlebusch, E., Schnattinger, T.: Computing the longest common prefix array based on the burrows-wheeler transform. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 197–208. Springer, Heidelberg (2011)
Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.-P., Rivals, E., Vingron, M.: q-gram Based Database Searching Using a Suffix Array (QUASAR). In: 3rd Int. Conf. on Computational Molecular Biology, pp. 77–83. ACM Press (1999)
Chikhi, R., Lavenier, D.: Localized genome assembly from reads to scaffolds: Practical traversal of the paired string graph. In: Przytycka, T.M., Sagot, M.-F. (eds.) WABI 2011. LNCS, vol. 6833, pp. 39–48. Springer, Heidelberg (2011)
Claude, F., Fariña, A., Martínez-Prieto, M.A., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: Proc. 10th IEEE Intl. Conf. on Bioinformatics and Bioengineering, pp. 86–91 (2010)
Conway, T.C., Bromage, A.J.: Succinct Data Structures for Assembling Large Genomes. Bioinformatics 27(4), 479–486 (2011)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS), pp. 390–398. IEEE Computer Society (2000)
Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci. 410(51), 5354–5364 (2009)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: 14th Ann. ACM-SIAM Symp. on Discrete Algorithms, pp. 841–850 (2003)
Hazelhurst, S., Lipták, Z.: Kaboom! a new suffix array based algorithm for clustering expression data. Bioinformatics 27(24), 3348–3355 (2011)
Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)
Hon, W.-K., Sadakane, K.: Space-economical algorithms for finding maximal unique matches. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 144–152. Springer, Heidelberg (2002)
Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: FOCS, pp. 713–722. IEEE Computer Society (2009)
Jacobson, G.: Succinct Static Data Structures. PhD thesis, Carnegie–Mellon (1989)
Li, H.: Implementation of BCR, https://github.com/lh3/ropebwt
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)
Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)
Melsted, P., Pritchard, J.: Efficient counting of k-mers in dna sequences using a bloom filter. BMC Bioinformatics 12(1), 333 (2011)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007)
Philippe, N., Salson, M., Commes, T., Rivals, E.: CRAC: an integrated approach to read analysis. Genome Biology (in press, 2013)
Philippe, N., Salson, M., Lecroq, T., Léonard, M., Commes, T., Rivals, E.: Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12, 242 (2011)
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics, page Advance access (January 2013)
Salmela, L., Schröder, J.: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11), 1455–1461 (2011)
Sirén, J.: Compressed Full-Text Indexes for Highly Repetitive Collections. PhD thesis, Dept. of Computer Science, Report A-2012-5, University of Helsinki (2012)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space Theta(N). Inf. Process. Lett. 17(2), 81–84 (1983)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Välimäki, N., Rivals, E. (2013). Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data. In: Cai, Z., Eulenstein, O., Janies, D., Schwartz, D. (eds) Bioinformatics Research and Applications. ISBRA 2013. Lecture Notes in Computer Science(), vol 7875. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38036-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-38036-5_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38035-8
Online ISBN: 978-3-642-38036-5
eBook Packages: Computer ScienceComputer Science (R0)