Abstract
Full-text search refers to techniques for searching a document, or a document collection, in a full-text database. To speed up such searches, the given text should be indexed. The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rank-handling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2–3 times faster than the fastest known ones, for the price of using typically 1.5–5 times more space.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40450-4_12
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), 23 (2014). Article 23
Chacón, A., Moure, J.C., Espinosa, A., Hernández, P.: \(n\)-step FM-index for faster pattern matching. Proc. Comput. Sci. 18, 70–79 (2013)
Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms Mol. Biol. 8(1), 25 (2013)
Fariña, A., Navarro, G., Paramá, J.: Boosting text compression with word-based statistical encoding. Comput. J. 55(1), 111–131 (2012)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of FOCS, pp. 390–398. IEEE (2000)
Gog, S., Kärkkäinen, J., Kempa, D., Petri, M., Puglisi, S.J.: Faster, minuter. In: Proceedings of DCC, pp. 53–62. IEEE (2016)
Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw.: Pract. Exp. 44(11), 1287–1314 (2014)
Grabowski, S.: Making dense codes even denser. AGH Automatyka 12(3), 769–779 (2008)
Grabowski, S., Raniszewski, M.: Two simple full-text indexes based on the suffix array. In: Proceedings of PSC, pp. 179–191 (2014)
Grabowski, S., Raniszewski, M.: Sampling the suffix array with minimizers. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 287–298. Springer, Cham (2015). doi:10.1007/978-3-319-23826-5_28
Huo, H., Chen, L., Zhao, H., Vitter, J.S., Nekrich, Y., Yu, Q.: A data-aware FM-index. In: Proceedings of ALENEX, pp. 10–23. SIAM (2015)
Jacobson, G.: Succinct static data structures. Ph.D. thesis, Carnegie Mellon University (1989)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Hybrid compression of bitvectors for the FM-index. In: Proceedings of DCC, pp. 302–311. IEEE (2014)
Kärkkäinen, J., Puglisi, S.J.: Fixed block compression boosting in FM-indexes. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 174–184. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24583-1_18
Külekci, M.O., Vitter, J.S., Xu, B.: Fast pattern-matching via \(k\)-bit filtering based text decomposition. Comput. J. 55(1), 62–68 (2010)
Lam, T.W., Li, R., Tam, A., Wong, S., Wu, E., Yiu, S.M.: High throughput short read alignment via bi-directional BWT. In: Proceedings of BIBM, pp. 31–36. IEEE (2009)
Mäkinen, V., Navarro, G.: New search algorithms and time/space tradeoffs for succinct suffix arrays. Technical report C-2004-20, University of Helsinki, Finland (2004)
Moffat, A., Gog, S.: String search experimentation using massive data. Philos. Trans. Roy. Soc. Lond. A: Math. Phys. Eng. Sci. 372(2016), 20130135 (2014)
Munro, J.I., Navarro, G., Nekrich, Y.: Space-efficient construction of compressed indexes in deterministic linear time. In: Proceeding of SODA (2017, to appear)
Navarro, G.: Wavelet trees for all. J. Discret. Algorithms 25, 2–20 (2014)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007)
Orlandi, A., Venturini, R.: Space-efficient substring occurrence estimation. Algorithmica 74(1), 65–90 (2016)
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008). doi:10.1007/978-3-540-68552-4_12
Vyverman, M., De Baets, B., Fack, V., Dawyndt, P.: Prospects and limitations of full-text index structures in genome analysis. Nucleic Acids Res. 40(15), 6993–7015 (2012)
Acknowledgments
We thank Simon Gog for providing the FM-FB-V5 and FM-hybrid-FB_8 sources and helping us in running sdsl-lite, and Shaun D. Jackman for a remark concerning the ABySS de novo genome assembler.
The work was supported by the Polish National Science Centre upon decision DEC-2013/09/B/ST6/03117.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Grabowski, S., Raniszewski, M., Deorowicz, S. (2017). FM-index for Dummies. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-58274-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58273-3
Online ISBN: 978-3-319-58274-0
eBook Packages: Computer ScienceComputer Science (R0)