Abstract
We revisit the non-overlapping indexing problem for an efficient repetition-aware solution. The problem is to index a text T[1..n], such that whenever a pattern P[1..p] comes as a query, we can report the largest set of non-overlapping occurrences of P in T. A previous index by Cohen and Porat [ISAAC 2009] takes linear space and optimal \(O(p+\mathsf {occ_{no}})\) query time, where \(\mathsf {occ_{no}}\) denotes the output size. We present an index of size O(r), where r denotes the number of runs in the Burrows Wheeler Transform (BWT) of T. The parameter r is significantly smaller than n for highly repetitive texts. The query time of our index is \(O(p\log \log _w \sigma +\textsf{sort}(\mathsf {occ_{no}}))\), where \(\sigma \) denotes the alphabet size, w denotes the machine word size in bits and \(\textsf{sort}(x)\) denotes the time for sorting x integers within the range [1, n].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alstrup, S., Brodal, G.S., Rauhe, T.: Optimal static range reporting in one dimension. In: Proceedings on 33rd Annual ACM Symposium on Theory of Computing, 6–8 July 2001, Heraklion, Crete, Greece, pp. 476–482 (2001). http://doi.acm.org/10.1145/380752.380842, https://doi.org/10.1145/380752.380842
Bannai, H., Gagie, T., Tomohiro, I.: Refining the r-index. Theor. Comput. Sci. 812, 96–108 (2020). https://doi.org/10.1016/j.tcs.2019.08.005
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. SRC Research Report, 124 (1994)
Cohen, H., Porat, E.: Range non-overlapping indexing. In: Proceedings of the Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, 16–18 December 2009, pp. 1044–1053 (2009). http://dx.doi.org/10.1007/978-3-642-10631-6_105, https://doi.org/10.1007/978-3-642-10631-6_105
Crochemore, M.: String-matching on ordered alphabets. Theoret. Comput. Sci. 92(1), 33–47 (1992)
Crochemore, M., Iliopoulos, C.S., Kubica, M., Rahman, M.S., Walen, T.: Improved algorithms for the range next value problem and applications. In: Proceedings of the STACS 2008, 25th Annual Symposium on Theoretical Aspects of Computer Science, Bordeaux, France, 21–23 February 2008, pp. 205–216 (2008). http://dx.doi.org/10.4230/LIPIcs.STACS.2008.1359, https://doi.org/10.4230/LIPIcs.STACS.2008.1359
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005). http://doi.acm.org/10.1145/1082036.1082039, https://doi.org/10.1145/1082036.1082039
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Czumaj, A. (ed.) Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, 7–10 January 2018, pp. 1459–1477. SIAM (2018). https://doi.org/10.1137/1.9781611975031.96
Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), 2:1–2:54 (2020). https://doi.org/10.1145/3375890
Ganguly, A., Shah, R., Thankachan, S.V.: Succinct non-overlapping indexing. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 185–195. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_16
Ganguly, A., Shah, R., Thankachan, S.V.: Succinct non-overlapping indexing. Algorithmica 82(1), 107–117 (2020). https://doi.org/10.1007/s00453-019-00605-5
Giuliani, S., Inenaga, S., Lipták, Z., Prezza, N., Sciortino, M., Toffanello, A.: Novel results on the number of runs of the burrows-wheeler-transform. In: Bureš, T., et al. (eds.) SOFSEM 2021. LNCS, vol. 12607, pp. 249–262. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67731-2_18
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005). https://doi.org/10.1137/S0097539702402354
Hooshmand, S., Abedin, P., Külekci, M.O., Thankachan, S.V.: Non-overlapping indexing - cache obliviously. In: Navarro, G., Sankoff, D., Zhu, B. (eds.) Annual Symposium on Combinatorial Pattern Matching, CPM 2018, 2–4 July 2018 - Qingdao, China. LIPIcs, vol. 105, pp. 8:1–8:9. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018). https://doi.org/10.4230/LIPIcs.CPM.2018.8
Hooshmand, S., Abedin, P., Külekci, M.O., Thankachan, S.V.: I/O-efficient data structures for non-overlapping indexing. Theor. Comput. Sci. 857, 1–7 (2021). https://doi.org/10.1016/j.tcs.2020.12.006
Keller, O., Kopelowitz, T., Lewenstein, M.: Range non-overlapping indexing and successive list indexing. In: Dehne, F., Sack, J.-R., Zeh, N. (eds.) WADS 2007. LNCS, vol. 4619, pp. 625–636. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73951-7_54
Kempa, D., Kociumaka, T.: Resolution of the burrows-wheeler transform conjecture. In: Irani, S. (ed.) 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, 16–19 November 2020, pp. 1002–1013. IEEE (2020). https://doi.org/10.1109/FOCS46700.2020.00097
Kociumaka, T., Navarro, G., Prezza, N.: Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory 69(4), 2074–2092 (2023). https://doi.org/10.1109/TIT.2022.3224382
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993). https://doi.org/10.1137/0222058
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2 (2007). https://doi.org/10.1145/1216370.1216372
Nekrich, Y., Navarro, G.: Sorted range reporting. In: Fomin, F.V., Kaski, P. (eds.) SWAT 2012. LNCS, vol. 7357, pp. 271–282. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31155-0_24
Nishimoto, T., Tabei, Y.: Optimal-time queries on BWT-runs compressed indexes. In: Bansal, N., Merelli, E., Worrell, J. (eds.) 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, 12–16 July 2021, Glasgow, Scotland (Virtual Conference). LIPIcs, vol. 198, pp. 101:1–101:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.ICALP.2021.101
Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007). https://doi.org/10.1145/1290672.1290680
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007). https://doi.org/10.1007/s00224-006-1198-x
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, 15–17 October 1973, pp. 1–11 (1973). http://dx.doi.org/10.1109/SWAT.1973.13, https://doi.org/10.1109/SWAT.1973.13
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Acknowledgements
This research is supported in part by the U.S. National Science Foundation (NSF) award CCF-2315822.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gibney, D., Macnichol, P., Thankachan, S.V. (2023). Non-overlapping Indexing in BWT-Runs Bounded Space. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-43980-3_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)