Abstract
We introduce a new variant of the popular Burrows-Wheeler transform (BWT), called Geometric Burrows-Wheeler Transform (GBWT), which converts a text into a set of points in 2-dimensional geometry. We also introduce a reverse transform, called Points2Text, which converts a set of points into text. Using these two transforms, we show strong equivalence between data structural problems in geometric range searching and text pattern matching. This allows us to apply the lower bounds known in the field of orthogonal range searching to the problems in compressed text indexing. In addition, we give the first succinct (compact) index for I/O-efficient pattern matching in external memory, and show how this index can be further improved to achieve higher-order entropy compressed space.
Similar content being viewed by others
Notes
Although the more optimal first term would be |P|/(Blog|Σ| n) because the block size is measured in terms of words and we assume the word-size to be logn bits and each character of the pattern takes log|Σ| bits.
The notation \(\tilde{O}\) ignores poly-logarithmic factors. Precisely, \(\tilde {O}(f(n)) \equiv O(f(n)\log^{O(1)} n)\).
For simplicity, we assume that n is a power of B, so that log B n is an integer. Otherwise, we simply consider the range of values in A as [1,n′], where \(n' = B^{\lceil\log_{B} n \rceil}\), so that both the space and query bounds in our proposed scheme follow.
For simplicity, we assume n is a multiple of d. Otherwise, T is first padded with enough special character $ at the end to make the length a multiple of d.
For simplicity, we assume that d is an integer. If not, we can slightly modify the data structures without affecting the overall complexity.
Without loss of generality, we assume here that \(|\varSigma| < \sqrt{n}\). The parameters can be appropriately adjusted for the more general case when |Σ|=O(n 1−ϵ) for any fixed ϵ>0.
Here, we make a slight modification that one extra bit is spent for each meta-character, such that if our kth-order encoding of the next o(log|Σ| n) characters already exceeds 0.5logn, we shall instead encode the next 0.5log|Σ| n characters (i.e., more characters) in its plain form. The extra bit is used to indicate whether we use the plain encoding or the kth-order encoding.
As mentioned, there is also an extra bit overhead per meta-character; however, we will soon see that the number of meta-characters = O((nH k +o(nlog|Σ|))/logn) so that this overhead is negligible.
Note that when we switch back to a node in Δ sbt , we choose the top-most node in Δ sbt corresponding to the node v.
Note that choosing larger d allows more sparsification, but it is not possible to design the four-russians data structure for small patterns in such cases.
References
Agarwal, P.K., Erickson, J.: Geometric range searching and its relatives. Adv. Discret. Comput. Geom. 23, 1–56 (1999)
Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1998)
Aref, W.G., Ilyas, I.F.: SP-GiST: an extensible database index for supporting space partitioning trees. J. Intell. Inf. Syst. 17(2–3), 215–240 (2001)
Arge, L., Brodal, G.S., Fagerberg, R., Laustsen, M.: Cache-oblivious planar orthogonal range searching and counting. In: Proceedings of Symposium on Computational Geometry, pp. 160–169 (2005)
Arge, L., Samoladas, V., Vitter, J.S.: Two-dimensional indexability and optimal range search indexing. In: Proceedings of Symposium on Principles of Database Systems, pp. 346–357 (1999)
Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 83–94 (2007)
Baeza-Yates, R., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Inf. Syst. 21(6), 497–514 (1996)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation, Paolo Alto CA, USA (1994)
Chazelle, B.: Lower bounds for orthogonal range searching. I: The reporting case. J. ACM 37, 200–212 (1990)
Clark, D., Munro, I.: Efficient suffix trees on secondary storage. In: Proceedings of Symposium on Discrete Algorithms, pp. 383–391 (1996)
Chien, Y.F., Hon, W.K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: linking range searching and text indexing. In: Proceedings of Data Compression Conference, pp. 252–261 (2008)
Chiu, S.Y., Hon, W.K., Shah, R., Vitter, J.S.: I/O-efficient compressed text indexes: from theory to practice. In: Proceedings of Data Compression Conference, pp. 426–434 (2010)
Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string searching in external memory and its application. J. ACM 46(2), 236–280 (1999)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. In: Proceedings of Symposium on Discrete Algorithms, pp. 690–696 (2007)
Fischer, J., Gagie, T., Kopelowitz, T., Lewenstein, M., Mäkinen, V., Salmela, L., Välimäki, N.N.: Forbidden patterns. In: Proceedings of Latin American Theoretical Informatics, pp. 327–337 (2012)
Gagie, T., Gawrychowski, P.: Linear-space substring range counting over polylogarithmic alphabets. (2012). CoRR. arXiv:1202.3208 [cs.DS]
González, R., Navarro, G.: A compressed text index on secondary memory. In: Proceedings of International Workshop on Combinatorial Algorithms, pp. 80–91 (2007)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of Symposium on Discrete Algorithms, pp. 841–850 (2003)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of International Conference on Management of Data, pp. 47–57 (1984)
Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: Proceedings of International Conference on Very Large Data Bases, pp. 562–573 (1995)
Hon, W.K., Lam, T.W., Shah, R., Lung, S.L., Vitter, J.S.: Succinct index for dynamic dictionary matching. In: Proceedings of Symposium on Algorithms and Computation, pp. 1034–1043 (2009)
Hon, W.K., Lam, T.W., Shah, R., Lung, S.L., Vitter, J.S.: Compressed index for dictionary matching. In: Proceedings of Data Compression Conference, pp. 23–32 (2008)
Hon, W.K., Shah, R., Vitter, J.S.: Ordered pattern matching: towards full-text retrieval. Technical report TR-06-008, Purdue University (2006)
Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.S.: On entropy-compressed text indexing in external memory. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 75–89 (2009)
Hon, W.K., Ku, T.H., Shah, R., Thankachan, S.V., Vitter, J.S.: Compressed text indexing with wildcards. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 267–277 (2011)
Hon, W.K., Ku, T.H., Shah, R., Thankachan, S.V., Vitter, J.S.: Compressed dictionary matching with one errors. In: Proceedings of Data Compression Conference, pp. 113–122 (2011)
Hon, W.K., Shah, R., Vitter, J.S.: Compression, indexing, and retrieval for massive string data. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 260–274 (2010)
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of Symposium on Foundations of Computer Science, pp. 549–554 (1989)
Kanth, K.V.R., Singh, A.K.: Optimal dynamic range searching in non-replicating index structures. In: Proceedings of International Conference on Database Theory, pp. 257–276 (1999)
Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Proceedings of International Conference on Computing and Combinatorics, pp. 219–230 (1996)
Kolpakov, R., Kucherov, G., Starikovskaya, T.A.: Pattern matching on sparse suffix trees. In: International Conference on Data Compression, Communications and Processing (2011). doi:10.1109/CCP.2011.45
Mäkinen, V., Navarro, G.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007)
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. Technical report TR/DCC-2006-10, University of Chile (2006)
Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proceedings of Latin American Theoretical Informatics Symposium, pp. 703–714 (2006)
Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching-efficient secondary memory and distributed implementation of compressed suffix arrays. In: Proceedings of Symposium on Algorithms and Computation, pp. 681–692 (2004)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Munro, J.I.: Tables. In: Proceedings of Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 37–42 (1996)
Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 589–607(2007)
Samet, H.: The quadtree and related hierarchical data structures. ACM Comput. Surv. 16(2), 187–260 (1984)
Subramanian, S., Ramaswamy, S.: The P-range tree: a new data structure for range searching in secondary memory. In: Proceedings of Symposium on Discrete Algorithms, pp. 378–387 (1995)
Thankachan, S.V.: Compressed indexes for aligned pattern matching. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 410–419 (2011)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space θ(N). Inf. Process. Lett. 17(2), 81–84 (1983)
Yu, C.C., Hon, W.K., Wang, B.F.: Efficient data structures for orthogonal range successor problem. In: Proceedings of International Computing and Combinatorics Conference, pp. 96–105 (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chien, YF., Hon, WK., Shah, R. et al. Geometric BWT: Compressed Text Indexing via Sparse Suffixes and Range Searching. Algorithmica 71, 258–278 (2015). https://doi.org/10.1007/s00453-013-9792-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-013-9792-1