Adaptive Succinctness

Arroyuelo, Diego; Raman, Rajeev

doi:10.1007/s00453-021-00872-1

Adaptive Succinctness

Published: 04 October 2021

Volume 84, pages 694–718, (2022)
Cite this article

Algorithmica Aims and scope Submit manuscript

274 Accesses
1 Citation
Explore all metrics

Abstract

Representing a static set of integers S, \(|S| = n\) from a finite universe \(U = [1{..}u]\) is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of \(\mathsf {rank}\) and \(\mathsf {select}\) on S; if S is viewed as its characteristic vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures. Although there is an information-theoretic lower bound of \({\mathcal {B}}(n, u)= \lg {u\atopwithdelims ()n}\) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain runs of| \(\ell >1\) consecutive elements, one that occurs in many practical situations. Let \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}\) denote the class of \({u\atopwithdelims ()n}\) distinct sets of \(n\) elements over the universe \([1{..}u]\). Let also \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}\subset {\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}\) contain the sets whose \(n\) elements are arranged in \(g \le n\) runs of \(\ell _i \ge 1\) consecutive elements from U for \(i=1,\ldots , g\), and let \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g,r}\subset {\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}\) contain all sets that consist of g runs, such that \(r \le g\) of them have at least 2 elements. This paper yields the following insights and contributions related to \(\mathsf {rank}\)/\(\mathsf {select}\) succinct data structures:

We introduce new compressibility measures for sets, including:
- \({\mathcal {B}}_1(g,n,u)= \lg {|{\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}|} = \lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-1 \atopwithdelims ()g-1}}\), and
- \({\mathcal {B}}_2(r, g, n,u)= \lg {|{\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g,r}|} =\lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-g-1 \atopwithdelims ()r-1}} + \lg {{g\atopwithdelims ()r}}\),
such that \({\mathcal {B}}_2(r, g, n,u)\le {\mathcal {B}}_1(g,n,u)\le {\mathcal {B}}(n, u)\).
We give data structures that use space close to bounds \({\mathcal {B}}_1(g,n,u)\) and \({\mathcal {B}}_2(r, g, n,u)\) and support \(\mathsf {rank}\) and \(\mathsf {select}\) in \(\mathrm {O}(1)\) time.
We provide additional measures involving entropy-coding run lengths and gaps between items, and data structures to support \(\mathsf {rank}\) and \(\mathsf {select}\) using space close to these measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Succinctness

Design of Practical Succinct Data Structures for Large Data Collections

Succinct Representations of Finite Groups

Notes

For example, if we choose every element in \(U \) to be in \(S \) with probability 0.5, then \(\texttt {GAP}(S) \sim 0.81u\), less than the Shannon lower bound for \(S \).
Since \(\texttt {GAP}(S)\) and \(\texttt {RLE}(S)\) are not achievable, this statement is imprecise.
\([k\not \in {\hat{L}} ]\) is Iverson brackets notation, which equals 1 iff \(k\not \in {\hat{L}} \) is true, 0 otherwise.

References

Andersson, A., Thorup, M.: Dynamic ordered sets with exponential search trees. J. ACM 54(3), 13 (2007)
Article MathSciNet Google Scholar
Arroyuelo, D., Raman, R.: Adaptive succinctness. In: Proceedings of the 26th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 11811, pp. 467–481. Springer (2019)
Arroyuelo, D., Oyarzún, M., González, S., Sepulveda, V.: Hybrid compression of inverted lists for reordered document collections. Inf. Process. Manag. 54(6), 1308–1324 (2018)
Article Google Scholar
Barbay, J.: From time to space: fast algorithms that yield small and fast data structures. In: Space-Efficient Data Structures, Streams, and Algorithms—Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, LNCS 8066, pp. 97–111. Springer (2013)
Blandford, D.K., Blelloch, G.E.: Dictionaries using variable-length keys and data, with applications. In: Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1–10. SIAM (2005)
Blandford, D.K., Blelloch, G.E.: Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algorithms 4(2), 17:1-17:25 (2008)
Article MathSciNet Google Scholar
Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)
Boldi, P., Vigna, S.: The webgraph framework II: codes for the world-wide web. In: Proceedings of the Data Compression Conference (DCC), p. 528 (2004)
Bona, M.: A Walk Through Combinatorics: An Introduction to Enumeration and Graph Theory, 4th edn. World Scientific, Singapore (2016)
Book Google Scholar
Bookstein, A., Klein, S.T.: Construction of optimal graphs for bit-vector compression. In: Proceedings of the 13th International Conference on Research and Development in Information Retrieval (SIGIR), pp. 327–342 (1990)
Cafagna, F., Böhlen, M.H.: Disjoint interval partitioning. VLDB J. 26(3), 447–466 (2017)
Article Google Scholar
Chen, Y., Chen, Y.: An efficient algorithm for answering graph reachability queries. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 893–902 (2008)
Chen, Y., Chen, Y.: Decomposing DAGs into spanning trees: a new way to compress transitive closures. In: Proceedings of the 27th International Conference on Data Engineering (ICDE), pp. 1007–1018 (2011)
Chen, Y., Shen, W.: An efficient method to evaluate intersections on big data sets. Theoret. Comput. Sci. 647, 1–21 (2016)
Article MathSciNet Google Scholar
Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage (extended abstract). In: Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 383–391 (1996)
Clark, D.: Compact pat trees. Ph.D. thesis, University of Waterloo (1997)
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2006)
MATH Google Scholar
de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, New York (2008)
Book Google Scholar
Delpratt, O., Rahman, N., Raman, R.: Engineering the LOUDS succinct tree representation. In: Proceedings of the 5th International Workshop on Experimental Algorithms (WEA), pp. 134–145 (2006)
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 743–752. ACM/SIAM (2000)
Dignös, A., Böhlen, M.H., Gamper, J.: Overlap interval partition join. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 1459–1470 (2014)
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)
Article MathSciNet Google Scholar
Estivill-Castro, V., Wood, D.: A survey of adaptive sorting algorithms. ACM Comput. Surv. 24(4), 441–476 (1992)
Article Google Scholar
Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. ACM Trans. Algorithms 2(4), 611–639 (2006)
Article MathSciNet Google Scholar
Fraenkel, A.S., Klein, S.T.: Novel compression of sparse bit-strings—preliminary report. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words, NATO ASI Series (Series F: Computer and Systems Sciences), vol. 12. Springer (1985)
Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-Runs bounded space. J. ACM 67(1), 2:1-2:54 (2020)
Article MathSciNet Google Scholar
Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2–29 (2005)
Article Google Scholar
Golomb, S.: Run-length encodings (corresp.). IEEE Trans. Inf. Theory 12(3), 399–401 (1966)
Article Google Scholar
Golynski, A., Raman, R., Rao, S.S.: On the redundancy of succinct data structures. In: Proceedings of the 11th Scandinavian Workshop on Algorithm Theory (SWAT), LNCS 5124, pp. 148–159. Springer (2008)
Golynski, A., Orlandi, A., Raman, R., Rao, S.S.: Optimal indexes for sparse bit vectors. Algorithmica 69(4), 906–924 (2014)
Article MathSciNet Google Scholar
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)
Gupta, A., Hon, W.K., Shah, R., Vitter, J.S.: Compressed data structures: dictionaries and data-aware measures. Theoret. Comput. Sci. 387(3), 313–331 (2007)
Article MathSciNet Google Scholar
Huo, H., Chen, L., Zhao, H., Vitter, J.S., Nekrich, Y., Yu, Q.: A data-aware FM-index. In: Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 10–23 (2015)
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989)
Jakobsson, M.: Huffman coding in bit-vector compression. Inf. Process. Lett. 7(6), 304–307 (1978)
Article MathSciNet Google Scholar
Jansson, J., Sadakane, K., Sung, W.: Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2), 619–631 (2012)
Article MathSciNet Google Scholar
Johnson, D.S., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.: Compressing large boolean matrices using reordering techniques. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pp. 13–23 (2004)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)
MathSciNet MATH Google Scholar
Moffat, A., Zobel, J.: Parameterised compression for sparse bitmaps. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 274–285 (1992)
Navarro, G.: Compact Data Structures—A Practical Approach. Cambridge University Press, Cambridge (2016)
Book Google Scholar
o Golynski, A., Grossi, R., Gupta, A., Raman, R., Rao, S.S.: On the size of succinct indices. In: Proceedings of the 15th Annual European Symposium on Algorithms (ESA), LNCS 4698, pp. 371–382. Springer (2007)
Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282 (2014)
Pǎtraşcu, M., Thorup, M.: Time-space trade-offs for predecessor search. In: Proceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pp. 232–240 (2006)
Pǎtraşcu, M., Viola, E.: Cell-probe lower bounds for succinct partial sums. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 117–122 (2010)
Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)
Pibiri, G.E., Venturini, R.: Techniques for inverted index compression. ACM Comput. Surv. 53(6), 125:1-125:36 (2021)
Article Google Scholar
Quinlan, A.R., Robins, G., Hall, I.M., Skadron, K., Layer, R.M.: Binary Interval Search: a scalable algorithm for counting interval intersections. Bioinformatics 29(1), 1–7 (2012)
Google Scholar
Rahman, N., Raman, R.: Rank and select operations on binary strings. In: Encyclopedia of Algorithms (2008)
Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)
Article MathSciNet Google Scholar
Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006)
Soo, M.D., Snodgrass, R.T., Jensen, C.S.: Efficient evaluation of the valid-time natural join. In: Proceedings of the 10th International Conference on Data Engineering (ICDE), pp. 282–292 (1994)

Download references

Acknowledgements

The first author was funded by ANID—Millennium Science Initiative Program—-Code ICN17_002.

Author information

Authors and Affiliations

Millennium Institute for Foundational Research on Data (IMFD), Santiago, Chile
Diego Arroyuelo
Department of Informatics, Universidad Técnica Federico Santa María, Santiago, Chile
Diego Arroyuelo
Department of Informatics, University of Leicester, University Road, Leicester, LE1 7RH, UK
Rajeev Raman

Authors

Diego Arroyuelo
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Raman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Arroyuelo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Proofs from Sect. 3

1.1 A.1 Proof of Theorem 2

Proof

Let us consider a cbv \(C_{\!S}= \mathbf {0}^{z_1}\mathbf {1}^{l_1}\mathbf {0}^{z_2}\mathbf {1}^{l_2}\cdots \mathbf {0}^{z_g}\mathbf {1}^{l_g}\mathbf {0}^{z_{g+1}}\) (for \(z_1,z_{g+1} \ge 0\), \(z_2,\ldots ,z_g, l_1,\ldots , l_g >0\)) of length \(u\), with \(n\) \(\mathbf {1}\)s grouped into g 1-runs. This corresponds to a set \(S \) of \(n\subseteq U\) elements arranged in g maximal runs. Think of \(C_{\!S}\) as consisting of \(g+1\) distinguishable “bins”, each of the form \(\mathbf {0}^{z_i}\mathbf {1}^{l_i}\), except for the last bin that contains only \(\mathbf {0}\)s (and can be empty). Let us count how many ways there are to distribute the \(n\) \(\mathbf {1}\)s among the first g bins, and the \(u- n\) \(\mathbf {0}\)s among all \(g+1\) bins.

1.
For counting the number of ways in which the \(\mathbf {1}\) bits can be distributed among the first g bins, note that each bin must have at least a \(\mathbf {1}\). This leaves only \(n-g\) \(\mathbf {1}\)s, which can be distributed in \({n-g+g-1 \atopwithdelims ()g-1} = {n-1 \atopwithdelims ()g-1}\) different ways. An alternative way to get this is to count the number of compositions of the integer \(n\) into g parts: each such composition is an ordered tuple \(\langle m_1,\ldots , m_g \rangle \) such that \(m_i>0\) and \(m_1 + \cdots m_g = n\). It turns out that this number is \({n- 1 \atopwithdelims ()g-1}\) [9, see Corollary 5.3].
2.
Similarly, for counting the number of ways in which the \(u-n\) \(\mathbf {0}\)s can be distributed among \(g+1\) bins, recall that each bin must contain at least a \(\mathbf {0}\). This is to separate it from the previous \(\mathbf {1}\)-run in the bit vector. The only exceptions are the first and last bins, as the bit vector does not necessarily starts and ends with \(\mathbf {0}\)s. To reduce the number of particular cases, we prefix the bit vector with a dummy \(\mathbf {0}\), increasing the universe size to \(u+1\). For the \(\mathbf {0}\)s at the end of \(C_{\!S}\), on the other hand, we consider two cases:
1. (a)
  \(C_{\!S}\) finishes in \(\mathbf {1}\) in \(C_{\!S}\): the \(\mathbf {0}\)s must be distributed among g distinguishable bins. Since each bin must have at least a \(\mathbf {0}\), we are left with \(u+1-n-g\) \(\mathbf {0}\)s, which can be distributed in \({u+ 1 - n- g +g- 1 \atopwithdelims ()g-1} = {u- n\atopwithdelims ()g-1}\) different ways.
2. (b)
  \(C_{\!S}\) finishes in \(\mathbf {0}\): in this case, we append an additional bin (we now have \(g+1\) of them) that can contain only \(\mathbf {0}\)s. Since each of the \(g+1\) bins must contain at least a \(\mathbf {0}\), we are left with \(u+1-n-(g+1)\) \(\mathbf {0}\)s, which can be distributed in \({u- n\atopwithdelims ()g}\) different ways.
From (a) and (b) we obtain \({u- n\atopwithdelims ()g-1}+{u- n\atopwithdelims ()g}={u-n+1 \atopwithdelims ()g}\), which is the total number of ways of distributing the \(\mathbf {0}\)s into \(g+1\) bins.

Combining the results from items 1 and 2, we obtain \({n-1 \atopwithdelims ()g-1}{u-n+ 1 \atopwithdelims ()g}\) different characteristic bit vectors of length \(u\) with \(n\) \(\mathbf {1}\)s arranged in g runs. \(\square \)

1.2 A.2 Proof of Theorem 3

For proving it we need the following result:

Lemma 4

There are \({n-g-1 \atopwithdelims ()r-1}{g \atopwithdelims ()r}\) distinct compositions of an integer \(n\) into g parts \(\langle m_1,\ldots , m_g \rangle \), such that \(m_i > 0\) for all i, \(m_1+\cdots +m_g = n\), and exactly \(r\le g\) of these \(m_i\) are \(\ge 2\).

Proof

Consider g distinct originally-empty bins \(G_1, \ldots , G_g\), and \(n\) identical balls. For \(i=1,\ldots , g\), let \(m_i\) be the size of \(G_i\) (initially \(m_i=0\)). Since \(m_i>0\) must hold, we assign a single ball to each bin. From the \(n-g\) remaining balls, we assign another ball to the first r bins \(G_1,\ldots , G_r\) (since r parts are \(\ge 2\)). The remaining \(n-g-r\) can be distributed into these r bins in \({n-g-r+r-1 \atopwithdelims ()r-1} = {n-g-1 \atopwithdelims ()r-1}\) distinct ways. Now, consider that the r bins of size \(\ge 2\) are not necessarily \(G_1,\ldots ,G_r\), but can be any of them. There are \({g \atopwithdelims ()r}\) ways of choosing r bins out of g, hence the lemma follows. \(\square \)

Proof of Theorem 3

As proved in Lemma 4, an integer \(n\) has \({n-g-1 \atopwithdelims ()r-1}{g \atopwithdelims ()r}\) distinct compositions into g parts, such that r of these parts have at least 2 elements. These correspond to the number of ways of distributing the \(\mathbf {1}\)s into the characteristic bit vector \(C_{\!S}[1{..}u]\), accomplishing the imposed restrictions. This must be combined with the different ways to distribute the \(u- n\) \(\mathbf {0}\)s, which is \({u-n+1 \atopwithdelims ()g}\) according to the proof of Theorem 2. This completes the proof. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arroyuelo, D., Raman, R. Adaptive Succinctness. Algorithmica 84, 694–718 (2022). https://doi.org/10.1007/s00453-021-00872-1

Download citation

Received: 18 September 2020
Accepted: 31 August 2021
Published: 04 October 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00453-021-00872-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Succinctness

Abstract

Access this article

Similar content being viewed by others

Adaptive Succinctness

Design of Practical Succinct Data Structures for Large Data Collections

Succinct Representations of Finite Groups

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Proofs from Sect. 3

1.1 A.1 Proof of Theorem 2

Proof

1.2 A.2 Proof of Theorem 3

Lemma 4

Proof

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Adaptive Succinctness

Design of Practical Succinct Data Structures for Large Data Collections

Succinct Representations of Finite Groups

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Proofs from Sect. 3

A Proofs from Sect. 3

1.1 A.1 Proof of Theorem 2

Proof

1.2 A.2 Proof of Theorem 3

Lemma 4

Proof

Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation