Abstract
Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}\) be a collection of D string documents of n characters in total, that are drawn from an alphabet set \(\varSigma =[\sigma ]\). The top-k document retrieval problem is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1\ldots p],k)\), can return the k documents of \(\mathcal{D}\) most relevant to the pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in \(\mathsf {T}_d\). For example, it can be the term frequency (i.e., the number of occurrences of P in \(\mathsf {T}_d\)), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in \(\mathsf {T}_d\)), or a pattern-independent importance score of \(\mathsf {T}_d\) such as PageRank. Linear space and optimal query time solutions already exist for the general top-k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of \(\mathcal{D}\) and solves queries in \(O((p+k) {{\mathrm{polylog}}}\,\,n)\) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.

Similar content being viewed by others
Notes
If \(D=o(n)\), which we assume for simplicity in this paper. Otherwise it is \(D\log (n/D)+O(D)+o(n)=O(n)\) bits.
Using \(O(n/\log ^\epsilon n)\) bits and no special implementation for operations \(\mathsf {SA}^{-1}[\mathsf {SA}[i]\pm 1]\).
Except for \(t=0\), which has 2 positions.
Using perfect hashing to move in constant time towards the children.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley, Reading (2011)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proceedings of the 19th ESA, pp. 748–759 (2011)
Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discrete Algorithms 18, 3–13 (2013)
Benson, G., Waterman, M.: A fast method for fast database search for all \(k\)-nucleotide repeats. Nucleic Acids Res. 22(22), 4828–4836 (1994)
Broschart, A., Schenkel, R.: Index tuning for efficient proximity-enhanced query processing. In: INEX, pp. 213–217 (2009)
Büttcher, S., Clarke, C.L.A., Cormack, G.: Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, Cambridge (2010)
de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Berlin (2008)
Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM. 46(2), 236–280 (1999)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Art. No. 20 (2007)
Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theor. Comput. Sci. 426–427, 25–41 (2012)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On position restricted substring searching in succinct space. J. Discrete Algorithms 17, 109–114 (2012)
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster compressed top-k document retrieval. In: Proceedings of the 23rd DCC, pp. 341–350 (2013)
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM. 61(2), 9 (2014)
Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-\(k\) string retrieval problems. In: Proceedings of the 50th FOCS, pp. 713–722 (2009)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM. 48(3), 407–430 (2001)
Munro, J.I., Navarro, G., Nielsen, J.S., Shah, R., Thankachan, S.V.: Top-k term-proximity in succinct space. In: Proceedings of the 25th ISAAC, pp. 169–180 (2014)
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 13th SODA, pp. 657–666 (2002)
Navarro, G.: Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), Art. No. 52 (2014)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), Art. No. 2 (2007)
Navarro, G., Nekrich, Y.: Top-\(k\) document retrieval in optimal time and linear space. In: Proceedings of the 23rd SODA, pp. 1066–1078 (2012)
Navarro, G., Russo, L.: Fast fully-compressed suffix trees. In: Proceedings of the 24th DCC, pp. 283–291 (2014)
Navarro, G., Thankachan, S.V.: Faster top-\(k\) document retrieval in optimal space. In: Proceedings of the 20th SPIRE, LNCS 8214, pp. 255–262 (2013)
Navarro, G., Thankachan, S.V.: Top-\(k\) document retrieval in compact space and near-optimal time. In: Proceedings of the 24th ISAAC, LNCS 8283, pp. 394–404 (2013)
Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-\(k\) document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)
Nekrich, Y., Navarro, G.: Sorted range reporting. In: Proceedings of the 13th SWAT, LNCS 7357, pp. 271–282 (2012)
Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th FOCS, pp. 305–313 (2008)
Raman, R., Raman, V., Srinivasa, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), Art. No. 43 (2007)
Schenkel, R., Broschart, A., Hwang, S.-W., Theobald, M., Weikum, G.: Efficient text proximity search. In: SPIRE, pp. 287–299 (2007)
Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: Top-k document retrieval in external memory. In: Proceedings of the 21st ESA, LNCS 8125, pp. 803–814 (2013)
Weiner, P.: Linear pattern matching algorithm. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.-R.: Efficient term proximity search with term-pair indexes. In: CIKM, pp. 1229–1238 (2010)
Zhu, M., Shi, S., Li, M., Wen, J.-R.: Effective top-k computation in retrieving structured documents with term-proximity support. In: CIKM, pp. 771–780 (2007)
Zhu, M., Shi, S., Yu, N., Wen, J.-R.: Can phrase indexing help to process non-phrase queries? In: CIKM, pp. 679–688 (2008)
Author information
Authors and Affiliations
Corresponding author
Additional information
Funded in part by NSERC of Canada and the Canada Research Chairs program, Fondecyt Grant 1-140796, Chile, and NSF Grants CCF–1017623, CCF–1218904 MADALGO, Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation, grant DNRF84. An early partial version of this paper appeared in Proc. ISAAC 2014 [18].
Rights and permissions
About this article
Cite this article
Munro, J.I., Navarro, G., Nielsen, J.S. et al. Top-k Term-Proximity in Succinct Space. Algorithmica 78, 379–393 (2017). https://doi.org/10.1007/s00453-016-0167-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-016-0167-2