Skip to main content

Advertisement

Log in

Top-k Term-Proximity in Succinct Space

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}\) be a collection of D string documents of n characters in total, that are drawn from an alphabet set \(\varSigma =[\sigma ]\). The top-k document retrieval problem is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1\ldots p],k)\), can return the k documents of \(\mathcal{D}\) most relevant to the pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in \(\mathsf {T}_d\). For example, it can be the term frequency (i.e., the number of occurrences of P in \(\mathsf {T}_d\)), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in \(\mathsf {T}_d\)), or a pattern-independent importance score of \(\mathsf {T}_d\) such as PageRank. Linear space and optimal query time solutions already exist for the general top-k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of \(\mathcal{D}\) and solves queries in \(O((p+k) {{\mathrm{polylog}}}\,\,n)\) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. If \(D=o(n)\), which we assume for simplicity in this paper. Otherwise it is \(D\log (n/D)+O(D)+o(n)=O(n)\) bits.

  2. Using \(O(n/\log ^\epsilon n)\) bits and no special implementation for operations \(\mathsf {SA}^{-1}[\mathsf {SA}[i]\pm 1]\).

  3. Except for \(t=0\), which has 2 positions.

  4. Using perfect hashing to move in constant time towards the children.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 2nd edn. Addison-Wesley, Reading (2011)

    Google Scholar 

  2. Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. In: Proceedings of the 19th ESA, pp. 748–759 (2011)

  3. Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discrete Algorithms 18, 3–13 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  4. Benson, G., Waterman, M.: A fast method for fast database search for all \(k\)-nucleotide repeats. Nucleic Acids Res. 22(22), 4828–4836 (1994)

    Article  Google Scholar 

  5. Broschart, A., Schenkel, R.: Index tuning for efficient proximity-enhanced query processing. In: INEX, pp. 213–217 (2009)

  6. Büttcher, S., Clarke, C.L.A., Cormack, G.: Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, Cambridge (2010)

    MATH  Google Scholar 

  7. de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, Berlin (2008)

    Book  MATH  Google Scholar 

  8. Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM. 46(2), 236–280 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  9. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Art. No. 20 (2007)

  10. Gagie, T., Navarro, G., Puglisi, S.J.: New algorithms on wavelet trees and applications to information retrieval. Theor. Comput. Sci. 426–427, 25–41 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  12. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On position restricted substring searching in succinct space. J. Discrete Algorithms 17, 109–114 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  13. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster compressed top-k document retrieval. In: Proceedings of the 23rd DCC, pp. 341–350 (2013)

  14. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM. 61(2), 9 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  15. Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-\(k\) string retrieval problems. In: Proceedings of the 50th FOCS, pp. 713–722 (2009)

  16. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  17. Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM. 48(3), 407–430 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  18. Munro, J.I., Navarro, G., Nielsen, J.S., Shah, R., Thankachan, S.V.: Top-k term-proximity in succinct space. In: Proceedings of the 25th ISAAC, pp. 169–180 (2014)

  19. Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the 13th SODA, pp. 657–666 (2002)

  20. Navarro, G.: Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), Art. No. 52 (2014)

    Article  MATH  Google Scholar 

  21. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), Art. No. 2 (2007)

    Article  MATH  Google Scholar 

  22. Navarro, G., Nekrich, Y.: Top-\(k\) document retrieval in optimal time and linear space. In: Proceedings of the 23rd SODA, pp. 1066–1078 (2012)

  23. Navarro, G., Russo, L.: Fast fully-compressed suffix trees. In: Proceedings of the 24th DCC, pp. 283–291 (2014)

  24. Navarro, G., Thankachan, S.V.: Faster top-\(k\) document retrieval in optimal space. In: Proceedings of the 20th SPIRE, LNCS 8214, pp. 255–262 (2013)

  25. Navarro, G., Thankachan, S.V.: Top-\(k\) document retrieval in compact space and near-optimal time. In: Proceedings of the 24th ISAAC, LNCS 8283, pp. 394–404 (2013)

  26. Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-\(k\) document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  27. Nekrich, Y., Navarro, G.: Sorted range reporting. In: Proceedings of the 13th SWAT, LNCS 7357, pp. 271–282 (2012)

  28. Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th FOCS, pp. 305–313 (2008)

  29. Raman, R., Raman, V., Srinivasa, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), Art. No. 43 (2007)

    Article  MathSciNet  Google Scholar 

  30. Schenkel, R., Broschart, A., Hwang, S.-W., Theobald, M., Weikum, G.: Efficient text proximity search. In: SPIRE, pp. 287–299 (2007)

  31. Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: Top-k document retrieval in external memory. In: Proceedings of the 21st ESA, LNCS 8125, pp. 803–814 (2013)

  32. Weiner, P.: Linear pattern matching algorithm. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)

  33. Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.-R.: Efficient term proximity search with term-pair indexes. In: CIKM, pp. 1229–1238 (2010)

  34. Zhu, M., Shi, S., Li, M., Wen, J.-R.: Effective top-k computation in retrieving structured documents with term-proximity support. In: CIKM, pp. 771–780 (2007)

  35. Zhu, M., Shi, S., Yu, N., Wen, J.-R.: Can phrase indexing help to process non-phrase queries? In: CIKM, pp. 679–688 (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jesper Sindahl Nielsen.

Additional information

Funded in part by NSERC of Canada and the Canada Research Chairs program, Fondecyt Grant 1-140796, Chile, and NSF Grants CCF–1017623, CCF–1218904 MADALGO, Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation, grant DNRF84. An early partial version of this paper appeared in Proc. ISAAC 2014 [18].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Munro, J.I., Navarro, G., Nielsen, J.S. et al. Top-k Term-Proximity in Succinct Space. Algorithmica 78, 379–393 (2017). https://doi.org/10.1007/s00453-016-0167-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-016-0167-2

Keywords