Skip to main content

Efficient Index for Retrieving Top-k Most Frequent Documents

  • Conference paper
String Processing and Information Retrieval (SPIRE 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5721))

Included in the following conference series:

Abstract

In the document retrieval problem [9], we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval, where instead of listing all documents containing P, our focus is to identify the top k documents having most occurrences of P. This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF metric.

A related problem was studied by [9] where the emphasis was on retrieving all the documents whose number of occurrences of the pattern P exceeds some frequency threshold f. However, from the information retrieval point of view, it is hard for a user to specify such a threshold value f and have a sense of how many documents will be outputted. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-k frequent document retrieval problem, answering queries in O(P + logD loglogD + k) time and taking O(DlogD) space. Our approach is based on novel use of the suffix tree called induced generalized suffix tree (IGST).

This work is supported in part by Taiwan NSC Grant 96-2221-E-007-082-MY3 (W. Hon) and US NSF Grant CCF–0621457 (R. Shah).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bialynicka-Birula, I., Grossi, R.: Rank-Sensitive Data Structures. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 79–90. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Boyer, R.S., Moore, J.S.: A Fast String Searching Algorithm. Communications of the ACM 20(10), 762–772 (1977)

    Article  MATH  Google Scholar 

  3. Hon, W.K., Shah, R., Vitter, J.S.: Ordered Pattern Matching: Towards Full-Text Retrieval. Technical Report TR-06-008, Department of CS, Purdue University (2006)

    Google Scholar 

  4. Karp, R.M., Rabin, M.O.: Efficient Randomized Pattern-Matching Algorithms. Technical Report TR-31-81, Aiken Computational Laboratory, Harvard University (1981)

    Google Scholar 

  5. Knuth, D.E., Morris, J.H., Pratt, V.B.: Fast Pattern Matching in Strings. SIAM Journal on Computing 6(2), 323–350 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  6. Mäkinen, V., Navarro, G.: Position-Restricted Substring Searching. In: Correa, J.R., Hevia, A., Kiwi, M. (eds.) LATIN 2006. LNCS, vol. 3887, pp. 703–714. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. Matias, Y., Muthukrishnan, S., Sahinalp, S.C., Ziv, J.: Augmenting Suffix Trees, with Applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 67–78. Springer, Heidelberg (1998)

    Google Scholar 

  8. McCreight, E.M.: A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  9. Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: Proceedings of Symposium on Discrete Algorithms, pp. 657–666 (2002)

    Google Scholar 

  10. Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proceedings of Symposium on Discrete Algorithms, pp. 225–232 (2002)

    Google Scholar 

  11. Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. Weiner, P.: Linear Pattern Matching Algorithms. In: Proceedings of Symposium on Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

  13. Willard, D.E.: Log-Logarithmic Worst-Case Range Queries are Possible in Space Θ(N). Information Processing Letters 17(2), 81–84 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  14. Witten, I., Moffat, A., Bell, T.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos (1999)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hon, WK., Shah, R., Wu, SB. (2009). Efficient Index for Retrieving Top-k Most Frequent Documents. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds) String Processing and Information Retrieval. SPIRE 2009. Lecture Notes in Computer Science, vol 5721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03784-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03784-9_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03783-2

  • Online ISBN: 978-3-642-03784-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics