Efficient Index for Retrieving Top-k Most Frequent Documents

Hon, Wing-Kai; Shah, Rahul; Wu, Shih-Bin

doi:10.1007/978-3-642-03784-9_18

Wing-Kai Hon¹⁹,
Rahul Shah²⁰ &
Shih-Bin Wu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5721))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1073 Accesses
5 Citations

Abstract

In the document retrieval problem [9], we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval, where instead of listing all documents containing P, our focus is to identify the top k documents having most occurrences of P. This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF metric.

A related problem was studied by [9] where the emphasis was on retrieving all the documents whose number of occurrences of the pattern P exceeds some frequency threshold f. However, from the information retrieval point of view, it is hard for a user to specify such a threshold value f and have a sense of how many documents will be outputted. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-k frequent document retrieval problem, answering queries in O(P + logD loglogD + k) time and taking O(DlogD) space. Our approach is based on novel use of the suffix tree called induced generalized suffix tree (IGST).

This work is supported in part by Taiwan NSC Grant 96-2221-E-007-082-MY3 (W. Hon) and US NSF Grant CCF–0621457 (R. Shah).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bialynicka-Birula, I., Grossi, R.: Rank-Sensitive Data Structures. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 79–90. Springer, Heidelberg (2005)
Chapter Google Scholar
Boyer, R.S., Moore, J.S.: A Fast String Searching Algorithm. Communications of the ACM 20(10), 762–772 (1977)
Article MATH Google Scholar
Hon, W.K., Shah, R., Vitter, J.S.: Ordered Pattern Matching: Towards Full-Text Retrieval. Technical Report TR-06-008, Department of CS, Purdue University (2006)
Google Scholar
Karp, R.M., Rabin, M.O.: Efficient Randomized Pattern-Matching Algorithms. Technical Report TR-31-81, Aiken Computational Laboratory, Harvard University (1981)
Google Scholar
Knuth, D.E., Morris, J.H., Pratt, V.B.: Fast Pattern Matching in Strings. SIAM Journal on Computing 6(2), 323–350 (1977)
Article MathSciNet MATH Google Scholar
Mäkinen, V., Navarro, G.: Position-Restricted Substring Searching. In: Correa, J.R., Hevia, A., Kiwi, M. (eds.) LATIN 2006. LNCS, vol. 3887, pp. 703–714. Springer, Heidelberg (2006)
Chapter Google Scholar
Matias, Y., Muthukrishnan, S., Sahinalp, S.C., Ziv, J.: Augmenting Suffix Trees, with Applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 67–78. Springer, Heidelberg (1998)
Google Scholar
McCreight, E.M.: A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)
Article MathSciNet MATH Google Scholar
Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: Proceedings of Symposium on Discrete Algorithms, pp. 657–666 (2002)
Google Scholar
Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proceedings of Symposium on Discrete Algorithms, pp. 225–232 (2002)
Google Scholar
Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)
Chapter Google Scholar
Weiner, P.: Linear Pattern Matching Algorithms. In: Proceedings of Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Willard, D.E.: Log-Logarithmic Worst-Case Range Queries are Possible in Space Θ(N). Information Processing Letters 17(2), 81–84 (1983)
Article MathSciNet MATH Google Scholar
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos (1999)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National Tsing Hua University, Taiwan
Wing-Kai Hon & Shih-Bin Wu
Department of Computer Science, Louisiana State University, LA, USA
Rahul Shah

Authors

Wing-Kai Hon
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Shah
View author publications
You can also search for this author in PubMed Google Scholar
Shih-Bin Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Swedish Institute of Computer Science, Kista, Sweden
Jussi Karlgren
Department of Computer Science and Engineering, Helsinki University of Technology, P.O. Box 5400, 02015 HUT, Espoo, Finland
Jorma Tarhio
Department of Computer Sciences, University of Tampere, Tampere, Finland
Heikki Hyyrö

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hon, WK., Shah, R., Wu, SB. (2009). Efficient Index for Retrieving Top-k Most Frequent Documents. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds) String Processing and Information Retrieval. SPIRE 2009. Lecture Notes in Computer Science, vol 5721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03784-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-03784-9_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03783-2
Online ISBN: 978-3-642-03784-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics