skip to main content
research-article

Space-Efficient Frameworks for Top-k String Retrieval

Published: 24 April 2014 Publication History

Abstract

The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an arbitrary string—which can be a partial word, multiword phrase, or more generally any sequence of characters—then word boundaries are no longer relevant and we need a different approach. In string retrieval settings, we are given a set D={d1, d2,d3, …, dD} of D strings with n characters in total taken from an alphabet set Σ = [σ], and the task of the search engine, for a given query pattern P of length p, is to report the “most relevant” strings in D containing P. The query may also consist of two or more patterns. The notion of relevance can be captured by a function score(P,dr), which indicates how relevant document dr is to the pattern P. Some example score functions are the frequency of pattern occurrences, proximity between pattern occurrences, or pattern-independent PageRank of the document.
The first formal framework to study such kinds of retrieval problems was given by Muthukrishnan [SODA 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures that use O(n log n) words of space. We study this problem in a somewhat more natural top-k framework. Here, k is a part of the query, and the top k most relevant (highest-scoring) documents are to be reported in sorted order of score. We present the first linear-space framework (i.e., using O(n) words of space) that is capable of handling arbitrary score functions with near-optimal O(p + klog k) query time. The query time can be made optimal O(p+k) if sorted order is not necessary. Further, we derive compact space and succinct space indexes (for some specific score functions). This space compression comes at the cost of higher query time. At last, we extend our framework to handle the case of multiple patterns. Apart from providing a robust framework, our results also improve many earlier results in index space or query time or both.

References

[1]
Alok Aggarwal and Jeffrey Scott Vitter. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9, 1116--1127.
[2]
D. Belazzougui and G. Navarro. 2011. Improved compressed indexes for full-text document retrieval. In Proceedings of the International Symposium on String Processing and Information Retrieval. 386--397.
[3]
Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela. 2013. Improved compressed indexes for full-text document retrieval. J. Discr. Algor. 18, 3--13.
[4]
M. A. Bender and M. Farach-Colton. 2000. The LCA problem revisited. In Proceedings of the Latin American Symposium on Theoretical Informatics. 88--94.
[5]
Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan. 1973. Time bounds for selection. J. Comput. System Sci. 7, 4, 448--461.
[6]
M. R. Brown and R. E. Tarjan. 1979. A fast merging algorithms. J. ACM 26, 2, 211--226.
[7]
B. Chazelle. 1988. A functional approach to data structures and its use in multidimensional searching. SIAM J. Comput. 17, 3, 427--462.
[8]
Yu Feng Chien, Wing Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2013. Geometric BWT: Compressed text indexing via sparse suffixes and range searching. Algorithmica. To appear.
[9]
D. R. Clark. 1996. Compact Pat Trees. Ph.D. Dissertation, University of Waterloo.
[10]
H. Cohen and E. Porat. 2010. Fast set intersection and two-patterns matching. Theoret. Comput. Sci. 411, 40--42, 3795--3800.
[11]
Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. 2004. Dictionary matching and indexing with errors and don't cares. In Proceedings of the Symposium on Theory of Computing. 91--100.
[12]
J. S. Culpepper, G. Navarro, S. J. Puglisi, and A. Turpin. 2010. Top-k ranked document search in general text databases. In Proceedings of the European Symposium on Algorithms. 194--205.
[13]
J. Shane Culpepper, Matthias Petri, and Falk Scholer. 2012. Efficient in-memory top-k document retrieval. In Proceedings of the SIGIR Conference on Research and Development in Information Retrieval. 225--234.
[14]
Martin Farach. 1997. Optimal suffix tree construction with large alphabets. In Proceedings of the Symposium on Foundations of Computer Science. 137--143.
[15]
P. Ferragina and G. Manzini. 2005. Indexing compressed text. J. ACM 52, 4, 552--581.
[16]
P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3, 2.
[17]
Johannes Fischer, Travis Gagie, Tsvi Kopelowitz, Moshe Lewenstein, Veli Mäkinen, Leena Salmela, and Niko Välimäki. 2012. Forbidden patterns. In Proceedings of the Latin American Symposium on Theoretical Informatics. 327--337.
[18]
J. Fischer and V. Heun. 2007. A new succinct representation of RMQ-information and improvements in the enhanced suffix array. In Proceedings of the Symposium on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies. 459--470.
[19]
Johannes Fischer and Volker Heun. 2011. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40, 2, 465--492.
[20]
G. N. Frederickson. 1993. An optimal algorithm for selection in a min-heap. Inf. Comput. 104, 2, 197--214.
[21]
Michael L. Fredman, János Komlós, and Endre Szemerédi. 1984. Storing a sparse table with O(1) worst case access time. J. ACM 31, 3, 538--544.
[22]
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 1999. Cache-oblivious algorithms. In Proceedings of the Symposium on Foundations of Computer Science. 285--298.
[23]
Travis Gagie, Kalle Karhu, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén. 2013. Document listing on repetitive collections. In Proceedings of the Symposium on Combinatorial Pattern Matching. 107--119.
[24]
T. Gagie, G. Navarro, and S. J. Puglisi. 2010. Colored range queries and document retrieval. In Proceedings of the International Symposium on String Processing and Information Retrieval. 67--81.
[25]
Travis Gagie, Gonzalo Navarro, and Simon J. Puglisi. 2012. New algorithms on wavelet trees and applications to information retrieval. Theoret. Comput. Sci. 426, 25--41.
[26]
Travis Gagie, Simon J. Puglisi, and Andrew Turpin. 2009. Range quantile queries: Another virtue of wavelet trees. In Proceedings of the International Symposium on String Processing and Information Retrieval. 1--6.
[27]
Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. 2006. Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the Symposium on Discrete Algorithms. 368--373.
[28]
R. Grossi, A. Gupta, and J. S. Vitter. 2003. High-order entropy-compressed text indexes. In Proceedings of the Symposium on Discrete Algorithms. 841--850.
[29]
R. Grossi and J. S. Vitter. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407.
[30]
Wing-Kai Hon, Manish Patil, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2013a. Indexes for document retrieval with relevance. In Space-Efficient Data Structures, Streams, and Algorithms. 351--362.
[31]
Wing Kai Hon, Manish Patil, Rahul Shah, and Shih Bin Wu. 2010a. Efficient index for retrieving top-k most frequent documents. J. Disc. Algor. 8, 4, 402--417.
[32]
Wing Kai Hon, Rahul Shah, and Sharma V. Thankachan. 2012. Towards an optimal space-and-query-time index for top-k document retrieval. In Proceedings of the Symposium on Combinatorial Pattern Matching. 173--184.
[33]
Wing Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2010b. String retrieval for multi-pattern queries. In Proceedings of the International Symposium on String Processing and Information Retrieval. 55--66.
[34]
Wing Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2012c. On position restricted substring searching in succinct space. J. Disc. Algor. 17, 109--114.
[35]
Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2012c. Document listing for queries with excluded pattern. In CPM, 185--195.
[36]
Wing Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. 2009. Space-efficient framework for top-k string retrieval problems. In Proceedings of the Symposium on Foundations of Computer Science. 713--722.
[37]
Wing Kai Hon, Sharma V. Thankachan, Rahul Shah, and Jeffrey Scott Vitter. 2013. Faster compressed top-k document retrieval. In Proceedings of the Data Compression Conference. 341--350.
[38]
Bo-June (Paul) Hsu and Giuseppe Ottaviano. 2013. Space-efficient data structures for top-k completion. In Proceedings of the International Conference on World Wide Web. 583--594.
[39]
M. Karpinski and Y. Nekrich. 2011. Top-K color queries for document retrieval. In Proceedings of the Symposium on Discrete Algorithms. 401--411.
[40]
D. E. Knuth, J. H. Morris, and V. B. Pratt. 1977. Fast pattern matching in strings. SIAM J. Comput. 6, 2, 323--350.
[41]
Roberto Konow and Gonzalo Navarro. 2013. Faster compact top-k document retrieval. In Proceedings of the Data Compression Conference. 351--360.
[42]
U. Manber and G. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948.
[43]
Y. Matias, S. Muthukrishnan, S. C. Sahinalp, and J. Ziv. 1998. Augmenting suffix trees, with applications. In Proceedings of the European Symposium on Algorithms. 67--78.
[44]
E. M. McCreight. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272.
[45]
J. I. Munro, V. Raman, and S. S. Rao. 2001. Space efficient suffix trees. J. Algor. 39, 2, 205--222.
[46]
S. Muthukrishnan. 2002. Efficient algorithms for document retrieval problems. In Proceedings of the Symposium on Discrete Algorithms. 657--666.
[47]
Gonzalo Navarro. 2013. Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. CoRR abs/1304.6023.
[48]
G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. Comput. Surveys 39, 1.
[49]
Gonzalo Navarro and Yakov Nekrich. 2012a. Sorted range reporting. In Proceedings of the 13th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT). 271--282.
[50]
G. Navarro and Y. Nekrich. 2012b. Top-k document retrieval in optimal time and linear space. In Proceedings of the Symposium on Discrete Algorithms. 1066--1077.
[51]
G. Navarro, S. J. Puglisi, and D. Valenzuela. 2011. Practical compressed document retrieval. In Proceedings of the Symposium on Experimental Algorithms. 193--205.
[52]
Gonzalo Navarro and Sharma V. Thankachan. 2013. Faster top-k document retrieval in optimal space. In Proceedings of the International Symposium on String Processing and Information Retrieval, To appear.
[53]
L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab.
[54]
Rasmus Pagh. 2001. Low redundancy in static dictionaries with constant query time. SIAM J. Comput. 31, 2, 353--363.
[55]
M. Patil, S. V. Thankachan, R. Shah, W. K. Hon, J. S. Vitter, and S. Chandrasekaran. 2011. Inverted indexes for phrases and strings. In Proceedings of the SIGIR Conference on Research and Development in Information Retrieval. 555--564.
[56]
Mihai Patrascu. 2008. Succincter. In Proceedings of the Symposium on Foundations of Computer Science. 305--313.
[57]
R. Raman, V. Raman, and S. S. Rao. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algor. 3, 4.
[58]
K. Sadakane. 2007a. Compressed suffix trees with full functionality. Theory of Computing Systems, 589--607.
[59]
K. Sadakane. 2007b. Succinct data structures for flexible text retrieval systems. J. Disc. Algor. 5, 1, 12--22.
[60]
K. Sadakane and G. Navarro. 2010. Fully-functional succinct trees. In Proceedings of the Symposium on Discrete Algorithms. 134--149.
[61]
R. Shah and M. Farach-Colton. 2002. Undiscretized dynamic programming: Faster algorithms for facility location and related problems on trees. In Proceedings of the Symposium on Discrete Algorithms. 108--115.
[62]
Rahul Shah, Cheng Sheng, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2013. Top-k document retrieval in external memory. In Proceedings of the European Symposium on Algorithms. 803--814.
[63]
Dekel Tsur. 2013. Top-k document retrieval in optimal space. Inf. Process. Lett. 113, 12, 440--443.
[64]
N. Välimäki and V. Mäkinen. 2007. Space-efficient algorithms for document retrieval. In Proceedings of the Symposium on Combinatorial Pattern Matching. 205--215.
[65]
Jeffrey Scott Vitter. 2008. Algorithms and data structures for external memory. Found. Trends Theoret. Comput. Sci. 2, 4, 305--474.
[66]
P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the Symposium on Switching and Automata Theory. 1--11.
[67]
Dan E. Willard. 1983. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett. 17, 2, 81--84.
[68]
I. Witten, A. Moffat, and T. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufmann, San Francisco, CA.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 61, Issue 2
April 2014
206 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/2605175
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 April 2014
Accepted: 01 November 2013
Revised: 01 September 2013
Received: 01 June 2012
Published in JACM Volume 61, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. String matching
  2. document retrieval
  3. top-k queries

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Ranked Document Retrieval in External MemoryACM Transactions on Algorithms10.1145/355976319:1(1-12)Online publication date: 9-Mar-2023
  • (2022)Generic Techniques for Building Top-k StructuresACM Transactions on Algorithms10.1145/354607418:4(1-23)Online publication date: 10-Oct-2022
  • (2022)String indexing for top-k close consecutive occurrencesTheoretical Computer Science10.1016/j.tcs.2022.06.004927(133-147)Online publication date: Aug-2022
  • (2022)Faster repetition-aware compressed suffix trees based on Block TreesInformation and Computation10.1016/j.ic.2021.104749285:PBOnline publication date: 1-May-2022
  • (2022)Gapped Indexing for Consecutive OccurrencesAlgorithmica10.1007/s00453-022-01051-685:4(879-901)Online publication date: 20-Oct-2022
  • (2021)A framework for designing space-efficient dictionaries for parameterized and order-preserving matchingTheoretical Computer Science10.1016/j.tcs.2020.11.036854(52-62)Online publication date: Jan-2021
  • (2020)Ranked document selectionTheoretical Computer Science10.1016/j.tcs.2019.10.008812(149-159)Online publication date: Apr-2020
  • (2019)A Guide to Designing Top-k IndexesACM SIGMOD Record10.1145/3377330.337733248:2(6-17)Online publication date: 19-Dec-2019
  • (2019)Lempel-Ziv Compressed Structures for Document RetrievalInformation and Computation10.1016/j.ic.2019.01.006Online publication date: Jan-2019
  • (2019)Faster Repetition-Aware Compressed Suffix Trees Based on Block TreesString Processing and Information Retrieval10.1007/978-3-030-32686-9_31(434-451)Online publication date: 7-Oct-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media