Abstract
Given a collection \(\mathcal D\) of string documents \(\{d_1,d_2,...,d_{|\mathcal D|}\}\) of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P 1, P 2, ..., P m }. To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores).
When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took \(\tilde{O}(n^{3/2})\) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in \(O(\sum |P_i|) + \tilde {O}(t^{1/m} n^{1-1/m})\) time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of \(O(|P_1| + |P_2| + \sqrt{nt}\log^2 n)\). We also show space-time trade-offs for our data structures. Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest.
This work is supported in part by Taiwan NSC Grant 96-2221-E-007-082 and US NSF Grants CCF-1017623 and CCF-0621457.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bender, M.A., Farach-Colton, M.: The LCA Problem Revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7), 107–117 (1998)
Cohen, H., Porat, E.: Fast Set Intersection and Two Patterns Matching. In: LATIN (2010)
Ferragina, P., Giancarlo, R., Manzini, G.: The Myriad Virtues of Wavelet Trees. Inf. and Comp. 207(8), 849–866 (2009)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) (2007)
Grossi, R., Gupta, A., Vitter, J.S.: High-Order Entropy-Compressed Text Indexes. In: SODA, pp. 841–850 (2003)
Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SICOMP 35(2), 378–407 (2005)
Hon, W.K., Shah, R., Vitter, J.S.: Ordered Pattern Matching: Towards Full-Text Retrieval. Tech Report TR-06-008, Dept. of CS, Purdue University (2006)
Hon, W.K., Shah, R., Vitter, J.S.: Space-Efficient Framework for Top-k String Retrival Problems. In: FOCS, pp. 713–722 (2009)
Mäkinen, V., Navarro, G.: Rank and Selected Revisited and Extended. TCS 387(3), 332–347 (2007)
Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. SICOMP 22(5), 935–948 (1993)
Matias, Y., Muthukrishnan, S., Sahinalp, S.C., Ziv, J.: Augmenting Suffix Trees, with Applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 67–78. Springer, Heidelberg (1998)
Munro, J.I., Raman, V.: Succinct Representation of Balanced Parentheses and Static Trees. SICOMP 31(3), 762–776 (2001)
Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: SODA, pp. 657–666 (2002)
Raman, R., Raman, V., Rao, S.S.: Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets. TALG 3(4) (2007)
Sadakane, K.: Compressed Suffix Trees with Full Functionality. TCS, 589–607 (2007)
Sadakane, K.: Succinct Data Structures for Flexible Text Retrieval Systems. JDA 5(1), 12–22 (2007)
Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)
Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. Switching and Automata Theory, pp. 1–11 (1973)
Wu, S.B., Hon, W.K., Shah, R.: Efficient Index for Retrieving Top-k Most Frequent Documents. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 182–193. Springer, Heidelberg (2009)
Yu, C.C., Hon, W.K., Wang, B.F.: Efficient Data Structures for the Orthogonal Range Successor Problem. In: Ngo, H.Q. (ed.) COCOON 2009. LNCS, vol. 5609, pp. 96–105. Springer, Heidelberg (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hon, WK., Shah, R., Thankachan, S.V., Vitter, J.S. (2010). String Retrieval for Multi-pattern Queries. In: Chavez, E., Lonardi, S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16321-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-16321-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16320-3
Online ISBN: 978-3-642-16321-0
eBook Packages: Computer ScienceComputer Science (R0)