Abstract
Let \(\mathcal{D}=\{\mathsf {T}_1,\mathsf {T}_2,\dots , \mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total. The forbidden pattern document listing problem asks to report those documents \(\mathcal{D}' \subseteq \mathcal{D}\) which contain the pattern \(P\), but not the pattern \(Q\). The \({\mathsf {top\text{- }}k}\) forbidden pattern query \((P,Q,k)\) asks to report those \(k\) documents in \(\mathcal{D}'\) that are most relevant to \(P\). For typical relevance functions (like document importance, term-frequency, term-proximity), we present a linear space index with worst case query time of \(O(|P|+|Q|+\sqrt{nk})\) for the \({\mathsf {top\text{- }}k}\) problem. As a corollary of this result, we obtain a linear space and \(O(|P|+|Q|+\sqrt{nt})\) query time solution for the document listing problem, where \(t\) is the number of documents reported. We conjecture that any significant improvement over the results in this paper is highly unlikely.
This research is funded in part by National Science Foundation (NSF) Grants CCF–1017623 and CCF–1218904.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alstrup, S., Brodal, G.S., Rauhe, T.: Optimal static range reporting in one dimension. In: Proceedings on 33rd Annual ACM Symposium on Theory of Computing, Heraklion, Crete, Greece, pp. 476–482, 6–8 July 2001
Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 234–242. Springer, Heidelberg (2010)
Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms, 2nd edn. McGraw-Hill Higher Education, New York (2001)
Durocher, S., Shah, R., Skala, M., Thankachan, S.V.: Linear-space data structures for range frequency queries on arrays and trees. In: Chatterjee, K., Sgall, J. (eds.) MFCS 2013. LNCS, vol. 8087, pp. 325–336. Springer, Heidelberg (2013)
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)
Fano, R.M.: On the number of bits required to implement an associative memory. Massachusetts Institute of Technology, Project MAC, Cambridge (1971)
Fischer, J., Gagie, T., Kopelowitz, T., Lewenstein, M., Mäkinen, V., Salmela, L., Välimäki, N.: Forbidden patterns. In: Fernández-Baca, D. (ed.) LATIN 2012. LNCS, vol. 7256, pp. 327–337. Springer, Heidelberg (2012)
Gawrychowski, P., Lewenstein, M., Nicholson, P.K.: Weighted ancestors in suffix trees. In: Schulz, A.S., Wagner, D. (eds.) ESA 2014. LNCS, vol. 8737, pp. 455–466. Springer, Heidelberg (2014)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, Portland, OR, USA, pp. 397–406, 21–23 May 2000
Gusfield, D.: Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String retrieval for multi-pattern queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010)
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: Document listing for queries with excluded pattern. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 185–195. Springer, Heidelberg (2012)
Hon, W., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM 61(2), 9 (2014)
Hon, W., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, Atlanta, Georgia, USA, pp. 713–722, 25–27 October 2009
Larsen, K.G., Munro, J.I., Nielsen, J.S., Thankachan, S.V.: On hardness of several string indexing problems. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 242–251. Springer, Heidelberg (2014)
Matias, Y., Muthukrishnan, S.M., Şahinalp, S.C., Ziv, J.: Augmenting suffix trees, with applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, p. 67. Springer, Heidelberg (1998)
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA, pp. 657–666, 6–8 January 2002
Navarro, G.: Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), 52 (2013)
Navarro, G., Nekrich, Y.: Top-k document retrieval in optimal time and linear space. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, pp. 1066–1077, 17–19 January 2012
Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014)
Navarro, G., Thankachan, S.V.: Bottom-k document retrieval. J. Discret. Algorithms 32, 69–74 (2015). StringMasters 2012; 2013 Special Issue (Volume 2)
Patil, M., Thankachan, S.V., Shah, R., Hon, W., Vitter, J.S., Chandrasekaran, S.: Inverted indexes for phrases and strings. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, pp. 555–564, 25–29 July 2011
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Biswas, S., Ganguly, A., Shah, R., Thankachan, S.V. (2015). Ranked Document Retrieval with Forbidden Pattern. In: Cicalese, F., Porat, E., Vaccaro, U. (eds) Combinatorial Pattern Matching. CPM 2015. Lecture Notes in Computer Science(), vol 9133. Springer, Cham. https://doi.org/10.1007/978-3-319-19929-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-19929-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19928-3
Online ISBN: 978-3-319-19929-0
eBook Packages: Computer ScienceComputer Science (R0)