Top-k Ranked Document Search in General Text Databases

Culpepper, J. Shane; Navarro, Gonzalo; Puglisi, Simon J.; Turpin, Andrew

doi:10.1007/978-3-642-15781-3_17

J. Shane Culpepper¹⁸,
Gonzalo Navarro¹⁹,
Simon J. Puglisi¹⁸ &
…
Andrew Turpin¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6347))

Included in the following conference series:

European Symposium on Algorithms

727 Accesses
34 Citations

Abstract

Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anh, V., Moffat, A.: Pruned query evaluation using pre-computed impacts. In: Proc. 29th ACM SIGIR, pp. 372–379 (2006)
Google Scholar
Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)
Google Scholar
Bender, M., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)
Chapter Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM TALG 3(2), article 20 (2007)
Google Scholar
Fischer, J., Heun, V.: A new succinct representation of RMQ-information and improvements in the enhanced suffix array. In: Chen, B., Paterson, M., Zhang, G. (eds.) ESCAPE 2007. LNCS, vol. 4614, pp. 459–470. Springer, Heidelberg (2007)
Chapter Google Scholar
Gagie, T., Puglisi, S., Turpin, A.: Range quantile queries: Another virtue of wavelet trees. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 1–6. Springer, Heidelberg (2009)
Chapter Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 841–850 (2003)
Google Scholar
Hon, W.-K., Shah, R., Vitter, J.S.: Space-efficient framework for top-k string retrieval problems. In: Proc. FOCS, pp. 713–722 (2009)
Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Computing 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. 13th SODA, pp. 657–666 (2002)
Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)
Google Scholar
Persin, M., Zobel, J., Sacks-Davis, R.: Filtered document retrieval with frequency-sorted indexes. JASIS 47(10), 749–764 (1996)
Article Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proc. 21th ACM SIGIR, pp. 275–281 (1998)
Google Scholar
Puglisi, S., Smyth, W., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 122–133. Springer, Heidelberg (2006)
Chapter Google Scholar
Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. SODA, pp. 233–242 (2002)
Google Scholar
Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. JASIST 27, 129–146 (1976)
Article Google Scholar
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: Harman, D.K. (ed.) Proc. 3rd TREC (1994)
Google Scholar
Sadakane, K.: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1), 12–22 (2007)
Article MATH MathSciNet Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Välimäki, N., Mäkinen, V.: Space-efficient algorithms for document retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)
Chapter Google Scholar
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2), 1–56 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT Univ., Australia
J. Shane Culpepper, Simon J. Puglisi & Andrew Turpin
Department of Computer Science, Univ. of Chile,
Gonzalo Navarro

Authors

J. Shane Culpepper
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Simon J. Puglisi
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Turpin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics and Computing Science, TU Eindhoven, Eindhoven, The Netherlands
Mark de Berg
Institute for Computer Science, J.W. Goethe University, 60325, Frankfurt/Main, Germany
Ulrich Meyer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Culpepper, J.S., Navarro, G., Puglisi, S.J., Turpin, A. (2010). Top-k Ranked Document Search in General Text Databases. In: de Berg, M., Meyer, U. (eds) Algorithms – ESA 2010. ESA 2010. Lecture Notes in Computer Science, vol 6347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15781-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-15781-3_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15780-6
Online ISBN: 978-3-642-15781-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics