Abstract
We present a method for optimizing phrase search based on inverted indexes. Our approach adds selected (two-term) phrases to an existing index. Whereas competing approaches are often based on the analysis of query logs, our approach works out of the box and uses only the information contained in the index. Also, our method is competitive in terms of query performance and can even improve on other approaches for difficult queries. Moreover, our approach gives performance guarantees for arbitrary queries. Further, we propose using a phrase index as a substitute for the positional index of an in-memory search engine working with short documents. We support our conclusions with experiments using a high-performance main-memory search engine. We also give evidence that classical disk based systems can profit from our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Spink, A., Wolfram, D., Jansen, B., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3) (2001)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2) (2006)
Salton, G., Yang, C.S., Yu, C.T.: A theory of term importance in automatic text analysis. Journal of the American Society for Information Science 26 (1975)
Fagan, J.: Automatic phrase indexing for document retrieval. In: Proceedings of the 10th Annual International Conference on Research and Development in Information Retrieval SIGIR 1987. ACM, New York (1987)
Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., Frank, E.: Improving browsing in digital libraries with keyphrase indexes. Decision Support Systems 27(1-2) (1999)
Williams, H.E., Zobel, J., Anderson, P.: What’s next? - index structures for efficient phrase querying. In: Proceedings of the 10th Australasian Database Conference ADC 1999. Springer, Heidelberg (1999)
Bahle, D., Williams, H.E., Zobel, J.: Compaction techniques for nextword indexes. In: Proceedings of the Symposium on String Processing and Information Retrieval SPIRE 2001. IEEE Computer Society, Los Alamitos (2001)
Bahle, D., Williams, H.E., Zobel, J.: Efficient phrase querying with an auxiliary index. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval SIGIR 2002. ACM, New York (2002)
Williams, H.E., Zobel, J., Bahle, D.: Fast phrase querying with combined indexes. ACM Transactions on Information Systems 22(4) (2004)
Chang, M., Poon, C.K.: Efficient phrase querying with common phrase index. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 61–71. Springer, Heidelberg (2006)
Transier, F., Sanders, P.: Compressed inverted indexes for in-memory search engines. In: Proceedings of the 10th Workshop on Algorithm Engineering and Experiments ALENEX 2008. SIAM, Philadelphia (2008)
Puglisi, S.J., Smyth, W.F., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209. Springer, Heidelberg (2006)
Jansen, B.J., Spink, A., Bateman, J., Saracevic, T.: Real life information retrieval: a study of user queries on the web. SIGIR Forum 32(1) (1998)
Bahle, D., Williams, H.E., Zobel, J.: Optimised phrase querying and browsing in text databases. In: Proceedings of the Australasian Computer Science Conference (2001)
Sanders, P., Transier, F.: Intersection in integer inverted indices. In: Proceedings of the 9th Workshop on Algorithm Engineering and Experiments ALENEX 2007. SIAM, Philadelphia (2007)
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(2) (1977)
Clarke, C., Soboroff, I., Craswell, N.: GOV2 test collection (2004), http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm
Clarke, C., Scholer, F., Soboroff, I.: TREC 2005 efficiency topics (2005), http://trec.nist.gov/data/terabyte/05/05.efficiency_topics.gz
Büttcher, S., Clarke, C., Soboroff, I.: TREC 2006 efficiency topics (2006), http://trec.nist.gov/data/terabyte/06/06.efficiency_topics.tar.gz
Clarke, C., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track (2005)
Büttcher, S., Clarke, C., Soboroff, I.: The TREC 2006 terabyte track (2006)
Mucci, P.: Performance API (2005), http://icl.cs.utk.edu/papi/
Hawking, D., Voorhees, E., Craswell, N., Bailey, P.: TREC-8 web track (1999), http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Transier, F., Sanders, P. (2008). Out of the Box Phrase Indexing. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-89097-3_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89096-6
Online ISBN: 978-3-540-89097-3
eBook Packages: Computer ScienceComputer Science (R0)