Skip to main content

Out of the Box Phrase Indexing

  • Conference paper
String Processing and Information Retrieval (SPIRE 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5280))

Included in the following conference series:

Abstract

We present a method for optimizing phrase search based on inverted indexes. Our approach adds selected (two-term) phrases to an existing index. Whereas competing approaches are often based on the analysis of query logs, our approach works out of the box and uses only the information contained in the index. Also, our method is competitive in terms of query performance and can even improve on other approaches for difficult queries. Moreover, our approach gives performance guarantees for arbitrary queries. Further, we propose using a phrase index as a substitute for the positional index of an in-memory search engine working with short documents. We support our conclusions with experiments using a high-performance main-memory search engine. We also give evidence that classical disk based systems can profit from our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Spink, A., Wolfram, D., Jansen, B., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3) (2001)

    Google Scholar 

  2. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2) (2006)

    Google Scholar 

  3. Salton, G., Yang, C.S., Yu, C.T.: A theory of term importance in automatic text analysis. Journal of the American Society for Information Science 26 (1975)

    Google Scholar 

  4. Fagan, J.: Automatic phrase indexing for document retrieval. In: Proceedings of the 10th Annual International Conference on Research and Development in Information Retrieval SIGIR 1987. ACM, New York (1987)

    Google Scholar 

  5. Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., Frank, E.: Improving browsing in digital libraries with keyphrase indexes. Decision Support Systems 27(1-2) (1999)

    Google Scholar 

  6. Williams, H.E., Zobel, J., Anderson, P.: What’s next? - index structures for efficient phrase querying. In: Proceedings of the 10th Australasian Database Conference ADC 1999. Springer, Heidelberg (1999)

    Google Scholar 

  7. Bahle, D., Williams, H.E., Zobel, J.: Compaction techniques for nextword indexes. In: Proceedings of the Symposium on String Processing and Information Retrieval SPIRE 2001. IEEE Computer Society, Los Alamitos (2001)

    Google Scholar 

  8. Bahle, D., Williams, H.E., Zobel, J.: Efficient phrase querying with an auxiliary index. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval SIGIR 2002. ACM, New York (2002)

    Google Scholar 

  9. Williams, H.E., Zobel, J., Bahle, D.: Fast phrase querying with combined indexes. ACM Transactions on Information Systems 22(4) (2004)

    Google Scholar 

  10. Chang, M., Poon, C.K.: Efficient phrase querying with common phrase index. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 61–71. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  11. Transier, F., Sanders, P.: Compressed inverted indexes for in-memory search engines. In: Proceedings of the 10th Workshop on Algorithm Engineering and Experiments ALENEX 2008. SIAM, Philadelphia (2008)

    Google Scholar 

  12. Puglisi, S.J., Smyth, W.F., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  13. Jansen, B.J., Spink, A., Bateman, J., Saracevic, T.: Real life information retrieval: a study of user queries on the web. SIGIR Forum 32(1) (1998)

    Google Scholar 

  14. Bahle, D., Williams, H.E., Zobel, J.: Optimised phrase querying and browsing in text databases. In: Proceedings of the Australasian Computer Science Conference (2001)

    Google Scholar 

  15. Sanders, P., Transier, F.: Intersection in integer inverted indices. In: Proceedings of the 9th Workshop on Algorithm Engineering and Experiments ALENEX 2007. SIAM, Philadelphia (2007)

    Google Scholar 

  16. Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(2) (1977)

    Google Scholar 

  17. Clarke, C., Soboroff, I., Craswell, N.: GOV2 test collection (2004), http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm

  18. Clarke, C., Scholer, F., Soboroff, I.: TREC 2005 efficiency topics (2005), http://trec.nist.gov/data/terabyte/05/05.efficiency_topics.gz

  19. Büttcher, S., Clarke, C., Soboroff, I.: TREC 2006 efficiency topics (2006), http://trec.nist.gov/data/terabyte/06/06.efficiency_topics.tar.gz

  20. Clarke, C., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track (2005)

    Google Scholar 

  21. Büttcher, S., Clarke, C., Soboroff, I.: The TREC 2006 terabyte track (2006)

    Google Scholar 

  22. Mucci, P.: Performance API (2005), http://icl.cs.utk.edu/papi/

  23. Hawking, D., Voorhees, E., Craswell, N., Bailey, P.: TREC-8 web track (1999), http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Transier, F., Sanders, P. (2008). Out of the Box Phrase Indexing. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89097-3_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89096-6

  • Online ISBN: 978-3-540-89097-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics