Skip to main content

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

  • Conference paper
Advances in Information Retrieval (ECIR 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6611))

Included in the following conference series:

Abstract

Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some within- document term information from inverted lists. We present a method of pruning inverted lists derived from the formulation of unigram language models for retrieval. Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given document to the frequency of its occurrence in the collection to decide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying speed with less compromise to retrieval effectiveness than similar heuristic methods. Furthermore, we give a formal statistical justification for such methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anh, V.N., Moffat, A.: Inverted index compressed using word-aligned binary codes. Inf. Retr. 8(1), 151–166 (2005)

    Article  Google Scholar 

  2. Anh, V.N., Moffat, A.: Pruned query evaluation using precomputed impacts. In: Proceedings of SIGIR (2006)

    Google Scholar 

  3. Azzopardi, L., Losada, D.E.: An efficient computation of the multiple-bernoulli language model. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 480–483. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Blanco, R., Barreiro, A.: Boosting static pruning of inverted files. In: Proceedings of SIGIR (2007)

    Google Scholar 

  5. Büttcher, S., Clarke, C.L.A.: Efficiency vs. effectiveness in terabyte-scale information retrieval. In: Proceedings of TREC (2005)

    Google Scholar 

  6. Büttcher, S., Clarke, C.L.A.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of CIKM (2006)

    Google Scholar 

  7. Büttcher, S., Clarke, C.L.A., Soboroff, I.: The TREC 2006 Terabyte track. In: Proceedings of TREC (2006)

    Google Scholar 

  8. Carmel, D., Cohen, D., Fagin, R., Farchi, E., Hersovici, M., Maarek, Y., Soer, A.: Static index pruning for information retrieval systems. In: Proceedings of SIGIR, pp. 43–50 (2001)

    Google Scholar 

  9. Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 Terabyte track. In: Proceedings of TREC (2004)

    Google Scholar 

  10. Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 Terabyte track. In: Proceedings of TREC (2005)

    Google Scholar 

  11. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge Academic (1988)

    Google Scholar 

  12. Croft, W.B., Lafferty, J. (eds.): Language Modeling for Information Retrieval. Springer, Heidelberg (2003)

    MATH  Google Scholar 

  13. de Moura, E.S., dos Santos, C.F., Fernandes, D.R., Silva, A.S., Calado, P., Nascimento, M.A.: Improving web search efficiency via a locality based static pruning method. In: Proceedings of WWW (2005)

    Google Scholar 

  14. Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM TOIS 14(4), 349–379 (1996)

    Article  Google Scholar 

  15. Nguyen, L.T.: Static index pruning for information retrieval systems: A posting-based approach. In: Proceedings of LSDS-IR, CEUR Workshop, pp. 25–32 (2009)

    Google Scholar 

  16. Persin, M., Zobel, J., Sacks-Davis, R.: Filtered document retrieval with frequency-sorted indexes. JASIS 47(10), 749–764 (1996)

    Article  Google Scholar 

  17. Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: A language-model based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis (2005)

    Google Scholar 

  18. Trotman, A.: Compressing inverted files. Inf. Retr. 6, 5–19 (2003)

    Article  Google Scholar 

  19. Tsegay, Y., Turpin, A., Zobel, J.: Dynamic index pruning for effective caching. In: Proceedings of CIKM (2007)

    Google Scholar 

  20. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. Morgan Kaufmann, San Francisco (1999)

    MATH  Google Scholar 

  21. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2), 179–214 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Thota, S.L., Carterette, B. (2011). Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20161-5_54

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20160-8

  • Online ISBN: 978-3-642-20161-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics