Abstract
Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some within- document term information from inverted lists. We present a method of pruning inverted lists derived from the formulation of unigram language models for retrieval. Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given document to the frequency of its occurrence in the collection to decide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying speed with less compromise to retrieval effectiveness than similar heuristic methods. Furthermore, we give a formal statistical justification for such methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Anh, V.N., Moffat, A.: Inverted index compressed using word-aligned binary codes. Inf. Retr. 8(1), 151–166 (2005)
Anh, V.N., Moffat, A.: Pruned query evaluation using precomputed impacts. In: Proceedings of SIGIR (2006)
Azzopardi, L., Losada, D.E.: An efficient computation of the multiple-bernoulli language model. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 480–483. Springer, Heidelberg (2006)
Blanco, R., Barreiro, A.: Boosting static pruning of inverted files. In: Proceedings of SIGIR (2007)
Büttcher, S., Clarke, C.L.A.: Efficiency vs. effectiveness in terabyte-scale information retrieval. In: Proceedings of TREC (2005)
Büttcher, S., Clarke, C.L.A.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of CIKM (2006)
Büttcher, S., Clarke, C.L.A., Soboroff, I.: The TREC 2006 Terabyte track. In: Proceedings of TREC (2006)
Carmel, D., Cohen, D., Fagin, R., Farchi, E., Hersovici, M., Maarek, Y., Soer, A.: Static index pruning for information retrieval systems. In: Proceedings of SIGIR, pp. 43–50 (2001)
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 Terabyte track. In: Proceedings of TREC (2004)
Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 Terabyte track. In: Proceedings of TREC (2005)
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge Academic (1988)
Croft, W.B., Lafferty, J. (eds.): Language Modeling for Information Retrieval. Springer, Heidelberg (2003)
de Moura, E.S., dos Santos, C.F., Fernandes, D.R., Silva, A.S., Calado, P., Nascimento, M.A.: Improving web search efficiency via a locality based static pruning method. In: Proceedings of WWW (2005)
Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM TOIS 14(4), 349–379 (1996)
Nguyen, L.T.: Static index pruning for information retrieval systems: A posting-based approach. In: Proceedings of LSDS-IR, CEUR Workshop, pp. 25–32 (2009)
Persin, M., Zobel, J., Sacks-Davis, R.: Filtered document retrieval with frequency-sorted indexes. JASIS 47(10), 749–764 (1996)
Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: A language-model based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis (2005)
Trotman, A.: Compressing inverted files. Inf. Retr. 6, 5–19 (2003)
Tsegay, Y., Turpin, A., Zobel, J.: Dynamic index pruning for effective caching. In: Proceedings of CIKM (2007)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. Morgan Kaufmann, San Francisco (1999)
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2), 179–214 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Thota, S.L., Carterette, B. (2011). Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_54
Download citation
DOI: https://doi.org/10.1007/978-3-642-20161-5_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20160-8
Online ISBN: 978-3-642-20161-5
eBook Packages: Computer ScienceComputer Science (R0)