Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

Thota, Sree Lekha; Carterette, Ben

doi:10.1007/978-3-642-20161-5_54

Sree Lekha Thota²¹ &
Ben Carterette²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6611))

Included in the following conference series:

European Conference on Information Retrieval

6712 Accesses
6 Citations

Abstract

Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some within- document term information from inverted lists. We present a method of pruning inverted lists derived from the formulation of unigram language models for retrieval. Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given document to the frequency of its occurrence in the collection to decide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying speed with less compromise to retrieval effectiveness than similar heuristic methods. Furthermore, we give a formal statistical justification for such methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anh, V.N., Moffat, A.: Inverted index compressed using word-aligned binary codes. Inf. Retr. 8(1), 151–166 (2005)
Article Google Scholar
Anh, V.N., Moffat, A.: Pruned query evaluation using precomputed impacts. In: Proceedings of SIGIR (2006)
Google Scholar
Azzopardi, L., Losada, D.E.: An efficient computation of the multiple-bernoulli language model. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 480–483. Springer, Heidelberg (2006)
Chapter Google Scholar
Blanco, R., Barreiro, A.: Boosting static pruning of inverted files. In: Proceedings of SIGIR (2007)
Google Scholar
Büttcher, S., Clarke, C.L.A.: Efficiency vs. effectiveness in terabyte-scale information retrieval. In: Proceedings of TREC (2005)
Google Scholar
Büttcher, S., Clarke, C.L.A.: A document-centric approach to static index pruning in text retrieval systems. In: Proceedings of CIKM (2006)
Google Scholar
Büttcher, S., Clarke, C.L.A., Soboroff, I.: The TREC 2006 Terabyte track. In: Proceedings of TREC (2006)
Google Scholar
Carmel, D., Cohen, D., Fagin, R., Farchi, E., Hersovici, M., Maarek, Y., Soer, A.: Static index pruning for information retrieval systems. In: Proceedings of SIGIR, pp. 43–50 (2001)
Google Scholar
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 Terabyte track. In: Proceedings of TREC (2004)
Google Scholar
Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 Terabyte track. In: Proceedings of TREC (2005)
Google Scholar
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge Academic (1988)
Google Scholar
Croft, W.B., Lafferty, J. (eds.): Language Modeling for Information Retrieval. Springer, Heidelberg (2003)
MATH Google Scholar
de Moura, E.S., dos Santos, C.F., Fernandes, D.R., Silva, A.S., Calado, P., Nascimento, M.A.: Improving web search efficiency via a locality based static pruning method. In: Proceedings of WWW (2005)
Google Scholar
Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM TOIS 14(4), 349–379 (1996)
Article Google Scholar
Nguyen, L.T.: Static index pruning for information retrieval systems: A posting-based approach. In: Proceedings of LSDS-IR, CEUR Workshop, pp. 25–32 (2009)
Google Scholar
Persin, M., Zobel, J., Sacks-Davis, R.: Filtered document retrieval with frequency-sorted indexes. JASIS 47(10), 749–764 (1996)
Article Google Scholar
Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: A language-model based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis (2005)
Google Scholar
Trotman, A.: Compressing inverted files. Inf. Retr. 6, 5–19 (2003)
Article Google Scholar
Tsegay, Y., Turpin, A., Zobel, J.: Dynamic index pruning for effective caching. In: Proceedings of CIKM (2007)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. Morgan Kaufmann, San Francisco (1999)
MATH Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2), 179–214 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
Sree Lekha Thota & Ben Carterette

Authors

Sree Lekha Thota
View author publications
You can also search for this author in PubMed Google Scholar
Ben Carterette
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information School, University of Sheffield, Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
Paul Clough
CLARITY: Centre for Sensor Web Technologies, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland
Colum Foley , Cathal Gurrin & Hyowon Lee , &
Centre for Next Generation Localisation, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland
Gareth J. F. Jones
TNO Human Factors, Brassersplein 2, 2612 CT, Delft, The Netherlands
Wessel Kraaij
Yahoo! Research, 177 Diagonal, 08018, Barcelona, Spain
Vanessa Mudoch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thota, S.L., Carterette, B. (2011). Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_54

Download citation

DOI: https://doi.org/10.1007/978-3-642-20161-5_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20160-8
Online ISBN: 978-3-642-20161-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics