skip to main content
10.1145/2808194.2809472acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article

On Divergence Measures and Static Index Pruning

Published: 27 September 2015 Publication History

Abstract

We study the problem of static index pruning in a renowned divergence minimization framework, using a range of divergence measures such as f-divergence and Rényi divergence as the objective. We show that many well-known divergence measures are convex in pruning decisions, and therefore can be exactly minimized using an efficient algorithm. Our approach allows postings be prioritized according to the amount of information they contribute to the index, and through specifying a different divergence measure the contribution is modeled on a different returns curve. In our experiment on GOV2 data, Rényi divergence of order infinity appears the most effective. This divergence measure significantly outperforms many standard methods and achieves identical retrieval effectiveness as full data using only 50% of the postings. When top-k precision is of the only concern, 10% of the data is sufficient to achieve the accuracy that one would usually expect from a full index.

References

[1]
I. S. Altingovde, R. Ozcan, and O. Ulusoy. A practitioner's guide for static index pruning. In Proceedings of ECIR '09, pages 675--679. Springer Berlin / Heidelberg, 2009.
[2]
I. S. Altingovde, R. Ozcan, and O. Ulusoy. Static index pruning in web search engines: Combining term and document popularities with query views. ACM Trans. Inf. Syst., 30(1), Mar. 2012.
[3]
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.
[4]
R. Blanco and A. Barreiro. Static pruning of terms in inverted files. In Proceedings of ECIR '07, pages 64--75. Springer Berlin Heidelberg, 2007.
[5]
R. Blanco and A. Barreiro. Probabilistic static pruning of inverted files. ACM Trans. Inf. Syst., 28(1), Jan. 2010.
[6]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge university press, 2004.
[7]
S. Büttcher and C. L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of CIKM '06, pages 182--189. ACM, 2006.
[8]
S. Büttcher, C. L. A. Clarke, and I. Soboroff. The TREC 2006 terabyte track. In TREC, volume 6, page 39, 2006.
[9]
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In Proceedings of SIGIR '01, pages 43--50. ACM, 2001.
[10]
R.-C. Chen and C.-J. Lee. An information-theoretic account of static index pruning. In Proceedings of SIGIR '13, pages 163--172. ACM, 2013.
[11]
R.-C. Chen, C.-J. Lee, C.-M. Tsai, and J. Hsiang. Information preservation in static index pruning. In Proceedings of CIKM '12, pages 2487--2490. ACM, 2012.
[12]
C. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC-2004 terabyte track. In Proceedings of TREC-2004, 2004.
[13]
I. Csiszár and P. C. Shields. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417--528, 2004.
[14]
E. S. de Moura, C. F. dos Santos, D. R. Fernandes, A. S. Silva, P. Calado, and M. A. Nascimento. Improving web search efficiency via a locality based static pruning method. In Proceedings of WWW '05, pages 235--244. ACM, 2005.
[15]
B. Fox. Discrete optimization via marginal analysis. Management science, 13(3):210--216, 1966.
[16]
T. Ibaraki and N. Katoh. Resource Allocation Problems: Algorithmic Approaches. MIT Press, 1988.
[17]
S. Kullback. Information Theory and Statistics. Wiley, 1959.
[18]
F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Th., 52(10):4394--4412, Oct. 2006.
[19]
D. Metzler and W. B. Croft. A Markov random field model for term dependencies. In Proceedings of SIGIR '05, pages 472--479. ACM, 2005.
[20]
T. Morimoto. Markov processes and the H-Theorem. Journal of the Physical Society of Japan, 18(3):328--331, Mar. 1963.
[21]
A. Ntoulas and J. Cho. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of SIGIR '07, pages 191--198. ACM, 2007.
[22]
G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In Proceedings of the 1st international conference on Scalable information systems, page 1. ACM, 2006.
[23]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR '98, pages 275--281. ACM, 1998.
[24]
A. Rényi. On measures of entropy and information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 547--561, 1961.
[25]
G. Skobeltsyn, F. Junqueira, V. Plachouras, and R. B. Yates. ResIn: a combination of results caching and index pruning for high-performance web search engines. In Proceedings of SIGIR '08, pages 131--138. ACM, 2008.
[26]
S. Thota and B. Carterette. Within-document term-based index pruning with statistical hypothesis testing. In Proceedings of ECIR '11, pages 543--554. Springer Berlin Heidelberg, 2011.
[27]
T. van Erven and P. Harremoes. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Th., 60(7):3797--3820, July 2014.
[28]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179--214, Apr. 2004.

Cited By

View all
  • (2018)Exploring Size-Speed Trade-Offs in Static Index Pruning2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622177(1093-1100)Online publication date: Dec-2018
  • (2017)An Empirical Analysis of Pruning TechniquesProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133151(2023-2026)Online publication date: 6-Nov-2017
  • (2017)Kullback-Leibler Divergence RevisitedProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3121050.3121062(117-124)Online publication date: 1-Oct-2017
  • Show More Cited By

Index Terms

  1. On Divergence Measures and Static Index Pruning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval
    September 2015
    402 pages
    ISBN:9781450338332
    DOI:10.1145/2808194
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 September 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. f-divergence
    2. rényi divergence
    3. static index pruning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICTIR '15
    Sponsor:

    Acceptance Rates

    ICTIR '15 Paper Acceptance Rate 29 of 57 submissions, 51%;
    Overall Acceptance Rate 235 of 527 submissions, 45%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Exploring Size-Speed Trade-Offs in Static Index Pruning2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622177(1093-1100)Online publication date: Dec-2018
    • (2017)An Empirical Analysis of Pruning TechniquesProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133151(2023-2026)Online publication date: 6-Nov-2017
    • (2017)Kullback-Leibler Divergence RevisitedProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3121050.3121062(117-124)Online publication date: 1-Oct-2017
    • (2016)Improved methods for static index pruning2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840661(686-695)Online publication date: Dec-2016

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media