skip to main content
10.1145/2063576.2063871acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Adaptive term frequency normalization for BM25

Published: 24 October 2011 Publication History

Abstract

A key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show empirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occurrences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25.

References

[1]
Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002.
[2]
Kenneth W. Church and William A. Gale. Poisson mixtures. Natural Language Engineering, 1:163--190, 1995.
[3]
Hui Fang, Tao Tao, and ChengXiang Zhai. A formal study of information retrieval heuristics. In SIGIR '04, pages 49--56, 2004.
[4]
Ben He and Iadh Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst., 25, July 2007.
[5]
Jaakko Hintikka. On Semantic Information. In J. Hintikka and P. Suppes, editors, Information and Inference, pages 3--27. D. Reidel Pub., 1970.
[6]
K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. In Information Processing and Management, pages 779--840, 2000.
[7]
Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In CIKM '11, 2011.
[8]
Yuanhua Lv and ChengXiang Zhai. When documents are very long, bm25 fails! In SIGIR '11, pages 1103--1104, 2011.
[9]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94, pages 232--241, 1994.
[10]
Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple weighted fields. In CIKM '04, pages 42--49, 2004.
[11]
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at trec-3. In TREC '94, pages 109--126, 1994.
[12]
Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96, pages 21--29, 1996.

Cited By

View all
  • (2024)Knowledge-based Retrieval Methods for Enhancing Aerospace Model Software DocumentationProceedings of the 2024 3rd International Conference on Artificial Intelligence and Intelligent Information Processing10.1145/3707292.3707392(372-378)Online publication date: 25-Oct-2024
  • (2024)Retrieval for Extremely Long Queries and Documents with RPRS: A Highly Efficient and Effective Transformer-based Re-RankerACM Transactions on Information Systems10.1145/363193842:5(1-32)Online publication date: 29-Apr-2024
  • (2024)From Claim to Evidence: Verifying Chinese Health Claims with Medical LiteratureNatural Language Processing and Chinese Computing10.1007/978-981-97-9440-9_14(171-183)Online publication date: 2-Nov-2024
  • Show More Cited By

Index Terms

  1. Adaptive term frequency normalization for BM25

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
    October 2011
    2712 pages
    ISBN:9781450307178
    DOI:10.1145/2063576
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. adaptation
    2. bm25
    3. information gain
    4. term frequency

    Qualifiers

    • Poster

    Conference

    CIKM '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)41
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Knowledge-based Retrieval Methods for Enhancing Aerospace Model Software DocumentationProceedings of the 2024 3rd International Conference on Artificial Intelligence and Intelligent Information Processing10.1145/3707292.3707392(372-378)Online publication date: 25-Oct-2024
    • (2024)Retrieval for Extremely Long Queries and Documents with RPRS: A Highly Efficient and Effective Transformer-based Re-RankerACM Transactions on Information Systems10.1145/363193842:5(1-32)Online publication date: 29-Apr-2024
    • (2024)From Claim to Evidence: Verifying Chinese Health Claims with Medical LiteratureNatural Language Processing and Chinese Computing10.1007/978-981-97-9440-9_14(171-183)Online publication date: 2-Nov-2024
    • (2023)ExpFinder: A hybrid model for expert finding from text-based expertise dataExpert Systems with Applications10.1016/j.eswa.2022.118691211(118691)Online publication date: Jan-2023
    • (2023)Cross-Genre Retrieval for Information Integrity: A COVID-19 Case StudyAdvanced Data Mining and Applications10.1007/978-3-031-46677-9_34(495-509)Online publication date: 5-Nov-2023
    • (2022)A Hybrid Approach to Recommending Universal Decimal Classification Codes for Cataloguing in Slovenian Digital LibrariesIEEE Access10.1109/ACCESS.2022.319870610(85595-85605)Online publication date: 2022
    • (2021)Automatic Solution Summarization for Crash BugsProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00117(1286-1297)Online publication date: 22-May-2021
    • (2020)JASSjr: The Minimalistic BM25 Search Engine for Teaching and Learning Information RetrievalProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401413(2185-2188)Online publication date: 25-Jul-2020
    • (2020)Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring VariantsAdvances in Information Retrieval10.1007/978-3-030-45442-5_4(28-34)Online publication date: 8-Apr-2020
    • (2019)A topic‐based term frequency normalization framework to enhance probabilistic information retrievalComputational Intelligence10.1111/coin.1224836:2(486-521)Online publication date: 20-Nov-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media