poster

Adaptive term frequency normalization for BM25

Authors:

Yuanhua Lv,

ChengXiang ZhaiAuthors Info & Claims

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 1985 - 1988

https://doi.org/10.1145/2063576.2063871

Published: 24 October 2011 Publication History

Get Access

Abstract

A key component of BM25 contributing to its success is its sub linear term frequency (TF) normalization formula. The scale and shape of this TF normalization component is controlled by a parameter k1, which is generally set to a term-independent constant. We hypothesize and show empirically that in order to optimize retrieval performance, this parameter should be set in a term-specific way. Following this intuition, we propose an information gain measure to directly estimate the contributions of repeated term occurrences, which is then exploited to fit the BM25 function to predict a term-specific k1. Our experiment results show that the proposed approach, without needing any training data, can efficiently and automatically estimate a term-specific k1, and is more effective and robust than the standard BM25.

References

[1]

Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357--389, October 2002.

Digital Library

Google Scholar

[2]

Kenneth W. Church and William A. Gale. Poisson mixtures. Natural Language Engineering, 1:163--190, 1995.

Crossref

Google Scholar

[3]

Hui Fang, Tao Tao, and ChengXiang Zhai. A formal study of information retrieval heuristics. In SIGIR '04, pages 49--56, 2004.

Digital Library

Google Scholar

[4]

Ben He and Iadh Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst., 25, July 2007.

Digital Library

Google Scholar

[5]

Jaakko Hintikka. On Semantic Information. In J. Hintikka and P. Suppes, editors, Information and Inference, pages 3--27. D. Reidel Pub., 1970.

Crossref

Google Scholar

[6]

K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. In Information Processing and Management, pages 779--840, 2000.

Digital Library

Google Scholar

[7]

Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In CIKM '11, 2011.

Digital Library

Google Scholar

[8]

Yuanhua Lv and ChengXiang Zhai. When documents are very long, bm25 fails! In SIGIR '11, pages 1103--1104, 2011.

Digital Library

Google Scholar

[9]

S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR '94, pages 232--241, 1994.

Digital Library

Google Scholar

[10]

Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple weighted fields. In CIKM '04, pages 42--49, 2004.

Digital Library

Google Scholar

[11]

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at trec-3. In TREC '94, pages 109--126, 1994.

Google Scholar

[12]

Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96, pages 21--29, 1996.

Digital Library

Google Scholar

Cited By

View all

Duan JWei DWang MChen GZhang YGao Y(2024)Knowledge-based Retrieval Methods for Enhancing Aerospace Model Software DocumentationProceedings of the 2024 3rd International Conference on Artificial Intelligence and Intelligent Information Processing10.1145/3707292.3707392(372-378)Online publication date: 25-Oct-2024
https://dl.acm.org/doi/10.1145/3707292.3707392
Askari AVerberne SAbolghasemi AKraaij WPasi G(2024)Retrieval for Extremely Long Queries and Documents with RPRS: A Highly Efficient and Effective Transformer-based Re-RankerACM Transactions on Information Systems10.1145/363193842:5(1-32)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631938
Zuo CLiu YWang CBanerjee R(2024)From Claim to Evidence: Verifying Chinese Health Claims with Medical LiteratureNatural Language Processing and Chinese Computing10.1007/978-981-97-9440-9_14(171-183)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1007/978-981-97-9440-9_14
Show More Cited By

Index Terms

Adaptive term frequency normalization for BM25
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

A log-logistic model-based interpretation of TF normalization of BM25
ECIR'12: Proceedings of the 34th European conference on Advances in Information Retrieval

The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k₁. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so ...
BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies
Intelligent and Fuzzy Systems applied to Language & Knowledge Engineering

In this paper, the use of collection term frequencies (i.e. the total number of occurrences of a term in a document collection) in the BM25 retrieval model is investigated by modifying its term frequency (TF) and inverse document frequency (IDF) ...
BM25 Beyond Query-Document Similarity
String Processing and Information Retrieval
Abstract
The massive growth of information produced and shared online has made retrieving relevant documents a difficult task. Query Expansion (QE) based on term co-occurrence statistics has been widely applied in an attempt to improve retrieval ...

Comments

Information & Contributors

Information

Published In

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

October 2011

2712 pages

ISBN:9781450307178

DOI:10.1145/2063576

Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

CIKM '11

Sponsor:

CIKM '11: International Conference on Information and Knowledge Management

October 24 - 28, 2011

Glasgow, Scotland, UK

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
502
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Duan JWei DWang MChen GZhang YGao Y(2024)Knowledge-based Retrieval Methods for Enhancing Aerospace Model Software DocumentationProceedings of the 2024 3rd International Conference on Artificial Intelligence and Intelligent Information Processing10.1145/3707292.3707392(372-378)Online publication date: 25-Oct-2024
https://dl.acm.org/doi/10.1145/3707292.3707392
Askari AVerberne SAbolghasemi AKraaij WPasi G(2024)Retrieval for Extremely Long Queries and Documents with RPRS: A Highly Efficient and Effective Transformer-based Re-RankerACM Transactions on Information Systems10.1145/363193842:5(1-32)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631938
Zuo CLiu YWang CBanerjee R(2024)From Claim to Evidence: Verifying Chinese Health Claims with Medical LiteratureNatural Language Processing and Chinese Computing10.1007/978-981-97-9440-9_14(171-183)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1007/978-981-97-9440-9_14
Kang YDu HForkan AJayaraman PAryani ASellis T(2023)ExpFinder: A hybrid model for expert finding from text-based expertise dataExpert Systems with Applications10.1016/j.eswa.2022.118691211(118691)Online publication date: Jan-2023
https://doi.org/10.1016/j.eswa.2022.118691
Zuo CWang CBanerjee R(2023)Cross-Genre Retrieval for Information Integrity: A COVID-19 Case StudyAdvanced Data Mining and Applications10.1007/978-3-031-46677-9_34(495-509)Online publication date: 5-Nov-2023
https://doi.org/10.1007/978-3-031-46677-9_34
Borovic MOjstersek MStrnad D(2022)A Hybrid Approach to Recommending Universal Decimal Classification Codes for Cataloguing in Slovenian Digital LibrariesIEEE Access10.1109/ACCESS.2022.319870610(85595-85605)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3198706
Wang HXia XLo DGrundy JWang X(2021)Automatic Solution Summarization for Crash BugsProceedings of the 43rd International Conference on Software Engineering10.1109/ICSE43902.2021.00117(1286-1297)Online publication date: 22-May-2021
https://dl.acm.org/doi/10.1109/ICSE43902.2021.00117
Trotman ALilly KHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)JASSjr: The Minimalistic BM25 Search Engine for Teaching and Learning Information RetrievalProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401413(2185-2188)Online publication date: 25-Jul-2020
https://dl.acm.org/doi/10.1145/3397271.3401413
Kamphuis Cde Vries ABoytsov LLin J(2020)Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring VariantsAdvances in Information Retrieval10.1007/978-3-030-45442-5_4(28-34)Online publication date: 8-Apr-2020
https://doi.org/10.1007/978-3-030-45442-5_4
Jian FHuang JZhao JYing ZWang Y(2019)A topic‐based term frequency normalization framework to enhance probabilistic information retrievalComputational Intelligence10.1111/coin.1224836:2(486-521)Online publication date: 20-Nov-2019
https://doi.org/10.1111/coin.12248
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

A log-logistic model-based interpretation of TF normalization of BM25

BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies

BM25 Beyond Query-Document Similarity

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations