short-paper

Verboseness Fission for BM25 Document Length Normalization

Authors:
Aldo Lipani

National Institute of Informatics, Tokyo, Japan

National Institute of Informatics, Tokyo, Japan
View Profile

,
Mihai Lupu

Vienna University of Technology, Vienna, Austria

Vienna University of Technology, Vienna, Austria
View Profile

,
Allan Hanbury

Vienna University of Technology, Vienna, Austria

Vienna University of Technology, Vienna, Austria
View Profile

,
Akiko Aizawa

National Institute of Informatics, Tokyo, Japan

National Institute of Informatics, Tokyo, Japan
View Profile

ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information RetrievalSeptember 2015Pages 385–388https://doi.org/10.1145/2808194.2809486

Published:27 September 2015Publication History

ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval

Pages 385–388

ABSTRACT

BM25 is probably the most well known term weighting model in Information Retrieval. It has, depending on the formula variant at hand, 2 or 3 parameters (k₁, b, and k₃). This paper addresses b - the document length normalization parameter. Based on the observation that the two cases previously discussed for length normalization (multi-topicality and verboseness) are actually three: multi-topicality, verboseness with word repetition (repetitiveness) and verboseness with synonyms, we propose and test a new length normalization method that removes the need for a b parameter in BM25. Testing the new method on a set of purposefully varied test collections, we observe that we can obtain results statistically indistinguishable from the optimal results, therefore removing the need for ground-truth based optimization.

References

G. Amati and J. C. C. Van Rijsbergen. Probabilistic models for information retrieval based on divergence from randomness. TOIS, 20(4), 2002. Google ScholarDigital Library
A. Chowdhury, M. C. McCabe, D. Grossman, and O. Frieder. Document Normalization Revisited. In Proc. of SIGIR, 2002. Google ScholarDigital Library
D. Harman. Overview of the Fourth Text REtrieval Conference (TREC-4). In Proc. of TREC 4, 1995.Google ScholarCross Ref
B. He and I. Ounis. A Study of Parameter Tuning for Term Frequency Normalization. In Proc. of CIKM, 2003. Google ScholarDigital Library
B. He and I. Ounis. A Study of the Dirichlet Priors for Term Frequency Normalisation. In Proc. of SIGIR, 2005. Google ScholarDigital Library
B. He and I. Ounis. Term Frequency Normalisation Tuning for BM25 and DFR Models. In Proc. of ECIR, 2005. Google ScholarDigital Library
Y. Lv and C. Zhai. Adaptive Term Frequency Normalization for BM25. In Proc. of CIKM, 2011. Google ScholarDigital Library
Y. Lv and C. Zhai. Lower-bounding Term Frequency Normalization. In Proc. of CIKM, 2011. Google ScholarDigital Library
Y. Lv and C. Zhai. When Documents Are Very Long, BM25 Fails! In Proc. of SIGIR, 2011. Google ScholarDigital Library
D. Metzler and H. Zaragoza. Semi-parametric and non-parametric term weighting for information retrieval. In Proc. of ICTIR, 2009. Google ScholarDigital Library
S.-H. Na, I.-S. Kang, and J.-H. Lee. Improving term frequency normalization for multi-topical documents and application to language modeling approaches. In Proc. of ECIR, 2008. Google ScholarDigital Library
S. Robertson, S. Walker, M. Beaulieu, M. Gatford, and A. Payne. Okapi at TREC-4. In Proc. of TREC 4, 1995.Google Scholar
S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 2009. Google ScholarDigital Library
S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. of TREC-3, 1994.Google Scholar
F. Rousseau and M. Vazirgiannis. Composition of TF Normalizations: New Insights on Scoring Functions for Ad Hoc IR. In Proc. of SIGIR, 2013. Google ScholarDigital Library
T. Sakai. Alternatives to Bpref. In Proc. of SIGIR, 2007. Google ScholarDigital Library
A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization. In Proc. of SIGIR, 1996. Google ScholarDigital Library

Index Terms

Verboseness Fission for BM25 Document Length Normalization
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Pivoted Document Length Normalization
SIGIR Test-of-Time Awardees 1978-2001

Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that ...
Read More
Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but ...
Read More
Document Length Normalization
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval
September 2015
402 pages
ISBN:9781450338332
DOI:10.1145/2808194
General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Bruce Croft
University of Massachusetts Amherst, USA
,
Program Chairs:
Arjen de Vries
CWI Amsterdam, The Netherlands
,
Chengxiang Zhai
University of Illinois at Urbana-Champaign, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 September 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- short-paper
Conference

Acceptance Rates
ICTIR '15 Paper Acceptance Rate29of57submissions,51%Overall Acceptance Rate209of482submissions,43%
More
Upcoming Conference
ICTIR '24

Sponsor:

sigir

The 2024 ACM SIGIR International Conference on the Theory of Information Retrieval

July 13, 2024

Washington DC , DC , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 182
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Verboseness Fission for BM25 Document Length Normalization

ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Pivoted Document Length Normalization

Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

Document Length Normalization