ABSTRACT
BM25 is probably the most well known term weighting model in Information Retrieval. It has, depending on the formula variant at hand, 2 or 3 parameters (k1, b, and k3). This paper addresses b - the document length normalization parameter. Based on the observation that the two cases previously discussed for length normalization (multi-topicality and verboseness) are actually three: multi-topicality, verboseness with word repetition (repetitiveness) and verboseness with synonyms, we propose and test a new length normalization method that removes the need for a b parameter in BM25. Testing the new method on a set of purposefully varied test collections, we observe that we can obtain results statistically indistinguishable from the optimal results, therefore removing the need for ground-truth based optimization.
- G. Amati and J. C. C. Van Rijsbergen. Probabilistic models for information retrieval based on divergence from randomness. TOIS, 20(4), 2002. Google ScholarDigital Library
- A. Chowdhury, M. C. McCabe, D. Grossman, and O. Frieder. Document Normalization Revisited. In Proc. of SIGIR, 2002. Google ScholarDigital Library
- D. Harman. Overview of the Fourth Text REtrieval Conference (TREC-4). In Proc. of TREC 4, 1995.Google ScholarCross Ref
- B. He and I. Ounis. A Study of Parameter Tuning for Term Frequency Normalization. In Proc. of CIKM, 2003. Google ScholarDigital Library
- B. He and I. Ounis. A Study of the Dirichlet Priors for Term Frequency Normalisation. In Proc. of SIGIR, 2005. Google ScholarDigital Library
- B. He and I. Ounis. Term Frequency Normalisation Tuning for BM25 and DFR Models. In Proc. of ECIR, 2005. Google ScholarDigital Library
- Y. Lv and C. Zhai. Adaptive Term Frequency Normalization for BM25. In Proc. of CIKM, 2011. Google ScholarDigital Library
- Y. Lv and C. Zhai. Lower-bounding Term Frequency Normalization. In Proc. of CIKM, 2011. Google ScholarDigital Library
- Y. Lv and C. Zhai. When Documents Are Very Long, BM25 Fails! In Proc. of SIGIR, 2011. Google ScholarDigital Library
- D. Metzler and H. Zaragoza. Semi-parametric and non-parametric term weighting for information retrieval. In Proc. of ICTIR, 2009. Google ScholarDigital Library
- S.-H. Na, I.-S. Kang, and J.-H. Lee. Improving term frequency normalization for multi-topical documents and application to language modeling approaches. In Proc. of ECIR, 2008. Google ScholarDigital Library
- S. Robertson, S. Walker, M. Beaulieu, M. Gatford, and A. Payne. Okapi at TREC-4. In Proc. of TREC 4, 1995.Google Scholar
- S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 2009. Google ScholarDigital Library
- S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proc. of TREC-3, 1994.Google Scholar
- F. Rousseau and M. Vazirgiannis. Composition of TF Normalizations: New Insights on Scoring Functions for Ad Hoc IR. In Proc. of SIGIR, 2013. Google ScholarDigital Library
- T. Sakai. Alternatives to Bpref. In Proc. of SIGIR, 2007. Google ScholarDigital Library
- A. Singhal, C. Buckley, and M. Mitra. Pivoted Document Length Normalization. In Proc. of SIGIR, 1996. Google ScholarDigital Library
Index Terms
- Verboseness Fission for BM25 Document Length Normalization
Recommendations
Pivoted Document Length Normalization
SIGIR Test-of-Time Awardees 1978-2001Automatic information retrieval systems have to deal with documents of varying lengths in a text collection. Document length normalization is used to fairly retrieve documents of all lengths. In this study, we ohserve that a normalization scheme that ...
Adapting pivoted document-length normalization for query size: Experiments in Chinese and English
The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but ...
Comments