Abstract
In the December 2017 issue of SIGIR Forum, Fuhr presented ten "Thou Shalt Not"s (i.e., warnings against bad practices) for IR experimenters. While his article provides a lot of good materials for discussion, the objective of the present article is to argue that not all of his recommendations should be considered as absolute truths: researchers should be aware that there are other views; conference programme chairs and journal editors should be very careful when providing a guideline for evaluation practices.
- Norbert Fuhr. Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51(3):32--41, 2017. Google ScholarDigital Library
- S. S. Stevens. On the theory of scales of measurement. Science, New Series, 103(2684):677--680, 1946.Google Scholar
- Jeff Sauro and James R. Lewis. Quantifying the User Experience: Practical Statistics for User Research (2nd Edition). Morgan Kafmann, 2016. Google ScholarDigital Library
- Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002. Google ScholarDigital Library
- Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of ACM CIKM 2009, pages 621--630, 2009. Google ScholarDigital Library
- Stephen Robertson. A new interpretation of average precision. In Proceedings of ACM SIGIR 2008, pages 689--690, 2008. Google ScholarDigital Library
- Tetsuya Sakai and Stephen Robertson. Modelling a user population for designing information retrieval metrics. In Proceedings of EVIA 2008, pages 30--41, 2008.Google Scholar
- Tetsuya Sakai. Metrics, statistics, tests. In PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), pages 116--163, 2014a.Google Scholar
- Tetsuya Sakai and Zhaohao Zeng. Which diversity evaluation measures are "good"? In Proceedings of ACM SIGIR 2019, pages 595--604, 2019. Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS, 27(1), 2008. Google ScholarDigital Library
- Justin Zobel, Alistair Moffat, and Laurence A.F. Park. Against recall: Is it persistence, cardinality, density, coverage, or totality? SIGIR Forum, 43(1):3--8, 2009. Google ScholarDigital Library
- Tetsuya Sakai. A simple and effective approach to score standardisation. In Proceedings of ACM ICTIR 2016, pages 95--104, 2016. Google ScholarDigital Library
- Julián Urbano, Harlley Lima, and Alan Hanjalic. A new perspective on score standardization. In Proceedings of ACM SIGIR 2019, pages 1061--1064, 2019. Google ScholarDigital Library
- William Webber, Alistair Moffat, and Justin Zobel. Score standardization for inter-collection comparison of retrieval systems. In Proceedings of ACM SIGIR 2008, pages 51--58, 2008a. Google ScholarDigital Library
- G.E.P. Box. Robustness in the strategy of scientific model building. In Robert L. Launer and Graham N. Wilkinson, editors, Robustness in Statistics, pages 201--236. Academic Press, 1979.Google ScholarCross Ref
- Tetsuya Sakai and Ruihua Song. Diversified search evaluation: Lessons from the NTCIR-9 INTENT task. Information Retrieval, 16(4):504--529, 2013. Google ScholarDigital Library
- Ben Carterette. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS, 30(1), 2012. Google ScholarDigital Library
- Tetsuya Sakai. Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power. Springer, 2018.Google Scholar
- Chris Buckley and Ellen M. Voorhees. Retrieval system evaluation. In Ellen M. Voorhees and Donna K. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval, chapter 3, pages 53--75. The MIT Press, 2005.Google Scholar
- Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. 1995.Google Scholar
- Stephen Robertson. On GMAP - and other transformations. In Proceedings of ACM CIKM 2006, pages 78--83, 2006. Google ScholarDigital Library
- William Webber, Alistair Moffat, Justin Zobel, and Tetsuya Sakai. Precision-at-ten considered redundant. In Proceedings of ACM SIGIR 2008, pages 695--696, 2008b. Google ScholarDigital Library
- Tetsuya Sakai. Statistical reform in information retrieval? SIGIR Forum, 48(1):3--12, 2014b. Google ScholarDigital Library
- Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. The SIGIR 2019 open-source IR replicability challenge (OSIRRC 2019). In Proceedings of ACM SIGIR 2019, pages 1432--1434, 2019. Google ScholarDigital Library
- Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff. CENTRE@CLEF2019: Sequel in the systematic reproducibility realm. In Proceedings of CLEF 2019 (LNCS 11696), pages 287--300, 2019.Google Scholar
- Tetsuya Sakai, Nicola Ferro, Ian Soboroff, Zhaohao Zeng, Peng Xiao, and Maria Maistro. Overview of the NTCIR-14 CENTRE task. In Proceedings of NTCIR-14, pages 494--509, 2019.Google Scholar
Index Terms
- On Fuhr's guideline for IR evaluation
Recommendations
Cumulated gain-based evaluation of IR techniques
Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop ...
Using graded relevance assessments in IR evaluation
This article proposes evaluation methods based on the use of nondichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable ...
Report on the SIGIR 2009 workshop on the future of IR evaluation
On July 23, 2009 the SIGIR Workshop on the Future of IR Evaluation was held as part of SIGIR in Boston. The program consisted of four keynotes, a boaster and poster session with 20 accepted papers, four breakout groups, and a final panel discussion of ...
Comments