skip to main content
research-article

On Fuhr's guideline for IR evaluation

Published:19 February 2021Publication History
Skip Abstract Section

Abstract

In the December 2017 issue of SIGIR Forum, Fuhr presented ten "Thou Shalt Not"s (i.e., warnings against bad practices) for IR experimenters. While his article provides a lot of good materials for discussion, the objective of the present article is to argue that not all of his recommendations should be considered as absolute truths: researchers should be aware that there are other views; conference programme chairs and journal editors should be very careful when providing a guideline for evaluation practices.

References

  1. Norbert Fuhr. Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51(3):32--41, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. S. Stevens. On the theory of scales of measurement. Science, New Series, 103(2684):677--680, 1946.Google ScholarGoogle Scholar
  3. Jeff Sauro and James R. Lewis. Quantifying the User Experience: Practical Statistics for User Research (2nd Edition). Morgan Kafmann, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of ACM CIKM 2009, pages 621--630, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Stephen Robertson. A new interpretation of average precision. In Proceedings of ACM SIGIR 2008, pages 689--690, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Tetsuya Sakai and Stephen Robertson. Modelling a user population for designing information retrieval metrics. In Proceedings of EVIA 2008, pages 30--41, 2008.Google ScholarGoogle Scholar
  8. Tetsuya Sakai. Metrics, statistics, tests. In PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), pages 116--163, 2014a.Google ScholarGoogle Scholar
  9. Tetsuya Sakai and Zhaohao Zeng. Which diversity evaluation measures are "good"? In Proceedings of ACM SIGIR 2019, pages 595--604, 2019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS, 27(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Justin Zobel, Alistair Moffat, and Laurence A.F. Park. Against recall: Is it persistence, cardinality, density, coverage, or totality? SIGIR Forum, 43(1):3--8, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tetsuya Sakai. A simple and effective approach to score standardisation. In Proceedings of ACM ICTIR 2016, pages 95--104, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Julián Urbano, Harlley Lima, and Alan Hanjalic. A new perspective on score standardization. In Proceedings of ACM SIGIR 2019, pages 1061--1064, 2019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. William Webber, Alistair Moffat, and Justin Zobel. Score standardization for inter-collection comparison of retrieval systems. In Proceedings of ACM SIGIR 2008, pages 51--58, 2008a. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G.E.P. Box. Robustness in the strategy of scientific model building. In Robert L. Launer and Graham N. Wilkinson, editors, Robustness in Statistics, pages 201--236. Academic Press, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  16. Tetsuya Sakai and Ruihua Song. Diversified search evaluation: Lessons from the NTCIR-9 INTENT task. Information Retrieval, 16(4):504--529, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ben Carterette. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS, 30(1), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tetsuya Sakai. Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power. Springer, 2018.Google ScholarGoogle Scholar
  19. Chris Buckley and Ellen M. Voorhees. Retrieval system evaluation. In Ellen M. Voorhees and Donna K. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval, chapter 3, pages 53--75. The MIT Press, 2005.Google ScholarGoogle Scholar
  20. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. 1995.Google ScholarGoogle Scholar
  21. Stephen Robertson. On GMAP - and other transformations. In Proceedings of ACM CIKM 2006, pages 78--83, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. William Webber, Alistair Moffat, Justin Zobel, and Tetsuya Sakai. Precision-at-ten considered redundant. In Proceedings of ACM SIGIR 2008, pages 695--696, 2008b. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tetsuya Sakai. Statistical reform in information retrieval? SIGIR Forum, 48(1):3--12, 2014b. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. The SIGIR 2019 open-source IR replicability challenge (OSIRRC 2019). In Proceedings of ACM SIGIR 2019, pages 1432--1434, 2019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff. CENTRE@CLEF2019: Sequel in the systematic reproducibility realm. In Proceedings of CLEF 2019 (LNCS 11696), pages 287--300, 2019.Google ScholarGoogle Scholar
  26. Tetsuya Sakai, Nicola Ferro, Ian Soboroff, Zhaohao Zeng, Peng Xiao, and Maria Maistro. Overview of the NTCIR-14 CENTRE task. In Proceedings of NTCIR-14, pages 494--509, 2019.Google ScholarGoogle Scholar

Index Terms

  1. On Fuhr's guideline for IR evaluation
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGIR Forum
      ACM SIGIR Forum  Volume 54, Issue 1
      June 2020
      148 pages
      ISSN:0163-5840
      DOI:10.1145/3451964
      Issue’s Table of Contents

      Copyright © 2021 Copyright is held by the owner/author(s)

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 February 2021

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader