research-article

On Fuhr's guideline for IR evaluation

Author:
Tetsuya Sakai

Waseda University, Tokyo, Japan

Waseda University, Tokyo, Japan
View Profile

Authors Info & Claims

ACM SIGIR Forum Volume 54 Issue 1June 2020Article No.: 12pp 1–8https://doi.org/10.1145/3451964.3451976

Published:19 February 2021Publication History

ACM SIGIR Forum

Abstract

In the December 2017 issue of SIGIR Forum, Fuhr presented ten "Thou Shalt Not"s (i.e., warnings against bad practices) for IR experimenters. While his article provides a lot of good materials for discussion, the objective of the present article is to argue that not all of his recommendations should be considered as absolute truths: researchers should be aware that there are other views; conference programme chairs and journal editors should be very careful when providing a guideline for evaluation practices.

References

Norbert Fuhr. Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51(3):32--41, 2017. Google ScholarDigital Library
S. S. Stevens. On the theory of scales of measurement. Science, New Series, 103(2684):677--680, 1946.Google Scholar
Jeff Sauro and James R. Lewis. Quantifying the User Experience: Practical Statistics for User Research (2nd Edition). Morgan Kafmann, 2016. Google ScholarDigital Library
Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002. Google ScholarDigital Library
Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of ACM CIKM 2009, pages 621--630, 2009. Google ScholarDigital Library
Stephen Robertson. A new interpretation of average precision. In Proceedings of ACM SIGIR 2008, pages 689--690, 2008. Google ScholarDigital Library
Tetsuya Sakai and Stephen Robertson. Modelling a user population for designing information retrieval metrics. In Proceedings of EVIA 2008, pages 30--41, 2008.Google Scholar
Tetsuya Sakai. Metrics, statistics, tests. In PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), pages 116--163, 2014a.Google Scholar
Tetsuya Sakai and Zhaohao Zeng. Which diversity evaluation measures are "good"? In Proceedings of ACM SIGIR 2019, pages 595--604, 2019. Google ScholarDigital Library
Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS, 27(1), 2008. Google ScholarDigital Library
Justin Zobel, Alistair Moffat, and Laurence A.F. Park. Against recall: Is it persistence, cardinality, density, coverage, or totality? SIGIR Forum, 43(1):3--8, 2009. Google ScholarDigital Library
Tetsuya Sakai. A simple and effective approach to score standardisation. In Proceedings of ACM ICTIR 2016, pages 95--104, 2016. Google ScholarDigital Library
Julián Urbano, Harlley Lima, and Alan Hanjalic. A new perspective on score standardization. In Proceedings of ACM SIGIR 2019, pages 1061--1064, 2019. Google ScholarDigital Library
William Webber, Alistair Moffat, and Justin Zobel. Score standardization for inter-collection comparison of retrieval systems. In Proceedings of ACM SIGIR 2008, pages 51--58, 2008a. Google ScholarDigital Library
G.E.P. Box. Robustness in the strategy of scientific model building. In Robert L. Launer and Graham N. Wilkinson, editors, Robustness in Statistics, pages 201--236. Academic Press, 1979.Google ScholarCross Ref
Tetsuya Sakai and Ruihua Song. Diversified search evaluation: Lessons from the NTCIR-9 INTENT task. Information Retrieval, 16(4):504--529, 2013. Google ScholarDigital Library
Ben Carterette. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS, 30(1), 2012. Google ScholarDigital Library
Tetsuya Sakai. Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power. Springer, 2018.Google Scholar
Chris Buckley and Ellen M. Voorhees. Retrieval system evaluation. In Ellen M. Voorhees and Donna K. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval, chapter 3, pages 53--75. The MIT Press, 2005.Google Scholar
Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. 1995.Google Scholar
Stephen Robertson. On GMAP - and other transformations. In Proceedings of ACM CIKM 2006, pages 78--83, 2006. Google ScholarDigital Library
William Webber, Alistair Moffat, Justin Zobel, and Tetsuya Sakai. Precision-at-ten considered redundant. In Proceedings of ACM SIGIR 2008, pages 695--696, 2008b. Google ScholarDigital Library
Tetsuya Sakai. Statistical reform in information retrieval? SIGIR Forum, 48(1):3--12, 2014b. Google ScholarDigital Library
Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and Ze Zhong Wu. The SIGIR 2019 open-source IR replicability challenge (OSIRRC 2019). In Proceedings of ACM SIGIR 2019, pages 1432--1434, 2019. Google ScholarDigital Library
Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, and Ian Soboroff. CENTRE@CLEF2019: Sequel in the systematic reproducibility realm. In Proceedings of CLEF 2019 (LNCS 11696), pages 287--300, 2019.Google Scholar
Tetsuya Sakai, Nicola Ferro, Ian Soboroff, Zhaohao Zeng, Peng Xiao, and Maria Maistro. Overview of the NTCIR-14 CENTRE task. In Proceedings of NTCIR-14, pages 494--509, 2019.Google Scholar

Index Terms

On Fuhr's guideline for IR evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Index terms have been assigned to the content through auto-classification.

Recommendations

Cumulated gain-based evaluation of IR techniques

Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop ...
Read More
Using graded relevance assessments in IR evaluation

This article proposes evaluation methods based on the use of nondichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable ...
Read More
Report on the SIGIR 2009 workshop on the future of IR evaluation

On July 23, 2009 the SIGIR Workshop on the Future of IR Evaluation was held as part of SIGIR in Boston. The program consisted of four keynotes, a boaster and poster session with 20 accepted papers, four breakout groups, and a final panel discussion of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGIR Forum Volume 54, Issue 1
June 2020
148 pages
ISSN:0163-5840
DOI:10.1145/3451964
Issue’s Table of Contents

Copyright © 2021 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 February 2021
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 120
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On Fuhr's guideline for IR evaluation

ACM SIGIR Forum

Abstract

References

Cited By

Index Terms

Recommendations

Cumulated gain-based evaluation of IR techniques

Using graded relevance assessments in IR evaluation

Report on the SIGIR 2009 workshop on the future of IR evaluation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On Fuhr's guideline for IR evaluation

ACM SIGIR Forum

Abstract

References

Cited By

Index Terms

Recommendations

Cumulated gain-based evaluation of IR techniques

Using graded relevance assessments in IR evaluation

Report on the SIGIR 2009 workshop on the future of IR evaluation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media