column

Some Common Mistakes In IR Evaluation, And How They Can Be Avoided

Author:
Norbert Fuhr

University of Duisburg-Essen, Germany

University of Duisburg-Essen, Germany
View Profile

Authors Info & Claims

ACM SIGIR Forum Volume 51 Issue 3December 2017pp 32–41https://doi.org/10.1145/3190580.3190586

Published:22 February 2018Publication History

ACM SIGIR Forum

Abstract

This paper points out some mistakes that can be frequently found in IR publications: MRR and ERR violate basic requirements for a metric, MAP is based on unrealistic assumptions, the numbers shown overstate the precision of the result, relative improvements of arithmetic means are inappropriate, the simple holdout method yields unreliable results, hypotheses are often formulated after the experiment, significance tests frequently ignore the multiple comparisons problem, effect sizes are ignored, reproducibility of the experiments might be nearly impossible, and sometimes authors claim proof by experimentation.

References

Stefanie R Austin, Isaac Dialsingh, and Naomi Altman. Multiple hypothesis testing: A review. J. Indian Soc. Of Agricultural Stat, 68:303--314, 2014.Google Scholar
Martin Braschler. CLEF 2001 - Overview of Results. In Evaluation of Cross-Language Information Retrieval Systems, Second Workshop of the Cross-Language Evaluation Fo- rum, CLEF 2001, Darmstadt, Germany, September 3-4, 2001, Revised Papers. pages 9--26. Google ScholarDigital Library
Ben Carterette. The best published result is random: Sequential testing and its effect on reported effectiveness. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '15, pages 747--750, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
Benjamin A. Carterette. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst., 30(1):4:1--4:34, 2012. Google ScholarDigital Library
Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, pages 621--630, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
Jacob Cohen. The earth is round (p i .05). American Psychologist, 49(12):997--1003, 1994.Google ScholarCross Ref
Thomas J. DiCiccio and Bradley Efron. Bootstrap confidence intervals. Statist. Sci., 11(3):189--228, 09 1996.Google ScholarCross Ref
Nicola Ferro, Norbert Fuhr, Kalervo Jarvelin, Noriko Kando, Matthias Lippold, and Justin Zobel. Increasing reproducibility in ir: Findings from the dagstuhl seminar on "reproducibility of data-oriented experiments in e-science". SIGIR Forum, 50(1):68--82, 2016. http://sigir.org/files/forum/2016J/p068.pdf. Google ScholarDigital Library
Laura A. Granka, Thorsten Joachims, and Geri Gay. Eye-tracking analysis of user behavior in www search. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 478--479, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
R. Islamaj Dogan, G. C. Murray, A. Neveol, and Z. Lu. Understanding pubmed user search behavior through log analysis. Database: The Journal of Biological Databases and Curation, 2009.Google Scholar
Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarCross Ref
Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst., 27(1):2:1--2:27, December 2008. Google ScholarDigital Library
Jinfeng Rao, Jimmy J. Lin, and Miles Efron. Reproducible experiments on lexical and temporal feedback for tweet search. In Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Norbert Fuhr, editors, Advances in Information Retrieval - 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29 - April 2, 2015. Proceedings, volume 9022 of Lecture Notes in Computer Science, pages 755--767, 2015.Google ScholarCross Ref
Stephen Robertson. On gmap: And other transformations. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM '06, pages 78--83, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
Stephen E. Robertson. A new interpretation of average precision. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR, pages 689--690. ACM, 2008. Google ScholarDigital Library
Tetsuya Sakai. Metrics, statistics, tests. In Nicola Ferro, editor, Bridging Between In- formation Retrieval and Databases, volume 8173 of Lecture Notes in Computer Science, pages 116--163. Springer Berlin Heidelberg, 2014.Google ScholarCross Ref
Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. Do user preferences and evaluation measures line up? In Proceedings of the 33rd Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '10, pages 555--562, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Amit Singhal, John Choi, Donald Hindle, and Fernando C. N. Pereira. At&t at TREC-6: SDR track. In D. Harman and E. M. Voorhees, editors, Proceedings of The Sixth Text REtrieval Conference, TREC 1997, Gaithersburg, Maryland, USA, November 19-21, 1997, pages 227--232, Gaithersburg, Md. 20899, 1997. National Institute of Standards and Technology.Google Scholar
S.S. Stevens. On the theory of scales of measurement. Science, New Series 103(2684):677--680, June 1946.Google Scholar
Bruce Thompson. Foundations of Behavioral Statistics: An Insight-Based Approach. The Guilford Press, 2006.Google Scholar
Ian H.Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011. Google ScholarDigital Library

Recommendations

Using graded relevance assessments in IR evaluation

This article proposes evaluation methods based on the use of nondichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable ...
Read More
Proof by Experimentation? Towards Better IR Research
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

The current fight against the COVID-19 pandemic illustrates the importance of proper scientific methods: Besides fake news lacking any factual evidence, reports on clinical trials with various drugs often yield contradicting results; here, only a closer ...
Read More
On the history of evaluation in IR

This paper is a personal take on the history of evaluation experiments in information retrieval. It describes some of the early experiments that were formative in our understanding, and goes on to discuss the current dominance of TREC (the Text ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGIR Forum Volume 51, Issue 3
December 2017
157 pages
ISSN:0163-5840
DOI:10.1145/3190580
Editors:
Claudia Hauff
Delft University of Technology, The Netherlands
,
Craig Macdonald
University of Glasgow, Glasgow, United Kingdom
Issue’s Table of Contents
Copyright © 2018 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2018
Check for updates
Qualifiers
- column
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 81
  Total Citations
  View Citations
- 661
  Total Downloads
- Downloads (Last 12 months)97
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Some Common Mistakes In IR Evaluation, And How They Can Be Avoided

ACM SIGIR Forum

Abstract

References

Cited By

Recommendations

Using graded relevance assessments in IR evaluation

Proof by Experimentation? Towards Better IR Research

On the history of evaluation in IR

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Some Common Mistakes In IR Evaluation, And How They Can Be Avoided

ACM SIGIR Forum

Abstract

References

Cited By

Recommendations

Using graded relevance assessments in IR evaluation

Proof by Experimentation? Towards Better IR Research

On the history of evaluation in IR

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media