Abstract
This paper points out some mistakes that can be frequently found in IR publications: MRR and ERR violate basic requirements for a metric, MAP is based on unrealistic assumptions, the numbers shown overstate the precision of the result, relative improvements of arithmetic means are inappropriate, the simple holdout method yields unreliable results, hypotheses are often formulated after the experiment, significance tests frequently ignore the multiple comparisons problem, effect sizes are ignored, reproducibility of the experiments might be nearly impossible, and sometimes authors claim proof by experimentation.
- Stefanie R Austin, Isaac Dialsingh, and Naomi Altman. Multiple hypothesis testing: A review. J. Indian Soc. Of Agricultural Stat, 68:303--314, 2014.Google Scholar
- Martin Braschler. CLEF 2001 - Overview of Results. In Evaluation of Cross-Language Information Retrieval Systems, Second Workshop of the Cross-Language Evaluation Fo- rum, CLEF 2001, Darmstadt, Germany, September 3-4, 2001, Revised Papers. pages 9--26. Google ScholarDigital Library
- Ben Carterette. The best published result is random: Sequential testing and its effect on reported effectiveness. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '15, pages 747--750, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- Benjamin A. Carterette. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst., 30(1):4:1--4:34, 2012. Google ScholarDigital Library
- Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, pages 621--630, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Jacob Cohen. The earth is round (p i .05). American Psychologist, 49(12):997--1003, 1994.Google ScholarCross Ref
- Thomas J. DiCiccio and Bradley Efron. Bootstrap confidence intervals. Statist. Sci., 11(3):189--228, 09 1996.Google ScholarCross Ref
- Nicola Ferro, Norbert Fuhr, Kalervo Jarvelin, Noriko Kando, Matthias Lippold, and Justin Zobel. Increasing reproducibility in ir: Findings from the dagstuhl seminar on "reproducibility of data-oriented experiments in e-science". SIGIR Forum, 50(1):68--82, 2016. http://sigir.org/files/forum/2016J/p068.pdf. Google ScholarDigital Library
- Laura A. Granka, Thorsten Joachims, and Geri Gay. Eye-tracking analysis of user behavior in www search. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 478--479, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- R. Islamaj Dogan, G. C. Murray, A. Neveol, and Z. Lu. Understanding pubmed user search behavior through log analysis. Database: The Journal of Biological Databases and Curation, 2009.Google Scholar
- Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarCross Ref
- Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst., 27(1):2:1--2:27, December 2008. Google ScholarDigital Library
- Jinfeng Rao, Jimmy J. Lin, and Miles Efron. Reproducible experiments on lexical and temporal feedback for tweet search. In Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Norbert Fuhr, editors, Advances in Information Retrieval - 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29 - April 2, 2015. Proceedings, volume 9022 of Lecture Notes in Computer Science, pages 755--767, 2015.Google ScholarCross Ref
- Stephen Robertson. On gmap: And other transformations. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM '06, pages 78--83, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Stephen E. Robertson. A new interpretation of average precision. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR, pages 689--690. ACM, 2008. Google ScholarDigital Library
- Tetsuya Sakai. Metrics, statistics, tests. In Nicola Ferro, editor, Bridging Between In- formation Retrieval and Databases, volume 8173 of Lecture Notes in Computer Science, pages 116--163. Springer Berlin Heidelberg, 2014.Google ScholarCross Ref
- Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. Do user preferences and evaluation measures line up? In Proceedings of the 33rd Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '10, pages 555--562, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Amit Singhal, John Choi, Donald Hindle, and Fernando C. N. Pereira. At&t at TREC-6: SDR track. In D. Harman and E. M. Voorhees, editors, Proceedings of The Sixth Text REtrieval Conference, TREC 1997, Gaithersburg, Maryland, USA, November 19-21, 1997, pages 227--232, Gaithersburg, Md. 20899, 1997. National Institute of Standards and Technology.Google Scholar
- S.S. Stevens. On the theory of scales of measurement. Science, New Series 103(2684):677--680, June 1946.Google Scholar
- Bruce Thompson. Foundations of Behavioral Statistics: An Insight-Based Approach. The Guilford Press, 2006.Google Scholar
- Ian H.Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011. Google ScholarDigital Library
Recommendations
Using graded relevance assessments in IR evaluation
This article proposes evaluation methods based on the use of nondichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable ...
Proof by Experimentation? Towards Better IR Research
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalThe current fight against the COVID-19 pandemic illustrates the importance of proper scientific methods: Besides fake news lacking any factual evidence, reports on clinical trials with various drugs often yield contradicting results; here, only a closer ...
On the history of evaluation in IR
This paper is a personal take on the history of evaluation experiments in information retrieval. It describes some of the early experiments that were formative in our understanding, and goes on to discuss the current dominance of TREC (the Text ...
Comments