skip to main content
column

Some Common Mistakes In IR Evaluation, And How They Can Be Avoided

Published:22 February 2018Publication History
Skip Abstract Section

Abstract

This paper points out some mistakes that can be frequently found in IR publications: MRR and ERR violate basic requirements for a metric, MAP is based on unrealistic assumptions, the numbers shown overstate the precision of the result, relative improvements of arithmetic means are inappropriate, the simple holdout method yields unreliable results, hypotheses are often formulated after the experiment, significance tests frequently ignore the multiple comparisons problem, effect sizes are ignored, reproducibility of the experiments might be nearly impossible, and sometimes authors claim proof by experimentation.

References

  1. Stefanie R Austin, Isaac Dialsingh, and Naomi Altman. Multiple hypothesis testing: A review. J. Indian Soc. Of Agricultural Stat, 68:303--314, 2014.Google ScholarGoogle Scholar
  2. Martin Braschler. CLEF 2001 - Overview of Results. In Evaluation of Cross-Language Information Retrieval Systems, Second Workshop of the Cross-Language Evaluation Fo- rum, CLEF 2001, Darmstadt, Germany, September 3-4, 2001, Revised Papers. pages 9--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ben Carterette. The best published result is random: Sequential testing and its effect on reported effectiveness. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '15, pages 747--750, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Benjamin A. Carterette. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst., 30(1):4:1--4:34, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09, pages 621--630, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jacob Cohen. The earth is round (p i .05). American Psychologist, 49(12):997--1003, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  7. Thomas J. DiCiccio and Bradley Efron. Bootstrap confidence intervals. Statist. Sci., 11(3):189--228, 09 1996.Google ScholarGoogle ScholarCross RefCross Ref
  8. Nicola Ferro, Norbert Fuhr, Kalervo Jarvelin, Noriko Kando, Matthias Lippold, and Justin Zobel. Increasing reproducibility in ir: Findings from the dagstuhl seminar on "reproducibility of data-oriented experiments in e-science". SIGIR Forum, 50(1):68--82, 2016. http://sigir.org/files/forum/2016J/p068.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Laura A. Granka, Thorsten Joachims, and Geri Gay. Eye-tracking analysis of user behavior in www search. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 478--479, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Islamaj Dogan, G. C. Murray, A. Neveol, and Z. Lu. Understanding pubmed user search behavior through log analysis. Database: The Journal of Biological Databases and Curation, 2009.Google ScholarGoogle Scholar
  11. Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  13. Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst., 27(1):2:1--2:27, December 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jinfeng Rao, Jimmy J. Lin, and Miles Efron. Reproducible experiments on lexical and temporal feedback for tweet search. In Allan Hanbury, Gabriella Kazai, Andreas Rauber, and Norbert Fuhr, editors, Advances in Information Retrieval - 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29 - April 2, 2015. Proceedings, volume 9022 of Lecture Notes in Computer Science, pages 755--767, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  15. Stephen Robertson. On gmap: And other transformations. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM '06, pages 78--83, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Stephen E. Robertson. A new interpretation of average precision. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, SIGIR, pages 689--690. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tetsuya Sakai. Metrics, statistics, tests. In Nicola Ferro, editor, Bridging Between In- formation Retrieval and Databases, volume 8173 of Lecture Notes in Computer Science, pages 116--163. Springer Berlin Heidelberg, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  18. Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. Do user preferences and evaluation measures line up? In Proceedings of the 33rd Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '10, pages 555--562, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Amit Singhal, John Choi, Donald Hindle, and Fernando C. N. Pereira. At&t at TREC-6: SDR track. In D. Harman and E. M. Voorhees, editors, Proceedings of The Sixth Text REtrieval Conference, TREC 1997, Gaithersburg, Maryland, USA, November 19-21, 1997, pages 227--232, Gaithersburg, Md. 20899, 1997. National Institute of Standards and Technology.Google ScholarGoogle Scholar
  20. S.S. Stevens. On the theory of scales of measurement. Science, New Series 103(2684):677--680, June 1946.Google ScholarGoogle Scholar
  21. Bruce Thompson. Foundations of Behavioral Statistics: An Insight-Based Approach. The Guilford Press, 2006.Google ScholarGoogle Scholar
  22. Ian H.Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGIR Forum
    ACM SIGIR Forum  Volume 51, Issue 3
    December 2017
    157 pages
    ISSN:0163-5840
    DOI:10.1145/3190580
    Issue’s Table of Contents

    Copyright © 2018 Copyright is held by the owner/author(s)

    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 February 2018

    Check for updates

    Qualifiers

    • column

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader