skip to main content
10.1145/3121050.3121058acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article

Are IR Evaluation Measures on an Interval Scale?

Published: 01 October 2017 Publication History

Abstract

In this paper, we formally investigate whether, or not, IR evaluation measures are on an interval scale, which is needed to safely compute the basic statistics, such as mean and variance, we daily use to compare IR systems. We face this issue in the framework of the representational theory of measurement and we rely on the notion of difference structure, i.e. a total equi-spaced ordering on the system runs. We found that the most popular set-based measures, i.e. precision, recall, and F-measure are interval-based. In the case of rank-based measures, using a strongly top-heavy ordering, we found that only RBP with p = 1/2 is on an interval scale while RBP for other p values, AP, DCG, and ERR are not. Moreover, using a weakly top-heavy ordering, we found that none of RBP, AP, DCG, and ERR is on an interval scale.

References

[1]
J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). 2015. Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015). ACM Press, New York, USA.
[2]
E. Amigó, J. Gonzalo, and M. F. Verdejo 2013. A General Evaluation Measure for Document Organization Tasks Proc. 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press, New York, USA, 643--652.
[3]
P. Bollmann. 1984. Two Axioms for Evaluation Measures in Information Retrieval Proc. of the Third Joint BCS and ACM Symposium on Research and Development in Information Retrieval, C. J. van Rijsbergen (Ed.). Cambridge University Press, UK, 233--245.
[4]
P. Bollmann and V. S. Cherniavsky 1980. Measurement-theoretical investigation of the MZ-metric Proc. 3rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1980), C. J. van Rijsbergen (Ed.). ACM Press, New York, USA, 256--267.
[5]
L. Busin and S. Mizzaro 2013. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics Proc. 4th International Conference on the Theory of Information Retrieval (ICTIR 2013), O. Kurland, D. Metzler, C. Lioma, B. Larsen, and P. Ingwersen (Eds.). ACM Press, New York, USA, 22--29.
[6]
O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. Proc. 18th International Conference on Information and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621--630.
[7]
M. Ferrante, N. Ferro, and M. Maistro 2015. Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness, See Nzz-ICTIR2015, 21--30.
[8]
S. Foldes. 2013. On distances and metrics in discrete ordered sets. arXiv.org, Combinatorics (math.CO) Vol. arXiv:1307.0244 (June 2013).
[9]
N. Fuhr 2012. Salton Award Lecture: Information Retrieval As Engineering Science. SIGIR Forum, Vol. 46, 2 (December 2012), 19--28.
[10]
K. Jarvelin and J. Kekalainen 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM TOIS, Vol. 20, 4 (October 2002), 422--446.
[11]
D. E. Knuth. 1981. The Art of Computer Programming -- Volume 2: Seminumerical Algorithms (2nd ed.). Addison-Wesley, USA.
[12]
D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. 1971. Foundations of Measurement. Additive and Polynomial Representations. Vol. Vol. 1. Academic Press, USA.
[13]
S. Miyamoto. 2004. Generalizations of Multisets and Rough Approximations. International Journal of Intelligent Systems, Vol. 19, 7 (July 2004), 639--652.
[14]
A. Moffat. 2013. Seven Numeric Properties of Effectiveness Metrics Proc. 9th Asia Information Retrieval Societies Conference (AIRS 2013), R. E. Banchs, F. Silvestri, T.-Y. Liu, M. Zhang, S. Gao, and J. Lang (Eds.), Vol. Vol. 8281. LNCS 8281, Springer, Heidelberg, Germany, 1--12.
[15]
A. Moffat and J. Zobel 2008. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM TOIS, Vol. 27, 1 (2008), 2:1--2:27.
[16]
S. Robertson. 2006. On GMAP: and Other Transformations. In Proc. 15th International Conference on Information and Knowledge Management (CIKM 2006), P. S. Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 78--83.
[17]
G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Measurement with Applications. Springer-Verlag, New York, USA.
[18]
F. Sebastiani. 2015. An Axiomatically Derived Measure for the Evaluation of Classification Algorithms, See Nzz-ICTIR2015, 11--20.
[19]
R. P. Stanley. 2012. Enumerative Combinatorics -- Volume 1 (bibinfoedition2nd ed.). Cambridge Studies in Advanced Mathematics, Vol. Vol. 49. Cambridge University Press, Cambridge, UK.
[20]
S. S. Stevens. 1946. On the Theory of Scales of Measurement. Science, New Series Vol. 103, 2684 (June 1946), 677--680.
[21]
C. J. van Rijsbergen. 1974. Foundations of Evaluation. Journal of Documentation Vol. 30, 4 (1974), 365--373.
[22]
P. F. Velleman and L. Wilkinson 1993. Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading. The American Statistician Vol. 47, 1 (February 1993), 65--72.

Cited By

View all
  • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.24874Online publication date: 15-Feb-2024
  • (2022)Towards Formally Grounded Evaluation Measures for Semantic Parsing-based Knowledge Graph Question AnsweringProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545146(3-12)Online publication date: 23-Aug-2022
  • (2021)Proof by experimentation?ACM SIGIR Forum10.1145/3483382.348338554:2(1-4)Online publication date: 20-Aug-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval
October 2017
348 pages
ISBN:9781450344906
DOI:10.1145/3121050
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2017

Permissions

Request permissions for this article.

Check for updates

Badges

  • Best Paper

Author Tags

  1. evaluation measures
  2. interval scale
  3. representational theory of measurement

Qualifiers

  • Research-article

Conference

ICTIR '17
Sponsor:

Acceptance Rates

ICTIR '17 Paper Acceptance Rate 27 of 54 submissions, 50%;
Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.24874Online publication date: 15-Feb-2024
  • (2022)Towards Formally Grounded Evaluation Measures for Semantic Parsing-based Knowledge Graph Question AnsweringProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545146(3-12)Online publication date: 23-Aug-2022
  • (2021)Proof by experimentation?ACM SIGIR Forum10.1145/3483382.348338554:2(1-4)Online publication date: 20-Aug-2021
  • (2021)Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking LeaderboardProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463034(2283-2287)Online publication date: 11-Jul-2021
  • (2021)MS MARCO: Benchmarking Ranking Models in the Large-Data RegimeProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462804(1566-1576)Online publication date: 11-Jul-2021
  • (2021)Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval ScalesIEEE Access10.1109/ACCESS.2021.31168579(136182-136216)Online publication date: 2021
  • (2020)Exploiting Stopping Time to Evaluate Accumulated RelevanceProceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval10.1145/3409256.3409832(169-176)Online publication date: 14-Sep-2020
  • (2020)On the nature of information access evaluation metrics: a unifying frameworkInformation Retrieval Journal10.1007/s10791-020-09374-0Online publication date: 29-May-2020
  • (2019)Statistical Significance Testing in Information RetrievalProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331259(505-514)Online publication date: 18-Jul-2019
  • (2019)A General Theory of IR Evaluation MeasuresIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.284070831:3(409-422)Online publication date: 1-Mar-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media