ABSTRACT
In this paper we present a formal framework to define and study the properties of utility-oriented measurements of retrieval effectiveness, like AP, RBP, ERR and many other popular IR evaluation measures. The proposed framework is laid in the wake of the representational theory of measurement, which provides the foundations of the modern theory of measurement in both physical and social sciences, thus contributing to explicitly link IR evaluation to a broader context. The proposed framework is minimal, in the sense that it relies on just one axiom, from which other properties are derived. Finally, it contributes to a better understanding and a clear separation of what issues are due to the inherent problems in comparing systems in terms of retrieval effectiveness and what others are due to the expected numerical properties of a measurement.
- M. Angelini, N. Ferro, G. Santucci, and G. Silvello. VIRTUE: A visual tool for information retrieval performance evaluation and failure analysis. JVLC, 25(4):394--413, 2014. Google ScholarDigital Library
- E. Amigó, J. Gonzalo, J. Artiles, and M. F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. IR, 12(4):461--486, 2009. Google ScholarDigital Library
- E. Amigó, J. Gonzalo, and M. F. Verdejo. A General Evaluation Measure for Document Organization Tasks. In SIGIR 2013, pp. 643--652. Google ScholarDigital Library
- P. Billingsley. Probability and Measure. John Wiley & Sons, New York, USA, 3rd edition, 1995.Google Scholar
- P. Bollman. Two Axioms for Evaluation Measures in Information Retrieval. In SIGIR 1984, pp. 233--245. Google ScholarDigital Library
- C. Buckley and E. M. Voorhees. Evaluating Evaluation Measure Stability. In SIGIR 2000, pp. 33--40. Google ScholarDigital Library
- C. Buckley and E. M. Voorhees. Retrieval Evaluation with Incomplete Information. In SIGIR 2004, pp. 25--32. Google ScholarDigital Library
- L. Busin and S. Mizzaro. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics. In ICTIR 2013, pp. 22--29. Google ScholarDigital Library
- B. A. Carterette. System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation. In SIGIR 2011, pp. 903--912. Google ScholarDigital Library
- O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected Reciprocal Rank for Graded Relevance. In CIKM 2009, pp. 621--630. Google ScholarDigital Library
- W. S. Cooper. On Selecting a Measure of Retrieval Effectiveness. JASIS, 24(2):87--100, 1973.Google ScholarCross Ref
- N. E. Fenton and J. Bieman. Software Metrics: A Rigorous & Practical Approach. Chapman and Hall/CRC, USA, 3rd edition, 2014. Google ScholarDigital Library
- N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola, and K. Järvelin. The Twist Measure for IR Evaluation: Taking User's Effort Into Account. JASIST, 2015.Google Scholar
- L. Finkelstein. Widely, Strongly and Weakly Defined Measurement. Measurement, 34(1):39--48, 2003.Google ScholarCross Ref
- G. B. Folland. Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons, New York, USA, 2nd edition, 1999.Google Scholar
- N. Fuhr. IR between Science and Engineering, and the Role of Experimentation. In CLEF 2010, p. 1. LNCS 6360. Google ScholarDigital Library
- K. Järvelin and J. Kekäläinen. Cumulated Gain-Based Evaluation of IR Techniques. TOIS, 20(4):422--446, 2002. Google ScholarDigital Library
- J. Kekäläinen and K. Järvelin. Using Graded Relevance Assessments in IR Evaluation. JASIST, 53(13):1120--1129, 2002. Google ScholarDigital Library
- M. G. Kendall. Rank correlation methods. Griffin, Oxford, England, 1948.Google Scholar
- D. E. Knuth. The Art of Computer Programming - Volume 2: Seminumerical Algorithms. Addison-Wesley, USA, 2nd edition, 1981.Google Scholar
- D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. Foundations of Measurement. Additive and Polynomial Representations, volume 1. Academic Press, New York, USA, 1971.Google Scholar
- E. Maddalena and S. Mizzaro. Axiometrics: Axioms of Information Retrieval Effectiveness Metrics. In EVIA 2014, pp. 17--24.Google Scholar
- E. Maddalena, S. Mizzaro, F. Scholer, and A. Turpin. Judging Relevance Using Magnitude Estimation. In ECIR 2015, pp. 215--220. LNCS 9022.Google Scholar
- L. Mari. Beyond the Representational Viewpoint: a New Formalization of Measurement. Measurement, 27(2):71--84, 2000.Google ScholarCross Ref
- S. Miyamoto. Generalizations of Multisets and Rough Approximations. International Journal of Intelligent Systems, 19(7):639--652, 2004. Google ScholarDigital Library
- A. Moffat. Seven Numeric Properties of Effectiveness Metrics. In AIRS 2013, pp. 1--12. LNCS 8281.Google Scholar
- A. Moffat and J. Zobel. Rank-biased Precision for Measurement of Retrieval Effectiveness. TOIS, 27(1):2:1--2:27, 2008. Google ScholarDigital Library
- T. Sakai. Evaluating Evaluation Metrics based on the Bootstrap. In SIGIR 2006, pp. 525--532. Google ScholarDigital Library
- T. Sakai. Metrics, Statistics, Tests. In Bridging Between Information Retrieval and Databases - PROMISE Winter School 2013, Revised Tutorial Lectures, pp. 116--163. LNCS 8173, 2014.Google Scholar
- S. S. Stevens. On the Theory of Scales of Measurement. Science, New Series, 103(2684):677--680, 1946.Google Scholar
- C. J. van Rijsbergen. Retrieval effectiveness. In K. Spärck Jones, editor, Information Retrieval Experiment, pp. 32--43. Butterworths, London, United Kingdom, 1981.Google Scholar
- Z. Y. Wang and G. J. Klir. Fuzzy Measure Theory. Springer-Verlag, New York, USA, 1992. Google ScholarCross Ref
- W. Webber, A. Moffat, and J. Zobel. A Similarity Measure for Indefinite Rankings. TOIS, 4(28):20:1--20:38, 2010. Google ScholarDigital Library
- E. Yilmaz and J. A. Aslam. Estimating average precision when judgments are incomplete. Knowledge and Information Systems, 16(2):173--211, 2008. Google ScholarDigital Library
- E. Yilmaz, J. A. Aslam, and S. E. Robertson. A New Rank Correlation Coefficient for Information Retrieval. In SIGIR 2008, pp. 587--594. Google ScholarDigital Library
Index Terms
- Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness
Recommendations
Are IR Evaluation Measures on an Interval Scale?
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information RetrievalIn this paper, we formally investigate whether, or not, IR evaluation measures are on an interval scale, which is needed to safely compute the basic statistics, such as mean and variance, we daily use to compare IR systems. We face this issue in the ...
Measuring retrieval effectiveness: a new proposal and a first experimental validation
Most common effectiveness measures for information retrieval systems are based on the assumptions of binary relevance (either a document is relevant to a given query or it is not) and binary retrieval (either a document is retrieved or it is not). In ...
Evaluating the effectiveness of content-oriented XML retrieval methods
AbstractContent-oriented XML retrieval approaches aim at a more focused retrieval strategy: Instead of retrieving whole documents, document components that are exhaustive to the information need while at the same time being as specific as possible should ...
Comments