Abstract
We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and F-score, and 2/ comparing the results, in terms of precision, recall and F-score, obtained using two different methods. To do so, we use a probabilistic setting which allows us to obtain posterior distributions on these performance indicators, rather than point estimates. This framework is applied to the case where different methods are run on different datasets from the same source, as well as the standard situation where competing results are obtained on the same data.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Efron, B.E.: The Jacknife, the Bootstrap and Other Resampling plans. CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 38. SIAM, Philadelphia (1982)
Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Information Processing & Management 33, 495–512 (1997)
Tague-Sutcliffe, J., Blustein, J.: A statistical analysis of the TREC-3 data. In: Harman, D. (ed.) Proceedings of the third Text Retrieval Conference (TREC), pp. 385–398 (1994)
Hull, D.: Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of SIGIR 1993, pp. 329–338. ACM Press, Pittsburg (1993)
Robertson, S., Soboroff, I.: The TREC 2002 filtering track report. In: Proc. Text Retrieval Conference, pp. 208–217 (2002)
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth (1979)
Box, G.E.P., Tiao, G.C.: Bayesian Inference in Statistical Analysis. Wiley, Chichester (1973)
Robert, C.: L’Analyse Statistique Bayesienne, Economica (1992)
Joachims, T.: Making large-scale svm learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods — Support Vector Learning. MIT Press, Cambridge (1999)
Mizzaro, S.: A new measure of retrieval effectiveness (or: What’s wrong with precision and recall). In: Ojala, T. (ed.) International Workshop on Information Retrieval, IR 2001 (2001)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems 20 (2002)
Yeh, A.: More accurate tests for the statistical significance of result differences. In: Proceedings of COLING 2000, Saarbrücken, Germany (2000)
Evert, S.: Significance tests for the evaluation of ranking methods. In: Proceedings of COLING 2004, Geneva, Switzerland (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Goutte, C., Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-31865-1_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)