Abstract
While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).
Partly supported by the National Science Foundation on Grant 17-51278.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bellogín, A., Castells, P., Cantador, I.: Statistical biases in information retrieval metrics for recommender systems. Inf. Retriev. J. 20, 606–634 (2017)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Annals Stat. 29(4), 1165–1188 (2001)
Bland, J.M., Altman, D.G.: Multiple significance tests: the bonferroni method. BMJ 310(6973), 170 (1995)
Boytsov, L., Belova, A., Westfall, P.: Deciding on an adjustment for multiplicity in IR experiments. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 403–412 (2013)
Carterette, B.A.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst. 30(1), 1–34 (2012). https://doi.org/10.1145/2094072.2094076
Hagen, M., et al.: Webis at trec 2013-session and web track. In: TREC (2013)
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 1–19 (2015)
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Statist. 65–70 (1979)
Hull, D.: Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 329–338 (1993)
Ihemelandu, N., Ekstrand, M.D.: Statistical inference: the missing piece of recsys experiment reliability discourse. arXiv preprint arXiv:2109.06424 (2021)
Ihemelandu, N., Ekstrand, M.D.: Inference at scale: significance testing for large search and recommendation experiments. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023) (2023)
Jones, K.S., Willett, P.: Readings in Information Retrieval. Morgan Kaufmann (1997)
Parapar, J., Losada, D.E., Presedo-Quindimil, M.A., Barreiro, A.: Using score distributions to compare statistical significance tests for information retrieval evaluation. J. Am. Soc. Inf. Sci. 71(1), 98–113 (2020)
Rijsbergen, C.V.: Van. Information Retrieval, vol. 2. Butterworths (1979)
Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Inf. Process. Manag. 33(4), 495–512 (1997)
Scheffé, H.: A method for judging all contrasts in the analysis of variance. Biometrika 40(1–2), 87–110 (1953)
Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 623–632 (2007)
Tague-Sutcliffe, J.: The pragmatics of information retrieval experimentation, revisited. Inf. Process. Manag. 28(4), 467–490 (1992)
Tague-Sutcliffe, J., Blustein, J.: A Statistical Analysis of the Trec-3 Data, pp. 385–385. NIST Special Publication SP (1995)
Urbano, J., Lima, H., Hanjalic, A.: Statistical significance testing in information retrieval: an empirical analysis of type I, type II and type III errors. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 505–514 (2019)
Urbano, J., Nagler, T.: Stochastic simulation of test collections: evaluation scores. In: Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 695–704 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ihemelandu, N., Ekstrand, M.D. (2024). Multiple Testing for IR and Recommendation System Experiments. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_37
Download citation
DOI: https://doi.org/10.1007/978-3-031-56063-7_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)