Abstract
Information retrieval (IR) systems evaluation aims at comparing IR systems either (1) one to another with respect to a single test collection, and (2) across multiple collections. In the first case, the evaluation environment (test collection and evaluation metrics) stays the same, while the environment changes, in the second case. Different evaluation environments may be seen, in fact, as evolutionary versions of some given evaluation environment. In this work, we propose a methodology to predict the statistically significant change in the performance of an IR system (i.e. result delta \(\mathcal {R}\varDelta \)) by quantifying the differences between test collections (i.e. knowledge delta \(\mathcal {K}\varDelta \)). In a first phase, we quantify differences between document collections (i.e. \(\mathcal {K}_{d}\varDelta \)) in the test collections by means of TF-IDF and Language Models (LM) representations. We use the \(\mathcal {K}_{d}\varDelta \) to train SVM classification models to predict the significantly performance changes of various IR systems using evolving test collections derived from the Robust and TREC-COVID collections. We evaluate our approach against our previous \(\mathcal {K}_{d}\varDelta \) experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Recall that, in our work, a test collection, TC together with a set of appropriate metrics form an Evaluation Environment, EE.
- 2.
In order for the test collections to be comparable, we consider as our vocabulary all tokens across all test collections.
- 3.
where: \(L = Q1 - 1.5 * (Q3- Q1)\) and \(U = Q3 + 1.5 * (Q3- Q1)\).
- 4.
Where we apply a min-max normalization to the entries of these rows.
- 5.
In our previous work, we defined different types of \(\mathcal {R}\varDelta \), in this paper \(\mathcal {R}\varDelta \) coincides with \(\mathcal {R}_{e}\varDelta \) in [5].
- 6.
- 7.
- 8.
The full set of results for the 56 classifiers can be found here: https://owncloud.tuwien.ac.at/index.php/s/opUP9QlFEUHlfsx.
References
Amati, G.: Frequentist and Bayesian approach to information retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. Lecture Notes in Computer Science, vol. 3936, pp. 13–24. Springer, Berlin (2006). https://doi.org/10.1007/11735106_3
Ferro, N., Kim, Y., Sanderson, M.: Using collection shards to study retrieval performance effect sizes. ACM Trans. Inf. Syst. (TOIS) 37(3), 1–40 (2019)
Ferro, N., Silvello, G.: Towards an anatomy of IR system component performances. J. Assoc. Inf. Sci. Technol. 69, 187–200 (2018). https://doi.org/10.1002/asi.23910
Galuščáková, P., et al.: Longeval-retrieval: French-english dynamic test collection for continuous web search evaluation. arXiv preprint arXiv:2303.03229 (2023)
González-Sáez, G.N., Mulhem, P., Goeuriot, L.: Towards the evaluation of information retrieval systems on evolving datasets with pivot systems. In: Candan, K.S., et al. (eds.) CLEF 2021. Lecture Notes in Computer Science, vol. 12880, pp. 91–102. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-85251-1_8
González-Sáez, G., et al.: Towards result delta prediction based on knowledge deltas for continuous IR evaluation. In: Faggioli, G., Ferro, N., Mothe, J., Raiber, F. (eds.) The QPP++ 2023: Query Performance Prediction and Its Evaluation in New Tasks Workshop (QPP++), pp. 20–24, no. 3366 in CEUR Workshop Proceedings, Aachen (2023). http://ceur-ws.org/Vol-3366/#paper-04
Hauff, C.: Predicting the effectiveness of queries and retrieval systems. In: SIGIR Forum, vol. 44, p. 88 (2010)
He, B., Ounis, I.: Query performance prediction. Inf. Syst. 31(7), 585–594 (2006) https://doi.org/10.1016/j.is.2005.11.003, https://www.sciencedirect.com/science/article/pii/S0306437905000955. (1) SPIRE 2004 (2) Multimedia Databases
Heafield, K., Pouzyrevsky, I., Clark, J.H., Koehn, P.: Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 690–696 (2013)
Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2Nd edn. Prentice-Hall Inc, Upper Saddle River (2009)
Kanoulas, E.: A short survey on online and offline methods for search quality evaluation. In: Russian Summer School on Information Retrieval (2015)
Macdonald, C., Tonellotto, N.: Declarative experimentation in information retrieval using PyTerrier. In: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, pp. 161–168 (2020)
Rashidi, L., Zobel, J., Moffat, A.: Evaluating the predictivity of IR experiments. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, pp. 1667–1671. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3404835.3463040
Rogers, A., Kovaleva, O., Rumshisky, A.: A Primer in BERTology: what We know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2021). https://doi.org/10.1162/tacl_a_00349
Sanderson, M.: Test collection based evaluation of information retrieval systems. Now Publishers Inc (2010)
Sanderson, M., Turpin, A., Zhang, Y., Scholer, F.: Differences in effectiveness across sub-collections. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 1965–1969. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2396761.2398553
Voorhees, E., et al.: TREC-COVID: constructing a pandemic information retrieval test collection. In: ACM SIGIR Forum, vol. 54, no. 1, pp. 1–12. ACM New York (2021)
Voorhees, E.M.: The TREC 2005 robust track. In: ACM SIGIR Forum, vol. 40, pp. 41–48. ACM, New York (2006)
Wang, L.L., et al.: Cord-19: The covid-19 open research dataset. ArXiv (2020)
Zhao, Y., Scholer, F., Tsegay, Y.: Effective pre-retrieval query performance prediction using similarity and variability evidence. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 52–64. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_8
Acknowledgement
This work is supported by the ANR Kodicare bi-lateral project, grant ANR-19-CE23-0029 of the French Agence Nationale de la Recherche, and by the Austrian Science Fund (FWF), grant I-4471-N.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
El-Ebshihy, A. et al. (2023). Predicting Retrieval Performance Changes in Evolving Evaluation Environments. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. Lecture Notes in Computer Science, vol 14163. Springer, Cham. https://doi.org/10.1007/978-3-031-42448-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-42448-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42447-2
Online ISBN: 978-3-031-42448-9
eBook Packages: Computer ScienceComputer Science (R0)