Skip to main content

Predicting Retrieval Performance Changes in Evolving Evaluation Environments

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2023)

Abstract

Information retrieval (IR) systems evaluation aims at comparing IR systems either (1) one to another with respect to a single test collection, and (2) across multiple collections. In the first case, the evaluation environment (test collection and evaluation metrics) stays the same, while the environment changes, in the second case. Different evaluation environments may be seen, in fact, as evolutionary versions of some given evaluation environment. In this work, we propose a methodology to predict the statistically significant change in the performance of an IR system (i.e. result delta \(\mathcal {R}\varDelta \)) by quantifying the differences between test collections (i.e. knowledge delta \(\mathcal {K}\varDelta \)). In a first phase, we quantify differences between document collections (i.e. \(\mathcal {K}_{d}\varDelta \)) in the test collections by means of TF-IDF and Language Models (LM) representations. We use the \(\mathcal {K}_{d}\varDelta \) to train SVM classification models to predict the significantly performance changes of various IR systems using evolving test collections derived from the Robust and TREC-COVID collections. We evaluate our approach against our previous \(\mathcal {K}_{d}\varDelta \) experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Recall that, in our work, a test collection, TC together with a set of appropriate metrics form an Evaluation Environment, EE.

  2. 2.

    In order for the test collections to be comparable, we consider as our vocabulary all tokens across all test collections.

  3. 3.

    where: \(L = Q1 - 1.5 * (Q3- Q1)\) and \(U = Q3 + 1.5 * (Q3- Q1)\).

  4. 4.

    Where we apply a min-max normalization to the entries of these rows.

  5. 5.

    In our previous work, we defined different types of \(\mathcal {R}\varDelta \), in this paper \(\mathcal {R}\varDelta \) coincides with \(\mathcal {R}_{e}\varDelta \) in [5].

  6. 6.

    https://ir.nist.gov/covidSubmit/data.html.

  7. 7.

    https://scikit-learn.org/stable/.

  8. 8.

    The full set of results for the 56 classifiers can be found here: https://owncloud.tuwien.ac.at/index.php/s/opUP9QlFEUHlfsx.

References

  1. Amati, G.: Frequentist and Bayesian approach to information retrieval. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. Lecture Notes in Computer Science, vol. 3936, pp. 13–24. Springer, Berlin (2006). https://doi.org/10.1007/11735106_3

    Chapter  Google Scholar 

  2. Ferro, N., Kim, Y., Sanderson, M.: Using collection shards to study retrieval performance effect sizes. ACM Trans. Inf. Syst. (TOIS) 37(3), 1–40 (2019)

    Article  Google Scholar 

  3. Ferro, N., Silvello, G.: Towards an anatomy of IR system component performances. J. Assoc. Inf. Sci. Technol. 69, 187–200 (2018). https://doi.org/10.1002/asi.23910

    Article  Google Scholar 

  4. Galuščáková, P., et al.: Longeval-retrieval: French-english dynamic test collection for continuous web search evaluation. arXiv preprint arXiv:2303.03229 (2023)

  5. González-Sáez, G.N., Mulhem, P., Goeuriot, L.: Towards the evaluation of information retrieval systems on evolving datasets with pivot systems. In: Candan, K.S., et al. (eds.) CLEF 2021. Lecture Notes in Computer Science, vol. 12880, pp. 91–102. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-85251-1_8

    Chapter  Google Scholar 

  6. González-Sáez, G., et al.: Towards result delta prediction based on knowledge deltas for continuous IR evaluation. In: Faggioli, G., Ferro, N., Mothe, J., Raiber, F. (eds.) The QPP++ 2023: Query Performance Prediction and Its Evaluation in New Tasks Workshop (QPP++), pp. 20–24, no. 3366 in CEUR Workshop Proceedings, Aachen (2023). http://ceur-ws.org/Vol-3366/#paper-04

  7. Hauff, C.: Predicting the effectiveness of queries and retrieval systems. In: SIGIR Forum, vol. 44, p. 88 (2010)

    Google Scholar 

  8. He, B., Ounis, I.: Query performance prediction. Inf. Syst. 31(7), 585–594 (2006) https://doi.org/10.1016/j.is.2005.11.003, https://www.sciencedirect.com/science/article/pii/S0306437905000955. (1) SPIRE 2004 (2) Multimedia Databases

  9. Heafield, K., Pouzyrevsky, I., Clark, J.H., Koehn, P.: Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 690–696 (2013)

    Google Scholar 

  10. Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2Nd edn. Prentice-Hall Inc, Upper Saddle River (2009)

    Google Scholar 

  11. Kanoulas, E.: A short survey on online and offline methods for search quality evaluation. In: Russian Summer School on Information Retrieval (2015)

    Google Scholar 

  12. Macdonald, C., Tonellotto, N.: Declarative experimentation in information retrieval using PyTerrier. In: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, pp. 161–168 (2020)

    Google Scholar 

  13. Rashidi, L., Zobel, J., Moffat, A.: Evaluating the predictivity of IR experiments. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021, pp. 1667–1671. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3404835.3463040

  14. Rogers, A., Kovaleva, O., Rumshisky, A.: A Primer in BERTology: what We know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2021). https://doi.org/10.1162/tacl_a_00349

    Article  Google Scholar 

  15. Sanderson, M.: Test collection based evaluation of information retrieval systems. Now Publishers Inc (2010)

    Google Scholar 

  16. Sanderson, M., Turpin, A., Zhang, Y., Scholer, F.: Differences in effectiveness across sub-collections. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 1965–1969. Association for Computing Machinery, New York, NY, USA (2012). https://doi.org/10.1145/2396761.2398553

  17. Voorhees, E., et al.: TREC-COVID: constructing a pandemic information retrieval test collection. In: ACM SIGIR Forum, vol. 54, no. 1, pp. 1–12. ACM New York (2021)

    Google Scholar 

  18. Voorhees, E.M.: The TREC 2005 robust track. In: ACM SIGIR Forum, vol. 40, pp. 41–48. ACM, New York (2006)

    Google Scholar 

  19. Wang, L.L., et al.: Cord-19: The covid-19 open research dataset. ArXiv (2020)

    Google Scholar 

  20. Zhao, Y., Scholer, F., Tsegay, Y.: Effective pre-retrieval query performance prediction using similarity and variability evidence. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 52–64. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_8

    Chapter  Google Scholar 

Download references

Acknowledgement

This work is supported by the ANR Kodicare bi-lateral project, grant ANR-19-CE23-0029 of the French Agence Nationale de la Recherche, and by the Austrian Science Fund (FWF), grant I-4471-N.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alaa El-Ebshihy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

El-Ebshihy, A. et al. (2023). Predicting Retrieval Performance Changes in Evolving Evaluation Environments. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. Lecture Notes in Computer Science, vol 14163. Springer, Cham. https://doi.org/10.1007/978-3-031-42448-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42448-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42447-2

  • Online ISBN: 978-3-031-42448-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics