ABSTRACT
Classical evaluation of information retrieval systems evaluates a system in a static test collection. In the case of Web search, the evaluation environment (EE) is continuously changing and the hypothesis of using a static test collection is not representative of this changing reality. Moreover, the changes in the evaluation environment, as the document set, the topics set, the relevance judgments, and the chosen metrics, have an impact on the performance measurement [1, 4]. To the best of our knowledge, there is no way to evaluate two versions of a search engine with evolving EEs.
We aim at proposing a continuous framework to evaluate different versions of a search engine in different evaluation environments. The classical paradigm relies on a controlled test collection (i.e., set of topics, corpus of documents and relevant assessments) as a stable and meaningful EE that guarantees the reproducibility of system results. We define the different EEs as a dynamic test collection (DTC). A DTC is a list of test collections based on a controlled evolution of a static test collection. The DTC allows us to quantify and relate the differences between the test collection elements, called Knowledge delta (K)Δ, and the performance differences between systems evaluated on these varying test collections, called Result delta (R)Δ. Finally, the continuous evaluation is characterized by KΔs and RΔs. The related changes in both deltas will allow for interpreting the evaluations in systems performances. The expected contributions of the thesis are: (i) a pivot strategy based on RΔ to compare systems evaluated in different EEs; (ii) a formalization of DTC to simulate the continuous evaluation and provide significant RΔ in evolving contexts; and (iii) a continuous evaluation framework that incorporates KΔ to explain RΔ of evaluated systems.
It is not possible to measure the RΔ of two systems evaluated in different EEs, because the performance variations are dependent on the changes in the EEs. [1]. To get an estimation of this RΔ measure, we propose to use a reference system, called the pivot system, which would be evaluated within the two EEs considered. Then, the RΔ value is measured using the relative distance between the pivot system and each evaluated system. Our results [2, 3] show that using the pivot strategy we improve the correctness of the ranking of systems (RoS) evaluated in two EEs (i.e., similarity with the RoS evaluated in the ground truth), compared to the RoS constructed with the absolute performance values for each system evaluated in the different EEs. The correctness of the RoS depends on the system defined as pivot and the metric.
The proposal focus moves to a continuous evaluation as a repeated assessment of the same or different versions of a web search across evolving EEs. Current test collections do not consider the evolution of documents, topics and relevance judgements. We require a DTC to extract RΔs of the compared system and its relation with the changes on the EEs (KΔ). We provide a method to define a DTC from static test collections based on controlled features as a way to better simulate the evolving EE. According to our preliminary experiments, a system evaluated in our proposed DTC shows more variable performances, and larger RΔs, than when it is evaluated in several random shards or bootstraps of documents.
As future work, we will integrate the KΔs to formalize an explainable continuous evaluation framework. The pivot strategy tells us when the performance of the system is improving across EEs. The DTC provides us with the required EEs to identify significant RΔs, and the inclusion of KΔs in the framework will define a set of factors that explain the system's performance changes.
- Nicola Ferro and Mark Sanderson. 2019. Improving the Accuracy of System Performance Estimation by Using Shards. In ACM SIGIR'19. 805--814.Google ScholarDigital Library
- G. González-Sáez, P. Mulhem, and L. Goeuriot. 2021. Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot Systems. In CLEF. Springer, 91--102.Google Scholar
- G. González Sáez, L. Goeuriot, and P. Mulhem. 2021. Addressing Different Evaluation Environments for Information Retrieval through Pivot Systems. In CORIA'21.Google Scholar
- Ellen M Voorhees, Daniel Samarov, and Ian Soboroff. 2017. Using replicates in information retrieval evaluation. ACM Transactions on Information Systems (TOIS) 36, 2 (2017).Google ScholarDigital Library
Index Terms
- Continuous Result Delta Evaluation of IR Systems
Recommendations
Pooling-based continuous evaluation of information retrieval systems
AbstractThe dominant approach to evaluate the effectiveness of information retrieval (IR) systems is by means of reusable test collections built following the Cranfield paradigm. In this paper, we propose a new IR evaluation methodology based on pooled ...
Predicting Retrieval Performance Changes in Evolving Evaluation Environments
Experimental IR Meets Multilinguality, Multimodality, and InteractionAbstractInformation retrieval (IR) systems evaluation aims at comparing IR systems either (1) one to another with respect to a single test collection, and (2) across multiple collections. In the first case, the evaluation environment (test collection and ...
Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot Systems
Experimental IR Meets Multilinguality, Multimodality, and InteractionAbstractEvaluation of information retrieval systems follows the Cranfield paradigm, where the evaluation of several IR systems relies on a common evaluation environment (test collection and evaluation settings). The Cranfield paradigm requires the ...
Comments