abstract

Continuous Result Delta Evaluation of IR Systems

Author:
Gabriela González-Sáez

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
View Profile

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2022Pages 3493https://doi.org/10.1145/3477495.3531686

Published:07 July 2022Publication History

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3493

ABSTRACT

Classical evaluation of information retrieval systems evaluates a system in a static test collection. In the case of Web search, the evaluation environment (EE) is continuously changing and the hypothesis of using a static test collection is not representative of this changing reality. Moreover, the changes in the evaluation environment, as the document set, the topics set, the relevance judgments, and the chosen metrics, have an impact on the performance measurement [1, 4]. To the best of our knowledge, there is no way to evaluate two versions of a search engine with evolving EEs.

We aim at proposing a continuous framework to evaluate different versions of a search engine in different evaluation environments. The classical paradigm relies on a controlled test collection (i.e., set of topics, corpus of documents and relevant assessments) as a stable and meaningful EE that guarantees the reproducibility of system results. We define the different EEs as a dynamic test collection (DTC). A DTC is a list of test collections based on a controlled evolution of a static test collection. The DTC allows us to quantify and relate the differences between the test collection elements, called Knowledge delta (K)Δ, and the performance differences between systems evaluated on these varying test collections, called Result delta (R)Δ. Finally, the continuous evaluation is characterized by KΔs and RΔs. The related changes in both deltas will allow for interpreting the evaluations in systems performances. The expected contributions of the thesis are: (i) a pivot strategy based on RΔ to compare systems evaluated in different EEs; (ii) a formalization of DTC to simulate the continuous evaluation and provide significant RΔ in evolving contexts; and (iii) a continuous evaluation framework that incorporates KΔ to explain RΔ of evaluated systems.

It is not possible to measure the RΔ of two systems evaluated in different EEs, because the performance variations are dependent on the changes in the EEs. [1]. To get an estimation of this RΔ measure, we propose to use a reference system, called the pivot system, which would be evaluated within the two EEs considered. Then, the RΔ value is measured using the relative distance between the pivot system and each evaluated system. Our results [2, 3] show that using the pivot strategy we improve the correctness of the ranking of systems (RoS) evaluated in two EEs (i.e., similarity with the RoS evaluated in the ground truth), compared to the RoS constructed with the absolute performance values for each system evaluated in the different EEs. The correctness of the RoS depends on the system defined as pivot and the metric.

The proposal focus moves to a continuous evaluation as a repeated assessment of the same or different versions of a web search across evolving EEs. Current test collections do not consider the evolution of documents, topics and relevance judgements. We require a DTC to extract RΔs of the compared system and its relation with the changes on the EEs (KΔ). We provide a method to define a DTC from static test collections based on controlled features as a way to better simulate the evolving EE. According to our preliminary experiments, a system evaluated in our proposed DTC shows more variable performances, and larger RΔs, than when it is evaluated in several random shards or bootstraps of documents.

As future work, we will integrate the KΔs to formalize an explainable continuous evaluation framework. The pivot strategy tells us when the performance of the system is improving across EEs. The DTC provides us with the required EEs to identify significant RΔs, and the inclusion of KΔs in the framework will define a set of factors that explain the system's performance changes.

References

Nicola Ferro and Mark Sanderson. 2019. Improving the Accuracy of System Performance Estimation by Using Shards. In ACM SIGIR'19. 805--814.Google ScholarDigital Library
G. González-Sáez, P. Mulhem, and L. Goeuriot. 2021. Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot Systems. In CLEF. Springer, 91--102.Google Scholar
G. González Sáez, L. Goeuriot, and P. Mulhem. 2021. Addressing Different Evaluation Environments for Information Retrieval through Pivot Systems. In CORIA'21.Google Scholar
Ellen M Voorhees, Daniel Samarov, and Ian Soboroff. 2017. Using replicates in information retrieval evaluation. ACM Transactions on Information Systems (TOIS) 36, 2 (2017).Google ScholarDigital Library

Index Terms

Continuous Result Delta Evaluation of IR Systems
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Pooling-based continuous evaluation of information retrieval systems
Abstract
The dominant approach to evaluate the effectiveness of information retrieval (IR) systems is by means of reusable test collections built following the Cranfield paradigm. In this paper, we propose a new IR evaluation methodology based on pooled ...
Read More
Predicting Retrieval Performance Changes in Evolving Evaluation Environments
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Abstract
Information retrieval (IR) systems evaluation aims at comparing IR systems either (1) one to another with respect to a single test collection, and (2) across multiple collections. In the first case, the evaluation environment (test collection and ... $_{}$ $_{}$ $_{}$
Read More
Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot Systems
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Abstract
Evaluation of information retrieval systems follows the Cranfield paradigm, where the evaluation of several IR systems relies on a common evaluation environment (test collection and evaluation settings). The Cranfield paradigm requires the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2022
3569 pages
ISBN:9781450387323
DOI:10.1145/3477495
General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University
Copyright © 2022 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 July 2022
Check for updates
Author Tags
continuous evaluation
dynamic test collection
result delta
Qualifiers
- abstract
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 86
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Continuous Result Delta Evaluation of IR Systems

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Pooling-based continuous evaluation of information retrieval systems

Predicting Retrieval Performance Changes in Evolving Evaluation Environments

Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Continuous Result Delta Evaluation of IR Systems

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Pooling-based continuous evaluation of information retrieval systems

Predicting Retrieval Performance Changes in Evolving Evaluation Environments

Towards the Evaluation of Information Retrieval Systems on Evolving Datasets with Pivot Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media