short-paper

On the Reusability of "Living Labs" Test Collections: A Case Study of Real-Time Summarization

Authors:

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 793 - 796

https://doi.org/10.1145/3077136.3080644

Published: 07 August 2017 Publication History

Get Access

Abstract

Information retrieval test collections are typically built using data from large-scale evaluations in international forums such as TREC, CLEF, and NTCIR. Previous validation studies on pool-based test collections for ad hoc retrieval have examined their reusability to accurately assess the effectiveness of systems that did not participate in the original evaluation. To our knowledge, the reusability of test collections derived from "living labs" evaluations, based on logs of user activity, has not been explored. In this paper, we performed a "leave-one-out" analysis of human judgment data derived from the TREC 2016 Real-Time Summarization Track and show that those judgments do not appear to be reusable. While this finding is limited to one specific evaluation, it does call into question the reusability of test collections built from living labs in general, and at the very least suggests the need for additional work in validating such experimental instruments.

References

[1]

Krisztian Balog, Anne Schuth, Peter Dekker, Narges Tavakolpoursaleh, Philipp Schaer, and Po-Yu Chuang. 2016. Overview of the TREC 2016 Open Search Track TREC.

Google Scholar

[2]

Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. 2007. Bias and the Limits of Pooling for Large Collections. Information Retrieval Vol. 10, 6 (2007), 491--508.

Digital Library

Google Scholar

[3]

Stefan Büttcher, Charles L. A. Clarke, Peter C. K. Yeung, and Ian Soboroff. 2007. Reliable Information Retrieval Evaluation with Incomplete and Biased Judgements SIGIR. 63--70.

Google Scholar

[4]

Ben Carterette, Evgeniy Gabrilovich, Vanja Josifovski, and Donald Metzler. 2010. Measuring the Reusability of Test Collections. In WSDM. 231--239.

Digital Library

Google Scholar

[5]

Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. Large-Scale Validation and Analysis of Interleaved Search Evaluation. ACM TOIS, Vol. 30, 1 (2012), Article 6.

Digital Library

Google Scholar

[6]

Allan Hanbury, Henning Müller, Krisztian Balog, Torben Brodt, Gordon V. Cormack, Ivan Eggel, Tim Gollub, Frank Hopfgartner, Jayashree Kalpathy-Cramer, Noriko Kando, Anastasia Krithara, Jimmy Lin, Simon Mercer, and Martin Potthast. 2015. Evaluation-as-a-Service: Overview and Outlook. arXiv:1512.07454.

Google Scholar

[7]

William Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kramer, Lynetta Sacherek, and Daniel Olson. 2000. Do Batch and User Evaluations Give the Same Results? SIGIR. 17--24.

Google Scholar

[8]

Ron Kohavi, Randal M. Henne, and Dan Sommerfield. 2007. Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO. In KDD. 959--967.

Google Scholar

[9]

Jimmy Lin, Adam Roegiest, Luchen Tan, Richard McCreadie, Ellen Voorhees, and Fernando Diaz. 2016. Overview of the TREC 2016 Real-Time Summarization Track TREC.

Google Scholar

[10]

Xin Qian, Jimmy Lin, and Adam Roegiest. 2016. Interleaved Evaluation for Retrospective Summarization and Prospective Notification on Document Streams. In SIGIR. 175--184.

Digital Library

Google Scholar

[11]

Alan Said, Jimmy Lin, Alejandro Bellogín, and Arjen P. de Vries. 2013. A Month in the Life of a Production News Recommender System CIKM Workshop on Living Labs for Information Retrieval Evaluation. 7--10.

Google Scholar

[12]

Anne Schuth, Krisztian Balog, and Liadh Kelly. 2015. Overview of the Living Labs for Information Retrieval Evaluation (LL4IR) CLEF Lab 2015. In CLEF.

Digital Library

Google Scholar

[13]

Ellen M. Voorhees. 2002. The Philosophy of Information Retrieval Evaluation CLEF. 355--370.

Google Scholar

[14]

Justin Zobel. 1998. How Reliable Are the Results of Large-Scale Information Retrieval Experiments? SIGIR. 307--314.

Google Scholar

Cited By

View all

Liu J(2022)Toward Cranfield-inspired reusability assessment in interactive information retrieval evaluationInformation Processing & Management10.1016/j.ipm.2022.10300759:5(103007)Online publication date: Sep-2022
https://doi.org/10.1016/j.ipm.2022.103007
Sequiera RTan LZhang YLin JO'Brien HFreund LArapakis IHoeber OLopatovska I(2020)Update Delivery Mechanisms for Prospective Information NeedsProceedings of the 2020 Conference on Human Information Interaction and Retrieval10.1145/3343413.3377988(308-312)Online publication date: 14-Mar-2020
https://dl.acm.org/doi/10.1145/3343413.3377988
Breuer T(2020)Reproducible Online Search ExperimentsAdvances in Information Retrieval10.1007/978-3-030-45442-5_77(597-601)Online publication date: 8-Apr-2020
https://doi.org/10.1007/978-3-030-45442-5_77

Index Terms

On the Reusability of "Living Labs" Test Collections: A Case Study of Real-Time Summarization
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment
      2. Test collections
    2. Retrieval tasks and goals
      1. Document filtering

Recommendations

On the Reusability of Personalized Test Collections
UMAP '17: Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization

Test collections for offline evaluation remain crucial for information retrieval research and industrial practice, yet reusability of test collections is under threat by different factors such as dynamic nature of data collections and new trends in ...
Measuring the reusability of test collections
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

While test collection construction is a time-consuming and expensive process, the true cost is amortized by reusing the collection over hundreds or thousands of experiments. Some of these experiments may involve systems that retrieve documents not ...
Reusable test collections through experimental design
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Portable, reusable test collections are a vital part of research and development in information retrieval. Reusability is difficult to assess, however. The standard approach--simulating judgment collection when groups of systems are held out, then ...

Comments

Information & Contributors

Information

Published In

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2017

1476 pages

ISBN:9781450350228

DOI:10.1145/3077136

General Chairs:
Noriko Kando
National Institute of Informatics
,
Tetsuya Sakai
Waseda University
,
Hideo Joho
University of Tsukuba
,
Program Chairs:
Hang Li
Huawei Noah's Ark Lab
,
Arjen P. de Vries
Radboud University
,
Ryen W. White
Microsoft Cortana

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Natural Sciences and Engineering Research Council of Canada

Conference

SIGIR '17

Sponsor:

SIGIR

SIGIR '17: The 40th International ACM SIGIR conference on research and development in Information Retrieval

August 7 - 11, 2017

Tokyo, Shinjuku, Japan

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
223
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Liu J(2022)Toward Cranfield-inspired reusability assessment in interactive information retrieval evaluationInformation Processing & Management10.1016/j.ipm.2022.10300759:5(103007)Online publication date: Sep-2022
https://doi.org/10.1016/j.ipm.2022.103007
Sequiera RTan LZhang YLin JO'Brien HFreund LArapakis IHoeber OLopatovska I(2020)Update Delivery Mechanisms for Prospective Information NeedsProceedings of the 2020 Conference on Human Information Interaction and Retrieval10.1145/3343413.3377988(308-312)Online publication date: 14-Mar-2020
https://dl.acm.org/doi/10.1145/3343413.3377988
Breuer T(2020)Reproducible Online Search ExperimentsAdvances in Information Retrieval10.1007/978-3-030-45442-5_77(597-601)Online publication date: 8-Apr-2020
https://doi.org/10.1007/978-3-030-45442-5_77

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

On the Reusability of Personalized Test Collections

Measuring the reusability of test collections

Reusable test collections through experimental design

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations