ABSTRACT
The high cost of constructing test collections led many researchers to develop intelligent document selection methods to find relevant documents with fewer judgments than the standard pooling method requires. In this paper, we conduct a comprehensive set of experiments to evaluate six bandit-based document selection methods, in terms of evaluation reliability, fairness, and reusability of the resultant test collections. In our experiments, the best performing method varies across test collections, showing the importance of using diverse test collections for an accurate performance analysis. Our experiments with six test collections also show that Move-To-Front is the most robust method among the ones we investigate.
Supplemental Material
- Allan, J., Harman, D., Kanoulas, E., Li, D., Van Gysel, C., Voorhees, E.M.: Trec 2017 common core track overview. In: TREC (2017)Google Scholar
- Aslam, J.A., Pavlu, V., Savell, R.: A unified model for metasearch, pooling, and system evaluation. In: Proceedings of the twelfth international conference on Information and knowledge management. pp. 484--491. ACM (2003)Google ScholarDigital Library
- Carterette, B.: On rank correlation and the distance between rankings. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. pp. 436--443. ACM (2009)Google ScholarDigital Library
- Cleverdon, C.W.: The evaluation of systems used in information retrieval. In: Proceedings of the international conference on scientific information. vol. 1, pp. 687--698. National Academy of Sciences Washington, DC, (1959)Google Scholar
- Cormack, G.V., Palmer, C.R., Clarke, C.L.: Effcient construction of large test collections. In: Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-98). Citeseer (1998)Google ScholarDigital Library
- Lipani, A., Losada, D.E., Zuccon, G., Lupu, M.: Fixed-cost pooling strategies. IEEE Transactions on Knowledge and Data Engineering (2019)Google Scholar
- Lipani, A., Palotti, J., Lupu, M., Piroi, F., Zuccon, G., Hanbury, A.: Fixed-cost pooling strategies based on ir evaluation measures. In: European Conference on Information Retrieval. pp. 357--368. Springer (2017)Google ScholarCross Ref
- Losada, D.E., Parapar, J., Barreiro, A.: Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems. Information Processing & Management 53(5), 1005--1025 (2017)Google ScholarCross Ref
- Moffat, A., Webber, W., Zobel, J.: Strategic system comparisons via targeted relevance judgments. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 375--382. ACM (2007)Google ScholarDigital Library
- Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27(1), 2 (2008)Google Scholar
- Rahman, M.M., Kutlu, M., Lease, M.: Constructing test collections using multiarmed bandits and active learning. In: The World Wide Web Conference. pp. 3158--3164. ACM (2019)Google ScholarDigital Library
- Sakai, T.: Topic set size design. Information Retrieval Journal 19(3), 256--283 (2016)Google ScholarDigital Library
- Sparck Jones, K., Van Rijsbergen, C.: Report on the need for and provision of an" ideal. Information Retrieval Test Collection (1975)Google Scholar
- Urbano, J., Marrero, M., Martín, D.: On the measurement of test collection reliability. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. pp. 393--402. ACM (2013).Google ScholarDigital Library
- Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management 36(5), 697--716 (2000).Google Scholar
- Voorhees, E.M.: The philosophy of information retrieval evaluation. In:Workshop of the cross-language evaluation forum for european languages. pp. 355--370 (2001)Google Scholar
- Voorhees, E.M.: Topic set size redux. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. pp. 806--807. ACM (2009)Google ScholarDigital Library
- Voorhees, E.M.: On building fair and reusable test collections using bandit techniques. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. pp. 407--416. ACM (2018)Google ScholarDigital Library
Index Terms
- Building Test Collections using Bandit Techniques: A Reproducibility Study
Recommendations
Reusable test collections through experimental design
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalPortable, reusable test collections are a vital part of research and development in information retrieval. Reusability is difficult to assess, however. The standard approach--simulating judgment collection when groups of systems are held out, then ...
On Building Fair and Reusable Test Collections using Bandit Techniques
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge ManagementWhile test collections are a vital piece of the research infrastructure for information retrieval, constructing fair, reusable test collections for large data sets is challenging because of the number of human relevance assessments required. Various ...
Dynamic Test Collections for Retrieval Evaluation
ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information RetrievalBatch evaluation with test collections of documents, search topics, and relevance judgments has been the bedrock of IR evaluation since its adoption by Salton for his experiments on vector space systems. Such test collections have limitations: they ...
Comments