Abstract
Assessors make preference judgments faster and more consistently than graded judgments. Preference judgments can also recognize distinctions between items that appear equivalent under graded judgments. Unfortunately, preference judgments can require more than linear effort to fully order a pool of items, and evaluation measures for preference judgments are not as well established as those for graded judgments, such as NDCG. In this article, we explore the assessment process for partial preference judgments, with the aim of identifying and ordering the top items in the pool, rather than fully ordering the entire pool. To measure the performance of a ranker, we compare its output to this preferred ordering by applying a rank similarity measure. We demonstrate the practical feasibility of this approach by crowdsourcing partial preferences for the TREC 2019 Conversational Assistance Track, replacing NDCG with a new measure named compatibility. This new measure has its most striking impact when comparing modern neural rankers, where it is able to recognize significant improvements in quality that would otherwise be missed by NDCG.
- Mustafa Abualsaud and Mark D. Smucker. 2019. Patterns of search result examination: Query to first action. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1833–1842.Google Scholar
- Omar Alonso, Daniel E. Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2 (Nov. 2008), 9–15.Google ScholarDigital Library
- Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.Google ScholarDigital Library
- Maryam Bashir, Jesse Anderton, Jie Wu, Peter B. Golbus, Virgil Pavlu, and Javed A. Aslam. 2013. A document rating system for preference judgements. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 909–912.Google Scholar
- Christopher J. C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report MSR-TR-2010-82.Google Scholar
- Ben Carterette, Paul Bennett, and Olivier Chapelle. 2008. A test collection of preference judgments. In Proceedings of the SIGIR Workshop on Beyond Binary Relevance: Preferences, Diversity, and Set-level Judgments.Google Scholar
- Ben Carterette and Paul N. Bennett. 2008. Evaluation measures for preference judgments. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–686.Google Scholar
- Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or There: Preference Judgments for Relevance. Computer Science Department Faculty Publication Series 46. University of Massachusetts, Amherst.Google ScholarDigital Library
- Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 861–870.Google ScholarDigital Library
- Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures for novelty and diversity. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 413–422.Google ScholarDigital Library
- Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 621–630.Google ScholarDigital Library
- Xi Chen, Paul N. Bennett, Kevyn Collins-Thompson, and Eric Horvitz. 2013. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. 193–202.Google ScholarDigital Library
- Charles L. A. Clarke. 2019. WaterlooClarke at the TREC 2019 conversational assistant track. In Proceedings of the 28th Text REtrieval Conference.Google Scholar
- Charles L. A. Clarke, Mark D. Smucker, and Alexandra Vtyurina. 2020. Offline evaluation by maximum similarity to an ideal ranking. In Proceedings of the 29th ACM Conference on Information and Knowledge Management.Google ScholarDigital Library
- Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020. Offline evaluation without gain. In Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval.Google Scholar
- Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The conversational assistance track overview. In Proceedings of the 28th Text REtrieval Conference.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.Google Scholar
- H. P. Frei and P. Schäuble. 1991. Determining the effectiveness of retrieval algorithms. Inf. Proc. Manag. 27, 2–3 (Apr. 1991), 153–164.Google ScholarDigital Library
- Laura A. Granka, Thorsten Joachims, and Geri Gay. 2004. Eye-tracking analysis of user behavior in WWW search. In Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval. 478–479.Google ScholarDigital Library
- Ahmed Hassan Awadallah and Imed Zitouni. 2014. Machine-assisted search preference evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 51–60.Google ScholarDigital Library
- Kai Hui and Klaus Berberich. 2017. Low-cost preference judgment via ties. In Proceedings of the European Conference on Information Retrieval. 626–632.Google ScholarCross Ref
- Kai Hui and Klaus Berberich. 2017. Transitivity, time consumption, and quality of preference judgments in crowdsourcing. In Proceedings of the European Conference on Information Retrieval. 239–251.Google ScholarCross Ref
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.Google ScholarDigital Library
- Saikishore Kalloori, Francesco Ricci, and Rosella Gennari. 2018. Eliciting pairwise preferences in recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems. 329–337.Google ScholarDigital Library
- Gabriella Kazai, Emine Yilmaz, Nick Craswell, and S. M. M. Tahaghoghi. 2013. User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 699–708.Google ScholarDigital Library
- Jinyoung Kim, Gabriella Kazai, and Imed Zitouni. 2013. Relevance dimensions in preference-based IR evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 913–916.Google ScholarDigital Library
- Caitlin Kuhlman, Diana Doherty, Malika Nurbekova, Goutham Deva, Zarni Phyo, Paul-Henry Schoenhagen, MaryAnn VanValkenburg, Elke Rundensteiner, and Lane Harrison. 2019. Evaluating preference collection methods for interactive ranking analytics. In Proceedings of the CHI Conference on Human Factors in Computing Systems.Google ScholarDigital Library
- Yanyan Lan, Shuzi Niu, Jiafeng Guo, and Xueqi Cheng. 2013. Is top-\(\) sufficient for ranking? In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 1261–1270.Google ScholarDigital Library
- Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. SIGIR Forum 45, 2 (Jan. 2012), 66–75.Google ScholarDigital Library
- Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Trans. Inf. Syst. 35, 3 (Jan. 2017).Google ScholarDigital Library
- Shuzi Niu, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2012. Top-k learning to rank: Labeling, ranking and evaluation. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 751–760.Google ScholarDigital Library
- Kira Radinsky and Nir Ailon. 2011. Ranking from pairs and triplets: Information quality, evaluation methods and query complexity. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. 105–114.Google ScholarDigital Library
- Mark E. Rorvig. 1990. The simple scalability of documents. J. Amer. Soc. Inf. Sci. 41, 8 (1990), 590–598.Google ScholarCross Ref
- Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. 525–532.Google ScholarDigital Library
- Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
- Tefko Saracevic. 2017. The Notion of Relevance in Information Science: Everybody Knows What Relevance Is. But, What Is It Really? Morgan & Claypool, San Rafael, CA.Google Scholar
- Ellen M. Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. 315–323.Google ScholarDigital Library
- William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (Nov. 2010), 20:1–20:38.Google ScholarDigital Library
- Fen Xia, Tie-Yan Liu, and Hang Li. 2009. Statistical consistency of top-\(\) ranking. In Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2098–2106.Google Scholar
- Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Haitian Chen, Min Zhang, and Shaoping Ma. 2020. Preference-based evaluation metrics for web image search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
- Jheng-Hong Yang, Sheng-Chieh Lin, Chuan-Ju Wang, Jimmy Lin, and Ming-Feng Tsai. 2019. Query and answer expansion from conversation history. In Proceedings of the 28th Text REtrieval Conference.Google Scholar
- Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically examining the “neural hype”: Weak baselines and the additivity of effectiveness gains from neural ranking models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1129–1132.Google ScholarDigital Library
- Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In Proceedings of the 23rd Australasian Document Computing Symposium.Google ScholarDigital Library
- Y. Y. Yao. 1995. Measuring retrieval effectiveness based on user preference of documents. J. Amer. Soc. Inf. Sci. 46, 2 (1995), 133–145.Google ScholarDigital Library
- Dongqing Zhu and Ben Carterette. 2010. An analysis of assessor behavior in crowdsourced preference judgments. In Proceedings of the SIGIR Workshop on Crowdsourcing for Search Evaluation.Google Scholar
Index Terms
- Assessing Top- Preferences
Recommendations
Machine-Assisted Search Preference Evaluation
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementInformation Retrieval systems are traditionally evaluated using the relevance of web pages to individual queries. Other work on IR evaluation has focused on exploring the use of preference judgments over two search result lists. Unlike traditional query-...
Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance Judgments
WWW '22: Proceedings of the ACM Web Conference 2022In Information Retrieval (IR) evaluation, preference judgments are collected by presenting to the assessors a pair of documents and asking them to select which of the two, if any, is the most relevant. This is an alternative to the classic relevance ...
Using PageRank to infer user preferences
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalRecently, researchers have shown interest in the use of preference judgments for evaluation in IR literature. Although preference judgments have several advantages over absolute judgment, one of the major disadvantages is that the number of judgments ...
Comments