research-article

Assessing Top- Preferences

Authors:
Charles L. A. Clarke

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

,
Alexandra Vtyurina

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

,
Mark D. Smucker

University of Waterloo, Waterloo, Canada

University of Waterloo, Waterloo, Canada
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 39 Issue 3Article No.: 33pp 1–21https://doi.org/10.1145/3451161

Published:05 May 2021Publication History

ACM Transactions on Information Systems

Abstract

Assessors make preference judgments faster and more consistently than graded judgments. Preference judgments can also recognize distinctions between items that appear equivalent under graded judgments. Unfortunately, preference judgments can require more than linear effort to fully order a pool of items, and evaluation measures for preference judgments are not as well established as those for graded judgments, such as NDCG. In this article, we explore the assessment process for partial preference judgments, with the aim of identifying and ordering the top items in the pool, rather than fully ordering the entire pool. To measure the performance of a ranker, we compare its output to this preferred ordering by applying a rank similarity measure. We demonstrate the practical feasibility of this approach by crowdsourcing partial preferences for the TREC 2019 Conversational Assistance Track, replacing NDCG with a new measure named compatibility. This new measure has its most striking impact when comparing modern neural rankers, where it is able to recognize significant improvements in quality that would otherwise be missed by NDCG.

References

Mustafa Abualsaud and Mark D. Smucker. 2019. Patterns of search result examination: Query to first action. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1833–1842.Google Scholar
Omar Alonso, Daniel E. Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. SIGIR Forum 42, 2 (Nov. 2008), 9–15.Google ScholarDigital Library
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.Google ScholarDigital Library
Maryam Bashir, Jesse Anderton, Jie Wu, Peter B. Golbus, Virgil Pavlu, and Javed A. Aslam. 2013. A document rating system for preference judgements. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 909–912.Google Scholar
Christopher J. C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report MSR-TR-2010-82.Google Scholar
Ben Carterette, Paul Bennett, and Olivier Chapelle. 2008. A test collection of preference judgments. In Proceedings of the SIGIR Workshop on Beyond Binary Relevance: Preferences, Diversity, and Set-level Judgments.Google Scholar
Ben Carterette and Paul N. Bennett. 2008. Evaluation measures for preference judgments. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 685–686.Google Scholar
Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or There: Preference Judgments for Relevance. Computer Science Department Faculty Publication Series 46. University of Massachusetts, Amherst.Google ScholarDigital Library
Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 861–870.Google ScholarDigital Library
Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures for novelty and diversity. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 413–422.Google ScholarDigital Library
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 621–630.Google ScholarDigital Library
Xi Chen, Paul N. Bennett, Kevyn Collins-Thompson, and Eric Horvitz. 2013. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. 193–202.Google ScholarDigital Library
Charles L. A. Clarke. 2019. WaterlooClarke at the TREC 2019 conversational assistant track. In Proceedings of the 28th Text REtrieval Conference.Google Scholar
Charles L. A. Clarke, Mark D. Smucker, and Alexandra Vtyurina. 2020. Offline evaluation by maximum similarity to an ideal ranking. In Proceedings of the 29th ACM Conference on Information and Knowledge Management.Google ScholarDigital Library
Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020. Offline evaluation without gain. In Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval.Google Scholar
Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The conversational assistance track overview. In Proceedings of the 28th Text REtrieval Conference.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.Google Scholar
H. P. Frei and P. Schäuble. 1991. Determining the effectiveness of retrieval algorithms. Inf. Proc. Manag. 27, 2–3 (Apr. 1991), 153–164.Google ScholarDigital Library
Laura A. Granka, Thorsten Joachims, and Geri Gay. 2004. Eye-tracking analysis of user behavior in WWW search. In Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval. 478–479.Google ScholarDigital Library
Ahmed Hassan Awadallah and Imed Zitouni. 2014. Machine-assisted search preference evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 51–60.Google ScholarDigital Library
Kai Hui and Klaus Berberich. 2017. Low-cost preference judgment via ties. In Proceedings of the European Conference on Information Retrieval. 626–632.Google ScholarCross Ref
Kai Hui and Klaus Berberich. 2017. Transitivity, time consumption, and quality of preference judgments in crowdsourcing. In Proceedings of the European Conference on Information Retrieval. 239–251.Google ScholarCross Ref
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.Google ScholarDigital Library
Saikishore Kalloori, Francesco Ricci, and Rosella Gennari. 2018. Eliciting pairwise preferences in recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems. 329–337.Google ScholarDigital Library
Gabriella Kazai, Emine Yilmaz, Nick Craswell, and S. M. M. Tahaghoghi. 2013. User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 699–708.Google ScholarDigital Library
Jinyoung Kim, Gabriella Kazai, and Imed Zitouni. 2013. Relevance dimensions in preference-based IR evaluation. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 913–916.Google ScholarDigital Library
Caitlin Kuhlman, Diana Doherty, Malika Nurbekova, Goutham Deva, Zarni Phyo, Paul-Henry Schoenhagen, MaryAnn VanValkenburg, Elke Rundensteiner, and Lane Harrison. 2019. Evaluating preference collection methods for interactive ranking analytics. In Proceedings of the CHI Conference on Human Factors in Computing Systems.Google ScholarDigital Library
Yanyan Lan, Shuzi Niu, Jiafeng Guo, and Xueqi Cheng. 2013. Is top-\(\) sufficient for ranking? In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 1261–1270.Google ScholarDigital Library
Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. SIGIR Forum 45, 2 (Jan. 2012), 66–75.Google ScholarDigital Library
Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Trans. Inf. Syst. 35, 3 (Jan. 2017).Google ScholarDigital Library
Shuzi Niu, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2012. Top-k learning to rank: Labeling, ranking and evaluation. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 751–760.Google ScholarDigital Library
Kira Radinsky and Nir Ailon. 2011. Ranking from pairs and triplets: Information quality, evaluation methods and query complexity. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. 105–114.Google ScholarDigital Library
Mark E. Rorvig. 1990. The simple scalability of documents. J. Amer. Soc. Inf. Sci. 41, 8 (1990), 590–598.Google ScholarCross Ref
Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. 525–532.Google ScholarDigital Library
Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
Tefko Saracevic. 2017. The Notion of Relevance in Information Science: Everybody Knows What Relevance Is. But, What Is It Really? Morgan & Claypool, San Rafael, CA.Google Scholar
Ellen M. Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. 315–323.Google ScholarDigital Library
William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (Nov. 2010), 20:1–20:38.Google ScholarDigital Library
Fen Xia, Tie-Yan Liu, and Hang Li. 2009. Statistical consistency of top-\(\) ranking. In Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2098–2106.Google Scholar
Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Haitian Chen, Min Zhang, and Shaoping Ma. 2020. Preference-based evaluation metrics for web image search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarDigital Library
Jheng-Hong Yang, Sheng-Chieh Lin, Chuan-Ju Wang, Jimmy Lin, and Ming-Feng Tsai. 2019. Query and answer expansion from conversation history. In Proceedings of the 28th Text REtrieval Conference.Google Scholar
Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically examining the “neural hype”: Weak baselines and the additivity of effectiveness gains from neural ranking models. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1129–1132.Google ScholarDigital Library
Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In Proceedings of the 23rd Australasian Document Computing Symposium.Google ScholarDigital Library
Y. Y. Yao. 1995. Measuring retrieval effectiveness based on user preference of documents. J. Amer. Soc. Inf. Sci. 46, 2 (1995), 133–145.Google ScholarDigital Library
Dongqing Zhu and Ben Carterette. 2010. An analysis of assessor behavior in crowdsourced preference judgments. In Proceedings of the SIGIR Workshop on Crowdsourcing for Search Evaluation.Google Scholar

Index Terms

Assessing Top- Preferences
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Machine-Assisted Search Preference Evaluation
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Information Retrieval systems are traditionally evaluated using the relevance of web pages to individual queries. Other work on IR evaluation has focused on exploring the use of preference judgments over two search result lists. Unlike traditional query-...
Read More
Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance Judgments
WWW '22: Proceedings of the ACM Web Conference 2022

In Information Retrieval (IR) evaluation, preference judgments are collected by presenting to the assessors a pair of documents and asking them to select which of the two, if any, is the most relevant. This is an alternative to the classic relevance ...
Read More
Using PageRank to infer user preferences
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Recently, researchers have shown interest in the use of preference judgments for evaluation in IR literature. Although preference judgments have several advantages over absolute judgment, one of the major disadvantages is that the number of judgments ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 39, Issue 3
July 2021
432 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3450607
Editor:
Min Zhang
Tsinghua University, China
Issue’s Table of Contents
Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 May 2021
- Revised: 1 February 2021
- Accepted: 1 February 2021
- Received: 1 July 2020
Published in tois Volume 39, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Preference judgments
question answering
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 229
  Total Downloads
- Downloads (Last 12 months)36
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Assessing Top- Preferences

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Machine-Assisted Search Preference Evaluation

Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance Judgments

Using PageRank to infer user preferences