short-paper

A Preference Judgment Tool for Authoritative Assessment

Authors:
Mahsa Seifikar

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada

0000-0001-8640-1267
View Profile

,
Linh Nhi Phan Minh

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada

0009-0000-6593-6561
View Profile

,
Negar Arabzadeh

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada

0000-0002-4411-7089
View Profile

,
Charles L. A. Clarke

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada

0000-0001-8178-9194
View Profile

,
Mark D. Smucker

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada

0000-0003-4968-6405
View Profile

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2023Pages 3100–3104https://doi.org/10.1145/3539618.3591801

Published:18 July 2023Publication History

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3100–3104

ABSTRACT

Preference judgments have been established as an effective method for offline evaluation of information retrieval systems with advantages to graded or binary relevance judgments. Graded judgments assign each document a pre-defined grade level, while preference judgments involve assessing a pair of items presented side by side and indicating which is better. However, leveraging preference judgments may require a more extensive number of judgments, and there are limitations in terms of evaluation measures. In this study, we present a new preference judgment tool called JUDGO, designed for expert assessors and researchers. The tool is supported by a new heap-like preference judgment algorithm that assumes transitivity and allows for ties. An earlier version of the tool was employed by NIST to determine up to the top-10 best items for each of the 38 topics for the TREC 2022 Health Misinformation track, with over 2,200 judgments collected. The current version has been applied in a separate research study to collect almost 10,000 judgments, with multiple assessors completing each topic. The code and resources are available at https://judgo-system.github.io.

Supplemental Material

SIGIR23-dep3093.mp4

mp4

45.3 MB

Download

References

Omar Alonso, Daniel E Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. In ACM SigIR forum, Vol. 42. ACM New York, NY, USA, 9--15.Google Scholar
Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, and Charles LA Clarke. 2022. Shallow pooling for sparse labels. Information Retrieval Journal, Vol. 25, 4 (2022), 365--385.Google ScholarDigital Library
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 667--674.Google ScholarDigital Library
Maryam Bashir, Jesse Anderton, Jie Wu, Peter B Golbus, Virgil Pavlu, and Javed A Aslam. 2013. A document rating system for preference judgements. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 909--912.Google ScholarDigital Library
Mike Bendersky, Xuanhui Wang, Marc Najork, and Don Metzler. 2018. Learning with Sparse and Biased Feedback for Personal Search. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI). 5219--5223.Google ScholarCross Ref
Pia Borlund. 2003. The concept of relevance in IR. Journal of the American Society for information Science and Technology, Vol. 54, 10 (2003), 913--925.Google ScholarDigital Library
Rocío Ca namares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems. Information Retrieval Journal, Vol. 23, 4 (2020), 387--410.Google ScholarDigital Library
Ben Carterette. 2011. System effectiveness, user models, and user utility: a conceptual framework for investigation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in information retrieval. 903--912.Google ScholarDigital Library
Ben Carterette, Paul N Bennett, David Maxwell Chickering, and Susan T Dumais. 2008. Here or there. In European Conference on Information Retrieval. Springer, 16--27.Google ScholarDigital Library
Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 861--870.Google ScholarDigital Library
Charles LA Clarke, Chengxi Luo, and Mark D Smucker. 2021a. Evaluation measures based on preference graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1534--1543.Google ScholarDigital Library
Charles LA Clarke, Maria Maistro, Mahsa Seifikar, and Mark D Smucker. 2022. Overview of the TREC 2022 health misinformation track. In TREC.Google Scholar
Charles LA Clarke, Mark D Smucker, and Alexandra Vtyurina. 2020a. Offline evaluation by maximum similarity to an ideal ranking. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 225--234.Google ScholarDigital Library
Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020b. Offline evaluation without gain. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval. 185--192.Google ScholarDigital Library
Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2021b. Assessing Top-k Preferences. ACM Transactions on Information Systems (TOIS), Vol. 39, 3 (2021), 1--21.Google ScholarDigital Library
Zhicheng Dou, Ruihua Song, Xiaojie Yuan, and Ji-Rong Wen. 2008. Are click-through data adequate for learning web search rankings?. In Proceedings of the 17th ACM conference on Information and knowledge management. 73--82.Google ScholarDigital Library
Ahmed Hassan Awadallah and Imed Zitouni. 2014. Machine-assisted search preference evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 51--60.Google ScholarDigital Library
Kai Hui and Klaus Berberich. 2017. Transitivity, time consumption, and quality of preference judgments in crowdsourcing. In European Conference on Information Retrieval. Springer, 239--251.Google ScholarCross Ref
Thorsten Joachims. 2003. Evaluating Retrieval Performance Using Clickthrough Data. In Text Mining. Physica/Springer Verlag, 79--96.Google Scholar
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems (TOIS), Vol. 25, 2 (2007), 7--es.Google ScholarDigital Library
Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the tenth ACM international conference on web search and data mining. 781--789.Google ScholarDigital Library
Jaana Kekäläinen. 2005. Binary and graded relevance in IR evaluations-comparison of the effects on ranking of IR systems. Information processing & management, Vol. 41, 5 (2005), 1019--1033.Google Scholar
Jaana Kekäläinen and Kalervo Järvelin. 2002. Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology, Vol. 53, 13 (2002), 1120--1129.Google ScholarDigital Library
Jinyoung Kim, Gabriella Kazai, and Imed Zitouni. 2013. Relevance dimensions in preference-based IR evaluation. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 913--916.Google ScholarDigital Library
Caitlin Kuhlman, Diana Doherty, Malika Nurbekova, Goutham Deva, Zarni Phyo, Paul-Henry Schoenhagen, MaryAnn VanValkenburg, Elke Rundensteiner, and Lane Harrison. 2019. Evaluating preference collection methods for interactive ranking analytics. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--11.Google ScholarDigital Library
Matthew Lease and Emine Yilmaz. 2012. Crowdsourcing for information retrieval. In ACM SIGIR Forum, Vol. 45. ACM New York, NY, USA, 66--75.Google ScholarDigital Library
Yan Li, Hao Wang, Ngai Meng Kou, Zhiguo Gong, et al. 2021. Crowdsourced top-k queries by pairwise preference judgments with confidence and budget control. The VLDB Journal, Vol. 30, 2 (2021), 189--213.Google ScholarDigital Library
Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On crowdsourcing relevance magnitudes for information retrieval evaluation. ACM Transactions on Information Systems (TOIS), Vol. 35, 3 (2017), 1--32.Google ScholarDigital Library
Mariana Neves and Jurica vS eva. 2021. An extensive review of tools for manual annotation of documents. Briefings in bioinformatics, Vol. 22, 1 (2021), 146--163.Google Scholar
Shuzi Niu, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2012. Top-k learning to rank: labeling, ranking and evaluation. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 751--760.Google ScholarDigital Library
Kira Radinsky and Nir Ailon. 2011. Ranking from pairs and triplets: Information quality, evaluation methods and query complexity. In Proceedings of the fourth ACM international conference on Web search and data mining. 105--114.Google ScholarDigital Library
Kevin Roitero, Alessandro Checco, Stefano Mizzaro, and Gianluca Demartini. 2022. Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance Judgments. In Proceedings of the ACM Web Conference 2022. 319--327.Google ScholarDigital Library
Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 359--368.Google ScholarDigital Library
Mark Sanderson and Justin Zobel. 2005. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. 162--169.Google ScholarDigital Library
Ruihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, and Hsiao-Wuen Hon. 2011. Select-the-Best-Ones: A new way to judge relative relevance. Information processing & management, Vol. 47, 1 (2011), 37--52.Google Scholar
Paul Thomas and David Hawking. 2006. Evaluation by comparing result sets in context. In Proceedings of the 15th ACM international conference on Information and knowledge management. 94--101.Google ScholarDigital Library
Xinyi Yan, Chengxi Luo, Charles LA Clarke, Nick Craswell, Ellen M Voorhees, and Pablo Castells. 2022. Human preferences as dueling bandits. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 567--577.Google ScholarDigital Library
Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In Proceedings of the 23rd Australasian Document Computing Symposium. 1--8.Google ScholarDigital Library
Zhaohui Zheng, Keke Chen, Gordon Sun, and Hongyuan Zha. 2007. A regression framework for learning ranking functions using relative relevance judgments. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 287--294.Google ScholarDigital Library

Index Terms

A Preference Judgment Tool for Authoritative Assessment
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

On the role of human and machine metadata in relevance judgment tasks
Abstract
In order to evaluate the effectiveness of Information Retrieval (IR) systems it is key to collect relevance judgments from human assessors. Crowdsourcing has successfully been used as a method to scale-up the collection of manual ...
Highlights
- Introducing metadata improves the efficiency of crowd workers performing relevance judgements.
Read More
Relevance Judgments: Preferences, Scores and Ties
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conventionally, relevance judgments were assessed using ordinal relevance scales such as binary and Sormunen categories [9]. Such judgments record how much overlap there is between the document and the topic. However they have been argued as unreliable ...
Read More
Comparing In Situ and Multidimensional Relevance Judgments
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

To address concerns of TREC-style relevance judgments, we explore two improvements. The first one seeks to make relevance judgments contextual, collecting in situ feedback of users in an interactive search session and embracing usefulness as the primary ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
offline evaluation
pairwise preference
relevance judgment
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 119
  Total Downloads
- Downloads (Last 12 months)119
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Preference Judgment Tool for Authoritative Assessment

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

On the role of human and machine metadata in relevance judgment tasks

Relevance Judgments: Preferences, Scores and Ties

Comparing In Situ and Multidimensional Relevance Judgments