skip to main content
10.1145/3459637.3482428acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Evaluating Relevance Judgments with Pairwise Discriminative Power

Published: 30 October 2021 Publication History

Abstract

Relevance judgments play an essential role in the evaluation of information retrieval systems. As many different relevance judgment settings have been proposed in recent years, an evaluation metric to compare relevance judgments in different annotation settings has become a necessity. Traditional metrics, such as ĸ, Krippendorff's α and Φ have mainly focused on the inter-assessor consistency to evaluate the quality of relevance judgments. They encounter "reliable but useless" problem when employed to compare different annotation settings (e.g. binary judgment v.s. 4-grade judgment). Meanwhile, other existing popular metrics such as discriminative power (DP) are not designed to compare relevance judgments across different annotation settings, they therefore suffer from limitations, such as the requirement of result ranking lists from different systems. Therefore, how to design an evaluation metric to compare relevance judgments under different grade settings needs further investigation. In this work, we propose a novel metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections. By leveraging a small amount of document-level preference tests, PDP estimates the discriminative ability of relevance judgments on separating ranking lists with various qualities. With comprehensive experiments on both synthetic and real-world datasets, we show that PDP maintains a high degree of consistency with annotation quality in various grade settings. Compared with existing metrics (e.g., Krippendorff's α, Φ, DP, etc), it provides reliable evaluation results with affordable additional annotation efforts.

References

[1]
Haldun Akoglu. 2018. User's guide to correlation coefficients. Turkish journal of emergency medicine 18, 3 (2018), 91--93.
[2]
Enrique Amigó, Fernando Giner, Stefano Mizzaro, and Damiano Spina. 2018. A Formal Account of Effectiveness Evaluation and Ranking Fusion. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval. 123--130.
[3]
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 667--674.
[4]
Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89--96.
[5]
Ben Carterette, Paul N Bennett, David Maxwell Chickering, and Susan T Dumais. 2008. Here or there. In European Conference on Information Retrieval. Springer, 16--27.
[6]
Praveen Chandar and Ben Carterette. 2012. Using preference judgments for novel document retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 861--870.
[7]
Praveen Chandar and Ben Carterette. 2013. Preference based evaluation measures for novelty and diversity. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 413--422.
[8]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. 621--630.
[9]
Alessandro Checco, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2017. Let's agree to disagree: Fixing agreement measures for crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 5.
[10]
Charles LA Clarke, Nick Craswell, and Ian Soboroff. 2004. Overview of the TREC 2004 Terabyte Track. In TREC, Vol. 4. 74.
[11]
Charles LA Clarke, Mark D Smucker, and Alexandra Vtyurina. 2020. Offline Evaluation by Maximum Similarity to an Ideal Ranking. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 225--234.
[12]
Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020. Assessing top-?? preferences. arXiv preprint arXiv:2007.11682 (2020).
[13]
Charles LA Clarke, Alexandra Vtyurina, and Mark D Smucker. 2020. Offline evaluation without gain. In Proceedings of the 2020aCM SIGIR on International Conference on Theory of Information Retrieval. 185--192.
[14]
Cyril Cleverdon and EM Keen. 1966. Aslib--Cranfield research project. Factors determining the performance of indexing systems 1 (1966).
[15]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37--46.
[16]
Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees. 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.
[17]
Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research 134, 1 (2005), 19--67.
[18]
Nicola Ferro and Carol Peters. 2019. Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF. Vol. 41. Springer.
[19]
Joseph L Fleiss, Jacob Cohen, and Brian S Everitt. 1969. Large sample standard errors of kappa and weighted kappa. Psychological bulletin 72, 5 (1969), 323.
[20]
John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. In proceedings of the 26th annual international conference on machine learning. 377--384.
[21]
Lei Han, Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Gianluca Demartini. 2019. On transforming relevance scales. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 39--48.
[22]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422--446.
[23]
Klaus Krippendorff. 2011. Computing Krippendorff's alpha-reliability. (2011).
[24]
Cheng Luo, Tetsuya Sakai, Yiqun Liu, Zhicheng Dou, Chenyan Xiong, and Jingfang Xu. 2017. Overview of the NTCIR-13 we want web task. Proc. NTCIR-13 (2017).
[25]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 1--27.
[26]
Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On fine-grained relevance scales. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 675--684.
[27]
Murray Rosenblatt. 1956. A central limit theorem and a strong mixing condition. Proceedings of the National Academy of Sciences of the United States of America 42, 1 (1956), 43.
[28]
Reuven Y Rubinstein and Dirk P Kroese. 2016. Simulation and the Monte Carlo method. Vol. 10. John Wiley & Sons.
[29]
Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 525--532.
[30]
Tetsuya Sakai, Douglas W Oard, and Noriko Kando. [n.d.]. Evaluating Information Retrieval and Access Tasks: NTCIR's Legacy of Research Impact. Springer Nature.
[31]
Tetsuya Sakai and Ruihua Song. 2011. Evaluating diversified search results using per-intent graded relevance. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 1043--1052.
[32]
Tetsuya Sakai, Sijie Tao, Zhaohao Zeng, Yukun Zheng, Jiaxin Mao, Zhumin Chu, Yiqun Liu, Maria Maistro, Zhicheng Dou, Nicola Ferro, et al. 2020. Overview of the NTCIR-15 We Want Web with CENTRE (WWW-3) Task. Proceedings of NTCIR-15. to appear (2020).
[33]
Tetsuya Sakai and Zhaohao Zeng. 2020. Good evaluation measures based on document preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 359--368.
[34]
Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379--423.
[35]
Eero Sormunen. 2002. Liberal relevance criteria of TREC- counting on negligible documents?. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. 324--330.
[36]
Rong Tang, William M Shaw Jr, and Jack L Vevea. 1999. Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science 50, 3 (1999), 254--264.
[37]
Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena. 2015. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 565--574.
[38]
Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37, 5 (2005), 360--363.
[39]
Ellen M Voorhees and Donna Harman. 2002. Overview of TREC 2002. In Trec.
[40]
Ellen M Voorhees, Donna K Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge.
[41]
Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise crowd judgments: Preference, absolute, and ratio. In Proceedings of the 23rd Australasian Document Computing Symposium. 1--8.
[42]
Dongqing Zhu and Ben Carterette. 2010. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 workshop on crowdsourcing for search evaluation. 17--20.

Cited By

View all
  • (2024)Comparing point‐wise and pair‐wise relevance judgment with brain signalsJournal of the Association for Information Science and Technology10.1002/asi.24936Online publication date: 18-Jun-2024

Index Terms

  1. Evaluating Relevance Judgments with Pairwise Discriminative Power

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
    October 2021
    4966 pages
    ISBN:9781450384469
    DOI:10.1145/3459637
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. evaluation metric
    2. preference test
    3. relevance judgment

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CIKM '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Comparing point‐wise and pair‐wise relevance judgment with brain signalsJournal of the Association for Information Science and Technology10.1002/asi.24936Online publication date: 18-Jun-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media