skip to main content
10.1145/3291992.3291995acmotherconferencesArticle/Chapter ViewAbstractPublication PagesadcsConference Proceedingsconference-collections
research-article

Pairwise Crowd Judgments: Preference, Absolute, and Ratio

Published: 11 December 2018 Publication History

Abstract

Relevance judgments are conventionally formed by small numbers of experts using ordinal relevance scales defined by two or more relevance categories. Such judgments often contain many ties: documents in the same category that cannot be separated by relevance. Here we explore the use of crowd-sourcing and combined three-way relevance assessments using pairwise preference, absolute relevance, and relevance ratio, with forced choice testing and embedded quality control processes, seeking to reduce assessment ties, and to increase judgment consistency. In particular, the crowd-sourced judgments from these three approaches were normalized into numeric relevance scores, and compared against judgments arising via three previous techniques: NIST binary; Sormunen; and magnitude estimation. The relationship between generated judgment reliability and number of document pairs assessed was also explored, as was the effect that factors such as document length, topic difficulty, number of documents judged, and assessment time, have on assessment reliability. Lastly, we investigate the extent to which the methodology used to collect judgments affects the ability of an experiment to discriminate between IR systems.

References

[1]
O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9--15, 2008.
[2]
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proc. SIGIR, pages 667--674, 2008.
[3]
S. Bozóki, J. Fülöp, and L. Rónyai. On optimal completion of incomplete pairwise comparison matrices. Math. and Comp. Modelling, 52(1):318--333, 2010.
[4]
B. Carterette and D. Petkova. Learning a ranking from pairwise preferences. In Proc. SIGIR, pages 629--630, 2006.
[5]
B. Carterette, P. N. Bennett, D. M. Chickering, and S. T. Dumais. Here or there: Preference judgments for relevance. In Proc. ECIR, pages 16--27, 2008.
[6]
P. Chandar and B. Carterette. Using preference judgments for novel document retrieval. In Proc. SIGIR, pages 861--870, 2012.
[7]
X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In Proc. WSDM, pages 193--202, 2013.
[8]
C. L. A. Clarke, F. Scholer, and I. Soboroff. The TREC 2005 terabyte track. In Proc. TREC, 2005.
[9]
T. Damessie, F. Scholer, and J. S. Culpepper. The influence of topic difficulty, relevance level, and document ordering on relevance judging. In Proc. ADCS, 2016.
[10]
R. V. Katter. The influence of scale form on relevance judgments. Inf. Str. & Retri., 4(1):1--11, 1968.
[11]
M. Lease and E. Yilmaz. Crowdsourcing for information retrieval: Introduction to the special issue. Inf. Retr., 16(2):91--100, 2013.
[12]
A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Sys., 27(1):2.1--2.27, 2008.
[13]
K. Roitero, E. Maddalena, G. Demartini, and S. Mizzaro. On fine-grained relevance scales. In Proc. SIGIR, pages 675--684, 2018.
[14]
F. Scholer and A. Turpin. Metric and relevance mismatch in retrieval evaluation. In Proc. AIRS, pages 50--62, 2009.
[15]
F. Scholer, A. Turpin, and M. Wu. Measuring user relevance criteria. In Proc. EVIA, pages 50--62, 2008.
[16]
F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proc. SIGIR, pages 1063--1072, 2011.
[17]
E. Sormunen. Liberal relevance criteria of TREC: Counting on negligible documents? In Proc. SIGIR, pages 324--330, 2002.
[18]
A. Turpin and F. Scholer. Modelling disagreement between judges for information retrieval system evaluation. In Proc. ADCS, page 51, 2009.
[19]
A. Turpin, F. Scholer, S. Mizzaro, and E. Maddalena. The benefits of magnitude estimation relevance assessments for information retrieval evaluation. In Proc. SIGIR, pages 565--574, 2015.
[20]
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Proc. & Man., 36(5):697--716, 2000.
[21]
E.M. Voorhees and D. Harman. Overview of the eighth Text REtrieval Conference. In Proc. TREC, 1999.
[22]
Z. Yang, A. Moffat, and A. Turpin. How precise does document scoring need to be? In Proc. AIRS, pages 279--291, 2016.

Cited By

View all
  • (2024)Comparing point‐wise and pair‐wise relevance judgment with brain signalsJournal of the Association for Information Science and Technology10.1002/asi.24936Online publication date: 18-Jun-2024
  • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
  • (2023)Perspectives on Large Language Models for Relevance JudgmentProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605136(39-50)Online publication date: 9-Aug-2023
  • Show More Cited By
  1. Pairwise Crowd Judgments: Preference, Absolute, and Ratio

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ADCS '18: Proceedings of the 23rd Australasian Document Computing Symposium
    December 2018
    78 pages
    ISBN:9781450365499
    DOI:10.1145/3291992
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    In-Cooperation

    • Dept. of Information Science, Univ.of Otago: Department of Information Science, University of Otago, Dunedin, New Zealand

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 December 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Pairwise preference
    2. crowd-sourcing
    3. relevance assessment
    4. test collection

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ADCS '18
    ADCS '18: 23rd Australasian Document Computing Symposium
    December 11 - 12, 2018
    Dunedin, New Zealand

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Comparing point‐wise and pair‐wise relevance judgment with brain signalsJournal of the Association for Information Science and Technology10.1002/asi.24936Online publication date: 18-Jun-2024
    • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
    • (2023)Perspectives on Large Language Models for Relevance JudgmentProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605136(39-50)Online publication date: 9-Aug-2023
    • (2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
    • (2023)Preference-Based Offline EvaluationProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3572725(1248-1251)Online publication date: 27-Feb-2023
    • (2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
    • (2022)Human Preferences as Dueling BanditsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531991(567-577)Online publication date: 6-Jul-2022
    • (2022)Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and MeaningIEEE Access10.1109/ACCESS.2022.321166810(105564-105577)Online publication date: 2022
    • (2022)Shallow pooling for sparse labelsInformation Retrieval Journal10.1007/s10791-022-09411-025:4(365-385)Online publication date: 20-Jul-2022
    • (2021)Evaluating Relevance Judgments with Pairwise Discriminative PowerProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482428(261-270)Online publication date: 26-Oct-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media