ABSTRACT
The collection of relevance judgements by assessors is important for many information retrieval (IR) tasks. In addition to the construction of test collections, relevance judging is critical to e-discovery and other applications where many assessors are hired to perform relevance judging. It is well known that assessors may differ in their judgements for a given document. One possible cause of a judgement difference is that an assessor may be uncertain in their judgement and thus may in effect be guessing the document's relevance. If assessors are aware of their uncertainty and can self-report their level of certainty, then uncertain relevance judgements can be targeted for adjudication by additional assessors. In this paper, we conducted a user study with 48 participants to test our hypothesis that assessors will be uncertain about their relevance judgements when the assessors are likely to disagree with each other. We found that for low consensus documents, i.e. documents known for assessor disagreement, assessors judge these documents with almost as much certainty as high consensus documents. In particular, assessor self-reported uncertainty is predictive of disagreement only for high consensus documents and not for low consensus documents.
- A. L. Al-Harbi and M. D. Smucker. A Qualitative Exploration of Secondary Assessor Relevance Judging Behavior. In IIiX, pages 195--204, 2014. Google ScholarDigital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR, pages 667--674, 2008. Google ScholarDigital Library
- C. W. Cleverdon. The Effect of Variations in Relevance Assessments in Comparative Experimental Tests of Index Languages. Tech. report, Cranfield Univ.; Aslib, 1970.Google Scholar
- C. Jethani. Effect of Prevalence on Relevance Assessing Behavior. Master's Thesis, University of Waterloo, 2011.Google Scholar
- M. E. Lesk and G. Salton. Relevance Assessments and Retrieval System Evaluation. Information Storage and Retrieval, 4(4):343--359, 1968.Google ScholarCross Ref
- M. D. Smucker and C. Jethani. Human Performance and Retrieval Precision Revisited. In SIGIR, pages 595--602, 2010. Google ScholarDigital Library
- M. D. Smucker and C. Jethani. The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior. In Proceedings of the SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval, 2011.Google Scholar
- E. M. Voorhees. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Information Processing & Management, 36(5):697--716, 2000. Google ScholarDigital Library
- E. M. Voorhees. Overview of the TREC 2005 Robust Retrieval Track. In TREC, 2005.Google Scholar
- J. Wang and D. Soergel. A User Study of Relevance Judgments for E-discovery. In ASIST, 47(1):1--10, 2010. Google ScholarDigital Library
- W. Webber, P. Chandar, and B. Carterette. Alternative Assessor Disagreement and Retrieval Depth. In CIKM, pages 125--134, 2012. Google ScholarDigital Library
Index Terms
- Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
Recommendations
A qualitative exploration of secondary assessor relevance judging behavior
IIiX '14: Proceedings of the 5th Information Interaction in Context SymposiumSecondary assessors frequently differ in their relevance judgments. Primary assessors are those that originate a search topic and whose judgments truly reflect the assessor's relevance criteria. Secondary assessors do not originate the search and must ...
A document rating system for preference judgements
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalHigh quality relevance judgments are essential for the evaluation of information retrieval systems. Traditional methods of collecting relevance judgments are based on collecting binary or graded nominal judgments, but such judgments are limited by ...
Time to judge relevance as an indicator of assessor error
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalWhen human assessors judge documents for their relevance to a search topic, it is possible for errors in judging to occur. As part of the analysis of the data collected from a 48 participant user study, we have discovered that when the participants made ...
Comments