research-article

Towards methods for the collective gathering and quality control of relevance assessments

Authors:
Gabriella Kazai

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
Natasa Milic-Frayling

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
Jamie Costello

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalJuly 2009Pages 452–459https://doi.org/10.1145/1571941.1572019

Published:19 July 2009Publication History

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 452–459

ABSTRACT

Growing interest in online collections of digital books and video content motivates the development and optimization of adequate retrieval systems. However, traditional methods for collecting relevance assessments to tune system performance are challenged by the nature of digital items in such collections, where assessors are faced with a considerable effort to review and assess content by extensive reading, browsing, and within-document searching. The extra strain is caused by the length and cohesion of the digital item and the dispersion of topics within it. We propose a method for the collective gathering of relevance assessments using a social game model to instigate participants' engagement. The game provides incentives for assessors to follow a predefined review procedure and makes provisions for the quality control of the collected relevance judgments. We discuss the approach in detail, and present the results of a pilot study conducted on a book corpus to validate the approach. Our analysis reveals intricate relationships between the affordances of the system, the incentives of the social game, and the behavior of the assessors. We show that the proposed game design achieves two designated goals: the incentive structure motivates endurance in assessors and the review process encourages truthful assessment.

References

Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., and Yilmaz, E. 2008. Relevance assessment: are judges exchangeable and does it matter. In Proc. of 31st ACM SIGIR (Singapore). ACM, New York, NY, 667--674. Google ScholarDigital Library
Clark, P. B. and J. Q. Wilson. 1961. "Incentive Systems: A Theory of Organizations." Administrative Science Quarterly 6:129--26.Google ScholarCross Ref
Cormack, G. V. and Lynam, T. R. 2007. Power and bias of subset pooling strategies. In Proc. of 30th ACM SIGIR (Amsterdam). ACM, New York, NY, 837--838. Google ScholarDigital Library
Fuhr, N., Kamps, J., Lalmas, M., Malik, S., and Trotman, A. 2007. Overview of the INEX 2007 ad hoc track. In Proc. of INEX'07. 1--22.Google Scholar
Kazai, G., Doucet, A., Landoni, M. 2009. Overview of the INEX 2008 Book Track. In Proc. of INEX'08. LNCS Vol. 5613, Springer.Google Scholar
Piwowarski, B., Trotman, A., and Lalmas, M. 2008. Sound and complete relevance assessment for XML retrieval. ACM Trans. Inf. Syst. 27(1), 1--37. Google ScholarDigital Library
Sanderson, M. and Joho, H. 2004. Forming test collections with no system pooling. In Proc. of 27th ACM SIGIR (Sheffield).ACM, New York, NY, 33--40. Google ScholarDigital Library
Soboroff, I., Nicholas, C., and Cahan, P. 2001. Ranking retrieval systems without relevance judgments. In Proc. of 24th ACM SIGIR (New Orleans). ACM, New York, 66--73. Google ScholarDigital Library
Spink, A. and Greisdorf, H. 2001. Regions and levels: measuring and mapping users'' relevance judgments. J. Am. Soc. Inf. Sci. Technol. 52(2), 161--173. Google ScholarDigital Library
Trotman, A., Pharo, N.&Lehtonen (2006). XML-IR users and use cases. In Pre-Proc. of INEX'06, 274--286.Google Scholar
Trotman, A. and Jenkinson, D. 2007. IR evaluation using multiple assessors per topic. In Proc. of ADCS.Google Scholar
von Ahn, L. and Dabbish, L. 2008. Designing games with a purpose. Commun. ACM 51(8), 58--67. Google ScholarDigital Library
von Ahn, L. and Dabbish, L. 2004. Labeling images with a computer game. In Proc. of SIGCHI Conference on Human Factors in Comp. Systems (Vienna). ACM, NY, 319--326. Google ScholarDigital Library
Voorhees, E. M. and Tice, D. M. 2000. Building a question answering test collection. In Proc. of 23rd ACM SIGIR (Athens). ACM, New York, NY, 200--207. Google ScholarDigital Library
Voorhees, E. M. and Harman, D. K. 2005 TREC: Experiment and Evaluation in Information Retrieval. The MIT Press. Google ScholarDigital Library
Yilmaz, E., Kanoulas, E., and Aslam, J. A. 2008. A simple and efficient sampling method for estimating AP and NDCG. In Proc. of 31st ACM SIGIR. ACM, New York, 603--610. Google ScholarDigital Library
Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proc. of 21st ACM SIGIR (Melbourne). ACM, New York, NY, 307--314. Google ScholarDigital Library

Index Terms

Towards methods for the collective gathering and quality control of relevance assessments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

On information retrieval metrics designed for evaluation with incomplete relevance assessments

Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, ...
Read More
Liberal relevance criteria of TREC -: counting on negligible documents?
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Most test collections (like TREC and CLEF) for experimental research in information retrieval apply binary relevance assessments. This paper introduces a four-point relevance scale and reports the findings of a project in which TREC-7 and TREC-8 ...
Read More
Relevance assessments, bibliometrics, and altmetrics: a quantitative study on PubMed and arXiv
Abstract
Relevance is a key element for analyzing bibliometrics and information retrieval (IR). In both domains, relevance decisions are discussed theoretically and sometimes evaluated in empirical studies. IR research is often based on test collections ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
July 2009
896 pages
ISBN:9781605584836
DOI:10.1145/1571941
General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
relevance assessments
social game
test collection construction
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 649
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards methods for the collective gathering and quality control of relevance assessments

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

On information retrieval metrics designed for evaluation with incomplete relevance assessments

Liberal relevance criteria of TREC -: counting on negligible documents?

Relevance assessments, bibliometrics, and altmetrics: a quantitative study on PubMed and arXiv