ABSTRACT
Growing interest in online collections of digital books and video content motivates the development and optimization of adequate retrieval systems. However, traditional methods for collecting relevance assessments to tune system performance are challenged by the nature of digital items in such collections, where assessors are faced with a considerable effort to review and assess content by extensive reading, browsing, and within-document searching. The extra strain is caused by the length and cohesion of the digital item and the dispersion of topics within it. We propose a method for the collective gathering of relevance assessments using a social game model to instigate participants' engagement. The game provides incentives for assessors to follow a predefined review procedure and makes provisions for the quality control of the collected relevance judgments. We discuss the approach in detail, and present the results of a pilot study conducted on a book corpus to validate the approach. Our analysis reveals intricate relationships between the affordances of the system, the incentives of the social game, and the behavior of the assessors. We show that the proposed game design achieves two designated goals: the incentive structure motivates endurance in assessors and the review process encourages truthful assessment.
- Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., and Yilmaz, E. 2008. Relevance assessment: are judges exchangeable and does it matter. In Proc. of 31st ACM SIGIR (Singapore). ACM, New York, NY, 667--674. Google ScholarDigital Library
- Clark, P. B. and J. Q. Wilson. 1961. "Incentive Systems: A Theory of Organizations." Administrative Science Quarterly 6:129--26.Google ScholarCross Ref
- Cormack, G. V. and Lynam, T. R. 2007. Power and bias of subset pooling strategies. In Proc. of 30th ACM SIGIR (Amsterdam). ACM, New York, NY, 837--838. Google ScholarDigital Library
- Fuhr, N., Kamps, J., Lalmas, M., Malik, S., and Trotman, A. 2007. Overview of the INEX 2007 ad hoc track. In Proc. of INEX'07. 1--22.Google Scholar
- Kazai, G., Doucet, A., Landoni, M. 2009. Overview of the INEX 2008 Book Track. In Proc. of INEX'08. LNCS Vol. 5613, Springer.Google Scholar
- Piwowarski, B., Trotman, A., and Lalmas, M. 2008. Sound and complete relevance assessment for XML retrieval. ACM Trans. Inf. Syst. 27(1), 1--37. Google ScholarDigital Library
- Sanderson, M. and Joho, H. 2004. Forming test collections with no system pooling. In Proc. of 27th ACM SIGIR (Sheffield).ACM, New York, NY, 33--40. Google ScholarDigital Library
- Soboroff, I., Nicholas, C., and Cahan, P. 2001. Ranking retrieval systems without relevance judgments. In Proc. of 24th ACM SIGIR (New Orleans). ACM, New York, 66--73. Google ScholarDigital Library
- Spink, A. and Greisdorf, H. 2001. Regions and levels: measuring and mapping users'' relevance judgments. J. Am. Soc. Inf. Sci. Technol. 52(2), 161--173. Google ScholarDigital Library
- Trotman, A., Pharo, N.&Lehtonen (2006). XML-IR users and use cases. In Pre-Proc. of INEX'06, 274--286.Google Scholar
- Trotman, A. and Jenkinson, D. 2007. IR evaluation using multiple assessors per topic. In Proc. of ADCS.Google Scholar
- von Ahn, L. and Dabbish, L. 2008. Designing games with a purpose. Commun. ACM 51(8), 58--67. Google ScholarDigital Library
- von Ahn, L. and Dabbish, L. 2004. Labeling images with a computer game. In Proc. of SIGCHI Conference on Human Factors in Comp. Systems (Vienna). ACM, NY, 319--326. Google ScholarDigital Library
- Voorhees, E. M. and Tice, D. M. 2000. Building a question answering test collection. In Proc. of 23rd ACM SIGIR (Athens). ACM, New York, NY, 200--207. Google ScholarDigital Library
- Voorhees, E. M. and Harman, D. K. 2005 TREC: Experiment and Evaluation in Information Retrieval. The MIT Press. Google ScholarDigital Library
- Yilmaz, E., Kanoulas, E., and Aslam, J. A. 2008. A simple and efficient sampling method for estimating AP and NDCG. In Proc. of 31st ACM SIGIR. ACM, New York, 603--610. Google ScholarDigital Library
- Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proc. of 21st ACM SIGIR (Melbourne). ACM, New York, NY, 307--314. Google ScholarDigital Library
Index Terms
- Towards methods for the collective gathering and quality control of relevance assessments
Recommendations
On information retrieval metrics designed for evaluation with incomplete relevance assessments
Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, ...
Liberal relevance criteria of TREC -: counting on negligible documents?
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalMost test collections (like TREC and CLEF) for experimental research in information retrieval apply binary relevance assessments. This paper introduces a four-point relevance scale and reports the findings of a project in which TREC-7 and TREC-8 ...
Relevance assessments, bibliometrics, and altmetrics: a quantitative study on PubMed and arXiv
AbstractRelevance is a key element for analyzing bibliometrics and information retrieval (IR). In both domains, relevance decisions are discussed theoretically and sometimes evaluated in empirical studies. IR research is often based on test collections ...
Comments