Abstract
Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments – even judgments formed from the consensus of multiple workers – is that there will be differences in the judgments compared to the judgments produced by high quality assessors. For two TREC test collections, we simulated errors in sets of judgments and then measured the effect of these errors on effectiveness measures. We found that some measures appear to be more tolerant of errors than others. We also found that to achieve high rank correlation in the ranking of retrieval systems requires conservative judgments for average precision (AP) and nDCG, while precision at rank 10 requires neutral judging behavior. Conservative judging avoids mistakenly judging non-relevant documents as relevant at the cost of judging some relevant documents as non-relevant. In addition, we found that while conservative judging behavior maximizes rank correlation for AP and nDCG, to minimize the error in the measures’ values requires more liberal behavior. Depending on the nature of a set of crowdsourced judgments, the judgments may be more suitable with some effectiveness measures than others, and the use of some effectiveness measures will require higher levels of judgment quality than others.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alonso, O., Mizzaro, S.: Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In: Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, pp. 15–16 (July 2009)
McCreadie, R., Macdonald, C., Ounis, I.: Crowdsourcing blog track top news judgments at TREC. In: WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining (2011)
Smucker, M.D., Kazai, G., Lease, M.: Overview of the TREC 2012 crowdsourcing track (2012)
Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36, 697–716 (2000)
Hosseini, M., Cox, I., Milić-Frayling, N., Kazai, G., Vinay, V.: On aggregating labels from multiple crowd workers to infer relevance of documents. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 182–194. Springer, Heidelberg (2012)
Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. The Journal of Machine Learning Research 99, 1297–1322 (2010)
Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for information retrieval. In: SIGIR, pp. 587–594 (2008)
Macmillan, N.A., Creelman, C.D.: Detection theory: A user’s guide. Psychology Press (2004)
Voorhees, E.M., Harman, D.: Overview of the Eighth Text REtrieval Conference (TREC-8). In: Proceedings of TREC, vol. 8, pp. 1–24 (1999)
Voorhees, E.M.: Overview of TREC 2005. In: Proceedings of TREC (2005)
Smucker, M.D., Jethani, C.P.: Measuring assessor accuracy: a comparison of NIST assessors and user study participants. In: SIGIR, pp. 1231–1232 (2011)
Smucker, M., Jethani, C.: The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behavior. In: Proceedings of the SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval (2011)
Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)
Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: SIGIR, pp. 539–546 (2010)
Webber, W., Chandar, P., Carterette, B.: Alternative assessor disagreement and retrieval depth. In: CIKM, pp. 125–134 (2012)
Vuurens, J., de Vries, A.P., Eickhoff, C.: How much spam can you take? In: SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval, CIR (2011)
Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36(5), 697–716 (2000)
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: SIGIR, pp. 667–674 (2008)
Kinney, K., Huffman, S., Zhai, J.: How evaluator domain expertise affects search result relevance judgments. In: CIKM, pp. 591–598 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Li, L., Smucker, M.D. (2014). Tolerance of Effectiveness Measures to Relevance Judging Errors. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-06028-6_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)