Skip to main content

Tolerance of Effectiveness Measures to Relevance Judging Errors

  • Conference paper
Advances in Information Retrieval (ECIR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

Abstract

Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments – even judgments formed from the consensus of multiple workers – is that there will be differences in the judgments compared to the judgments produced by high quality assessors. For two TREC test collections, we simulated errors in sets of judgments and then measured the effect of these errors on effectiveness measures. We found that some measures appear to be more tolerant of errors than others. We also found that to achieve high rank correlation in the ranking of retrieval systems requires conservative judgments for average precision (AP) and nDCG, while precision at rank 10 requires neutral judging behavior. Conservative judging avoids mistakenly judging non-relevant documents as relevant at the cost of judging some relevant documents as non-relevant. In addition, we found that while conservative judging behavior maximizes rank correlation for AP and nDCG, to minimize the error in the measures’ values requires more liberal behavior. Depending on the nature of a set of crowdsourced judgments, the judgments may be more suitable with some effectiveness measures than others, and the use of some effectiveness measures will require higher levels of judgment quality than others.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Alonso, O., Mizzaro, S.: Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In: Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, pp. 15–16 (July 2009)

    Google Scholar 

  2. McCreadie, R., Macdonald, C., Ounis, I.: Crowdsourcing blog track top news judgments at TREC. In: WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining (2011)

    Google Scholar 

  3. Smucker, M.D., Kazai, G., Lease, M.: Overview of the TREC 2012 crowdsourcing track (2012)

    Google Scholar 

  4. Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36, 697–716 (2000)

    Google Scholar 

  5. Hosseini, M., Cox, I., Milić-Frayling, N., Kazai, G., Vinay, V.: On aggregating labels from multiple crowd workers to infer relevance of documents. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 182–194. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  6. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. The Journal of Machine Learning Research 99, 1297–1322 (2010)

    MathSciNet  Google Scholar 

  7. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for information retrieval. In: SIGIR, pp. 587–594 (2008)

    Google Scholar 

  8. Macmillan, N.A., Creelman, C.D.: Detection theory: A user’s guide. Psychology Press (2004)

    Google Scholar 

  9. Voorhees, E.M., Harman, D.: Overview of the Eighth Text REtrieval Conference (TREC-8). In: Proceedings of TREC, vol. 8, pp. 1–24 (1999)

    Google Scholar 

  10. Voorhees, E.M.: Overview of TREC 2005. In: Proceedings of TREC (2005)

    Google Scholar 

  11. Smucker, M.D., Jethani, C.P.: Measuring assessor accuracy: a comparison of NIST assessors and user study participants. In: SIGIR, pp. 1231–1232 (2011)

    Google Scholar 

  12. Smucker, M., Jethani, C.: The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behavior. In: Proceedings of the SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval (2011)

    Google Scholar 

  13. Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)

    Article  MathSciNet  MATH  Google Scholar 

  14. Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: SIGIR, pp. 539–546 (2010)

    Google Scholar 

  15. Webber, W., Chandar, P., Carterette, B.: Alternative assessor disagreement and retrieval depth. In: CIKM, pp. 125–134 (2012)

    Google Scholar 

  16. Vuurens, J., de Vries, A.P., Eickhoff, C.: How much spam can you take? In: SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval, CIR (2011)

    Google Scholar 

  17. Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36(5), 697–716 (2000)

    Google Scholar 

  18. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: SIGIR, pp. 667–674 (2008)

    Google Scholar 

  19. Kinney, K., Huffman, S., Zhai, J.: How evaluator domain expertise affects search result relevance judgments. In: CIKM, pp. 591–598 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, L., Smucker, M.D. (2014). Tolerance of Effectiveness Measures to Relevance Judging Errors. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06028-6_13

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06027-9

  • Online ISBN: 978-3-319-06028-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics