skip to main content
10.1145/3498366.3505815acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

The Crowd is Made of People: Observations from Large-Scale Crowd Labelling

Published:14 March 2022Publication History

ABSTRACT

Like many other researchers, at Microsoft Bing we use external “crowd” judges to label results from a search engine—especially, although not exclusively, to obtain relevance labels for offline evaluation in the Cranfield tradition. Crowdsourced labels are relatively cheap, and hence very popular, but are prone to disagreements, spam, and various biases which appear to be unexplained “noise” or “error”. In this paper, we provide examples of problems we have encountered running crowd labelling at large scale and around the globe, for search evaluation in particular. We demonstrate effects due to the time of day and day of week that a label is given; fatigue; anchoring; exposure; left-side bias; task switching; and simple disagreement between judges. Rather than simple “error”, these effects are consistent with well-known physiological and cognitive factors. “The crowd” is not some abstract machinery, but is made of people. Human factors that affect people’s judgement behaviour must be considered when designing research evaluations and in interpreting evaluation metrics.

References

  1. Omar Alonso and Ricardo Baeza-Yates. 2011. Design and implementation of relevance assessments using crowdsourcing. In Proceedings of the European Conference on Information Retrieval. 153–164.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Omar Alonso, Daniel E Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. In ACM SIGIR Forum, Vol. 42. 9–15.Google ScholarGoogle Scholar
  3. Tim Althoff, Eric Horvitz, Ryen W White, and Jamie Zeitzer. 2017. Harnessing the web for population-scale physiological sensing: A case study of sleep and performance. In Proceedings of the 26th International Conference on the World Wide Web. 113–122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Leif Azzopardi. 2021. Cognitive biases in search: A review and reflection of cognitive biases in information retrieval. In Proceedings of the ACM SIGIR Conference on Human Information Interaction and Retrieval. 27–37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter?. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peter Bailey, Nick Craswell, Ryen W White, Liwei Chen, Ashwin Satyanarayana, and Saied MM Tahaghoghi. 2010. Evaluating search systems using result page context. In Proceedings of the 3rd Information Interaction in Context Symposium. 105–114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting linear mixed-effects models Using lme4. Journal of Statistical Software 67, 1 (2015), 1–48.Google ScholarGoogle ScholarCross RefCross Ref
  8. Sara Behimehr and Hamid R. Jamali. 2020. Cognitive biases and their effects on information behaviour of graduate students in their research projects. Journal of Information Science Theory and Practice 8, 2 (2020), 18–31.Google ScholarGoogle Scholar
  9. Erik M Benau, Natalia C Orloff, E Amy Janke, Lucy Serpell, and C Alix Timko. 2014. A systematic review of the effects of experimental fasting on cognition. Appetite 77(2014), 52–61.Google ScholarGoogle ScholarCross RefCross Ref
  10. Buster Benson. 2016. Cognitive bias cheat sheet. https://betterhumans.pub/cognitive-bias-cheat-sheet-55a472476b18. Downloaded: 2022-01-08.Google ScholarGoogle Scholar
  11. Robert F Bornstein. 1989. Exposure and affect: overview and meta-analysis of research, 1968–1987. Psychological Bulletin 106, 2 (1989), 265.Google ScholarGoogle ScholarCross RefCross Ref
  12. Carrie J. Cai, Shamsi T. Iqbal, and Jaime Teevan. 2016. Chain reactions: The impact of order on microtask chains. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 3143–3154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ben Carterette and Ian Soboroff. 2010. The effect of assessor errors on IR system evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 539–546.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Paul Clough, Mark Sanderson, Jiayu Tang, Tim Gollins, and Amy Warner. 2013. Examining the limits of crowdsourcing for relevance assessment. IEEE Internet Computing 17, 4 (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Djellel Difallah, Elena Filatova, and Panos Ipeirotis. 2018. Demographics and dynamics of Mechanical Turk workers. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 135–143.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tim Draws, Alisa Rieger, Oana Inel, Ujwal Gadiraju, and Nava Tintarev. 2021. A checklist to combat cognitive biases in crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 48–59.Google ScholarGoogle ScholarCross RefCross Ref
  17. Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 162–170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Carsten Eickhoff, Christopher G. Harris, Arjen P. de Vries, and Padmini Srinivasan. 2012. Quality through flow and immersion: Gamifying crowdsourced relevance assessments. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 871–880.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Nicholas Epley and Thomas Gilovich. 2005. When effortful thinking influences judgmental anchoring differential effects of forewarning and incentives on self-generated and externally provided anchors. Journal of Behavioral Decision Making 18, 3 (2005), 199–212.Google ScholarGoogle ScholarCross RefCross Ref
  20. Ozkan Eren and Naci Mocan. 2018. Emotional judges and unlucky juveniles. American Economic Journal: Applied Economics 10, 3 (2018), 171–205.Google ScholarGoogle ScholarCross RefCross Ref
  21. Marvin Fleischmann, Miglena Amirpur, Alexander Benlian, and Thomas Hess. 2014. Cognitive biases in information systems research: A scientometric analysis. In Proceedings of the European Conference on Information Systems.Google ScholarGoogle Scholar
  22. Hershey H Friedman and Taiwo Amoo. 1999. Rating the rating scales. Journal of Marketing Management 9, 3 (1999), 114–123.Google ScholarGoogle Scholar
  23. Ujwal Gadiraju and Stefan Dietze. 2017. Improving learning through achievement priming in crowdsourced information finding microtasks. In Proceedings of the Learning Analytics and Knowledge Conference. 105–114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Alyssa A. Gamaldo, Jason C. Allaire, and Keith E. Whitfield. 2010. Exploring the within-person coupling of sleep and cognition in older African Americans. Psychology and Aging 25, 4 (2010), 851–857.Google ScholarGoogle ScholarCross RefCross Ref
  25. Scott A. Golder and Michael W. Macy. 2011. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333, 6051 (2011), 1878–1881.Google ScholarGoogle ScholarCross RefCross Ref
  26. Victor M. González and Gloria Mark. 2004. Constant, constant, multi-tasking craziness. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 113–120.Google ScholarGoogle Scholar
  27. Google. 2020. Search quality evaluator guidelines. https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf. Downloaded 2021-10-11.Google ScholarGoogle Scholar
  28. Sandy J. J. Gould, Anna L. Cox, and Duncan P. Brumby. 2016. Diminished control in crowdsourcing: An investigation of crowdworker multitasking behavior. ACM Transactions on Computer-Human Interaction 23, 3, Article 19 (June 2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Javier Hernandez, Mohammed (Ehsan) Hoque, Will Drevo, and Rosalind W. Picard. 2012. Mood meter: counting smiles in the wild. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing. 301–310.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Danula Hettiachchi, Niels van Berkel, Simo Hosio, Vassilis Kostakos, and Jorge Goncalves. 2019. Effect of Cognitive Abilities on Crowdsourcing Task Performance. In Proceedings of the IFIP Conference on Human-Computer Interaction. 442–464.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Paul W Holland. 1986. Statistics and causal inference. Journal of the American Statistical Association 81, 396(1986), 945–960.Google ScholarGoogle ScholarCross RefCross Ref
  32. Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. 2019. Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the SIGKDD Workshop on Human Computation. 64–67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Toni Kaplan, Susumu Saito, Kotaro Hara, and Jeffrey P. Bigham. 2018. Striving to earn more: A survey of work strategies and tool use among crowd workers. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 70–78.Google ScholarGoogle ScholarCross RefCross Ref
  35. Gabriella Kazai. 2011. In search of quality in crowdsourcing for search engine evaluation. In Proceedings of the European Conference on Information Retrieval. 165–176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Gabriella Kazai, Nick Craswell, Emine Yilmaz, and S.M.M. Tahaghoghi. 2012. An analysis of systematic judging errors in information retrieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 105–114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. 2017. lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82, 13 (2017), 1–26.Google ScholarGoogle ScholarCross RefCross Ref
  38. Laura Lascau, Sandy J. J. Gould, Anna L. Cox, Elizaveta Karmannaya, and Duncan P. Brumby. 2019. Monotasking or multitasking: Designing for crowdworkers’ preferences. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Matt Lease and Emine Yilmaz. 2013. Crowdsourcing for information retrieval: Introduction to the special issue. Information Retrieval Journal 16 (2013), 91–100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jim Lewis and Jeff Sauro. 2020. Revisiting the evidence for the left-side bias in rating scales. https://measuringu.com/revisiting-the-left-side-bias/. Downloaded 2021-09-09.Google ScholarGoogle Scholar
  41. James R Lewis. 2019. Comparison of four TAM item formats: Effect of response option labels and order. Journal of Usability Studies 14, 4 (2019), 224–236.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Gloria Mark, Shamsi Iqbal, Mary Czerwinski, Paul Johns, and Akane Sano. 2016. Neurotics can’t focus: An in situ study of online multitasking in the workplace. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1739–1744.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. David Martin, Benjamin V. Hanrahan, Jacki O’Neill, and Neha Gupta. 2014. Being a Turker. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing. 224–235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. David Martin, Jacki O’Neill, Neha Gupta, and Benjamin V. Hanrahan. 2016. Turking in a Global Labour Market. In Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work and Social Computing. 39–77.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (Dec. 2008).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Edward Newell and Derek Ruths. 2016. How one microtask affects another. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 3155–3166.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sihang Qiu, Ujwal Gadiraju, and Alessandro Bozzon. 2020. Just the right mood for HIT! Analyzing the role of worker moods in conversational microtask crowdsourcing. In Proceedings of the International Conference on Web Engineering. 381–396.Google ScholarGoogle Scholar
  48. R Core Team. 2020. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/Google ScholarGoogle Scholar
  49. Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tomlinson. 2010. Who are the crowdworkers? Shifting demographics in Mechanical Turk. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 2863–2872.Google ScholarGoogle Scholar
  50. Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 248–375.Google ScholarGoogle ScholarCross RefCross Ref
  51. Tefko Saracevic. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology 58, 13 (2007), 1915–1933.Google ScholarGoogle ScholarCross RefCross Ref
  52. Tefko Saracevic. 2012. Research on relevance in information science: A historical perspective. In Proceedings of the ASIS&T Pre-conference on the History of ASIS&T and Information Science and Technology. 49–60.Google ScholarGoogle Scholar
  53. Falk Scholer, Diane Kelly, Wan Ching Wu, Hanseul S. Lee, and William Webber. 2013. The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 623–632.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Falk Scholer, Andrew Turpin, and Mark Sanderson. 2011. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1063–1072.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Milad Shokouhi, Ryen White, and Emine Yilmaz. 2015. Anchoring and adjustment in relevance estimation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 963–966.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Paul Thomas and David Hawking. 2006. Evaluation by comparing result sets in context. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 94–101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Ming-Chang Tsai. 2019. The good, the bad, and the ordinary: The day-of-the-week effect on mood across the globe. Journal of Happiness Studies 20, 7 (2019), 2101–2124.Google ScholarGoogle ScholarCross RefCross Ref
  58. Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science 185(1974), 1124–1131.Google ScholarGoogle ScholarCross RefCross Ref
  59. Timothy D. Wilson, Christopher E. Houston, Kathryn M. Etling, and Nancy Brekke. 1996. A new look at anchoring effects: Basic anchoring and its antecedents. Journal of Experimental Psychology: General 125, 4 (1996), 387–402.Google ScholarGoogle ScholarCross RefCross Ref
  60. Ying Zhang, Xianhua Ding, and Ning Gu. 2018. Understanding fatigue and its impact in crowdsourcing. In Proceedings of the 22nd IEEE International Conference on Computer Supported Cooperative Work in Design. 57–62.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. The Crowd is Made of People: Observations from Large-Scale Crowd Labelling
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CHIIR '22: Proceedings of the 2022 Conference on Human Information Interaction and Retrieval
        March 2022
        399 pages
        ISBN:9781450391863
        DOI:10.1145/3498366

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 March 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate55of163submissions,34%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format