ABSTRACT
Like many other researchers, at Microsoft Bing we use external “crowd” judges to label results from a search engine—especially, although not exclusively, to obtain relevance labels for offline evaluation in the Cranfield tradition. Crowdsourced labels are relatively cheap, and hence very popular, but are prone to disagreements, spam, and various biases which appear to be unexplained “noise” or “error”. In this paper, we provide examples of problems we have encountered running crowd labelling at large scale and around the globe, for search evaluation in particular. We demonstrate effects due to the time of day and day of week that a label is given; fatigue; anchoring; exposure; left-side bias; task switching; and simple disagreement between judges. Rather than simple “error”, these effects are consistent with well-known physiological and cognitive factors. “The crowd” is not some abstract machinery, but is made of people. Human factors that affect people’s judgement behaviour must be considered when designing research evaluations and in interpreting evaluation metrics.
- Omar Alonso and Ricardo Baeza-Yates. 2011. Design and implementation of relevance assessments using crowdsourcing. In Proceedings of the European Conference on Information Retrieval. 153–164.Google ScholarDigital Library
- Omar Alonso, Daniel E Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. In ACM SIGIR Forum, Vol. 42. 9–15.Google Scholar
- Tim Althoff, Eric Horvitz, Ryen W White, and Jamie Zeitzer. 2017. Harnessing the web for population-scale physiological sensing: A case study of sleep and performance. In Proceedings of the 26th International Conference on the World Wide Web. 113–122.Google ScholarDigital Library
- Leif Azzopardi. 2021. Cognitive biases in search: A review and reflection of cognitive biases in information retrieval. In Proceedings of the ACM SIGIR Conference on Human Information Interaction and Retrieval. 27–37.Google ScholarDigital Library
- Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter?. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.Google ScholarDigital Library
- Peter Bailey, Nick Craswell, Ryen W White, Liwei Chen, Ashwin Satyanarayana, and Saied MM Tahaghoghi. 2010. Evaluating search systems using result page context. In Proceedings of the 3rd Information Interaction in Context Symposium. 105–114.Google ScholarDigital Library
- Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting linear mixed-effects models Using lme4. Journal of Statistical Software 67, 1 (2015), 1–48.Google ScholarCross Ref
- Sara Behimehr and Hamid R. Jamali. 2020. Cognitive biases and their effects on information behaviour of graduate students in their research projects. Journal of Information Science Theory and Practice 8, 2 (2020), 18–31.Google Scholar
- Erik M Benau, Natalia C Orloff, E Amy Janke, Lucy Serpell, and C Alix Timko. 2014. A systematic review of the effects of experimental fasting on cognition. Appetite 77(2014), 52–61.Google ScholarCross Ref
- Buster Benson. 2016. Cognitive bias cheat sheet. https://betterhumans.pub/cognitive-bias-cheat-sheet-55a472476b18. Downloaded: 2022-01-08.Google Scholar
- Robert F Bornstein. 1989. Exposure and affect: overview and meta-analysis of research, 1968–1987. Psychological Bulletin 106, 2 (1989), 265.Google ScholarCross Ref
- Carrie J. Cai, Shamsi T. Iqbal, and Jaime Teevan. 2016. Chain reactions: The impact of order on microtask chains. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 3143–3154.Google ScholarDigital Library
- Ben Carterette and Ian Soboroff. 2010. The effect of assessor errors on IR system evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 539–546.Google ScholarDigital Library
- Paul Clough, Mark Sanderson, Jiayu Tang, Tim Gollins, and Amy Warner. 2013. Examining the limits of crowdsourcing for relevance assessment. IEEE Internet Computing 17, 4 (2013).Google ScholarDigital Library
- Djellel Difallah, Elena Filatova, and Panos Ipeirotis. 2018. Demographics and dynamics of Mechanical Turk workers. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 135–143.Google ScholarDigital Library
- Tim Draws, Alisa Rieger, Oana Inel, Ujwal Gadiraju, and Nava Tintarev. 2021. A checklist to combat cognitive biases in crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 48–59.Google ScholarCross Ref
- Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 162–170.Google ScholarDigital Library
- Carsten Eickhoff, Christopher G. Harris, Arjen P. de Vries, and Padmini Srinivasan. 2012. Quality through flow and immersion: Gamifying crowdsourced relevance assessments. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 871–880.Google ScholarDigital Library
- Nicholas Epley and Thomas Gilovich. 2005. When effortful thinking influences judgmental anchoring differential effects of forewarning and incentives on self-generated and externally provided anchors. Journal of Behavioral Decision Making 18, 3 (2005), 199–212.Google ScholarCross Ref
- Ozkan Eren and Naci Mocan. 2018. Emotional judges and unlucky juveniles. American Economic Journal: Applied Economics 10, 3 (2018), 171–205.Google ScholarCross Ref
- Marvin Fleischmann, Miglena Amirpur, Alexander Benlian, and Thomas Hess. 2014. Cognitive biases in information systems research: A scientometric analysis. In Proceedings of the European Conference on Information Systems.Google Scholar
- Hershey H Friedman and Taiwo Amoo. 1999. Rating the rating scales. Journal of Marketing Management 9, 3 (1999), 114–123.Google Scholar
- Ujwal Gadiraju and Stefan Dietze. 2017. Improving learning through achievement priming in crowdsourced information finding microtasks. In Proceedings of the Learning Analytics and Knowledge Conference. 105–114.Google ScholarDigital Library
- Alyssa A. Gamaldo, Jason C. Allaire, and Keith E. Whitfield. 2010. Exploring the within-person coupling of sleep and cognition in older African Americans. Psychology and Aging 25, 4 (2010), 851–857.Google ScholarCross Ref
- Scott A. Golder and Michael W. Macy. 2011. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333, 6051 (2011), 1878–1881.Google ScholarCross Ref
- Victor M. González and Gloria Mark. 2004. Constant, constant, multi-tasking craziness. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 113–120.Google Scholar
- Google. 2020. Search quality evaluator guidelines. https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf. Downloaded 2021-10-11.Google Scholar
- Sandy J. J. Gould, Anna L. Cox, and Duncan P. Brumby. 2016. Diminished control in crowdsourcing: An investigation of crowdworker multitasking behavior. ACM Transactions on Computer-Human Interaction 23, 3, Article 19 (June 2016).Google ScholarDigital Library
- Javier Hernandez, Mohammed (Ehsan) Hoque, Will Drevo, and Rosalind W. Picard. 2012. Mood meter: counting smiles in the wild. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing. 301–310.Google ScholarDigital Library
- Danula Hettiachchi, Niels van Berkel, Simo Hosio, Vassilis Kostakos, and Jorge Goncalves. 2019. Effect of Cognitive Abilities on Crowdsourcing Task Performance. In Proceedings of the IFIP Conference on Human-Computer Interaction. 442–464.Google ScholarDigital Library
- Paul W Holland. 1986. Statistics and causal inference. Journal of the American Statistical Association 81, 396(1986), 945–960.Google ScholarCross Ref
- Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. 2019. Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the SIGKDD Workshop on Human Computation. 64–67.Google ScholarDigital Library
- Toni Kaplan, Susumu Saito, Kotaro Hara, and Jeffrey P. Bigham. 2018. Striving to earn more: A survey of work strategies and tool use among crowd workers. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 70–78.Google ScholarCross Ref
- Gabriella Kazai. 2011. In search of quality in crowdsourcing for search engine evaluation. In Proceedings of the European Conference on Information Retrieval. 165–176.Google ScholarDigital Library
- Gabriella Kazai, Nick Craswell, Emine Yilmaz, and S.M.M. Tahaghoghi. 2012. An analysis of systematic judging errors in information retrieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 105–114.Google ScholarDigital Library
- Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. 2017. lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82, 13 (2017), 1–26.Google ScholarCross Ref
- Laura Lascau, Sandy J. J. Gould, Anna L. Cox, Elizaveta Karmannaya, and Duncan P. Brumby. 2019. Monotasking or multitasking: Designing for crowdworkers’ preferences. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarDigital Library
- Matt Lease and Emine Yilmaz. 2013. Crowdsourcing for information retrieval: Introduction to the special issue. Information Retrieval Journal 16 (2013), 91–100.Google ScholarDigital Library
- Jim Lewis and Jeff Sauro. 2020. Revisiting the evidence for the left-side bias in rating scales. https://measuringu.com/revisiting-the-left-side-bias/. Downloaded 2021-09-09.Google Scholar
- James R Lewis. 2019. Comparison of four TAM item formats: Effect of response option labels and order. Journal of Usability Studies 14, 4 (2019), 224–236.Google ScholarDigital Library
- Gloria Mark, Shamsi Iqbal, Mary Czerwinski, Paul Johns, and Akane Sano. 2016. Neurotics can’t focus: An in situ study of online multitasking in the workplace. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1739–1744.Google ScholarDigital Library
- David Martin, Benjamin V. Hanrahan, Jacki O’Neill, and Neha Gupta. 2014. Being a Turker. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing. 224–235.Google ScholarDigital Library
- David Martin, Jacki O’Neill, Neha Gupta, and Benjamin V. Hanrahan. 2016. Turking in a Global Labour Market. In Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work and Social Computing. 39–77.Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (Dec. 2008).Google ScholarDigital Library
- Edward Newell and Derek Ruths. 2016. How one microtask affects another. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 3155–3166.Google ScholarDigital Library
- Sihang Qiu, Ujwal Gadiraju, and Alessandro Bozzon. 2020. Just the right mood for HIT! Analyzing the role of worker moods in conversational microtask crowdsourcing. In Proceedings of the International Conference on Web Engineering. 381–396.Google Scholar
- R Core Team. 2020. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/Google Scholar
- Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tomlinson. 2010. Who are the crowdworkers? Shifting demographics in Mechanical Turk. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 2863–2872.Google Scholar
- Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 248–375.Google ScholarCross Ref
- Tefko Saracevic. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology 58, 13 (2007), 1915–1933.Google ScholarCross Ref
- Tefko Saracevic. 2012. Research on relevance in information science: A historical perspective. In Proceedings of the ASIS&T Pre-conference on the History of ASIS&T and Information Science and Technology. 49–60.Google Scholar
- Falk Scholer, Diane Kelly, Wan Ching Wu, Hanseul S. Lee, and William Webber. 2013. The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 623–632.Google ScholarDigital Library
- Falk Scholer, Andrew Turpin, and Mark Sanderson. 2011. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1063–1072.Google ScholarDigital Library
- Milad Shokouhi, Ryen White, and Emine Yilmaz. 2015. Anchoring and adjustment in relevance estimation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 963–966.Google ScholarDigital Library
- Paul Thomas and David Hawking. 2006. Evaluation by comparing result sets in context. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 94–101.Google ScholarDigital Library
- Ming-Chang Tsai. 2019. The good, the bad, and the ordinary: The day-of-the-week effect on mood across the globe. Journal of Happiness Studies 20, 7 (2019), 2101–2124.Google ScholarCross Ref
- Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science 185(1974), 1124–1131.Google ScholarCross Ref
- Timothy D. Wilson, Christopher E. Houston, Kathryn M. Etling, and Nancy Brekke. 1996. A new look at anchoring effects: Basic anchoring and its antecedents. Journal of Experimental Psychology: General 125, 4 (1996), 387–402.Google ScholarCross Ref
- Ying Zhang, Xianhua Ding, and Ning Gu. 2018. Understanding fatigue and its impact in crowdsourcing. In Proceedings of the 22nd IEEE International Conference on Computer Supported Cooperative Work in Design. 57–62.Google ScholarCross Ref
Index Terms
- The Crowd is Made of People: Observations from Large-Scale Crowd Labelling
Recommendations
Aggregation of Crowdsourced Labels Based on Worker History
WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)Using crowdsourcing for gathering labels can be beneficial for supervised machine learning, if done in the right way. Crowdsourcing is more cost-effective and faster than employing experts for labeling the items needed as training examples. ...
Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments
The ubiquity of the Internet and the widespread proliferation of electronic devices has resulted in flourishing microtask crowdsourcing marketplaces, such as Amazon MTurk. An aspect that has remained largely invisible in microtask crowdsourcing is that ...
Agreement/disagreement based crowd labeling
In many supervised learning problems, determining the true labels of training instances is expensive, laborious, and even practically impossible. As an alternative approach, it is much easier to collect multiple subjective (possibly noisy) labels from ...
Comments