research-article

The Crowd is Made of People: Observations from Large-Scale Crowd Labelling

Authors:
Paul Thomas

Microsoft, Australia

Microsoft, Australia
View Profile

,
Gabriella Kazai

Microsoft, United Kingdom

Microsoft, United Kingdom
View Profile

,
Ryen White

Microsoft, United States

Microsoft, United States
View Profile

,
Nick Craswell

Microsoft, United States

Microsoft, United States
View Profile

CHIIR '22: Proceedings of the 2022 Conference on Human Information Interaction and RetrievalMarch 2022Pages 25–35https://doi.org/10.1145/3498366.3505815

Published:14 March 2022Publication History

CHIIR '22: Proceedings of the 2022 Conference on Human Information Interaction and Retrieval

Pages 25–35

ABSTRACT

Like many other researchers, at Microsoft Bing we use external “crowd” judges to label results from a search engine—especially, although not exclusively, to obtain relevance labels for offline evaluation in the Cranfield tradition. Crowdsourced labels are relatively cheap, and hence very popular, but are prone to disagreements, spam, and various biases which appear to be unexplained “noise” or “error”. In this paper, we provide examples of problems we have encountered running crowd labelling at large scale and around the globe, for search evaluation in particular. We demonstrate effects due to the time of day and day of week that a label is given; fatigue; anchoring; exposure; left-side bias; task switching; and simple disagreement between judges. Rather than simple “error”, these effects are consistent with well-known physiological and cognitive factors. “The crowd” is not some abstract machinery, but is made of people. Human factors that affect people’s judgement behaviour must be considered when designing research evaluations and in interpreting evaluation metrics.

References

Omar Alonso and Ricardo Baeza-Yates. 2011. Design and implementation of relevance assessments using crowdsourcing. In Proceedings of the European Conference on Information Retrieval. 153–164.Google ScholarDigital Library
Omar Alonso, Daniel E Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. In ACM SIGIR Forum, Vol. 42. 9–15.Google Scholar
Tim Althoff, Eric Horvitz, Ryen W White, and Jamie Zeitzer. 2017. Harnessing the web for population-scale physiological sensing: A case study of sleep and performance. In Proceedings of the 26th International Conference on the World Wide Web. 113–122.Google ScholarDigital Library
Leif Azzopardi. 2021. Cognitive biases in search: A review and reflection of cognitive biases in information retrieval. In Proceedings of the ACM SIGIR Conference on Human Information Interaction and Retrieval. 27–37.Google ScholarDigital Library
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter?. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.Google ScholarDigital Library
Peter Bailey, Nick Craswell, Ryen W White, Liwei Chen, Ashwin Satyanarayana, and Saied MM Tahaghoghi. 2010. Evaluating search systems using result page context. In Proceedings of the 3rd Information Interaction in Context Symposium. 105–114.Google ScholarDigital Library
Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting linear mixed-effects models Using lme4. Journal of Statistical Software 67, 1 (2015), 1–48.Google ScholarCross Ref
Sara Behimehr and Hamid R. Jamali. 2020. Cognitive biases and their effects on information behaviour of graduate students in their research projects. Journal of Information Science Theory and Practice 8, 2 (2020), 18–31.Google Scholar
Erik M Benau, Natalia C Orloff, E Amy Janke, Lucy Serpell, and C Alix Timko. 2014. A systematic review of the effects of experimental fasting on cognition. Appetite 77(2014), 52–61.Google ScholarCross Ref
Buster Benson. 2016. Cognitive bias cheat sheet. https://betterhumans.pub/cognitive-bias-cheat-sheet-55a472476b18. Downloaded: 2022-01-08.Google Scholar
Robert F Bornstein. 1989. Exposure and affect: overview and meta-analysis of research, 1968–1987. Psychological Bulletin 106, 2 (1989), 265.Google ScholarCross Ref
Carrie J. Cai, Shamsi T. Iqbal, and Jaime Teevan. 2016. Chain reactions: The impact of order on microtask chains. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 3143–3154.Google ScholarDigital Library
Ben Carterette and Ian Soboroff. 2010. The effect of assessor errors on IR system evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 539–546.Google ScholarDigital Library
Paul Clough, Mark Sanderson, Jiayu Tang, Tim Gollins, and Amy Warner. 2013. Examining the limits of crowdsourcing for relevance assessment. IEEE Internet Computing 17, 4 (2013).Google ScholarDigital Library
Djellel Difallah, Elena Filatova, and Panos Ipeirotis. 2018. Demographics and dynamics of Mechanical Turk workers. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 135–143.Google ScholarDigital Library
Tim Draws, Alisa Rieger, Oana Inel, Ujwal Gadiraju, and Nava Tintarev. 2021. A checklist to combat cognitive biases in crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 48–59.Google ScholarCross Ref
Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 162–170.Google ScholarDigital Library
Carsten Eickhoff, Christopher G. Harris, Arjen P. de Vries, and Padmini Srinivasan. 2012. Quality through flow and immersion: Gamifying crowdsourced relevance assessments. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 871–880.Google ScholarDigital Library
Nicholas Epley and Thomas Gilovich. 2005. When effortful thinking influences judgmental anchoring differential effects of forewarning and incentives on self-generated and externally provided anchors. Journal of Behavioral Decision Making 18, 3 (2005), 199–212.Google ScholarCross Ref
Ozkan Eren and Naci Mocan. 2018. Emotional judges and unlucky juveniles. American Economic Journal: Applied Economics 10, 3 (2018), 171–205.Google ScholarCross Ref
Marvin Fleischmann, Miglena Amirpur, Alexander Benlian, and Thomas Hess. 2014. Cognitive biases in information systems research: A scientometric analysis. In Proceedings of the European Conference on Information Systems.Google Scholar
Hershey H Friedman and Taiwo Amoo. 1999. Rating the rating scales. Journal of Marketing Management 9, 3 (1999), 114–123.Google Scholar
Ujwal Gadiraju and Stefan Dietze. 2017. Improving learning through achievement priming in crowdsourced information finding microtasks. In Proceedings of the Learning Analytics and Knowledge Conference. 105–114.Google ScholarDigital Library
Alyssa A. Gamaldo, Jason C. Allaire, and Keith E. Whitfield. 2010. Exploring the within-person coupling of sleep and cognition in older African Americans. Psychology and Aging 25, 4 (2010), 851–857.Google ScholarCross Ref
Scott A. Golder and Michael W. Macy. 2011. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333, 6051 (2011), 1878–1881.Google ScholarCross Ref
Victor M. González and Gloria Mark. 2004. Constant, constant, multi-tasking craziness. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 113–120.Google Scholar
Google. 2020. Search quality evaluator guidelines. https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf. Downloaded 2021-10-11.Google Scholar
Sandy J. J. Gould, Anna L. Cox, and Duncan P. Brumby. 2016. Diminished control in crowdsourcing: An investigation of crowdworker multitasking behavior. ACM Transactions on Computer-Human Interaction 23, 3, Article 19 (June 2016).Google ScholarDigital Library
Javier Hernandez, Mohammed (Ehsan) Hoque, Will Drevo, and Rosalind W. Picard. 2012. Mood meter: counting smiles in the wild. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing. 301–310.Google ScholarDigital Library
Danula Hettiachchi, Niels van Berkel, Simo Hosio, Vassilis Kostakos, and Jorge Goncalves. 2019. Effect of Cognitive Abilities on Crowdsourcing Task Performance. In Proceedings of the IFIP Conference on Human-Computer Interaction. 442–464.Google ScholarDigital Library
Paul W Holland. 1986. Statistics and causal inference. Journal of the American Statistical Association 81, 396(1986), 945–960.Google ScholarCross Ref
Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. 2019. Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the SIGKDD Workshop on Human Computation. 64–67.Google ScholarDigital Library
Toni Kaplan, Susumu Saito, Kotaro Hara, and Jeffrey P. Bigham. 2018. Striving to earn more: A survey of work strategies and tool use among crowd workers. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 70–78.Google ScholarCross Ref
Gabriella Kazai. 2011. In search of quality in crowdsourcing for search engine evaluation. In Proceedings of the European Conference on Information Retrieval. 165–176.Google ScholarDigital Library
Gabriella Kazai, Nick Craswell, Emine Yilmaz, and S.M.M. Tahaghoghi. 2012. An analysis of systematic judging errors in information retrieval. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 105–114.Google ScholarDigital Library
Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. 2017. lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82, 13 (2017), 1–26.Google ScholarCross Ref
Laura Lascau, Sandy J. J. Gould, Anna L. Cox, Elizaveta Karmannaya, and Duncan P. Brumby. 2019. Monotasking or multitasking: Designing for crowdworkers’ preferences. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1–14.Google ScholarDigital Library
Matt Lease and Emine Yilmaz. 2013. Crowdsourcing for information retrieval: Introduction to the special issue. Information Retrieval Journal 16 (2013), 91–100.Google ScholarDigital Library
Jim Lewis and Jeff Sauro. 2020. Revisiting the evidence for the left-side bias in rating scales. https://measuringu.com/revisiting-the-left-side-bias/. Downloaded 2021-09-09.Google Scholar
James R Lewis. 2019. Comparison of four TAM item formats: Effect of response option labels and order. Journal of Usability Studies 14, 4 (2019), 224–236.Google ScholarDigital Library
Gloria Mark, Shamsi Iqbal, Mary Czerwinski, Paul Johns, and Akane Sano. 2016. Neurotics can’t focus: An in situ study of online multitasking in the workplace. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 1739–1744.Google ScholarDigital Library
David Martin, Benjamin V. Hanrahan, Jacki O’Neill, and Neha Gupta. 2014. Being a Turker. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing. 224–235.Google ScholarDigital Library
David Martin, Jacki O’Neill, Neha Gupta, and Benjamin V. Hanrahan. 2016. Turking in a Global Labour Market. In Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work and Social Computing. 39–77.Google ScholarDigital Library
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27, 1 (Dec. 2008).Google ScholarDigital Library
Edward Newell and Derek Ruths. 2016. How one microtask affects another. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 3155–3166.Google ScholarDigital Library
Sihang Qiu, Ujwal Gadiraju, and Alessandro Bozzon. 2020. Just the right mood for HIT! Analyzing the role of worker moods in conversational microtask crowdsourcing. In Proceedings of the International Conference on Web Engineering. 381–396.Google Scholar
R Core Team. 2020. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/Google Scholar
Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tomlinson. 2010. Who are the crowdworkers? Shifting demographics in Mechanical Turk. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. 2863–2872.Google Scholar
Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 248–375.Google ScholarCross Ref
Tefko Saracevic. 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology 58, 13 (2007), 1915–1933.Google ScholarCross Ref
Tefko Saracevic. 2012. Research on relevance in information science: A historical perspective. In Proceedings of the ASIS&T Pre-conference on the History of ASIS&T and Information Science and Technology. 49–60.Google Scholar
Falk Scholer, Diane Kelly, Wan Ching Wu, Hanseul S. Lee, and William Webber. 2013. The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 623–632.Google ScholarDigital Library
Falk Scholer, Andrew Turpin, and Mark Sanderson. 2011. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1063–1072.Google ScholarDigital Library
Milad Shokouhi, Ryen White, and Emine Yilmaz. 2015. Anchoring and adjustment in relevance estimation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 963–966.Google ScholarDigital Library
Paul Thomas and David Hawking. 2006. Evaluation by comparing result sets in context. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 94–101.Google ScholarDigital Library
Ming-Chang Tsai. 2019. The good, the bad, and the ordinary: The day-of-the-week effect on mood across the globe. Journal of Happiness Studies 20, 7 (2019), 2101–2124.Google ScholarCross Ref
Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science 185(1974), 1124–1131.Google ScholarCross Ref
Timothy D. Wilson, Christopher E. Houston, Kathryn M. Etling, and Nancy Brekke. 1996. A new look at anchoring effects: Basic anchoring and its antecedents. Journal of Experimental Psychology: General 125, 4 (1996), 387–402.Google ScholarCross Ref
Ying Zhang, Xianhua Ding, and Ning Gu. 2018. Understanding fatigue and its impact in crowdsourcing. In Proceedings of the 22nd IEEE International Conference on Computer Supported Cooperative Work in Design. 57–62.Google ScholarCross Ref

Index Terms

The Crowd is Made of People: Observations from Large-Scale Crowd Labelling
1. Human-centered computing
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Aggregation of Crowdsourced Labels Based on Worker History
WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

Using crowdsourcing for gathering labels can be beneficial for supervised machine learning, if done in the right way. Crowdsourcing is more cost-effective and faster than employing experts for labeling the items needed as training examples. ...
Read More
Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments

The ubiquity of the Internet and the widespread proliferation of electronic devices has resulted in flourishing microtask crowdsourcing marketplaces, such as Amazon MTurk. An aspect that has remained largely invisible in microtask crowdsourcing is that ...
Read More
Agreement/disagreement based crowd labeling

In many supervised learning problems, determining the true labels of training instances is expensive, laborious, and even practically impossible. As an alternative approach, it is much easier to collect multiple subjective (possibly noisy) labels from ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHIIR '22: Proceedings of the 2022 Conference on Human Information Interaction and Retrieval
March 2022
399 pages
ISBN:9781450391863
DOI:10.1145/3498366
General Chairs:
David Elsweiler
University of Regensburg, Bavaria, Germany
,
Udo Kruschwitz
University of Regensburg, Bavaria, Germany
,
Bernd Ludwig
University of Regensburg, Bavaria, Germany
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 March 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cognitive biases
Crowdsourcing
Quality control
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate55of163submissions,34%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 213
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

The Crowd is Made of People: Observations from Large-Scale Crowd Labelling

CHIIR '22: Proceedings of the 2022 Conference on Human Information Interaction and Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Aggregation of Crowdsourced Labels Based on Worker History

Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments

Agreement/disagreement based crowd labeling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

The Crowd is Made of People: Observations from Large-Scale Crowd Labelling

CHIIR '22: Proceedings of the 2022 Conference on Human Information Interaction and Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Aggregation of Crowdsourced Labels Based on Worker History

Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments

Agreement/disagreement based crowd labeling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media