skip to main content
research-article

Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work

Published:01 November 2018Publication History
Skip Abstract Section

Abstract

Crowdsourced classification of data typically assumes that objects can be unambiguously classified into categories. In practice, many classification tasks are ambiguous due to various forms of disagreement. Prior work shows that exchanging verbal justifications can significantly improve answer accuracy over aggregation techniques. In this work, we study how worker deliberation affects resolvability and accuracy using case studies with both an objective and a subjective task. Results show that case resolvability depends on various factors, including the level and reasons for the initial disagreement, as well as the amount and quality of deliberation activities. Our work reinforces the finding that deliberation can increase answer accuracy and the importance of verbal discussion in this process. We contribute a new public data set on worker deliberation for text classification tasks, and discuss considerations for the design of deliberation workflows for classification.

References

  1. Christopher R. Bilder and Thomas M. Loughin. 2004. Testing for Marginal Independence between Two Categorical Variables with Multiple Responses. Biometrics 60, 1 (3 2004), 241--248.Google ScholarGoogle Scholar
  2. Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 2334--2346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Norman Dalkey and Olaf Helmer. 1963. An Experimental Application of the DELPHI Method to the Use of Experts. Management Science 9, 3 (4 1963), 458--467.Google ScholarGoogle Scholar
  4. Heidi Danker-Hopfe, Peter Anderer, Josef Zeitlhofer, Marion Boeck, Hans Dorn, Georg Gruber, Esther Heller, Erna Loretz, Doris Moser, Silvia Parapatics, Bernd Saletu, Andrea Schmidt, and Georg Dorffner. 2009. Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. Journal of Sleep Research 18, 1 (3 2009), 74--84.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jeff Donahue and Kristen Grauman. 2011. Annotator rationales for visual recognition. In 2011 International Conference on Computer Vision. IEEE, 1395--1402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shayan Doroudi, Ece Kamar, Emma Brunskill, and Eric Horvitz. 2016. Toward a Learning Science for Complex Crowdsourcing Tasks. In Proceedings of the 2016 SIGCHI Conference on Human Factors in Computing Systems - CHI '16. ACM Press, New York, New York, USA, 2623--2634. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google ScholarGoogle Scholar
  8. Anca Dumitrache, Lora Aroyo, and Chris Welty. 2018. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Transactions on Interactive Intelligent Systems 8, 2 (7 2018), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Elena Filatova. 2012. Irony and Sarcasm: Corpus Generation and Analysis Using Crowdsourcing. In Proceedings of the Eight International Conference on Language Resources and Evaluation - LREC '12, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), 392--398.Google ScholarGoogle Scholar
  10. Deen G. Freelon, Travis Kriplean, Jonathan Morgan, W. Lance Bennett, and Alan Borning. 2012. Facilitating Diverse Political Engagement with the Living Voters Guide. Journal of Information Technology & Politics 9, 3 (7 2012), 279--297.Google ScholarGoogle ScholarCross RefCross Ref
  11. Snehalkumar (Neil) S. Gaikwad, Mark Whiting, Karolina Ziulkoski, Alipta Ballav, Aaron Gilbee, Senadhipathige S. Niranga, Vibhor Sehgal, Jasmine Lin, Leonardy Kristianto, Angela Richmond-Fuller, Jeff Regino, Durim Morina, Nalin Chhibber, Dinesh Majeti, Sachin Sharma, Kamila Mananova, Dinesh Dhakal, William Dai, Victoria Purynova, Samarth Sandeep, Varshine Chandrakanthan, Tejas Sarma, Adam Ginzberg, Sekandar Matin, Ahmed Nasser, Rohit Nistala, Alexander Stolzoff, Kristy Milland, Vinayak Mathur, Rajan Vaish, Michael S. Bernstein, Catherine Mullings, Shirish Goyal, Dilrukshi Gamage, Christopher Diemert, Mathias Burton, and Sharon Zhou. 2016. Boomerang: Rebounding the Consequences of Reputation Feedback on Crowdsourcing Platforms. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology - UIST '16. ACM Press, New York, New York, USA, 625--637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Luciana Garbayo. 2014. Epistemic Considerations on Expert Disagreement, Normative Justification, and Inconsistency Regarding Multi-criteria Decision Making. Constraint Programming and Decision Making 539 (2014), 35--45. http: //link.springer.com/10.1007/978--3--319-04280-0_5Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. In AAAI Conference on Artificial Intelligence. https://arxiv.org/pdf/1703.08774.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  14. Danna Gurari and Kristen Grauman. 2017. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 3511--3522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Francis T. Hartman and Andrew Baldwin. 1995. Using Technology to Improve Delphi Method. Journal of Computing in Civil Engineering 9, 4 (10 1995), 244--249.Google ScholarGoogle ScholarCross RefCross Ref
  16. DavidW. Hosmer and Stanley Lemesbow. 1980. Goodness of fit tests for the multiple logistic regression model. Communications in Statistics - Theory and Methods 9, 10 (1980), 1043--1069.Google ScholarGoogle ScholarCross RefCross Ref
  17. Alan M. Jones. 1973. Victims of Groupthink: A Psychological Study of Foreign Policy Decisions and Fiascoes. The ANNALS of the American Academy of Political and Social Science 407, 1 (5 1973), 179--180.Google ScholarGoogle Scholar
  18. Sanjay Kairam and Jeffrey Heer. 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing - CSCW '16. ACM Press, New York, New York, USA, 1635--1646. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sara Kiesler and Lee Sproull. 1992. Group decision making and communication technology. Organizational Behavior and Human Decision Processes 52, 1 (6 1992), 96--123.Google ScholarGoogle Scholar
  20. Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology (3 2018).Google ScholarGoogle Scholar
  21. Travis Kriplean, Caitlin Bonnar, Alan Borning, Bo Kinney, and Brian Gill. 2014. Integrating on-demand fact-checking with public dialogue. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing - CSCW '14. ACM Press, New York, New York, USA, 1188--1199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Travis Kriplean, Jonathan Morgan, Deen Freelon, Alan Borning, and Lance Bennett. 2012a. Supporting reflective public thought with considerit. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work - CSCW '12. ACM Press, New York, New York, USA, 265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Travis Kriplean, Michael Toomim, Jonathan Morgan, Alan Borning, and Andrew Ko. 2012b. Is this what you meant?: promoting listening on the web with reflect. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems - CHI '12. ACM Press, New York, New York, USA, 1559. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Weichen Liu, Sijia Xiao, Jacob T Browne, Ming Yang, and Steven P Dow. 2018. ConsensUs: Supporting Multi-Criteria Group Decisions by Visualizing Points of Disagreement. ACM Transactions on Social Computing 1, 1 (1 2018), 4:1--4:26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Peter McCullagh and John Nelder. 1989. Generalized Linear Models (2 ed.). Chapman & Hall/CRC.Google ScholarGoogle Scholar
  26. Tyler McDonnell, Matthew Lease, Tamer Elsayad, and Mucahid Kutlu. 2016. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google ScholarGoogle Scholar
  27. Jeryl L. Mumpower and Thomas R. Stewart. 1996. Expert Judgement and Expert Disagreement. Thinking & Reasoning 2, 2--3 (7 1996), 191--212.Google ScholarGoogle ScholarCross RefCross Ref
  28. Joaquin Navajas, Tamara Niella, Gerry Garbulsky, Bahador Bahrami, and Mariano Sigman. 2018. Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds. Nature Human Behaviour (1 2018).Google ScholarGoogle Scholar
  29. Charlan Nemeth. 1977. Interactions Between Jurors as a Function of Majority vs. Unanimity Decision Rules. Journal of Applied Social Psychology 7, 1 (3 1977), 38--56.Google ScholarGoogle ScholarCross RefCross Ref
  30. Gerhard Osius and Dieter Rojek. 1992. Normal Goodness-of-Fit Tests for Multinomial Models with Large Degrees of Freedom. J. Amer. Statist. Assoc. 87, 420 (12 1992), 1145--1152.Google ScholarGoogle ScholarCross RefCross Ref
  31. Shengying Pan, Kate Larson, Joshua Bradshaw, and Edith Law. 2016. Dynamic Task Allocation Algorithm for Hiring Workers that Learn. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016). New York, 3825--3831. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y. Ng. 2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. (7 2017). http://arxiv.org/abs/1707.01836Google ScholarGoogle Scholar
  33. Richard S. Rosenberg and Steven van Hout. 2013. The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal of Clinical Sleep Medicine (1 2013).Google ScholarGoogle Scholar
  34. Harold Sackman. 1974. Delphi assessment: Expert opinion, forecasting, and group process. Technical Report. RAND CORP SANTA MONICA CA.Google ScholarGoogle Scholar
  35. Manali Sharma, Di Zhuang, and Mustafa Bilgic. 2015. Active Learning with Rationales for Text Classification. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT).Google ScholarGoogle ScholarCross RefCross Ref
  36. Miriam Solomon. 2006. Groupthink versus The Wisdom of Crowds : The Social Epistemology of Deliberation and Dissent. The Southern Journal of Philosophy 44, S1 (3 2006), 28--42.Google ScholarGoogle ScholarCross RefCross Ref
  37. Thérèse A. Stukel. 1988. Generalized Logistic Models. J. Amer. Statist. Assoc. 83, 402 (6 1988), 426--431.Google ScholarGoogle ScholarCross RefCross Ref
  38. Ainur Yessenalina, Yejin Choi, and Claire Cardie. 2010. Automatically Generating Annotator Rationales to Improve Sentiment Classification. In Proceedings of the ACL 2010 Conference Short Papers (ACLShort '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 336--341. http://dl.acm.org/citation.cfm?id=1858842.1858904 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Omar F. Zaidan, Jason Eisner, and Christine D. Piatko. 2007. Using "Annotator Rationales" to Improve Machine Learning for Text Categorization. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). 260--267.Google ScholarGoogle Scholar
  40. Omar F. Zaidan, Jason Eisner, and Christine D. Piatko. 2008. Machine learning with annotator rationales to reduce annotation cost. In Proceedings of the NIPS 2008 Workshop on Cost Sensitive Learning.Google ScholarGoogle Scholar
  41. Amy X. Zhang, Lea Verou, and David Karger. 2017. Wikum: Bridging Discussion Forums and Wikis Using Recursive Summarization. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW '17. ACM Press, New York, New York, USA, 2082--2096. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Human-Computer Interaction
          Proceedings of the ACM on Human-Computer Interaction  Volume 2, Issue CSCW
          November 2018
          4104 pages
          EISSN:2573-0142
          DOI:10.1145/3290265
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 November 2018
          Published in pacmhci Volume 2, Issue CSCW

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader