Abstract
Crowdsourced classification of data typically assumes that objects can be unambiguously classified into categories. In practice, many classification tasks are ambiguous due to various forms of disagreement. Prior work shows that exchanging verbal justifications can significantly improve answer accuracy over aggregation techniques. In this work, we study how worker deliberation affects resolvability and accuracy using case studies with both an objective and a subjective task. Results show that case resolvability depends on various factors, including the level and reasons for the initial disagreement, as well as the amount and quality of deliberation activities. Our work reinforces the finding that deliberation can increase answer accuracy and the importance of verbal discussion in this process. We contribute a new public data set on worker deliberation for text classification tasks, and discuss considerations for the design of deliberation workflows for classification.
- Christopher R. Bilder and Thomas M. Loughin. 2004. Testing for Marginal Independence between Two Categorical Variables with Multiple Responses. Biometrics 60, 1 (3 2004), 241--248.Google Scholar
- Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 2334--2346. Google ScholarDigital Library
- Norman Dalkey and Olaf Helmer. 1963. An Experimental Application of the DELPHI Method to the Use of Experts. Management Science 9, 3 (4 1963), 458--467.Google Scholar
- Heidi Danker-Hopfe, Peter Anderer, Josef Zeitlhofer, Marion Boeck, Hans Dorn, Georg Gruber, Esther Heller, Erna Loretz, Doris Moser, Silvia Parapatics, Bernd Saletu, Andrea Schmidt, and Georg Dorffner. 2009. Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. Journal of Sleep Research 18, 1 (3 2009), 74--84.Google ScholarCross Ref
- Jeff Donahue and Kristen Grauman. 2011. Annotator rationales for visual recognition. In 2011 International Conference on Computer Vision. IEEE, 1395--1402. Google ScholarDigital Library
- Shayan Doroudi, Ece Kamar, Emma Brunskill, and Eric Horvitz. 2016. Toward a Learning Science for Complex Crowdsourcing Tasks. In Proceedings of the 2016 SIGCHI Conference on Human Factors in Computing Systems - CHI '16. ACM Press, New York, New York, USA, 2623--2634. Google ScholarDigital Library
- Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, and Daniel S. Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google Scholar
- Anca Dumitrache, Lora Aroyo, and Chris Welty. 2018. Crowdsourcing Ground Truth for Medical Relation Extraction. ACM Transactions on Interactive Intelligent Systems 8, 2 (7 2018), 1--20. Google ScholarDigital Library
- Elena Filatova. 2012. Irony and Sarcasm: Corpus Generation and Analysis Using Crowdsourcing. In Proceedings of the Eight International Conference on Language Resources and Evaluation - LREC '12, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), 392--398.Google Scholar
- Deen G. Freelon, Travis Kriplean, Jonathan Morgan, W. Lance Bennett, and Alan Borning. 2012. Facilitating Diverse Political Engagement with the Living Voters Guide. Journal of Information Technology & Politics 9, 3 (7 2012), 279--297.Google ScholarCross Ref
- Snehalkumar (Neil) S. Gaikwad, Mark Whiting, Karolina Ziulkoski, Alipta Ballav, Aaron Gilbee, Senadhipathige S. Niranga, Vibhor Sehgal, Jasmine Lin, Leonardy Kristianto, Angela Richmond-Fuller, Jeff Regino, Durim Morina, Nalin Chhibber, Dinesh Majeti, Sachin Sharma, Kamila Mananova, Dinesh Dhakal, William Dai, Victoria Purynova, Samarth Sandeep, Varshine Chandrakanthan, Tejas Sarma, Adam Ginzberg, Sekandar Matin, Ahmed Nasser, Rohit Nistala, Alexander Stolzoff, Kristy Milland, Vinayak Mathur, Rajan Vaish, Michael S. Bernstein, Catherine Mullings, Shirish Goyal, Dilrukshi Gamage, Christopher Diemert, Mathias Burton, and Sharon Zhou. 2016. Boomerang: Rebounding the Consequences of Reputation Feedback on Crowdsourcing Platforms. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology - UIST '16. ACM Press, New York, New York, USA, 625--637. Google ScholarDigital Library
- Luciana Garbayo. 2014. Epistemic Considerations on Expert Disagreement, Normative Justification, and Inconsistency Regarding Multi-criteria Decision Making. Constraint Programming and Decision Making 539 (2014), 35--45. http: //link.springer.com/10.1007/978--3--319-04280-0_5Google ScholarDigital Library
- Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. In AAAI Conference on Artificial Intelligence. https://arxiv.org/pdf/1703.08774.pdfGoogle ScholarCross Ref
- Danna Gurari and Kristen Grauman. 2017. CrowdVerge: Predicting If People Will Agree on the Answer to a Visual Question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM, ACM Press, New York, New York, USA, 3511--3522. Google ScholarDigital Library
- Francis T. Hartman and Andrew Baldwin. 1995. Using Technology to Improve Delphi Method. Journal of Computing in Civil Engineering 9, 4 (10 1995), 244--249.Google ScholarCross Ref
- DavidW. Hosmer and Stanley Lemesbow. 1980. Goodness of fit tests for the multiple logistic regression model. Communications in Statistics - Theory and Methods 9, 10 (1980), 1043--1069.Google ScholarCross Ref
- Alan M. Jones. 1973. Victims of Groupthink: A Psychological Study of Foreign Policy Decisions and Fiascoes. The ANNALS of the American Academy of Political and Social Science 407, 1 (5 1973), 179--180.Google Scholar
- Sanjay Kairam and Jeffrey Heer. 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing - CSCW '16. ACM Press, New York, New York, USA, 1635--1646. Google ScholarDigital Library
- Sara Kiesler and Lee Sproull. 1992. Group decision making and communication technology. Organizational Behavior and Human Decision Processes 52, 1 (6 1992), 96--123.Google Scholar
- Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology (3 2018).Google Scholar
- Travis Kriplean, Caitlin Bonnar, Alan Borning, Bo Kinney, and Brian Gill. 2014. Integrating on-demand fact-checking with public dialogue. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing - CSCW '14. ACM Press, New York, New York, USA, 1188--1199. Google ScholarDigital Library
- Travis Kriplean, Jonathan Morgan, Deen Freelon, Alan Borning, and Lance Bennett. 2012a. Supporting reflective public thought with considerit. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work - CSCW '12. ACM Press, New York, New York, USA, 265. Google ScholarDigital Library
- Travis Kriplean, Michael Toomim, Jonathan Morgan, Alan Borning, and Andrew Ko. 2012b. Is this what you meant?: promoting listening on the web with reflect. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems - CHI '12. ACM Press, New York, New York, USA, 1559. Google ScholarDigital Library
- Weichen Liu, Sijia Xiao, Jacob T Browne, Ming Yang, and Steven P Dow. 2018. ConsensUs: Supporting Multi-Criteria Group Decisions by Visualizing Points of Disagreement. ACM Transactions on Social Computing 1, 1 (1 2018), 4:1--4:26. Google ScholarDigital Library
- Peter McCullagh and John Nelder. 1989. Generalized Linear Models (2 ed.). Chapman & Hall/CRC.Google Scholar
- Tyler McDonnell, Matthew Lease, Tamer Elsayad, and Mucahid Kutlu. 2016. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google Scholar
- Jeryl L. Mumpower and Thomas R. Stewart. 1996. Expert Judgement and Expert Disagreement. Thinking & Reasoning 2, 2--3 (7 1996), 191--212.Google ScholarCross Ref
- Joaquin Navajas, Tamara Niella, Gerry Garbulsky, Bahador Bahrami, and Mariano Sigman. 2018. Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds. Nature Human Behaviour (1 2018).Google Scholar
- Charlan Nemeth. 1977. Interactions Between Jurors as a Function of Majority vs. Unanimity Decision Rules. Journal of Applied Social Psychology 7, 1 (3 1977), 38--56.Google ScholarCross Ref
- Gerhard Osius and Dieter Rojek. 1992. Normal Goodness-of-Fit Tests for Multinomial Models with Large Degrees of Freedom. J. Amer. Statist. Assoc. 87, 420 (12 1992), 1145--1152.Google ScholarCross Ref
- Shengying Pan, Kate Larson, Joshua Bradshaw, and Edith Law. 2016. Dynamic Task Allocation Algorithm for Hiring Workers that Learn. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI 2016). New York, 3825--3831. Google ScholarDigital Library
- Pranav Rajpurkar, Awni Y. Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y. Ng. 2017. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. (7 2017). http://arxiv.org/abs/1707.01836Google Scholar
- Richard S. Rosenberg and Steven van Hout. 2013. The American Academy of Sleep Medicine Inter-scorer Reliability Program: Sleep Stage Scoring. Journal of Clinical Sleep Medicine (1 2013).Google Scholar
- Harold Sackman. 1974. Delphi assessment: Expert opinion, forecasting, and group process. Technical Report. RAND CORP SANTA MONICA CA.Google Scholar
- Manali Sharma, Di Zhuang, and Mustafa Bilgic. 2015. Active Learning with Rationales for Text Classification. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT).Google ScholarCross Ref
- Miriam Solomon. 2006. Groupthink versus The Wisdom of Crowds : The Social Epistemology of Deliberation and Dissent. The Southern Journal of Philosophy 44, S1 (3 2006), 28--42.Google ScholarCross Ref
- Thérèse A. Stukel. 1988. Generalized Logistic Models. J. Amer. Statist. Assoc. 83, 402 (6 1988), 426--431.Google ScholarCross Ref
- Ainur Yessenalina, Yejin Choi, and Claire Cardie. 2010. Automatically Generating Annotator Rationales to Improve Sentiment Classification. In Proceedings of the ACL 2010 Conference Short Papers (ACLShort '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 336--341. http://dl.acm.org/citation.cfm?id=1858842.1858904 Google ScholarDigital Library
- Omar F. Zaidan, Jason Eisner, and Christine D. Piatko. 2007. Using "Annotator Rationales" to Improve Machine Learning for Text Categorization. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). 260--267.Google Scholar
- Omar F. Zaidan, Jason Eisner, and Christine D. Piatko. 2008. Machine learning with annotator rationales to reduce annotation cost. In Proceedings of the NIPS 2008 Workshop on Cost Sensitive Learning.Google Scholar
- Amy X. Zhang, Lea Verou, and David Karger. 2017. Wikum: Bridging Discussion Forums and Wikis Using Recursive Summarization. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing - CSCW '17. ACM Press, New York, New York, USA, 2082--2096. Google ScholarDigital Library
Index Terms
- Resolvable vs. Irresolvable Disagreement: A Study on Worker Deliberation in Crowd Work
Recommendations
Disagreement, Agreement, and Elaboration in Crowdsourced Deliberation: Ideation Through Elaborated Perspectives
CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing SystemsIn this study, we examined disagreement, agreement, and elaboration (rationale sharing) and their association with idea generation in a crowdsourced deliberation that took place within a crowdsourced policymaking process led by a national government. We ...
Political participation via social media: a case study of deliberative quality in the public online budgeting process of Frankfurt/Main, Germany 2013
If social media are to reinforce sustainability of political decisions, their design has conceptually to take into account the implications of deliberative democracy, which stresses the active cooperation of virtually all citizens of a democracy for the ...
E-Democracy and Public Online Budgeting
Proceedings of the 6th International Conference on Social Computing and Social Media - Volume 8531If social media are to reinforce sustainability of political decisions their design has conceptually to take into account the implications of deliberative democracy, which stresses the active co-operation of virtually all citizens of a democracy for the ...
Comments