ABSTRACT
Peer consistency evaluation is often used in games with a purpose (GWAP) to evaluate workers using outputs of other workers without using gold standard answers. Despite its popularity, the reliability of peer consistency evaluation has never been systematically tested to show how it can be used as a general evaluation method in human computation systems. We present experimental results that show that human computation systems using peer consistency evaluation can lead to outcomes that are even better than those that evaluate workers using gold standard answers. We also show that even without evaluation, simply telling the workers that their answers will be used as future evaluation standards can significantly enhance the workers' performance. Results have important implication for methods that improve the reliability of human computation systems.
- Ahn, L. V., Blum, M., Hopper, N. J., and Langford, J. Captcha: using hard ai problems for security. In Proceedings of the 22nd international conference on Theory and applications of cryptographic techniques, EUROCRYPT'03, Springer-Verlag (Berlin, Heidelberg, 2003), 294--311. Google ScholarDigital Library
- Bandiera, O., Barankay, I., and Rasul, I. Social preferences and the response to incentives: Evidence from personnel data. The Quarterly Journal of Economics 120, 3 (2005), 917--962.Google Scholar
- Dawid, A. P., and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), pp. 20--28.Google Scholar
- Dow, S., Kulkarni, A., Klemmer, S., and Hartmann, B. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, CSCW '12, ACM (New York, NY, USA, 2012), 1013--1022. Google ScholarDigital Library
- Gneezy, U., and Rustichini, A. Pay enough or don't pay at all. The Quarterly Journal of Economics 115, 3 (2000), 791--810.Google ScholarCross Ref
- Harris, C. G. You're hired! an examination of crowdsourcing incentive models in human resourse tasks. In Proceedings of WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining (2011).Google Scholar
- Hirth, M., Hossfeld, T., and Tran-Gia, P. Cost-optimal validation mechanisms and cheat-detection for crowdsourcing platforms. In Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on (30 2011-july 2 2011), 316--321. Google ScholarDigital Library
- Huang, E., Zhang, H., Parkes, D. C., Gajos, K. Z., and Chen, Y. Toward automatic task design: a progress report. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10, ACM (New York, NY, USA, 2010), 77--85. Google ScholarDigital Library
- Huang, S.-W., and Fu, W.-T. Systematic analysis of output agreement games: Effects of gaming environment, social interaction, and feedback. In Proceedings of HCOMP12: The 4th Workshop on Human Computation (2012).Google Scholar
- Ipeirotis, P. G. Analyzing the amazon mechanical turk marketplace. XRDS 17, 2 (Dec. 2010), 16--21. Google ScholarDigital Library
- Ipeirotis, P. G., Provost, F., and Wang, J. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10, ACM (New York, NY, USA, 2010), 64--67. Google ScholarDigital Library
- Kittur, A., Khamkar, S., André, P., and Kraut, R. Crowdweaver: visually managing complex crowd work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, CSCW '12, ACM (New York, NY, USA, 2012), 1033--1036. Google ScholarDigital Library
- Kittur, A., Smus, B., Khamkar, S., and Kraut, R. E. Crowdforge: crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology, UIST '11, ACM (New York, NY, USA, 2011), 43--52. Google ScholarDigital Library
- Kulkarni, A., Can, M., and Hartmann, B. Collaboratively crowdsourcing workflows with turkomatic. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, CSCW '12, ACM (New York, NY, USA, 2012), 1003--1012. Google ScholarDigital Library
- Law, E., and Von Ahn, L. Human Computation. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011. Google ScholarDigital Library
- Le, J., Edmonds, A., Hester, V., and Biewald, L. Ensuring quality in crowdsourced search relevance evaluation: The effects of training qustion distribution. In Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (2010).Google Scholar
- Liem, B., Zhang, H., and Chen, Y. An Iterative Dual Pathway Structure for Speech-to-Text Transcription. In Proceedings of the AAAI Workshop on Human Computation (HCOMP) (2011).Google Scholar
- Lin, C., Mausam, M., and Weld, D. Crowdsourcing control: Moving beyond multiple choice. In Proceedings of HCOMP12: The 4th Workshop on Human Computation (2012).Google Scholar
- Lin, C. H., Mausam, and Weld, D. S. Dynamically switching between synergistic workflows for crowdsourcing. In AAAI (2012).Google Scholar
- Little, G., Chilton, L. B., Goldman, M., and Miller, R. C. Turkit: tools for iterative tasks on mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '09, ACM (New York, NY, USA, 2009), 29--30. Google ScholarDigital Library
- Mason, W., and Watts, D. J. Financial incentives and the "performance of crowds". SIGKDD Explor. Newsl. 11, 2 (May 2010), 100--108. Google ScholarDigital Library
- Oleson, D., Sorokin, A., Laughlin, G., Hester, V., Le, J., and Biewald, L. Programmatic gold: Targeted and scalable quality assurance in crowdsourcing. In Proceedings of HCOMP11: The 3rd Workshop on Human Computation (2011).Google Scholar
- Paritosh, P. Human computation must be reproducible. In Proceedings of CrowdSearch: Crowdsourcing Web search 2012 (2012).Google Scholar
- Quinn, A. J., and Bederson, B. B. Human computation: a survey and taxonomy of a growing field. In Proceedings of the 2011 annual conference on Human factors in computing systems, CHI '11, ACM (New York, NY, USA, 2011), 1403--1412. Google ScholarDigital Library
- Robertson, S., Vojnovic, M., and Weber, I. Rethinking the esp game. In Proceedings of the 27th international conference extended abstracts on Human factors in computing systems, CHI EA '09, ACM (New York, NY, USA, 2009), 3937--3942. Google ScholarDigital Library
- Seemakurty, N., Chu, J., von Ahn, L., and Tomasic, A. Word sense disambiguation via human computation. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10, ACM (New York, NY, USA, 2010), 60--63. Google ScholarDigital Library
- Snow, R., O'Connor, B., Jurafsky, D., and Ng, A. Y. Cheap and fast - but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, Association for Computational Linguistics (Stroudsburg, PA, USA, 2008), 254--263. Google ScholarDigital Library
- Sun, Y.-A., Roy, S., and Little, G. D. Beyond independent agreement: A tournament selection approach for quality assurance of human computation tasks. In Proceedings of HCOMP11: The 3rd Workshop on Human Computation (2011).Google Scholar
- von Ahn, L., and Dabbish, L. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, CHI '04, ACM (New York, NY, USA, 2004), 319--326. Google ScholarDigital Library
- von Ahn, L., and Dabbish, L. Designing games with a purpose. Commun. ACM 51 (Aug. 2008), 58--67. Google ScholarDigital Library
Index Terms
- Enhancing reliability using peer consistency evaluation in human computation
Recommendations
Who are the crowdworkers?: shifting demographics in mechanical turk
CHI EA '10: CHI '10 Extended Abstracts on Human Factors in Computing SystemsAmazon Mechanical Turk (MTurk) is a crowdsourcing system in which tasks are distributed to a population of thousands of anonymous workers for completion. This system is increasingly popular with researchers and developers. Here we extend previous ...
CrowdForge: crowdsourcing complex work
CHI EA '11: CHI '11 Extended Abstracts on Human Factors in Computing SystemsMicro-task markets such as Amazon's Mechanical Turk represent a new paradigm for accomplishing work, in which employers can tap into a large population of workers around the globe to accomplish tasks in a fraction of the time and money of more ...
Generating ground truth for music mood classification using mechanical turk
JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital LibrariesMood is an important access point in music digital libraries and online music repositories, but generating ground truth for evaluating various music mood classification algorithms is a challenging problem. This is because collecting enough human ...
Comments