research-article

Enhancing reliability using peer consistency evaluation in human computation

Authors:
Shih-Wen Huang

University of Illinois at Urbana-Champaign, Urbana, Illinois, USA

University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
View Profile

,
Wai-Tat Fu

University of Illinois at Urbana-Champaign, Urbana, Illinois, USA

University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
View Profile

CSCW '13: Proceedings of the 2013 conference on Computer supported cooperative workFebruary 2013Pages 639–648https://doi.org/10.1145/2441776.2441847

Published:23 February 2013Publication History

CSCW '13: Proceedings of the 2013 conference on Computer supported cooperative work

Pages 639–648

ABSTRACT

Peer consistency evaluation is often used in games with a purpose (GWAP) to evaluate workers using outputs of other workers without using gold standard answers. Despite its popularity, the reliability of peer consistency evaluation has never been systematically tested to show how it can be used as a general evaluation method in human computation systems. We present experimental results that show that human computation systems using peer consistency evaluation can lead to outcomes that are even better than those that evaluate workers using gold standard answers. We also show that even without evaluation, simply telling the workers that their answers will be used as future evaluation standards can significantly enhance the workers' performance. Results have important implication for methods that improve the reliability of human computation systems.

References

Ahn, L. V., Blum, M., Hopper, N. J., and Langford, J. Captcha: using hard ai problems for security. In Proceedings of the 22nd international conference on Theory and applications of cryptographic techniques, EUROCRYPT'03, Springer-Verlag (Berlin, Heidelberg, 2003), 294--311. Google ScholarDigital Library
Bandiera, O., Barankay, I., and Rasul, I. Social preferences and the response to incentives: Evidence from personnel data. The Quarterly Journal of Economics 120, 3 (2005), 917--962.Google Scholar
Dawid, A. P., and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), pp. 20--28.Google Scholar
Dow, S., Kulkarni, A., Klemmer, S., and Hartmann, B. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, CSCW '12, ACM (New York, NY, USA, 2012), 1013--1022. Google ScholarDigital Library
Gneezy, U., and Rustichini, A. Pay enough or don't pay at all. The Quarterly Journal of Economics 115, 3 (2000), 791--810.Google ScholarCross Ref
Harris, C. G. You're hired! an examination of crowdsourcing incentive models in human resourse tasks. In Proceedings of WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining (2011).Google Scholar
Hirth, M., Hossfeld, T., and Tran-Gia, P. Cost-optimal validation mechanisms and cheat-detection for crowdsourcing platforms. In Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on (30 2011-july 2 2011), 316--321. Google ScholarDigital Library
Huang, E., Zhang, H., Parkes, D. C., Gajos, K. Z., and Chen, Y. Toward automatic task design: a progress report. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10, ACM (New York, NY, USA, 2010), 77--85. Google ScholarDigital Library
Huang, S.-W., and Fu, W.-T. Systematic analysis of output agreement games: Effects of gaming environment, social interaction, and feedback. In Proceedings of HCOMP12: The 4th Workshop on Human Computation (2012).Google Scholar
Ipeirotis, P. G. Analyzing the amazon mechanical turk marketplace. XRDS 17, 2 (Dec. 2010), 16--21. Google ScholarDigital Library
Ipeirotis, P. G., Provost, F., and Wang, J. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10, ACM (New York, NY, USA, 2010), 64--67. Google ScholarDigital Library
Kittur, A., Khamkar, S., André, P., and Kraut, R. Crowdweaver: visually managing complex crowd work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, CSCW '12, ACM (New York, NY, USA, 2012), 1033--1036. Google ScholarDigital Library
Kittur, A., Smus, B., Khamkar, S., and Kraut, R. E. Crowdforge: crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology, UIST '11, ACM (New York, NY, USA, 2011), 43--52. Google ScholarDigital Library
Kulkarni, A., Can, M., and Hartmann, B. Collaboratively crowdsourcing workflows with turkomatic. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, CSCW '12, ACM (New York, NY, USA, 2012), 1003--1012. Google ScholarDigital Library
Law, E., and Von Ahn, L. Human Computation. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011. Google ScholarDigital Library
Le, J., Edmonds, A., Hester, V., and Biewald, L. Ensuring quality in crowdsourced search relevance evaluation: The effects of training qustion distribution. In Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (2010).Google Scholar
Liem, B., Zhang, H., and Chen, Y. An Iterative Dual Pathway Structure for Speech-to-Text Transcription. In Proceedings of the AAAI Workshop on Human Computation (HCOMP) (2011).Google Scholar
Lin, C., Mausam, M., and Weld, D. Crowdsourcing control: Moving beyond multiple choice. In Proceedings of HCOMP12: The 4th Workshop on Human Computation (2012).Google Scholar
Lin, C. H., Mausam, and Weld, D. S. Dynamically switching between synergistic workflows for crowdsourcing. In AAAI (2012).Google Scholar
Little, G., Chilton, L. B., Goldman, M., and Miller, R. C. Turkit: tools for iterative tasks on mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '09, ACM (New York, NY, USA, 2009), 29--30. Google ScholarDigital Library
Mason, W., and Watts, D. J. Financial incentives and the "performance of crowds". SIGKDD Explor. Newsl. 11, 2 (May 2010), 100--108. Google ScholarDigital Library
Oleson, D., Sorokin, A., Laughlin, G., Hester, V., Le, J., and Biewald, L. Programmatic gold: Targeted and scalable quality assurance in crowdsourcing. In Proceedings of HCOMP11: The 3rd Workshop on Human Computation (2011).Google Scholar
Paritosh, P. Human computation must be reproducible. In Proceedings of CrowdSearch: Crowdsourcing Web search 2012 (2012).Google Scholar
Quinn, A. J., and Bederson, B. B. Human computation: a survey and taxonomy of a growing field. In Proceedings of the 2011 annual conference on Human factors in computing systems, CHI '11, ACM (New York, NY, USA, 2011), 1403--1412. Google ScholarDigital Library
Robertson, S., Vojnovic, M., and Weber, I. Rethinking the esp game. In Proceedings of the 27th international conference extended abstracts on Human factors in computing systems, CHI EA '09, ACM (New York, NY, USA, 2009), 3937--3942. Google ScholarDigital Library
Seemakurty, N., Chu, J., von Ahn, L., and Tomasic, A. Word sense disambiguation via human computation. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '10, ACM (New York, NY, USA, 2010), 60--63. Google ScholarDigital Library
Snow, R., O'Connor, B., Jurafsky, D., and Ng, A. Y. Cheap and fast - but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, Association for Computational Linguistics (Stroudsburg, PA, USA, 2008), 254--263. Google ScholarDigital Library
Sun, Y.-A., Roy, S., and Little, G. D. Beyond independent agreement: A tournament selection approach for quality assurance of human computation tasks. In Proceedings of HCOMP11: The 3rd Workshop on Human Computation (2011).Google Scholar
von Ahn, L., and Dabbish, L. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, CHI '04, ACM (New York, NY, USA, 2004), 319--326. Google ScholarDigital Library
von Ahn, L., and Dabbish, L. Designing games with a purpose. Commun. ACM 51 (Aug. 2008), 58--67. Google ScholarDigital Library

Index Terms

Enhancing reliability using peer consistency evaluation in human computation
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems

Recommendations

Who are the crowdworkers?: shifting demographics in mechanical turk
CHI EA '10: CHI '10 Extended Abstracts on Human Factors in Computing Systems

Amazon Mechanical Turk (MTurk) is a crowdsourcing system in which tasks are distributed to a population of thousands of anonymous workers for completion. This system is increasingly popular with researchers and developers. Here we extend previous ...
Read More
CrowdForge: crowdsourcing complex work
CHI EA '11: CHI '11 Extended Abstracts on Human Factors in Computing Systems

Micro-task markets such as Amazon's Mechanical Turk represent a new paradigm for accomplishing work, in which employers can tap into a large population of workers around the globe to accomplish tasks in a fraction of the time and money of more ...
Read More
Generating ground truth for music mood classification using mechanical turk
JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

Mood is an important access point in music digital libraries and online music repositories, but generating ground truth for evaluating various music mood classification algorithms is a challenging problem. This is because collecting enough human ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CSCW '13: Proceedings of the 2013 conference on Computer supported cooperative work
February 2013
1594 pages
ISBN:9781450313315
DOI:10.1145/2441776
General Chairs:
Amy Bruckman
Georgia Institute of Technology, USA
,
Scott Counts
Microsoft Research, USA
,
Program Chairs:
Cliff Lampe
University of Michigan, USA
,
Loren Terveen
University of Minnesota, USA
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 February 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowdsourcing
evaluation
human computation
mechanical turk
user behavior
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate2,235of8,521submissions,26%
Upcoming Conference
CSCW '24

Sponsor:

sigchi

CSCW '24: Computer-Supported Cooperative Work and Social Computing

November 9 - 13, 2024

San Jose , Costa Rica
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 443
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing reliability using peer consistency evaluation in human computation

CSCW '13: Proceedings of the 2013 conference on Computer supported cooperative work

ABSTRACT

References

Cited By

Index Terms

Recommendations

Who are the crowdworkers?: shifting demographics in mechanical turk

CrowdForge: crowdsourcing complex work

Generating ground truth for music mood classification using mechanical turk