tutorial

Practice of Efficient Data Collection via Crowdsourcing: Aggregation, Incremental Relabelling, and Pricing

Authors:

Valentina Fedorova,

Dmitry Ustalov,

Olga Megorskaya,

Evfrosiniya Zerminova,

Daria BaidakovaAuthors Info & Claims

WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

Pages 873 - 876

https://doi.org/10.1145/3336191.3371875

Published: 22 January 2020 Publication History

Abstract

In this tutorial, we present a portion of unique industry experience in efficient data labelling via crowdsourcing shared by both leading researchers and engineers from Yandex. We will make an introduction to data labelling via public crowdsourcing marketplaces and will present key components of efficient label collection. This will be followed by a practice session, where participants will choose one of the real label collection tasks, experiment with selecting settings for the labelling process, and launch their label collection project on Yandex.Toloka, one of the largest crowdsourcing marketplaces. The projects will be run on real crowds within the tutorial session. Finally, participants will receive a feedback about their projects and practical advice to make them more efficient. We expect that our tutorial will address an audience with a wide range of background and interests. We do not require specific prerequisite knowledge or skills. We invite beginners, advanced specialists, and researchers to learn how to efficiently collect labelled data.

References

[1]

Ittai Abraham, Omar Alonso, Vasilis Kandylas, Rajesh Patel, Steven Shelford, and Aleksandrs Slivkins. 2016. How Many Workers to Ask?: Adaptive Exploration for Collecting High Quality Labels. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '16). ACM, Pisa, Italy, 473--482.

Digital Library

[2]

Xi Chen, Paul N. Bennett, Kevyn Collins-Thompson, and Eric Horvitz. 2013. Pairwise Ranking Aggregation in a Crowdsourced Setting. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM '13). ACM, Rome, Italy, 193--202.

Digital Library

[3]

Justin Cheng, Jaime Teevan, and Michael S. Bernstein. 2015. Measuring Crowdsourcing Effort with Error-Time Curves. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, Seoul, Korea, 1365--1374.

[4]

A. Philip Dawid and Allan M. Skene. 1979. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 20--28.

[5]

Djellel Eddine Difallah, Michele Catasta, Gianluca Demartini, and Philippe CudréMauroux. 2014. Scaling-Up the Crowd: Micro-Task Pricing Schemes for Worker Retention and Latency Improvement. In Proceedings of the Second AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2014). Association for the Advancement of Artificial Intelligence, Pittsburgh, PA, USA, 50--58.

[6]

Alexey Drutsa, Viktoriya Farafonova, Valentina Fedorova, Olga Megorskaya, Evfrosiniya Zerminova, and Olga Zhilinskaya. 2019. Practice of Efficient Data Collection via Crowdsourcing at Large-Scale. Tutorial at KDD '19.

[7]

Seyda Ertekin, Haym Hirsh, and Cynthia Rudin. 2012. Learning to Predict the Wisdom of Crowds. arXiv:1204.3611

[8]

Siamak Faridani and Georg Buscher. 2013. LabelBoost: An Ensemble Model for Ground Truth Inference Using Boosted Trees. In Human Computation and Crowdsourcing: Works in Progress and Demonstration Abstracts (HCOMP 2013). Association for the Advancement of Artificial Intelligence, Palm Springs, CA, USA, 18--19.

[9]

Chien-Ju Ho, Aleksandrs Slivkins, Siddharth Suri, and Jennifer Wortman Vaughan. 2015. Incentivizing High Quality Crowdwork. In Proceedings of the 24th International Conference on World Wide Web (WWW '15). International World Wide Web Conferences Steering Committee, Florence, Italy, 419--429.

Digital Library

[10]

Hideaki Imamura, Issei Sato, and Masashi Sugiyama. 2018. Analysis of Minimax Error Rate for Crowdsourcing and Its Application to Worker Clustering Model. arXiv:1802.04551

[11]

Panagiotis G. Ipeirotis, Foster Provost, Victor S. Sheng, and Jing Wang. 2014. Repeated labeling using multiple noisy labelers. Data Mining and Knowledge Discovery 28, 2 (2014), 402--441.

Digital Library

[12]

Yuan Jin, Mark Carman, Dongwoo Kim, and Lexing Xie. 2017. Leveraging Side Information to Improve Label Quality Control in Crowd-Sourcing. In Proceedings of the Fifth Conference on Human Computation and Crowdsourcing (HCOMP 2017). Association for the Advancement of Artificial Intelligence, Québec City, QC, Canada, 79--88.

[13]

Hyun-Chul Kim and Zoubin Ghahramani. 2012. Bayesian Classifier Combination. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012). PMLR, La Palma, Canary Islands, 619--627.

[14]

Christopher H. Lin, Mausam, and Daniel S. Weld. 2014. To Re(label), or Not To Re(label). In Proceedings of the Second AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2014). Association for the Advancement of Artificial Intelligence, Pittsburgh, PA, USA, 151--158.

[15]

Chao Liu and Yi-Min Wang. 2012. TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple Ratings. In Proceedings of the 29th International Conference on Machine Learning (ICML-12). Omnipress, Edinburgh, Scotland, GB, 225--232.

[16]

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning From Crowds. Journal of Machine Learning Research 11 (2010), 1297--1322.

Digital Library

[17]

Paul Ruvolo, Jacob Whitehill, and Javier R Movellan. 2013. Exploiting Commonality and Interaction Effects in Crowdsourcing Tasks Using Latent Factor Models. In NIPS '13 Workshop on Crowdsourcing: Theory, Algorithms and Applications.

[18]

Nihar B. Shah and Dengyong Zhou. 2016. No Oops, You Won't Do It Again: Mechanisms for Self-correction in Crowdsourcing. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML16). PMLR, New York, NY, USA, 1--10.

[19]

Nihar B. Shah, Dengyong Zhou, and Yuval Peres. 2015. Approval Voting and Incentives in Crowdsourcing. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning (ICML-15). PMLR, Lille, France, 11--20.

[20]

Aashish Sheshadri and Matthew Lease. 2013. SQUARE: A Benchmark for Research on Computing Crowd Consensus. In Proceedings of the First AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2013). Association for the Advancement of Artificial Intelligence, Palm Springs, CA, USA, 156--164.

[21]

Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and Fast-but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008). Association for Computational Linguistics, Honolulu, HI, USA, 254--263.

[22]

Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. 2014. Community-based Bayesian Aggregation Models for Crowdsourcing. In Proceedings of the 23rd International Conference on World Wide Web (WWW '14). ACM, Seoul, Korea, 155--164.

Digital Library

[23]

Jeroen Vuurens, Arjen P. de Vries, and Carsten Eickhoff. 2011. How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy. In Proceedings of the ACM SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval (CIR 2011). Beijing, China, 48--55.

[24]

Jing Wang, Panagiotis G. Ipeirotis, and Foster Provost. 2013. Quality-Based Pricing for Crowdsourced Workers. (2013). NYU Working Paper No. 2451/31833.

[25]

Peter Welinder, Steve Branson, Pietro Perona, and Serge J. Belongie. 2010. The Multidimensional Wisdom of Crowds. In Advances in Neural Information Processing Systems 21 (NIPS 2010). Curran Associates, Inc., Vancouver, BC, Canada, 2424--2432.

[26]

Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. 2009. Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In Advances in Neural Information Processing Systems 22 (NIPS 2009). Curran Associates, Inc., Vancouver, BC, Canada, 2035-- 2043.

[27]

Ming Yin, Yiling Chen, and Yu-An Sun. 2013. The Effects of PerformanceContingent Financial Incentives in Online Labor Markets. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI-13). Association for the Advancement of Artificial Intelligence, Bellevue, WA, USA, 1191--1197.

Digital Library

[28]

Liyue Zhao, Gita Sukthankar, and Rahul Sukthankar. 2011. Incremental Relabeling for Active Learning with Noisy Crowdsourced Annotations. In 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing. IEEE, Boston, MA, USA, 728--733.

[29]

Dengyong Zhou, Qiang Liu, John C. Platt, Christopher Meek, and Nihar B. Shah. 2015. Regularized Minimax Conditional Entropy for Crowdsourcing. arXiv:1503.07240

Cited By

Li JKuutila MHuusko EKariyakarawana NSavic MAhooie NHosio SMäntylä M(2023)Assessing Credibility Factors of Short-Form Social Media Posts: A Crowdsourced Online ExperimentProceedings of the 15th Biannual Conference of the Italian SIGCHI Chapter10.1145/3605390.3605406(1-14)Online publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1145/3605390.3605406
Li Dde Rijke MChen HDuh WHuang HKato MMothe JPoblete B(2023)Extending Label Aggregation Models with a Gaussian Process to Denoise Crowdsourcing LabelsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591685(729-738)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591685
Alorwu ASavage Svan Berkel NUstalov DDrutsa AOppenlaender JBates OHettiachchi DGadiraju UGoncalves JHosio S(2022)REGROW: Reimagining Global Crowdsourcing for Better Human-AI CollaborationExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3503725(1-7)Online publication date: 27-Apr-2022
https://dl.acm.org/doi/10.1145/3491101.3503725
Show More Cited By

Index Terms

Practice of Efficient Data Collection via Crowdsourcing: Aggregation, Incremental Relabelling, and Pricing

Recommendations

Crowdsourcing Practice for Efficient Data Labeling: Aggregation, Incremental Relabeling, and Pricing
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

In this tutorial, we present a portion of unique industry experience in efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex. We will make an introduction to data labeling via public crowdsourcing ...
Crowdsourcing as a Method for the Collection of Revealed Preferences Data
SOSE '15: Proceedings of the 2015 IEEE Symposium on Service-Oriented System Engineering

Crowdsourcing has been used widely for the collection of stated preference data (e.g., responses in a survey) by researchers. However, the use of crowdsourcing for collection of revealed preference data (e.g., real-life data collected in natural ...
Ballpark Crowdsourcing: The Wisdom of Rough Group Comparisons
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

Crowdsourcing has become a popular method for collecting labeled training data. However, in many practical scenarios traditional labeling can be difficult for crowdworkers(for example, if the data is high-dimensional or unintuitive, or the labels are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

January 2020

950 pages

ISBN:9781450368223

DOI:10.1145/3336191

General Chairs:
James Caverlee
Texas A&M University
,
Xia "Ben" Hu
Texas A&M University
,
Program Chairs:
Mounia Lalmas
Spotify
,
Wei Wang
University of California, Los Angeles

Copyright © 2020 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2020

Check for updates

Author Tags

Qualifiers

Tutorial

Conference

WSDM '20

Sponsor:

WSDM '20: The Thirteenth ACM International Conference on Web Search and Data Mining

February 3 - 7, 2020

TX, Houston, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
261
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li JKuutila MHuusko EKariyakarawana NSavic MAhooie NHosio SMäntylä M(2023)Assessing Credibility Factors of Short-Form Social Media Posts: A Crowdsourced Online ExperimentProceedings of the 15th Biannual Conference of the Italian SIGCHI Chapter10.1145/3605390.3605406(1-14)Online publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1145/3605390.3605406
Li Dde Rijke MChen HDuh WHuang HKato MMothe JPoblete B(2023)Extending Label Aggregation Models with a Gaussian Process to Denoise Crowdsourcing LabelsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591685(729-738)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591685
Alorwu ASavage Svan Berkel NUstalov DDrutsa AOppenlaender JBates OHettiachchi DGadiraju UGoncalves JHosio S(2022)REGROW: Reimagining Global Crowdsourcing for Better Human-AI CollaborationExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3503725(1-7)Online publication date: 27-Apr-2022
https://dl.acm.org/doi/10.1145/3491101.3503725
Difallah DChecco ADemartini GZuccon GCulpepper JHuang ZTong H(2021)Aggregation Techniques in Crowdsourcing: Multiple Choice Questions and BeyondProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482032(4842-4844)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482032
Lioznova ADrutsa AKukushkin VBezzubtseva AGupta RLiu YShah MRajan STang JPrakash B(2020)Prediction of Hourly Earnings and Completion Time on a Crowdsourcing PlatformProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403369(3172-3182)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1145/3394486.3403369
Tian SZhang JChen LLiu HWang Y(2020)Random Sampling-Arithmetic Mean: A Simple Method of Meteorological Data Quality Control Based on Random Observation ThoughtIEEE Access10.1109/ACCESS.2020.30454348(226999-227013)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3045434

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten