Abstract
Crowdsourcing can harness human intelligence to handle computer-hard tasks in a relatively economic way. The collected answers from various crowd workers are of different qualities, due to the task difficulty, worker capability, incentives and other factors. To maintain high-quality answers while reducing the cost, various strategies have been developed by modeling tasks, workers, or both. Nevertheless, they typically deem that the capability of workers is static when assigning/completing all the tasks. However, in actual fact, crowd workers can improve their capability by gradually completing easy to hard tasks, alike human beings’ intrinsic self-paced learning ability. In this paper, we study crowdsourcing with self-paced workers, whose capability can be progressively improved as they scrutinize and complete tasks from to easy to hard. We introduce a Self-paced Crowd-worker model (SPCrowder). In SPCrowder, workers firstly do a set of golden tasks with known truths, which serve as feedbacks to assist workers capturing the raw modes of tasks and to stimulate the self-paced learning. This also helps to estimate workers’ quality and tasks’ difficulty. SPCrowder then uses a task difficulty model to dynamically measure the difficulty of tasks and rank them from easy to hard and assign tasks to self-paced workers by maximizing a benefit criterion. By doing so, a normal worker can be capable to handle hard tasks after completing some easier and related tasks. We conducted extensive experiments on semi-simulated and real crowdsourcing datasets, SPCrowder outperforms competitive methods in quality control and budget saving. Crowd workers indeed hold the self-paced learning ability, which boosts the quality and save the budget.






Similar content being viewed by others
References
Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: International conference on machine learning, 41–48
Bernardo JM, Smith AF (2009) Bayesian theory, vol 405. Wiley, Hoboken
Cooper S, Khatib F, Treuille A, Barbero J, Lee J, Beenen M, Leaver-Fay A, Baker D, Popović Z et al (2010) Predicting protein structures with a multiplayer online game. Nature 466(7307):756–760
Crescenzi V, Fernandes AA, Merialdo P, Paton NW (2017) Crowdsourcing for data management. Knowl Info Sys 53(1):1–41
Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M (2018) Quality control in crowdsourcing: a survey of quality attributes, assessment techniques and assurance actions. ACM Comput Surv (CSUR) 51(1):1–40
Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, 469–478
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Resear 7(1):1–30
Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F (2009) Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE
Fan J, Li G, Ooi BC, Tan K-l, Feng J (2015) icrowd: an adaptive crowdsourcing framework. In: ACM Conference on management of data, 1015–1030
Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset
Han G, Tu J, Yu G, Wang J, Domeniconi C (2020) Crowdsourcing with multiple-source knowledge transfer. In: International joint conference on artificial intelligence, 2908–2914
Ho C-J, Vaughan J (2012) Online task assignment in crowdsourcing markets. In Proceedings of the AAAI conference on artificial intelligence, 45–51
Hu Z, Zhang J (2018) A novel strategy for active task assignment in crowd labeling. In International joint conference on artificial intelligence, 1538–1545
Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation, 64–67
Jiang L, Meng D, Mitamura T, Hauptmann A (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In: ACM International conference on multimedia, 547–556
Kang X, Yu G, Domeniconi C, Wang J, Guo W, Ren Y, Cui L (2021) Crowdsourcing with self-paced workers. In: IEEE International conference on data mining, 280–289
Karger DR, Oh S, Shah D (2011) Budget-optimal crowdsourcing using low-rank matrix approximations. In 2011 49th Annual allerton conference on communication, control and computing, 284–291
Kazai G, Kamps J, Milic-Frayling N (2011) Worker types and personality traits in crowdsourcing relevance labels. In: ACM International conference on information and knowledge management, 1941–1944
Khan AR, Garcia-Molina H (2017) Crowddqs: Dynamic question selection in crowdsourcing systems. In: ACM conference on management of data, 1447–1462
Korycki Ł, Krawczyk B (2017) Combining active learning and self-labeling for data stream mining. In: International conference on computer recognition systems, 481–490
Kovashka A, Russakovsky O, Fei-Fei L, Grauman K (2016) Crowdsourcing in computer vision. arXiv preprint arXiv:1611.02145
Kumar MP, Packer B, Koller D (2010) Self-paced learning for latent variable models. In Advances in neural information processing systems, 1189–1197
Lakhani KR, Boudreau KJ, Loh P-R, Backstrom L, Baldwin C, Lonstein E, Lydon M, MacCormack A, Arnaout RA, Guinan EC (2013) Prize-based contests can provide solutions to computational biology problems. Nat Biotech 31(2):108–111
Lang K (1995) Newsweeder: learning to filter netnews. In Proceedings of the 12th International conference on machine learning, 331–339
Li G, Zheng Y, Fan J, Wang J, Cheng R (2017) Crowdsourced data management: Overview and challenges. In ACM SIGMOD conference on management of data, 1711–1716
Liu X, Lu M, Ooi BC, Shen Y, Wu S, Zhang M (2012) Cdas: a crowdsourcing data analytics system. Proceed VLDB Endowm 5(10):1040–1051
Ma F, Li Y, Li Q, Qiu M, Gao J, Zhi S, Su L, Zhao B, Ji H, Han J (2015) Faitcrowd: fine grained truth discovery for crowdsourced data aggregation. In: Proceedings of ACM SIGKDD International conference on knowledge discovery and data mining, 745–754
Marcus A, Karger D, Madden S, Miller R, Oh S (2012) Counting with the crowd. Proceed VLDB Endowm 6(2):109–120
Meng D, Zhao Q, Jiang L (2017) A theoretical understanding of self-paced learning. Info Sci 414:319–328
Muhammadi J, Rabiee HR, Hosseini A (2015) A unified statistical framework for crowd labeling. Knowl Info Sys 45(2):271–294
Nassar L, Karray F (2019) Overview of the crowdsourcing process. Knowl Info Sys 60(1):1–24
Noble WS (2006) What is a support vector machine? Nat Biotech 24(12):1565–1567
Que X, Ren Y, Zhou J, Xu Z (2017) Regularized multi-source matrix factorization for diagnosis of alzheimer’s disease. In: International conference on neural information processing, 463–473. Springer
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Resear 11(4):1297–1322
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Tao F, Jiang L, Li C (2020) Label similarity-based weighted soft majority voting and pairing for crowdsourcing. Knowl Info Sys 62(7):2521–2538
Tu J, Yu G, Domeniconi C, Wang J, Xiao G, Guo M (2020) Multi-label crowd consensus via joint matrix factorization. Knowl Info Sys 62(4):1341–1369
Tu J, Yu G, Wang J, Domeniconi C, Guo M, Zhang X (2021) Crowdwt: Crowdsourcing via joint modeling of workers and tasks. ACM Trans Knowl Discov Data 15(1):1–24
Tu J, Yu G, Wang J, Domeniconi C, Zhang X (2020b) Attention-aware answers of the crowd. In SIAM International conference on data mining, 451–459
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
Wang A, Hoang CDV, Kan M-Y (2013) Perspectives on crowdsourcing annotations for natural language processing. Lang Resour Evaluat 47(1):9–31
Wang J, Kraska T, Franklin MJ, Feng J (2012) Crowder: crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927
Wang W, Guo X-Y, Li S-Y, Jiang Y, Zhou Z-H (2017) Obtaining high-quality label by distinguishing between easy and hard items in crowdsourcing. In: International joint conference on artificial intelligence, 2964–2970
Whitehill J, Wu T-f, Bergsma J, Movellan J, Ruvolo P (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In: Advances in neural information processing systems, 2035–2043
Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: IEEE Conference on computer vision and pattern recognition, 10687–10698
Xu W, Liu W, Huang X, Yang J, Qiu S (2018) Multi-modal self-paced learning for image classification. Neurocomputing 309:134–144
Yu G, Tu J, Wang J, Domeniconi C, Zhang X (2021) Active multilabel crowd consensus. IEEE Trans Neur Netw Learn Sys 32(4):1448–1459
Zhang J, Sheng VS, Li T, Wu X (2017) Improving crowdsourced label quality using noise correction. IEEE Trans Neur Netw Learn Sys 29(5):1675–1688
Zhang X, Shi H, Li Y, Liang W (2017b) Spglad: A self-paced learning-based crowdsourcing classification model. In Pacific-Asia conference on knowledge discovery and data mining, 189–201
Zheng Y, Li G, Cheng R (2016) Docs: a domain-aware crowdsourcing system using knowledge bases. Proceed VLDB Endow 10(4):361–372
Zheng Y, Wang J, Li G, Cheng R, Feng J (2015) Qasca: a quality-aware task assignment system for crowdsourcing applications. In: ACM SIGMOD conference on management of data, 1031–1046
Acknowledgements
We appreciate the authors who kindly shared their source code and datasets with us for the experiments. This work is supported by National Key Research and Development Project of China (No. 2019YFB1705900), the Innovation Method Fund of China (No. 2020IM020100), NSFC (62031003 and 62072380), and Shenzhen Polytechnic Youth Innovation Project under Grant No. 6019310007K0.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kang, X., Yu, G., Domeniconi, C. et al. Self-paced annotations of crowd workers. Knowl Inf Syst 64, 3235–3263 (2022). https://doi.org/10.1007/s10115-022-01759-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01759-5