Skip to main content
Log in

Self-paced annotations of crowd workers

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Crowdsourcing can harness human intelligence to handle computer-hard tasks in a relatively economic way. The collected answers from various crowd workers are of different qualities, due to the task difficulty, worker capability, incentives and other factors. To maintain high-quality answers while reducing the cost, various strategies have been developed by modeling tasks, workers, or both. Nevertheless, they typically deem that the capability of workers is static when assigning/completing all the tasks. However, in actual fact, crowd workers can improve their capability by gradually completing easy to hard tasks, alike human beings’ intrinsic self-paced learning ability. In this paper, we study crowdsourcing with self-paced workers, whose capability can be progressively improved as they scrutinize and complete tasks from to easy to hard. We introduce a Self-paced Crowd-worker model (SPCrowder). In SPCrowder, workers firstly do a set of golden tasks with known truths, which serve as feedbacks to assist workers capturing the raw modes of tasks and to stimulate the self-paced learning. This also helps to estimate workers’ quality and tasks’ difficulty. SPCrowder then uses a task difficulty model to dynamically measure the difficulty of tasks and rank them from easy to hard and assign tasks to self-paced workers by maximizing a benefit criterion. By doing so, a normal worker can be capable to handle hard tasks after completing some easier and related tasks. We conducted extensive experiments on semi-simulated and real crowdsourcing datasets, SPCrowder outperforms competitive methods in quality control and budget saving. Crowd workers indeed hold the self-paced learning ability, which boosts the quality and save the budget.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.amt.com.

  2. http://www.crowdflower.com.

  3. http://test.baidu.com/crowdtest/.

References

  1. Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: International conference on machine learning, 41–48

  2. Bernardo JM, Smith AF (2009) Bayesian theory, vol 405. Wiley, Hoboken

    Google Scholar 

  3. Cooper S, Khatib F, Treuille A, Barbero J, Lee J, Beenen M, Leaver-Fay A, Baker D, Popović Z et al (2010) Predicting protein structures with a multiplayer online game. Nature 466(7307):756–760

    Article  Google Scholar 

  4. Crescenzi V, Fernandes AA, Merialdo P, Paton NW (2017) Crowdsourcing for data management. Knowl Info Sys 53(1):1–41

    Article  Google Scholar 

  5. Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M (2018) Quality control in crowdsourcing: a survey of quality attributes, assessment techniques and assurance actions. ACM Comput Surv (CSUR) 51(1):1–40

    Article  Google Scholar 

  6. Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, 469–478

  7. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Resear 7(1):1–30

    MathSciNet  MATH  Google Scholar 

  8. Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F (2009) Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE

  9. Fan J, Li G, Ooi BC, Tan K-l, Feng J (2015) icrowd: an adaptive crowdsourcing framework. In: ACM Conference on management of data, 1015–1030

  10. Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset

  11. Han G, Tu J, Yu G, Wang J, Domeniconi C (2020) Crowdsourcing with multiple-source knowledge transfer. In: International joint conference on artificial intelligence, 2908–2914

  12. Ho C-J, Vaughan J (2012) Online task assignment in crowdsourcing markets. In Proceedings of the AAAI conference on artificial intelligence, 45–51

  13. Hu Z, Zhang J (2018) A novel strategy for active task assignment in crowd labeling. In International joint conference on artificial intelligence, 1538–1545

  14. Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation, 64–67

  15. Jiang L, Meng D, Mitamura T, Hauptmann A (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In: ACM International conference on multimedia, 547–556

  16. Kang X, Yu G, Domeniconi C, Wang J, Guo W, Ren Y, Cui L (2021) Crowdsourcing with self-paced workers. In: IEEE International conference on data mining, 280–289

  17. Karger DR, Oh S, Shah D (2011) Budget-optimal crowdsourcing using low-rank matrix approximations. In 2011 49th Annual allerton conference on communication, control and computing, 284–291

  18. Kazai G, Kamps J, Milic-Frayling N (2011) Worker types and personality traits in crowdsourcing relevance labels. In: ACM International conference on information and knowledge management, 1941–1944

  19. Khan AR, Garcia-Molina H (2017) Crowddqs: Dynamic question selection in crowdsourcing systems. In: ACM conference on management of data, 1447–1462

  20. Korycki Ł, Krawczyk B (2017) Combining active learning and self-labeling for data stream mining. In: International conference on computer recognition systems, 481–490

  21. Kovashka A, Russakovsky O, Fei-Fei L, Grauman K (2016) Crowdsourcing in computer vision. arXiv preprint arXiv:1611.02145

  22. Kumar MP, Packer B, Koller D (2010) Self-paced learning for latent variable models. In Advances in neural information processing systems, 1189–1197

  23. Lakhani KR, Boudreau KJ, Loh P-R, Backstrom L, Baldwin C, Lonstein E, Lydon M, MacCormack A, Arnaout RA, Guinan EC (2013) Prize-based contests can provide solutions to computational biology problems. Nat Biotech 31(2):108–111

    Article  Google Scholar 

  24. Lang K (1995) Newsweeder: learning to filter netnews. In Proceedings of the 12th International conference on machine learning, 331–339

  25. Li G, Zheng Y, Fan J, Wang J, Cheng R (2017) Crowdsourced data management: Overview and challenges. In ACM SIGMOD conference on management of data, 1711–1716

  26. Liu X, Lu M, Ooi BC, Shen Y, Wu S, Zhang M (2012) Cdas: a crowdsourcing data analytics system. Proceed VLDB Endowm 5(10):1040–1051

    Article  Google Scholar 

  27. Ma F, Li Y, Li Q, Qiu M, Gao J, Zhi S, Su L, Zhao B, Ji H, Han J (2015) Faitcrowd: fine grained truth discovery for crowdsourced data aggregation. In: Proceedings of ACM SIGKDD International conference on knowledge discovery and data mining, 745–754

  28. Marcus A, Karger D, Madden S, Miller R, Oh S (2012) Counting with the crowd. Proceed VLDB Endowm 6(2):109–120

    Article  Google Scholar 

  29. Meng D, Zhao Q, Jiang L (2017) A theoretical understanding of self-paced learning. Info Sci 414:319–328

    Article  Google Scholar 

  30. Muhammadi J, Rabiee HR, Hosseini A (2015) A unified statistical framework for crowd labeling. Knowl Info Sys 45(2):271–294

    Article  Google Scholar 

  31. Nassar L, Karray F (2019) Overview of the crowdsourcing process. Knowl Info Sys 60(1):1–24

    Article  Google Scholar 

  32. Noble WS (2006) What is a support vector machine? Nat Biotech 24(12):1565–1567

    Article  Google Scholar 

  33. Que X, Ren Y, Zhou J, Xu Z (2017) Regularized multi-source matrix factorization for diagnosis of alzheimer’s disease. In: International conference on neural information processing, 463–473. Springer

  34. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Resear 11(4):1297–1322

    MathSciNet  Google Scholar 

  35. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  36. Tao F, Jiang L, Li C (2020) Label similarity-based weighted soft majority voting and pairing for crowdsourcing. Knowl Info Sys 62(7):2521–2538

    Article  Google Scholar 

  37. Tu J, Yu G, Domeniconi C, Wang J, Xiao G, Guo M (2020) Multi-label crowd consensus via joint matrix factorization. Knowl Info Sys 62(4):1341–1369

    Article  Google Scholar 

  38. Tu J, Yu G, Wang J, Domeniconi C, Guo M, Zhang X (2021) Crowdwt: Crowdsourcing via joint modeling of workers and tasks. ACM Trans Knowl Discov Data 15(1):1–24

    Article  Google Scholar 

  39. Tu J, Yu G, Wang J, Domeniconi C, Zhang X (2020b) Attention-aware answers of the crowd. In SIAM International conference on data mining, 451–459

  40. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset

  41. Wang A, Hoang CDV, Kan M-Y (2013) Perspectives on crowdsourcing annotations for natural language processing. Lang Resour Evaluat 47(1):9–31

    Article  Google Scholar 

  42. Wang J, Kraska T, Franklin MJ, Feng J (2012) Crowder: crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927

  43. Wang W, Guo X-Y, Li S-Y, Jiang Y, Zhou Z-H (2017) Obtaining high-quality label by distinguishing between easy and hard items in crowdsourcing. In: International joint conference on artificial intelligence, 2964–2970

  44. Whitehill J, Wu T-f, Bergsma J, Movellan J, Ruvolo P (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In: Advances in neural information processing systems, 2035–2043

  45. Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: IEEE Conference on computer vision and pattern recognition, 10687–10698

  46. Xu W, Liu W, Huang X, Yang J, Qiu S (2018) Multi-modal self-paced learning for image classification. Neurocomputing 309:134–144

    Article  Google Scholar 

  47. Yu G, Tu J, Wang J, Domeniconi C, Zhang X (2021) Active multilabel crowd consensus. IEEE Trans Neur Netw Learn Sys 32(4):1448–1459

    Article  Google Scholar 

  48. Zhang J, Sheng VS, Li T, Wu X (2017) Improving crowdsourced label quality using noise correction. IEEE Trans Neur Netw Learn Sys 29(5):1675–1688

    Article  MathSciNet  Google Scholar 

  49. Zhang X, Shi H, Li Y, Liang W (2017b) Spglad: A self-paced learning-based crowdsourcing classification model. In Pacific-Asia conference on knowledge discovery and data mining, 189–201

  50. Zheng Y, Li G, Cheng R (2016) Docs: a domain-aware crowdsourcing system using knowledge bases. Proceed VLDB Endow 10(4):361–372

    Article  Google Scholar 

  51. Zheng Y, Wang J, Li G, Cheng R, Feng J (2015) Qasca: a quality-aware task assignment system for crowdsourcing applications. In: ACM SIGMOD conference on management of data, 1031–1046

Download references

Acknowledgements

We appreciate the authors who kindly shared their source code and datasets with us for the experiments. This work is supported by National Key Research and Development Project of China (No. 2019YFB1705900), the Innovation Method Fund of China (No. 2020IM020100), NSFC (62031003 and 62072380), and Shenzhen Polytechnic Youth Innovation Project under Grant No. 6019310007K0.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guoxian Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kang, X., Yu, G., Domeniconi, C. et al. Self-paced annotations of crowd workers. Knowl Inf Syst 64, 3235–3263 (2022). https://doi.org/10.1007/s10115-022-01759-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01759-5

Keywords

Navigation