Self-paced annotations of crowd workers

Kang, Xiangping; Yu, Guoxian; Domeniconi, Carlotta; Wang, Jun; Guo, Wei; Ren, Yazhou; Zhang, Xiayan; Cui, Lizhen

doi:10.1007/s10115-022-01759-5

Self-paced annotations of crowd workers

Regular Paper
Published: 22 September 2022

Volume 64, pages 3235–3263, (2022)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Xiangping Kang^1,2,
Guoxian Yu^1,2,
Carlotta Domeniconi³,
Jun Wang²,
Wei Guo²,
Yazhou Ren⁴,
Xiayan Zhang⁵ &
…
Lizhen Cui^1,2

424 Accesses
1 Altmetric
Explore all metrics

Abstract

Crowdsourcing can harness human intelligence to handle computer-hard tasks in a relatively economic way. The collected answers from various crowd workers are of different qualities, due to the task difficulty, worker capability, incentives and other factors. To maintain high-quality answers while reducing the cost, various strategies have been developed by modeling tasks, workers, or both. Nevertheless, they typically deem that the capability of workers is static when assigning/completing all the tasks. However, in actual fact, crowd workers can improve their capability by gradually completing easy to hard tasks, alike human beings’ intrinsic self-paced learning ability. In this paper, we study crowdsourcing with self-paced workers, whose capability can be progressively improved as they scrutinize and complete tasks from to easy to hard. We introduce a Self-paced Crowd-worker model (SPCrowder). In SPCrowder, workers firstly do a set of golden tasks with known truths, which serve as feedbacks to assist workers capturing the raw modes of tasks and to stimulate the self-paced learning. This also helps to estimate workers’ quality and tasks’ difficulty. SPCrowder then uses a task difficulty model to dynamically measure the difficulty of tasks and rank them from easy to hard and assign tasks to self-paced workers by maximizing a benefit criterion. By doing so, a normal worker can be capable to handle hard tasks after completing some easier and related tasks. We conducted extensive experiments on semi-simulated and real crowdsourcing datasets, SPCrowder outperforms competitive methods in quality control and budget saving. Crowd workers indeed hold the self-paced learning ability, which boosts the quality and save the budget.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost-Efficient Crowdsourcing for Span-Based Sequence Labeling: Worker Selection and Data Augmentation

A workload-dependent task assignment policy for crowdsourcing

Article 14 January 2017

Training Workers for Improving Performance in Crowdsourcing Microtasks

Notes

References

Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: International conference on machine learning, 41–48
Bernardo JM, Smith AF (2009) Bayesian theory, vol 405. Wiley, Hoboken
Google Scholar
Cooper S, Khatib F, Treuille A, Barbero J, Lee J, Beenen M, Leaver-Fay A, Baker D, Popović Z et al (2010) Predicting protein structures with a multiplayer online game. Nature 466(7307):756–760
Article Google Scholar
Crescenzi V, Fernandes AA, Merialdo P, Paton NW (2017) Crowdsourcing for data management. Knowl Info Sys 53(1):1–41
Article Google Scholar
Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M (2018) Quality control in crowdsourcing: a survey of quality attributes, assessment techniques and assurance actions. ACM Comput Surv (CSUR) 51(1):1–40
Article Google Scholar
Demartini G, Difallah DE, Cudré-Mauroux P (2012) Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, 469–478
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Resear 7(1):1–30
MathSciNet MATH Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F (2009) Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE
Fan J, Li G, Ooi BC, Tan K-l, Feng J (2015) icrowd: an adaptive crowdsourcing framework. In: ACM Conference on management of data, 1015–1030
Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset
Han G, Tu J, Yu G, Wang J, Domeniconi C (2020) Crowdsourcing with multiple-source knowledge transfer. In: International joint conference on artificial intelligence, 2908–2914
Ho C-J, Vaughan J (2012) Online task assignment in crowdsourcing markets. In Proceedings of the AAAI conference on artificial intelligence, 45–51
Hu Z, Zhang J (2018) A novel strategy for active task assignment in crowd labeling. In International joint conference on artificial intelligence, 1538–1545
Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In: Proceedings of the ACM SIGKDD workshop on human computation, 64–67
Jiang L, Meng D, Mitamura T, Hauptmann A (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In: ACM International conference on multimedia, 547–556
Kang X, Yu G, Domeniconi C, Wang J, Guo W, Ren Y, Cui L (2021) Crowdsourcing with self-paced workers. In: IEEE International conference on data mining, 280–289
Karger DR, Oh S, Shah D (2011) Budget-optimal crowdsourcing using low-rank matrix approximations. In 2011 49th Annual allerton conference on communication, control and computing, 284–291
Kazai G, Kamps J, Milic-Frayling N (2011) Worker types and personality traits in crowdsourcing relevance labels. In: ACM International conference on information and knowledge management, 1941–1944
Khan AR, Garcia-Molina H (2017) Crowddqs: Dynamic question selection in crowdsourcing systems. In: ACM conference on management of data, 1447–1462
Korycki Ł, Krawczyk B (2017) Combining active learning and self-labeling for data stream mining. In: International conference on computer recognition systems, 481–490
Kovashka A, Russakovsky O, Fei-Fei L, Grauman K (2016) Crowdsourcing in computer vision. arXiv preprint arXiv:1611.02145
Kumar MP, Packer B, Koller D (2010) Self-paced learning for latent variable models. In Advances in neural information processing systems, 1189–1197
Lakhani KR, Boudreau KJ, Loh P-R, Backstrom L, Baldwin C, Lonstein E, Lydon M, MacCormack A, Arnaout RA, Guinan EC (2013) Prize-based contests can provide solutions to computational biology problems. Nat Biotech 31(2):108–111
Article Google Scholar
Lang K (1995) Newsweeder: learning to filter netnews. In Proceedings of the 12th International conference on machine learning, 331–339
Li G, Zheng Y, Fan J, Wang J, Cheng R (2017) Crowdsourced data management: Overview and challenges. In ACM SIGMOD conference on management of data, 1711–1716
Liu X, Lu M, Ooi BC, Shen Y, Wu S, Zhang M (2012) Cdas: a crowdsourcing data analytics system. Proceed VLDB Endowm 5(10):1040–1051
Article Google Scholar
Ma F, Li Y, Li Q, Qiu M, Gao J, Zhi S, Su L, Zhao B, Ji H, Han J (2015) Faitcrowd: fine grained truth discovery for crowdsourced data aggregation. In: Proceedings of ACM SIGKDD International conference on knowledge discovery and data mining, 745–754
Marcus A, Karger D, Madden S, Miller R, Oh S (2012) Counting with the crowd. Proceed VLDB Endowm 6(2):109–120
Article Google Scholar
Meng D, Zhao Q, Jiang L (2017) A theoretical understanding of self-paced learning. Info Sci 414:319–328
Article Google Scholar
Muhammadi J, Rabiee HR, Hosseini A (2015) A unified statistical framework for crowd labeling. Knowl Info Sys 45(2):271–294
Article Google Scholar
Nassar L, Karray F (2019) Overview of the crowdsourcing process. Knowl Info Sys 60(1):1–24
Article Google Scholar
Noble WS (2006) What is a support vector machine? Nat Biotech 24(12):1565–1567
Article Google Scholar
Que X, Ren Y, Zhou J, Xu Z (2017) Regularized multi-source matrix factorization for diagnosis of alzheimer’s disease. In: International conference on neural information processing, 463–473. Springer
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Resear 11(4):1297–1322
MathSciNet Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Tao F, Jiang L, Li C (2020) Label similarity-based weighted soft majority voting and pairing for crowdsourcing. Knowl Info Sys 62(7):2521–2538
Article Google Scholar
Tu J, Yu G, Domeniconi C, Wang J, Xiao G, Guo M (2020) Multi-label crowd consensus via joint matrix factorization. Knowl Info Sys 62(4):1341–1369
Article Google Scholar
Tu J, Yu G, Wang J, Domeniconi C, Guo M, Zhang X (2021) Crowdwt: Crowdsourcing via joint modeling of workers and tasks. ACM Trans Knowl Discov Data 15(1):1–24
Article Google Scholar
Tu J, Yu G, Wang J, Domeniconi C, Zhang X (2020b) Attention-aware answers of the crowd. In SIAM International conference on data mining, 451–459
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
Wang A, Hoang CDV, Kan M-Y (2013) Perspectives on crowdsourcing annotations for natural language processing. Lang Resour Evaluat 47(1):9–31
Article Google Scholar
Wang J, Kraska T, Franklin MJ, Feng J (2012) Crowder: crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927
Wang W, Guo X-Y, Li S-Y, Jiang Y, Zhou Z-H (2017) Obtaining high-quality label by distinguishing between easy and hard items in crowdsourcing. In: International joint conference on artificial intelligence, 2964–2970
Whitehill J, Wu T-f, Bergsma J, Movellan J, Ruvolo P (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In: Advances in neural information processing systems, 2035–2043
Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: IEEE Conference on computer vision and pattern recognition, 10687–10698
Xu W, Liu W, Huang X, Yang J, Qiu S (2018) Multi-modal self-paced learning for image classification. Neurocomputing 309:134–144
Article Google Scholar
Yu G, Tu J, Wang J, Domeniconi C, Zhang X (2021) Active multilabel crowd consensus. IEEE Trans Neur Netw Learn Sys 32(4):1448–1459
Article Google Scholar
Zhang J, Sheng VS, Li T, Wu X (2017) Improving crowdsourced label quality using noise correction. IEEE Trans Neur Netw Learn Sys 29(5):1675–1688
Article MathSciNet Google Scholar
Zhang X, Shi H, Li Y, Liang W (2017b) Spglad: A self-paced learning-based crowdsourcing classification model. In Pacific-Asia conference on knowledge discovery and data mining, 189–201
Zheng Y, Li G, Cheng R (2016) Docs: a domain-aware crowdsourcing system using knowledge bases. Proceed VLDB Endow 10(4):361–372
Article Google Scholar
Zheng Y, Wang J, Li G, Cheng R, Feng J (2015) Qasca: a quality-aware task assignment system for crowdsourcing applications. In: ACM SIGMOD conference on management of data, 1031–1046

Download references

Acknowledgements

We appreciate the authors who kindly shared their source code and datasets with us for the experiments. This work is supported by National Key Research and Development Project of China (No. 2019YFB1705900), the Innovation Method Fund of China (No. 2020IM020100), NSFC (62031003 and 62072380), and Shenzhen Polytechnic Youth Innovation Project under Grant No. 6019310007K0.

Author information

Authors and Affiliations

School of Software, Shandong University, Jinan, China
Xiangping Kang, Guoxian Yu & Lizhen Cui
Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
Xiangping Kang, Guoxian Yu, Jun Wang, Wei Guo & Lizhen Cui
Department of Computer Science, George Mason University, Fairfax, VA, USA
Carlotta Domeniconi
School of Computer Science and Technology, University of Electron Science and Technology of China, Chengdu, 611731, China
Yazhou Ren
School of Artificial Intelligence, Shenzhen Polytechnic, Shenzhen, China
Xiayan Zhang

Authors

Xiangping Kang
View author publications
You can also search for this author inPubMed Google Scholar
Guoxian Yu
View author publications
You can also search for this author inPubMed Google Scholar
Carlotta Domeniconi
View author publications
You can also search for this author inPubMed Google Scholar
Jun Wang
View author publications
You can also search for this author inPubMed Google Scholar
Wei Guo
View author publications
You can also search for this author inPubMed Google Scholar
Yazhou Ren
View author publications
You can also search for this author inPubMed Google Scholar
Xiayan Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Lizhen Cui
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Guoxian Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kang, X., Yu, G., Domeniconi, C. et al. Self-paced annotations of crowd workers. Knowl Inf Syst 64, 3235–3263 (2022). https://doi.org/10.1007/s10115-022-01759-5

Download citation

Received: 17 January 2022
Revised: 28 August 2022
Accepted: 03 September 2022
Published: 22 September 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10115-022-01759-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-paced annotations of crowd workers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cost-Efficient Crowdsourcing for Span-Based Sequence Labeling: Worker Selection and Data Augmentation

A workload-dependent task assignment policy for crowdsourcing

Training Workers for Improving Performance in Crowdsourcing Microtasks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now