Skip to main content
Log in

Learning from crowds with active learning and self-healing

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

With the development of crowdsourcing, data acquisition for supervised learning from annotators all over the world becomes simple and economical. To improve accuracy, it is nature to obtain multiple noisy labels (i.e., a multiple label set) for each example from the crowd. Then, consensus algorithms can infer the estimated ground truth from the multiple label set for each example. The estimated ground truth is also called an integrated label, which could be a noise. That is, a dataset constructed via integrating the multiple noisy labels for each example in a crowdsourcing dataset (called an integrated dataset) still contains noises. In order to further improve the data quality of an integrated dataset, so that to improve the performance of a model learned from the integrated dataset, this paper proposes a framework that integrates active learning with the self-healing of a model together. With active learning, a limited number of examples from the integrated dataset, which are most likely noises, are selected for the oracle to correct; with the self-healing of a model, the data quality of the integrated dataset can be also improved automatically. From our experimental results on eight simulated crowdsourcing datasets with three popular consensus algorithms, we draw some conclusions as follows. (1) Our proposed framework does improve the performance of a model learned from the integrated dataset. (2) The simple active learning selection strategy based on uncertainty estimation can identify noises in the integrated dataset. (3) Self-healing is efficient and effective to improve the data quality of the integrated dataset, so that it improves the accuracy of a model learned from the integrated dataset. We further investigate our proposed framework on a real-world crowdsourcing dataset collected from Amazon Mechanical Turk, and the above conclusions are sustained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Lai S, Xu L, Liu K et al (2015) Recurrent convolutional neural networks for text classification. AAAI, pp 2267–2273

  2. Tang K, Paluri M, Fei-Fei L et al (2015) Improving image classification with location context. In: Proceedings of the IEEE international conference on computer vision, pp 1008–1016

  3. Wen X, Shao L, Xue Y, Fang W (2015) A rapid learning algorithm for vehicle classification. Inf Sci 295:395–406

    Article  Google Scholar 

  4. Li J, Li X, Yang B, Sun X (2015) Segmentation-based image copy-move forgery detection scheme. IEEE Trans Inf Forensics Secur 10(3):507–518

    Article  Google Scholar 

  5. Xia Z, Wang X, Sun X, Liu Q, Xiong N (2016) Steganalysis of LSB matching using differences between nonadjacent pixels. Multimedia Tools Appl 75(4):1947–1962

    Article  Google Scholar 

  6. Chen B, Shu H, Coatrieux G, Chen G, Sun X, Coatrieux JL (2015) Color image analysis by quaternion-type moments. J Math Imaging Vis 51(1):124–144

    Article  MathSciNet  Google Scholar 

  7. Zheng Y, Jeon B, Xu D et al (2015) Image segmentation by generalized hierarchical fuzzy C-means algorithm. J Intell Fuzzy Syst 28(2):961–973

    Google Scholar 

  8. Zhou Z, Wang Y, Wu QM et al (2017) Effective and efficient global context verification for image copy detection. IEEE Trans Inf Forensics Secur 12(1):48–63

    Article  Google Scholar 

  9. Xia Z, Wang X, Zhang L et al (2016) A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing. IEEE Trans Inf Forensics Secur 11(11):2594–2608

    Article  Google Scholar 

  10. Fu Z, Wu X, Guan C et al (2016) Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement. IEEE Trans Inf Forensics Secur 11(12): 2706–2716

    Article  Google Scholar 

  11. Li J, Li X, Yang B, Sun X (2015) Segmentation-based image copy--move forgery detection scheme. IEEE Trans Inf Forensics Secur 10(3):507–518

    Article  Google Scholar 

  12. Xia Z, Wang X, Sun X, Wang B (2014) Steganalysis of least significant bit matching using multi-order differences. Secur Commun Netw 7(8):1283–1291

    Article  Google Scholar 

  13. Wu J, Pan S, Zhu X et al (2016) Positive and unlabeled multi-graph learning. IEEE Trans Cybern

  14. Wu J, Pan S, Zhu X et al (2015) Boosting for multi-graph classification. IEEE Trans Cybern 45:416–429

    Article  Google Scholar 

  15. Wu J, Zhu X, Zhang C et al (2014) Bag constrained structure pattern mining for multi-graph classification. IEEE Trans Knowl Data Eng 26:2382–2396

    Article  Google Scholar 

  16. Xintong G, Hongzhi W, Song Y et al (2014) Brief survey of crowdsourcing for data mining. Expert Syst Appl 41:7987–7994

    Article  Google Scholar 

  17. Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 614–622

  18. Ipeirotis PG, Provost F, Sheng VS et al (2008) Repeated labeling using multiple noisy labelers. Data Min Knowl Disc 28:402–441

    Article  MathSciNet  Google Scholar 

  19. Penrose LS (1946) The elementary statistics of majority voting. J R Stat Soc 109:53–57

    Article  Google Scholar 

  20. Raykar VC, Yu S, Zhao LH et al (2010) Learning from crowds. J Mach Learn Res 11:1297–1322

    MathSciNet  Google Scholar 

  21. Demartini G, Difallah D E, Cudré-Mauroux P (2012) ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 469–478

  22. Liu Q, Steyvers M, Fisher JW et al (2003) On reliable crowdsourcing and the use of ground truth information. The Advancement of Artificial Intelligence. http://www.ics.uci.edu/~ihler/papers/hcomp13.pdf

  23. Settles, Burr (2010) Active learning literature survey. University of Wisconsin, Madison 52:55–66

  24. Lewis, David D, Catlett Jason (1994) Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the eleventh international conference on machine learning pp 48–156

  25. Blake C, Merz CJ (1998) UCI repository of machine learning databases

  26. Wu J, Pan S, Zhu X et al (2016) SODE: self-adaptive one-dependence estimators for classification. Pattern Recogn 51:358–377

    Article  Google Scholar 

  27. Wu J, Pan S, Zhu X et al (2015) Self-adaptive attribute weighting for Naive Bayes classification. Expert Syst Appl 42:1487–1502

    Article  Google Scholar 

  28. Jiang L, Li C, Wang S, Zhang L (2016) Deep feature weighting for naive Bayes and its application to text classification. Eng Appl Artif Intell 52:26–39

    Article  Google Scholar 

  29. Rahman Mahbubur et al (2015) Smartphone-based hierarchical crowdsourcing for weed identification. Comput Electron Agric 113:14–23

    Article  Google Scholar 

  30. Parry C, Beckjord E, Moser RP et al (2015) It takes a (virtual) village: crowdsourcing measurement consensus to advance survivorship care planning. Transl Behav Med 5:53–59

    Article  Google Scholar 

  31. Crescenzi V, Merialdo P, Qiu D (2014) Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases, pp 1–28

    Article  Google Scholar 

  32. Byun TMA, Halpin PF, Szeredi D (2015) Online crowdsourcing for efficient rating of speech: a validation study. J Commun Disord 53:70–83

    Article  Google Scholar 

  33. Li C, Sheng VS, Jiang L et al (2016) Noise filtering to improve data and model quality for crowdsourcing. Knowl Based Syst 107:96–103

    Article  Google Scholar 

  34. Peer E, Vosgerau J, Acquisti A (2014) Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav Res Methods 46:1023–1031

    Article  Google Scholar 

  35. Raykar VC, Yu S (2011) An entropic score to rank annotators for crowdsourced labeling tasks. In: IEEE third national conference on computer vision, pattern recognition, image processing and graphics (NCVPRIPG)

  36. Tarasov A, Delany SJ, Namee BMac (2014) Dynamic estimation of worker reliability in crowdsourcing for regression tasks: making it work. Expert Syst Appl 41:6190–6210

    Article  Google Scholar 

  37. Hu Q et al (2014) Learning from crowds under experts’ supervision. Advances in knowledge discovery and data mining, pp 200–211

  38. Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on computational learning theory, ACM pp 287–294

  39. Brinker K (2003) Incorporating diversity in active learning with support vector machines. ICML 3:59–66

    Google Scholar 

  40. Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, pp 1070–1079

  41. Holub A, Perona P, Burl MC (2008) Entropy-based active learning for object recognition. In: IEEE computer society conference computer vision and pattern recognition workshops, 2008. CVPRW’08, pp 1–8

  42. Zhao L, Sukthankar G, Sukthankar R (2011) Incremental relabeling for active learning with noisy crowdsourced annotations. In: IEEE international conference privacy, security, risk and trust (PASSAT) and 2011 IEEE third international conference on social computing (SocialCom), pp 728–733

  43. Costa J et al (2011) On using crowdsourcing and active learning to improve classification performance. In: IEEE international 11th conference intelligent systems design and applications (ISDA), pp 469–474

  44. Zhang J, Wu X, Sheng VS (2015) Active learning with imbalanced multiple noisy labeling. IEEE Trans Cybern 45:1081–1093

    Google Scholar 

  45. Breiman Leo (2001) Random forests. Mach Learn 45:5–32

    Article  Google Scholar 

  46. Shu Z, Sheng VS, Zhang Y, et al (2015) Integrating active learning with supervision for crowdsourcing generalization. In: IEEE 14th international conference on machine learning and applications (ICMLA), pp 232–237

  47. Jiang L (2011) Learning random forests for ranking. Front Comput Sci China 5:79–86

    Article  MathSciNet  Google Scholar 

  48. Jiang L, Zhang H, Cai Z (2009) A novel bayes model: hidden naive bayes. IEEE Trans Knowl Data Eng 21:1361–1371

    Article  Google Scholar 

  49. Jiang L, Cai Z, Wang D, Zhang H (2012) Improving tree augmented naive bayes for class probability estimation. Knowl-Based Syst 26:239–245

    Article  Google Scholar 

  50. Qiu C, Jiang L, Li C (2015) Not always simple classification: learning super parent for class probability estimation. Expert Syst Appl 42:5433–5440

    Article  Google Scholar 

  51. Gu B, Sheng VS, Wang Z, Ho D, Osman S, Li S (2015) Incremental learning for v-support vector regression. Neural Netw 67:140–150

    Article  Google Scholar 

  52. Gu B, Sheng VS, Li S (2015) Bi-parameter space partition for cost-sensitive SVM. In: Proceedings of the 24th international conference on artificial intelligence. AAAI Press, pp 3532–3539

  53. Gu B, Sheng VS, Tay KY, Romano W, Li S (2015) Incremental support vector learning for ordinal regression. IEEE Trans Neural Netw Learn Syst 26(7):1403–1416

    Article  MathSciNet  Google Scholar 

  54. Gu B, Sun X, Sheng VS (2016) Structural minimax probability machine. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2544779

    Article  MathSciNet  Google Scholar 

  55. Gu B, Sheng VS (2016) A robust regularization path algorithm for ν-support vector classification. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2527796

    Article  Google Scholar 

Download references

Acknowledgements

The work was supported by the U.S. National Science Foundation under Grant No. IIS-1115417, the National Natural Science Foundation of China under Grant No. 61472267, 61170020, 61440053, and the Natural Science Foundation of Hubei Province under Grant No. 2014CFB913.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenyu Shu.

Ethics declarations

Conflict of interest

We declare that we have no conflicts of interest to this work. The manuscript has been approved by all authors for publication.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shu, Z., Sheng, V.S. & Li, J. Learning from crowds with active learning and self-healing. Neural Comput & Applic 30, 2883–2894 (2018). https://doi.org/10.1007/s00521-017-2878-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-017-2878-y

Keywords

Navigation