Abstract
With the development of crowdsourcing, data acquisition for supervised learning from annotators all over the world becomes simple and economical. To improve accuracy, it is nature to obtain multiple noisy labels (i.e., a multiple label set) for each example from the crowd. Then, consensus algorithms can infer the estimated ground truth from the multiple label set for each example. The estimated ground truth is also called an integrated label, which could be a noise. That is, a dataset constructed via integrating the multiple noisy labels for each example in a crowdsourcing dataset (called an integrated dataset) still contains noises. In order to further improve the data quality of an integrated dataset, so that to improve the performance of a model learned from the integrated dataset, this paper proposes a framework that integrates active learning with the self-healing of a model together. With active learning, a limited number of examples from the integrated dataset, which are most likely noises, are selected for the oracle to correct; with the self-healing of a model, the data quality of the integrated dataset can be also improved automatically. From our experimental results on eight simulated crowdsourcing datasets with three popular consensus algorithms, we draw some conclusions as follows. (1) Our proposed framework does improve the performance of a model learned from the integrated dataset. (2) The simple active learning selection strategy based on uncertainty estimation can identify noises in the integrated dataset. (3) Self-healing is efficient and effective to improve the data quality of the integrated dataset, so that it improves the accuracy of a model learned from the integrated dataset. We further investigate our proposed framework on a real-world crowdsourcing dataset collected from Amazon Mechanical Turk, and the above conclusions are sustained.
Similar content being viewed by others
References
Lai S, Xu L, Liu K et al (2015) Recurrent convolutional neural networks for text classification. AAAI, pp 2267–2273
Tang K, Paluri M, Fei-Fei L et al (2015) Improving image classification with location context. In: Proceedings of the IEEE international conference on computer vision, pp 1008–1016
Wen X, Shao L, Xue Y, Fang W (2015) A rapid learning algorithm for vehicle classification. Inf Sci 295:395–406
Li J, Li X, Yang B, Sun X (2015) Segmentation-based image copy-move forgery detection scheme. IEEE Trans Inf Forensics Secur 10(3):507–518
Xia Z, Wang X, Sun X, Liu Q, Xiong N (2016) Steganalysis of LSB matching using differences between nonadjacent pixels. Multimedia Tools Appl 75(4):1947–1962
Chen B, Shu H, Coatrieux G, Chen G, Sun X, Coatrieux JL (2015) Color image analysis by quaternion-type moments. J Math Imaging Vis 51(1):124–144
Zheng Y, Jeon B, Xu D et al (2015) Image segmentation by generalized hierarchical fuzzy C-means algorithm. J Intell Fuzzy Syst 28(2):961–973
Zhou Z, Wang Y, Wu QM et al (2017) Effective and efficient global context verification for image copy detection. IEEE Trans Inf Forensics Secur 12(1):48–63
Xia Z, Wang X, Zhang L et al (2016) A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing. IEEE Trans Inf Forensics Secur 11(11):2594–2608
Fu Z, Wu X, Guan C et al (2016) Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement. IEEE Trans Inf Forensics Secur 11(12): 2706–2716
Li J, Li X, Yang B, Sun X (2015) Segmentation-based image copy--move forgery detection scheme. IEEE Trans Inf Forensics Secur 10(3):507–518
Xia Z, Wang X, Sun X, Wang B (2014) Steganalysis of least significant bit matching using multi-order differences. Secur Commun Netw 7(8):1283–1291
Wu J, Pan S, Zhu X et al (2016) Positive and unlabeled multi-graph learning. IEEE Trans Cybern
Wu J, Pan S, Zhu X et al (2015) Boosting for multi-graph classification. IEEE Trans Cybern 45:416–429
Wu J, Zhu X, Zhang C et al (2014) Bag constrained structure pattern mining for multi-graph classification. IEEE Trans Knowl Data Eng 26:2382–2396
Xintong G, Hongzhi W, Song Y et al (2014) Brief survey of crowdsourcing for data mining. Expert Syst Appl 41:7987–7994
Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 614–622
Ipeirotis PG, Provost F, Sheng VS et al (2008) Repeated labeling using multiple noisy labelers. Data Min Knowl Disc 28:402–441
Penrose LS (1946) The elementary statistics of majority voting. J R Stat Soc 109:53–57
Raykar VC, Yu S, Zhao LH et al (2010) Learning from crowds. J Mach Learn Res 11:1297–1322
Demartini G, Difallah D E, Cudré-Mauroux P (2012) ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 469–478
Liu Q, Steyvers M, Fisher JW et al (2003) On reliable crowdsourcing and the use of ground truth information. The Advancement of Artificial Intelligence. http://www.ics.uci.edu/~ihler/papers/hcomp13.pdf
Settles, Burr (2010) Active learning literature survey. University of Wisconsin, Madison 52:55–66
Lewis, David D, Catlett Jason (1994) Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the eleventh international conference on machine learning pp 48–156
Blake C, Merz CJ (1998) UCI repository of machine learning databases
Wu J, Pan S, Zhu X et al (2016) SODE: self-adaptive one-dependence estimators for classification. Pattern Recogn 51:358–377
Wu J, Pan S, Zhu X et al (2015) Self-adaptive attribute weighting for Naive Bayes classification. Expert Syst Appl 42:1487–1502
Jiang L, Li C, Wang S, Zhang L (2016) Deep feature weighting for naive Bayes and its application to text classification. Eng Appl Artif Intell 52:26–39
Rahman Mahbubur et al (2015) Smartphone-based hierarchical crowdsourcing for weed identification. Comput Electron Agric 113:14–23
Parry C, Beckjord E, Moser RP et al (2015) It takes a (virtual) village: crowdsourcing measurement consensus to advance survivorship care planning. Transl Behav Med 5:53–59
Crescenzi V, Merialdo P, Qiu D (2014) Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases, pp 1–28
Byun TMA, Halpin PF, Szeredi D (2015) Online crowdsourcing for efficient rating of speech: a validation study. J Commun Disord 53:70–83
Li C, Sheng VS, Jiang L et al (2016) Noise filtering to improve data and model quality for crowdsourcing. Knowl Based Syst 107:96–103
Peer E, Vosgerau J, Acquisti A (2014) Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav Res Methods 46:1023–1031
Raykar VC, Yu S (2011) An entropic score to rank annotators for crowdsourced labeling tasks. In: IEEE third national conference on computer vision, pattern recognition, image processing and graphics (NCVPRIPG)
Tarasov A, Delany SJ, Namee BMac (2014) Dynamic estimation of worker reliability in crowdsourcing for regression tasks: making it work. Expert Syst Appl 41:6190–6210
Hu Q et al (2014) Learning from crowds under experts’ supervision. Advances in knowledge discovery and data mining, pp 200–211
Seung HS, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the fifth annual workshop on computational learning theory, ACM pp 287–294
Brinker K (2003) Incorporating diversity in active learning with support vector machines. ICML 3:59–66
Settles B, Craven M (2008) An analysis of active learning strategies for sequence labeling tasks. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, pp 1070–1079
Holub A, Perona P, Burl MC (2008) Entropy-based active learning for object recognition. In: IEEE computer society conference computer vision and pattern recognition workshops, 2008. CVPRW’08, pp 1–8
Zhao L, Sukthankar G, Sukthankar R (2011) Incremental relabeling for active learning with noisy crowdsourced annotations. In: IEEE international conference privacy, security, risk and trust (PASSAT) and 2011 IEEE third international conference on social computing (SocialCom), pp 728–733
Costa J et al (2011) On using crowdsourcing and active learning to improve classification performance. In: IEEE international 11th conference intelligent systems design and applications (ISDA), pp 469–474
Zhang J, Wu X, Sheng VS (2015) Active learning with imbalanced multiple noisy labeling. IEEE Trans Cybern 45:1081–1093
Breiman Leo (2001) Random forests. Mach Learn 45:5–32
Shu Z, Sheng VS, Zhang Y, et al (2015) Integrating active learning with supervision for crowdsourcing generalization. In: IEEE 14th international conference on machine learning and applications (ICMLA), pp 232–237
Jiang L (2011) Learning random forests for ranking. Front Comput Sci China 5:79–86
Jiang L, Zhang H, Cai Z (2009) A novel bayes model: hidden naive bayes. IEEE Trans Knowl Data Eng 21:1361–1371
Jiang L, Cai Z, Wang D, Zhang H (2012) Improving tree augmented naive bayes for class probability estimation. Knowl-Based Syst 26:239–245
Qiu C, Jiang L, Li C (2015) Not always simple classification: learning super parent for class probability estimation. Expert Syst Appl 42:5433–5440
Gu B, Sheng VS, Wang Z, Ho D, Osman S, Li S (2015) Incremental learning for v-support vector regression. Neural Netw 67:140–150
Gu B, Sheng VS, Li S (2015) Bi-parameter space partition for cost-sensitive SVM. In: Proceedings of the 24th international conference on artificial intelligence. AAAI Press, pp 3532–3539
Gu B, Sheng VS, Tay KY, Romano W, Li S (2015) Incremental support vector learning for ordinal regression. IEEE Trans Neural Netw Learn Syst 26(7):1403–1416
Gu B, Sun X, Sheng VS (2016) Structural minimax probability machine. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2544779
Gu B, Sheng VS (2016) A robust regularization path algorithm for ν-support vector classification. IEEE Trans Neural Netw Learn Syst. doi:10.1109/TNNLS.2016.2527796
Acknowledgements
The work was supported by the U.S. National Science Foundation under Grant No. IIS-1115417, the National Natural Science Foundation of China under Grant No. 61472267, 61170020, 61440053, and the Natural Science Foundation of Hubei Province under Grant No. 2014CFB913.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we have no conflicts of interest to this work. The manuscript has been approved by all authors for publication.
Rights and permissions
About this article
Cite this article
Shu, Z., Sheng, V.S. & Li, J. Learning from crowds with active learning and self-healing. Neural Comput & Applic 30, 2883–2894 (2018). https://doi.org/10.1007/s00521-017-2878-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-017-2878-y