Abstract
In many supervised learning problems, determining the true labels of training instances is expensive, laborious, and even practically impossible. As an alternative approach, it is much easier to collect multiple subjective (possibly noisy) labels from human labelers, especially with the crowdsourcing services such as Amazon’s Mechanical Turk. The collected labels are then aggregated to estimate the true labels. In order to reduce the negative effects of novices, spammers, and malicious labelers, it necessitates taking into account the accuracies of the labelers. However, in the absence of true labels, we miss the main source of information to estimate the labeler accuracies. This paper demonstrates that the agreements or disagreements among labeler opinions are useful sources of information and facilitate the accuracy estimation problem. We represent this estimation problem as an optimization problem which its goal is to minimize the differences between the analytical probabilities of disagreements based on estimated accuracies and the probabilities of disagreements according to the provided labels. We present an efficient semi-exhaustive search method to solve this optimization problem. Our experiments on the simulated data and three real datasets show that the proposed method is a promising idea in this emerging new area. The source code of the proposed method is available for downloading at http://ceit.aut.ac.ir/~amirkhani.
Similar content being viewed by others
Notes
AMT (found at http://www.mturk.com) is an online labor market where uncertified workers complete small tasks receiving small amounts of money.
Other references are reported in the rest of this section.
Different papers use different (possibly conflicting) epithets for these categories. In this paper, we use the terms proposed in [17].
These two parameters model the accuracy of the labeler separately for each class.
We adopt a simple approach for convenience. Yet more realistic assumptions and more advanced techniques can be applied.
This information is explicitly used in [29].
When there is no clear majority, this information is useless and we must use other types of information.
It is clear that one labeler must be better than the other and equality information is useless.
References
Howe J (2006) The rise of crowdsourcing. Wired Mag 14:1–4
Pontin J (2007) Artificial intelligence: with help from the humans. The New York Times
Vondrick C, Patterson D, Ramanan D (2013) Efficiently scaling up crowdsourced video annotation. Int J Comput Vis 101:184–204
Jones GJ (2013) An introduction to crowdsourcing for language and multimedia technology research. In: Information retrieval meets information visualization. Springer, Berlin, pp 132–154
Marcus A, Wu E, Karger DR et al (2011) Crowdsourced databases: query processing with people. In: 5th biennial conference on innovative data systems research, Asilomar, CA, USA, pp 211–214
Snow R, O’Connor B, Jurafsky D, Ng A (2008) Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. In: The conference on empirical methods in natural language processing, pp 254–263
Zaidan OF, Callison-Burch C (2011) Crowdsourcing translation: professional quality from non-professionals. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 1220–1229
Bernstein MS, Little G, Miller RC et al (2010) Soylent: a word processor with a crowd inside. In: Proceedings of the 23nd annual ACM symposium on user interface software and technology, pp 313–322
Urbano J, Morato J, Marrero M, Martín D (2010) Crowdsourcing preference judgments for evaluation of music similarity tasks. In: ACM SIGIR workshop on crowdsourcing for search evaluation, pp 9–16
Grady C, Lease M (2010) Crowdsourcing document relevance assessment with mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, pp 172–179
Ahn LV, Maurer B, McMillen C et al (2008) Re-CAPTCHA: human-based character recognition via web security measures. Science 5895(321):1465–1468
Muhammadi J, Rabiee HR (2013) Crowd computing: a survey. arXiv:1301.2774
Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 614–622
Dekel O, Shamir O (2009) Good learners for evil teachers. In: Proceedings of the 26th international conference on machine learning, Montreal, Canada, pp 233–240
Raykar VC, Yu S, Zhao LH et al (2010) Learning from crowds. J Mach Learn Res 11:1297–1322
Whitehill J, Ruvolo P, Wu T et al (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in neural information processing systems, pp 2035–2043
Raykar VC, Yu S (2012) Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J Mach Learn Res 13:491–518
Ipeirotis PG, Provost F, Wang J (2010) Quality management on amazon mechanical turk. In: The ACM SIGKDD workshop on human computation. ACM, New York, pp 64–67
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl Stat 28:20–28
Karger DR, Oh S, Shah D (2011) Iterative learning for reliable crowdsourcing systems. In: Neural information processing systems (NIPS)
Khattak FK, Salleb-Aouissi A (2011) Quality control of crowd labeling through expert evaluation. In: Proceedings of the NIPS 2nd workshop on computational social science and the wisdom of crowds
Wauthier FL, Jordan MI (2011) Bayesian bias mitigation for crowdsourcing. In: Neural information processing systems (NIPS)
Smyth P, Fayyad U, Burl M et al (1995) Inferring ground truth from subjective labelling of venus images. In: Advances in neural information processing systems, pp 1085–1092
Wiebe JM, Bruce RF, O’Hara TP (1999) Development and use of a gold-standard data set for subjectivity classifications. In: Proceedings of the 37th annual meeting of the association for computational linguistics on computational linguistics, pp 246–253
Eagle N (2009) Txteagle: mobile crowdsourcing. In: Internationalization, design and global development. Springer, Berlin, pp 447–456
Yan Y, Rosales R, Fung G et al (2010) Modeling annotator expertise: learning when everybody knows a bit of something. In: International conference on artificial intelligence and statistics, pp 932–939
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, Ser B 39:1–38
Kajino H, Tsuboi Y, Kashima H (2012) A convex formulation for learning from crowds. In: Proceedings of the 26th AAAI conference on artificial intelligence, pp 73–79
Karger DR, Oh S, Shah D (2012) Budget-optimal task allocation for reliable crowdsourcing systems. arXiv:1110.3564v3
Dietterich TG (2000) Ensemble methods in machine learning. In: Multiple classifier systems. Springer, Berlin, pp 1–15
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40:139–157
Valentini G, Dietterich TG (2004) Bias-variance analysis of support vector machines for the development of SVM-based ensemble methods. J Mach Learn Res 5:725–775
Cho SB, Won H-H (2007) Cancer classification using ensemble of neural networks with multiple significant gene subsets. Appl Intell 26:243–250
Lin Z, Hao Z, Yang X, Liu X (2009) Several SVM ensemble methods integrated with under-sampling for imbalanced data learning. In: Advanced data mining and applications. Springer, Berlin, pp 536–544
Maudes J, Rodríguez JJ, García-Osorio C, Pardo C (2011) Random projections for linear SVM ensembles. Appl Intell 34:347–359
Canuto AM, Santos AM, Vargas RR (2011) Ensembles of ARTMAP-based neural networks: an experimental study. Appl Intell 35:1–17
Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2012) A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl Intell 36:320–329
Lee H, Kim E, Pedrycz W (2012) A new selective neural network ensemble with negative correlation. Appl Intell 37:488–498
Wang C-W, You W-H (2013) Boosting-SVM: effective learning with reduced data dimension. Appl Intell 39(3):465–474
Park S, Lee SR (2014) Red tides prediction system using fuzzy reasoning and the ensemble method. Appl Intell 40(2):244–255
Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2013) On the effect of calibration in classifier combination. Appl Intell 38(4):566–585
Sakar CO, Kursun O, Gurgen F (2014) Ensemble canonical correlation analysis. Appl Intell 40(2):291–304
Fahim M, Fatima I, Lee S, Lee Y-K (2013) EEM: evolutionary ensembles model for activity recognition in smart homes. Appl Intell 38:88–98
Gamberger D, Lavrac N, Dzeroski S (2000) Noise detection and elimination in data preprocessing: experiments in medical domains. Appl Artif Intell 14:205–223
Lallich S, Muhlenbach F, Zighed DA (2002) Improving classification by removing or relabeling mislabeled instances. In: Foundations of intelligent systems. Springer, Berlin, pp 5–15
Verbaeten S, Van Assche A (2003) Ensemble methods for noise elimination in classification problems. In: Multiple classifier systems. Springer, Berlin, pp 317–325
Guan D, Yuan W, Lee Y-K, Lee S (2011) Identifying mislabeled training data with the aid of unlabeled data. Appl Intell 35:345–358
Young J, Ashburner J, Ourselin S (2013) Wrapper methods to correct mislabelled training data. In: International workshop on pattern recognition in neuroimaging (PRNI), pp 170–173
Wasserman L (2003) All of statistics: a concise course. In: Statistical inference. Springer, Berlin
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Amirkhani, H., Rahmati, M. Agreement/disagreement based crowd labeling. Appl Intell 41, 212–222 (2014). https://doi.org/10.1007/s10489-014-0516-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-014-0516-2