Abstract
In this article, we propose an improvement over the GLAD algorithm that increases the efficiency and accuracy of the model when working on problems with large datasets. The GLAD algorithm allows practitioners to learn from instances labeled by multiple annotators, taking into account the quality of their annotations and the instance difficulty. However, due to the number of parameters of the model, it does not scale easily to solve problems with large datasets, especially when the execution time is limited. Our proposal, CGLAD, solves these problems using clustering from vectors coming from the factorization of the annotation matrix. This approach drastically reduces the number of parameters in the model, which makes using GLAD strategy for solving multiple annotators problems easier to use and more efficient.
This work has been partially funded by the Spanish Research Agency (AEI) and FEDER (UE) through project TIN2016-77902-C3-1-P. Enrique G. Rodrigo has also been funded by the FPU scholarship FPU15/02281 by MECD.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The datasets are available in .csv and .parquet format in the following link: http://bit.ly/ideal2018-cglad.
- 2.
In this paper we did not use parameter tuning as this is not trivial to do in real datasets (because of the lack of ground truth) and also in order to obtain more general results.
- 3.
In problems with large datasets GLAD fails to converge due to the high number of parameters to optimize, so it does not reach a good enough solution and the iteration process stops quickly. Because of that, CGLAD and GLAD cannot be fairly compared regarding execution time for bigger datasets.
References
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the em algorithm. Appl. Stat. 2, 20–28 (1979)
Hernández-González, J., Inza, I., Lozano, J.A.: Weak supervision and other non-standard classification problems: a taxonomy. Pattern Recogn. Lett. 69, 49–55 (2016)
Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1187–1198. ACM (2014)
Raykar, V.C., et al.: Learning from crowds. J. Mach. Learn. Res. 11(Apr), 1297–1322 (2010)
Whitehill, J., Wu, T., Bergsma, J., Movellan, J.R., Ruvolo, P.L.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in Neural Information Processing Systems, pp. 2035–2043 (2009)
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Zhang, J., Wu, X., Sheng, V.S.: Learning from crowdsourced labeled data: a survey. Artif. Intell. Rev. 46(4), 543–576 (2016)
Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? Proc. VLDB Endow. 10(5), 541–552 (2017)
Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative filtering for the netflix prize. In: Fleischer, R., Xu, J. (eds.) AAIM 2008. LNCS, vol. 5034, pp. 337–348. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68880-8_32
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Rodrigo, E.G., Aledo, J.A., Gamez, J.A. (2018). CGLAD: Using GLAD in Crowdsourced Large Datasets. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11314. Springer, Cham. https://doi.org/10.1007/978-3-030-03493-1_81
Download citation
DOI: https://doi.org/10.1007/978-3-030-03493-1_81
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03492-4
Online ISBN: 978-3-030-03493-1
eBook Packages: Computer ScienceComputer Science (R0)