CGLAD: Using GLAD in Crowdsourced Large Datasets

Rodrigo, Enrique G.; Aledo, Juan A.; Gamez, Jose A.

doi:10.1007/978-3-030-03493-1_81

CGLAD: Using GLAD in Crowdsourced Large Datasets

Conference paper
First Online: 09 November 2018

2318 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11314))

Abstract

In this article, we propose an improvement over the GLAD algorithm that increases the efficiency and accuracy of the model when working on problems with large datasets. The GLAD algorithm allows practitioners to learn from instances labeled by multiple annotators, taking into account the quality of their annotations and the instance difficulty. However, due to the number of parameters of the model, it does not scale easily to solve problems with large datasets, especially when the execution time is limited. Our proposal, CGLAD, solves these problems using clustering from vectors coming from the factorization of the annotation matrix. This approach drastically reduces the number of parameters in the model, which makes using GLAD strategy for solving multiple annotators problems easier to use and more efficient.

This work has been partially funded by the Spanish Research Agency (AEI) and FEDER (UE) through project TIN2016-77902-C3-1-P. Enrique G. Rodrigo has also been funded by the FPU scholarship FPU15/02281 by MECD.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The datasets are available in .csv and .parquet format in the following link: http://bit.ly/ideal2018-cglad.
2.
In this paper we did not use parameter tuning as this is not trivial to do in real datasets (because of the lack of ground truth) and also in order to obtain more general results.
3.
In problems with large datasets GLAD fails to converge due to the high number of parameters to optimize, so it does not reach a good enough solution and the iteration process stops quickly. Because of that, CGLAD and GLAD cannot be fairly compared regarding execution time for bigger datasets.

References

Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the em algorithm. Appl. Stat. 2, 20–28 (1979)
Article Google Scholar
Hernández-González, J., Inza, I., Lozano, J.A.: Weak supervision and other non-standard classification problems: a taxonomy. Pattern Recogn. Lett. 69, 49–55 (2016)
Article Google Scholar
Li, Q., Li, Y., Gao, J., Zhao, B., Fan, W., Han, J.: Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1187–1198. ACM (2014)
Google Scholar
Raykar, V.C., et al.: Learning from crowds. J. Mach. Learn. Res. 11(Apr), 1297–1322 (2010)
MathSciNet Google Scholar
Whitehill, J., Wu, T., Bergsma, J., Movellan, J.R., Ruvolo, P.L.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in Neural Information Processing Systems, pp. 2035–2043 (2009)
Google Scholar
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Article Google Scholar
Zhang, J., Wu, X., Sheng, V.S.: Learning from crowdsourced labeled data: a survey. Artif. Intell. Rev. 46(4), 543–576 (2016)
Article Google Scholar
Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? Proc. VLDB Endow. 10(5), 541–552 (2017)
Article Google Scholar
Zhou, Y., Wilkinson, D., Schreiber, R., Pan, R.: Large-scale parallel collaborative filtering for the netflix prize. In: Fleischer, R., Xu, J. (eds.) AAIM 2008. LNCS, vol. 5034, pp. 337–348. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68880-8_32
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Computer Systems Department, Castilla-La Mancha University, Albacete, Spain
Enrique G. Rodrigo & Jose A. Gamez
Mathematics Department, Castilla-La Mancha University, Albacete, Spain
Juan A. Aledo

Authors

Enrique G. Rodrigo
View author publications
You can also search for this author in PubMed Google Scholar
Juan A. Aledo
View author publications
You can also search for this author in PubMed Google Scholar
Jose A. Gamez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Enrique G. Rodrigo .

Editor information

Editors and Affiliations

University of Manchester, Manchester, UK
Hujun Yin
Autonomous University of Madrid, Madrid, Spain
David Camacho
Campus of Gualtar, University of Minho, Braga, Portugal
Paulo Novais
University of Seville, Seville, Spain
Antonio J. Tallón-Ballesteros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodrigo, E.G., Aledo, J.A., Gamez, J.A. (2018). CGLAD: Using GLAD in Crowdsourced Large Datasets. In: Yin, H., Camacho, D., Novais, P., Tallón-Ballesteros, A. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2018. IDEAL 2018. Lecture Notes in Computer Science(), vol 11314. Springer, Cham. https://doi.org/10.1007/978-3-030-03493-1_81

Download citation

DOI: https://doi.org/10.1007/978-3-030-03493-1_81
Published: 09 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03492-4
Online ISBN: 978-3-030-03493-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics