Abstract
Learning from noisy labels is getting trendy in the era of big data. However, in crowdsourcing practice, it is still a challenging task to extract ground truth labels from noisy labels obtained from crowds. In this paper, we propose a latent variable model built on probabilistic logistic matrix factorization model and classical Gaussian mixture model for inferring ground truth labels from noisy, crowdsourced ones. The proposed model incorporates item heterogeneity in contrast to previous works and allows for vector space embeddings of both items and worker labels. Moreover, we derive a tractable mean-field variational inference algorithm to approximate the model posterior. Meanwhile, related MAP approximation problem to the model posterior is also investigated to identify links to existing works. Empirically, we demonstrate that the proposed method achieves good inference accuracy while preserving meaningful uncertainty measures in the embeddings, and therefore better reflects the intrinsic structure of data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmed, A., Xing, E.: On tight approximate inference of the logistic-normal topic admixture model. In: Proceedings of the 11th Tenth International Workshop on Artificial Intelligence and Statistics (2007)
Bhattacharya, A., Dunson, D.B.: Simplex factor models for multivariate unordered categorical data. J. Am. Stat. Assoc. 107(497), 362–377 (2012)
Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134. ACM (2003)
Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 44(1), 197–200 (1992)
Böhning, D., Lindsay, B.G.: Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math. 40(4), 641–663 (1988)
Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal components analysis to the exponential family. In: Advances in Neural Information Processing Systems, pp. 617–624 (2002)
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28(1), 20–28 (1979)
Gollini, I., Murphy, T.B.: Mixture of latent trait analyzers for model-based clustering of categorical data. Stat. Comput. 24(4), 569–588 (2014)
Jagabathula, S., Subramanian, L., Venkataraman, A.: Identifying unreliable and adversarial workers in crowdsourced labeling tasks. J. Mach. Learn. Res. 18(1), 3233–3299 (2017)
Kajino, H., Tsuboi, Y., Kashima, H.: A convex formulation for learning from crowds. In: 36th AAAI Conference on Artificial Intelligence (2012)
Karger, D.R., Oh, S., Shah, D.: Budget-optimal crowdsourcing using low-rank matrix approximations. In: 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 284–291. IEEE (2011)
Khan, M.E., Bouchard, G., Murphy, K.P., Marlin, B.M.: Variational bounds for mixed-data factor analysis. In: Advances in Neural Information Processing Systems, pp. 1108–1116 (2010)
Mohamed, S., Ghahramani, Z., Heller, K.A.: Bayesian exponential family PCA. In: Advances in Neural Information Processing Systems, pp. 1089–1096 (2009)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Rai, P., Wang, Y., Guo, S., Chen, G., Dunson, D., Carin, L.: Scalable Bayesian low-rank decomposition of incomplete multiway tensors. In: International Conference on Machine Learning, pp. 1800–1808 (2014)
Raykar, V.C., Yu, S.: Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J. Mach. Learn. Res. 13, 491–518 (2012)
Raykar, V.C., et al.: Learning from Crowds. J. Mach. Learn. Res. 11, 1297–1322 (2010)
Shaham, U., et al.: A deep learning approach to unsupervised ensemble learning. In: International Conference on Machine Learning, pp. 30–39 (2016)
Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622 (2008)
Snow, R., O’connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast-but is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pp. 254–263 (2008)
Welinder, P., Branson, S., Perona, P., Belongie, S.J.: The multidimensional wisdom of crowds. In: Advances in Neural Information Processing Systems, pp. 2424–2432 (2010)
Whitehill, J., Wu, T., Bergsma, J., Movellan, J.R., Ruvolo, P.L.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Advances in Neural Information Processing Systems, pp. 2035–2043 (2009)
Xu, A., Feng, X., Tian, Y.: Revealing, characterizing, and detecting crowdsourcing spammers: a case study in community Q&A. In: 2015 IEEE Conference on Computer Communications, pp. 2533–2541. IEEE (2015)
Yang, B., Fu, X., Sidiropoulos, N.D.: Learning from hidden traits: joint factor analysis and latent clustering. IEEE Trans. Sig. Process. 65(1), 256–269 (2016)
Yin, L., Han, J., Zhang, W., Yu, Y.: Aggregating crowd wisdoms with label-aware autoencoders. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1325–1331. AAAI Press (2017)
Zhang, Y., Chen, X., Zhou, D., Jordan, M.I.: Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In: Advances in Neural Information Processing Systems, pp. 1260–1268 (2014)
Zhou, D., Basu, S., Mao, Y., Platt, J.C.: Learning from the wisdom of crowds by minimax entropy. In: Advances in Neural Information Processing Systems, pp. 2195–2203 (2012)
Acknowledgements
We thank the reviewers for providing valuable comments. Junhui Wang’s research is supported in part by HKRGC Grants GRF-11303918 and GRF-11300919.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Yao, W., Lee, W., Wang, J. (2021). Learning from Crowds via Joint Probabilistic Matrix Factorization and Clustering in Latent Space. In: Dong, Y., Mladenić, D., Saunders, C. (eds) Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12460. Springer, Cham. https://doi.org/10.1007/978-3-030-67667-4_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-67667-4_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67666-7
Online ISBN: 978-3-030-67667-4
eBook Packages: Computer ScienceComputer Science (R0)