Abstract
Semi-supervised learning aims at training accurate prediction models on labeled and unlabeled data. Its realization strongly depends on selecting pseudo-labeled data. The standard approach is to select instances based on the pseudo-label confidence values that they receive from the prediction models. In this paper we argue that this is an indirect approach w.r.t. the main goal of semi-supervised learning. Instead, we propose a direct approach that selects the pseudo-labeled instances based on their individual contributions for the performance of the prediction models. The individual instance contributions are computed as Shapley values w.r.t. characteristic functions related to the model performance. Experiments show that our approach outperforms the standard one when used in semi-supervised wrappers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For example, if the prediction model is a probabilistic classifier, the confidence value is the posterior probability of the label for an unlabeled instance.
- 2.
Note that T equals \(L \cup L^\prime \cup UL\) in formula 1 in our case.
- 3.
References
Berthelot, D., Carlini, N., Goodfellow, I.J., Papernot, N., Oliver, A., Raffel, C.: MixMatch: a holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems 32, pp. 5050–5060 (2019)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT 1998, pp. 92–100. ACM (1998)
Deng, C., Guo, M.Z.: Tri-training and data editing based semi-supervised clustering algorithm. In: Gelbukh, A., Reyes-Garcia, C.A. (eds.) MICAI 2006. LNCS (LNAI), vol. 4293, pp. 641–651. Springer, Heidelberg (2006). https://doi.org/10.1007/11925231_61
Deng, C., Guo, M.: A new co-training-style random forest for computer aided diagnosis. J. Intell. Inf. Syst. 36(3), 253–281 (2011)
Dheeru, D., Casey, G.: UCI machine learning repository (2017)
Ghorbani, A., Zou, J.: Data shapley: equitable valuation of data for machine learning. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019. Proceedings of Machine Learning Research, vol. 97, pp. 2242–2251. PMLR (2019)
Hady, M., Schwenker, F.: Co-training by committee: a new semi-supervised learning framework. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), pp. 563–572. IEEE Computer Society (2008)
Halder, A., Ghosh, S., Ghosh, A.: Ant based semi-supervised classification. In: Dorigo, M., et al. (eds.) ANTS 2010. LNCS, vol. 6234, pp. 376–383. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15461-4_34
He, J., Gu, J., Shen, J., Aurelio Ranzato, M.: Revisiting self-training for neural sequence generation. In: Proceedings of the 8th International Conference on Learning Representations, ICLR 2020 (2020). OpenReview.net
He, W., Jiang, Z.: Semi-supervised learning with the EM algorithm: a comparative study between unstructured and structured prediction. CoRR, abs/2008.12442 (2020)
Huang, T., Yu, Y., Guo, G., Li, K.: A classification algorithm based on local cluster centers with a few labeled training examples. Knowl. Based Syst. 23(6), 563–571 (2010)
Ishii, M.: Semi-supervised learning by selective training with pseudo labels via confidence estimation. CoRR, abs/2103.08193 (2021)
Jia, R., et al.: Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow. 12(11), 1610–1623 (2019)
Li, M., Zhou, Z.-H.: SETRED: self-training with editing. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 611–621. Springer, Heidelberg (2005). https://doi.org/10.1007/11430919_71
Li, M., Zhou, Z.H.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans. Syst. Man Cybern. Part A 37(6), 1088–1098 (2007)
Shapley, L.S.: A value for n-person games. Ann. Math. Stud. 28, 307–317 (1953)
Triguero, I., GarcÃa, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42(2), 245–284 (2013). https://doi.org/10.1007/s10115-013-0706-y
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2019). https://doi.org/10.1007/s10994-019-05855-6
Wang, J., Luo, S., Zeng, X.: A random subspace method for co-training. In: Proceedings of the International Joint Conference on Neural Networks, IJCNN 2008, pp. 195–200. IEEE (2008)
Wang, Y., Xu, X., Zhao, H., Hua, Z.: Semi-supervised learning based on nearest neighbor rule and cut edges. Knowl. Based Syst. 23(6), 547–554 (2010)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196. Morgan Kaufmann Publishers/ACL (1995)
Yaslan, Y., Cataltepe, Z.: Co-training with relevant random subspaces. Neurocomputing 73(10–12), 1652–1661 (2010)
Zhou, Y., Goldman, S.: Democratic co-learning. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004), pp. 594–602. IEEE Computer Society (2004)
Zhou, Z.H., Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)
Zhu, X., Schleif, F.M., Hammer, B.: Adaptive conformal semi-supervised vector quantization for dissimilarity data. Pattern Recogn. Lett. 49, 138–145 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Courtnage, C., Smirnov, E. (2021). Shapley-Value Data Valuation for Semi-supervised Learning. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-88942-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88941-8
Online ISBN: 978-3-030-88942-5
eBook Packages: Computer ScienceComputer Science (R0)