Shapley-Value Data Valuation for Semi-supervised Learning

Courtnage, Christie; Smirnov, Evgueni

doi:10.1007/978-3-030-88942-5_8

Christie Courtnage¹⁰ &
Evgueni Smirnov¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12986))

Included in the following conference series:

International Conference on Discovery Science

1585 Accesses
1 Citations

Abstract

Semi-supervised learning aims at training accurate prediction models on labeled and unlabeled data. Its realization strongly depends on selecting pseudo-labeled data. The standard approach is to select instances based on the pseudo-label confidence values that they receive from the prediction models. In this paper we argue that this is an indirect approach w.r.t. the main goal of semi-supervised learning. Instead, we propose a direct approach that selects the pseudo-labeled instances based on their individual contributions for the performance of the prediction models. The individual instance contributions are computed as Shapley values w.r.t. characteristic functions related to the model performance. Experiments show that our approach outperforms the standard one when used in semi-supervised wrappers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For example, if the prediction model is a probabilistic classifier, the confidence value is the posterior probability of the label for an unlabeled instance.
2.
Note that T equals \(L \cup L^\prime \cup UL\) in formula 1 in our case.
3.
Due to the exactly same cross-validation we do not perform experiments with the 14 classifiers from [17]. We process the experimental data from [17] together with ours.

References

Berthelot, D., Carlini, N., Goodfellow, I.J., Papernot, N., Oliver, A., Raffel, C.: MixMatch: a holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems 32, pp. 5050–5060 (2019)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT 1998, pp. 92–100. ACM (1998)
Google Scholar
Deng, C., Guo, M.Z.: Tri-training and data editing based semi-supervised clustering algorithm. In: Gelbukh, A., Reyes-Garcia, C.A. (eds.) MICAI 2006. LNCS (LNAI), vol. 4293, pp. 641–651. Springer, Heidelberg (2006). https://doi.org/10.1007/11925231_61
Chapter Google Scholar
Deng, C., Guo, M.: A new co-training-style random forest for computer aided diagnosis. J. Intell. Inf. Syst. 36(3), 253–281 (2011)
Article Google Scholar
Dheeru, D., Casey, G.: UCI machine learning repository (2017)
Google Scholar
Ghorbani, A., Zou, J.: Data shapley: equitable valuation of data for machine learning. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019. Proceedings of Machine Learning Research, vol. 97, pp. 2242–2251. PMLR (2019)
Google Scholar
Hady, M., Schwenker, F.: Co-training by committee: a new semi-supervised learning framework. In: Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), pp. 563–572. IEEE Computer Society (2008)
Google Scholar
Halder, A., Ghosh, S., Ghosh, A.: Ant based semi-supervised classification. In: Dorigo, M., et al. (eds.) ANTS 2010. LNCS, vol. 6234, pp. 376–383. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15461-4_34
Chapter Google Scholar
He, J., Gu, J., Shen, J., Aurelio Ranzato, M.: Revisiting self-training for neural sequence generation. In: Proceedings of the 8th International Conference on Learning Representations, ICLR 2020 (2020). OpenReview.net
He, W., Jiang, Z.: Semi-supervised learning with the EM algorithm: a comparative study between unstructured and structured prediction. CoRR, abs/2008.12442 (2020)
Google Scholar
Huang, T., Yu, Y., Guo, G., Li, K.: A classification algorithm based on local cluster centers with a few labeled training examples. Knowl. Based Syst. 23(6), 563–571 (2010)
Article Google Scholar
Ishii, M.: Semi-supervised learning by selective training with pseudo labels via confidence estimation. CoRR, abs/2103.08193 (2021)
Google Scholar
Jia, R., et al.: Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow. 12(11), 1610–1623 (2019)
Article Google Scholar
Li, M., Zhou, Z.-H.: SETRED: self-training with editing. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 611–621. Springer, Heidelberg (2005). https://doi.org/10.1007/11430919_71
Chapter Google Scholar
Li, M., Zhou, Z.H.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans. Syst. Man Cybern. Part A 37(6), 1088–1098 (2007)
Article Google Scholar
Shapley, L.S.: A value for n-person games. Ann. Math. Stud. 28, 307–317 (1953)
MathSciNet MATH Google Scholar
Triguero, I., García, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42(2), 245–284 (2013). https://doi.org/10.1007/s10115-013-0706-y
Article Google Scholar
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2019). https://doi.org/10.1007/s10994-019-05855-6
Article MathSciNet MATH Google Scholar
Wang, J., Luo, S., Zeng, X.: A random subspace method for co-training. In: Proceedings of the International Joint Conference on Neural Networks, IJCNN 2008, pp. 195–200. IEEE (2008)
Google Scholar
Wang, Y., Xu, X., Zhao, H., Hua, Z.: Semi-supervised learning based on nearest neighbor rule and cut edges. Knowl. Based Syst. 23(6), 547–554 (2010)
Article Google Scholar
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196. Morgan Kaufmann Publishers/ACL (1995)
Google Scholar
Yaslan, Y., Cataltepe, Z.: Co-training with relevant random subspaces. Neurocomputing 73(10–12), 1652–1661 (2010)
Article Google Scholar
Zhou, Y., Goldman, S.: Democratic co-learning. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004), pp. 594–602. IEEE Computer Society (2004)
Google Scholar
Zhou, Z.H., Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)
Article Google Scholar
Zhu, X., Schleif, F.M., Hammer, B.: Adaptive conformal semi-supervised vector quantization for dissimilarity data. Pattern Recogn. Lett. 49, 138–145 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Data Science and Knowledge Engineering, Maastricht University, P.O. BOX 616, 6200 MD, Maastricht, The Netherlands
Christie Courtnage & Evgueni Smirnov

Authors

Christie Courtnage
View author publications
You can also search for this author in PubMed Google Scholar
Evgueni Smirnov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evgueni Smirnov .

Editor information

Editors and Affiliations

Universidade do Porto and Fraunhofer Portugal AICOS, Porto, Portugal
Carlos Soares
Dalhousie University, Halifax, NS, Canada
Luis Torgo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Courtnage, C., Smirnov, E. (2021). Shapley-Value Data Valuation for Semi-supervised Learning. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-88942-5_8
Published: 09 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88941-8
Online ISBN: 978-3-030-88942-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics