K-Fold Cross-Valuation for Machine Learning Using Shapley Value

He, Qiangqiang; Zhang, Mujie; Zhang, Jie; Yang, Shang; Wang, Chongjun

doi:10.1007/978-3-031-44213-1_5

Qiangqiang He¹¹,
Mujie Zhang¹¹,
Jie Zhang¹¹,
Shang Yang¹¹ &
…
Chongjun Wang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14256))

Included in the following conference series:

International Conference on Artificial Neural Networks

1434 Accesses

Abstract

Research on data valuation using Shapley value has recently garnered significant attention. Existing approaches typically estimate the value of the training set by using the model’s performance on a validation set as a utility function. However, since the validation set is often a small subset of the complete dataset, a dataset shift between the training and validation sets may lead to biased data valuation. To address this issue, this paper proposes a k-fold cross-validation method based on the Shapley value. Specifically, the dataset is divided into k subsets, and each subset is employed in turn as a validation set to evaluate the valuation of the training set composed of the remaining $k-1$ subsets by using the Shapley value. The average of $k-1$ valuations of each data instance is taken as the valuation result. Given the exponential correlation between the Shapley value’s computation overhead and the volume of data, we propose the Monte Carlo permutation, incremental learning, and batch data valuation methodologies. This approach aids in approximating the true Shapley value as precisely as possible while simultaneously reducing computation time. Extensive experiments have demonstrated the effectiveness of our method, especially in the presence of noise and outliers in the validation set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The leave-worst-k-out criterion for cross validation

Article 17 June 2022

Subsampling bias and the best-discrepancy systematic cross validation

Article 21 November 2019

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Article 13 June 2020

References

Wang, W., et al.: Internimage: Exploring large-scale vision foundation models with deformable convolutions. arXiv preprint arXiv:2211.05778 (2022)
Liu, Z., et al.: Swin transformer v2: Scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022)
Google Scholar
Varshni, D., Thakral, K., Agarwal, L., Nijhawan, R., Mittal, A.: Pneumonia detection using CNN based feature extraction. In: 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–7. IEEE (2019)
Google Scholar
Chrysos, G.G., Moschoglou, S., Bouritsas, G., Panagakis, Y., Deng, J., Zafeiriou, S.: P-nets: Deep polynomial neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7325–7335 (2020)
Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Google Scholar
Winter, E.: The shapley value. In: Handbook of game theory with economic applications, 3, pp. 2025–2054 (2002)
Google Scholar
Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D., (Eds.).: Dataset shift in machine learning. In: Mit Press (2008)
Google Scholar
Park, C., Awadalla, A., Kohno, T., Patel, S.: Reliable and trustworthy machine learning for health using dataset shift detection. Adv. Neural Inform. Process. Syst. 34, 3043–3056 (2021)
Google Scholar
Jia, R., et al.: Towards efficient data valuation based on the shapley value. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1167–1176. PMLR (2019)
Google Scholar
Ghorbani, A., Zou, J.: Data shapley: Equitable valuation of data for machine learning. In: International Conference on Machine Learning, pp. 2242–2251. PMLR (2019)
Google Scholar
Tang, S., et al.: Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset. Sci. Reports 11(1), 1–9 (2021)
Google Scholar
Sun, X., Liu, Y., Li, J., Zhu, J., Liu, X., Chen, H.: Using cooperative game theory to optimize the feature selection problem. Neurocomputing 97, 86–93 (2012)
Google Scholar
Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: International Conference on Machine Learning, pp. 1885–1894. PMLR (2017)
Google Scholar
Liu, Z., Chen, Y., Yu, H., Liu, Y., Cui, L.: Gtg-shapley: Efficient and accurate participant contribution evaluation in federated learning. ACM Trans. Intell. Syst. Technol. (TIST), 13(4), 1–21 (2022)
Google Scholar
Song, T., Tong, Y., Wei, S.: Profit allocation for federated learning. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 2577–2586. IEEE (2019)
Google Scholar
Chen, J., Song, L., Wainwright, M.J., Jordan, M.I.: L-shapley and c-shapley: Efficient model interpretation for structured data. arXiv preprint arXiv:1808.02610 (2018)
Ancona, M., Oztireli, C., Gross, M.: Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. In: International Conference on Machine Learning, pp. 272–281. PMLR (2019)
Google Scholar
Sharchilev, B., Ustinovskiy, Y., Serdyukov, P., Rijke, M.: Finding influential training samples for gradient boosted decision trees. In: International Conference on Machine Learning, pp. 4577–4585. PMLR (2018)
Google Scholar
Cook, R.D.: Detection of influential observation in linear regression. Technometrics 42(1), 65–68 (2000)
Google Scholar
Dasgupta, A., Drineas, P., Harb, B., Kumar, R., Mahoney, M.W.: Sampling algorithms and coresets for _p regression. SIAM J. Comput. 38(5), 2060–2078 (2009)
Google Scholar
Kwon, Y., Rivas, M.A., Zou, J.: Efficient computation and analysis of distributional shapley values. In: International Conference on Artificial Intelligence and Statistics, pp. 793–801. PMLR (2021)
Google Scholar
Castro, J., Gómez, D., Tejada, J.: Polynomial calculation of the Shapley value based on sampling. Comput. Oper. Res. 36(5), 1726–1730 (2009)
Google Scholar
Maleki, S., Tran-Thanh, L., Hines, G., Rahwan, T., Rogers, A.: Bounding the estimation error of sampling-based Shapley value approximation. arXiv preprint arXiv:1306.4265 (2013)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Wang, L., Lin, Z.Q., Wong, A.: Covid-net: A tailored deep convolutional neural network design for detection of Covid-19 cases from chest x-ray images. Sci. Reports 10(1), 1–12 (2020)
Google Scholar
Islam, M.N., Hasan, M., Hossain, M.K., Alam, M.G.R., Uddin, M., Soylu, A.: Vision transformer and explainable transfer learning models for auto detection of kidney cyst, stone and tumor from CT-radiography. Sci. Reports 12(1), 11440 (2022)
Google Scholar
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. 7 (2009)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)

Download references

Acknowledgement

This paper is supported by the National Natural Science Foundation of China (Grant No. 62192783, U1811462), the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Qiangqiang He, Mujie Zhang, Jie Zhang, Shang Yang & Chongjun Wang

Authors

Qiangqiang He
View author publications
You can also search for this author in PubMed Google Scholar
Mujie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chongjun Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chongjun Wang .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Q., Zhang, M., Zhang, J., Yang, S., Wang, C. (2023). K-Fold Cross-Valuation for Machine Learning Using Shapley Value. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14256. Springer, Cham. https://doi.org/10.1007/978-3-031-44213-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-44213-1_5
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44212-4
Online ISBN: 978-3-031-44213-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

K-Fold Cross-Valuation for Machine Learning Using Shapley Value