Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios

Yang, Jiaxi; Wang, Yihan; Yang, Ye; Ding, Kai; Na, Chongning; Yang, Yao

doi:10.1007/s10489-024-05295-3

Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios

Published: 12 February 2024

Volume 54, pages 2812–2830, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

192 Accesses
Explore all metrics

Abstract

The presence of missing values consistently emerges as a critical issue in most machine learning tasks, as they can alter the distribution of the training data and consequently lead to overfitting. The theoretical framework for missing value imputation has reached a considerable level of maturity, with numerous imputation models having been proposed. However, there has been limited research conducted on the underlying causes of missing values and scenarios where imbalanced data is significantly correlated with target variables due to business logic. In this study, we conducted simulation studies to evaluate the imputation performance of six imputation models on six datasets under three missing mechanisms, including random dropout, imbalance dropout based on features, and imbalance dropout based on labels, to identify an appropriate approach to deal with imbalanced missing data with certain patterns. By recognizing the missing pattern and imputing the data with a suitable imputation method, the overfitting issue caused by missingness has been significantly mitigated in a real-world application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Feature Based Multivariate Data Imputation

On Combining Imputation Methods for Handling Missing Data

Handling Missing Values for the CN2 Algorithm

Data Availability

The datasets used in this paper are partially available in which five public datasets are selected from the UCI Machine Learning Repository and the Kaggle competition, and one private dataset named Car cannot be shared publicly due to commercial confidential terms. The details of five public datasets are as follows: The dataset named adult is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/2/adult. The dataset named Bank is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/222/bank+marketing. The dataset named Churn is available in the Kaggle competition at https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset. The dataset named Credit is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients. The dataset named Online is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset.

References

Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Article MathSciNet Google Scholar
Little RJ, Rubin DB (2019) Statistical analysis with missing data 793. John Wiley & Sons
Miao X, Wu Y, Chen L, Gao Y, Yin J (2022) An experimental survey of missing data imputation algorithms. IEEE Trans Knowl Data Eng
Josse J, Prost N, Scornet E, Varoquaux G (2019) On the consistency of supervised learning with missing values. Preprint arXiv:1902.06931
Swayne DF, Buja A (1998) Missing data in interactive high-dimensional data visualization. Comput Stat 13(1):15–26
Google Scholar
Liao SG, Lin Y, Kang DD, Chandra D, Bon J, Kaminski N, Sciurba FC, Tseng GC (2014) Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC Bioinforma 15(1):1–12
Article Google Scholar
Twala BE, Jones M, Hand DJ (2008) Good methods for coping with missing data in decision trees. Pattern Recogn Lett 29(7):950-956
Article ADS Google Scholar
Deng Y, Lumley T (2023) Multiple imputation through XGBoost. J Comput Graph Stat (just-accepted), 1–18
Zhang S (2012) Nearest neighbor selection for iteratively KNN imputation. J Syst Softw 85(11):2541–2552
Article Google Scholar
Gondara L, Wang K (2018) Mida: multiple imputation using denoising autoencoders. In: Pacific-asia conference on knowledge discovery and data mining. Springer, pp 260–272
Santos MS, Abreu PH, Wilk S, Santos J (2020) How distance metrics influence missing data imputation with k-nearest neighbours. Pattern Recogn Lett 136:111–119
Article ADS Google Scholar
Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53(2):1487–1509
Article Google Scholar
Zhang Z (2016) Missing data imputation: focusing on single imputation. Ann Transl Med 4(1)
Andridge RR, Little RJ (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78(1):40–64
Article PubMed PubMed Central Google Scholar
Taljaard M, Donner A, Klar N (2008) Imputation strategies for missing continuous outcomes in cluster randomized trials. Biom J 50(3):329–345
Article MathSciNet PubMed Google Scholar
White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399
Article MathSciNet PubMed Google Scholar
Lee D, Seung HS (2000) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13
Rahman MG, Islam MZ (2013) Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl-Based Syst 53:51–65
Article Google Scholar
Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115
Article PubMed Google Scholar
Ravi V, Krishna M (2014) A new online data imputation method based on general regression auto associative neural network. Neurocomputing 138:106–113
Article Google Scholar
Singh N, Javeed A, Chhabra S, Kumar P (2015) Missing value imputation with unsupervised Kohonen self organizing map. In: Emerging research in computing, information, communication and applications. Springer, pp 61–76
García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR (2013) Classifying patterns with missing values using multi-task learning perceptrons. Expert Syst Appl 40(4):1333–1341
Article Google Scholar
McCoy JT, Kroon S, Auret L (2018) Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51(21):141–146
Article Google Scholar
Antoniou A, Storkey A, Edwards H (2017) Data augmentation generative adversarial networks. Preprint arXiv:1711.04340
Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi C (2018) BAGAN: data augmentation with balancing GAN. Preprint arXiv:1803.09655
Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: an overview. IEEE Signal Proc Mag 35(1):53–65
Article ADS Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Article MathSciNet Google Scholar
Lesch S, Corwin D (2008) Prediction of spatial soil property information from ancillary sensor data using ordinary linear regression: model derivations, residual assumptions and model validation tests. Geoderma 148(2):130–140
Article ADS Google Scholar
Van Buuren S, Brand JP, Groothuis-Oudshoorn CG, Rubin DB (2006) Fully conditional specification in multivariate imputation. J Stat Comput Simul 76(12):1049–1064
Article MathSciNet Google Scholar
Schomaker M, Heumann C (2018) Bootstrap inference when using multiple imputation. Stat Med 37(14):2252–2266
Article MathSciNet PubMed PubMed Central Google Scholar
Schunk D (2008) A Markov chain Monte Carlo algorithm for multiple imputation in large surveys. AStA Adv Stat Anal 92(1):101-114
Article MathSciNet Google Scholar
Li X, She J (2017) Collaborative variational autoencoder for recommender systems. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 305–314
Kusner MJ, Paige B, Hernández-Lobato JM (2017) Grammar variational autoencoder. In: International conference on machine learning, pp 1945–1954. PMLR
Dong W, Fong DYT, Yoon J-S, Wan EYF, Bedford LE, Tang EHM, Lam CLK (2021) Generative adversarial networks for imputing missing data for big data clinical research. BMC Med Res Methodol 21:1–10
Article Google Scholar
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794
Dua D, Graff C (2017) UCI Machine learning repository. http://archive.ics.uci.edu/ml
Zheng A, Casari A (2018) Feature engineering for machine learning: principles and techniques for data scientists. “O’Reilly Media, Inc.”

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (grant no. 2022YFF0712400), the National Key Research and Development Program of China (grant no. 2022YFB4501500, 2022YFB4501503), and the National Nature Science Foundation of China (grant no. 12201580). Thanks to Jiaxi Yang, Yihan Wang, and Ye Yang, who contribute equally to this work. Specifically, we acknowledge Yao Yang and Jiaxi Yang for coming up with novel ideas and designing the entire experimental work. Furthermore, we gratefully appreciate Ye Yang and Yihan Wang for their coding and experimental work. Also, thanks go to Jiaxi Yang and Yihan Wang for their hard work on this manuscript. Last but not least, sincere thanks are given to Kai Ding for his GPU resources, which greatly helped to accelerate the process of data imputation based on deep learning. Chongning Na and Yao Yang supervised the research.

Author information

Jiaxi Yang and Yihan Wang contributed equally to this work.

Authors and Affiliations

Zhejiang Lab, Hangzhou, 311100, Zhejiang, China
Jiaxi Yang, Yihan Wang, Kai Ding, Chongning Na & Yao Yang
College of Control Science and Engineering, Zhejiang University, Hangzhou, 310058, Zhejiang, China
Ye Yang

Authors

Jiaxi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yihan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ye Yang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Ding
View author publications
You can also search for this author in PubMed Google Scholar
Chongning Na
View author publications
You can also search for this author in PubMed Google Scholar
Yao Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yao Yang.

Ethics declarations

Conflict of Interest

The authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, J., Wang, Y., Yang, Y. et al. Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios. Appl Intell 54, 2812–2830 (2024). https://doi.org/10.1007/s10489-024-05295-3

Download citation

Accepted: 28 January 2024
Published: 12 February 2024
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10489-024-05295-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios

Abstract

Access this article

Similar content being viewed by others

Feature Based Multivariate Data Imputation

On Combining Imputation Methods for Handling Missing Data

Handling Missing Values for the CN2 Algorithm

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios

Abstract

Access this article

Similar content being viewed by others

Feature Based Multivariate Data Imputation

On Combining Imputation Methods for Handling Missing Data

Handling Missing Values for the CN2 Algorithm

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation