Skip to main content
Log in

Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The presence of missing values consistently emerges as a critical issue in most machine learning tasks, as they can alter the distribution of the training data and consequently lead to overfitting. The theoretical framework for missing value imputation has reached a considerable level of maturity, with numerous imputation models having been proposed. However, there has been limited research conducted on the underlying causes of missing values and scenarios where imbalanced data is significantly correlated with target variables due to business logic. In this study, we conducted simulation studies to evaluate the imputation performance of six imputation models on six datasets under three missing mechanisms, including random dropout, imbalance dropout based on features, and imbalance dropout based on labels, to identify an appropriate approach to deal with imbalanced missing data with certain patterns. By recognizing the missing pattern and imputing the data with a suitable imputation method, the overfitting issue caused by missingness has been significantly mitigated in a real-world application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The datasets used in this paper are partially available in which five public datasets are selected from the UCI Machine Learning Repository and the Kaggle competition, and one private dataset named Car cannot be shared publicly due to commercial confidential terms. The details of five public datasets are as follows: The dataset named adult is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/2/adult. The dataset named Bank is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/222/bank+marketing. The dataset named Churn is available in the Kaggle competition at https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset. The dataset named Credit is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients. The dataset named Online is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset.

References

  1. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MathSciNet  Google Scholar 

  2. Little RJ, Rubin DB (2019) Statistical analysis with missing data 793. John Wiley & Sons

  3. Miao X, Wu Y, Chen L, Gao Y, Yin J (2022) An experimental survey of missing data imputation algorithms. IEEE Trans Knowl Data Eng

  4. Josse J, Prost N, Scornet E, Varoquaux G (2019) On the consistency of supervised learning with missing values. Preprint arXiv:1902.06931

  5. Swayne DF, Buja A (1998) Missing data in interactive high-dimensional data visualization. Comput Stat 13(1):15–26

    Google Scholar 

  6. Liao SG, Lin Y, Kang DD, Chandra D, Bon J, Kaminski N, Sciurba FC, Tseng GC (2014) Missing value imputation in high-dimensional phenomic data: imputable or not, and how? BMC Bioinforma 15(1):1–12

    Article  Google Scholar 

  7. Twala BE, Jones M, Hand DJ (2008) Good methods for coping with missing data in decision trees. Pattern Recogn Lett 29(7):950-956

    Article  ADS  Google Scholar 

  8. Deng Y, Lumley T (2023) Multiple imputation through XGBoost. J Comput Graph Stat (just-accepted), 1–18

  9. Zhang S (2012) Nearest neighbor selection for iteratively KNN imputation. J Syst Softw 85(11):2541–2552

    Article  Google Scholar 

  10. Gondara L, Wang K (2018) Mida: multiple imputation using denoising autoencoders. In: Pacific-asia conference on knowledge discovery and data mining. Springer, pp 260–272

  11. Santos MS, Abreu PH, Wilk S, Santos J (2020) How distance metrics influence missing data imputation with k-nearest neighbours. Pattern Recogn Lett 136:111–119

    Article  ADS  Google Scholar 

  12. Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53(2):1487–1509

    Article  Google Scholar 

  13. Zhang Z (2016) Missing data imputation: focusing on single imputation. Ann Transl Med 4(1)

  14. Andridge RR, Little RJ (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78(1):40–64

    Article  PubMed  PubMed Central  Google Scholar 

  15. Taljaard M, Donner A, Klar N (2008) Imputation strategies for missing continuous outcomes in cluster randomized trials. Biom J 50(3):329–345

    Article  MathSciNet  PubMed  Google Scholar 

  16. White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399

    Article  MathSciNet  PubMed  Google Scholar 

  17. Lee D, Seung HS (2000) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13

  18. Rahman MG, Islam MZ (2013) Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl-Based Syst 53:51–65

    Article  Google Scholar 

  19. Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115

    Article  PubMed  Google Scholar 

  20. Ravi V, Krishna M (2014) A new online data imputation method based on general regression auto associative neural network. Neurocomputing 138:106–113

    Article  Google Scholar 

  21. Singh N, Javeed A, Chhabra S, Kumar P (2015) Missing value imputation with unsupervised Kohonen self organizing map. In: Emerging research in computing, information, communication and applications. Springer, pp 61–76

  22. García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR (2013) Classifying patterns with missing values using multi-task learning perceptrons. Expert Syst Appl 40(4):1333–1341

    Article  Google Scholar 

  23. McCoy JT, Kroon S, Auret L (2018) Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51(21):141–146

    Article  Google Scholar 

  24. Antoniou A, Storkey A, Edwards H (2017) Data augmentation generative adversarial networks. Preprint arXiv:1711.04340

  25. Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi C (2018) BAGAN: data augmentation with balancing GAN. Preprint arXiv:1803.09655

  26. Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA (2018) Generative adversarial networks: an overview. IEEE Signal Proc Mag 35(1):53–65

    Article  ADS  Google Scholar 

  27. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    Article  MathSciNet  Google Scholar 

  28. Lesch S, Corwin D (2008) Prediction of spatial soil property information from ancillary sensor data using ordinary linear regression: model derivations, residual assumptions and model validation tests. Geoderma 148(2):130–140

    Article  ADS  Google Scholar 

  29. Van Buuren S, Brand JP, Groothuis-Oudshoorn CG, Rubin DB (2006) Fully conditional specification in multivariate imputation. J Stat Comput Simul 76(12):1049–1064

    Article  MathSciNet  Google Scholar 

  30. Schomaker M, Heumann C (2018) Bootstrap inference when using multiple imputation. Stat Med 37(14):2252–2266

    Article  MathSciNet  PubMed  PubMed Central  Google Scholar 

  31. Schunk D (2008) A Markov chain Monte Carlo algorithm for multiple imputation in large surveys. AStA Adv Stat Anal 92(1):101-114

    Article  MathSciNet  Google Scholar 

  32. Li X, She J (2017) Collaborative variational autoencoder for recommender systems. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 305–314

  33. Kusner MJ, Paige B, Hernández-Lobato JM (2017) Grammar variational autoencoder. In: International conference on machine learning, pp 1945–1954. PMLR

  34. Dong W, Fong DYT, Yoon J-S, Wan EYF, Bedford LE, Tang EHM, Lam CLK (2021) Generative adversarial networks for imputing missing data for big data clinical research. BMC Med Res Methodol 21:1–10

    Article  Google Scholar 

  35. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794

  36. Dua D, Graff C (2017) UCI Machine learning repository. http://archive.ics.uci.edu/ml

  37. Zheng A, Casari A (2018) Feature engineering for machine learning: principles and techniques for data scientists. “O’Reilly Media, Inc.”

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (grant no. 2022YFF0712400), the National Key Research and Development Program of China (grant no. 2022YFB4501500, 2022YFB4501503), and the National Nature Science Foundation of China (grant no. 12201580). Thanks to Jiaxi Yang, Yihan Wang, and Ye Yang, who contribute equally to this work. Specifically, we acknowledge Yao Yang and Jiaxi Yang for coming up with novel ideas and designing the entire experimental work. Furthermore, we gratefully appreciate Ye Yang and Yihan Wang for their coding and experimental work. Also, thanks go to Jiaxi Yang and Yihan Wang for their hard work on this manuscript. Last but not least, sincere thanks are given to Kai Ding for his GPU resources, which greatly helped to accelerate the process of data imputation based on deep learning. Chongning Na and Yao Yang supervised the research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yao Yang.

Ethics declarations

Conflict of Interest

The authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, J., Wang, Y., Yang, Y. et al. Effects of single and multiple imputation strategies on addressing over-fitting issues caused by imbalanced data from various scenarios. Appl Intell 54, 2812–2830 (2024). https://doi.org/10.1007/s10489-024-05295-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05295-3

Keywords

Navigation