Skip to main content
Log in

Incomplete data modeling based on alternate update of clustering and autoencoder for missing value imputation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Missing values exist widely in real-world datasets, which restrict the performance of data mining. In this paper, we propose a joint optimization framework to mine attribute associations and category structures in incomplete datasets, aiming to impute missing values with a full understanding of the data structure. Considering the differences in attribute correlations among different sample categories, we partition incomplete data into fuzzy subsets by fuzzy clustering. Within each subset, a tracking-removed autoencoder is constructed as a submodel to fit the regression relationships among attributes. Due to the mutual influence between fuzzy clustering and regression modeling, we further propose a missing value variable-based training scheme to iteratively optimize these two processes. Our proposed framework offers the advantage of decomposing the complex imputation task into simpler sub-tasks by fuzzy clustering where the attribute associations are more explicit. Moreover, the proposed training scheme activates the complementary nature of clustering and regression processes to reduce imputation errors. The experimental results on artificial and real datasets illustrate the effectiveness of our proposed framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

My manuscript has associated data in a data repository.

References

  1. Austin PC, White IR, Lee DS, van Buuren S (2021) Missing data in clinical research: a tutorial on multiple imputation. Can J Cardiol 37:1322–1331

    Article  MATH  Google Scholar 

  2. Zhang T, Zhang D, Yan H, Qiu J, Gao J (2021) A new method of data missing estimation with FNN-based tensor heterogeneous ensemble learning for internet of vehicle. Neurocomputing 420:98–110

    Article  MATH  Google Scholar 

  3. Li L, Du B, Wang Y, Qin L, Tan H (2020) Estimation of missing values in heterogeneous traffic data: Application of multimodal deep learning model. Knowl Based Syst 194:105592

    Article  MATH  Google Scholar 

  4. Lustig N (2020) The “missing rich” in household surveys: causes and correction approaches, Working Paper 75 Commitment to Equity (CEQ) Institute. Tulane University, Louisiana

  5. Bertsimas D, Pawlowski C, Zhuo YD (2018) From predictive methods to missing data imputation: An optimization approach. J Mach Learn Res 18:1–39

    MathSciNet  MATH  Google Scholar 

  6. Luo Y, Cai X, Zhang Y, Xu J (2018) Multivariate time series imputation with generative adversarial networks. In: Advances in Neural Information Processing Systems. Curran Associates, pp 1596–1607.

  7. Muzellec B, Josse J, Boyer C, Cuturi M (2020) Missing data imputation using optimal transport. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, pp 7130–7140

  8. Tsai C-F, Chang F-Y (2016) Combining instance selection for better missing value imputation. J Syst Softw 122:63–71

    Article  MATH  Google Scholar 

  9. Liu Z, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52:85–95

    Article  MATH  Google Scholar 

  10. Lin W-C, Tsai C-F (2020) Missing value imputation: A review and analysis of the literature (2006–2017). Artif Intell Rev 53:1487–1509

    Article  MATH  Google Scholar 

  11. Taylor S, Ponzini M, Wilson M, Kim K (2021) Comparison of imputation and imputation-free methods for statistical analysis of mass spectrometry data with missing data. Brief Bioinform 23:bbab353

    Article  Google Scholar 

  12. Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci 233:25–35

    Article  MATH  Google Scholar 

  13. Di Nuovo AG (2011) Missing data analysis with fuzzy c-means: A study of its application in a psychological scenario. Expert Syst Appl 38:6793–6797

    Article  MATH  Google Scholar 

  14. Luengo J, Sáez JA, Herrera F (2012) Missing data imputation for fuzzy rule-based classification systems. Soft Comput 16:863–881

    Article  MATH  Google Scholar 

  15. Hasan MdK, Alam MdA, Roy S, Dutta A, Jawad MT, Das S (2021) Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Inform Med Unlocked 27:100799

    Article  Google Scholar 

  16. van Buuren S, Groothuis-Oudshoorn K (2011) Mice: Multivariate imputation by chained equations in R. J Stat Softw 45:1–67

    Article  MATH  Google Scholar 

  17. Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: International Conference on Computational Cybernetics, IEEE, pp 207–212

  18. Gautam C, Ravi V (2015) Counter propagation auto-associative neural network based data imputation. Inf Sci 325:288–299

    Article  MATH  Google Scholar 

  19. Miranda V, Krstulovic J, Keko H, Moreira C, Pereira J (2012) Reconstructing missing data in state estimation with autoencoders. IEEE Trans Power Syst 27:604–611

    Article  Google Scholar 

  20. Krstulovic J, Miranda V, Simões Costa AJA, Pereira J (2013) Towards an auto-associative topology state estimator. IEEE Trans Power Syst 28:3311–3318

    Article  Google Scholar 

  21. Ghezelbash R, Maghsoudi A, Shamekhi M, Pradhan B, Daviran M (2023) Genetic algorithm to optimize the SVM and k-means algorithms for mapping of mineral prospectivity. Neural Comput Appl 35:719–733

    Article  Google Scholar 

  22. Mohammadrezapour O, Kisi O, Pourahmad F (2020) Fuzzy c-means and k-means clustering with genetic algorithm for identification of homogeneous regions of groundwater quality. Neural Comput Appl 32:3763–3775

    Article  MATH  Google Scholar 

  23. Lai X, Wu X, Zhang L, Lu W, Zhong C (2019) Imputations of missing values using a tracking-removed autoencoder trained with incomplete data. Neurocomputing 366:54–65

    Article  MATH  Google Scholar 

  24. Ghosh TK, Hasan MdK, Roy S, Alam MA, Hossain E, Ahmad M (2021) Multi-class probabilistic atlas-based whole heart segmentation method in cardiac CT and MRI. IEEE Access 9:66948–66964

    Article  Google Scholar 

  25. Schneider T (2001) Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. J Clim 14:853–871

    Article  MATH  Google Scholar 

  26. Castillo I, Schmidt-Hieber J, van der Vaart A (2015) Bayesian linear regression with sparse priors. Ann Stat 43:1986–2018

    Article  MathSciNet  MATH  Google Scholar 

  27. Sengupta N, Udell M, Srebro N, Evans J (2023) Sparse data reconstruction, missing value and multiple imputation through matrix factorization. Sociol Methodol 53(1):72–114

    Article  MATH  Google Scholar 

  28. Yuan L (2022) Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform. 

    Article  MATH  Google Scholar 

  29. Salakhutdinov R, Mnih A (2008) Bayesian probabilistic matrix factorization using markov chain monte carlo. In: International Conference on Machine Learning. Association for Computing Machinery, pp 880–887

  30. Chen X, He Z, Sun L (2019) A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation. Transp Res Part C Emerg Technol 98:73–84

    Article  MATH  Google Scholar 

  31. Kreindler DM, Lumsden CJ (2016) The effects of the irregular sample and missing data in time series analysis. Nonlinear Dynamical Systems Analysis for the Behavioral Sciences Using Real Data. CRC Press, Florida, pp 149–172

    MATH  Google Scholar 

  32. Soley-Bori M (2013) Dealing with missing data: Key assumptions and methods for applied analysis. Boston University, Boston

    Google Scholar 

  33. Shi Z, Wang S, Yue L, Pang L, Zuo X, Zuo W, Li X (2021) Deep dynamic imputation of clinical time series for mortality prediction. Inf Sci 579:607–622

    Article  MathSciNet  MATH  Google Scholar 

  34. Feng R, Grana D, Balling N (2021) Imputation of missing well log data by random forest and its uncertainty analysis. Comput Geosci 152:104763

    Article  MATH  Google Scholar 

  35. Khan SI, Hoque ASML (2020) SICE: An improved missing data imputation technique. J Big Data 7:1–21

    Article  MATH  Google Scholar 

  36. Thomas T, Rajabi E (2021) A systematic review of machine learning-based missing value imputation techniques. Data Technol Appl 55:558–585

    MATH  Google Scholar 

  37. Jung S, Moon J, Park S, Rho S, Baik SW, Hwang E (2020) Bagging ensemble of multilayer perceptrons for missing electricity consumption data imputation. Sensors 20:1772

    Article  MATH  Google Scholar 

  38. Sharpe PK, Solly RJ (1995) Dealing with missing values in neural network-based diagnostic systems. Neural Comput Appl 3:73–77

    Article  MATH  Google Scholar 

  39. Choudhury SJ, Pal NR (2019) Imputation of missing data with neural networks for classification. Knowl Based Syst 182:104838

    Article  MATH  Google Scholar 

  40. Razavi-Far R, Cheng B, Saif M, Ahmadi M (2020) Similarity-learning information-fusion schemes for missing data imputation. Knowl Based Syst 187:104805

    Article  MATH  Google Scholar 

  41. Shang Q, Yang Z, Gao S, Tan D (2018) An imputation method for missing traffic data based on fcm optimized by pso-svr. J Adv Transp 2018:1–21

    Article  MATH  Google Scholar 

  42. Lim C-P, Leong J-H, Kuan M-M (2005) A hybrid neural network system for pattern classification tasks with missing features. IEEE Trans Pattern Anal Mach Intell 27:648–653

    Article  MATH  Google Scholar 

  43. Raja PS, Sasirekha K, Thangavel K (2020) A novel fuzzy rough clustering parameter-based missing value imputation. Neural Comput Appl 32:10033–10050

    Article  MATH  Google Scholar 

  44. Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min 10:363–377

    Article  MathSciNet  MATH  Google Scholar 

  45. Dua D, Graff C (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences,http://archive.ics.uci.edu/ml

    MATH  Google Scholar 

  46. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(2):255–287. https://sci2s.ugr.es/keel/datasets.php

    MATH  Google Scholar 

Download references

Funding

This work is supported by the National Natural Science Foundation of China (62076050, 62073056), the National Key R&D Program of China (2022YFF0610900) and the Fundamental Research Funds for the Central Universities (DUT22LAB129).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liyong Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lai, X., Zhang, Z., Zhang, L. et al. Incomplete data modeling based on alternate update of clustering and autoencoder for missing value imputation. Neural Comput & Applic 37, 1523–1540 (2025). https://doi.org/10.1007/s00521-024-10646-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-10646-9

Keywords