Incomplete data modeling based on alternate update of clustering and autoencoder for missing value imputation

Lai, Xiaochen; Zhang, Zheng; Zhang, Liyong; Lu, Wei; Li, ZhuoHan

doi:10.1007/s00521-024-10646-9

Incomplete data modeling based on alternate update of clustering and autoencoder for missing value imputation

Original Article
Published: 26 November 2024

Volume 37, pages 1523–1540, (2025)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Xiaochen Lai¹,
Zheng Zhang¹,
Liyong Zhang²,
Wei Lu² &
…
ZhuoHan Li²

157 Accesses
Explore all metrics

Abstract

Missing values exist widely in real-world datasets, which restrict the performance of data mining. In this paper, we propose a joint optimization framework to mine attribute associations and category structures in incomplete datasets, aiming to impute missing values with a full understanding of the data structure. Considering the differences in attribute correlations among different sample categories, we partition incomplete data into fuzzy subsets by fuzzy clustering. Within each subset, a tracking-removed autoencoder is constructed as a submodel to fit the regression relationships among attributes. Due to the mutual influence between fuzzy clustering and regression modeling, we further propose a missing value variable-based training scheme to iteratively optimize these two processes. Our proposed framework offers the advantage of decomposing the complex imputation task into simpler sub-tasks by fuzzy clustering where the attribute associations are more explicit. Moreover, the proposed training scheme activates the complementary nature of clustering and regression processes to reduce imputation errors. The experimental results on artificial and real datasets illustrate the effectiveness of our proposed framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imputation of Incomplete Data Based on Attribute Cross Fitting Model and Iterative Missing Value Variables

Tracking-removed neural network with graph information for classification of incomplete data

Article 02 January 2025

A missing value filling model based on feature fusion enhanced autoencoder

Article 31 July 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

My manuscript has associated data in a data repository.

References

Austin PC, White IR, Lee DS, van Buuren S (2021) Missing data in clinical research: a tutorial on multiple imputation. Can J Cardiol 37:1322–1331
Article MATH Google Scholar
Zhang T, Zhang D, Yan H, Qiu J, Gao J (2021) A new method of data missing estimation with FNN-based tensor heterogeneous ensemble learning for internet of vehicle. Neurocomputing 420:98–110
Article MATH Google Scholar
Li L, Du B, Wang Y, Qin L, Tan H (2020) Estimation of missing values in heterogeneous traffic data: Application of multimodal deep learning model. Knowl Based Syst 194:105592
Article MATH Google Scholar
Lustig N (2020) The “missing rich” in household surveys: causes and correction approaches, Working Paper 75 Commitment to Equity (CEQ) Institute. Tulane University, Louisiana
Bertsimas D, Pawlowski C, Zhuo YD (2018) From predictive methods to missing data imputation: An optimization approach. J Mach Learn Res 18:1–39
MathSciNet MATH Google Scholar
Luo Y, Cai X, Zhang Y, Xu J (2018) Multivariate time series imputation with generative adversarial networks. In: Advances in Neural Information Processing Systems. Curran Associates, pp 1596–1607.
Muzellec B, Josse J, Boyer C, Cuturi M (2020) Missing data imputation using optimal transport. In: Proceedings of the 37th International Conference on Machine Learning. PMLR, pp 7130–7140
Tsai C-F, Chang F-Y (2016) Combining instance selection for better missing value imputation. J Syst Softw 122:63–71
Article MATH Google Scholar
Liu Z, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52:85–95
Article MATH Google Scholar
Lin W-C, Tsai C-F (2020) Missing value imputation: A review and analysis of the literature (2006–2017). Artif Intell Rev 53:1487–1509
Article MATH Google Scholar
Taylor S, Ponzini M, Wilson M, Kim K (2021) Comparison of imputation and imputation-free methods for statistical analysis of mass spectrometry data with missing data. Brief Bioinform 23:bbab353
Article Google Scholar
Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci 233:25–35
Article MATH Google Scholar
Di Nuovo AG (2011) Missing data analysis with fuzzy c-means: A study of its application in a psychological scenario. Expert Syst Appl 38:6793–6797
Article MATH Google Scholar
Luengo J, Sáez JA, Herrera F (2012) Missing data imputation for fuzzy rule-based classification systems. Soft Comput 16:863–881
Article MATH Google Scholar
Hasan MdK, Alam MdA, Roy S, Dutta A, Jawad MT, Das S (2021) Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Inform Med Unlocked 27:100799
Article Google Scholar
van Buuren S, Groothuis-Oudshoorn K (2011) Mice: Multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Article MATH Google Scholar
Abdella M, Marwala T (2005) The use of genetic algorithms and neural networks to approximate missing data in database. In: International Conference on Computational Cybernetics, IEEE, pp 207–212
Gautam C, Ravi V (2015) Counter propagation auto-associative neural network based data imputation. Inf Sci 325:288–299
Article MATH Google Scholar
Miranda V, Krstulovic J, Keko H, Moreira C, Pereira J (2012) Reconstructing missing data in state estimation with autoencoders. IEEE Trans Power Syst 27:604–611
Article Google Scholar
Krstulovic J, Miranda V, Simões Costa AJA, Pereira J (2013) Towards an auto-associative topology state estimator. IEEE Trans Power Syst 28:3311–3318
Article Google Scholar
Ghezelbash R, Maghsoudi A, Shamekhi M, Pradhan B, Daviran M (2023) Genetic algorithm to optimize the SVM and k-means algorithms for mapping of mineral prospectivity. Neural Comput Appl 35:719–733
Article Google Scholar
Mohammadrezapour O, Kisi O, Pourahmad F (2020) Fuzzy c-means and k-means clustering with genetic algorithm for identification of homogeneous regions of groundwater quality. Neural Comput Appl 32:3763–3775
Article MATH Google Scholar
Lai X, Wu X, Zhang L, Lu W, Zhong C (2019) Imputations of missing values using a tracking-removed autoencoder trained with incomplete data. Neurocomputing 366:54–65
Article MATH Google Scholar
Ghosh TK, Hasan MdK, Roy S, Alam MA, Hossain E, Ahmad M (2021) Multi-class probabilistic atlas-based whole heart segmentation method in cardiac CT and MRI. IEEE Access 9:66948–66964
Article Google Scholar
Schneider T (2001) Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. J Clim 14:853–871
Article MATH Google Scholar
Castillo I, Schmidt-Hieber J, van der Vaart A (2015) Bayesian linear regression with sparse priors. Ann Stat 43:1986–2018
Article MathSciNet MATH Google Scholar
Sengupta N, Udell M, Srebro N, Evans J (2023) Sparse data reconstruction, missing value and multiple imputation through matrix factorization. Sociol Methodol 53(1):72–114
Article MATH Google Scholar
Yuan L (2022) Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform.
Article MATH Google Scholar
Salakhutdinov R, Mnih A (2008) Bayesian probabilistic matrix factorization using markov chain monte carlo. In: International Conference on Machine Learning. Association for Computing Machinery, pp 880–887
Chen X, He Z, Sun L (2019) A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation. Transp Res Part C Emerg Technol 98:73–84
Article MATH Google Scholar
Kreindler DM, Lumsden CJ (2016) The effects of the irregular sample and missing data in time series analysis. Nonlinear Dynamical Systems Analysis for the Behavioral Sciences Using Real Data. CRC Press, Florida, pp 149–172
MATH Google Scholar
Soley-Bori M (2013) Dealing with missing data: Key assumptions and methods for applied analysis. Boston University, Boston
Google Scholar
Shi Z, Wang S, Yue L, Pang L, Zuo X, Zuo W, Li X (2021) Deep dynamic imputation of clinical time series for mortality prediction. Inf Sci 579:607–622
Article MathSciNet MATH Google Scholar
Feng R, Grana D, Balling N (2021) Imputation of missing well log data by random forest and its uncertainty analysis. Comput Geosci 152:104763
Article MATH Google Scholar
Khan SI, Hoque ASML (2020) SICE: An improved missing data imputation technique. J Big Data 7:1–21
Article MATH Google Scholar
Thomas T, Rajabi E (2021) A systematic review of machine learning-based missing value imputation techniques. Data Technol Appl 55:558–585
MATH Google Scholar
Jung S, Moon J, Park S, Rho S, Baik SW, Hwang E (2020) Bagging ensemble of multilayer perceptrons for missing electricity consumption data imputation. Sensors 20:1772
Article MATH Google Scholar
Sharpe PK, Solly RJ (1995) Dealing with missing values in neural network-based diagnostic systems. Neural Comput Appl 3:73–77
Article MATH Google Scholar
Choudhury SJ, Pal NR (2019) Imputation of missing data with neural networks for classification. Knowl Based Syst 182:104838
Article MATH Google Scholar
Razavi-Far R, Cheng B, Saif M, Ahmadi M (2020) Similarity-learning information-fusion schemes for missing data imputation. Knowl Based Syst 187:104805
Article MATH Google Scholar
Shang Q, Yang Z, Gao S, Tan D (2018) An imputation method for missing traffic data based on fcm optimized by pso-svr. J Adv Transp 2018:1–21
Article MATH Google Scholar
Lim C-P, Leong J-H, Kuan M-M (2005) A hybrid neural network system for pattern classification tasks with missing features. IEEE Trans Pattern Anal Mach Intell 27:648–653
Article MATH Google Scholar
Raja PS, Sasirekha K, Thangavel K (2020) A novel fuzzy rough clustering parameter-based missing value imputation. Neural Comput Appl 32:10033–10050
Article MATH Google Scholar
Tang F, Ishwaran H (2017) Random forest missing data algorithms. Stat Anal Data Min 10:363–377
Article MathSciNet MATH Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences,http://archive.ics.uci.edu/ml
MATH Google Scholar
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(2):255–287. https://sci2s.ugr.es/keel/datasets.php
MATH Google Scholar

Download references

Funding

This work is supported by the National Natural Science Foundation of China (62076050, 62073056), the National Key R&D Program of China (2022YFF0610900) and the Fundamental Research Funds for the Central Universities (DUT22LAB129).

Author information

Authors and Affiliations

School of Software, Dalian University of Technology, Dalian, 116600, China
Xiaochen Lai & Zheng Zhang
School of Control Science and Engineering, Dalian University of Technology, Dalian, 116024, China
Liyong Zhang, Wei Lu & ZhuoHan Li

Authors

Xiaochen Lai
View author publications
You can also search for this author inPubMed Google Scholar
Zheng Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Liyong Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Wei Lu
View author publications
You can also search for this author inPubMed Google Scholar
ZhuoHan Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Liyong Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lai, X., Zhang, Z., Zhang, L. et al. Incomplete data modeling based on alternate update of clustering and autoencoder for missing value imputation. Neural Comput & Applic 37, 1523–1540 (2025). https://doi.org/10.1007/s00521-024-10646-9

Download citation

Received: 02 January 2024
Accepted: 07 October 2024
Published: 26 November 2024
Issue Date: January 2025
DOI: https://doi.org/10.1007/s00521-024-10646-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incomplete data modeling based on alternate update of clustering and autoencoder for missing value imputation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Imputation of Incomplete Data Based on Attribute Cross Fitting Model and Iterative Missing Value Variables

Tracking-removed neural network with graph information for classification of incomplete data

A missing value filling model based on feature fusion enhanced autoencoder

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now