Abstract
Due to the ubiquitous presence of missing values (MVs) in real-world datasets, the MV imputation problem, aiming to recover MVs, is an important and fundamental data preprocessing step for various data analytics and mining tasks to effectively achieve good performance. To impute MVs, a typical idea is to explore the correlations amongst the attributes of the data. However, those correlations are usually complex and thus difficult to identify. Accordingly, we develop a new deep learning model called MIssing Data Imputation denoising Autoencoder (MIDIA) that effectively imputes the MVs in a given dataset by exploring non-linear correlations between missing values and non-missing values. Additionally, by considering various data missing patterns, we propose two effective MV imputation approaches based on the proposed MIDIA model, namely MIDIA-Sequential and MIDIA-Batch. MIDIA-Sequential imputes the MVs attribute-by-attribute sequentially by training an independent MIDIA model for each incomplete attribute. By contrast, MIDIA-Batch imputes the MVs in one batch by training a uniform MIDIA model. Finally, we evaluate the proposed approaches by experimentation in comparison with existing MV imputation algorithms. The experimental results demonstrate that both MIDIA-Sequential and MIDIA-Batch achieve significantly higher imputation accuracy compared with existing solutions, and the proposed approaches are capable of handling various data missing patterns and data types. Specifically, MIDIA-Sequential performs better than MIDIA-Batch for data with monotone missing pattern, while MIDIA-Batch performs better than MIDIA-Sequential for data with general missing pattern.
Similar content being viewed by others
Notes
We only consider the number of hidden layers in Encoder since the Decoder is symmetric with the Encoder.
References
Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform 11(2):253–264
Anagnostopoulos C, Triantafillou P (2014) Scaling out big data missing value imputations: pythia vs. godzilla. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 651–660
Andridge RR, Little RJA (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78(1):40–64
Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156
Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–50
Bergstra J, Desjardins G, Lamblin P, Bengio Y (2009) Quadratic polynomials learn better image features. Technical report, p 1337
Bertsimas D, Pawlowski C, Zhuo YD (2017) From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 18(1):7133–7171
Borovicka T, Jirina-Jr M, Kordik P, Jirina M (2012) Selecting representative data sets. In: Advances in data mining knowledge discovery and applications, pp 43–70
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp 177–186
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38
Dong X, Gabrilovich E, Heitz G et al (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 601–610
Gharibshah Z, Zhu XQ, Hainline A, Conway M (2020) Deep learning for user interest and response prediction in online display advertising. Data Sci Eng 5(1):12–26
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 249–256
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 315–323
Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of international workshop on artificial neural networks, pp 195–201
Jain YK, Bhandare SK (2011) Min max normalization based data perturbation method for privacy protection. Int J Comput Commun Technol 2(8):45–50
Jing XY, Qi FM, Wu F, Xu BW (2016) Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In: Proceedings of IEEE/ACM international conference on software engineering, pp 607–618
Joenssen DW, Bankhofer U (2012) Hot deck methods for imputing missing data—the effects of limiting donor usage. In: International workshop on machine learning and data mining in pattern recognition, pp 63–75
Jonathan ACS, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ Br Med J 339(7713):157–160
Kim KY, Kim BJ, Yi GS (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinform 5:160
Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Discov Eng 17(4):491–502
Lovedeep G, Wang K (2017) Multiple imputation using deep denoising autoencoders. CoRR arXiv:1705.02737
Magnani M (2004) Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01):2007. http://magnanim.web.cs.unibo.it/index.html
McNeish D (2017) Missing data methods for arbitrary missingness with small samples. J Appl Stat 44(1):24–39
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of international conference on international conference on machine learning, pp 807–814
Qin Y, Zhang S, Zhu X et al (2009) POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Syst Appl 36(2):2794–2804
Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol 27(1):85–96
Rahman G, Islam Z (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50
Sinclair JM, Wilkes GA, Krebs WA (2001) Collins concise dictionary. HarperCollins, New York
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Troyanskaya OG, Cantor MN, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Verboven S, Branden KV, Goos P (2007) Sequential imputation for missing values. Comput Biol Chem 31(5–6):320–327
Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of international conference on machine learning, pp 1096–1103
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12):3371–3408
Vito SD, Massera E, Piga M et al (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757
Wang QH, Rao JNK (2002a) Empirical likelihood-based inference in linear models with missing data. Scand J Stat 29(3):563–576
Wang QH, Rao JNK (2002b) Empirical likelihood-based inference under imputation for missing response data. Ann Stat 30(3):896–924
Yuan YC (2010) Multiple imputation for missing data: concepts and new development, vol 49. SAS Institute Inc, Rockville, pp 1–11
Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38
Zhang Y, Liu YC (2009) Data imputation using least squares support vector machines in urban arterial streets. IEEE Signal Process Lett 16(5):414–417
Zhang CQ, Zhu XF, Zhang JL, Qin YS, Zhang SC (2007) GBKII: an imputation method for missing values. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining, pp 1080–1087
Zhang X, Song X, Wang H et al (2008) Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med 38(10):1112–1120
Zhou XB, Wang XD, Dougherty ER (2003) Construction of genomic networks using mutual-information clustering and reversible-jump markov-chain-monte-carlo predictor design. Signal Process 83(4):745–761
Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121
Acknowledgements
This work is supported by the China Postdoctoral Science Foundation (2019M661077), the National Science Foundation (Grant No. IIS-1717084), the National Natural Science Foundation of China (Grant Nos. 61772102, 61751205), and the Liaoning Revitalization Talents Program (XLYC1807158).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Shuiwang Ji.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ma, Q., Lee, WC., Fu, TY. et al. MIDIA: exploring denoising autoencoders for missing data imputation. Data Min Knowl Disc 34, 1859–1897 (2020). https://doi.org/10.1007/s10618-020-00706-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-020-00706-8