MIDIA: exploring denoising autoencoders for missing data imputation

Ma, Qian; Lee, Wang-Chien; Fu, Tao-Yang; Gu, Yu; Yu, Ge

doi:10.1007/s10618-020-00706-8

MIDIA: exploring denoising autoencoders for missing data imputation

Published: 25 July 2020

Volume 34, pages 1859–1897, (2020)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Qian Ma ORCID: orcid.org/0000-0001-6473-9523¹,
Wang-Chien Lee²,
Tao-Yang Fu²,
Yu Gu³ &
…
Ge Yu³

1532 Accesses
12 Citations
Explore all metrics

Abstract

Due to the ubiquitous presence of missing values (MVs) in real-world datasets, the MV imputation problem, aiming to recover MVs, is an important and fundamental data preprocessing step for various data analytics and mining tasks to effectively achieve good performance. To impute MVs, a typical idea is to explore the correlations amongst the attributes of the data. However, those correlations are usually complex and thus difficult to identify. Accordingly, we develop a new deep learning model called MIssing Data Imputation denoising Autoencoder (MIDIA) that effectively imputes the MVs in a given dataset by exploring non-linear correlations between missing values and non-missing values. Additionally, by considering various data missing patterns, we propose two effective MV imputation approaches based on the proposed MIDIA model, namely MIDIA-Sequential and MIDIA-Batch. MIDIA-Sequential imputes the MVs attribute-by-attribute sequentially by training an independent MIDIA model for each incomplete attribute. By contrast, MIDIA-Batch imputes the MVs in one batch by training a uniform MIDIA model. Finally, we evaluate the proposed approaches by experimentation in comparison with existing MV imputation algorithms. The experimental results demonstrate that both MIDIA-Sequential and MIDIA-Batch achieve significantly higher imputation accuracy compared with existing solutions, and the proposed approaches are capable of handling various data missing patterns and data types. Specifically, MIDIA-Sequential performs better than MIDIA-Batch for data with monotone missing pattern, while MIDIA-Batch performs better than MIDIA-Sequential for data with general missing pattern.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 3

Fig. 8

Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges

Deep learning modelling techniques: current progress, applications, advantages, and challenges

Article Open access 17 April 2023

Shams Forruque Ahmed, Md. Sakib Bin Alam, … Amir H. Gandomi

Machine Learning in Healthcare Analytics: A State-of-the-Art Review

Article 04 April 2024

Surajit Das, Samaleswari P. Nayak, … Sarat Chandra Nayak

Notes

https://archive.ics.uci.edu/ml/datasets/Air+Quality.
http://archive.ics.uci.edu/ml/datasets/Adult.
http://archive.ics.uci.edu/ml/datasets/Car+Evaluation.
We only consider the number of hidden layers in Encoder since the Decoder is symmetric with the Encoder.

References

Aittokallio T (2010) Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform 11(2):253–264
Google Scholar
Anagnostopoulos C, Triantafillou P (2014) Scaling out big data missing value imputations: pythia vs. godzilla. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 651–660
Andridge RR, Little RJA (2010) A review of hot deck imputation for survey non-response. Int Stat Rev 78(1):40–64
Google Scholar
Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156
MathSciNet MATH Google Scholar
Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–50
Bergstra J, Desjardins G, Lamblin P, Bengio Y (2009) Quadratic polynomials learn better image features. Technical report, p 1337
Bertsimas D, Pawlowski C, Zhuo YD (2017) From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 18(1):7133–7171
MathSciNet MATH Google Scholar
Borovicka T, Jirina-Jr M, Kordik P, Jirina M (2012) Selecting representative data sets. In: Advances in data mining knowledge discovery and applications, pp 43–70
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp 177–186
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38
MathSciNet MATH Google Scholar
Dong X, Gabrilovich E, Heitz G et al (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 601–610
Gharibshah Z, Zhu XQ, Hainline A, Conway M (2020) Deep learning for user interest and response prediction in online display advertising. Data Sci Eng 5(1):12–26
Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 249–256
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of international conference on artificial intelligence and statistics, pp 315–323
Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Proceedings of international workshop on artificial neural networks, pp 195–201
Jain YK, Bhandare SK (2011) Min max normalization based data perturbation method for privacy protection. Int J Comput Commun Technol 2(8):45–50
Google Scholar
Jing XY, Qi FM, Wu F, Xu BW (2016) Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In: Proceedings of IEEE/ACM international conference on software engineering, pp 607–618
Joenssen DW, Bankhofer U (2012) Hot deck methods for imputing missing data—the effects of limiting donor usage. In: International workshop on machine learning and data mining in pattern recognition, pp 63–75
Jonathan ACS, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR (2009) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ Br Med J 339(7713):157–160
Google Scholar
Kim KY, Kim BJ, Yi GS (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinform 5:160
Google Scholar
Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Google Scholar
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Discov Eng 17(4):491–502
Google Scholar
Lovedeep G, Wang K (2017) Multiple imputation using deep denoising autoencoders. CoRR arXiv:1705.02737
Magnani M (2004) Techniques for dealing with missing data in knowledge discovery tasks. Obtido 15(01):2007. http://magnanim.web.cs.unibo.it/index.html
McNeish D (2017) Missing data methods for arbitrary missingness with small samples. J Appl Stat 44(1):24–39
MathSciNet Google Scholar
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of international conference on international conference on machine learning, pp 807–814
Qin Y, Zhang S, Zhu X et al (2009) POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases. Expert Syst Appl 36(2):2794–2804
Google Scholar
Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol 27(1):85–96
Google Scholar
Rahman G, Islam Z (2011) A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of Australasian data mining conference, pp 41–50
Sinclair JM, Wilkes GA, Krebs WA (2001) Collins concise dictionary. HarperCollins, New York
Google Scholar
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Google Scholar
Troyanskaya OG, Cantor MN, Sherlock G et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Google Scholar
Verboven S, Branden KV, Goos P (2007) Sequential imputation for missing values. Comput Biol Chem 31(5–6):320–327
MATH Google Scholar
Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of international conference on machine learning, pp 1096–1103
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12):3371–3408
MathSciNet MATH Google Scholar
Vito SD, Massera E, Piga M et al (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757
Google Scholar
Wang QH, Rao JNK (2002a) Empirical likelihood-based inference in linear models with missing data. Scand J Stat 29(3):563–576
MathSciNet MATH Google Scholar
Wang QH, Rao JNK (2002b) Empirical likelihood-based inference under imputation for missing response data. Ann Stat 30(3):896–924
Article MathSciNet MATH Google Scholar
Yuan YC (2010) Multiple imputation for missing data: concepts and new development, vol 49. SAS Institute Inc, Rockville, pp 1–11
Google Scholar
Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38
MathSciNet Google Scholar
Zhang Y, Liu YC (2009) Data imputation using least squares support vector machines in urban arterial streets. IEEE Signal Process Lett 16(5):414–417
MathSciNet Google Scholar
Zhang CQ, Zhu XF, Zhang JL, Qin YS, Zhang SC (2007) GBKII: an imputation method for missing values. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining, pp 1080–1087
Zhang X, Song X, Wang H et al (2008) Sequential local least squares imputation estimating missing value of microarray data. Comput Biol Med 38(10):1112–1120
Google Scholar
Zhou XB, Wang XD, Dougherty ER (2003) Construction of genomic networks using mutual-information clustering and reversible-jump markov-chain-monte-carlo predictor design. Signal Process 83(4):745–761
MATH Google Scholar
Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121
Google Scholar

Download references

Acknowledgements

This work is supported by the China Postdoctoral Science Foundation (2019M661077), the National Science Foundation (Grant No. IIS-1717084), the National Natural Science Foundation of China (Grant Nos. 61772102, 61751205), and the Liaoning Revitalization Talents Program (XLYC1807158).

Author information

Authors and Affiliations

College of Information Science and Technology, Dalian Maritime University, Dalian, China
Qian Ma
Department of Computer Science and Engineering, The Pennsylvania State University, State College, USA
Wang-Chien Lee & Tao-Yang Fu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yu Gu & Ge Yu

Authors

Qian Ma
View author publications
You can also search for this author in PubMed Google Scholar
Wang-Chien Lee
View author publications
You can also search for this author in PubMed Google Scholar
Tao-Yang Fu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Gu
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qian Ma.

Additional information

Responsible editor: Shuiwang Ji.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, Q., Lee, WC., Fu, TY. et al. MIDIA: exploring denoising autoencoders for missing data imputation. Data Min Knowl Disc 34, 1859–1897 (2020). https://doi.org/10.1007/s10618-020-00706-8

Download citation

Received: 16 August 2019
Accepted: 17 July 2020
Published: 25 July 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10618-020-00706-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

MIDIA: exploring denoising autoencoders for missing data imputation

Abstract

Access this article

Similar content being viewed by others

Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges

Deep learning modelling techniques: current progress, applications, advantages, and challenges

Machine Learning in Healthcare Analytics: A State-of-the-Art Review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MIDIA: exploring denoising autoencoders for missing data imputation

Abstract

Access this article

Similar content being viewed by others

Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges

Deep learning modelling techniques: current progress, applications, advantages, and challenges

Machine Learning in Healthcare Analytics: A State-of-the-Art Review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation