Abstract
Real-world datasets often have missing values, which hinders the use of a large number of machine learning (ML) estimators. To overcome this limitation in a data analysis pipeline, data points may be deleted in a data preprocessing stage. However, an alternative better solution is data imputation.
Several methods based on Artificial Neural Networks (ANN) have been recently proposed as successful alternatives to classical discriminative imputation methods. Amongst those ANN imputation methods are the ones that rely on Generative Adversarial Networks (GAN).
This paper presents three data imputation methods based on GAN: SGAIN, WSGAIN-CP and WSGAIN-GP. These methods were tested on datasets with different settings of missing values probabilities, where the values are missing completely at random (MCAR). The evaluation of the newly developed methods shows that they are equivalent or outperform competitive state-of-the-art imputation methods in different ways, either in terms of response time, the data imputation quality, or the accuracy of post-imputation tasks (e.g., prediction or classification).
M. G. Naik—The second author is a participant in the BIH Charité Digital Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin, the Berlin Institute of Health and the German Research Foundation (DFG).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The first author used synthetic data to develop and test ML models for the kidney disease pilot (see https://www.bigmedilytics.eu/pilot/kidney-disease/) of BigMedilytics (an EU-funded project supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 780495).
- 2.
- 3.
- 4.
- 5.
- 6.
References
Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endowment 9(12), 993–1004 (2016)
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
Buuren, S.v., Groothuis-Oudshoorn, K.: mice: Multivariate imputation by chained equations in r. J. Stat. Softw., 1–68 (2010)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Friedjungová, M., Vašata, D., Balatsko, M., Jiřina, M.: Missing features reconstruction using a wasserstein generative adversarial imputation network. In: Krzhizhanovskaya, V.V., et al. (eds.) ICCS 2020. LNCS, vol. 12140, pp. 225–239. Springer, Cham (2020)
Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27, 2672–2680 (2014)
Graham, J.W.: Missing data analysis: making it work in the real world. Ann. Rev. Psychol. 60, 549–576 (2009)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)
Huqqani, A.A., Schikuta, E., Ye, S., Chen, P.: Multicore and GPU parallelization of neural networks for face recognition, pp. 349–358 (2013)
Lall, R.: How multiple imputation makes a difference. Polit. Anal. 24(4), 414–433 (2016)
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012)
Liefeldt, L., et al.: Donor-specific hla antibodies in a cohort comparing everolimus with cyclosporine after kidney transplantation. Am. J. Transplant. 12(5), 1192–1198 (2012)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. John Wiley & Sons, Hoboken (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. John Wiley & Sons, Hoboken (2004)
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)
Stekhoven, D.J., Bühlmann, P.: Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Strigl, D., Kofler, K., Podlipnig, S.: Performance and scalability of GPU-based convolutional neural networks. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 317–324. IEEE (2010)
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Van Buuren, S.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2018)
Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. SIGKDD Explor. 15(2), 49–60 (2013)
Yoon, J., Jordon, J., Van Der Schaar, M.: Gain: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920 (2018)
Acknowledgements
The first author thanks the ALGORITMI research centre, Universidade do Minho, where he conducts part of his research as an external collaborator. The third author developed his work at ALGORITMI research centre, Universidade do Minho, supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Neves, D.T., Naik, M.G., Proença, A. (2021). SGAIN, WSGAIN-CP and WSGAIN-GP: Novel GAN Methods for Missing Data Imputation. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12742. Springer, Cham. https://doi.org/10.1007/978-3-030-77961-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-77961-0_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77960-3
Online ISBN: 978-3-030-77961-0
eBook Packages: Computer ScienceComputer Science (R0)