SGAIN, WSGAIN-CP and WSGAIN-GP: Novel GAN Methods for Missing Data Imputation

Neves, Diogo Telmo; Naik, Marcel Ganesh; Proença, Alberto

doi:10.1007/978-3-030-77961-0_10

Diogo Telmo Neves^13,14,15,
Marcel Ganesh Naik¹⁴ &
Alberto Proença¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12742))

Included in the following conference series:

International Conference on Computational Science

1839 Accesses
4 Citations

Abstract

Real-world datasets often have missing values, which hinders the use of a large number of machine learning (ML) estimators. To overcome this limitation in a data analysis pipeline, data points may be deleted in a data preprocessing stage. However, an alternative better solution is data imputation.

Several methods based on Artificial Neural Networks (ANN) have been recently proposed as successful alternatives to classical discriminative imputation methods. Amongst those ANN imputation methods are the ones that rely on Generative Adversarial Networks (GAN).

This paper presents three data imputation methods based on GAN: SGAIN, WSGAIN-CP and WSGAIN-GP. These methods were tested on datasets with different settings of missing values probabilities, where the values are missing completely at random (MCAR). The evaluation of the newly developed methods shows that they are equivalent or outperform competitive state-of-the-art imputation methods in different ways, either in terms of response time, the data imputation quality, or the accuracy of post-imputation tasks (e.g., prediction or classification).

M. G. Naik—The second author is a participant in the BIH Charité Digital Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin, the Berlin Institute of Health and the German Research Foundation (DFG).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The first author used synthetic data to develop and test ML models for the kidney disease pilot (see https://www.bigmedilytics.eu/pilot/kidney-disease/) of BigMedilytics (an EU-funded project supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 780495).
2.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html.
3.
https://github.com/jsyoon0823/GAIN.
4.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.
5.
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html.
6.
https://github.com/dtneves/ICCS_2021.

References

Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endowment 9(12), 993–1004 (2016)
Article Google Scholar
Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)
Article Google Scholar
Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
Buuren, S.v., Groothuis-Oudshoorn, K.: mice: Multivariate imputation by chained equations in r. J. Stat. Softw., 1–68 (2010)
Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Friedjungová, M., Vašata, D., Balatsko, M., Jiřina, M.: Missing features reconstruction using a wasserstein generative adversarial imputation network. In: Krzhizhanovskaya, V.V., et al. (eds.) ICCS 2020. LNCS, vol. 12140, pp. 225–239. Springer, Cham (2020)
Chapter Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27, 2672–2680 (2014)
Google Scholar
Graham, J.W.: Missing data analysis: making it work in the real world. Ann. Rev. Psychol. 60, 549–576 (2009)
Article Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)
Google Scholar
Huqqani, A.A., Schikuta, E., Ye, S., Chen, P.: Multicore and GPU parallelization of neural networks for face recognition, pp. 349–358 (2013)
Google Scholar
Lall, R.: How multiple imputation makes a difference. Polit. Anal. 24(4), 414–433 (2016)
Article Google Scholar
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012)
Chapter Google Scholar
Liefeldt, L., et al.: Donor-specific hla antibodies in a cohort comparing everolimus with cyclosporine after kidney transplantation. Am. J. Transplant. 12(5), 1192–1198 (2012)
Article Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. John Wiley & Sons, Hoboken (2019)
MATH Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. John Wiley & Sons, Hoboken (2004)
MATH Google Scholar
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)
Article Google Scholar
Stekhoven, D.J., Bühlmann, P.: Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Article Google Scholar
Strigl, D., Kofler, K., Podlipnig, S.: Performance and scalability of GPU-based convolutional neural networks. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 317–324. IEEE (2010)
Google Scholar
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Article Google Scholar
Van Buuren, S.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2018)
Book Google Scholar
Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. SIGKDD Explor. 15(2), 49–60 (2013)
Article Google Scholar
Yoon, J., Jordon, J., Van Der Schaar, M.: Gain: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920 (2018)

Download references

Acknowledgements

The first author thanks the ALGORITMI research centre, Universidade do Minho, where he conducts part of his research as an external collaborator. The third author developed his work at ALGORITMI research centre, Universidade do Minho, supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020.

Author information

Authors and Affiliations

Intelligent Analytics for Mass Data (IAM), German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
Diogo Telmo Neves
Charité – Universitätsmedizin, and Berlin Institute of Health, Berlin, Germany
Diogo Telmo Neves & Marcel Ganesh Naik
Centro ALGORITMI, Universidade do Minho, Braga, Portugal
Diogo Telmo Neves & Alberto Proença

Authors

Diogo Telmo Neves
View author publications
You can also search for this author in PubMed Google Scholar
Marcel Ganesh Naik
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Proença
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diogo Telmo Neves .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
Ludwig-Maximilians-Universität München, Munich, Germany
Dieter Kranzlmüller
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee at Knoxville, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Neves, D.T., Naik, M.G., Proença, A. (2021). SGAIN, WSGAIN-CP and WSGAIN-GP: Novel GAN Methods for Missing Data Imputation. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12742. Springer, Cham. https://doi.org/10.1007/978-3-030-77961-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-77961-0_10
Published: 09 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77960-3
Online ISBN: 978-3-030-77961-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics