Skip to main content

SGAIN, WSGAIN-CP and WSGAIN-GP: Novel GAN Methods for Missing Data Imputation

  • Conference paper
  • First Online:
Computational Science – ICCS 2021 (ICCS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12742))

Included in the following conference series:

Abstract

Real-world datasets often have missing values, which hinders the use of a large number of machine learning (ML) estimators. To overcome this limitation in a data analysis pipeline, data points may be deleted in a data preprocessing stage. However, an alternative better solution is data imputation.

Several methods based on Artificial Neural Networks (ANN) have been recently proposed as successful alternatives to classical discriminative imputation methods. Amongst those ANN imputation methods are the ones that rely on Generative Adversarial Networks (GAN).

This paper presents three data imputation methods based on GAN: SGAIN, WSGAIN-CP and WSGAIN-GP. These methods were tested on datasets with different settings of missing values probabilities, where the values are missing completely at random (MCAR). The evaluation of the newly developed methods shows that they are equivalent or outperform competitive state-of-the-art imputation methods in different ways, either in terms of response time, the data imputation quality, or the accuracy of post-imputation tasks (e.g., prediction or classification).

M. G. Naik—The second author is a participant in the BIH Charité Digital Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin, the Berlin Institute of Health and the German Research Foundation (DFG).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The first author used synthetic data to develop and test ML models for the kidney disease pilot (see https://www.bigmedilytics.eu/pilot/kidney-disease/) of BigMedilytics (an EU-funded project supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 780495).

  2. 2.

    https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html.

  3. 3.

    https://github.com/jsyoon0823/GAIN.

  4. 4.

    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

  5. 5.

    https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html.

  6. 6.

    https://github.com/dtneves/ICCS_2021.

References

  1. Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endowment 9(12), 993–1004 (2016)

    Article  Google Scholar 

  2. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015)

    Article  Google Scholar 

  3. Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017)

  4. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)

  5. Buuren, S.v., Groothuis-Oudshoorn, K.: mice: Multivariate imputation by chained equations in r. J. Stat. Softw., 1–68 (2010)

    Google Scholar 

  6. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  7. Friedjungová, M., Vašata, D., Balatsko, M., Jiřina, M.: Missing features reconstruction using a wasserstein generative adversarial imputation network. In: Krzhizhanovskaya, V.V., et al. (eds.) ICCS 2020. LNCS, vol. 12140, pp. 225–239. Springer, Cham (2020)

    Chapter  Google Scholar 

  8. Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27, 2672–2680 (2014)

    Google Scholar 

  9. Graham, J.W.: Missing data analysis: making it work in the real world. Ann. Rev. Psychol. 60, 549–576 (2009)

    Article  Google Scholar 

  10. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)

    Google Scholar 

  11. Huqqani, A.A., Schikuta, E., Ye, S., Chen, P.: Multicore and GPU parallelization of neural networks for face recognition, pp. 349–358 (2013)

    Google Scholar 

  12. Lall, R.: How multiple imputation makes a difference. Polit. Anal. 24(4), 414–433 (2016)

    Article  Google Scholar 

  13. LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Liefeldt, L., et al.: Donor-specific hla antibodies in a cohort comparing everolimus with cyclosporine after kidney transplantation. Am. J. Transplant. 12(5), 1192–1198 (2012)

    Article  Google Scholar 

  15. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 793. John Wiley & Sons, Hoboken (2019)

    MATH  Google Scholar 

  16. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  17. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. John Wiley & Sons, Hoboken (2004)

    MATH  Google Scholar 

  18. Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)

    Article  Google Scholar 

  19. Stekhoven, D.J., Bühlmann, P.: Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)

    Article  Google Scholar 

  20. Strigl, D., Kofler, K., Podlipnig, S.: Performance and scalability of GPU-based convolutional neural networks. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp. 317–324. IEEE (2010)

    Google Scholar 

  21. Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)

    Article  Google Scholar 

  22. Van Buuren, S.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2018)

    Book  Google Scholar 

  23. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. SIGKDD Explor. 15(2), 49–60 (2013)

    Article  Google Scholar 

  24. Yoon, J., Jordon, J., Van Der Schaar, M.: Gain: Missing data imputation using generative adversarial nets. arXiv preprint arXiv:1806.02920 (2018)

Download references

Acknowledgements

The first author thanks the ALGORITMI research centre, Universidade do Minho, where he conducts part of his research as an external collaborator. The third author developed his work at ALGORITMI research centre, Universidade do Minho, supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diogo Telmo Neves .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Neves, D.T., Naik, M.G., Proença, A. (2021). SGAIN, WSGAIN-CP and WSGAIN-GP: Novel GAN Methods for Missing Data Imputation. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12742. Springer, Cham. https://doi.org/10.1007/978-3-030-77961-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77961-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77960-3

  • Online ISBN: 978-3-030-77961-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics