Skip to main content

Numerical Data Imputation: Choose kNN over Deep Learning

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2022)

Abstract

Artificial neural networks (ANNs) are now ubiquitous in data science. In this respect, Deep-Learning (DL) methods have been developed to address missing data problems. The present study compares state-of-the-art DL Generative Adversarial Network (GAN) models with the well-established kNN algorithm (1951) for numerical data imputation. Using real-world and generated datasets in various missing data scenarios, we show that the good old kNN algorithm is still competitive with powerful DL algorithms for numerical data imputation. This review consolidates the emerging consensus that numerical data imputation does not necessarily require powerful or heavy DL tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Full code available at: https://github.com/DeltaFloflo/imputation_comparison.

References

  1. Batista, G.E., Monard, M.C.: A study of k-nearest neighbour as an imputation method. Front. Artif. Intell. Appl. 87 (2002)

    Google Scholar 

  2. Bertsimas, D., Pawlowski, C., Zhuo, Y.D.: From predictive methods to missing data imputation: an optimization approach. J. Mach. Learn. Res. 18, 7133–7171 (2018)

    MathSciNet  MATH  Google Scholar 

  3. Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3 (1989). https://doi.org/10.1023/A:1022641700528

  4. Dua, D., Graff, C.: UCI Machine Learning Repository: Data Sets. University of California, School of Information and Computer Science, Irvine (2019). https://archive.ics.uci.edu/ml

  5. Fix, E., Hodges, J.L.: Discriminatory analysis. Nonparametric discrimination: consistency properties. Int. Stat. Rev./Revue Internationale de Statistique 57 (1989). https://doi.org/10.2307/1403797

  6. Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press (2006). https://doi.org/10.1017/cbo9780511790942

  7. Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 3 (2014)

    Google Scholar 

  8. Jadhav, A., Pramod, D., Ramanathan, K.: Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 33 (2019). https://doi.org/10.1080/08839514.2019.1637138

  9. Jäger, S., Allhorn, A., Bießmann, F.: A benchmark for data imputation methods. Front. Big Data 4 (2021). https://doi.org/10.3389/fdata.2021.693674

  10. Kalton, G., Kasprzyk, D.: The treatment of missing survey data. Surv. Methodol. 12 (1986)

    Google Scholar 

  11. Lall, R.: How multiple imputation makes a difference. Polit. Anal. 24 (2016). https://doi.org/10.1093/pan/mpw020

  12. Li, S.C.X., Marlin, B.M., Jiang, B.: MisGAN: learning from incomplete data with generative adversarial networks. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)

    Google Scholar 

  13. Little, R.J., Rubin, D.B.: Statistical analysis with missing data. Stat. Anal. Missing Data (2014). https://doi.org/10.1002/9781119013563

  14. Poulos, J., Valle, R.: Missing data imputation for supervised learning. Appl. Artif. Intell. 32 (2018). https://doi.org/10.1080/08839514.2018.1448143

  15. Salzberg, S.L.: C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach. Learn. 16, 235–240 (1994). https://doi.org/10.1007/bf00993309

  16. Stekhoven, D.J., Bühlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28 (2012). https://doi.org/10.1093/bioinformatics/btr597

  17. Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17 (2001). https://doi.org/10.1093/bioinformatics/17.6.520

  18. Yoon, J., Jordon, J., Schaar, M.V.D.: Gain: missing data imputation using generative adversarial nets. In: 35th International Conference on Machine Learning, ICML 2018, vol. 13, pp. 9042–9051 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Lalande .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 398 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lalande, F., Doya, K. (2022). Numerical Data Imputation: Choose kNN over Deep Learning. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds) Similarity Search and Applications. SISAP 2022. Lecture Notes in Computer Science, vol 13590. Springer, Cham. https://doi.org/10.1007/978-3-031-17849-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17849-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17848-1

  • Online ISBN: 978-3-031-17849-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics