Numerical Data Imputation: Choose kNN over Deep Learning

Lalande, Florian; Doya, Kenji

doi:10.1007/978-3-031-17849-8_1

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13590))

Included in the following conference series:

International Conference on Similarity Search and Applications

857 Accesses
3 Citations

Abstract

Artificial neural networks (ANNs) are now ubiquitous in data science. In this respect, Deep-Learning (DL) methods have been developed to address missing data problems. The present study compares state-of-the-art DL Generative Adversarial Network (GAN) models with the well-established kNN algorithm (1951) for numerical data imputation. Using real-world and generated datasets in various missing data scenarios, we show that the good old kNN algorithm is still competitive with powerful DL algorithms for numerical data imputation. This review consolidates the emerging consensus that numerical data imputation does not necessarily require powerful or heavy DL tools.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Full code available at: https://github.com/DeltaFloflo/imputation_comparison.

References

Batista, G.E., Monard, M.C.: A study of k-nearest neighbour as an imputation method. Front. Artif. Intell. Appl. 87 (2002)
Google Scholar
Bertsimas, D., Pawlowski, C., Zhuo, Y.D.: From predictive methods to missing data imputation: an optimization approach. J. Mach. Learn. Res. 18, 7133–7171 (2018)
MathSciNet MATH Google Scholar
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3 (1989). https://doi.org/10.1023/A:1022641700528
Dua, D., Graff, C.: UCI Machine Learning Repository: Data Sets. University of California, School of Information and Computer Science, Irvine (2019). https://archive.ics.uci.edu/ml
Fix, E., Hodges, J.L.: Discriminatory analysis. Nonparametric discrimination: consistency properties. Int. Stat. Rev./Revue Internationale de Statistique 57 (1989). https://doi.org/10.2307/1403797
Gelman, A., Hill, J.: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press (2006). https://doi.org/10.1017/cbo9780511790942
Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 3 (2014)
Google Scholar
Jadhav, A., Pramod, D., Ramanathan, K.: Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 33 (2019). https://doi.org/10.1080/08839514.2019.1637138
Jäger, S., Allhorn, A., Bießmann, F.: A benchmark for data imputation methods. Front. Big Data 4 (2021). https://doi.org/10.3389/fdata.2021.693674
Kalton, G., Kasprzyk, D.: The treatment of missing survey data. Surv. Methodol. 12 (1986)
Google Scholar
Lall, R.: How multiple imputation makes a difference. Polit. Anal. 24 (2016). https://doi.org/10.1093/pan/mpw020
Li, S.C.X., Marlin, B.M., Jiang, B.: MisGAN: learning from incomplete data with generative adversarial networks. In: 7th International Conference on Learning Representations, ICLR 2019 (2019)
Google Scholar
Little, R.J., Rubin, D.B.: Statistical analysis with missing data. Stat. Anal. Missing Data (2014). https://doi.org/10.1002/9781119013563
Poulos, J., Valle, R.: Missing data imputation for supervised learning. Appl. Artif. Intell. 32 (2018). https://doi.org/10.1080/08839514.2018.1448143
Salzberg, S.L.: C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach. Learn. 16, 235–240 (1994). https://doi.org/10.1007/bf00993309
Stekhoven, D.J., Bühlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28 (2012). https://doi.org/10.1093/bioinformatics/btr597
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17 (2001). https://doi.org/10.1093/bioinformatics/17.6.520
Yoon, J., Jordon, J., Schaar, M.V.D.: Gain: missing data imputation using generative adversarial nets. In: 35th International Conference on Machine Learning, ICML 2018, vol. 13, pp. 9042–9051 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna-son, Okinawa, Japan
Florian Lalande & Kenji Doya

Authors

Florian Lalande
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Doya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Lalande .

Editor information

Editors and Affiliations

Charles University, Prague, Czech Republic
Tomáš Skopal
ISTI-CNR, Pisa, Italy
Fabrizio Falchi
Charles University, Prague, Czech Republic
Jakub Lokoč
University of Torino, Torino, Italy
Maria Luisa Sapino
University of Bologna, Bologna, Italy
Ilaria Bartolini
University of Bologna, Bologna, Italy
Marco Patella

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 398 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lalande, F., Doya, K. (2022). Numerical Data Imputation: Choose kNN over Deep Learning. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds) Similarity Search and Applications. SISAP 2022. Lecture Notes in Computer Science, vol 13590. Springer, Cham. https://doi.org/10.1007/978-3-031-17849-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-17849-8_1
Published: 28 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17848-1
Online ISBN: 978-3-031-17849-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Numerical Data Imputation: Choose kNN over Deep Learning