Skip to main content

Advertisement

Log in

A review of the current publication trends on missing data imputation over three decades: direction and future research

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Studies on missing data have increased in the past few decades. It is an uncontrollable phenomenon and could occur during the data collection in practically any research field. Numerous missing data imputation techniques are well documented in the literature. However, very few studies have systematically examined the evolutionary nuances of a specific area while offering insight into the emerging imputation methods in that field. The primary objective of this paper is to provide a comprehensive review of studies concerning missing data imputation methods in classification problems from several viewpoints: (a) publication trends (by year, subject area, country, document language, and author), (b) keyword analysis, (c) the most cited documents and (d) the most influenced authors. Bibliometric analysis has been conducted using VOSviewer and Harzing Publish or Perish software, covering 430 journal articles published in Scopus from 1991 to June 2021. One of the findings reveals an emerging trend in missing data imputation methods using random forest and nearest neighbor. Above all, this research is a valuable resource for gaining insights into the available imputation techniques at a glance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Availability of data and materials

The papers analyzed in this study are available in the Scopus database.

References

  1. Bertsimas D, Pawlowski C, Zhuo YD (2018) From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 18:1–39

    MathSciNet  MATH  Google Scholar 

  2. Lobato F, Sales C, Araujo I et al (2015) Multi-objective genetic algorithm for missing data imputation. Pattern Recognit Lett 68:126–131. https://doi.org/10.1016/j.patrec.2015.08.023

    Article  Google Scholar 

  3. García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282. https://doi.org/10.1007/s00521-009-0295-6

    Article  Google Scholar 

  4. Xia J, Zhang S, Cai G et al (2017) Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognit 69:52–60. https://doi.org/10.1016/j.patcog.2017.04.005

    Article  Google Scholar 

  5. Mehrabani-Zeinabad K, Doostfatemeh M, Ayatollahi SMT (2020) An efficient and effective model to handle missing data in classification. Biomed Res Int. https://doi.org/10.1155/2020/8810143

    Article  Google Scholar 

  6. Awan SE, Bennamoun M, Sohel F et al (2022) A reinforcement learning-based approach for imputing missing data. Neural Comput Appl 34:9701–9716. https://doi.org/10.1007/s00521-022-06958-3

    Article  Google Scholar 

  7. Sim J, Lee JS, Kwon O (2015) Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Math Probl Eng. https://doi.org/10.1155/2015/538613

    Article  Google Scholar 

  8. Stekhoven DJ, Bühlmann P (2012) Missforest-Non-parametric missing value imputation for mixed-type data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btr597

    Article  Google Scholar 

  9. Twala B, Phorah M (2010) Predicting incomplete gene microarray data with the use of supervised learning algorithms. Pattern Recognit Lett 31:2061–2069. https://doi.org/10.1016/j.patrec.2010.05.006

    Article  Google Scholar 

  10. Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64:402–406

    Article  Google Scholar 

  11. Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41:3692–3705. https://doi.org/10.1016/j.patcog.2008.05.019

    Article  MATH  Google Scholar 

  12. Silva-Ramírez EL, Cabrera-Sánchez JF (2021) Co-active neuro-fuzzy inference system model as single imputation approach for non-monotone pattern of missing data. Neural Comput Appl 33:8981–9004. https://doi.org/10.1007/s00521-020-05661-5

    Article  Google Scholar 

  13. Farhangfar A, Kurgan LA, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern A Syst Hum 37:692–709. https://doi.org/10.1109/TSMCA.2007.902631

    Article  Google Scholar 

  14. Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1625–1657

    MATH  Google Scholar 

  15. Pati SK, Das AK (2017) Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst. https://doi.org/10.1007/s10115-017-1025-5

    Article  Google Scholar 

  16. Liu J, Musialski P, Wonka P, Ye J (2013) Tensor completion for estimating missing values in visual data. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2012.39

    Article  Google Scholar 

  17. Saha B, Gupta S, Phung D, Venkatesh S (2017) Effective sparse imputation of patient conditions in electronic medical records for emergency risk predictions. Knowl Inf Syst 53:179–206. https://doi.org/10.1007/s10115-017-1038-0

    Article  Google Scholar 

  18. White KK, Reiter JP, Petrin A (2018) Imputation in U.S. manufacturing data and its implications for productivity dispersion. Rev Econ Stat 100:502–509. https://doi.org/10.1162/rest_a_00678

    Article  Google Scholar 

  19. Folino G, Pisani FS (2016) Evolving meta-ensemble of classifiers for handling incomplete and unbalanced datasets in the cyber security domain. Appl Soft Comput J 47:179–190. https://doi.org/10.1016/j.asoc.2016.05.044

    Article  Google Scholar 

  20. Huang J, Keung JW, Sarro F et al (2017) Cross-validation based K nearest neighbor imputation for software quality datasets: an empirical study. J Syst Softw. https://doi.org/10.1016/j.jss.2017.07.012

    Article  Google Scholar 

  21. Cevallos Valdiviezo H, Van Aelst S (2015) Tree-based prediction on incomplete data using imputation or surrogate decisions. Inf Sci (NY) 311:163–181. https://doi.org/10.1016/j.ins.2015.03.018

    Article  Google Scholar 

  22. Mahmoudi A, Deng X, Javed SA, Yuan J (2021) Large-scale multiple criteria decision-making with missing values: project selection through TOPSIS-OPA. J Ambient Intell Humaniz Comput 12:9341–9362. https://doi.org/10.1007/s12652-020-02649-w

    Article  Google Scholar 

  23. Saha S, Ghosh A, Seal DB, Dey KN (2016) An improved fuzzy based missing value estimation in DNA microarray validated by gene ranking. Adv Fuzzy Syst. https://doi.org/10.1155/2016/6134736

    Article  MathSciNet  Google Scholar 

  24. Luengo J, García S, Herrera F (2012) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108. https://doi.org/10.1007/s10115-011-0424-2

    Article  Google Scholar 

  25. Aydilek IB, Arslan A (2013) A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf Sci (NY) 233:25–35. https://doi.org/10.1016/j.ins.2013.01.021

    Article  Google Scholar 

  26. Li Z, Sharaf MA, Sitbon L et al (2014) A web-based approach to data imputation. World Wide Web 17:873–897. https://doi.org/10.1007/s11280-013-0263-z

    Article  Google Scholar 

  27. García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2013) Classifying patterns with missing values using Multi-Task Learning perceptrons. Expert Syst Appl 40:1333–1341. https://doi.org/10.1016/j.eswa.2012.08.057

    Article  Google Scholar 

  28. Purwar A, Singh SK (2015) Hybrid prediction model with missing value imputation for medical data. Expert Syst Appl 42:5621–5631. https://doi.org/10.1016/j.eswa.2015.02.050

    Article  Google Scholar 

  29. Nishanth KJ, Ravi V (2016) Probabilistic neural network based categorical data imputation. Neurocomputing 218:17–25. https://doi.org/10.1016/j.neucom.2016.08.044

    Article  Google Scholar 

  30. Bathaeian NS (2018) Using imputation algorithms when missing values appear in the test data in contrast with the training data. Int J Data Anal Tech Strateg 10:111–123. https://doi.org/10.1504/IJDATS.2018.092447

    Article  Google Scholar 

  31. Sahri Z, Yusof R, Watada J (2014) FINNIM: Iterative imputation of missing values in dissolved gas analysis dataset. IEEE Trans Ind Inform 10:2093–2102. https://doi.org/10.1109/TII.2014.2350837

    Article  Google Scholar 

  32. Liu S, Zhang J, Xiang Y, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25:1476–1490. https://doi.org/10.1109/TFUZZ.2017.2754998

    Article  Google Scholar 

  33. Zhang S, Cheng D, Deng Z et al (2018) A novel kNN algorithm with data-driven k parameter computation. Pattern Recognit Lett 109:44–54. https://doi.org/10.1016/j.patrec.2017.09.036

    Article  Google Scholar 

  34. Acuña E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. In: Classification, clustering, and data mining applications. Springer, Berlin, pp 639–647

  35. Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23:110–121. https://doi.org/10.1109/TKDE.2010.99

    Article  Google Scholar 

  36. Gheyas IA, Smith LS (2010) A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing. https://doi.org/10.1016/j.neucom.2010.06.021

    Article  Google Scholar 

  37. Zeng D, Xie D, Liu R, Li X (2017) Missing value imputation methods for TCM medical data and its effect in the classifier accuracy. In: 2017 IEEE 19th international conference on e-health networking, applications and services (Healthcom). IEEE, pp 1–4

  38. Rado O, Fanah M Al, Taktek E (2019) Performance analysis of missing values imputation methods using machine learning techniques. In: Advances in intelligent systems and computing. Springer, Cham, pp 738–750

  39. Hunt LA (2017) Missing data imputation and its effect on the accuracy of classification. In: Studies in classification, data analysis, and knowledge organization, pp 3–14

  40. Jordanov I, Petrov N, Petrozziello A (2018) Classifiers accuracy improvement based on missing data imputation. J Artif Intell Soft Comput Res 8:31–48. https://doi.org/10.1515/jaiscr-2018-0002

    Article  Google Scholar 

  41. Melton E (2020) A random forest approach to identifying young stellar object candidates in the lupus star-forming region. Astron J 159:200. https://doi.org/10.3847/1538-3881/ab72ac

    Article  Google Scholar 

  42. Nancy JY, Khanna NH, Arputharaj K (2017) Imputing missing values in unevenly spaced clinical time series data to build an effective temporal classification framework. Comput Stat Data Anal 112:63–79. https://doi.org/10.1016/j.csda.2017.02.012

    Article  MathSciNet  MATH  Google Scholar 

  43. Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, Hoboken

    Book  Google Scholar 

  44. Rubin DB (1976) Inference and missing data. Biometrika 63:581–592. https://doi.org/10.1093/biomet/63.3.581

    Article  MathSciNet  MATH  Google Scholar 

  45. Kumaran SR, Othman MS, Yusuf LM, Yunianta A (2019) Estimation of missing values using hybrid fuzzy clustering mean and majority vote for microarray data. Procedia Comput Sci 163:145–153. https://doi.org/10.1016/j.procs.2019.12.096

    Article  Google Scholar 

  46. Li S, Koch GG, Preisser JS et al (2017) Sensitivity analysis for missing dichotomous outcome data in multi-visit randomized clinical trial with randomization-based covariance adjustment. J Biopharm Stat 27:387–398. https://doi.org/10.1080/10543406.2017.1289955

    Article  Google Scholar 

  47. Little RJA (1988) A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc 83:1198–1202. https://doi.org/10.1080/01621459.1988.10478722

    Article  MathSciNet  Google Scholar 

  48. Bardab SN, Ahmed TM, Mohammed TAA (2021) Data mining classification algorithms: An overview. Int J Adv Appl Sci 8:1–5. https://doi.org/10.21833/ijaas.2021.02.001

    Article  Google Scholar 

  49. Donthu N, Kumar S, Mukherjee D et al (2021) How to conduct a bibliometric analysis: an overview and guidelines. J Bus Res 133:285–296. https://doi.org/10.1016/j.jbusres.2021.04.070

    Article  Google Scholar 

  50. Adnan FA, Zakaria MH, Ibrahim S (2020) 60-year research history of missing data: a bibliometric review on Scopus database (1960–2019). Appl Math Comput Intell 9:75–86

    Google Scholar 

  51. Clogg CC, Rubin DB, Schenker N et al (1991) Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. J Am Stat Assoc 86:68–78. https://doi.org/10.1080/01621459.1991.10475005

    Article  Google Scholar 

  52. Che Z, Purushotham S, Cho K et al (2018) Recurrent neural networks for multivariate time series with missing values. Sci Rep 8:1–12. https://doi.org/10.1038/s41598-018-24271-9

    Article  Google Scholar 

  53. Dogo EM, Nwulu NI, Twala B, Aigbavboa CO (2020) Empirical comparison of approaches for mitigating effects of class imbalances in water quality anomaly detection. IEEE Access 8:218015–218036. https://doi.org/10.1109/ACCESS.2020.3038658

    Article  Google Scholar 

  54. Twala B (2017) When partly missing data matters in software effort development prediction. J Adv Comput Intell Intell Informatics. https://doi.org/10.20965/jaciii.2017.p0803

    Article  Google Scholar 

  55. García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR, Verleysen M (2009) K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing 72:1483–1493. https://doi.org/10.1016/j.neucom.2008.11.026

    Article  Google Scholar 

  56. Urda D, Subirats JL, García-Laencina PJ et al (2012) WIMP: Web server tool for missing data imputation. Comput Methods Programs Biomed. https://doi.org/10.1016/j.cmpb.2012.08.006

    Article  Google Scholar 

  57. Zhang S, Li X, Zong M et al (2017) Learning k for kNN classification. ACM Trans Intell Syst Technol. https://doi.org/10.1145/2990508

    Article  Google Scholar 

  58. Phipps AI, Limburg PJ, Baron JA et al (2015) Association between molecular subtypes of colorectal cancer and patient survival. Gastroenterology 148:77-87.e2. https://doi.org/10.1053/j.gastro.2014.09.038

    Article  Google Scholar 

  59. Kingsley GH, Kowalczyk A, Taylor H et al (2012) A randomized placebo-controlled trial of methotrexate in psoriatic arthritis. Rheumatol (United Kingdom) 51:1368–1377. https://doi.org/10.1093/rheumatology/kes001

    Article  Google Scholar 

  60. Elbaz A, Clavel J, Rathouz PJ et al (2009) Professional exposure to pesticides and Parkinson disease. Ann Neurol 66:494–504. https://doi.org/10.1002/ana.21717

    Article  Google Scholar 

  61. Paleologo G, Elisseeff A, Antonini G (2010) Subagging for credit scoring models. Eur J Oper Res 201:490–499. https://doi.org/10.1016/j.ejor.2009.03.008

    Article  Google Scholar 

  62. Shrive FM, Stuart H, Quan H, Ghali WA (2006) Dealing with missing data in a multi-question depression scale: a comparison of imputation methods. BMC Med Res Methodol 6:1–10. https://doi.org/10.1186/1471-2288-6-57

    Article  Google Scholar 

  63. Jarquín D, Kocak K, Posadas L et al (2014) Genotyping by sequencing for genomic prediction in a soybean breeding population. BMC Genom 15:1–10. https://doi.org/10.1186/1471-2164-15-740

    Article  Google Scholar 

  64. Luengo J, García S, Herrera F (2012) On the choice of the best imputation methods for missing values considering three groups of classification methods

  65. Zhu X, Zhang S, Jin Z et al (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2010.99

    Article  Google Scholar 

  66. Shivaswamy PK, Bhattacharyya C, Smola AJ (2006) Second order cone programming approaches for handling missing and uncertain data. J Mach Learn Res 7:1283–1314

    MathSciNet  MATH  Google Scholar 

  67. Buse D, Manack A, Serrano D et al (2012) Headache impact of chronic and episodic migraine: results from the American Migraine Prevalence and Prevention Study. Headache 52:3–17. https://doi.org/10.1111/j.1526-4610.2011.02046.x

    Article  Google Scholar 

  68. Leu S, Von FS, Frank S et al (2013) DH/MGMT-driven molecular classification of low-grade glioma is a strong predictor for long-term survival. Neuro Oncol 15:469–479

    Article  Google Scholar 

  69. Liu ZG, Pan Q, Dezert J, Martin A (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52:85–95. https://doi.org/10.1016/j.patcog.2015.10.001

    Article  Google Scholar 

  70. Sánchez-Morales A, Sancho-Gómez JL, Martínez-García JA, Figueiras-Vidal AR (2020) Improving deep learning performance with missing values via deletion and compensation. Neural Comput Appl 32:13233–13244. https://doi.org/10.1007/s00521-019-04013-2

    Article  Google Scholar 

  71. Sánchez-Morales A, Sancho-Gómez JL, Figueiras-Vidal AR (2021) Complete autoencoders for classification with missing values. Neural Comput Appl 33:1951–1957. https://doi.org/10.1007/s00521-020-05066-4

    Article  Google Scholar 

  72. Bottigliengo D, Lorenzoni G, Ocagli H et al (2021) Propensity score analysis with partially observed baseline covariates: A practical comparison of methods for handling missing data. Int J Environ Res Public Health. https://doi.org/10.3390/ijerph18136694

    Article  Google Scholar 

  73. Saeipourdizaj P, Sarbakhsh P, Gholampour A (2021) Application of imputation methods for missing values of pm10 and o3 data: interpolation, moving average and k-nearest neighbor methods. Environ Heal Eng Manag 8:215–226. https://doi.org/10.34172/EHEM.2021.25

    Article  Google Scholar 

  74. Vivar G, Kazi A, Burwinkel H et al (2021) Simultaneous imputation and classification using Multigraph Geometric Matrix Completion (MGMC): application to neurodegenerative disease classification. Artif Intell Med. https://doi.org/10.1016/j.artmed.2021.102097

    Article  Google Scholar 

  75. Hamzah FB, Hamzah FM, Razali SFM, Samad H (2021) A comparison of multiple imputation methods for recovering missing data in hydrological studies. Civ Eng J 7:1608–1619. https://doi.org/10.28991/cej-2021-03091747

    Article  Google Scholar 

  76. Popoola PA, Tapamo JR, Assounga AG (2021) Cluster analysis of mixed and missing chronic kidney disease data in KwaZulu-Natal Province, South Africa. IEEE Access 9:52125–52143. https://doi.org/10.1109/ACCESS.2021.3069684

    Article  Google Scholar 

  77. Yu L, Zhou R, Chen R, Lai KK (2022) Missing data preprocessing in credit classification: one-hot encoding or imputation? Emerg Mark Financ Trade 58:472–482. https://doi.org/10.1080/1540496X.2020.1825935

    Article  Google Scholar 

  78. Kim Y, Steen S, Muri H (2022) A novel method for estimating missing values in ship principal data. Ocean Eng 251:110979. https://doi.org/10.1016/j.oceaneng.2022.110979

    Article  Google Scholar 

  79. Sangeetha M, Senthil Kumaran M (2019) Indiscriminant expected maximization imputation model using multiple classification technique on diabetic dataset. Int J Eng Adv Technol 8:3449–3455. https://doi.org/10.35940/ijeat.F9516.088619

    Article  Google Scholar 

  80. Gaul W, Gastes D (2010) Missing values and the consistency problem concerning AHP data. In: Locarek-Junge H, Weihs C (eds). Springer, Berlin, pp 693–700

  81. Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35:123–133. https://doi.org/10.1007/s10489-009-0207-6

    Article  Google Scholar 

  82. Guo CY, Yang YC, Chen YH (2021) The optimal machine learning-based missing data imputation for the cox proportional hazard model. Front Public Heal 9:1–8. https://doi.org/10.3389/fpubh.2021.680054

    Article  Google Scholar 

  83. Wang ZX, Qiu MZ, Jiang YM et al (2017) Comparison of prognostic nomograms based on different nodal staging systems in patients with resected gastric cancer. J Cancer 8:950–958. https://doi.org/10.7150/jca.17370

    Article  Google Scholar 

  84. Zhu X, Yang J, Zhang C, Zhang S (2021) Efficient utilization of missing data in cost-sensitive learning. IEEE Trans Knowl Data Eng 33:2425–2436. https://doi.org/10.1109/TKDE.2019.2956530

    Article  Google Scholar 

  85. Saeed S, Jhanjhi NZ, Naqvi M et al (2019) Disparage the barriers of journal citation reports (JCR). Int J Comput Sci Netw Secur 19:156–175

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported under the Collaborative Research Grant (CRG) scheme between Universiti Teknologi Malaysia (Q. K130000.2456.08G27) and Universiti Malaysia Perlis (9023-00013). This work also was funded by the Ministry of higher Education, Malaysia under Fundamental Research Grant Scheme (FRGS/1/2021/TK0/UTM/02/45).

Author information

Authors and Affiliations

Authors

Contributions

FAA conducted the literature search review, analyzed the extracted data obtained from the Scopus database, and write the first draft of the manuscript. KRJ and WZAWM provided direction for the bibliometrics review and criticize the contents. SM revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Farah Adibah Adnan.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethical approval and consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Adnan, F.A., Jamaludin, K.R., Wan Muhamad, W.Z.A. et al. A review of the current publication trends on missing data imputation over three decades: direction and future research. Neural Comput & Applic 34, 18325–18340 (2022). https://doi.org/10.1007/s00521-022-07702-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07702-7

Keywords

Navigation