Skip to main content
Log in

Empirical comparison of supervised learning techniques for missing value imputation

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Many data mining algorithms cannot handle incomplete datasets where some data samples are missing attribute values. To solve this problem, missing value imputation is usually conducted and commonly based on reasoning from observed data or complete data to provide estimated replacements for missing values. In general, missing imputation methods can be classified into statistical and machine learning methods. The statistical methods are usually based on the mean for continuous attributes or mode for discrete attributes, whereas the machine learning methods are based on supervised learning techniques. However, which machine learning method performs optimally for missing value imputation is unknown. This paper compares five well-known supervised learning techniques, namely k-nearest neighbor, the multilayer perceptron neural network (MLP), the classification and regression tree (CART), naïve Bayes, and the support vector machine, to examine their imputation results for categorical, numerical, and mixed data types. The experimental results demonstrate that CART outperforms the other methods for categorical datasets, whereas the MLP is optimal for numerical and mixed datasets in terms of classification accuracy. However, when computational cost is a factor, CART is superior to the MLP because CART can provide reasonably accurate imputation results and requires the least amount of time to perform missing value imputation. Moreover, CART generates the lowest root-mean-squared error of all methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://archive.ics.uci.edu/ml/index.php.

  2. http://www.cs.waikato.ac.nz/ml/weka/.

  3. The computing environment is based on PC, Intel® Core™ i7-2600 CPU @ 3.40 GHz, 4 GB RAM.

References

  1. Acuna E, Rodriguez C (2004) The treatment of missing values and its effect in the classifier accuracy. In: Banks D et al (eds) Classification, clustering and data mining applications. Springer-Verlag, Berlin, pp 639–648

    Google Scholar 

  2. Arlot S (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79

    MathSciNet  MATH  Google Scholar 

  3. Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533

    Google Scholar 

  4. Byun H, Lee S-W (2003) A survey on pattern recognition applications of support vector machines. Int J Pattern Recognit Artif Intell 17(3):459–486

    Google Scholar 

  5. Cervantes J, Garcia-Lamont F, Rodriguez-Mazahua L, Lopez A (2020) A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing 408:189–215

    Google Scholar 

  6. Chang CC, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27

    Google Scholar 

  7. De Leeuw E (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160

    Google Scholar 

  8. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  9. Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 10:617–621

    Google Scholar 

  10. Eirola E, Lendasse A, Vandewalle V, Biernacki C (2014) Mixture of Gaussians for distance estimation with missing data. Neurocomputing 131:32–42

    Google Scholar 

  11. Enders CK (2010) Applied missing data analysis. Guilford Press, USA

    Google Scholar 

  12. Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41:3692–3705

    MATH  Google Scholar 

  13. Garcia AJT, Hruschka ER (2005) Naïve Bayes as an imputation tool for classification problems. In: International conference on hybrid intelligent systems, pp 497–499

  14. Garcia-Laencina PJ, Sancho-Gomez J-L, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282

    Google Scholar 

  15. Grzymala-Busse JW, Grzymala-Busse WJ (2005) Handling missing attribute values. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer-Verlag, pp 37–57

    MATH  Google Scholar 

  16. Haykin S (1999) Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, USA

    MATH  Google Scholar 

  17. Hruschka ER Jr, Hruschka ER, Ebecken NFF (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29:231–252

    Google Scholar 

  18. Huang J, Keung JW, Sarro F, Li YF, Yu YT, Chan WK, Sun H (2017) Cross-validation based K nearest neighbor imputation for software quality datasets: an empirical study. J Syst Softw 132:226–252

    Google Scholar 

  19. Jonsson P, Wohlin C (2004) An evaluation of k-nearest neighbor imputation using likert data. In: IEEE international symposium on software metrics, pp 108–118

  20. Jung Y (2018) Multiple predicting k-fold cross-validation for model selection. J Nonparametric Stat 30(1):197–215

    MathSciNet  MATH  Google Scholar 

  21. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp 1137–1143

  22. Lakshminarayan K, Harp SA, Samad T (1999) Imputation of missing data in industrial databases. Appl Intell 11(3):259–275

    Google Scholar 

  23. Lin W-C, Tsai C-F (2019) Missing value imputation: a review and analysis of the literature (2016–2017). Artif Intell Rev. https://doi.org/10.1007/s10462-019-09709-4

    Article  Google Scholar 

  24. Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. John Wiley and Sons, USA

    MATH  Google Scholar 

  25. Nayak J, Naik B, Behera H (2015) A comprehensive survey on support vector machine in data mining tasks: applications & challenges. Int J Database Theory Appl 8:169–186

    Google Scholar 

  26. Nishanth KJ, Ravi V (2016) Probabilistic neural network based categorical data imputation. Neurocomputing 218:17–25

    Google Scholar 

  27. Pan R, Yang T, Cao J, Lu K, Zhang Z (2015) Missing data imputation by K nearest neighbours based on grey relational structure and mutual information. Appl Intell 43:614–632

    Google Scholar 

  28. Pati SK, Das AK (2017) Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst 52(3):709–750

    Google Scholar 

  29. Pelckmans K, De Brabanter J, Suykens JAK, De Moor B (2005) Handling missing values in support vector machine classifiers. Neural Netw 18:684–692

    MATH  Google Scholar 

  30. Poulos J, Valle R (2018) Missing data imputation for supervised learning. Appl Artif Intell 32(2):186–196

    Google Scholar 

  31. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

    Google Scholar 

  32. Raymond M, Roberts D (1987) A comparison of methods for treating incomplete data in selection research. Educ Psychol Meas 47:13–26

    Google Scholar 

  33. Rodriguez JD, Perez A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575

    Google Scholar 

  34. Salcedo-Sanz S, Rojo-Alvarez JL, Martinez-Ramon M, Camps-Valls G (2014) Support vector machines in engineering: an overview. Wiley Interdiscip Rev Data Min Knowl Dis 4(3):234–267

    Google Scholar 

  35. Silva-Ramirez E-L, Pino-ejias R, Lopez-Coello M (2015) Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbors for monotone patterns. Appl Soft Comput 29:65–74

    Google Scholar 

  36. Sivapriya TR, Kamal ARNB, Thavavel V (2012) Imputation and classification of missing data using least square support vector machines—a new approach in dementia diagnosis. Int J Adv Res Artif Intell 1(4):29–34

    Google Scholar 

  37. Strike K, Emam KE, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908

    Google Scholar 

  38. Su X, Khoshgoftaar TM, Zhu X, Greiner R (2008) Imputation-boosted collaborative filtering using machine learning classifiers. In: ACM symposium on applied computing, pp 949–950

  39. Tsai C-F, Chang F-Y (2016) Combining instance selection for better missing value imputation. J Syst Softw 122:63–71

    Google Scholar 

  40. Valdiviezo HC, van Aelst S (2015) Tree-based prediction on incomplete data using imputation or surrogate decision. Inf Sci 311:163–181

    Google Scholar 

  41. Vapnik V (1998) Statistical learning theory. John Wiley, USA

    MATH  Google Scholar 

  42. Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6(1):1–34

    MathSciNet  MATH  Google Scholar 

  43. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37

    Google Scholar 

  44. Xia J, Zhang S, Cai G, Li L, Pan Q, Yan J, Ning G (2017) Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recogn 69:52–60

    Google Scholar 

  45. Zhang L, Bing Z, Zhang L (2015) A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data. Pattern Anal Appl 18:377–384

    MathSciNet  Google Scholar 

  46. Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inf Bull 9(1):32–38

    Google Scholar 

  47. Zhang Y, Liu Y (2009) Data imputation using least squares support vector machines in urban arterial streets. IEEE Signal Process Lett 16(5):414–417

    MathSciNet  Google Scholar 

  48. Zhou X, Reiter JP (2010) A note n Bayesian inference after multiple imputation. Am Stat 64(2):159–163

    MathSciNet  Google Scholar 

  49. Zhou Y, De S, Wang W, Wang R, Moessner K (2018) Missing data estimation in mobile sensing environments. IEEE Access 6(1):69869–69882

    Google Scholar 

  50. Zhu X, Zhang S, Jin Z, Zhang Z, Xu Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Ministry of Science and Technology of Taiwan (MOST 105-2410-H-008-043-MY3).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chih-Fong Tsai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tsai, CF., Hu, YH. Empirical comparison of supervised learning techniques for missing value imputation. Knowl Inf Syst 64, 1047–1075 (2022). https://doi.org/10.1007/s10115-022-01661-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01661-0

Keywords

Navigation