Empirical comparison of supervised learning techniques for missing value imputation

Tsai, Chih-Fong; Hu, Ya-Han

doi:10.1007/s10115-022-01661-0

Empirical comparison of supervised learning techniques for missing value imputation

Regular Paper
Published: 16 March 2022

Volume 64, pages 1047–1075, (2022)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Chih-Fong Tsai¹ &
Ya-Han Hu¹

1193 Accesses
1 Altmetric
Explore all metrics

Abstract

Many data mining algorithms cannot handle incomplete datasets where some data samples are missing attribute values. To solve this problem, missing value imputation is usually conducted and commonly based on reasoning from observed data or complete data to provide estimated replacements for missing values. In general, missing imputation methods can be classified into statistical and machine learning methods. The statistical methods are usually based on the mean for continuous attributes or mode for discrete attributes, whereas the machine learning methods are based on supervised learning techniques. However, which machine learning method performs optimally for missing value imputation is unknown. This paper compares five well-known supervised learning techniques, namely k-nearest neighbor, the multilayer perceptron neural network (MLP), the classification and regression tree (CART), naïve Bayes, and the support vector machine, to examine their imputation results for categorical, numerical, and mixed data types. The experimental results demonstrate that CART outperforms the other methods for categorical datasets, whereas the MLP is optimal for numerical and mixed datasets in terms of classification accuracy. However, when computational cost is a factor, CART is superior to the MLP because CART can provide reasonably accurate imputation results and requires the least amount of time to perform missing value imputation. Moreover, CART generates the lowest root-mean-squared error of all methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing Value Imputation Using Weighted KNN and Genetic Algorithm

Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm

An Application Sample of Machine Learning Tools, Such as SVM and ANN, for Data Editing and Imputation

Notes

http://archive.ics.uci.edu/ml/index.php.
http://www.cs.waikato.ac.nz/ml/weka/.
The computing environment is based on PC, Intel® Core™ i7-2600 CPU @ 3.40 GHz, 4 GB RAM.

References

Acuna E, Rodriguez C (2004) The treatment of missing values and its effect in the classifier accuracy. In: Banks D et al (eds) Classification, clustering and data mining applications. Springer-Verlag, Berlin, pp 639–648
Google Scholar
Arlot S (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
MathSciNet MATH Google Scholar
Batista G, Monard M (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533
Google Scholar
Byun H, Lee S-W (2003) A survey on pattern recognition applications of support vector machines. Int J Pattern Recognit Artif Intell 17(3):459–486
Google Scholar
Cervantes J, Garcia-Lamont F, Rodriguez-Mazahua L, Lopez A (2020) A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing 408:189–215
Google Scholar
Chang CC, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27
Google Scholar
De Leeuw E (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160
Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 10:617–621
Google Scholar
Eirola E, Lendasse A, Vandewalle V, Biernacki C (2014) Mixture of Gaussians for distance estimation with missing data. Neurocomputing 131:32–42
Google Scholar
Enders CK (2010) Applied missing data analysis. Guilford Press, USA
Google Scholar
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41:3692–3705
MATH Google Scholar
Garcia AJT, Hruschka ER (2005) Naïve Bayes as an imputation tool for classification problems. In: International conference on hybrid intelligent systems, pp 497–499
Garcia-Laencina PJ, Sancho-Gomez J-L, Figueiras-Vidal AR (2010) Pattern classification with missing data: a review. Neural Comput Appl 19:263–282
Google Scholar
Grzymala-Busse JW, Grzymala-Busse WJ (2005) Handling missing attribute values. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer-Verlag, pp 37–57
MATH Google Scholar
Haykin S (1999) Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, USA
MATH Google Scholar
Hruschka ER Jr, Hruschka ER, Ebecken NFF (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29:231–252
Google Scholar
Huang J, Keung JW, Sarro F, Li YF, Yu YT, Chan WK, Sun H (2017) Cross-validation based K nearest neighbor imputation for software quality datasets: an empirical study. J Syst Softw 132:226–252
Google Scholar
Jonsson P, Wohlin C (2004) An evaluation of k-nearest neighbor imputation using likert data. In: IEEE international symposium on software metrics, pp 108–118
Jung Y (2018) Multiple predicting k-fold cross-validation for model selection. J Nonparametric Stat 30(1):197–215
MathSciNet MATH Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp 1137–1143
Lakshminarayan K, Harp SA, Samad T (1999) Imputation of missing data in industrial databases. Appl Intell 11(3):259–275
Google Scholar
Lin W-C, Tsai C-F (2019) Missing value imputation: a review and analysis of the literature (2016–2017). Artif Intell Rev. https://doi.org/10.1007/s10462-019-09709-4
Article Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. John Wiley and Sons, USA
MATH Google Scholar
Nayak J, Naik B, Behera H (2015) A comprehensive survey on support vector machine in data mining tasks: applications & challenges. Int J Database Theory Appl 8:169–186
Google Scholar
Nishanth KJ, Ravi V (2016) Probabilistic neural network based categorical data imputation. Neurocomputing 218:17–25
Google Scholar
Pan R, Yang T, Cao J, Lu K, Zhang Z (2015) Missing data imputation by K nearest neighbours based on grey relational structure and mutual information. Appl Intell 43:614–632
Google Scholar
Pati SK, Das AK (2017) Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst 52(3):709–750
Google Scholar
Pelckmans K, De Brabanter J, Suykens JAK, De Moor B (2005) Handling missing values in support vector machine classifiers. Neural Netw 18:684–692
MATH Google Scholar
Poulos J, Valle R (2018) Missing data imputation for supervised learning. Appl Artif Intell 32(2):186–196
Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Google Scholar
Raymond M, Roberts D (1987) A comparison of methods for treating incomplete data in selection research. Educ Psychol Meas 47:13–26
Google Scholar
Rodriguez JD, Perez A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575
Google Scholar
Salcedo-Sanz S, Rojo-Alvarez JL, Martinez-Ramon M, Camps-Valls G (2014) Support vector machines in engineering: an overview. Wiley Interdiscip Rev Data Min Knowl Dis 4(3):234–267
Google Scholar
Silva-Ramirez E-L, Pino-ejias R, Lopez-Coello M (2015) Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbors for monotone patterns. Appl Soft Comput 29:65–74
Google Scholar
Sivapriya TR, Kamal ARNB, Thavavel V (2012) Imputation and classification of missing data using least square support vector machines—a new approach in dementia diagnosis. Int J Adv Res Artif Intell 1(4):29–34
Google Scholar
Strike K, Emam KE, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908
Google Scholar
Su X, Khoshgoftaar TM, Zhu X, Greiner R (2008) Imputation-boosted collaborative filtering using machine learning classifiers. In: ACM symposium on applied computing, pp 949–950
Tsai C-F, Chang F-Y (2016) Combining instance selection for better missing value imputation. J Syst Softw 122:63–71
Google Scholar
Valdiviezo HC, van Aelst S (2015) Tree-based prediction on incomplete data using imputation or surrogate decision. Inf Sci 311:163–181
Google Scholar
Vapnik V (1998) Statistical learning theory. John Wiley, USA
MATH Google Scholar
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6(1):1–34
MathSciNet MATH Google Scholar
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37
Google Scholar
Xia J, Zhang S, Cai G, Li L, Pan Q, Yan J, Ning G (2017) Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recogn 69:52–60
Google Scholar
Zhang L, Bing Z, Zhang L (2015) A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data. Pattern Anal Appl 18:377–384
MathSciNet Google Scholar
Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inf Bull 9(1):32–38
Google Scholar
Zhang Y, Liu Y (2009) Data imputation using least squares support vector machines in urban arterial streets. IEEE Signal Process Lett 16(5):414–417
MathSciNet Google Scholar
Zhou X, Reiter JP (2010) A note n Bayesian inference after multiple imputation. Am Stat 64(2):159–163
MathSciNet Google Scholar
Zhou Y, De S, Wang W, Wang R, Moessner K (2018) Missing data estimation in mobile sensing environments. IEEE Access 6(1):69869–69882
Google Scholar
Zhu X, Zhang S, Jin Z, Zhang Z, Xu Z (2011) Missing value estimation for mixed-attribute data sets. IEEE Trans Knowl Data Eng 23(1):110–121
Google Scholar

Download references

Acknowledgements

This work was supported by the Ministry of Science and Technology of Taiwan (MOST 105-2410-H-008-043-MY3).

Author information

Authors and Affiliations

Department of Information Management, National Central University, Zhongli, Taoyuan, Taiwan
Chih-Fong Tsai & Ya-Han Hu

Authors

Chih-Fong Tsai
View author publications
You can also search for this author inPubMed Google Scholar
Ya-Han Hu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Chih-Fong Tsai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tsai, CF., Hu, YH. Empirical comparison of supervised learning techniques for missing value imputation. Knowl Inf Syst 64, 1047–1075 (2022). https://doi.org/10.1007/s10115-022-01661-0

Download citation

Received: 18 July 2019
Revised: 20 January 2022
Accepted: 21 January 2022
Published: 16 March 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10115-022-01661-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Empirical comparison of supervised learning techniques for missing value imputation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Missing Value Imputation Using Weighted KNN and Genetic Algorithm

Missing Value Imputation Method Using Separate Features Nearest Neighbors Algorithm

An Application Sample of Machine Learning Tools, Such as SVM and ANN, for Data Editing and Imputation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now