Skip to main content
Log in

Missing data imputation using decision trees and fuzzy clustering with iterative learning

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Various imputation approaches have been proposed to address the issue of missing values in data mining and machine learning applications. To improve the accuracy of missing data imputation, this paper proposes a new method called DIFC by integrating the merits of decision tress and fuzzy clustering into an iterative learning approach. To compare the performance of the DIFC method against five effective imputation methods, extensive experiments are conducted on six widely used datasets with numerical and categorical missing data, and with various amounts and types of missing values. The experimental results show that the DIFC method outperforms other methods in terms of imputation accuracy. Further experiments on the effect of missing value types demonstrate the robustness of the DIFC method in dealing with different types of missing values. This paper contributes to missing data imputation research by providing an accurate and robust method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Batista GEAPA, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533

    Article  Google Scholar 

  2. Beysolow T II (2017) Introduction to deep learning using R. Apress, Berkeley

    Book  Google Scholar 

  3. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth & Brooks, Monterey

    MATH  Google Scholar 

  4. Cai Z, Heydari M, Lin G (2006) Iterated local least squares microarray missing value imputation. J Bioinform Comput Biol 4:935–957

    Article  Google Scholar 

  5. Campello RJGB, Hruschka ER (2006) A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets Syst 157:2858–2875

    Article  MathSciNet  Google Scholar 

  6. Cheng KO, Law NF, Siu WC (2012) Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data. Pattern Recogn 45:1281–1289

    Article  Google Scholar 

  7. Deb R, Liew AWC (2016) Missing value imputation for the analysis of incomplete traffic accident data. Inf Sci 339:274–289

    Article  Google Scholar 

  8. Dua D, Taniskidou EK (2017) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine

    Google Scholar 

  9. James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning: with applications in R. Springer, New York

    Book  Google Scholar 

  10. Jenghara MM, Ebrahimpour-Komleh H, Rezaie V, Nejatian S, Parvin H, Yusof SKS (2018) Imputing missing value through ensemble concept based on statistical measures. Knowl Inf Syst 56:123–139

    Article  Google Scholar 

  11. Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38:2895–2907

    Article  Google Scholar 

  12. Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21:187–198

    Article  Google Scholar 

  13. Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy K-means clustering method. In: Tsumoto S, Słowiński R, Komorowski J, Grzymała-Busse JW (eds) Rough sets and current trends in computing. Springer, Berlin, pp 573–579

    Chapter  Google Scholar 

  14. Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken

    Book  Google Scholar 

  15. Luengo J, García S, Herrera F (2012) On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 32:77–108

    Article  Google Scholar 

  16. Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Software Eng 27:999–1013

    Article  Google Scholar 

  17. Nikfalazar S, Yeh C-H, Bedingfield S, Khorshidi HA (2017) A new iterative fuzzy clustering algorithm for multiple imputation of missing data. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), Naples, pp 1–6

  18. Oba S, Sato MA, Takemasa I, Monden M, Matsubara KI, Ishii S (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19:2088–2096

    Article  Google Scholar 

  19. Panda S, Sahu S, Jena P, Chattopadhyay S (2012) Comparing fuzzy-C means and K-means clustering techniques: a comprehensive study. In: Wyld DC, Zizka J, Nagamalai D (eds) Advances in computer science. Engineering & Applications, Springer, pp 451–460

    Google Scholar 

  20. Pati SK, Das AK (2017) Missing value estimation for microarray data through cluster analysis. Knowl Inf Syst 52:709–750

    Article  Google Scholar 

  21. Rahman MG, Islam MZ (2010) A decision tree-based missing value imputation technique for data pre-processing. In: Conferences in research and practice in information technology series, vol 121, pp 41–50

  22. Rahman MG, Islam MZ (2013) Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl-Based Syst 53:51–65

    Article  Google Scholar 

  23. Rahman MG, Islam MZ (2014) FIMUS: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Knowl-Based Syst 56:311–327

    Article  Google Scholar 

  24. Rahman MG, Islam MZ (2016) Missing value imputation using a fuzzy clustering-based EM approach. Knowl Inf Syst 46:389–422

    Article  Google Scholar 

  25. Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14:853–871

    Article  Google Scholar 

  26. Wang X, Li A, Jiang Z, Feng H (2006) Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7:32

    Article  Google Scholar 

  27. Zhang S (2012) Nearest neighbor selection for iteratively kNN imputation. J Syst Softw 85:2541–2552

    Article  Google Scholar 

Download references

Acknowledgements

This project was supported through an Australian Government Research Training Program Scholarship. The authors are grateful to the editor and the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanaz Nikfalazar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nikfalazar, S., Yeh, CH., Bedingfield, S. et al. Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 62, 2419–2437 (2020). https://doi.org/10.1007/s10115-019-01427-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01427-1

Keywords

Navigation