Skip to main content
Log in

Imputing missing value through ensemble concept based on statistical measures

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Many datasets include missing values in their attributes. Data mining techniques are not applicable in the presence of missing values. So an important step in preprocessing of a data mining task is missing value management. One of the most important categories in missing value management techniques is missing value imputation. This paper presents a new imputation technique. The proposed imputation technique is based on statistical measurements. The suggested imputation technique employs an ensemble of the estimators built to estimate the missing values based on positive and negative correlated observed attributes separately. Each estimator guesses a value for a missed value based on the average and variance of that feature. The average and variance of the feature are estimated from the non-missed values of that feature. The final consensus value for a missed value is the weighted aggregation of the values estimated by different estimators. The chief weight is attribute correlation, and the slight weight is dependent to kernel function such as kurtosis, skewness, number of involved samples and composition of them. The missing values are deliberately produced randomly at different levels. The experimentations indicate that the suggested technique has a good accuracy in comparison with the classical methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Zhang S (2011) Shell-neighbor method and its application in missing data imputation. Appl Intell 35:123–133

    Article  Google Scholar 

  2. Conrady S, Jouffe L (2011) Missing values imputation. Bayesia, Changé

    Google Scholar 

  3. Eirola E, Doquire G, Verleysen M, Lendasse A (2013) Distance estimation in numerical data sets with missing values. Inf Sci 240:115–128

    Article  MathSciNet  MATH  Google Scholar 

  4. Zhu B, He C, Liatsis P (2012) A robust missing value imputation method for noisy data. Appl Intell 36:61–74

    Article  Google Scholar 

  5. Ghannad-Rezaie M, Soltanian-Zadeh H, Ying H, Dong M (2010) Selection fusion approach for classification of datasets with missing values. Pattern Recognit 43:2340–2350

    Article  MATH  Google Scholar 

  6. Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH (2005) Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc 100:332–346

    Article  MathSciNet  MATH  Google Scholar 

  7. Kang P (2013) Locally linear reconstruction based missing value imputation for supervised learning. Neurocomputing 118:65–78

    Article  Google Scholar 

  8. Acuña E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. In: Banks D, McMorris FR, Arabie P, Gaul W (eds) Classification, clustering, and data mining applications. Studies in classification, data analysis, and knowledge organisation. Springer, Berlin, Heidelberg

  9. Hron K, Templ M, Filzmoser P (2010) Imputation of missing values for compositional data using classical and robust methods. Comput Stat Data Anal 54:3095–3107

    Article  MathSciNet  MATH  Google Scholar 

  10. Silva-Ramrez E-L, Pino-Mejas R, Lpez-Coello M, Cubiles-de-la-Vega M-D (2011) Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Netw 24:121–129

    Article  Google Scholar 

  11. Stekhoven DJ, Bhlmann P (2012) MissForest non-parametric missing value imputation for mixed-type data. Bioinformatics 28:112–118

    Article  Google Scholar 

  12. Qin Y, Zhang S, Zhu X, Zhang J, Zhang C (2007) Semi-parametric optimization for missing data imputation. Appl Intell 27:79–88

    Article  MATH  Google Scholar 

  13. Theodoridis S, Koutroumbas K (2003) Pattern recognition

  14. Wang J (2003) Data mining: opportunities and challenges. IGI Global, Hershey

    Book  Google Scholar 

  15. Schafer JL (2010) Analysis of incomplete multivariate data. CRC Press, Boca Raton

    MATH  Google Scholar 

  16. Liu Y, Brown SD (2013) Comparison of five iterative imputation methods for multivariate classification. Chemom Intell Lab Syst 120:106–115

    Article  Google Scholar 

  17. Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41:3692–3705

    Article  MATH  Google Scholar 

  18. Ford B (1983) An overview of hot deck procedures. In: Madow W, Nisselson H, Olkin I (eds) Incomplete data in sample surveys, theory and bibliographies, vol 2. Academic Press, pp 185–207

  19. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  20. Ghahramani Z, Jordan M (1994) Supervised learning from incomplete data via an EM approach. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems, vol 6, pp 120–127

  21. Liao Z, Lu X, Yang T, Wang H (2009) Missing data imputation: a fuzzy K-means clustering algorithm over sliding window. In: Sixth international conference on fuzzy systems and knowledge discovery, FSKD’09, pp 133–137

  22. Zhang S, Zhang J, Zhu XF, Qin YQ, Zhang C (2008) Missing value imputation based on data clustering. In: Gavrilova ML, Tan CJK (eds) Transactions on computational science I, vol 4750. Springer, Berlin, Heidelberg, pp 128–138

  23. Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    Article  MathSciNet  MATH  Google Scholar 

  24. Ennett CM, Frize M, Walker CR (2008) Imputation of missing values by integrating neural networks and case-based reasoning. In: 30th annual international conference of the IEEE on engineering in medicine and biology society, 2008. EMBS 2008, pp 4337–4341

  25. Grzymała-Busse J, Hu M (2001) A comparison of several approaches to missing attribute values in data mining. In: Ziarko W, Yao Y (eds) Rough sets and current trends in computing. Lecture notes in computer science, vol 2005. Springer, Berlin, Heidelberg, pp 378–385

  26. Su X, Greiner R, Khoshgoftaar TM, Napolitano A (2011) Using classifier-based nominal imputation to improve machine learning. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 124–135

  27. Hruschka ER, Jr Hruschka ER, Ebecken NFF (2003) Evaluating a nearest-neighbor method to substitute continuous missing values. In: The 16th Australian joint conference on artificial intelligence. Lecture notes in artificial intelligence (LNAI), vol 2903. Springer, pp 723–734

  28. Van Hulse J, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259:596–610

    Article  Google Scholar 

  29. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Statistical description of data. In: Numerical recipes in FORTRAN: The art of scientific computing, 2nd edn, Chap 14. Cambridge University Press, Cambridge, England, pp 603–649

  30. Frank A, Asuncion A (2010) UCI machine learning repository. In: School of Information and Computer Science. University of California, Irvine, CA, vol 213. http://archive.ics.uci.edu/ml

Download references

Acknowledgements

We thank anonymous reviewers for their very useful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samad Nejatian.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jenghara, M.M., Ebrahimpour-Komleh, H., Rezaie, V. et al. Imputing missing value through ensemble concept based on statistical measures. Knowl Inf Syst 56, 123–139 (2018). https://doi.org/10.1007/s10115-017-1118-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1118-1

Keywords

Navigation