Skip to main content
Log in

A robust missing value imputation method for noisy data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Missing data imputation is an important research topic in data mining. The impact of noise is seldom considered in previous works while real-world data often contain much noise. In this paper, we systematically investigate the impact of noise on imputation methods and propose a new imputation approach by introducing the mechanism of Group Method of Data Handling (GMDH) to deal with incomplete data with noise. The performance of four commonly used imputation methods is compared with ours, called RIBG (robust imputation based on GMDH), on nine benchmark datasets. The experimental result demonstrates that noise has a great impact on the effectiveness of imputation techniques and our method RIBG is more robust to noise than the other four imputation methods used as benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abdel-Aal RE (2005) GMDH-based feature ranking and selection for improved classification of medical data. J Biomed Inf 38(6):456–468

    Article  Google Scholar 

  2. Aksenova TI, Yurachkovsky YP (1988) A characterisation at unbiased structure and conditions of their J-optimality. Sov J Autom Inf Sci 21(4):36–42

    MATH  Google Scholar 

  3. Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://www.ics.uci.edu/mlearn/MLRepository.html

  4. Aussem A, de Morais SR (2008) A conservative feature subset selection algorithm with missing data. In: Kellenberger P (ed) Proc eighth IEEE int conf on data mining, ICDM’08, Pisa, Italy, pp 725–730

  5. Barron AR, Barron RL (1988) Statistical learning networks: A unifying view. In: Wegman E (ed) Proc the 20th symposium on the interface: computing science and statistics. American Statistical Association, Washington, pp 192–203

    Google Scholar 

  6. Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533

    Article  Google Scholar 

  7. Beaumont JF (2000) On regression imputation in the presence of nonignorable nonresponse. In: Proc of the survey research methods section, ASA, pp 580–585

  8. Chen S, Huang C (2003) Generating weighted fuzzy rules from relational database systems for estimating null values using genetic algorithms. IEEE Trans Fuzzy Syst 11(4):495–506

    Article  Google Scholar 

  9. Chen S, Huang C (2008) A new approach to generate weighted fuzzy rules using genetic algorithms for estimating null values. Expert Syst Appl 35(3):905–917

    Article  Google Scholar 

  10. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 39:1–38

    MATH  MathSciNet  Google Scholar 

  11. Elder JF, Brown DE (2000) Induction and polynomial networks. In: Fraser MD (ed) Proc network models for control and processing, induction and polynomial networks. Intellect Books, Exeter, pp 143–198

    Google Scholar 

  12. Farhangfar A, Kurgan L, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Part A, Syst Humans 37(5):692–709

    Article  Google Scholar 

  13. Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41(12):3692–3705

    Article  MATH  Google Scholar 

  14. Ford BL (1983) An overview of hot-deck procedures. In: Madow WG, OIkin I, Rubin DB (eds) Incomplete data in sample surveys, vol II: theory and bibliographies. Academic Press, New York, pp 85–207

    Google Scholar 

  15. Harel O, Zhou XH (2007) Multiple imputation: Review of theory, implementation and software. Stat Med 26(16):3057–3077

    Article  MathSciNet  Google Scholar 

  16. Hathaway RJ, Bezdek JC (2002) Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm. Pattern Recogn Lett 23(1-3):151–160

    Article  MATH  Google Scholar 

  17. Hruschka ER Jr, Hruschka ER, Ebecken N (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29(3):231–252

    Article  Google Scholar 

  18. Huang CC, Lee HM (2004) A grey-based nearest neighbor approach for missing attribute value prediction. Appl Intell 20:239–252

    Article  MATH  Google Scholar 

  19. Ivakhnenko AG (1968) The group method of data handling-a rival of the method of stochastic approximation. Sov Autom Control 1–3:43–55

    Google Scholar 

  20. Ivakhnenko AG (1971) Polynomial theory of complex systems. IEEE Trans Syst Man Cybern 1(4):364–378

    Article  MathSciNet  Google Scholar 

  21. Ivakhnenko AG, Kocherga YL (1983) Theory of two-level GMDH algorithms for long-range quantitative prediction. Sov Autom Control 16(6):7–12

    Google Scholar 

  22. Ivakhnenko AG, Stepashko VS (1985) Noise stability of modeling. Naukova Dumka, Kiev

    Google Scholar 

  23. Lakshminarayan K, Harp SA, Samad T (1999) Imputation of missing data in industrial databases. Appl Intell 11(3):259–275

    Article  Google Scholar 

  24. Lemke F, Mueller J (2003) Self-organising data mining. Syst Anal Model Simul 43(2):231–240

    Article  Google Scholar 

  25. Little R, Rubin D (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  26. Madala HR, Ivakhnenko AG (1994) Inductive learning algorithms for complex systems modeling. CRC Press, Boca Raton

    MATH  Google Scholar 

  27. Mani S, Valtorta M, McDermott S (2005) Building Bayesian network models in medicine: The MENTOR experience. Appl Intell 22(2):93–108

    Article  Google Scholar 

  28. Mannino M, Yang Y, Ryu Y (2009) Classification algorithm sensitivity to training data with non representative attribute noise. Decis Support Syst 46(3):743–751

    Article  Google Scholar 

  29. Mehrara M et al (2009) Investigating the efficiency in oil futures market based on GMDH approach. Expert Syst Appl 36(4):7479–7483

    Article  Google Scholar 

  30. Miller RG (1997) Beyond ANOVA: basics of applied statistics. Chapman & Hall, Boca Raton

    MATH  Google Scholar 

  31. Mueller JA, Lemke F (2000) Self-organizing data mining: an intelligent approach to extract knowledge from data. Libri Books, Berlin

    Google Scholar 

  32. Myrtveit I, Stensrud E, Olsson U (2001) Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27(11):999–1013

    Article  Google Scholar 

  33. Oh S, Pedrycz W (2002) The design of self-organizing polynomial neural networks. Inf Sci 141(3–4):237–258

    Article  MATH  Google Scholar 

  34. Olinsky A, Chen S, Harlow L (2003) The comparative efficacy of imputation methods for missing data in structural equation modeling. Eur J Oper Res 151(1):53–79

    Article  MATH  MathSciNet  Google Scholar 

  35. Puig V et al (2007) A GMDH neural network-based approach to passive robust fault detection using a constraint satisfaction backward test. Eng Appl Artif Intell 20(7):886–897

    Article  Google Scholar 

  36. Qin Y et al (2007) Semi-parametric optimization for missing data imputation. Appl Intell 27(1):79–88

    Article  MATH  Google Scholar 

  37. Quinlan JR (1993) C4. 5: Programs for machine learning. Morgan Kauffman, Los Altos

    Google Scholar 

  38. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MATH  MathSciNet  Google Scholar 

  39. Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1625–1657

    Google Scholar 

  40. Schafer JL (1999) Multiple imputation: A primer. Stat Methods Med Res 8(1):3–15

    Article  Google Scholar 

  41. Stepashko VS, Yurachkovskiy YP (1986) The present state of the theory of the group method of data handling. Sov J Autom Inf Sci c/c of Avtomatika 19(4):36–46

    MATH  Google Scholar 

  42. Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Boston

    Google Scholar 

  43. Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24(1):53–62

    Google Scholar 

  44. Twala B (2009) An empirical comparison of techniques for handling incomplete data when using decision trees. Appl Artif Intell 23(5):373–405

    Article  Google Scholar 

  45. Ungaro F, Calzolari C, Busoni E (2005) Development of pedotransfer functions using a group method of data handling for the soil of the Pianura Padano-Veneta region of North Italy: Water retention properties. Geoderma 124(3–4):293–317

    Article  Google Scholar 

  46. Van Buuren S et al (2006) Fully conditional specification in multivariate imputation. J Stat Comput Simul 76(12):1049–1064

    Article  MATH  MathSciNet  Google Scholar 

  47. Van Hulse J, Khoshgoftaar TM (2008) A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J Syst Softw 81(5):691–708

    Google Scholar 

  48. Williams D et al (2007) On classification with incomplete data. IEEE Trans Pattern Anal Mach Intell 29(3):427–436

    Article  Google Scholar 

  49. Wu X, Zhu X (2008) Mining with noise knowledge: Error-aware data mining. IEEE Trans Syst Man Cybern Part A 38(4):917–932

    Article  Google Scholar 

  50. Zhu X, Wu X (2004) Class noise vs. attribute noise: A quantitative study. Artif Intell Rev 22(3):177–210

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Changzheng He.

Additional information

This work is supported by National Natural Science Foundation of China (Grant No. 70771067) and the NSFC/RS (Royal Society of the UK) International Joint Project (Grant No. 70911130133).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, B., He, C. & Liatsis, P. A robust missing value imputation method for noisy data. Appl Intell 36, 61–74 (2012). https://doi.org/10.1007/s10489-010-0244-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-010-0244-1

Keywords

Navigation