Abstract
Missing data estimation is an important strategy for improving learning performance in learning from incomplete data, especially, when there are non discardable records with missing values. However, most of the existing algorithms are focused on missing at random (MAR) or missing completely at random (MCAR), and less attention has been paid to data not missing at random (NMAR). In this paper, an information decomposition imputation (IDIM) algorithm using fuzzy membership function is proposed for addressing the missing value problem under NMAR. Firstly, the proposed IDIM algorithm is presented with detailed examples. Then, the proposed approach is evaluated with extensive experiments compared with some typical algorithms. The experimental results demonstrate that the proposed algorithm has higher accuracy than the exiting imputation approaches in terms of normal root mean square error (NRMSE) and TP+TN evaluation under different missing strategies.
Similar content being viewed by others
References
Qin Y, Zhang S et al (2009) POP algorithm: kernel-based imputation to treat missing values in knowledge discovery from databases. Expert syst Appl 36(2):2794–2804
Vagin V, Fomina M (2011) Problem of knowledge discovery in noisy databases. Inter J Mach Learn Cybern 2(3):135–145
Yu T, Peng H et al (2011) Incorporating nonlinear relationships in microarray missing value imputation. Comput Biol Bioinform IEEE/ACM Trans 8(3):723–731
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Zhang S, Qin Z et al (2005) Missing is useful: missing values in cost-sensitive decision trees. Knowl Data Eng IEEE Trans 17(12):1689–1693
Qin Y, Zhang S et al (2007) Semi-parametric optimization for missing data imputation. Appl Intell 27(1):79–88
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1217–1250
Zhu X, Zhang S et al (2011) Missing value estimation for mixed-attribute data sets. Knowl Data Eng IEEE Trans 23(1):110–121
Allison PD (2000) Missing data. Sage Thousand Oaks, USA
Little RJ, Rubin DB (2002) Statistical analysis with missing data
Rubin DB (2004) Multiple imputation for nonresponse in surveys. John Wiley and Sons, New York
Ramoni M, Sebastiani P (1997) Learning Bayesian networks from incomplete databases. In: Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., USA
Ghahramani Z, Jordan MI (1997) Mixture models for learning from incomplete data. Comput Learn Theory Nat Learn Syst 4:67–85
Dick U, Haider P et al. (2008) Learning from incomplete data with infinite imputations. In: Proceedings of the 25th international conference on Machine learning, ACM
Dai H, Ciesielski V (1994) Learning of inexact rules by the fish-net algorithm from low quality data. In: Proceedings of the Eigth Australian Joint Artificial Intelligence Conference, Citeseer
Scheffer J (2002) Dealing with missing data
Dempster AP, Laird NM et al. (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B 1–38
Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38
Zhang C, Zhu X et al (2007) GBKII: an imputation method for missing values. Adv Knowl Discov Data Mining 1080–1087
Wang Q, Rao J (2002) Empirical likelihood-based inference under imputation for missing response data. Annal Stat 30(3):896–924
Pérez A, Dennis RJ et al (2002) Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in Colombia. Stat Med 21(24):3885–3896
Jerez JM, Molina I et al (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115
Bø TH, Dysvik B et al (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32(3):e34–e34
Choong MK, Charbit M et al (2009) Autoregressive-model-based missing value estimation for DNA microarray time series data. Inform Technol Biomed IEEE Trans 13(1):131–137
Kim H, Golub GH et al (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Oba S, Sato M-A et al (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096
Wang X, Li A et al (2006) Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7(1):32
Wong DS, Wong FK et al (2007) A multi-stage approach to clustering and imputation of gene expression profiles. Bioinformatics 23(8):998–1005
Diggle P, Kenward MG (1994) Informative drop-out in longitudinal data analysis. Appl Stat 49–93
Little RJ (1995) Modeling the drop-out mechanism in repeated-measures studies. J Am Stat Assoc 90(431):1112–1121
Little RJ (2008) Selection and pattern-mixture models. Longitud Data Anal 409–431
Muthén B, Asparouhov T et al (2011) Growth modeling with nonignorable dropout: alternative analyses of the STAR* D antidepressant trial. Psychol Methods 16(1):17
Albert PS, Follmann DA (2009) Shared-parameter models. Longitud Data Anal 433–452
Beunckens C, Molenberghs G et al (2008) A latent class mixture model for incomplete longitudinal Gaussian data. Biometrics 64(1):96–105
Dantan E, Proust-Lima C et al (2008) Pattern mixture models and latent class models for the analysis of multivariate longitudinal data with informative dropouts. Inter J Biostat 4(1):1–26
Roy J, Daniels MJ (2008) A general class of pattern mixture models for nonignorable dropout with many possible dropout times. Biometrics 64(2):538–545
Jansen I, Hens N et al (2006) The nature of sensitivity in monotone missing not at random models. Comput Stat Data Anal 50(3):830–858
Hogan JW, Roy J et al (2004) Handling dropout in longitudinal studies. Stat Med 23(9):1455–1497
Kenward MG (1998) Selection models for repeated measurements with nonandom dropout: an illustration of sensitivity. Stat Med 17(23):2723–2732
Michiels B, Molenberghs G et al (2002) Selection models and patternmixture models to analyse longitudinal quality of life data subject to dropout. Stat Med 21(8):1023–1041
Ma J et al (2014) Fuzzy clustering with non-local information for image segmentation. Inter J Mach Learn Cybern 5(6):845–859
Vishwakarma VP (2013) Illumination normalization using fuzzy filter in DCT domain for face recognition. Inter J Mach Learn Cybern 6(1):17–34
Zadeh LA (1965) Fuzzy sets. Inform control 8(3):338–353
Chongfu H (2000) Demonstration of benefit of information distribution for probability estimation. Signal Process 80(6):1037–1048
Lakshminarayan K, Harp SA et al (1999) Imputation of missing data in industrial databases. Appl Intell 11(3):259–275
Merz CJ, Murphy PM (1998) UCI Repository of machine learning databases
Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Climate 14(5):853–871
Troyanskaya O, Cantor M et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, S., Dai, H. & Gan, M. Information-decomposition-model-based missing value estimation for not missing at random dataset. Int. J. Mach. Learn. & Cyber. 9, 85–95 (2018). https://doi.org/10.1007/s13042-015-0354-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-015-0354-5