Abstract
Since incompleteness affects the data usage, missing values in database should be estimated to make data mining and analysis more accurate. In addition to ignoring or setting to default values, many imputation methods have been proposed, but all of them have their limitations. This paper proposes a probabilistic method to estimate missing values. We construct a Bayesian network in a novel way to identify the dependencies in a dataset, then use the Bayesian reasoning process to find the most probable substitution for each missing value. The benefits of this method include (1) irrelevant attributes can be ignored during estimation; (2) network is built with no target attribute, which means all attributes are handled in one model;(3) probability information can be obtained to measure the accuracy of the imputation. Experimental results show that our construction algorithm is effective and the quality of filled values outperforms the mode imputation method and kNN method. We also verify the effectiveness of the probabilities given by our method experimentally.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–39 (1977)
Yang, K., Li, J., Wang, C.: Missing Values Estimation in Microarray Data with Partial Least Squares Regression. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 662–669. Springer, Heidelberg (2006)
Shan, Y., Deng, G.: Kernel PCA regression for missing data estimation in DNA microarray analysis. In: IEEE International Symposium on Circuits and Systems, ISCAS 2009, May 24-27, pp. 1477–1480 (2009)
Walsh, B.: Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB 581, version 26 (April 2004)
Little, R., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, New York (1987)
Zhang, S.: Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123–133 (2011)
Ling, W., Mei, F.D.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology, ESIAT 2009, vol. 3, pp. 660–663 (2009)
Yuan, Y.C.: Multiple Imputation for Missing Data: Concepts and New Development (Version 9.0). SAS Institute Inc., NC (2001), http://www.sas.com/statistics
Lakshminarayan, K., Harp, S.A., Goldman, R., Samad, T.: Imputation of missing data using machine learning techniques. In: Simoudis, Han, Fayyad (eds.) Proceedings: Second International Conference on Knowledge Discovery and Data Mining, pp. 140–145. AAAI Press, Menlo Park (1996)
Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 17(12), 1689–1693 (2005)
Li, X.-B.: A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality 1(1) (2009)
Setiawan, N.A., Venkatachalam, P., Hani, A.F.M.: Missing Attribute Value Prediction Based on Artificial Neural Network and Rough Set Theory. In: International Conference on BioMedical Engineering and Informatics, BMEI 2008, May 27-30, vol. 1, pp. 306–310 (2008)
Hruschka, E.R., Hruschka, E.R., Ebecken, N.F.: Bayesian networks for imputation in classification problems. Journal of Intelligent Information Systems 29(13), 231–252 (2007)
Di Zio, M., Scanu, M., Coppola, L., Luzi, O., Ponti, A.: Bayesian networks for imputation. Journal of the Royal Statistical Society A 167(pt. 2), 309–322 (2004)
Goodman, L.A., Kruskal, W.H.: Measures of association for cross-classification. Journal of the American Statistical Association 49, 732–764 (1954)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems. In: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco (1988)
Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9, 309–347 (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jin, L., Wang, H., Gao, H. (2013). Imputation for Categorical Attributes with Probabilistic Reasoning. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds) Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38562-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-38562-9_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38561-2
Online ISBN: 978-3-642-38562-9
eBook Packages: Computer ScienceComputer Science (R0)