Imputation for Categorical Attributes with Probabilistic Reasoning

Jin, Lian; Wang, Hongzhi; Gao, Hong

doi:10.1007/978-3-642-38562-9_9

Imputation for Categorical Attributes with Probabilistic Reasoning

Lian Jin²¹,
Hongzhi Wang²¹ &
Hong Gao²¹

Conference paper

3462 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7923))

Abstract

Since incompleteness affects the data usage, missing values in database should be estimated to make data mining and analysis more accurate. In addition to ignoring or setting to default values, many imputation methods have been proposed, but all of them have their limitations. This paper proposes a probabilistic method to estimate missing values. We construct a Bayesian network in a novel way to identify the dependencies in a dataset, then use the Bayesian reasoning process to find the most probable substitution for each missing value. The benefits of this method include (1) irrelevant attributes can be ignored during estimation; (2) network is built with no target attribute, which means all attributes are handled in one model;(3) probability information can be obtained to measure the accuracy of the imputation. Experimental results show that our construction algorithm is effective and the quality of filled values outperforms the mode imputation method and kNN method. We also verify the effectiveness of the probabilities given by our method experimentally.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–39 (1977)
MathSciNet MATH Google Scholar
Yang, K., Li, J., Wang, C.: Missing Values Estimation in Microarray Data with Partial Least Squares Regression. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 662–669. Springer, Heidelberg (2006)
Chapter Google Scholar
Shan, Y., Deng, G.: Kernel PCA regression for missing data estimation in DNA microarray analysis. In: IEEE International Symposium on Circuits and Systems, ISCAS 2009, May 24-27, pp. 1477–1480 (2009)
Google Scholar
Walsh, B.: Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB 581, version 26 (April 2004)
Google Scholar
Little, R., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, New York (1987)
MATH Google Scholar
Zhang, S.: Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123–133 (2011)
Article MATH Google Scholar
Ling, W., Mei, F.D.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology, ESIAT 2009, vol. 3, pp. 660–663 (2009)
Google Scholar
Yuan, Y.C.: Multiple Imputation for Missing Data: Concepts and New Development (Version 9.0). SAS Institute Inc., NC (2001), http://www.sas.com/statistics
Google Scholar
Lakshminarayan, K., Harp, S.A., Goldman, R., Samad, T.: Imputation of missing data using machine learning techniques. In: Simoudis, Han, Fayyad (eds.) Proceedings: Second International Conference on Knowledge Discovery and Data Mining, pp. 140–145. AAAI Press, Menlo Park (1996)
Google Scholar
Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 17(12), 1689–1693 (2005)
Article Google Scholar
Li, X.-B.: A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality 1(1) (2009)
Google Scholar
Setiawan, N.A., Venkatachalam, P., Hani, A.F.M.: Missing Attribute Value Prediction Based on Artificial Neural Network and Rough Set Theory. In: International Conference on BioMedical Engineering and Informatics, BMEI 2008, May 27-30, vol. 1, pp. 306–310 (2008)
Google Scholar
Hruschka, E.R., Hruschka, E.R., Ebecken, N.F.: Bayesian networks for imputation in classification problems. Journal of Intelligent Information Systems 29(13), 231–252 (2007)
Article Google Scholar
Di Zio, M., Scanu, M., Coppola, L., Luzi, O., Ponti, A.: Bayesian networks for imputation. Journal of the Royal Statistical Society A 167(pt. 2), 309–322 (2004)
Article Google Scholar
Goodman, L.A., Kruskal, W.H.: Measures of association for cross-classification. Journal of the American Statistical Association 49, 732–764 (1954)
MATH Google Scholar
Pearl, J.: Probabilistic Reasoning in Intelligent Systems. In: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco (1988)
Google Scholar
Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9, 309–347 (1992)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Institute of Technology, China
Lian Jin, Hongzhi Wang & Hong Gao

Authors

Lian Jin
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Jianyong Wang
Management Science and Information Systems Department, Rutgers, the State University of New Jersey, 1, Washington Park, 07102, Newark, NJ, USA
Hui Xiong
Department of Information Engineering, Nagoya University, 464-8601, Nagoya, Japan
Yoshiharu Ishikawa
Department of Computer Science, Hong Kong Baptist University, Hong Kong
Jianliang Xu
School of Information Science and Engineering, Yanshan University, Qinhuangdao, China
Junfeng Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jin, L., Wang, H., Gao, H. (2013). Imputation for Categorical Attributes with Probabilistic Reasoning. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds) Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38562-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-38562-9_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38561-2
Online ISBN: 978-3-642-38562-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics