Skip to main content

Imputation for Categorical Attributes with Probabilistic Reasoning

  • Conference paper
  • 3462 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7923))

Abstract

Since incompleteness affects the data usage, missing values in database should be estimated to make data mining and analysis more accurate. In addition to ignoring or setting to default values, many imputation methods have been proposed, but all of them have their limitations. This paper proposes a probabilistic method to estimate missing values. We construct a Bayesian network in a novel way to identify the dependencies in a dataset, then use the Bayesian reasoning process to find the most probable substitution for each missing value. The benefits of this method include (1) irrelevant attributes can be ignored during estimation; (2) network is built with no target attribute, which means all attributes are handled in one model;(3) probability information can be obtained to measure the accuracy of the imputation. Experimental results show that our construction algorithm is effective and the quality of filled values outperforms the mode imputation method and kNN method. We also verify the effectiveness of the probabilities given by our method experimentally.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1–39 (1977)

    MathSciNet  MATH  Google Scholar 

  2. Yang, K., Li, J., Wang, C.: Missing Values Estimation in Microarray Data with Partial Least Squares Regression. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 662–669. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  3. Shan, Y., Deng, G.: Kernel PCA regression for missing data estimation in DNA microarray analysis. In: IEEE International Symposium on Circuits and Systems, ISCAS 2009, May 24-27, pp. 1477–1480 (2009)

    Google Scholar 

  4. Walsh, B.: Markov chain Monte Carlo and Gibbs sampling. Lecture notes for EEB 581, version 26 (April 2004)

    Google Scholar 

  5. Little, R., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, New York (1987)

    MATH  Google Scholar 

  6. Zhang, S.: Shell-neighbor method and its application in missing data imputation. Applied Intelligence 35(1), 123–133 (2011)

    Article  MATH  Google Scholar 

  7. Ling, W., Mei, F.D.: Estimation of missing values using a weighted k-nearest neighbors algorithm. In: Proceedings - 2009 International Conference on Environmental Science and Information Application Technology, ESIAT 2009, vol. 3, pp. 660–663 (2009)

    Google Scholar 

  8. Yuan, Y.C.: Multiple Imputation for Missing Data: Concepts and New Development (Version 9.0). SAS Institute Inc., NC (2001), http://www.sas.com/statistics

    Google Scholar 

  9. Lakshminarayan, K., Harp, S.A., Goldman, R., Samad, T.: Imputation of missing data using machine learning techniques. In: Simoudis, Han, Fayyad (eds.) Proceedings: Second International Conference on Knowledge Discovery and Data Mining, pp. 140–145. AAAI Press, Menlo Park (1996)

    Google Scholar 

  10. Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 17(12), 1689–1693 (2005)

    Article  Google Scholar 

  11. Li, X.-B.: A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality 1(1) (2009)

    Google Scholar 

  12. Setiawan, N.A., Venkatachalam, P., Hani, A.F.M.: Missing Attribute Value Prediction Based on Artificial Neural Network and Rough Set Theory. In: International Conference on BioMedical Engineering and Informatics, BMEI 2008, May 27-30, vol. 1, pp. 306–310 (2008)

    Google Scholar 

  13. Hruschka, E.R., Hruschka, E.R., Ebecken, N.F.: Bayesian networks for imputation in classification problems. Journal of Intelligent Information Systems 29(13), 231–252 (2007)

    Article  Google Scholar 

  14. Di Zio, M., Scanu, M., Coppola, L., Luzi, O., Ponti, A.: Bayesian networks for imputation. Journal of the Royal Statistical Society A 167(pt. 2), 309–322 (2004)

    Article  Google Scholar 

  15. Goodman, L.A., Kruskal, W.H.: Measures of association for cross-classification. Journal of the American Statistical Association 49, 732–764 (1954)

    MATH  Google Scholar 

  16. Pearl, J.: Probabilistic Reasoning in Intelligent Systems. In: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Francisco (1988)

    Google Scholar 

  17. Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9, 309–347 (1992)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jin, L., Wang, H., Gao, H. (2013). Imputation for Categorical Attributes with Probabilistic Reasoning. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds) Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38562-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38562-9_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38561-2

  • Online ISBN: 978-3-642-38562-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics