Skip to main content

Cleaning Missing Data Based on the Bayesian Network

  • Conference paper
Book cover Web-Age Information Management (WAIM 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7901))

Included in the following conference series:

Abstract

To guarantee the data quality, it is necessary to clean the missing data that prevalently exist in real world databases. By incorporating additional information, such as functional dependencies or integrity constraints, the correct value for each missing data item can be derived in many existing data cleaning methods. In this paper, we propose a method for cleaning the missing data item without additional information by adopting Bayesian network (BN) as the framework of the representation and inferences of probability distributions. First, we learn a Bayesian network from the complete part of the given incomplete database, called IBN. Then, we infer the probability distributions of each missing data item based on Gibbs sampling upon the IBN. Consequently, we obtain all possible values with their corresponding probability distributions (i.e., confidence degrees), by which we clean the incomplete databases. Experimental results showed the efficiency, accuracy and precision of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Muller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical report, Humboldt-Universitat zu Berlin (2003)

    Google Scholar 

  2. Arasu, A., Chaudhuri, S., Chen, Z., Ganjam, K., et al.: Experiences with using Data Cleaning Technology for Bing Services. IEEE Data Engineering Bulletin, 14–23 (2012)

    Google Scholar 

  3. Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1), 197–207 (2010)

    Google Scholar 

  4. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional Functional Dependencies for Data Cleaning. In: Chirkova, R., Dogac, A., Ozsu, M.T., Sellis, T.K. (eds.) Proc. of ICDE 2007, Istanbul, Turkey, pp. 746–755. IEEE Computer Society (2007)

    Google Scholar 

  5. Chen, H., Ku, W.S., Wang, H.: Cleansing Uncertain Databases Leveraging Aggregate Constraints. In: Workshops Proc. of ICDE 2010, California, USA, pp. 128–135. IEEE Computer Society (2010)

    Google Scholar 

  6. Srivastava, D.: Analyzing Data Quality Using Data Auditor. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 1–1. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  7. Mayfield, C., Neville, J., Prabhakar, S.: ERACER: A Database Approach for Statistical Inference and Data Cleaning. In: Elmagarmid, A.K., Agrawal, D. (eds.) Proc. of SIGMOD 2010, Indiana, USA, pp. 75–86. ACM (2010)

    Google Scholar 

  8. Stoyanovich, J., Davidson, S., Milo, T., Tannen, V.: Deriving Probabilistic Databases with Inference Ensembles. In: Abiteboul, S., Bohm, K., Koch, C., Tan, K.L. (eds.) Proc. of ICDE 2011, Hannover, Germany, pp. 303–314. IEEE Computer Society (2011)

    Google Scholar 

  9. Darwiche, A.: Modeling and Reasoning with Bayesian Networks. Cambridge University Press (2009)

    Google Scholar 

  10. Cheng, J., Greiner, R., Bell, D., Liu, W.: Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory. Artificial Intelligence 137(1-2), 43–90 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  11. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall (2009)

    Google Scholar 

  12. Cavallo, R., Pittarelli, M.: The Theory of Probabilistic Databases. In: Stocker, P.M., Kent, W., Hammersley, P. (eds.) Proc. of VLDB 1987, Brighton, England, pp. 71–81. Morgan Kaufmann (1987)

    Google Scholar 

  13. Huang, J., Antova, L., Koch, C., Olteanu, D.: MayBMS: A Probabilistic Databases Management System. In: Cetintemel, U., Zdonik, S.B., Kossmann, D., Tatbul, N. (eds.) Proc. of SIGMOD 2009, Rhode Island, USA, pp. 1071–1074. ACM (2009)

    Google Scholar 

  14. Benjelloun, O., Sarma, A., Halevy, A., Widom, J.: ULDBs: Databases with Uncertainty and Lineage. In: Dayal, U., Whang, K.Y., Lomet, D.B., Alonso, G.A., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y.K. (eds.) Proc. of VLDB 2006, Seoul, Korea, pp. 953–964. Morgan Kaufmann (2006)

    Google Scholar 

  15. Norsys Software Corporation, http://www.norsys.com/

  16. Cover, T., Thomas, J.: Elements of Information Theory. Wiley and Sons (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Duan, L., Yue, K., Qian, W., Liu, W. (2013). Cleaning Missing Data Based on the Bayesian Network. In: Gao, Y., et al. Web-Age Information Management. WAIM 2013. Lecture Notes in Computer Science, vol 7901. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39527-7_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-39527-7_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-39526-0

  • Online ISBN: 978-3-642-39527-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics