Skip to main content

Tracing Errors in Probabilistic Databases Based on the Bayesian Network

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9050))

Abstract

Data in probabilistic databases may not be absolutely correct, and worse, may be erroneous. Many existing data cleaning methods can be used to detect errors in traditional databases, but they fall short of guiding us to find errors in probabilistic databases, especially for databases with complex correlations among data. In this paper, we propose a method for tracing errors in probabilistic databases by adopting Bayesian network (BN) as the framework of representing the correlations among data. We first develop the techniques to construct an augmented Bayesian network (ABN) for an anomalous query to represent correlations among input data, intermediate data and output data in the query execution. Inspired by the notion of blame in causal models, we then define a notion of blame for ranking candidate errors. Next, we provide an efficient method for computing the degree of blame for each candidate error based on the probabilistic inference upon the ABN. Experimental results show the effectiveness and efficiency of our method.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C., Yu, P.: A Survey of Uncertain Data Algorithms and Applications. TKDE 21(5), 609–623 (2007)

    Google Scholar 

  2. Tong, Y., Chen, L., Cheng, Y., Yu, P.: Mining frequent itemsets over uncertain databases. PVLDB 5(11), 1650–1661 (2012)

    Google Scholar 

  3. Rekatsinas, T., Deshpande, A., Getoor, L.: Theodoros Rekatsinas and Amol Deshpande and Lise Getoor. In: SIGMOD, pp. 373–384. ACM (2012)

    Google Scholar 

  4. Buneman, P., Cheney, J., Tan, W., Vansummeren, S.: Curated databases. In: PODS, pp. 1–12. ACM (2008)

    Google Scholar 

  5. Jha, A., Suciu, D.: Probabilistic databases with MarkoViews. PVLDB 5(11), 1160–1171 (2012)

    Google Scholar 

  6. Fan, W.: Dependencies revisited for improving data quality. In: PODS, pp. 159–170. ACM (2008)

    Google Scholar 

  7. Zhang, M., Zhang, X., Zhang, X., Prabhakar, S.: Tracing lineage beyond relational operators. In: VDLB, pp. 1116–1127. VLDB Endowment (2007)

    Google Scholar 

  8. Meliou, A., Gatterbauer, W., Moore, K., Suciu, D.: The complexity of causality and responsibility for query answers and non-answers. PVLDB 4(1), 34–45 (2010)

    Google Scholar 

  9. Meliou, A., Gatterbauer, W., Nath, S., Suciu, D.: Tracing data errors with view-conditioned causality. In: SIGMOD, pp. 505–516. ACM (2011)

    Google Scholar 

  10. Darwiche, A.: Modeling and reasoning with Bayesian networks. Cambridge University Press (2009)

    Google Scholar 

  11. Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE, pp. 596–605. IEEE (2007)

    Google Scholar 

  12. Deshpande, A., Getoor, L., Sen, P.: Managing and Mining Uncertain Data. Springer (2009)

    Google Scholar 

  13. Chockler, H., Halpern, J.: Responsibility and blame: A structural-model approach. JAIR 22, 93–115 (2004)

    MATH  MathSciNet  Google Scholar 

  14. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall (2009)

    Google Scholar 

  15. Jarvelin, K., Kekalainen, J.: IR evaluation methods for retrieving highly relevant documents. In: SIGIR, pp. 41–48. ACM (2000)

    Google Scholar 

  16. Lian, X., Chen, L.: Causality and responsibility: probabilistic queries revisited in uncertain databases. In: CIKM, pp. 349–358. ACM (2013)

    Google Scholar 

  17. Jin, C., Zhang, R., Kang, Q., Zhang, Z., Zhou, A.: Probabilistic Reverse Top-k Queries. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014, Part I. LNCS, vol. 8421, pp. 406–419. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  18. Liu, J., Ye, D., Wei, J., Huang, F., Zhong, H.: Consistent Query Answering Based on Repairing Inconsistent Attributes with Nulls. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013, Part I. LNCS, vol. 7825, pp. 407–423. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  19. Miao, X., Gao, Y., Chen, L., Chen, G., Li, Q., Jiang, T.: On Efficient k-Skyband Query Processing over Incomplete Data. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013, Part I. LNCS, vol. 7825, pp. 424–439. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  20. Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12. ACM (2007)

    Google Scholar 

  21. Benjelloun, O., Sarma, A., Halevy, A., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: VLDB, pp. 953–964. VLDB Endowment (2006)

    Google Scholar 

  22. Wang, D., Michelakis, E., Garofalakis, M., Hellerstein, J.M.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. PVLDB 1(1), 340–351 (2008)

    Google Scholar 

  23. Cheney, J., Chiticariu, L., Tan, W.: Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1(4), 379–474 (2007)

    Article  Google Scholar 

  24. Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS, pp. 31–40. ACM (2007)

    Google Scholar 

  25. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C. A.: Improving Data Cleaning Quality Using a Data Lineage Facility. In: Workshop of DMDW, pp. (3)1–13 (2001)

    Google Scholar 

  26. Halpern, J., Pearl, J.: Causes and explanations: A structural-model approach. Part I: Causes. The British Journal for the Philosophy of Science 56(4), 843–887 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  27. Duan, L., Yue, K., Qian, W., Liu, W.: Cleaning Missing Data Based on the Bayesian Network. In: Gao, Y., Shim, K., Ding, Z., Jin, P., Ren, Z., Xiao, Y., Liu, A., Qiao, S. (eds.) WAIM 2013 Workshops 2013. LNCS, vol. 7901, pp. 348–359. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  28. Muller, H., Freytag, J.: Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst, Fur Informatik (2005)

    Google Scholar 

  29. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2), 6 (2008)

    Article  Google Scholar 

  30. Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1–2), 173–184 (2010)

    Google Scholar 

  31. Cormode, G., Srivastava, D., Shen, E., Yu, T.: Aggregate query answering on possibilistic data with cardinality constraints. In: ICDE, pp. 258–269. IEEE (2012)

    Google Scholar 

  32. Chen, H., Ku, W., Wang, H.: Cleansing uncertain databases leveraging aggregate constraints. In: Workshop of ICDE, pp. 128–135. IEEE (2010)

    Google Scholar 

  33. Beskales, G., Ilyas, I., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1–2), 197–207 (2010)

    Google Scholar 

  34. Ma, S., Fan, W., Bravo, L.: Extending inclusion dependencies with conditions. Theoretical Computer Science 515, 64–95 (2014)

    Article  MATH  MathSciNet  Google Scholar 

  35. Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: SIGMOD, pp. 75–86. ACM (2010)

    Google Scholar 

  36. Stoyanovich, J., Davidson, S., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314. IEEE (2011)

    Google Scholar 

  37. Panse, F., Van Keulen, M., De Keijzer, A., Ritter, N.: Duplicate detection in probabilistic data. In: Workshop of ICDE, pp. 179–182. IEEE (2010)

    Google Scholar 

  38. Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. PVLDB 6(9), 625–636 (2013)

    Google Scholar 

  39. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD, pp. 541–552. ACM (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kun Yue .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Duan, L., Yue, K., Jin, C., Xu, W., Liu, W. (2015). Tracing Errors in Probabilistic Databases Based on the Bayesian Network. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18123-3_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18122-6

  • Online ISBN: 978-3-319-18123-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics