Tracing Errors in Probabilistic Databases Based on the Bayesian Network

Duan, Liang; Yue, Kun; Jin, Cheqing; Xu, Wenlin; Liu, Weiyi

doi:10.1007/978-3-319-18123-3_7

Tracing Errors in Probabilistic Databases Based on the Bayesian Network

Liang Duan¹⁷,
Kun Yue¹⁷,
Cheqing Jin¹⁸,
Wenlin Xu¹⁷ &
…
Weiyi Liu¹⁷

Conference paper
First Online: 01 January 2015

1770 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9050))

Abstract

Data in probabilistic databases may not be absolutely correct, and worse, may be erroneous. Many existing data cleaning methods can be used to detect errors in traditional databases, but they fall short of guiding us to find errors in probabilistic databases, especially for databases with complex correlations among data. In this paper, we propose a method for tracing errors in probabilistic databases by adopting Bayesian network (BN) as the framework of representing the correlations among data. We first develop the techniques to construct an augmented Bayesian network (ABN) for an anomalous query to represent correlations among input data, intermediate data and output data in the query execution. Inspired by the notion of blame in causal models, we then define a notion of blame for ranking candidate errors. Next, we provide an efficient method for computing the degree of blame for each candidate error based on the probabilistic inference upon the ABN. Experimental results show the effectiveness and efficiency of our method.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C., Yu, P.: A Survey of Uncertain Data Algorithms and Applications. TKDE 21(5), 609–623 (2007)
Google Scholar
Tong, Y., Chen, L., Cheng, Y., Yu, P.: Mining frequent itemsets over uncertain databases. PVLDB 5(11), 1650–1661 (2012)
Google Scholar
Rekatsinas, T., Deshpande, A., Getoor, L.: Theodoros Rekatsinas and Amol Deshpande and Lise Getoor. In: SIGMOD, pp. 373–384. ACM (2012)
Google Scholar
Buneman, P., Cheney, J., Tan, W., Vansummeren, S.: Curated databases. In: PODS, pp. 1–12. ACM (2008)
Google Scholar
Jha, A., Suciu, D.: Probabilistic databases with MarkoViews. PVLDB 5(11), 1160–1171 (2012)
Google Scholar
Fan, W.: Dependencies revisited for improving data quality. In: PODS, pp. 159–170. ACM (2008)
Google Scholar
Zhang, M., Zhang, X., Zhang, X., Prabhakar, S.: Tracing lineage beyond relational operators. In: VDLB, pp. 1116–1127. VLDB Endowment (2007)
Google Scholar
Meliou, A., Gatterbauer, W., Moore, K., Suciu, D.: The complexity of causality and responsibility for query answers and non-answers. PVLDB 4(1), 34–45 (2010)
Google Scholar
Meliou, A., Gatterbauer, W., Nath, S., Suciu, D.: Tracing data errors with view-conditioned causality. In: SIGMOD, pp. 505–516. ACM (2011)
Google Scholar
Darwiche, A.: Modeling and reasoning with Bayesian networks. Cambridge University Press (2009)
Google Scholar
Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE, pp. 596–605. IEEE (2007)
Google Scholar
Deshpande, A., Getoor, L., Sen, P.: Managing and Mining Uncertain Data. Springer (2009)
Google Scholar
Chockler, H., Halpern, J.: Responsibility and blame: A structural-model approach. JAIR 22, 93–115 (2004)
MATH MathSciNet Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall (2009)
Google Scholar
Jarvelin, K., Kekalainen, J.: IR evaluation methods for retrieving highly relevant documents. In: SIGIR, pp. 41–48. ACM (2000)
Google Scholar
Lian, X., Chen, L.: Causality and responsibility: probabilistic queries revisited in uncertain databases. In: CIKM, pp. 349–358. ACM (2013)
Google Scholar
Jin, C., Zhang, R., Kang, Q., Zhang, Z., Zhou, A.: Probabilistic Reverse Top-k Queries. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014, Part I. LNCS, vol. 8421, pp. 406–419. Springer, Heidelberg (2014)
Chapter Google Scholar
Liu, J., Ye, D., Wei, J., Huang, F., Zhong, H.: Consistent Query Answering Based on Repairing Inconsistent Attributes with Nulls. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013, Part I. LNCS, vol. 7825, pp. 407–423. Springer, Heidelberg (2013)
Chapter Google Scholar
Miao, X., Gao, Y., Chen, L., Chen, G., Li, Q., Jiang, T.: On Efficient k-Skyband Query Processing over Incomplete Data. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013, Part I. LNCS, vol. 7825, pp. 424–439. Springer, Heidelberg (2013)
Chapter Google Scholar
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12. ACM (2007)
Google Scholar
Benjelloun, O., Sarma, A., Halevy, A., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: VLDB, pp. 953–964. VLDB Endowment (2006)
Google Scholar
Wang, D., Michelakis, E., Garofalakis, M., Hellerstein, J.M.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. PVLDB 1(1), 340–351 (2008)
Google Scholar
Cheney, J., Chiticariu, L., Tan, W.: Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1(4), 379–474 (2007)
Article Google Scholar
Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS, pp. 31–40. ACM (2007)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C. A.: Improving Data Cleaning Quality Using a Data Lineage Facility. In: Workshop of DMDW, pp. (3)1–13 (2001)
Google Scholar
Halpern, J., Pearl, J.: Causes and explanations: A structural-model approach. Part I: Causes. The British Journal for the Philosophy of Science 56(4), 843–887 (2005)
Article MATH MathSciNet Google Scholar
Duan, L., Yue, K., Qian, W., Liu, W.: Cleaning Missing Data Based on the Bayesian Network. In: Gao, Y., Shim, K., Ding, Z., Jin, P., Ren, Z., Xiao, Y., Liu, A., Qiao, S. (eds.) WAIM 2013 Workshops 2013. LNCS, vol. 7901, pp. 348–359. Springer, Heidelberg (2013)
Chapter Google Scholar
Muller, H., Freytag, J.: Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst, Fur Informatik (2005)
Google Scholar
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2), 6 (2008)
Article Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1–2), 173–184 (2010)
Google Scholar
Cormode, G., Srivastava, D., Shen, E., Yu, T.: Aggregate query answering on possibilistic data with cardinality constraints. In: ICDE, pp. 258–269. IEEE (2012)
Google Scholar
Chen, H., Ku, W., Wang, H.: Cleansing uncertain databases leveraging aggregate constraints. In: Workshop of ICDE, pp. 128–135. IEEE (2010)
Google Scholar
Beskales, G., Ilyas, I., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1–2), 197–207 (2010)
Google Scholar
Ma, S., Fan, W., Bravo, L.: Extending inclusion dependencies with conditions. Theoretical Computer Science 515, 64–95 (2014)
Article MATH MathSciNet Google Scholar
Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: SIGMOD, pp. 75–86. ACM (2010)
Google Scholar
Stoyanovich, J., Davidson, S., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314. IEEE (2011)
Google Scholar
Panse, F., Van Keulen, M., De Keijzer, A., Ritter, N.: Duplicate detection in probabilistic data. In: Workshop of ICDE, pp. 179–182. IEEE (2010)
Google Scholar
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. PVLDB 6(9), 625–636 (2013)
Google Scholar
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD, pp. 541–552. ACM (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, China
Liang Duan, Kun Yue, Wenlin Xu & Weiyi Liu
Institute of Massive Computing, East China Normal University, Shanghai, China
Cheqing Jin

Authors

Liang Duan
View author publications
You can also search for this author in PubMed Google Scholar
Kun Yue
View author publications
You can also search for this author in PubMed Google Scholar
Cheqing Jin
View author publications
You can also search for this author in PubMed Google Scholar
Wenlin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Weiyi Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kun Yue .

Editor information

Editors and Affiliations

Universität München, München, Germany
Matthias Renz
University of Southern California, Los Angeles, USA
Cyrus Shahabi
University of Queensland, Brisbane, Australia
Xiaofang Zhou
Monash University, Clayton, Australia
Muhammad Aamir Cheema

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duan, L., Yue, K., Jin, C., Xu, W., Liu, W. (2015). Tracing Errors in Probabilistic Databases Based on the Bayesian Network. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-18123-3_7
Published: 09 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics