Abstract
Data in probabilistic databases may not be absolutely correct, and worse, may be erroneous. Many existing data cleaning methods can be used to detect errors in traditional databases, but they fall short of guiding us to find errors in probabilistic databases, especially for databases with complex correlations among data. In this paper, we propose a method for tracing errors in probabilistic databases by adopting Bayesian network (BN) as the framework of representing the correlations among data. We first develop the techniques to construct an augmented Bayesian network (ABN) for an anomalous query to represent correlations among input data, intermediate data and output data in the query execution. Inspired by the notion of blame in causal models, we then define a notion of blame for ranking candidate errors. Next, we provide an efficient method for computing the degree of blame for each candidate error based on the probabilistic inference upon the ABN. Experimental results show the effectiveness and efficiency of our method.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aggarwal, C., Yu, P.: A Survey of Uncertain Data Algorithms and Applications. TKDE 21(5), 609–623 (2007)
Tong, Y., Chen, L., Cheng, Y., Yu, P.: Mining frequent itemsets over uncertain databases. PVLDB 5(11), 1650–1661 (2012)
Rekatsinas, T., Deshpande, A., Getoor, L.: Theodoros Rekatsinas and Amol Deshpande and Lise Getoor. In: SIGMOD, pp. 373–384. ACM (2012)
Buneman, P., Cheney, J., Tan, W., Vansummeren, S.: Curated databases. In: PODS, pp. 1–12. ACM (2008)
Jha, A., Suciu, D.: Probabilistic databases with MarkoViews. PVLDB 5(11), 1160–1171 (2012)
Fan, W.: Dependencies revisited for improving data quality. In: PODS, pp. 159–170. ACM (2008)
Zhang, M., Zhang, X., Zhang, X., Prabhakar, S.: Tracing lineage beyond relational operators. In: VDLB, pp. 1116–1127. VLDB Endowment (2007)
Meliou, A., Gatterbauer, W., Moore, K., Suciu, D.: The complexity of causality and responsibility for query answers and non-answers. PVLDB 4(1), 34–45 (2010)
Meliou, A., Gatterbauer, W., Nath, S., Suciu, D.: Tracing data errors with view-conditioned causality. In: SIGMOD, pp. 505–516. ACM (2011)
Darwiche, A.: Modeling and reasoning with Bayesian networks. Cambridge University Press (2009)
Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE, pp. 596–605. IEEE (2007)
Deshpande, A., Getoor, L., Sen, P.: Managing and Mining Uncertain Data. Springer (2009)
Chockler, H., Halpern, J.: Responsibility and blame: A structural-model approach. JAIR 22, 93–115 (2004)
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall (2009)
Jarvelin, K., Kekalainen, J.: IR evaluation methods for retrieving highly relevant documents. In: SIGIR, pp. 41–48. ACM (2000)
Lian, X., Chen, L.: Causality and responsibility: probabilistic queries revisited in uncertain databases. In: CIKM, pp. 349–358. ACM (2013)
Jin, C., Zhang, R., Kang, Q., Zhang, Z., Zhou, A.: Probabilistic Reverse Top-k Queries. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014, Part I. LNCS, vol. 8421, pp. 406–419. Springer, Heidelberg (2014)
Liu, J., Ye, D., Wei, J., Huang, F., Zhong, H.: Consistent Query Answering Based on Repairing Inconsistent Attributes with Nulls. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013, Part I. LNCS, vol. 7825, pp. 407–423. Springer, Heidelberg (2013)
Miao, X., Gao, Y., Chen, L., Chen, G., Li, Q., Jiang, T.: On Efficient k-Skyband Query Processing over Incomplete Data. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013, Part I. LNCS, vol. 7825, pp. 424–439. Springer, Heidelberg (2013)
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12. ACM (2007)
Benjelloun, O., Sarma, A., Halevy, A., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: VLDB, pp. 953–964. VLDB Endowment (2006)
Wang, D., Michelakis, E., Garofalakis, M., Hellerstein, J.M.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. PVLDB 1(1), 340–351 (2008)
Cheney, J., Chiticariu, L., Tan, W.: Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1(4), 379–474 (2007)
Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS, pp. 31–40. ACM (2007)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C. A.: Improving Data Cleaning Quality Using a Data Lineage Facility. In: Workshop of DMDW, pp. (3)1–13 (2001)
Halpern, J., Pearl, J.: Causes and explanations: A structural-model approach. Part I: Causes. The British Journal for the Philosophy of Science 56(4), 843–887 (2005)
Duan, L., Yue, K., Qian, W., Liu, W.: Cleaning Missing Data Based on the Bayesian Network. In: Gao, Y., Shim, K., Ding, Z., Jin, P., Ren, Z., Xiao, Y., Liu, A., Qiao, S. (eds.) WAIM 2013 Workshops 2013. LNCS, vol. 7901, pp. 348–359. Springer, Heidelberg (2013)
Muller, H., Freytag, J.: Problems, methods, and challenges in comprehensive data cleansing. Professoren des Inst, Fur Informatik (2005)
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2), 6 (2008)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. PVLDB 3(1–2), 173–184 (2010)
Cormode, G., Srivastava, D., Shen, E., Yu, T.: Aggregate query answering on possibilistic data with cardinality constraints. In: ICDE, pp. 258–269. IEEE (2012)
Chen, H., Ku, W., Wang, H.: Cleansing uncertain databases leveraging aggregate constraints. In: Workshop of ICDE, pp. 128–135. IEEE (2010)
Beskales, G., Ilyas, I., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3(1–2), 197–207 (2010)
Ma, S., Fan, W., Bravo, L.: Extending inclusion dependencies with conditions. Theoretical Computer Science 515, 64–95 (2014)
Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: SIGMOD, pp. 75–86. ACM (2010)
Stoyanovich, J., Davidson, S., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314. IEEE (2011)
Panse, F., Van Keulen, M., De Keijzer, A., Ritter, N.: Duplicate detection in probabilistic data. In: Workshop of ICDE, pp. 179–182. IEEE (2010)
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. PVLDB 6(9), 625–636 (2013)
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD, pp. 541–552. ACM (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Duan, L., Yue, K., Jin, C., Xu, W., Liu, W. (2015). Tracing Errors in Probabilistic Databases Based on the Bayesian Network. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-18123-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)