Abstract
Due to numerous applications producing noisy data, e.g., sensor data, experimental data, data from uncurated sources, information extraction, etc., there has been a surge of interest in the development of probabilistic databases. Most probabilistic database models proposed to date, however, fail to meet the challenges of real-world applications on two counts: (1) they often restrict the kinds of uncertainty that the user can represent; and (2) the query processing algorithms often cannot scale up to the needs of the application. In this work, we define a probabilistic database model, PrDB, that uses graphical models, a state-of-the-art probabilistic modeling technique developed within the statistics and machine learning community, to model uncertain data. We show how this results in a rich, complex yet compact probabilistic database model, which can capture the commonly occurring uncertainty models (tuple uncertainty, attribute uncertainty), more complex models (correlated tuples and attributes) and allows compact representation (shared and schema-level correlations). In addition, we show how query evaluation in PrDB translates into inference in an appropriately augmented graphical model. This allows us to easily use any of a myriad of exact and approximate inference algorithms developed within the graphical modeling community. While probabilistic inference provides a generic approach to solving queries, we show how the use of shared correlations, together with a novel inference algorithm that we developed based on bisimulation, can speed query processing significantly. We present a comprehensive experimental evaluation of the proposed techniques and show that even with a few shared correlations, significant speedups are possible.
Similar content being viewed by others
References
Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases. In: ICDE (2006)
Arnborg S.: Efficient algorithms for combinatorial problems on graphs with bounded decomposability—a survey. BIT 25(1), 2–23 (1985)
Bosc P., Pivert O.: About projection-selection-join queries addressed to possibilistic relational databases. IEEE Trans. Fuzzy Syst. 13(1), 124–139 (2005)
Boulos, J., Dalvi, N., Mandhani, B., Re, C., Mathur, S., Suciu, D.: Mystiq: a system for finding more answers by using probabilities. In: SIGMOD (2005)
Bravo, H., Ramakrishnan, R.: Optimizing MPF queries: decision support and probabilistic inference. In: SIGMOD (2007)
Buckles B., Petry F.: A fuzzy model for relational databases. Fuzzy Sets Syst. 7(3), 213–226 (1982)
Cheng, R., Kalashnikov, D., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: SIGMOD (2003)
Choenni, S., Blok, H.E., Leertouwer, E.: Handling uncertainty and ignorance in databases: a rule to combine dependent data. In: DASFAA (2006)
Cowell R., Dawid A., Lauritzen S., Spiegelhater D.: Probabilistic Networks and Expert Systems. Springer, Berlin (1999)
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS (2007)
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2004)
Das Sarma, A., Agrawal, P., Nabar, S., Widom, J.: Towards special-purpose indexes and statistics for uncertain data. In: Workshop on Management of Uncertain Data (MUD), Auckland, New Zealand (2008)
Das Sarma, A., Theobald, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. In: ICDE (2008)
De Raedt, L., Kimmig, A., Toivonen, H.: Problog: a probabilistic prolog and its application in link discovery. In: IJCAI (2007)
de Salvo Braz, R., Amir, E., Roth, D.: Lifted first-order probabilistic inference. In: IJCAI (2005)
Dechter, R.: Bucket elimination: a unifying framework for probabilistic inference. In: UAI (1996)
Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: VLDB (2004)
Dovier, A., Piazza, C., Policriti, A.: A fast bisimulation algorithm. In: International Conference on Computer Aided Verification, Paris, France (2001)
Frey, B.: Extending factor graphs so as to unify directed and undirected graphical models. In: UAI (2003)
Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: IJCAI (1999)
Fuhr N., Rolleke T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)
Getoor, L., Taskar, B. (eds.): Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007)
Getoor L., Friedman N., Koller D., Taskar B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2002)
Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: VLDB (2006)
Halpern J.: An analysis of first-order logics for reasoning about probability. Artif. Intell. 44(1–2), 167–207 (1990)
Huang C., Darwiche A.: Inference in belief networks: A procedural guide. Int. J. Approx. Reason. 15(3), 225–263 (1996)
Imielinski T., Lipski W. Jr: Incomplete information in relational databases. J. ACM 31(4), 761–797 (1984)
Jampani, R., Xu, F., Wu, M., Perez, L., Jermaine, C., Haas, P.: MCDB: a monte carlo approach to managing uncertain data. In: SIGMOD (2008)
Kanellakis, P., Smolka, S.: CCS expressions, finite state processes, and three problems of equivalence. In: ACM Symposium on Principles of Distributed Computing, Montreal, Canada (1983)
Kjaerulff, U.: Triangulation of graphs—algorithms giving small total state space. Technical report, University of Aalborg, Denmark (1990)
Koch, C., Olteanu, D.: Conditioning probabilistic databases. In: VLDB (2008)
Milch, B., Zettlemoyer, L., Kersting, K., Haimes, M., Kaelbling, L.: Lifted probabilistic inference with counting formulas. In: AAAI (2008)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Menlo Park (1988)
Poole, D.: First-order probabilistic inference. In: IJCAI (2003)
Re C., Dalvi N., Suciu D.: Query evaluation on probabilistic databases. IEEE Data Eng. Bull. Spec. Issue Probab. Data Manag. 29(1), 17–24 (2006)
Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE (2007)
Richardson M., Domingos P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)
Richardson T.: A characterization of Markov equivalence for directed cyclic graphs. Int. J. Approx. Reason. 17(2–3), 107–162 (1997)
Rish, I.: Efficient Reasoning in Graphical Models. PhD thesis, University of California, Irvine (1999)
Sen P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE (2007)
Sen, P., Deshpande, A., Getoor, L.: Representing tuple and attribute uncertainty in probabilistic databases. In: DUNE Workshop (ICDM) (2007)
Sen P., Deshpande A., Getoor L.: Exploiting shared correlations in probabilistic databases. PVLDB 1(1), 809–820 (2008)
Sen, P., Deshpande, A., Getoor, L.: Bisimulation-based approximate lifted inference. In: UAI (2009)
Singh, S., Mayfield, C., Prabhakar, S., Hambrusch, S., Shah, R.: Indexing uncertain categorical data. In: ICDE (2007)
Singla, P., Domingos, P.: Lifted first-order belief propagation. In: AAAI (2008)
Wang, D., Michelakis, E., Garofalakis, M., Hellerstein, J.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. In: VLDB (2008)
Zhang, N., Poole, D.: A simple approach to Bayesian network computations. In: Canadian Conference on Artificial Intelligence, Banff, Canada (1994)
Zhang N., Poole D.: Exploiting causal independence in Bayesian network inference. J. Artif. Intell. Res. 5, 301–328 (1996)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sen, P., Deshpande, A. & Getoor, L. PrDB: managing and exploiting rich correlations in probabilistic databases. The VLDB Journal 18, 1065–1090 (2009). https://doi.org/10.1007/s00778-009-0153-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-009-0153-2