PrDB: managing and exploiting rich correlations in probabilistic databases

Sen, Prithviraj; Deshpande, Amol; Getoor, Lise

doi:10.1007/s00778-009-0153-2

PrDB: managing and exploiting rich correlations in probabilistic databases

Special Issue Paper
Published: 15 July 2009

Volume 18, pages 1065–1090, (2009)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Prithviraj Sen¹,
Amol Deshpande¹ &
Lise Getoor¹

196 Accesses
76 Citations
3 Altmetric
Explore all metrics

Abstract

Due to numerous applications producing noisy data, e.g., sensor data, experimental data, data from uncurated sources, information extraction, etc., there has been a surge of interest in the development of probabilistic databases. Most probabilistic database models proposed to date, however, fail to meet the challenges of real-world applications on two counts: (1) they often restrict the kinds of uncertainty that the user can represent; and (2) the query processing algorithms often cannot scale up to the needs of the application. In this work, we define a probabilistic database model, PrDB, that uses graphical models, a state-of-the-art probabilistic modeling technique developed within the statistics and machine learning community, to model uncertain data. We show how this results in a rich, complex yet compact probabilistic database model, which can capture the commonly occurring uncertainty models (tuple uncertainty, attribute uncertainty), more complex models (correlated tuples and attributes) and allows compact representation (shared and schema-level correlations). In addition, we show how query evaluation in PrDB translates into inference in an appropriately augmented graphical model. This allows us to easily use any of a myriad of exact and approximate inference algorithms developed within the graphical modeling community. While probabilistic inference provides a generic approach to solving queries, we show how the use of shared correlations, together with a novel inference algorithm that we developed based on bisimulation, can speed query processing significantly. We present a comprehensive experimental evaluation of the proposed techniques and show that even with a few shared correlations, significant speedups are possible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases. In: ICDE (2006)
Arnborg S.: Efficient algorithms for combinatorial problems on graphs with bounded decomposability—a survey. BIT 25(1), 2–23 (1985)
Article MATH MathSciNet Google Scholar
Bosc P., Pivert O.: About projection-selection-join queries addressed to possibilistic relational databases. IEEE Trans. Fuzzy Syst. 13(1), 124–139 (2005)
Article Google Scholar
Boulos, J., Dalvi, N., Mandhani, B., Re, C., Mathur, S., Suciu, D.: Mystiq: a system for finding more answers by using probabilities. In: SIGMOD (2005)
Bravo, H., Ramakrishnan, R.: Optimizing MPF queries: decision support and probabilistic inference. In: SIGMOD (2007)
Buckles B., Petry F.: A fuzzy model for relational databases. Fuzzy Sets Syst. 7(3), 213–226 (1982)
Article MATH Google Scholar
Cheng, R., Kalashnikov, D., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: SIGMOD (2003)
Choenni, S., Blok, H.E., Leertouwer, E.: Handling uncertainty and ignorance in databases: a rule to combine dependent data. In: DASFAA (2006)
Cowell R., Dawid A., Lauritzen S., Spiegelhater D.: Probabilistic Networks and Expert Systems. Springer, Berlin (1999)
MATH Google Scholar
Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS (2007)
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2004)
Das Sarma, A., Agrawal, P., Nabar, S., Widom, J.: Towards special-purpose indexes and statistics for uncertain data. In: Workshop on Management of Uncertain Data (MUD), Auckland, New Zealand (2008)
Das Sarma, A., Theobald, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. In: ICDE (2008)
De Raedt, L., Kimmig, A., Toivonen, H.: Problog: a probabilistic prolog and its application in link discovery. In: IJCAI (2007)
de Salvo Braz, R., Amir, E., Roth, D.: Lifted first-order probabilistic inference. In: IJCAI (2005)
Dechter, R.: Bucket elimination: a unifying framework for probabilistic inference. In: UAI (1996)
Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: VLDB (2004)
Dovier, A., Piazza, C., Policriti, A.: A fast bisimulation algorithm. In: International Conference on Computer Aided Verification, Paris, France (2001)
Frey, B.: Extending factor graphs so as to unify directed and undirected graphical models. In: UAI (2003)
Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: IJCAI (1999)
Fuhr N., Rolleke T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)
Article Google Scholar
Getoor, L., Taskar, B. (eds.): Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007)
MATH Google Scholar
Getoor L., Friedman N., Koller D., Taskar B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2002)
Article MathSciNet Google Scholar
Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: VLDB (2006)
Halpern J.: An analysis of first-order logics for reasoning about probability. Artif. Intell. 44(1–2), 167–207 (1990)
Google Scholar
Huang C., Darwiche A.: Inference in belief networks: A procedural guide. Int. J. Approx. Reason. 15(3), 225–263 (1996)
Article MATH MathSciNet Google Scholar
Imielinski T., Lipski W. Jr: Incomplete information in relational databases. J. ACM 31(4), 761–797 (1984)
Article MATH MathSciNet Google Scholar
Jampani, R., Xu, F., Wu, M., Perez, L., Jermaine, C., Haas, P.: MCDB: a monte carlo approach to managing uncertain data. In: SIGMOD (2008)
Kanellakis, P., Smolka, S.: CCS expressions, finite state processes, and three problems of equivalence. In: ACM Symposium on Principles of Distributed Computing, Montreal, Canada (1983)
Kjaerulff, U.: Triangulation of graphs—algorithms giving small total state space. Technical report, University of Aalborg, Denmark (1990)
Koch, C., Olteanu, D.: Conditioning probabilistic databases. In: VLDB (2008)
Milch, B., Zettlemoyer, L., Kersting, K., Haimes, M., Kaelbling, L.: Lifted probabilistic inference with counting formulas. In: AAAI (2008)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Menlo Park (1988)
Poole, D.: First-order probabilistic inference. In: IJCAI (2003)
Re C., Dalvi N., Suciu D.: Query evaluation on probabilistic databases. IEEE Data Eng. Bull. Spec. Issue Probab. Data Manag. 29(1), 17–24 (2006)
Google Scholar
Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE (2007)
Richardson M., Domingos P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)
Article Google Scholar
Richardson T.: A characterization of Markov equivalence for directed cyclic graphs. Int. J. Approx. Reason. 17(2–3), 107–162 (1997)
Article MATH Google Scholar
Rish, I.: Efficient Reasoning in Graphical Models. PhD thesis, University of California, Irvine (1999)
Sen P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE (2007)
Sen, P., Deshpande, A., Getoor, L.: Representing tuple and attribute uncertainty in probabilistic databases. In: DUNE Workshop (ICDM) (2007)
Sen P., Deshpande A., Getoor L.: Exploiting shared correlations in probabilistic databases. PVLDB 1(1), 809–820 (2008)
Google Scholar
Sen, P., Deshpande, A., Getoor, L.: Bisimulation-based approximate lifted inference. In: UAI (2009)
Singh, S., Mayfield, C., Prabhakar, S., Hambrusch, S., Shah, R.: Indexing uncertain categorical data. In: ICDE (2007)
Singla, P., Domingos, P.: Lifted first-order belief propagation. In: AAAI (2008)
Wang, D., Michelakis, E., Garofalakis, M., Hellerstein, J.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. In: VLDB (2008)
Zhang, N., Poole, D.: A simple approach to Bayesian network computations. In: Canadian Conference on Artificial Intelligence, Banff, Canada (1994)
Zhang N., Poole D.: Exploiting causal independence in Bayesian network inference. J. Artif. Intell. Res. 5, 301–328 (1996)
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Maryland, College Park, MD, 20742, USA
Prithviraj Sen, Amol Deshpande & Lise Getoor

Authors

Prithviraj Sen
View author publications
You can also search for this author in PubMed Google Scholar
Amol Deshpande
View author publications
You can also search for this author in PubMed Google Scholar
Lise Getoor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amol Deshpande.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sen, P., Deshpande, A. & Getoor, L. PrDB: managing and exploiting rich correlations in probabilistic databases. The VLDB Journal 18, 1065–1090 (2009). https://doi.org/10.1007/s00778-009-0153-2

Download citation

Received: 15 September 2008
Revised: 07 June 2009
Accepted: 10 June 2009
Published: 15 July 2009
Issue Date: October 2009
DOI: https://doi.org/10.1007/s00778-009-0153-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PrDB: managing and exploiting rich correlations in probabilistic databases

Abstract

Access this article

Similar content being viewed by others

Querying and Learning in Probabilistic Databases

A Tutorial on Query Answering and Reasoning over Probabilistic Knowledge Bases

Uncertain Evidence for Probabilistic Relational Models

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PrDB: managing and exploiting rich correlations in probabilistic databases

Abstract

Access this article

Similar content being viewed by others

Querying and Learning in Probabilistic Databases

A Tutorial on Query Answering and Reasoning over Probabilistic Knowledge Bases

Uncertain Evidence for Probabilistic Relational Models

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation