Skip to main content
Log in

PrDB: managing and exploiting rich correlations in probabilistic databases

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Due to numerous applications producing noisy data, e.g., sensor data, experimental data, data from uncurated sources, information extraction, etc., there has been a surge of interest in the development of probabilistic databases. Most probabilistic database models proposed to date, however, fail to meet the challenges of real-world applications on two counts: (1) they often restrict the kinds of uncertainty that the user can represent; and (2) the query processing algorithms often cannot scale up to the needs of the application. In this work, we define a probabilistic database model, PrDB, that uses graphical models, a state-of-the-art probabilistic modeling technique developed within the statistics and machine learning community, to model uncertain data. We show how this results in a rich, complex yet compact probabilistic database model, which can capture the commonly occurring uncertainty models (tuple uncertainty, attribute uncertainty), more complex models (correlated tuples and attributes) and allows compact representation (shared and schema-level correlations). In addition, we show how query evaluation in PrDB translates into inference in an appropriately augmented graphical model. This allows us to easily use any of a myriad of exact and approximate inference algorithms developed within the graphical modeling community. While probabilistic inference provides a generic approach to solving queries, we show how the use of shared correlations, together with a novel inference algorithm that we developed based on bisimulation, can speed query processing significantly. We present a comprehensive experimental evaluation of the proposed techniques and show that even with a few shared correlations, significant speedups are possible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases. In: ICDE (2006)

  2. Arnborg S.: Efficient algorithms for combinatorial problems on graphs with bounded decomposability—a survey. BIT 25(1), 2–23 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  3. Bosc P., Pivert O.: About projection-selection-join queries addressed to possibilistic relational databases. IEEE Trans. Fuzzy Syst. 13(1), 124–139 (2005)

    Article  Google Scholar 

  4. Boulos, J., Dalvi, N., Mandhani, B., Re, C., Mathur, S., Suciu, D.: Mystiq: a system for finding more answers by using probabilities. In: SIGMOD (2005)

  5. Bravo, H., Ramakrishnan, R.: Optimizing MPF queries: decision support and probabilistic inference. In: SIGMOD (2007)

  6. Buckles B., Petry F.: A fuzzy model for relational databases. Fuzzy Sets Syst. 7(3), 213–226 (1982)

    Article  MATH  Google Scholar 

  7. Cheng, R., Kalashnikov, D., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: SIGMOD (2003)

  8. Choenni, S., Blok, H.E., Leertouwer, E.: Handling uncertainty and ignorance in databases: a rule to combine dependent data. In: DASFAA (2006)

  9. Cowell R., Dawid A., Lauritzen S., Spiegelhater D.: Probabilistic Networks and Expert Systems. Springer, Berlin (1999)

    MATH  Google Scholar 

  10. Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS (2007)

  11. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2004)

  12. Das Sarma, A., Agrawal, P., Nabar, S., Widom, J.: Towards special-purpose indexes and statistics for uncertain data. In: Workshop on Management of Uncertain Data (MUD), Auckland, New Zealand (2008)

  13. Das Sarma, A., Theobald, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. In: ICDE (2008)

  14. De Raedt, L., Kimmig, A., Toivonen, H.: Problog: a probabilistic prolog and its application in link discovery. In: IJCAI (2007)

  15. de Salvo Braz, R., Amir, E., Roth, D.: Lifted first-order probabilistic inference. In: IJCAI (2005)

  16. Dechter, R.: Bucket elimination: a unifying framework for probabilistic inference. In: UAI (1996)

  17. Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: VLDB (2004)

  18. Dovier, A., Piazza, C., Policriti, A.: A fast bisimulation algorithm. In: International Conference on Computer Aided Verification, Paris, France (2001)

  19. Frey, B.: Extending factor graphs so as to unify directed and undirected graphical models. In: UAI (2003)

  20. Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: IJCAI (1999)

  21. Fuhr N., Rolleke T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)

    Article  Google Scholar 

  22. Getoor, L., Taskar, B. (eds.): Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007)

    MATH  Google Scholar 

  23. Getoor L., Friedman N., Koller D., Taskar B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2002)

    Article  MathSciNet  Google Scholar 

  24. Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: VLDB (2006)

  25. Halpern J.: An analysis of first-order logics for reasoning about probability. Artif. Intell. 44(1–2), 167–207 (1990)

    Google Scholar 

  26. Huang C., Darwiche A.: Inference in belief networks: A procedural guide. Int. J. Approx. Reason. 15(3), 225–263 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  27. Imielinski T., Lipski W. Jr: Incomplete information in relational databases. J. ACM 31(4), 761–797 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  28. Jampani, R., Xu, F., Wu, M., Perez, L., Jermaine, C., Haas, P.: MCDB: a monte carlo approach to managing uncertain data. In: SIGMOD (2008)

  29. Kanellakis, P., Smolka, S.: CCS expressions, finite state processes, and three problems of equivalence. In: ACM Symposium on Principles of Distributed Computing, Montreal, Canada (1983)

  30. Kjaerulff, U.: Triangulation of graphs—algorithms giving small total state space. Technical report, University of Aalborg, Denmark (1990)

  31. Koch, C., Olteanu, D.: Conditioning probabilistic databases. In: VLDB (2008)

  32. Milch, B., Zettlemoyer, L., Kersting, K., Haimes, M., Kaelbling, L.: Lifted probabilistic inference with counting formulas. In: AAAI (2008)

  33. Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Menlo Park (1988)

  34. Poole, D.: First-order probabilistic inference. In: IJCAI (2003)

  35. Re C., Dalvi N., Suciu D.: Query evaluation on probabilistic databases. IEEE Data Eng. Bull. Spec. Issue Probab. Data Manag. 29(1), 17–24 (2006)

    Google Scholar 

  36. Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE (2007)

  37. Richardson M., Domingos P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)

    Article  Google Scholar 

  38. Richardson T.: A characterization of Markov equivalence for directed cyclic graphs. Int. J. Approx. Reason. 17(2–3), 107–162 (1997)

    Article  MATH  Google Scholar 

  39. Rish, I.: Efficient Reasoning in Graphical Models. PhD thesis, University of California, Irvine (1999)

  40. Sen P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE (2007)

  41. Sen, P., Deshpande, A., Getoor, L.: Representing tuple and attribute uncertainty in probabilistic databases. In: DUNE Workshop (ICDM) (2007)

  42. Sen P., Deshpande A., Getoor L.: Exploiting shared correlations in probabilistic databases. PVLDB 1(1), 809–820 (2008)

    Google Scholar 

  43. Sen, P., Deshpande, A., Getoor, L.: Bisimulation-based approximate lifted inference. In: UAI (2009)

  44. Singh, S., Mayfield, C., Prabhakar, S., Hambrusch, S., Shah, R.: Indexing uncertain categorical data. In: ICDE (2007)

  45. Singla, P., Domingos, P.: Lifted first-order belief propagation. In: AAAI (2008)

  46. Wang, D., Michelakis, E., Garofalakis, M., Hellerstein, J.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. In: VLDB (2008)

  47. Zhang, N., Poole, D.: A simple approach to Bayesian network computations. In: Canadian Conference on Artificial Intelligence, Banff, Canada (1994)

  48. Zhang N., Poole D.: Exploiting causal independence in Bayesian network inference. J. Artif. Intell. Res. 5, 301–328 (1996)

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amol Deshpande.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sen, P., Deshpande, A. & Getoor, L. PrDB: managing and exploiting rich correlations in probabilistic databases. The VLDB Journal 18, 1065–1090 (2009). https://doi.org/10.1007/s00778-009-0153-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-009-0153-2

Keywords

Navigation