Skip to main content
Log in

Databases with uncertainty and lineage

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

This paper introduces uldbs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately. We show that the uldb representation is complete, and that it permits straightforward implementation of many relational operations. We define two notions of uldb minimality—data-minimal and lineage-minimal—and study minimization of uldb representations under both notions. With lineage, derived relations are no longer self-contained: their uncertainty depends on uncertainty in the base data. We provide an algorithm for the new operation of extracting a database subset in the presence of interconnected uncertainty. We also show how uldbs enable a new approach to query processing in probabilistic databases. Finally, we describe the current state of the Trio system, our implementation of uldbs under development at Stanford.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, New York (1995)

    MATH  Google Scholar 

  2. Abiteboul S., Kanellakis P., Grahne G. (1991) On the representation and querying of sets of possible worlds. Theor. Comput. Sci. 78(1): 137–158

    Article  MathSciNet  Google Scholar 

  3. Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: Proc. of CIDR (2003)

  4. Barbará D., Garcia-Molina H., Porter D. (1992) The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4(5): 487–502

    Article  Google Scholar 

  5. Barga R.S., Pu C. (1993) Accessing imprecise data: an approach based on intervals. IEEE Data Eng. Bull. 16(2): 12–15

    Google Scholar 

  6. Benjelloun, O., Das Sarma, A., Halevy, A., Widom, J.: ULDBs: databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006)

  7. Benjelloun O., Das Sarma A., Hayworth C., Widom J. (2006) An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull. 29(1): 5–16

    Google Scholar 

  8. Bhagwat, D., Chiticariu, L., Tan, W., Vijayvargiya, G.: An annotation management system for relational databases. In: Proc. of VLDB (2004)

  9. Boulos, J., Dalvi, N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: Proc. of ACM SIGMOD (2005)

  10. Buckles B.P., Petry F.E. (1982) A fuzzy model for relational databases. Int. J. Fuzzy Sets Systems 7: 213–226

    Article  MATH  Google Scholar 

  11. Buneman, P., Khanna, S., Tan, W.: Why and where: a charaterization of data provenance. In: Proc. of ICDT (2001)

  12. Cavallo, R., Pittarelli, M.: The theory of probabilistic databases. In: Proc. of VLDB (1987)

  13. Chang, K.C.C., He, B., Zhang, Z.: Toward large scale integration: building a metaquerier over databases on the web. In: Proc. of CIDR, pp. 44–55 (2005)

  14. Cheng, R., Singh, S., Prabhakar, S.: U-DBMS: A database system for managing constantly-evolving data. In: Proc. of VLDB (2005)

  15. The CherryPy web development framework. http://www.cherrypy.org

  16. Chiticariu, L., Tan, W., Vijayvargiya, G.: DBNotes: a post-it system for relational databases based on provenance. In: Proc. of ACM SIGMOD (2005)

  17. Cui, Y., Widom, J.: Practical lineage tracing in data warehouses. In: Proc. of ICDE (2000)

  18. Cui Y., Widom J. (2003) Lineage tracing for general data warehouse transformations. VLDB J. 12(1): 41–58

    Article  Google Scholar 

  19. Cui Y., Widom J., Wiener J.L. (2000) Tracing the lineage of view data in a warehousing environment. ACM TODS 25(2): 179–227

    Article  Google Scholar 

  20. Dalvi, N., Miklau, G., Suciu, D.: Asymptotic conditional probabilities for conjunctive queries. In: Proc. of ICDT (2005)

  21. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: Proc. of VLDB (2004)

  22. Dalvi, N., Suciu, D.: Answering queries from statistics and probabilistic views. In: Proc. of VLDB (2005)

  23. Das Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: Proc. of ICDE (2006)

  24. Das Sarma, A., Nabar, S., Widom, J.: Representing uncertain data: uniqueness, equivalence, minimization, and approximation. Tech. rep., Stanford InfoLab (2005). Available at http://dbpubs.stanford.edu/pub/2005-38

  25. Das Sarma, A., Theobald, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. Tech. rep., Stanford InfoLab (2007). Available on http://dbpubs.stanford.edu

  26. Fuhr, N.: A probabilistic framework for vague queries and imprecise information in databases. In: Proc. of VLDB (1990)

  27. Fuhr, N., Rölleke, T.: A probabilistic NF2 relational algebra for imprecision in databases. Unpublished Manuscript (1997)

  28. Fuhr N., Rölleke T. (1997) A probabilistic relational algebra for the integration of information retrieval and database systems. ACM TOIS 14(1): 32–66

    Article  Google Scholar 

  29. Grahne, G.: Dependency satisfaction in databases with incomplete information. In: Proc. of VLDB (1984)

  30. Grahne, G.: Horn tables—an efficient tool for handling incomplete information in databases. In: Proc. of ACM PODS (1989)

  31. Imielinski T., Lipski W. Jr. (1984) Incomplete information in relational databases. J. ACM 31(4): 761–791

    Article  MATH  MathSciNet  Google Scholar 

  32. Ives, Z.G., Khandelwal, N., Kapur, A., Cakir, M.: Orchestra: rapid, collaborative sharing of dynamic data. In: Proc. of CIDR (2005)

  33. Karp, R.M., Luby, M.: Monte Carlo algorithms for enumeration and reliability problems. In: Proc. of FOCS (1983)

  34. Lakshmanan L.V.S., Leone N., Ross R., Subrahmanian V. (1997) ProbView: a flexible probabilistic database system. ACM TODS 22(3): 419–469

    Article  Google Scholar 

  35. Levy A.Y., Fikes R.E., Sagiv S. (1997) Speeding up inferences using relevance reasoning: a formalism and algorithms. Artif. Intell. 97(1–2): 83–136

    Article  MATH  MathSciNet  Google Scholar 

  36. Levy, A.Y., Sagiv, Y.: Queries independent of updates. In: Proc. of VLDB (1993)

  37. Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: you can afford to pay as you go. In: Proc. of CIDR, pp. 342–350 (2007)

  38. Mutsuzaki, M., Theobald, M., de Keijzer, A., Widom, J., Agrawal, P., Benjelloun, O., Sarma, A.D., Murthy, R., Sugihara, T.: Trio-one: layering uncertainty and lineage on a conventional dbms (system demonstration). In: Proc. of CIDR, pp. 269–274 (2007)

  39. Buneman, P., Khanna, S., Tan, W.: Data provenance: some basic issues. In: Proc. of FSTTCS (2000)

  40. Buneman, P., Khanna, S., Tan, W.: On propagation of deletions and annotations through views. In: Proc. of ACM PODS (2002)

  41. Tao, Y., Cheng, R., Xiao, X., Ngai, W.K., Kao, B., Prabhakar, S.: Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proc. of VLDB (2005)

  42. Taylor, N.E., Ives, Z.G.: Reconciling while tolerating disagreement in collaborative data sharing. In: Proc. of ACM SIGMOD (2006)

  43. Theobald, A., Weikum, G.: The XXL search engine: ranked retrieval of xml data using indexes and ontologies. In: Proc. of ACM SIGMOD (2002)

  44. TriQL: The Trio query language. Available from http://infolab.stanford.edu/trio

  45. Vardi, M.Y.: Querying logical databases. In: Proc. of ACM PODS (1985)

  46. Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. of CIDR (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omar Benjelloun.

Additional information

This work was supported by the National Science Foundation under grants IIS-0324431, IIS-1098447, and IIS-9985114, by DARPA Contract #03-000225, and by a grant from the Boeing Corporation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Benjelloun, O., Das Sarma, A., Halevy, A. et al. Databases with uncertainty and lineage. The VLDB Journal 17, 243–264 (2008). https://doi.org/10.1007/s00778-007-0080-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-007-0080-z

Keywords

Navigation