Abstract
This paper introduces uldbs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately. We show that the uldb representation is complete, and that it permits straightforward implementation of many relational operations. We define two notions of uldb minimality—data-minimal and lineage-minimal—and study minimization of uldb representations under both notions. With lineage, derived relations are no longer self-contained: their uncertainty depends on uncertainty in the base data. We provide an algorithm for the new operation of extracting a database subset in the presence of interconnected uncertainty. We also show how uldbs enable a new approach to query processing in probabilistic databases. Finally, we describe the current state of the Trio system, our implementation of uldbs under development at Stanford.
Similar content being viewed by others
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, New York (1995)
Abiteboul S., Kanellakis P., Grahne G. (1991) On the representation and querying of sets of possible worlds. Theor. Comput. Sci. 78(1): 137–158
Agrawal, S., Chaudhuri, S., Das, G., Gionis, A.: Automated ranking of database query results. In: Proc. of CIDR (2003)
Barbará D., Garcia-Molina H., Porter D. (1992) The management of probabilistic data. IEEE Trans. Knowl. Data Eng. 4(5): 487–502
Barga R.S., Pu C. (1993) Accessing imprecise data: an approach based on intervals. IEEE Data Eng. Bull. 16(2): 12–15
Benjelloun, O., Das Sarma, A., Halevy, A., Widom, J.: ULDBs: databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006)
Benjelloun O., Das Sarma A., Hayworth C., Widom J. (2006) An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull. 29(1): 5–16
Bhagwat, D., Chiticariu, L., Tan, W., Vijayvargiya, G.: An annotation management system for relational databases. In: Proc. of VLDB (2004)
Boulos, J., Dalvi, N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: Proc. of ACM SIGMOD (2005)
Buckles B.P., Petry F.E. (1982) A fuzzy model for relational databases. Int. J. Fuzzy Sets Systems 7: 213–226
Buneman, P., Khanna, S., Tan, W.: Why and where: a charaterization of data provenance. In: Proc. of ICDT (2001)
Cavallo, R., Pittarelli, M.: The theory of probabilistic databases. In: Proc. of VLDB (1987)
Chang, K.C.C., He, B., Zhang, Z.: Toward large scale integration: building a metaquerier over databases on the web. In: Proc. of CIDR, pp. 44–55 (2005)
Cheng, R., Singh, S., Prabhakar, S.: U-DBMS: A database system for managing constantly-evolving data. In: Proc. of VLDB (2005)
The CherryPy web development framework. http://www.cherrypy.org
Chiticariu, L., Tan, W., Vijayvargiya, G.: DBNotes: a post-it system for relational databases based on provenance. In: Proc. of ACM SIGMOD (2005)
Cui, Y., Widom, J.: Practical lineage tracing in data warehouses. In: Proc. of ICDE (2000)
Cui Y., Widom J. (2003) Lineage tracing for general data warehouse transformations. VLDB J. 12(1): 41–58
Cui Y., Widom J., Wiener J.L. (2000) Tracing the lineage of view data in a warehousing environment. ACM TODS 25(2): 179–227
Dalvi, N., Miklau, G., Suciu, D.: Asymptotic conditional probabilities for conjunctive queries. In: Proc. of ICDT (2005)
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: Proc. of VLDB (2004)
Dalvi, N., Suciu, D.: Answering queries from statistics and probabilistic views. In: Proc. of VLDB (2005)
Das Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: Proc. of ICDE (2006)
Das Sarma, A., Nabar, S., Widom, J.: Representing uncertain data: uniqueness, equivalence, minimization, and approximation. Tech. rep., Stanford InfoLab (2005). Available at http://dbpubs.stanford.edu/pub/2005-38
Das Sarma, A., Theobald, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. Tech. rep., Stanford InfoLab (2007). Available on http://dbpubs.stanford.edu
Fuhr, N.: A probabilistic framework for vague queries and imprecise information in databases. In: Proc. of VLDB (1990)
Fuhr, N., Rölleke, T.: A probabilistic NF2 relational algebra for imprecision in databases. Unpublished Manuscript (1997)
Fuhr N., Rölleke T. (1997) A probabilistic relational algebra for the integration of information retrieval and database systems. ACM TOIS 14(1): 32–66
Grahne, G.: Dependency satisfaction in databases with incomplete information. In: Proc. of VLDB (1984)
Grahne, G.: Horn tables—an efficient tool for handling incomplete information in databases. In: Proc. of ACM PODS (1989)
Imielinski T., Lipski W. Jr. (1984) Incomplete information in relational databases. J. ACM 31(4): 761–791
Ives, Z.G., Khandelwal, N., Kapur, A., Cakir, M.: Orchestra: rapid, collaborative sharing of dynamic data. In: Proc. of CIDR (2005)
Karp, R.M., Luby, M.: Monte Carlo algorithms for enumeration and reliability problems. In: Proc. of FOCS (1983)
Lakshmanan L.V.S., Leone N., Ross R., Subrahmanian V. (1997) ProbView: a flexible probabilistic database system. ACM TODS 22(3): 419–469
Levy A.Y., Fikes R.E., Sagiv S. (1997) Speeding up inferences using relevance reasoning: a formalism and algorithms. Artif. Intell. 97(1–2): 83–136
Levy, A.Y., Sagiv, Y.: Queries independent of updates. In: Proc. of VLDB (1993)
Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: you can afford to pay as you go. In: Proc. of CIDR, pp. 342–350 (2007)
Mutsuzaki, M., Theobald, M., de Keijzer, A., Widom, J., Agrawal, P., Benjelloun, O., Sarma, A.D., Murthy, R., Sugihara, T.: Trio-one: layering uncertainty and lineage on a conventional dbms (system demonstration). In: Proc. of CIDR, pp. 269–274 (2007)
Buneman, P., Khanna, S., Tan, W.: Data provenance: some basic issues. In: Proc. of FSTTCS (2000)
Buneman, P., Khanna, S., Tan, W.: On propagation of deletions and annotations through views. In: Proc. of ACM PODS (2002)
Tao, Y., Cheng, R., Xiao, X., Ngai, W.K., Kao, B., Prabhakar, S.: Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proc. of VLDB (2005)
Taylor, N.E., Ives, Z.G.: Reconciling while tolerating disagreement in collaborative data sharing. In: Proc. of ACM SIGMOD (2006)
Theobald, A., Weikum, G.: The XXL search engine: ranked retrieval of xml data using indexes and ontologies. In: Proc. of ACM SIGMOD (2002)
TriQL: The Trio query language. Available from http://infolab.stanford.edu/trio
Vardi, M.Y.: Querying logical databases. In: Proc. of ACM PODS (1985)
Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. of CIDR (2005)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the National Science Foundation under grants IIS-0324431, IIS-1098447, and IIS-9985114, by DARPA Contract #03-000225, and by a grant from the Boeing Corporation.
Rights and permissions
About this article
Cite this article
Benjelloun, O., Das Sarma, A., Halevy, A. et al. Databases with uncertainty and lineage. The VLDB Journal 17, 243–264 (2008). https://doi.org/10.1007/s00778-007-0080-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-007-0080-z