skip to main content
10.1145/1951365.1951412acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Synopses for probabilistic data over large domains

Published: 21 March 2011 Publication History

Abstract

Many real world applications produce data with uncertainties drawn from measurements over a continuous domain space. Recent research in the area of probabilistic databases has mainly focused on managing and querying discrete data in which the domain is limited to a small number of values (i.e. on the order of 10). When the size of the domain increases, current methods fail due to their nature of explicitly storing each value/probability pair. Such methods are not capable of extending their use to continuous-valued attributes. In this paper, we provide a scalable, accurate, space efficient probabilistic data synopsis for uncertain attributes defined over a continuous domain. Our synopsis construction methods are all error-aware to ensure that our synopsis provides an accurate representation of the underlying data given a limited space budget. Additionally, we are able to provide approximate query results over the synopsis with error bounds.
We provide an extensive experimental evaluation to show that our proposed methods improve upon the current state of the art in terms of construction time and query accuracy. In particular, our synopsis can be constructed in O(N2) time (where N is the number of tuples in the database). We also demonstrate the ability of our synopsis to answer a variety of interesting queries on a real data set and show that our query error is reduced by up to an order of magnitude over the previous state-of-the-art method.

References

[1]
O. Benjelloun, A. D. Sarma, A. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB, 2006.
[2]
Y. Cai and R. Ng. Indexing spatio-temporal trajectories with chebyshev polynomials. In SIGMOD, 2004.
[3]
R. Cheng, S. Singh, S. Prabhakar, R. Shah, J. Vitter, and Y. Xia. Efficient join processing over uncertain data. In CIKM, 2006.
[4]
N. D. Christopher Re and D. Suciu. Efficient top-k query evaluation on probabilistic data. In ICDE, 2007.
[5]
G. Cormode, A. Deligiannakis, M. Garofalakis, and A. Mcgregor. Probabilistic Histograms for Probabilistic Data. In VLDB, 2009.
[6]
G. Cormode and M. Garofalakis. Histograms and wavelets on probabilistic data. In ICDE, 2009.
[7]
N. Dalvi and D. Suciu. Management of probabilistic data foundations and challenges. In SIGMOD, 2007.
[8]
W. Day and H. Edelsbrunner. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of classification, 1(1), 1984.
[9]
H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. In VLDB, 2008.
[10]
C. Dunham. Remez algorithm for Chebyshev approximation with interpolation. Computing, 28(1), 1982.
[11]
M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximumerror metrics. In PODS, 2004.
[12]
S. Guha and B. Harb. Wavelet synopsis for data streams: Minimizing noneuclidean error. In KDD, 2005.
[13]
H. Jagadish, H. Jin, B. Ooi, and K. Tan. Global optimization of histograms. In SIGMOD, 2001.
[14]
H. V. Jagadish, R. T. Ng, B. C. Ooi, and A. K. H. Tung. Itcompress: An iterative semantic compression algorithm. In ICDE, 2004.
[15]
R. Jampani, F. Xu, M. Wu, L. Perez, C. Jermaine, and P. Haas. Mcdb: a monte carlo approach to managing uncertain data. In SIGMOD, 2008.
[16]
P. Karras, D. Sacharidis, and N. Mamoulis. Exploiting duality in summarization with deterministic guarantees. In KDD, 2007.
[17]
F. Korn, T. Johnson, and H. Jagadish. Range selectivity estimation for continuous attributes. In SSDBM, 1999.
[18]
U. Lerner, B. Moses, M. Scott, S. McIlraith, and D. Koller. Monitoring a Complex Physical System using a Hybrid Dynamic Bayes Net. In ICML, 2002.
[19]
J. Mason. Chebyshev polynomials. Chapman & Hall/CRC, Boca Raton, FL, 2003.
[20]
D. S. Nilesh Dalvi. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
[21]
A. Ralston. Rational Chebyshev approximation by Remes' algorithms. Numerische Mathematik, 7(4), 1965.
[22]
T. Rivlin. Chebyshev polynomials: From Approximation Theory to Algebra and Number Theory. Wiley, 1974.
[23]
P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.
[24]
S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. Hambrusch, J. Neville, and R. Cheng. Database support for probabilistic attributes and tuples. In ICDE, 2008.
[25]
L. Veidinger. On the numerical determination of the best approximations in the Chebyshev sense. Numerische Mathematik, 2(1), 1960.

Cited By

View all
  • (2019)Online Computing Quantile Summaries Over Uncertain Data StreamsIEEE Access10.1109/ACCESS.2019.28915507(10916-10926)Online publication date: 2019
  • (2013)A Histogram Method for Summarizing Multi-dimensional Probabilistic DataProcedia Computer Science10.1016/j.procs.2013.06.13519(971-976)Online publication date: 2013

Index Terms

  1. Synopses for probabilistic data over large domains

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology
    March 2011
    587 pages
    ISBN:9781450305280
    DOI:10.1145/1951365
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • Microsoft Research: Microsoft Research

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 March 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data synopsis
    2. probabilistic databases

    Qualifiers

    • Research-article

    Conference

    EDBT/ICDT '11
    Sponsor:
    • Microsoft Research
    EDBT/ICDT '11: EDBT/ICDT '11 joint conference
    March 21 - 24, 2011
    Uppsala, Sweden

    Acceptance Rates

    Overall Acceptance Rate 7 of 10 submissions, 70%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Online Computing Quantile Summaries Over Uncertain Data StreamsIEEE Access10.1109/ACCESS.2019.28915507(10916-10926)Online publication date: 2019
    • (2013)A Histogram Method for Summarizing Multi-dimensional Probabilistic DataProcedia Computer Science10.1016/j.procs.2013.06.13519(971-976)Online publication date: 2013

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media