skip to main content
10.1145/1247480.1247513acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Sketching probabilistic data streams

Published: 11 June 2007 Publication History

Abstract

The management of uncertain, probabilistic data has recently emerged as a useful paradigm for dealing with the inherent unreliabilities of several real-world application domains, including data cleaning, information integration, and pervasive, multi-sensor computing. Unlike conventional data sets, a set of probabilistic tuples defines a probability distribution over an exponential number of possible worlds (i.e., "grounded", deterministic databases). This "possibleworlds" interpretation allows for clean query semantics but also raises hard computational problems for probabilistic database query processors. To further complicate matters, in many scenarios (e.g., large-scale process and environmental monitoring using multiple sensor modalities), probabilistic data tuples arrive and need to be processed in a streaming fashion; that is, using limited memory and CPU resources and without the benefit of multiple passes over a static probabilistic database. Such probabilistic data streams raise a host of new research challenges for stream-processing engines that, to date, remain largely unaddressed.
In this paper, we propose the first space- and time-efficient algorithms for approximating complex aggregate queries (including, the number of distinct values and join/self-join sizes) over probabilistic data streams. Following the possible-worlds semantics, such aggregates essentially define probability distributions over the space of possible aggregation results, and our goal is to characterize such distributions through efficient approximations of their key moments (such as expectation and variance). Our algorithms offer strong randomized estimation guarantees while using only sublinear space in the size of the stream(s), and rely on novel, concise streaming sketch synopses that extend conventional sketching ideas to the probabilistic streams setting. Our experimental results verify the effectiveness of our approach.

References

[1]
N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In ACM PODS, 1999.
[2]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In ACM STOC, 1996.
[3]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In ACM PODS, 2002.
[4]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisian. Counting distinct elements in a data stream. In RANDOM, 2002.
[5]
O. Benjelloun, A. Das Sarma, C. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB, 2006.
[6]
G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Space and time-efficient deterministic algorithms for biased quantiles over data streams. In ACM PODS, 2006.
[7]
G. Cormode and S. Muthukrishnan. Improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55(1), 2005.
[8]
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
[9]
A. Dobra, M. Garofalakis, J. E. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In ACM SIGMOD, 2002.
[10]
J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate L1-difference algorithm for massive data streams. In IEEE FOCS, 1999.
[11]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for database applications. J. Computer and System Sciences, 31, 1985.
[12]
L. Bhuvanagiri, S. Ganguly, D. Kesh, and C. Saha. Simpler algorithm for estimating frequency moments of data streams. In SODA, 2006.
[13]
M. Garofalakis, J. Gehrke, and R. Rastogi. Querying and mining data streams: You only get one look. In ACM SIGMOD Tutorials, 2002.
[14]
A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB, 2001.
[15]
M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In ACM SIGMOD, 2001.
[16]
T. S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA, 2007.
[17]
T. S. Jayram, R. Krshnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bulletin, 29(1), 2006.
[18]
T. S. Jayram, A. McGregor, S. Muthukrishnan, E. Vee. Estimating Statistical Aggregates on Probabilistic Data Streams. In PODS, 2007.
[19]
N. Khoussainova, M. Balazinska, and D. Suciu. Towards correcting input data errors probabilistically using integrity constraints. In ACM MobiDE, 2006.
[20]
G.S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002.
[21]
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ICDT, 2005.
[22]
J. Misra and D. Gries. Finding repeated elements. Science of Comp. Programming, 2, 1982.
[23]
A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In IEEE ICDE, 2006.
[24]
N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In ACM SenSys, 2004.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
June 2007
1210 pages
ISBN:9781595936868
DOI:10.1145/1247480
  • General Chairs:
  • Lizhu Zhou,
  • Tok Wang Ling,
  • Program Chair:
  • Beng Chin Ooi
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data streams
  2. uncertain data

Qualifiers

  • Article

Conference

SIGMOD/PODS07
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)2
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)TreeSensing: Linearly Compressing Sketches with FlexibilityProceedings of the ACM on Management of Data10.1145/35889101:1(1-28)Online publication date: 30-May-2023
  • (2023)On distributed data aggregation and the precision of approximate histogramsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104722180:COnline publication date: 1-Oct-2023
  • (2022)Accuracy-Aware CompilersApproximate Computing Techniques10.1007/978-3-030-94705-7_7(177-214)Online publication date: 3-Jan-2022
  • (2020)Space-efficient Query Evaluation over Probabilistic Event StreamsProceedings of the 35th Annual ACM/IEEE Symposium on Logic in Computer Science10.1145/3373718.3394747(74-87)Online publication date: 8-Jul-2020
  • (2020)JoltikProceedings of the 26th Annual International Conference on Mobile Computing and Networking10.1145/3372224.3419204(1-14)Online publication date: 16-Apr-2020
  • (2020)OLAP over Probabilistic Data Cubes II: Parallel Materialization and Extended AggregatesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.291342032:10(1966-1981)Online publication date: 1-Oct-2020
  • (2020)A hardware accelerator for entropy estimation using the top-k most frequent elements2020 23rd Euromicro Conference on Digital System Design (DSD)10.1109/DSD51259.2020.00032(141-148)Online publication date: Aug-2020
  • (2020)Police: An Effective Truth Discovery Method in Intelligent Crowd SensingArtificial Intelligence and Security10.1007/978-3-030-57884-8_34(384-398)Online publication date: 1-Sep-2020
  • (2019)NitrosketchProceedings of the ACM Special Interest Group on Data Communication10.1145/3341302.3342076(334-350)Online publication date: 19-Aug-2019
  • (2019)Efficient User Guidance for Validating Participatory Sensing DataACM Transactions on Intelligent Systems and Technology10.1145/332616410:4(1-30)Online publication date: 17-Jul-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media