Article

Processing set expressions over continuous update streams

Authors:
Sumit Ganguly

Bell Laboratories, Lucent Technologies, Murray Hill, NJ

Bell Laboratories, Lucent Technologies, Murray Hill, NJ
View Profile

,
Minos Garofalakis

Bell Laboratories, Lucent Technologies, Murray Hill, NJ

Bell Laboratories, Lucent Technologies, Murray Hill, NJ
View Profile

,
Rajeev Rastogi

Bell Laboratories, Lucent Technologies, Murray Hill, NJ

Bell Laboratories, Lucent Technologies, Murray Hill, NJ
View Profile

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of dataJune 2003Pages 265–276https://doi.org/10.1145/872757.872790

Published:09 June 2003Publication History

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Pages 265–276

ABSTRACT

There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions).Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask "what is the number of distinct IP source addresses seen in passing packets from both router R₁ and R₂ but not router R₃?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed "2-level hash sketch". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach.

References

N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. "Tracking Join and Self-Join Sizes in Limited Storage". In Proc. of the 18th ACM Symp. on Principles of Database Systems, May 1999. Google ScholarDigital Library
N. Alon, Y. Matias, and M. Szegedy. "The Space Complexity of Approximating the Frequency Moments". In Proc. of the 28th Annual ACM Symp. on the Theory of Computing, May 1996. Google ScholarDigital Library
N. Alon and J. H. Spencer. "The Probabilistic Method". John Wiley & Sons, Inc., 1992.Google Scholar
Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. "Counting distinct elements in a data stream". In Proc. of the RANDOM'2002 Intl. Workshop, September 2002. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. "Min-wise independent permutations". In Proc. 30th ACM Symp. on the Theory of Computing, May 1998. Google ScholarDigital Library
M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. "Towards Estimation Error Guarantees for Distinct Values". In Proc. of the 19th ACM Symp. on Principles of Database Systems, May 2000. Google ScholarDigital Library
Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. "Selectivity Estimation For Boolean Queries". In Proc. of the 19th ACM Symp. on Principles of Database Systems, May 2000. Google ScholarDigital Library
E. Cohen. "Size-estimation Framework with Applications to Transitive Closure and Reachability". Journal of Computer and Systems Sciences, 55(3):441--453, December 1997. Google ScholarDigital Library
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. "Introduction to Algorithms". MIT Press (The MIT Electrical Engineering and Computer Science Series), 1990. Google ScholarDigital Library
A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. "Processing Complex Aggregate Queries over Data Streams". In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. Google ScholarDigital Library
J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. "An approximate L1-difference algorithm for massive data streams". In Proc. 40th IEEE Symp. on Foundations of Computer Science, October 1999. Google ScholarDigital Library
P. Flajolet and G. N. Martin. "Probabilistic Counting Algorithms for Data Base Applications". Journal of Computer and Systems Sciences, 31:182--209, 1985. Google ScholarDigital Library
M. Garofalakis, J. Gehrke, and R. Rastogi. "Querying and Mining Data Streams: You Only Get One Look". Tutorial in 28th Intl. Conf. on Very Large Data Bases, August 2002.Google ScholarDigital Library
P. B. Gibbons. "Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports". In Proc. of the 27th Intl. Conf. on Very Large Data Bases, September 2001. Google ScholarDigital Library
P. B. Gibbons and S. Tirthapura. "Estimating Simple Functions on the Union of Data Streams". In Proceedings of the Thirteenth Annual ACM Symp. on Parallel Algorithms and Architectures, July 2001. Google ScholarDigital Library
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In Proc. of the 28th Intl. Conf. on Very Large Data Bases, August 2002. Google ScholarDigital Library
P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. "Sampling-based estimation of the number of distinct values of an attribute". In Proc. 21st Intl. Conf. on Very Large Data Bases, September 1995. Google ScholarDigital Library
P. Indyk. "A Small Approximately Min-wise Independent Family of Hash Functions". In Proc. of the 10th Annual ACM-SIAM Symp. on Discrete Algorithms, January 1999. Google ScholarDigital Library
P. Indyk. "Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation". In Proc. of the 41st Annual IEEE Symp. on Foundations of Computer Science, November 2000. Google ScholarDigital Library
B. Kalyanasundaram and G. Schnitger. "The Probabilistic Communication Complexity of Set Intersection". SIAM Journal on Discrete Mathematics, 5(4):545--557, Nov. 1992. Google ScholarDigital Library
E. Kushilevitz and N. Nisan. "Communication Complexity". Cambridge University Press, 1997. Google ScholarDigital Library
J. Melton and A. R. Simon. "Understanding the New SQL: A Complete Guide". Morgan Kaufmann Publishers, 1993. Google ScholarDigital Library

Index Terms

Processing set expressions over continuous update streams
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Database transaction processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Tracking set-expression cardinalities over continuous update streams

There is growing interest in algorithms for processing and querying continuous data streams (i.e., data seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an <i>update</i> stream, i.e., ...
Read More
Efficient Processing of XML Update Streams
ICDE '08: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

This paper introduces a framework for processing continuous, exact queries over continuous update XML streams. Instead of eagerly performing the updates on cached portions of the stream, we propagate the updates through the query evaluation pipeline, ...
Read More
Query processing over live and archived data streams
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
June 2003
702 pages
ISBN:158113634X
DOI:10.1145/872757
Conference Chair:
Zachary Ives
University of Pennsylvania
,
General Chair:
Yannis Papakonstantinou
University of California, San Diego
,
Program Chair:
Alon Halevy
University of Washington
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 57
  Total Citations
  View Citations
- 826
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Processing set expressions over continuous update streams

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tracking set-expression cardinalities over continuous update streams

Efficient Processing of XML Update Streams

Query processing over live and archived data streams