ABSTRACT
There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions).Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask "what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed "2-level hash sketch". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach.
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. "Tracking Join and Self-Join Sizes in Limited Storage". In Proc. of the 18th ACM Symp. on Principles of Database Systems, May 1999. Google ScholarDigital Library
- N. Alon, Y. Matias, and M. Szegedy. "The Space Complexity of Approximating the Frequency Moments". In Proc. of the 28th Annual ACM Symp. on the Theory of Computing, May 1996. Google ScholarDigital Library
- N. Alon and J. H. Spencer. "The Probabilistic Method". John Wiley & Sons, Inc., 1992.Google Scholar
- Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. "Counting distinct elements in a data stream". In Proc. of the RANDOM'2002 Intl. Workshop, September 2002. Google ScholarDigital Library
- A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. "Min-wise independent permutations". In Proc. 30th ACM Symp. on the Theory of Computing, May 1998. Google ScholarDigital Library
- M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. "Towards Estimation Error Guarantees for Distinct Values". In Proc. of the 19th ACM Symp. on Principles of Database Systems, May 2000. Google ScholarDigital Library
- Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. "Selectivity Estimation For Boolean Queries". In Proc. of the 19th ACM Symp. on Principles of Database Systems, May 2000. Google ScholarDigital Library
- E. Cohen. "Size-estimation Framework with Applications to Transitive Closure and Reachability". Journal of Computer and Systems Sciences, 55(3):441--453, December 1997. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest. "Introduction to Algorithms". MIT Press (The MIT Electrical Engineering and Computer Science Series), 1990. Google ScholarDigital Library
- A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. "Processing Complex Aggregate Queries over Data Streams". In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. Google ScholarDigital Library
- J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. "An approximate L1-difference algorithm for massive data streams". In Proc. 40th IEEE Symp. on Foundations of Computer Science, October 1999. Google ScholarDigital Library
- P. Flajolet and G. N. Martin. "Probabilistic Counting Algorithms for Data Base Applications". Journal of Computer and Systems Sciences, 31:182--209, 1985. Google ScholarDigital Library
- M. Garofalakis, J. Gehrke, and R. Rastogi. "Querying and Mining Data Streams: You Only Get One Look". Tutorial in 28th Intl. Conf. on Very Large Data Bases, August 2002.Google ScholarDigital Library
- P. B. Gibbons. "Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports". In Proc. of the 27th Intl. Conf. on Very Large Data Bases, September 2001. Google ScholarDigital Library
- P. B. Gibbons and S. Tirthapura. "Estimating Simple Functions on the Union of Data Streams". In Proceedings of the Thirteenth Annual ACM Symp. on Parallel Algorithms and Architectures, July 2001. Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In Proc. of the 28th Intl. Conf. on Very Large Data Bases, August 2002. Google ScholarDigital Library
- P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. "Sampling-based estimation of the number of distinct values of an attribute". In Proc. 21st Intl. Conf. on Very Large Data Bases, September 1995. Google ScholarDigital Library
- P. Indyk. "A Small Approximately Min-wise Independent Family of Hash Functions". In Proc. of the 10th Annual ACM-SIAM Symp. on Discrete Algorithms, January 1999. Google ScholarDigital Library
- P. Indyk. "Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation". In Proc. of the 41st Annual IEEE Symp. on Foundations of Computer Science, November 2000. Google ScholarDigital Library
- B. Kalyanasundaram and G. Schnitger. "The Probabilistic Communication Complexity of Set Intersection". SIAM Journal on Discrete Mathematics, 5(4):545--557, Nov. 1992. Google ScholarDigital Library
- E. Kushilevitz and N. Nisan. "Communication Complexity". Cambridge University Press, 1997. Google ScholarDigital Library
- J. Melton and A. R. Simon. "Understanding the New SQL: A Complete Guide". Morgan Kaufmann Publishers, 1993. Google ScholarDigital Library
Index Terms
- Processing set expressions over continuous update streams
Recommendations
Tracking set-expression cardinalities over continuous update streams
There is growing interest in algorithms for processing and querying continuous data streams (i.e., data seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an <i>update</i> stream, i.e., ...
Efficient Processing of XML Update Streams
ICDE '08: Proceedings of the 2008 IEEE 24th International Conference on Data EngineeringThis paper introduces a framework for processing continuous, exact queries over continuous update XML streams. Instead of eagerly performing the updates on cached portions of the stream, we propagate the updates through the query evaluation pipeline, ...
Comments