skip to main content
10.1145/872757.872790acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Processing set expressions over continuous update streams

Authors Info & Claims
Published:09 June 2003Publication History

ABSTRACT

There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions).Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask "what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed "2-level hash sketch". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach.

References

  1. N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. "Tracking Join and Self-Join Sizes in Limited Storage". In Proc. of the 18th ACM Symp. on Principles of Database Systems, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Alon, Y. Matias, and M. Szegedy. "The Space Complexity of Approximating the Frequency Moments". In Proc. of the 28th Annual ACM Symp. on the Theory of Computing, May 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Alon and J. H. Spencer. "The Probabilistic Method". John Wiley & Sons, Inc., 1992.Google ScholarGoogle Scholar
  4. Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. "Counting distinct elements in a data stream". In Proc. of the RANDOM'2002 Intl. Workshop, September 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. "Min-wise independent permutations". In Proc. 30th ACM Symp. on the Theory of Computing, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. "Towards Estimation Error Guarantees for Distinct Values". In Proc. of the 19th ACM Symp. on Principles of Database Systems, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. "Selectivity Estimation For Boolean Queries". In Proc. of the 19th ACM Symp. on Principles of Database Systems, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Cohen. "Size-estimation Framework with Applications to Transitive Closure and Reachability". Journal of Computer and Systems Sciences, 55(3):441--453, December 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. "Introduction to Algorithms". MIT Press (The MIT Electrical Engineering and Computer Science Series), 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. "Processing Complex Aggregate Queries over Data Streams". In Proc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. "An approximate L1-difference algorithm for massive data streams". In Proc. 40th IEEE Symp. on Foundations of Computer Science, October 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Flajolet and G. N. Martin. "Probabilistic Counting Algorithms for Data Base Applications". Journal of Computer and Systems Sciences, 31:182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Garofalakis, J. Gehrke, and R. Rastogi. "Querying and Mining Data Streams: You Only Get One Look". Tutorial in 28th Intl. Conf. on Very Large Data Bases, August 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. B. Gibbons. "Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports". In Proc. of the 27th Intl. Conf. on Very Large Data Bases, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. B. Gibbons and S. Tirthapura. "Estimating Simple Functions on the Union of Data Streams". In Proceedings of the Thirteenth Annual ACM Symp. on Parallel Algorithms and Architectures, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In Proc. of the 28th Intl. Conf. on Very Large Data Bases, August 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. "Sampling-based estimation of the number of distinct values of an attribute". In Proc. 21st Intl. Conf. on Very Large Data Bases, September 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Indyk. "A Small Approximately Min-wise Independent Family of Hash Functions". In Proc. of the 10th Annual ACM-SIAM Symp. on Discrete Algorithms, January 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Indyk. "Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation". In Proc. of the 41st Annual IEEE Symp. on Foundations of Computer Science, November 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Kalyanasundaram and G. Schnitger. "The Probabilistic Communication Complexity of Set Intersection". SIAM Journal on Discrete Mathematics, 5(4):545--557, Nov. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Kushilevitz and N. Nisan. "Communication Complexity". Cambridge University Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Melton and A. R. Simon. "Understanding the New SQL: A Complete Guide". Morgan Kaufmann Publishers, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Processing set expressions over continuous update streams

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
          June 2003
          702 pages
          ISBN:158113634X
          DOI:10.1145/872757

          Copyright © 2003 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 June 2003

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader