skip to main content
10.1145/378580.378687acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
Article

Estimating simple functions on the union of data streams

Authors Info & Claims
Published:03 July 2001Publication History

ABSTRACT

Massive data sets often arise as physically distributed, parallel data streams. We present algorithms for estimating simple functions on the union of such data streams, while using only logarithmic space per stream. Each processor observes only its own stream, and communicates with the other processors only after observing its entire stream. This models the set-up in current network monitoring products. Our algorithms employ a novel coordinated sampling technique to extract a sample of the union; this sample can be used to estimate aggregate functions on the union. The technique can also be used to estimate aggregate functions over the distinct “labels” in one or more data streams, e.g., to determine the zeroth frequency moment (i.e., the number of distinct labels) in one or more data streams. Our space and time bounds are the best known for these problems, and our logarithmic space bounds for coordinated sampling contrast with polynomial lower bounds for independent sampling. We relate our distributed streams model to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms.

References

  1. 1.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 574-576, June 1999. Demo paper.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 275-286, June 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking algorithms for join and self-join sizes. In Proc. 18th ACM Syrup. on Principles of Database Systems, pages 1-11, May 1999. Full version to appear in JCSS special issue for PODS'99.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. 28th ACM Syrup. on the Theory of Computing, pages 20-29, May 1996. Full version to appear in JCSS special issue for STOC'96.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proe. 30th A CM Symp. on the Theory of Computing, pages 327-336, May 1998. Full version to appear in JCSS special issue for STOC'98.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. 19th ACM Syrup. on Principles of Database Systems, pages 268-279, May 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. of Computer and System Sciences, 55(3):441-453, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate Ll-difference algorithm for massive data streams. In Proc. 4Oth IEEE Symp. on Foundations of Computer Science, pages 501-511, Oct. 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot-checking of data streams. Technical report, AT&T Shannon Laboratories, Florham Park, N J, July 1999.]]Google ScholarGoogle Scholar
  10. 10.P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer and System Sciences, 31:182-209, 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.J. Fong and M. Strauss. An approximate LP-difference algorithm for massive data streams. In Proc. 17th Syrup. on Theoretical Aspects of Computer Science, LNCS 1770, pages 193-204. Springer, Feb. 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 331-342, June 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In J. M. Abello and J. S. Vitter, editors, External Memory Algorithms, pages 39-70. AMS, 1999. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, Vol. 50. A two page summary appeared as a short paper in SODA'99.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. 41st IEEE Syrup. on Foundations of Computer Science, pages 359-366, Nov. 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st International Conf. on Very Large Data Bases, pages 311-322, Sept. 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical report, Digital Systems Research Center, Palo Alto, CA, May 1998.]]Google ScholarGoogle Scholar
  17. 17.P. Indyk. A small approximately min-wise independent family of hash functions. Technical report, Stanford University, Palo Alto, CA, Nov. 1998.]]Google ScholarGoogle Scholar
  18. 18.P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Syrup. on Foundations of Computer Science, pages 189-197, Nov. 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.P. Indyk, N. Koudas, and S. Muthukrishnan. Identifying representative trends in massive time series datasets using sketches. In Proc. 26th International Conf. on Very Large Databases, pages 363-372, Sept. 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Computational Complexity, 8(1):21-49, 1999. Preliminary version in STOC'95.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, Cambridge, UK, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.I. Newman. Private vs. common random bits in communication complexity. Information Processing Letters, 39:67-71, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.I. Newman and M. Szegedy. Public vs. private coin flips in one round communication games. In Proc. 28th ACM Symp. on the Theory of Computing, pages 561-570, May 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.N. Nisan and D. Ron. Private communication, October-November 2000.]]Google ScholarGoogle Scholar
  25. 25.Transaction processing performance council (TPC). TPC Benchmarks, 2000. URL: www. tpc. org.]]Google ScholarGoogle Scholar
  26. 26.K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15(2):208-229, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Estimating simple functions on the union of data streams

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SPAA '01: Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
              July 2001
              340 pages
              ISBN:1581134096
              DOI:10.1145/378580

              Copyright © 2001 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 3 July 2001

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              SPAA '01 Paper Acceptance Rate34of93submissions,37%Overall Acceptance Rate447of1,461submissions,31%

              Upcoming Conference

              SPAA '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader