ABSTRACT
Massive data sets often arise as physically distributed, parallel data streams. We present algorithms for estimating simple functions on the union of such data streams, while using only logarithmic space per stream. Each processor observes only its own stream, and communicates with the other processors only after observing its entire stream. This models the set-up in current network monitoring products. Our algorithms employ a novel coordinated sampling technique to extract a sample of the union; this sample can be used to estimate aggregate functions on the union. The technique can also be used to estimate aggregate functions over the distinct “labels” in one or more data streams, e.g., to determine the zeroth frequency moment (i.e., the number of distinct labels) in one or more data streams. Our space and time bounds are the best known for these problems, and our logarithmic space bounds for coordinated sampling contrast with polynomial lower bounds for independent sampling. We relate our distributed streams model to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms.
- 1.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 574-576, June 1999. Demo paper.]] Google ScholarDigital Library
- 2.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 275-286, June 1999.]] Google ScholarDigital Library
- 3.N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking algorithms for join and self-join sizes. In Proc. 18th ACM Syrup. on Principles of Database Systems, pages 1-11, May 1999. Full version to appear in JCSS special issue for PODS'99.]] Google ScholarDigital Library
- 4.N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. 28th ACM Syrup. on the Theory of Computing, pages 20-29, May 1996. Full version to appear in JCSS special issue for STOC'96.]] Google ScholarDigital Library
- 5.A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proe. 30th A CM Symp. on the Theory of Computing, pages 327-336, May 1998. Full version to appear in JCSS special issue for STOC'98.]] Google ScholarDigital Library
- 6.M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. 19th ACM Syrup. on Principles of Database Systems, pages 268-279, May 2000.]] Google ScholarDigital Library
- 7.E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. of Computer and System Sciences, 55(3):441-453, 1997.]] Google ScholarDigital Library
- 8.J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate Ll-difference algorithm for massive data streams. In Proc. 4Oth IEEE Symp. on Foundations of Computer Science, pages 501-511, Oct. 1999.]] Google ScholarDigital Library
- 9.J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot-checking of data streams. Technical report, AT&T Shannon Laboratories, Florham Park, N J, July 1999.]]Google Scholar
- 10.P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer and System Sciences, 31:182-209, 1985.]] Google ScholarDigital Library
- 11.J. Fong and M. Strauss. An approximate LP-difference algorithm for massive data streams. In Proc. 17th Syrup. on Theoretical Aspects of Computer Science, LNCS 1770, pages 193-204. Springer, Feb. 2000.]] Google ScholarDigital Library
- 12.P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 331-342, June 1998.]] Google ScholarDigital Library
- 13.P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In J. M. Abello and J. S. Vitter, editors, External Memory Algorithms, pages 39-70. AMS, 1999. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, Vol. 50. A two page summary appeared as a short paper in SODA'99.]] Google ScholarDigital Library
- 14.S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. 41st IEEE Syrup. on Foundations of Computer Science, pages 359-366, Nov. 2000.]] Google ScholarDigital Library
- 15.P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st International Conf. on Very Large Data Bases, pages 311-322, Sept. 1995.]] Google ScholarDigital Library
- 16.M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical report, Digital Systems Research Center, Palo Alto, CA, May 1998.]]Google Scholar
- 17.P. Indyk. A small approximately min-wise independent family of hash functions. Technical report, Stanford University, Palo Alto, CA, Nov. 1998.]]Google Scholar
- 18.P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Syrup. on Foundations of Computer Science, pages 189-197, Nov. 2000.]] Google ScholarDigital Library
- 19.P. Indyk, N. Koudas, and S. Muthukrishnan. Identifying representative trends in massive time series datasets using sketches. In Proc. 26th International Conf. on Very Large Databases, pages 363-372, Sept. 2000.]] Google ScholarDigital Library
- 20.I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Computational Complexity, 8(1):21-49, 1999. Preliminary version in STOC'95.]] Google ScholarDigital Library
- 21.E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, Cambridge, UK, 1997.]] Google ScholarDigital Library
- 22.I. Newman. Private vs. common random bits in communication complexity. Information Processing Letters, 39:67-71, 1991.]] Google ScholarDigital Library
- 23.I. Newman and M. Szegedy. Public vs. private coin flips in one round communication games. In Proc. 28th ACM Symp. on the Theory of Computing, pages 561-570, May 1996.]] Google ScholarDigital Library
- 24.N. Nisan and D. Ron. Private communication, October-November 2000.]]Google Scholar
- 25.Transaction processing performance council (TPC). TPC Benchmarks, 2000. URL: www. tpc. org.]]Google Scholar
- 26.K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15(2):208-229, 1990.]] Google ScholarDigital Library
Index Terms
- Estimating simple functions on the union of data streams
Recommendations
Estimating statistical aggregates on probabilistic data streams
The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of ...
Data Streams with Bounded Deletions
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsTwo prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. ...
Estimating statistical aggregates on probabilistic data streams
PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsThe probabilistic-stream model was introduced by Jayram et al. [20].It is a generalization of the data stream model that issuited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of ...
Comments