Article

Estimating simple functions on the union of data streams

Authors:
Phillip B. Gibbons

Information Sciences Research Center, Bell Laboratories, Murray Hill, NJ

Information Sciences Research Center, Bell Laboratories, Murray Hill, NJ
View Profile

,
Srikanta Tirthapura

Computer Science Department, Brown University, Providence, RI

Computer Science Department, Brown University, Providence, RI
View Profile

SPAA '01: Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architecturesJuly 2001Pages 281–291https://doi.org/10.1145/378580.378687

Published:03 July 2001Publication History

SPAA '01: Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures

Pages 281–291

ABSTRACT

Massive data sets often arise as physically distributed, parallel data streams. We present algorithms for estimating simple functions on the union of such data streams, while using only logarithmic space per stream. Each processor observes only its own stream, and communicates with the other processors only after observing its entire stream. This models the set-up in current network monitoring products. Our algorithms employ a novel coordinated sampling technique to extract a sample of the union; this sample can be used to estimate aggregate functions on the union. The technique can also be used to estimate aggregate functions over the distinct “labels” in one or more data streams, e.g., to determine the zeroth frequency moment (i.e., the number of distinct labels) in one or more data streams. Our space and time bounds are the best known for these problems, and our logarithmic space bounds for coordinated sampling contrast with polynomial lower bounds for independent sampling. We relate our distributed streams model to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms.

References

1.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua approximate query answering system. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 574-576, June 1999. Demo paper.]] Google ScholarDigital Library
2.S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 275-286, June 1999.]] Google ScholarDigital Library
3.N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking algorithms for join and self-join sizes. In Proc. 18th ACM Syrup. on Principles of Database Systems, pages 1-11, May 1999. Full version to appear in JCSS special issue for PODS'99.]] Google ScholarDigital Library
4.N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. 28th ACM Syrup. on the Theory of Computing, pages 20-29, May 1996. Full version to appear in JCSS special issue for STOC'96.]] Google ScholarDigital Library
5.A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proe. 30th A CM Symp. on the Theory of Computing, pages 327-336, May 1998. Full version to appear in JCSS special issue for STOC'98.]] Google ScholarDigital Library
6.M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. 19th ACM Syrup. on Principles of Database Systems, pages 268-279, May 2000.]] Google ScholarDigital Library
7.E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. of Computer and System Sciences, 55(3):441-453, 1997.]] Google ScholarDigital Library
8.J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate Ll-difference algorithm for massive data streams. In Proc. 4Oth IEEE Symp. on Foundations of Computer Science, pages 501-511, Oct. 1999.]] Google ScholarDigital Library
9.J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot-checking of data streams. Technical report, AT&T Shannon Laboratories, Florham Park, N J, July 1999.]]Google Scholar
10.P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer and System Sciences, 31:182-209, 1985.]] Google ScholarDigital Library
11.J. Fong and M. Strauss. An approximate LP-difference algorithm for massive data streams. In Proc. 17th Syrup. on Theoretical Aspects of Computer Science, LNCS 1770, pages 193-204. Springer, Feb. 2000.]] Google ScholarDigital Library
12.P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. A CM SIGMOD International Conf. on Management of Data, pages 331-342, June 1998.]] Google ScholarDigital Library
13.P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In J. M. Abello and J. S. Vitter, editors, External Memory Algorithms, pages 39-70. AMS, 1999. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, Vol. 50. A two page summary appeared as a short paper in SODA'99.]] Google ScholarDigital Library
14.S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. 41st IEEE Syrup. on Foundations of Computer Science, pages 359-366, Nov. 2000.]] Google ScholarDigital Library
15.P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Proc. 21st International Conf. on Very Large Data Bases, pages 311-322, Sept. 1995.]] Google ScholarDigital Library
16.M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical report, Digital Systems Research Center, Palo Alto, CA, May 1998.]]Google Scholar
17.P. Indyk. A small approximately min-wise independent family of hash functions. Technical report, Stanford University, Palo Alto, CA, Nov. 1998.]]Google Scholar
18.P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Syrup. on Foundations of Computer Science, pages 189-197, Nov. 2000.]] Google ScholarDigital Library
19.P. Indyk, N. Koudas, and S. Muthukrishnan. Identifying representative trends in massive time series datasets using sketches. In Proc. 26th International Conf. on Very Large Databases, pages 363-372, Sept. 2000.]] Google ScholarDigital Library
20.I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Computational Complexity, 8(1):21-49, 1999. Preliminary version in STOC'95.]] Google ScholarDigital Library
21.E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, Cambridge, UK, 1997.]] Google ScholarDigital Library
22.I. Newman. Private vs. common random bits in communication complexity. Information Processing Letters, 39:67-71, 1991.]] Google ScholarDigital Library
23.I. Newman and M. Szegedy. Public vs. private coin flips in one round communication games. In Proc. 28th ACM Symp. on the Theory of Computing, pages 561-570, May 1996.]] Google ScholarDigital Library
24.N. Nisan and D. Ron. Private communication, October-November 2000.]]Google Scholar
25.Transaction processing performance council (TPC). TPC Benchmarks, 2000. URL: www. tpc. org.]]Google Scholar
26.K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15(2):208-229, 1990.]] Google ScholarDigital Library

Index Terms

Estimating simple functions on the union of data streams

Recommendations

Estimating statistical aggregates on probabilistic data streams

The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of ...
Read More
Data Streams with Bounded Deletions
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. ...
Read More
Estimating statistical aggregates on probabilistic data streams
PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

The probabilistic-stream model was introduced by Jayram et al. [20].It is a generalization of the data stream model that issuited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '01: Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
July 2001
340 pages
ISBN:1581134096
DOI:10.1145/378580
Chairman:
Arnold Rosenberg
Univ. of Massachusetts
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 July 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SPAA '01 Paper Acceptance Rate34of93submissions,37%Overall Acceptance Rate447of1,461submissions,31%
More
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 144
  Total Citations
  View Citations
- 672
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.