skip to main content
10.1145/2254756.2254798acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

Don't let the negatives bring you down: sampling from streams of signed updates

Published: 11 June 2012 Publication History

Abstract

Random sampling has been proven time and time again to be a powerful tool for working with large data. Queries over the full dataset are replaced by approximate queries over the smaller (and hence easier to store and manipulate) sample. The sample constitutes a flexible summary that supports a wide class of queries. But in many applications, datasets are modified with time, and it is desirable to update samples without requiring access to the full underlying datasets. In this paper, we introduce and analyze novel techniques for sampling over dynamic data, modeled as a stream of modifications to weights associated with each key.
While sampling schemes designed for stream applications can often readily accommodate positive updates to the dataset, much less is known for the case of negative updates, where weights are reduced or items deleted altogether. We primarily consider the turnstile model of streams, and extend classic schemes to incorporate negative updates. Perhaps surprisingly, the modifications to handle negative updates turn out to be natural and seamless extensions of the well-known positive update-only algorithms. We show that they produce unbiased estimators, and we relate their performance to the behavior of corresponding algorithms on insert-only streams with different parameters. A careful analysis is necessitated, in order to account for the fact that sampling choices for one key now depend on the choices made for other keys.
In practice, our solutions turn out to be efficient and accurate. Compared to recent algorithms for Lp sampling which can be applied to this problem, they are significantly more reliable, and dramatically faster.

References

[1]
A. Andoni, R. Krauthgamer, and K. Onak. Streaming algorithms from precision sampling. Technical Report 1011.1263, arXiv, 2010.
[2]
B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In ACM-SIAM Symposium on Discrete Algorithms, pages 633--634, 2002.
[3]
V. Braverman, R. Ostrovsky, and C. Zaniolo. Optimal sampling from sliding windows. In ACM Principles of Database Systems, 2009.
[4]
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), pages 693--703, 2002.
[5]
E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators for accurate summarization of Internet traffic. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (IMC), 2007.
[6]
E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Sketching unaggregated data streams for subpopulation-size queries. In Proc. of the 2007 ACM Symp. on Principles of Database Systems (PODS 2007). ACM, 2007.
[7]
E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In Proceedings of the ACM PODC'07 Conference, 2007.
[8]
E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008.
[9]
G. Cormode, S. Muthukrishnan, and I. Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In International Conference on Very Large Data Bases, 2005.
[10]
C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proceedings of the ACM SIGCOMM'02 Conference. ACM, 2002.
[11]
G. Frahling, P. Indyk, and C. Sohler. Sampling in dynamic data streams and applications. In Symposium on Computational Geometry, June 2005.
[12]
R. Gemulla, W. Lehner, and P. J. Haas. A dip in the reservoir: Maintaining sample synopses of evolving datasets. In VLDB, pages 595--606, 2006.
[13]
R. Gemulla, W. Lehner, and P. J. Haas. Maintaining bernoulli samples over evolving multisets. In PODS 2007, pages 93--102. ACM, 2007.
[14]
P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD. ACM, 1998.
[15]
P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the 13th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 2001.
[16]
M. Hoffmann, S. Muthukrishnan, and R. Raman. Streaming algorithms for data in motion. In ESCAPE, pages 294--304, 2007.
[17]
H. Jowhari, M. Saglam, and G. Tardos. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In PODS, pages 49--58, 2011.
[18]
D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 2nd edition, 1998.
[19]
G. Manku and R. Motwani. Approximate frequency counts over data streams. In International Conference on Very Large Databases (VLDB), pages 346--357, 2002.
[20]
M. Monemizadeh and D. P. Woodruff. 1-pass relative-error lp-sampling with applications. In Proc. 21st ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 2010.
[21]
B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972.
[22]
J. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985.

Cited By

View all
  • (2020)Composable sketches for functions of frequenciesProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525130(2057-2067)Online publication date: 13-Jul-2020
  • (2020)WOR and p'sProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497495(21092-21104)Online publication date: 6-Dec-2020
  • (2019)Sampling sketches for concave sublinear functions of frequenciesProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454409(1363-1373)Online publication date: 8-Dec-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMETRICS '12: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
June 2012
450 pages
ISBN:9781450310970
DOI:10.1145/2254756
  • cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 40, Issue 1
    Performance evaluation review
    June 2012
    433 pages
    ISSN:0163-5999
    DOI:10.1145/2318857
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data streams
  2. deletions
  3. sampling
  4. updates

Qualifiers

  • Research-article

Conference

SIGMETRICS '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Composable sketches for functions of frequenciesProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525130(2057-2067)Online publication date: 13-Jul-2020
  • (2020)WOR and p'sProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497495(21092-21104)Online publication date: 6-Dec-2020
  • (2019)Sampling sketches for concave sublinear functions of frequenciesProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454409(1363-1373)Online publication date: 8-Dec-2019
  • (2019)Continuously Distinct Sampling over Centralized and Distributed High Speed Data StreamsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.286545230:2(300-314)Online publication date: 1-Feb-2019
  • (2018)Stream Sampling Framework and Application for Frequency Cap StatisticsACM Transactions on Algorithms10.1145/323433814:4(1-40)Online publication date: 24-Sep-2018
  • (2017)Stream Aggregation Through Order SamplingProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133042(909-918)Online publication date: 6-Nov-2017
  • (2015)Stream Sampling for Frequency Cap StatisticsProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783279(159-168)Online publication date: 10-Aug-2015
  • (2015)Efficient sampling of non-strict turnstile data streamsTheoretical Computer Science10.1016/j.tcs.2015.01.026590:C(106-117)Online publication date: 26-Jul-2015
  • (2013)Efficient sampling of non-strict turnstile data streamsProceedings of the 19th international conference on Fundamentals of Computation Theory10.1007/978-3-642-40164-0_8(48-59)Online publication date: 19-Aug-2013
  • (2021)Perfect $L_p$ Sampling in a Data StreamSIAM Journal on Computing10.1137/18M122991250:2(382-439)Online publication date: 30-Mar-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media