Skip to main content

Sampling from Dense Streams without Penalty

Improved Bounds for Frequency Moments and Heavy Hitters

  • Conference paper
Computing and Combinatorics (COCOON 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8591))

Included in the following conference series:

  • 1278 Accesses

Abstract

We investigate the ability to sample relatively small amounts of data from a stream and approximately calculate statistics on the original stream. McGregor et al. [29] provide worst case theoretical bounds that show space costs for sampling that are inversely correlated with the sampling rate. Indeed, while the lower bound of McGregor et al. cannot be improved in the general case, we show it is possible to improve the space bound for stream D of domain n, when the average positive frequency μ = F 1/F 0 is sufficiently large. We consider the following range of parameters: μ ≥ log(n) and sample rate p ≥ C k μ − 1log(n), where C k is a constant. On these streams we improve the bound from \(\tilde{O} ({1 \over p} n^{1-2/k})\) to \( \tilde{O} (n^{1-2/k})\) thus giving polynomial improvement in space for sufficiently large μ and p − 1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  2. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA, pp. 633–634 (2002)

    Google Scholar 

  3. Bar-Yossef, Z.: The complexity of massive data set computations. PhD thesis, Berkeley, CA, USA, AAI3183783 (2002)

    Google Scholar 

  4. Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D.: An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68(4), 702–732 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  5. Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Rolim, J.D.P., Vadhan, S.P. (eds.) RANDOM 2002. LNCS, vol. 2483, pp. 1–10. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  6. Bhattacharyya, S., Madeira, A., Muthukrishnan, S., Ye, T.: How to scalably and accurately skip past streams. In: Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop, ICDEW 2007, pp. 654–663. IEEE Computer Society, Washington, DC (2007)

    Chapter  Google Scholar 

  7. Braverman, V., Katzman, J., Seidell, C., Vorsanger, G.: Approximating large frequency moments with o(n 1 − 2/k) bits. CoRR, abs/1401.1763 (2014)

    Google Scholar 

  8. Braverman, V., Ostrovsky, R.: Smooth histograms for sliding windows. In: Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2007, pp. 283–293. IEEE Computer Society, Washington, DC (2007)

    Google Scholar 

  9. Braverman, V., Ostrovsky, R.: Zero-one frequency laws. In: Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, pp. 281–290. ACM, New York (2010)

    Google Scholar 

  10. Braverman, V., Ostrovsky, R.: Approximating large frequency moments with pick-and-drop sampling. In: Raghavendra, P., Raskhodnikova, S., Jansen, K., Rolim, J.D.P. (eds.) APPROX/RANDOM 2013. LNCS, vol. 8096, pp. 42–57. Springer, Heidelberg (2013)

    Google Scholar 

  11. Braverman, V., Ostrovsky, R.: Generalizing the layering method of Indyk and Woodruff: Recursive sketches for frequency-based vectors on streams. In: Raghavendra, P., Raskhodnikova, S., Jansen, K., Rolim, J.D.P. (eds.) APPROX/RANDOM 2013. LNCS, vol. 8096, pp. 58–70. Springer, Heidelberg (2013)

    Google Scholar 

  12. Braverman, V., Ostrovsky, R., Vilenchik, D.: How hard is counting triangles in the streaming model? In: Fomin, F.V., Freivalds, R., Kwiatkowska, M., Peleg, D. (eds.) ICALP 2013, Part I. LNCS, vol. 7965, pp. 244–254. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  13. Braverman, V., Ostrovsky, R., Vorsanger, G.: Weighted sampling without replacement from data streams (2013) (submitted)

    Google Scholar 

  14. Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. In: PODS, pp. 147–156 (2009)

    Google Scholar 

  15. Chakrabarti, A., Khot, S., Sun, X.: Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In: IEEE Conference on Computational Complexity, pp. 107–117 (2003)

    Google Scholar 

  16. Chaudhuri, S., Motwani, R., Narasayya, V.: On random sampling over joins. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD 1999, pp. 263–274. ACM, New York (1999)

    Chapter  Google Scholar 

  17. Coppersmith, D., Kumar, R.: An improved data stream algorithm for frequency moments. In: SODA, pp. 151–156 (2004)

    Google Scholar 

  18. Cormode, G., Datar, M., Indyk, P., Muthukrishnan, S.: Comparing data streams using hamming norms (how to zero in). IEEE Trans. on Knowl. and Data Eng. 15(3), 529–540 (2003)

    Article  Google Scholar 

  19. Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M.: An approximate l1-difference algorithm for massive data streams. In: FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, p. 501. IEEE Computer Society, Washington, DC (1999)

    Google Scholar 

  20. Ganguly, S.: Estimating frequency moments of data streams using random linear combinations. In: Jansen, K., Khanna, S., Rolim, J.D.P., Ron, D. (eds.) APPROX and RANDOM 2004. LNCS, vol. 3122, pp. 369–380. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  21. Ganguly, S., Cormode, G.: On estimating frequency moments of data streams. In: Charikar, M., Jansen, K., Reingold, O., Rolim, J.D.P. (eds.) APPROX and RANDOM. LNCS, vol. 4627, pp. 479–493. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  22. Indyk, P., Woodruff, D.: Optimal approximations of the frequency moments of data streams. In: STOC 2005: Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, pp. 202–208. ACM, New York (2005)

    Google Scholar 

  23. Johnson, N.L., Kemp, A.W., Kotz, S.: Univariate discrete distributions. Wiley-Interscience (2005)

    Google Scholar 

  24. Kane, D.M., Nelson, J., Woodruff, D.P.: On the exact space complexity of sketching and streaming small norms. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010 (2010)

    Google Scholar 

  25. Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: PODS 2010: Proceedings of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems of Data, pp. 41–52. ACM, New York (2010)

    Google Scholar 

  26. Knuth, D.E.: The art of computer programming, fundamental algorithms, 3rd edn., vol. 1. Addison Wesley Longman Publishing Co., Inc., Redwood City (1997)

    Google Scholar 

  27. Li, P.: Compressed counting. In: SODA 2009: Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms, pp. 412–421. Society for Industrial and Applied Mathematics, Philadelphia (2009)

    Google Scholar 

  28. McGregor, A.: Open problems in data streams and related topics. In: IITK Workshop on Algorithms for Data Streams (2006), http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf (2007)

  29. McGregor, A., Pavan, A., Tirthapura, S., Woodruff, D.: Space-efficient estimation of statistics over sub-sampled streams. In: Proceedings of the 31st Symposium on Principles of Database Systems, PODS 2012, pp. 273–282. ACM, New York (2012)

    Google Scholar 

  30. Rusu, F., Dobra, A.: Sketching sampled data streams. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE 2009, pp. 381–392. IEEE Computer Society, Washington, DC (2009)

    Google Scholar 

  31. Vazirani, V.V.: Approximation algorithms. Springer-Verlag New York, Inc., New York (2001)

    Google Scholar 

  32. Vitter, J.S.: ACM Transactions on Mathematical Software, 11(1), 37–57

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Braverman, V., Vorsanger, G. (2014). Sampling from Dense Streams without Penalty. In: Cai, Z., Zelikovsky, A., Bourgeois, A. (eds) Computing and Combinatorics. COCOON 2014. Lecture Notes in Computer Science, vol 8591. Springer, Cham. https://doi.org/10.1007/978-3-319-08783-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08783-2_2

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08782-5

  • Online ISBN: 978-3-319-08783-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics