Skip to main content
Log in

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We present a randomized procedure named Hierarchical Sampling from Sketches (Hss) that can be used for estimating a class of functions over the frequency vector f of update streams of the form \(\varPsi(\mathcal {S})=\sum_{i=1}^{n}\psi(\vert {f_{i}}\vert )\) . We illustrate this by applying the Hss technique to design nearly space-optimal algorithms for estimating the pth moment of the frequency vector, for real p≥2 and for estimating the entropy of a data stream.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1998)

    Article  MathSciNet  Google Scholar 

  2. Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D.: An information statistics approach to data stream and communication complexity. In: Proceedings of the ACM Symposium on Theory of Computing, 2002

  3. Bhuvanagiri, L., Ganguly, S.: Estimating entropy over data streams. In: Proceedings of the European Symposium on Algorithms, pp. 148–159 (2006)

  4. Carter, J.L., Wegman, M.N.: Universal classes of hash functions. J. Comput. Syst. Sci. 18(2), 143–154 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  5. Chakrabarti, A., Ba, D.K., Muthukrishnan, S.: Estimating entropy and entropy norm on data streams. In: Proceedings of the Symposium on Theoretical Aspects of Computer Science, 2006

  6. Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream. In: Proceedings of the ACM Symposium on Discrete Algorithms, 2007

  7. Chakrabarti, A., Khot, S., Sun, X.: Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In: Proceedings of the Conference on Computational Complexity, 2003

  8. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming, pp. 693–703 (2002)

  9. Coppersmith, D., Kumar, R.: An improved data stream algorithm for estimating frequency moments. In: Proceedings of the ACM Symposium on Discrete Algorithms, pp. 151–156 (2004)

  10. Cormode, G., Muthukrishnan, S.: An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  11. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for database applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  12. Ganguly, S.: A hybrid technique for estimating frequency moments over data streams. Manuscript (July 2004)

  13. Ganguly, S.: Estimating frequency moments of update streams using random linear combinations. In: Proceedings of the International Workshop on Randomization and Computation (RANDOM), 2004

  14. Ganguly, S., Kesh, D., Saha, C.: Practical algorithms for tracking database join sizes. In: Proceedings of the FSTTCS, December 2005, pp. 294–305

  15. Gu, Y., McCallum, A., Towsley, D.: Detecting anomalies in network traffic using maximum entropy estimation. In: Proceedings of Internet Measurement Conference, pp. 345–350 (2005)

  16. Guha, S., McGregor, A., Venkatsubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: Proceedings of the ACM Symposium on Discrete Algorithms, 2006

  17. Indyk, P.: Stable distributions, pseudo-random generators, embeddings and data stream computation. In: Proceedings of the IEEE Foundations of Computer Science, pp. 189–197 (2000)

  18. Indyk, P., Woodruff, D.: Optimal approximations of the frequency moments. In: Proceedings of the ACM Symposium on Theory of Computing, pp. 202–298 (2005)

  19. Nisan, N.: Pseudo-random generators for space bounded computation. In: Proceedings of the ACM Symposium on Theory of Computing, 1990

  20. Saks, M., Sun, X.: Space lower bounds for distance approximation in the data stream model. In: Proceedings of the ACM Symposium on Theory of Computing, 2002

  21. Thorup, M., Zhang, Y.: Tabulation based 4-universal hashing with applications to second moment estimation. In: Proceedings of the ACM Symposium on Discrete Algorithms, January 2004, pp. 615–624

  22. Wagner, A., Plattner, B.: Entropy based worm and anomaly detection in fast IP networks. In: 14th IEEE WET ICE, STCA Security Workshop, 2005

  23. Wegman, M.N., Carter, J.L.: New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci. 22, 265–279 (1981)

    Article  MATH  MathSciNet  Google Scholar 

  24. Woodruff, D.P.: Optimal space lower bounds for all frequency moments. In: Proceedings of the ACM Symposium on Discrete Algorithms, pp. 167–175 (2004)

  25. Xu, K., Zhang, Z., Bhattacharyya, S.: Profiling Internet backbone traffic: behavior models and applications. SIGCOMM Comput. Commun. Rev. 35(4), 169–180 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sumit Ganguly.

Additional information

Preliminary version of this paper appeared as the following conference publications. “Simpler algorithm for estimating frequency moments of data streams,” Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh and Chandan Saha, Proceedings of the ACM Symposium on Discrete Algorithms, 2006, pp. 708–713 and “Estimating entropy over data streams,” Lakshminath Bhuvanagiri and Sumit Ganguly, Proceedings of the European Symposium on Algorithms, LNCS, vol. 4168, pp. 148–159, Springer, 2006.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ganguly, S., Bhuvanagiri, L. Hierarchical Sampling from Sketches: Estimating Functions over Data Streams. Algorithmica 53, 549–582 (2009). https://doi.org/10.1007/s00453-008-9260-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-008-9260-5

Keywords

Navigation