Abstract
We present a randomized procedure named Hierarchical Sampling from Sketches (Hss) that can be used for estimating a class of functions over the frequency vector f of update streams of the form \(\varPsi(\mathcal {S})=\sum_{i=1}^{n}\psi(\vert {f_{i}}\vert )\) . We illustrate this by applying the Hss technique to design nearly space-optimal algorithms for estimating the pth moment of the frequency vector, for real p≥2 and for estimating the entropy of a data stream.
Similar content being viewed by others
References
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1998)
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D.: An information statistics approach to data stream and communication complexity. In: Proceedings of the ACM Symposium on Theory of Computing, 2002
Bhuvanagiri, L., Ganguly, S.: Estimating entropy over data streams. In: Proceedings of the European Symposium on Algorithms, pp. 148–159 (2006)
Carter, J.L., Wegman, M.N.: Universal classes of hash functions. J. Comput. Syst. Sci. 18(2), 143–154 (1979)
Chakrabarti, A., Ba, D.K., Muthukrishnan, S.: Estimating entropy and entropy norm on data streams. In: Proceedings of the Symposium on Theoretical Aspects of Computer Science, 2006
Chakrabarti, A., Cormode, G., McGregor, A.: A near-optimal algorithm for computing the entropy of a stream. In: Proceedings of the ACM Symposium on Discrete Algorithms, 2007
Chakrabarti, A., Khot, S., Sun, X.: Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In: Proceedings of the Conference on Computational Complexity, 2003
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming, pp. 693–703 (2002)
Coppersmith, D., Kumar, R.: An improved data stream algorithm for estimating frequency moments. In: Proceedings of the ACM Symposium on Discrete Algorithms, pp. 151–156 (2004)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for database applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Ganguly, S.: A hybrid technique for estimating frequency moments over data streams. Manuscript (July 2004)
Ganguly, S.: Estimating frequency moments of update streams using random linear combinations. In: Proceedings of the International Workshop on Randomization and Computation (RANDOM), 2004
Ganguly, S., Kesh, D., Saha, C.: Practical algorithms for tracking database join sizes. In: Proceedings of the FSTTCS, December 2005, pp. 294–305
Gu, Y., McCallum, A., Towsley, D.: Detecting anomalies in network traffic using maximum entropy estimation. In: Proceedings of Internet Measurement Conference, pp. 345–350 (2005)
Guha, S., McGregor, A., Venkatsubramanian, S.: Streaming and sublinear approximation of entropy and information distances. In: Proceedings of the ACM Symposium on Discrete Algorithms, 2006
Indyk, P.: Stable distributions, pseudo-random generators, embeddings and data stream computation. In: Proceedings of the IEEE Foundations of Computer Science, pp. 189–197 (2000)
Indyk, P., Woodruff, D.: Optimal approximations of the frequency moments. In: Proceedings of the ACM Symposium on Theory of Computing, pp. 202–298 (2005)
Nisan, N.: Pseudo-random generators for space bounded computation. In: Proceedings of the ACM Symposium on Theory of Computing, 1990
Saks, M., Sun, X.: Space lower bounds for distance approximation in the data stream model. In: Proceedings of the ACM Symposium on Theory of Computing, 2002
Thorup, M., Zhang, Y.: Tabulation based 4-universal hashing with applications to second moment estimation. In: Proceedings of the ACM Symposium on Discrete Algorithms, January 2004, pp. 615–624
Wagner, A., Plattner, B.: Entropy based worm and anomaly detection in fast IP networks. In: 14th IEEE WET ICE, STCA Security Workshop, 2005
Wegman, M.N., Carter, J.L.: New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci. 22, 265–279 (1981)
Woodruff, D.P.: Optimal space lower bounds for all frequency moments. In: Proceedings of the ACM Symposium on Discrete Algorithms, pp. 167–175 (2004)
Xu, K., Zhang, Z., Bhattacharyya, S.: Profiling Internet backbone traffic: behavior models and applications. SIGCOMM Comput. Commun. Rev. 35(4), 169–180 (2005)
Author information
Authors and Affiliations
Corresponding author
Additional information
Preliminary version of this paper appeared as the following conference publications. “Simpler algorithm for estimating frequency moments of data streams,” Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh and Chandan Saha, Proceedings of the ACM Symposium on Discrete Algorithms, 2006, pp. 708–713 and “Estimating entropy over data streams,” Lakshminath Bhuvanagiri and Sumit Ganguly, Proceedings of the European Symposium on Algorithms, LNCS, vol. 4168, pp. 148–159, Springer, 2006.
Rights and permissions
About this article
Cite this article
Ganguly, S., Bhuvanagiri, L. Hierarchical Sampling from Sketches: Estimating Functions over Data Streams. Algorithmica 53, 549–582 (2009). https://doi.org/10.1007/s00453-008-9260-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-008-9260-5