Abstract
We investigate the ability to sample relatively small amounts of data from a stream and approximately calculate statistics on the original stream. McGregor et al. [29] provide worst case theoretical bounds that show space costs for sampling that are inversely correlated with the sampling rate. Indeed, while the lower bound of McGregor et al. cannot be improved in the general case, we show it is possible to improve the space bound for stream D of domain n, when the average positive frequency μ = F 1/F 0 is sufficiently large. We consider the following range of parameters: μ ≥ log(n) and sample rate p ≥ C k μ − 1log(n), where C k is a constant. On these streams we improve the bound from \(\tilde{O} ({1 \over p} n^{1-2/k})\) to \( \tilde{O} (n^{1-2/k})\) thus giving polynomial improvement in space for sufficiently large μ and p − 1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA, pp. 633–634 (2002)
Bar-Yossef, Z.: The complexity of massive data set computations. PhD thesis, Berkeley, CA, USA, AAI3183783 (2002)
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D.: An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68(4), 702–732 (2004)
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Rolim, J.D.P., Vadhan, S.P. (eds.) RANDOM 2002. LNCS, vol. 2483, pp. 1–10. Springer, Heidelberg (2002)
Bhattacharyya, S., Madeira, A., Muthukrishnan, S., Ye, T.: How to scalably and accurately skip past streams. In: Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop, ICDEW 2007, pp. 654–663. IEEE Computer Society, Washington, DC (2007)
Braverman, V., Katzman, J., Seidell, C., Vorsanger, G.: Approximating large frequency moments with o(n 1 − 2/k) bits. CoRR, abs/1401.1763 (2014)
Braverman, V., Ostrovsky, R.: Smooth histograms for sliding windows. In: Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2007, pp. 283–293. IEEE Computer Society, Washington, DC (2007)
Braverman, V., Ostrovsky, R.: Zero-one frequency laws. In: Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, pp. 281–290. ACM, New York (2010)
Braverman, V., Ostrovsky, R.: Approximating large frequency moments with pick-and-drop sampling. In: Raghavendra, P., Raskhodnikova, S., Jansen, K., Rolim, J.D.P. (eds.) APPROX/RANDOM 2013. LNCS, vol. 8096, pp. 42–57. Springer, Heidelberg (2013)
Braverman, V., Ostrovsky, R.: Generalizing the layering method of Indyk and Woodruff: Recursive sketches for frequency-based vectors on streams. In: Raghavendra, P., Raskhodnikova, S., Jansen, K., Rolim, J.D.P. (eds.) APPROX/RANDOM 2013. LNCS, vol. 8096, pp. 58–70. Springer, Heidelberg (2013)
Braverman, V., Ostrovsky, R., Vilenchik, D.: How hard is counting triangles in the streaming model? In: Fomin, F.V., Freivalds, R., Kwiatkowska, M., Peleg, D. (eds.) ICALP 2013, Part I. LNCS, vol. 7965, pp. 244–254. Springer, Heidelberg (2013)
Braverman, V., Ostrovsky, R., Vorsanger, G.: Weighted sampling without replacement from data streams (2013) (submitted)
Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. In: PODS, pp. 147–156 (2009)
Chakrabarti, A., Khot, S., Sun, X.: Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In: IEEE Conference on Computational Complexity, pp. 107–117 (2003)
Chaudhuri, S., Motwani, R., Narasayya, V.: On random sampling over joins. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD 1999, pp. 263–274. ACM, New York (1999)
Coppersmith, D., Kumar, R.: An improved data stream algorithm for frequency moments. In: SODA, pp. 151–156 (2004)
Cormode, G., Datar, M., Indyk, P., Muthukrishnan, S.: Comparing data streams using hamming norms (how to zero in). IEEE Trans. on Knowl. and Data Eng. 15(3), 529–540 (2003)
Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M.: An approximate l1-difference algorithm for massive data streams. In: FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, p. 501. IEEE Computer Society, Washington, DC (1999)
Ganguly, S.: Estimating frequency moments of data streams using random linear combinations. In: Jansen, K., Khanna, S., Rolim, J.D.P., Ron, D. (eds.) APPROX and RANDOM 2004. LNCS, vol. 3122, pp. 369–380. Springer, Heidelberg (2004)
Ganguly, S., Cormode, G.: On estimating frequency moments of data streams. In: Charikar, M., Jansen, K., Reingold, O., Rolim, J.D.P. (eds.) APPROX and RANDOM. LNCS, vol. 4627, pp. 479–493. Springer, Heidelberg (2007)
Indyk, P., Woodruff, D.: Optimal approximations of the frequency moments of data streams. In: STOC 2005: Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, pp. 202–208. ACM, New York (2005)
Johnson, N.L., Kemp, A.W., Kotz, S.: Univariate discrete distributions. Wiley-Interscience (2005)
Kane, D.M., Nelson, J., Woodruff, D.P.: On the exact space complexity of sketching and streaming small norms. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010 (2010)
Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: PODS 2010: Proceedings of the Twenty-ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems of Data, pp. 41–52. ACM, New York (2010)
Knuth, D.E.: The art of computer programming, fundamental algorithms, 3rd edn., vol. 1. Addison Wesley Longman Publishing Co., Inc., Redwood City (1997)
Li, P.: Compressed counting. In: SODA 2009: Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms, pp. 412–421. Society for Industrial and Applied Mathematics, Philadelphia (2009)
McGregor, A.: Open problems in data streams and related topics. In: IITK Workshop on Algorithms for Data Streams (2006), http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf (2007)
McGregor, A., Pavan, A., Tirthapura, S., Woodruff, D.: Space-efficient estimation of statistics over sub-sampled streams. In: Proceedings of the 31st Symposium on Principles of Database Systems, PODS 2012, pp. 273–282. ACM, New York (2012)
Rusu, F., Dobra, A.: Sketching sampled data streams. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE 2009, pp. 381–392. IEEE Computer Society, Washington, DC (2009)
Vazirani, V.V.: Approximation algorithms. Springer-Verlag New York, Inc., New York (2001)
Vitter, J.S.: ACM Transactions on Mathematical Software, 11(1), 37–57
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Braverman, V., Vorsanger, G. (2014). Sampling from Dense Streams without Penalty. In: Cai, Z., Zelikovsky, A., Bourgeois, A. (eds) Computing and Combinatorics. COCOON 2014. Lecture Notes in Computer Science, vol 8591. Springer, Cham. https://doi.org/10.1007/978-3-319-08783-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-08783-2_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08782-5
Online ISBN: 978-3-319-08783-2
eBook Packages: Computer ScienceComputer Science (R0)