A unifying framework for ℓ 0-sampling algorithms

Cormode, Graham; Firmani, Donatella

doi:10.1007/s10619-013-7131-9

A unifying framework for ℓ ₀-sampling algorithms

Published: 25 July 2013

Volume 32, pages 315–335, (2014)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Graham Cormode¹ &
Donatella Firmani²

887 Accesses
Explore all metrics

Abstract

The problem of building an ℓ ₀-sampler is to sample near-uniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an ℓ ₀-sampler, based on sampling, recovery and selection. We analyze the implementation of an ℓ ₀-sampler within this framework, and show how prior constructions of ℓ ₀-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of ℓ ₀-samplers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

More generally, we also seek solutions so that, given sketches of vectors a and b, we can form a sketch of (a+b) and sample from the ℓ ₀-distribution on (a+b). All the algorithms that we discuss have this property.
We note that tighter bounds are possible via a similar construction and a more involved analysis: adapting the approach of [11] improves the log term from log(s/δ _r) to log1/δ _r, and the analysis of [26] further improves it to log_s1/δ _r.
Jowhari et al. [18] first present their algorithm assuming a random oracle, and then they remove this assumption through the use of the pseudo-random generator of Nisan [23].
This level is ⌈log(2N/k)⌉ for the ℓ ₀-sampler with k-wise independence, and ⌈logN/ϵ⌉ for the variant with pairwise independence.

References

Achlioptas, D.: Database-friendly random projections. In: ACM Principles of Database Systems, pp. 274–281 (2001)
Google Scholar
Ahn, K.J., Guha, S., McGregor, A.: Analyzing graph structure via linear measurements. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 459–467 (2012)
Chapter Google Scholar
Barkay, N., Porat, E., Shalem, B.: Feasible Sampling of Non-strict Turnstile Data Streams (2012). arXiv:1209.5566
Beyer, K., Gemulla, R., Haas, P.J., Reinwald, B., Sismanis, Y.: Distinct-value synopses for multiset operations. Commun. ACM 52(10), 87–95 (2009)
Article Google Scholar
Cormode, G., Firmani, D.: On unifying the space of ℓ ₀ sampling algorithms. In: Meeting on Algorithm Engineering & Experiments, pp. 163–172 (2013)
Google Scholar
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases, pp. 3–20 (2008)
Google Scholar
Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)
Google Scholar
Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: International Conference on Very Large Data Bases, pp. 25–36 (2005)
Google Scholar
Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synposes for Massive Data: Samples, Histograms, Wavelets and Sketches. Now Publishers, Hanover (2012)
Google Scholar
Dasgupta, S., Gupta, A.: An Elementary Proof of the Johnson–Lindenstrauss Lemma. International Computer Science Institute, Berkeley (1999). Tech. Rep. TR-99-006
Google Scholar
Eppstein, D., Goodrich, M.T.: Space-efficient straggler identification in round-trip data streams via Newton’s identitities and invertible Bloom filters. In: Workshop on Algorithms and Data Structures, pp. 637–648 (2007)
Chapter Google Scholar
Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Symposium on Computational Geometry, pp. 142–149 (2005)
Google Scholar
Ganguly, S.: Counting distinct items over update streams. Theor. Comput. Sci. 378(3), 211–222 (2007)
Article MATH MathSciNet Google Scholar
Gilbert, A.C., Strauss, M.J., Tropp, J.A., Vershynin, R.: One sketch for all: fast algorithms for compressed sensing. In: ACM Symposium on Theory of Computing, pp. 237–246 (2007)
Google Scholar
Indyk, P.: A small approximately min-wise independent family of hash functions. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 454–456 (1999)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Google Scholar
Johnson, W., Lindenstrauss, J.: Extensions of Lipshitz mapping into Hilbert space. Contemp. Math. 26, 189–206 (1984)
Article MATH MathSciNet Google Scholar
Jowhari, H., Sağlam, M., Tardos, G.: Tight bounds for l _p samplers, finding duplicates in streams, and related problems. In: ACM Principles of Database Systems, pp. 49–58 (2011)
Google Scholar
Kane, D.M., Nelson, J., Woodruff, D.P.: An optimal algorithm for the distinct elements problem. In: ACM Principles of Database Systems, pp. 41–52 (2010)
Google Scholar
Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)
Article Google Scholar
Metwally, A., Agrawal, D., El Abbadi, A.: Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In: EDBT, pp. 618–629 (2008)
Chapter Google Scholar
Monemizadeh, M., Woodruff, D.P.: 1-pass relative-error l _p-sampling with applications. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1143–1160 (2010)
Chapter Google Scholar
Nisan, N.: Pseudorandom generators for space-bounded computations. In: ACM Symposium on Theory of Computing, pp. 204–212 (1990)
Google Scholar
Patrascu, M., Thorup, M.: The power of simple tabulation hashing. In: ACM Symposium on Theory of Computing, pp. 1–10 (2011)
Google Scholar
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. 13(4), 277–298 (2005)
Google Scholar
Price, E.: Efficient sketches for the set query problem. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 41–56 (2011)
Chapter Google Scholar
Schmidt, J.P., Siegel, A., Srinivasan, A.: Chernoff–Hoeffding bounds for applications with limited independence. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 331–340 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Warwick, Coventry, UK
Graham Cormode
Sapienza University of Rome, Rome, Italy
Donatella Firmani

Authors

Graham Cormode
View author publications
You can also search for this author in PubMed Google Scholar
Donatella Firmani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Graham Cormode.

Additional information

Communicated by: Feifei Li and Suman Nath.

This paper is an extended version of [5].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cormode, G., Firmani, D. A unifying framework for ℓ ₀-sampling algorithms. Distrib Parallel Databases 32, 315–335 (2014). https://doi.org/10.1007/s10619-013-7131-9

Download citation

Published: 25 July 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10619-013-7131-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A unifying framework for ℓ ₀-sampling algorithms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Graph sampling

Informed sub-sampling MCMC: approximate Bayesian inference for large datasets

DIDES: a fast and effective sampling for clustering algorithm

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A unifying framework for ℓ 0-sampling algorithms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Graph sampling

Informed sub-sampling MCMC: approximate Bayesian inference for large datasets

DIDES: a fast and effective sampling for clustering algorithm

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

A unifying framework for ℓ ₀-sampling algorithms