Coordinated Sampling

Cohen, Edith

doi:10.1007/978-1-4939-2864-4_576

Edith Cohen^2,3

231 Accesses
2 Citations

Years and Authors of Summarized Original Work

1972; Brewer, Early, Joice
1997; Cohen
1997; Broder
2013; Cohen, Kaplan
2014; Cohen

Problem Definition

Data is often sampled as a means of addressing resource constraints on storage, bandwidth, or processing – even when we have the resources to store the full data set, processing queries exactly over the data can be very expensive, and we therefore may opt for approximate fast answers obtained from the much smaller sample.

Our focus here is on data sets that have the form of a set of keys from some universe and multiple instances, which are assignments of nonnegative values to keys. We denote by v_hi the value of key h in instance i. Examples of data sets with this form include measurements of a set of parameters; snapshots of a state of a system; logs of requests, transactions, or activity; IP flow records in different time periods; and occurrences of terms in a set of documents. Typically, this matrix is very sparse – the vast majority...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 1,999.99; Price excludes VAT (USA)

Hardcover Book: USD 1,999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Beyer KS, Haas PJ, Reinwald B, Sismanis Y, Gemulla R (2007) On synopses for distinct-value estimation under multiset operations. In: SIGMOD, Beijing. ACM, pp 199–210
Chapter Google Scholar
Brewer KRW, Early LJ, Joyce SF (1972) Selecting several samples from a single population. Aust J Stat 14(3):231–239
Article Google Scholar
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences, Salerno. IEEE, pp 21–29
Google Scholar
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, Montreal. LNCS, vol 1848. Springer, pp 1–10
Google Scholar
Byers JW, Considine J, Mitzenmacher M, Rost S (2004) Informed content delivery across adaptive overlay networks. IEEE/ACM Trans Netw 12(5):767–780
Article Google Scholar
Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55:441–453
Article MathSciNet MATH Google Scholar
Cohen E (2013) All-distances sketches, revisited: HIP estimators for massive graphs analysis. Tech. Rep. cs.DS/1306.3284, arXiv http://arxiv.org/abs/1306.3284
Cohen E (2014) Distance queries from sampled data: accurate and efficient. In: ACM KDD, New York. Full version: http://arxiv.org/abs/1203.4903
Cohen E (2014) Estimation for monotone sampling: competitiveness and customization. In: ACM PODC, Paris. http://arxiv.org/abs/1212.0243, full version http://arxiv.org/abs/1212.0243
Cohen E (2014) Variance competitiveness for monotone estimation: tightening the bounds. Tech. Rep. cs.ST/1406.6490, arXiv http://arxiv.org/abs/1406.6490
Cohen E, Kaplan H (2007) Spatially-decaying aggregation over a network: model and algorithms. J Comput Syst Sci 73:265–288. Full version of a SIGMOD 2004 paper
Google Scholar
Cohen E, Kaplan H (2007) Summarizing data using bottom-k sketches. In: ACM PODC, Portland
Book MATH Google Scholar
Cohen E, Kaplan H (2008) Tighter estimation using bottom-k sketches. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland. http://arxiv.org/abs/0802.3448
Cohen E, Kaplan H (2009) Leveraging discarded samples for tighter estimation of multiple-set aggregates. In: ACM SIGMETRICS, Seattle
Book Google Scholar
Cohen E, Kaplan H (2013) What you can do with coordinated samples. In: The 17th international workshop on randomization and computation (RANDOM), Berkeley. Full version: http://arxiv.org/abs/1206.5637
Cohen E, Wang YM, Suri G (1995) When piecewise determinism is almost true. In: Proceedings of the pacific rim international symposium on fault-tolerant systems, Newport Beach, pp 66–71
Google Scholar
Cohen E, Kaplan H, Sen S (2009) Coordinated weighted sampling for estimating aggregates over multiple weight assignments. In: Proceedings of the VLDB endowment, Lyon, France, vol 2(1–2). Full version: http://arxiv.org/abs/0906.4560
Cohen E, Delling D, Fuchs F, Goldberg A, Goldszmidt M, Werneck R (2013) Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths. In: ACM COSN, Boston
Book Google Scholar
Cohen E, Delling D, Pajor T, Werneck RF (2014) Sketch-based influence maximization and computation: scaling up with guarantees. In: ACM CIKM, Shanghai. http://research.microsoft.com/apps/pubs/?id=226623, full version http://research.microsoft.com/apps/pubs/?id=226623
Cohen E, Delling D, Pajor T, Werneck RF (2014) Timed influence: computation and maximization. Tech. Rep. cs.SI/1410.6976, arXiv http://arxiv.org/abs/1410.06976
Das A, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: WWW, Banff, Alberta, Canada
Book Google Scholar
Duffield N, Thorup M, Lund C (2007) Priority sampling for estimating arbitrary subset sums. J Assoc Comput Mach 54(6)
Google Scholar
Efraimidis PS, Spirakis PG (2006) Weighted random sampling with a reservoir. Inf Process Lett 97(5):181–185
Article MathSciNet MATH Google Scholar
Gibbons PB (2001) Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: International conference on very large databases (VLDB), Roma, pp 541–550
Google Scholar
Gibbons P, Tirthapura S (2001) Estimating simple functions on the union of data streams. In: Proceedings of the 13th annual ACM symposium on parallel algorithms and architectures, Crete Island. ACM
Google Scholar
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases (VLDB’99), Edinburgh
Google Scholar
Hadjieleftheriou M, Yu X, Koudas N, Srivastava D (2008) Hashed samples: selectivity estimators for set similarity selection queries. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland
Google Scholar
Hájek J (1981) Sampling from a finite population. Marcel Dekker, New York
MATH Google Scholar
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260):663–685
Article MathSciNet MATH Google Scholar
Indyk P (2001) Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Proceedings of the 41st IEEE annual symposium on foundations of computer science, Redondo Beach. IEEE, pp 189–197
Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th annual ACM symposium on theory of computing, Texas. ACM, pp 604–613
Google Scholar
Mosk-Aoyama D, Shah D (2006) Computing separable functions via gossip. In: ACM PODC, Denver
Book MATH Google Scholar
Ohlsson E (1998) Sequential poisson sampling. J Off Stat 14(2):149–162
Google Scholar
Ohlsson E (2000) Coordination of PPS samples over time. In: The 2nd international conference on establishment surveys. American Statistical Association, pp 255–264
Google Scholar
Rosén B (1972) Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann Math Stat 43(2):373–397. http://www.jstor.org/stable/2239977
Article MathSciNet MATH Google Scholar
Rosén B (1997) Asymptotic theory for order sampling. J Stat Plan Inference 62(2):135–158
Article MathSciNet MATH Google Scholar
Saavedra PJ (1995) Fixed sample size PPS approximations with a permanent random number. In: Proceedings of the section on survey research methods, Alexandria. American Statistical Association, pp 697–700
Google Scholar
Szegedy M (2006) The DLT priority sampling is essentially optimal. In: Proceedings of the 38th annual ACM symposium on theory of computing, Seattle. ACM
Google Scholar

Download references

Author information

Authors and Affiliations

Tel Aviv University, Tel Aviv, Montpellier, Israel
Edith Cohen
Stanford University, Stanford, CA, USA
Edith Cohen

Authors

Edith Cohen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edith Cohen .

Editor information

Editors and Affiliations

Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USA
Ming-Yang Kao

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Cohen, E. (2016). Coordinated Sampling. In: Kao, MY. (eds) Encyclopedia of Algorithms. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2864-4_576

Download citation

DOI: https://doi.org/10.1007/978-1-4939-2864-4_576
Published: 22 April 2016
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-2863-7
Online ISBN: 978-1-4939-2864-4
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics