Years and Authors of Summarized Original Work
-
1972; Brewer, Early, Joice
-
1997; Cohen
-
1997; Broder
-
2013; Cohen, Kaplan
-
2014; Cohen
Problem Definition
Data is often sampled as a means of addressing resource constraints on storage, bandwidth, or processing – even when we have the resources to store the full data set, processing queries exactly over the data can be very expensive, and we therefore may opt for approximate fast answers obtained from the much smaller sample.
Our focus here is on data sets that have the form of a set of keys from some universe and multiple instances, which are assignments of nonnegative values to keys. We denote by v hi the value of key h in instance i. Examples of data sets with this form include measurements of a set of parameters; snapshots of a state of a system; logs of requests, transactions, or activity; IP flow records in different time periods; and occurrences of terms in a set of documents. Typically, this matrix is very sparse – the vast majority...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Beyer KS, Haas PJ, Reinwald B, Sismanis Y, Gemulla R (2007) On synopses for distinct-value estimation under multiset operations. In: SIGMOD, Beijing. ACM, pp 199–210
Brewer KRW, Early LJ, Joyce SF (1972) Selecting several samples from a single population. Aust J Stat 14(3):231–239
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences, Salerno. IEEE, pp 21–29
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, Montreal. LNCS, vol 1848. Springer, pp 1–10
Byers JW, Considine J, Mitzenmacher M, Rost S (2004) Informed content delivery across adaptive overlay networks. IEEE/ACM Trans Netw 12(5):767–780
Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55:441–453
Cohen E (2013) All-distances sketches, revisited: HIP estimators for massive graphs analysis. Tech. Rep. cs.DS/1306.3284, arXiv http://arxiv.org/abs/1306.3284
Cohen E (2014) Distance queries from sampled data: accurate and efficient. In: ACM KDD, New York. Full version: http://arxiv.org/abs/1203.4903
Cohen E (2014) Estimation for monotone sampling: competitiveness and customization. In: ACM PODC, Paris. http://arxiv.org/abs/1212.0243, full version http://arxiv.org/abs/1212.0243
Cohen E (2014) Variance competitiveness for monotone estimation: tightening the bounds. Tech. Rep. cs.ST/1406.6490, arXiv http://arxiv.org/abs/1406.6490
Cohen E, Kaplan H (2007) Spatially-decaying aggregation over a network: model and algorithms. J Comput Syst Sci 73:265–288. Full version of a SIGMOD 2004 paper
Cohen E, Kaplan H (2007) Summarizing data using bottom-k sketches. In: ACM PODC, Portland
Cohen E, Kaplan H (2008) Tighter estimation using bottom-k sketches. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland. http://arxiv.org/abs/0802.3448
Cohen E, Kaplan H (2009) Leveraging discarded samples for tighter estimation of multiple-set aggregates. In: ACM SIGMETRICS, Seattle
Cohen E, Kaplan H (2013) What you can do with coordinated samples. In: The 17th international workshop on randomization and computation (RANDOM), Berkeley. Full version: http://arxiv.org/abs/1206.5637
Cohen E, Wang YM, Suri G (1995) When piecewise determinism is almost true. In: Proceedings of the pacific rim international symposium on fault-tolerant systems, Newport Beach, pp 66–71
Cohen E, Kaplan H, Sen S (2009) Coordinated weighted sampling for estimating aggregates over multiple weight assignments. In: Proceedings of the VLDB endowment, Lyon, France, vol 2(1–2). Full version: http://arxiv.org/abs/0906.4560
Cohen E, Delling D, Fuchs F, Goldberg A, Goldszmidt M, Werneck R (2013) Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths. In: ACM COSN, Boston
Cohen E, Delling D, Pajor T, Werneck RF (2014) Sketch-based influence maximization and computation: scaling up with guarantees. In: ACM CIKM, Shanghai. http://research.microsoft.com/apps/pubs/?id=226623, full version http://research.microsoft.com/apps/pubs/?id=226623
Cohen E, Delling D, Pajor T, Werneck RF (2014) Timed influence: computation and maximization. Tech. Rep. cs.SI/1410.6976, arXiv http://arxiv.org/abs/1410.06976
Das A, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: WWW, Banff, Alberta, Canada
Duffield N, Thorup M, Lund C (2007) Priority sampling for estimating arbitrary subset sums. J Assoc Comput Mach 54(6)
Efraimidis PS, Spirakis PG (2006) Weighted random sampling with a reservoir. Inf Process Lett 97(5):181–185
Gibbons PB (2001) Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: International conference on very large databases (VLDB), Roma, pp 541–550
Gibbons P, Tirthapura S (2001) Estimating simple functions on the union of data streams. In: Proceedings of the 13th annual ACM symposium on parallel algorithms and architectures, Crete Island. ACM
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases (VLDB’99), Edinburgh
Hadjieleftheriou M, Yu X, Koudas N, Srivastava D (2008) Hashed samples: selectivity estimators for set similarity selection queries. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland
Hájek J (1981) Sampling from a finite population. Marcel Dekker, New York
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260):663–685
Indyk P (2001) Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Proceedings of the 41st IEEE annual symposium on foundations of computer science, Redondo Beach. IEEE, pp 189–197
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th annual ACM symposium on theory of computing, Texas. ACM, pp 604–613
Mosk-Aoyama D, Shah D (2006) Computing separable functions via gossip. In: ACM PODC, Denver
Ohlsson E (1998) Sequential poisson sampling. J Off Stat 14(2):149–162
Ohlsson E (2000) Coordination of PPS samples over time. In: The 2nd international conference on establishment surveys. American Statistical Association, pp 255–264
Rosén B (1972) Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann Math Stat 43(2):373–397. http://www.jstor.org/stable/2239977
Rosén B (1997) Asymptotic theory for order sampling. J Stat Plan Inference 62(2):135–158
Saavedra PJ (1995) Fixed sample size PPS approximations with a permanent random number. In: Proceedings of the section on survey research methods, Alexandria. American Statistical Association, pp 697–700
Szegedy M (2006) The DLT priority sampling is essentially optimal. In: Proceedings of the 38th annual ACM symposium on theory of computing, Seattle. ACM
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this entry
Cite this entry
Cohen, E. (2016). Coordinated Sampling. In: Kao, MY. (eds) Encyclopedia of Algorithms. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2864-4_576
Download citation
DOI: https://doi.org/10.1007/978-1-4939-2864-4_576
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-2863-7
Online ISBN: 978-1-4939-2864-4
eBook Packages: Computer ScienceReference Module Computer Science and Engineering