Skip to main content

Coordinated Sampling

  • Reference work entry
  • First Online:
Encyclopedia of Algorithms

Years and Authors of Summarized Original Work

  • 1972; Brewer, Early, Joice

  • 1997; Cohen

  • 1997; Broder

  • 2013; Cohen, Kaplan

  • 2014; Cohen

Problem Definition

Data is often sampled as a means of addressing resource constraints on storage, bandwidth, or processing – even when we have the resources to store the full data set, processing queries exactly over the data can be very expensive, and we therefore may opt for approximate fast answers obtained from the much smaller sample.

Our focus here is on data sets that have the form of a set of keys from some universe and multiple instances, which are assignments of nonnegative values to keys. We denote by v hi the value of key h in instance i. Examples of data sets with this form include measurements of a set of parameters; snapshots of a state of a system; logs of requests, transactions, or activity; IP flow records in different time periods; and occurrences of terms in a set of documents. Typically, this matrix is very sparse – the vast majority...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

  1. Beyer KS, Haas PJ, Reinwald B, Sismanis Y, Gemulla R (2007) On synopses for distinct-value estimation under multiset operations. In: SIGMOD, Beijing. ACM, pp 199–210

    Chapter  Google Scholar 

  2. Brewer KRW, Early LJ, Joyce SF (1972) Selecting several samples from a single population. Aust J Stat 14(3):231–239

    Article  Google Scholar 

  3. Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences, Salerno. IEEE, pp 21–29

    Google Scholar 

  4. Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, Montreal. LNCS, vol 1848. Springer, pp 1–10

    Google Scholar 

  5. Byers JW, Considine J, Mitzenmacher M, Rost S (2004) Informed content delivery across adaptive overlay networks. IEEE/ACM Trans Netw 12(5):767–780

    Article  Google Scholar 

  6. Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55:441–453

    Article  MathSciNet  MATH  Google Scholar 

  7. Cohen E (2013) All-distances sketches, revisited: HIP estimators for massive graphs analysis. Tech. Rep. cs.DS/1306.3284, arXiv http://arxiv.org/abs/1306.3284

  8. Cohen E (2014) Distance queries from sampled data: accurate and efficient. In: ACM KDD, New York. Full version: http://arxiv.org/abs/1203.4903

  9. Cohen E (2014) Estimation for monotone sampling: competitiveness and customization. In: ACM PODC, Paris. http://arxiv.org/abs/1212.0243, full version http://arxiv.org/abs/1212.0243

  10. Cohen E (2014) Variance competitiveness for monotone estimation: tightening the bounds. Tech. Rep. cs.ST/1406.6490, arXiv http://arxiv.org/abs/1406.6490

  11. Cohen E, Kaplan H (2007) Spatially-decaying aggregation over a network: model and algorithms. J Comput Syst Sci 73:265–288. Full version of a SIGMOD 2004 paper

    Google Scholar 

  12. Cohen E, Kaplan H (2007) Summarizing data using bottom-k sketches. In: ACM PODC, Portland

    Book  MATH  Google Scholar 

  13. Cohen E, Kaplan H (2008) Tighter estimation using bottom-k sketches. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland. http://arxiv.org/abs/0802.3448

  14. Cohen E, Kaplan H (2009) Leveraging discarded samples for tighter estimation of multiple-set aggregates. In: ACM SIGMETRICS, Seattle

    Book  Google Scholar 

  15. Cohen E, Kaplan H (2013) What you can do with coordinated samples. In: The 17th international workshop on randomization and computation (RANDOM), Berkeley. Full version: http://arxiv.org/abs/1206.5637

  16. Cohen E, Wang YM, Suri G (1995) When piecewise determinism is almost true. In: Proceedings of the pacific rim international symposium on fault-tolerant systems, Newport Beach, pp 66–71

    Google Scholar 

  17. Cohen E, Kaplan H, Sen S (2009) Coordinated weighted sampling for estimating aggregates over multiple weight assignments. In: Proceedings of the VLDB endowment, Lyon, France, vol 2(1–2). Full version: http://arxiv.org/abs/0906.4560

  18. Cohen E, Delling D, Fuchs F, Goldberg A, Goldszmidt M, Werneck R (2013) Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths. In: ACM COSN, Boston

    Book  Google Scholar 

  19. Cohen E, Delling D, Pajor T, Werneck RF (2014) Sketch-based influence maximization and computation: scaling up with guarantees. In: ACM CIKM, Shanghai. http://research.microsoft.com/apps/pubs/?id=226623, full version http://research.microsoft.com/apps/pubs/?id=226623

  20. Cohen E, Delling D, Pajor T, Werneck RF (2014) Timed influence: computation and maximization. Tech. Rep. cs.SI/1410.6976, arXiv http://arxiv.org/abs/1410.06976

  21. Das A, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: WWW, Banff, Alberta, Canada

    Book  Google Scholar 

  22. Duffield N, Thorup M, Lund C (2007) Priority sampling for estimating arbitrary subset sums. J Assoc Comput Mach 54(6)

    Google Scholar 

  23. Efraimidis PS, Spirakis PG (2006) Weighted random sampling with a reservoir. Inf Process Lett 97(5):181–185

    Article  MathSciNet  MATH  Google Scholar 

  24. Gibbons PB (2001) Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: International conference on very large databases (VLDB), Roma, pp 541–550

    Google Scholar 

  25. Gibbons P, Tirthapura S (2001) Estimating simple functions on the union of data streams. In: Proceedings of the 13th annual ACM symposium on parallel algorithms and architectures, Crete Island. ACM

    Google Scholar 

  26. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases (VLDB’99), Edinburgh

    Google Scholar 

  27. Hadjieleftheriou M, Yu X, Koudas N, Srivastava D (2008) Hashed samples: selectivity estimators for set similarity selection queries. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland

    Google Scholar 

  28. Hájek J (1981) Sampling from a finite population. Marcel Dekker, New York

    MATH  Google Scholar 

  29. Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260):663–685

    Article  MathSciNet  MATH  Google Scholar 

  30. Indyk P (2001) Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Proceedings of the 41st IEEE annual symposium on foundations of computer science, Redondo Beach. IEEE, pp 189–197

    Google Scholar 

  31. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th annual ACM symposium on theory of computing, Texas. ACM, pp 604–613

    Google Scholar 

  32. Mosk-Aoyama D, Shah D (2006) Computing separable functions via gossip. In: ACM PODC, Denver

    Book  MATH  Google Scholar 

  33. Ohlsson E (1998) Sequential poisson sampling. J Off Stat 14(2):149–162

    Google Scholar 

  34. Ohlsson E (2000) Coordination of PPS samples over time. In: The 2nd international conference on establishment surveys. American Statistical Association, pp 255–264

    Google Scholar 

  35. Rosén B (1972) Asymptotic theory for successive sampling with varying probabilities without replacement, I. Ann Math Stat 43(2):373–397. http://www.jstor.org/stable/2239977

    Article  MathSciNet  MATH  Google Scholar 

  36. Rosén B (1997) Asymptotic theory for order sampling. J Stat Plan Inference 62(2):135–158

    Article  MathSciNet  MATH  Google Scholar 

  37. Saavedra PJ (1995) Fixed sample size PPS approximations with a permanent random number. In: Proceedings of the section on survey research methods, Alexandria. American Statistical Association, pp 697–700

    Google Scholar 

  38. Szegedy M (2006) The DLT priority sampling is essentially optimal. In: Proceedings of the 38th annual ACM symposium on theory of computing, Seattle. ACM

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edith Cohen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this entry

Cite this entry

Cohen, E. (2016). Coordinated Sampling. In: Kao, MY. (eds) Encyclopedia of Algorithms. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-2864-4_576

Download citation

Publish with us

Policies and ethics