Abstract
Sample coordination, where similar instances have similar samples, was proposed by statisticians four decades ago as a way to maximize overlap in repeated surveys. Coordinated sampling had been since used for summarizing massive data sets.
The usefulness of a sampling scheme hinges on the scope and accuracy within which queries posed over the original data can be answered from the sample. We aim here to gain a fundamental understanding of the limits and potential of coordination. Our main result is a precise characterization, in terms of simple properties of the estimated function, of queries for which estimators with desirable properties exist. We consider unbiasedness, nonnegativity, finite variance, and bounded estimates.
Since generally a single estimator can not be optimal (minimize variance simultaneously) for all data, we propose variance competitiveness, which means that the expectation of the square on any data is not too far from the minimum one possible for the data. Surprisingly perhaps, we show how to construct, for any function for which an unbiased nonnegative estimator exists, a variance competitive estimator.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Beyer, K.S., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: On synopses for distinct-value estimation under multiset operations. In: SIGMOD, pp. 199–210. ACM (2007)
Brewer, K.R.W., Early, L.J., Joyce, S.F.: Selecting several samples from a single population. Australian Journal of Statistics 14(3), 231–239 (1972)
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29. IEEE (1997)
Broder, A.: Identifying and filtering near-duplicate documents. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 1–10. Springer, Heidelberg (2000)
Byers, J.W., Considine, J., Mitzenmacher, M., Rost, S.: Informed content delivery across adaptive overlay networks. IEEE/ACM Trans. Netw. 12(5), 767–780 (2004)
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci. 55, 441–453 (1997)
Cohen, E., Kaplan, H.: Spatially-decaying aggregation over a network: model and algorithms. J. Comput. System Sci. 73, 265–288 (2007)
Cohen, E., Kaplan, H.: Summarizing data using bottom-k sketches. In: Proc. of ACM PODC (2007)
Cohen, E., Kaplan, H.: Tighter estimation using bottom-k sketches. In: VLDB (2008)
Cohen, E., Kaplan, H.: Leveraging discarded samples for tighter estimation of multiple-set aggregates. In: ACM SIGMETRICS (2009)
Cohen, E., Kaplan, H.: Get the most out of your sample: Optimal unbiased estimators using partial information. In: Proc. of ACM PODS (2011), full version: http://arxiv.org/abs/1203.4903
Cohen, E., Kaplan, H.: A case for customizing estimators: Coordinated samples. Technical Report cs.ST/1212.0243, arXiv (2012)
Cohen, E., Kaplan, H.: How to estimate change from samples. Technical Report cs.DS/1203.4903, arXiv (2012)
Cohen, E., Kaplan, H., Sen, S.: Coordinated weighted sampling for estimating aggregates over multiple weight assignments. In: VLDB (2009), full version: http://arxiv.org/abs/0906.4560
Cohen, E., Wang, Y.-M., Suri, G.: When piecewise determinism is almost true. In: Proc. Pacific Rim International Symposium on Fault-Tolerant Systems (1995)
Das, A., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW (2007)
Duffield, N., Thorup, M., Lund, C.: Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach. 54(6) (2007)
Efraimidis, P.S., Spirakis, P.G.: Weighted random sampling with a reservoir. Inf. Process. Lett. 97(5), 181–185 (2006)
Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: ACM SPAA (2001)
Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: VLDB (2001)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB (1999)
Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: Selectivity estimators for set similarity selection queries. In: VLDB (2008)
Hájek, J.: Sampling from a finite population. Marcel Dekker, New York (1981)
Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47(260), 663–685 (1952)
Indyk, P.: Stable distributions, pseudorandom generators, embeddings and data stream computation. In: IEEE FOCS (2001)
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: ACM STOC (1998)
Lanke, J.: On umv-estimators in survey sampling. Metrika 20(1), 196–202 (1973)
Mosk-Aoyama, D., Shah, D.: Computing separable functions via gossip. In: ACM PODC (2006)
Ohlsson, E.: Sequential poisson sampling. J. Official Statistics 14(2), 149–162 (1998)
Ohlsson, E.: Coordination of pps samples over time. In: The 2nd International Conference on Establishment Surveys, pp. 255–264. American Statistical Association (2000)
Rosén, B.: Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics 43(2), 373–397 (1972)
Rosén, B.: Asymptotic theory for order sampling. J. Statistical Planning and Inference 62(2), 135–158 (1997)
Saavedra, P.J.: Fixed sample size pps approximations with a permanent random number. In: Proc. of the Section on Survey Research Methods, Alexandria, VA, pp. 697–700. American Statistical Association (1995)
Szegedy, M.: The DLT priority sampling is essentially optimal. In: ACM STOC (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cohen, E., Kaplan, H. (2013). What You Can Do with Coordinated Samples. In: Raghavendra, P., Raskhodnikova, S., Jansen, K., Rolim, J.D.P. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2013 2013. Lecture Notes in Computer Science, vol 8096. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40328-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-40328-6_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40327-9
Online ISBN: 978-3-642-40328-6
eBook Packages: Computer ScienceComputer Science (R0)