ABSTRACT
Sampling is being used in statistical surveys to select a subset of individuals from some population, to estimate properties of the population. In stratified sampling, the surveyed population is partitioned into homogeneous subgroups and individuals are selected within the subgroups, to reduce the sample size. In this paper we consider sampling of large-scale, distributed online social networks, and we show how to deal with cases where several surveys are conducted in parallel---in some surveys it may be desired to share individuals to reduce costs, while in other surveys, sharing should be minimized, e.g., to prevent survey fatigue. A multi-survey stratified sampling is the task of choosing the individuals for several surveys, in parallel, according to sharing constraints, without a bias. In this paper, we present a scalable distributed algorithm, designed for the MapReduce framework, for answering stratified-sampling queries over a population of a social network. We also present an algorithm to effectively answer multi-survey stratified sampling, and we show how to implement it using MapReduce. An experimental evaluation illustrates the efficiency of our algorithms and their effectiveness for multi-survey stratified sampling.
- A. Chaudhuri and H. Stenger. Survey Sampling Theory and Methods. Taylor and Francis Group, LLC, 2005.Google Scholar
- CNET. http://news.cnet.com/8301--1023_3--57484991--93/facebook-8.7-percent-are-fake-users/, 2012.Google Scholar
- G. Cormode, S. Muthukrishnan, K. Yi, and Q. Zhang. Optimal sampling from distributed streams. In PODS, pages 77--86, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107--113, Jan. 2008. Google ScholarDigital Library
- W. A. Fuller. Sampling Statistics. Wiley Series in Survey Methodology, 2009.Google Scholar
- M. Gjoka, C. Butts, M. Kurant, and A. Markopoulou. Multigraph sampling of online social networks. Selected Areas in Communications, IEEE Journal on, 29(9):1893--1905, 2011.Google Scholar
- R. Grover and M. J. Carey. Extending map-reduce for efficient predicate-based sampling. In ICDE '12, pages 486--497, Washington, DC, USA, 2012. Google ScholarDigital Library
- G. Kalton. Intro to Survey Sampling. Sage, 1983.Google Scholar
- D. Knuth. The Art of Computer Programming: Sorting and Searching. Addison-Wesley, 1997. Google ScholarDigital Library
- M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou. Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In SIGMETRICS'11, pages 281--292. ACM, 2011. Google ScholarDigital Library
- M. Kurant, A. Markopoulou, and P. Thiran. On the bias of BFS (Breadth First Search). In International Teletraffic Congress (ITC), pages 1--8, 2010.Google ScholarCross Ref
- N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow., 5(10):1028--1039, June 2012. Google ScholarDigital Library
- F. Olken and D. Rotem. Random sampling from database files: a survey. In SSDBM, pages 92--111, Charlotte, NC, USA, 1990. Google ScholarDigital Library
- J. H. Schuh. Assessment Methods for Student A airs. John Wiley and Sons, 2011.Google Scholar
- S. Tirthapura and D. P. Woodru. Optimal random sampling from distributed streams revisited. In Proceedings of the 25th international conference on Distributed computing, DISC'11, pages 283--297. Springer-Verlag, 2011. Google ScholarDigital Library
- J. S. Vitter. Random sampling with a reservoir. ACM TOMS, 11:37--57, March 1985. Google ScholarDigital Library
- M. Vojnovic, F. Xu, and J. Zhou. Sampling based range partition methods for big data analytics. Technical report, Microsoft Research, 2012.Google Scholar
Index Terms
- Stratified-sampling over social networks using mapreduce
Recommendations
Fast balanced sampling for highly stratified population
Balanced sampling is a very efficient sampling design when the variable of interest is correlated to the auxiliary variables on which the sample is balanced. A procedure to select balanced samples in a stratified population has previously been proposed. ...
Stratified sampling of execution traces: Execution phases serving as strata
The understanding of the behavioral aspects of a software system is an important enabler for many reverse engineering activities. The behavior of software is typically represented in the form of execution traces. Traces, however, can be overwhelmingly ...
The Concept of Stratified Sampling of Execution Traces
ICPC '11: Proceedings of the 2011 IEEE 19th International Conference on Program ComprehensionExecution traces can be overwhelmingly large. To reduce their size, sampling techniques, especially the ones based on random sampling, have been extensively used. Random sampling, however, may result in samples that are not representative of the ...
Comments