skip to main content
10.1145/2588555.2588577acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Stratified-sampling over social networks using mapreduce

Authors Info & Claims
Published:18 June 2014Publication History

ABSTRACT

Sampling is being used in statistical surveys to select a subset of individuals from some population, to estimate properties of the population. In stratified sampling, the surveyed population is partitioned into homogeneous subgroups and individuals are selected within the subgroups, to reduce the sample size. In this paper we consider sampling of large-scale, distributed online social networks, and we show how to deal with cases where several surveys are conducted in parallel---in some surveys it may be desired to share individuals to reduce costs, while in other surveys, sharing should be minimized, e.g., to prevent survey fatigue. A multi-survey stratified sampling is the task of choosing the individuals for several surveys, in parallel, according to sharing constraints, without a bias. In this paper, we present a scalable distributed algorithm, designed for the MapReduce framework, for answering stratified-sampling queries over a population of a social network. We also present an algorithm to effectively answer multi-survey stratified sampling, and we show how to implement it using MapReduce. An experimental evaluation illustrates the efficiency of our algorithms and their effectiveness for multi-survey stratified sampling.

References

  1. A. Chaudhuri and H. Stenger. Survey Sampling Theory and Methods. Taylor and Francis Group, LLC, 2005.Google ScholarGoogle Scholar
  2. CNET. http://news.cnet.com/8301--1023_3--57484991--93/facebook-8.7-percent-are-fake-users/, 2012.Google ScholarGoogle Scholar
  3. G. Cormode, S. Muthukrishnan, K. Yi, and Q. Zhang. Optimal sampling from distributed streams. In PODS, pages 77--86, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107--113, Jan. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. A. Fuller. Sampling Statistics. Wiley Series in Survey Methodology, 2009.Google ScholarGoogle Scholar
  6. M. Gjoka, C. Butts, M. Kurant, and A. Markopoulou. Multigraph sampling of online social networks. Selected Areas in Communications, IEEE Journal on, 29(9):1893--1905, 2011.Google ScholarGoogle Scholar
  7. R. Grover and M. J. Carey. Extending map-reduce for efficient predicate-based sampling. In ICDE '12, pages 486--497, Washington, DC, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Kalton. Intro to Survey Sampling. Sage, 1983.Google ScholarGoogle Scholar
  9. D. Knuth. The Art of Computer Programming: Sorting and Searching. Addison-Wesley, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou. Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In SIGMETRICS'11, pages 281--292. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Kurant, A. Markopoulou, and P. Thiran. On the bias of BFS (Breadth First Search). In International Teletraffic Congress (ITC), pages 1--8, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  12. N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow., 5(10):1028--1039, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. Olken and D. Rotem. Random sampling from database files: a survey. In SSDBM, pages 92--111, Charlotte, NC, USA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. H. Schuh. Assessment Methods for Student A airs. John Wiley and Sons, 2011.Google ScholarGoogle Scholar
  15. S. Tirthapura and D. P. Woodru. Optimal random sampling from distributed streams revisited. In Proceedings of the 25th international conference on Distributed computing, DISC'11, pages 283--297. Springer-Verlag, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. S. Vitter. Random sampling with a reservoir. ACM TOMS, 11:37--57, March 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Vojnovic, F. Xu, and J. Zhou. Sampling based range partition methods for big data analytics. Technical report, Microsoft Research, 2012.Google ScholarGoogle Scholar

Index Terms

  1. Stratified-sampling over social networks using mapreduce

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
          June 2014
          1645 pages
          ISBN:9781450323765
          DOI:10.1145/2588555

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 June 2014

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader