ABSTRACT
The properties of online social networks (OSNs) are of great interests to the general public as well as IT professionals. Often the raw data are not available and the summary released by the service providers are sketchy. Thus sampling is needed to reveal the hidden properties of the underlying data. While uniform random sampling is often preferred, some properties such as the top bloggers need to be obtained using PPS (probability proportional to size) sampling. Although PPS sampling can be approximated using simple random walk, it is not efficient because only one sample is taken in every step. This paper introduces an efficient sampling method, called star sampling, that takes all the neighbours as valid samples. It is more efficient than random walk sampling by a factor of the average degrees. We derive the estimator and its variance, and verify the result using six large real-networks locally where the ground-truth are known and the estimations can be evaluated.
Then we apply our method on Weibo, the Chinese version of Twitter, whose properties are rarely studied albeit its enormous size and influence. Along with other conventional metrics such as size and degree distributions, we demonstrate that star sampling can identify ten thousand top bloggers efficiently. In general, the estimated follower number is consistent with the claimed number, but there are cases where the follower numbers are inflated by a factor up to 132.
- Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In Proceedings of the 16th international conference on World Wide Web, pages 835--844. ACM, 2007. Google ScholarDigital Library
- A. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286(5439): 509--512, 1999.Google ScholarCross Ref
- R. Bond and et al. A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415): 295--298, 2012.Google ScholarCross Ref
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer networks, 33(1): 309--320, 2000. Google ScholarDigital Library
- S. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Crawling facebook for social network analysis purposes. Arxiv preprint arXiv:1105.6307, 2011. Google ScholarDigital Library
- K.-w. Fu and M. Chau. Reality check for the chinese microblog space: a random sampling approach. PLOS ONE, 8(3): e58356, 2013.Google ScholarCross Ref
- M. Gjoka, M. Kurant, C. Butts, and A. Markopoulou. A walk in facebook: Uniform sampling of users in online social networks. Arxiv preprint arXiv:0906.0060, 2009.Google Scholar
- B. Huberman, D. Romero, and F. Wu. Social networks that matter: Twitter under the microscope. 2008.Google Scholar
- C. Hubler, H. Kriegel, K. Borgwardt, and Z. Ghahramani. Metropolis algorithms for representative subgraph sampling. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on, pages 283--292. IEEE, 2008. Google ScholarDigital Library
- A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 56--65. ACM, 2007. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591--600. ACM, 2010. Google ScholarDigital Library
- S. Lee, P. Kim, and H. Jeong. Statistical properties of sampled networks. Physical Review E, 73(1): 016102, 2006.Google ScholarCross Ref
- J. Leskovec and C. Faloutsos. Sampling from large graphs. In SIGKDD, pages 631--636. ACM, 2006. Google ScholarDigital Library
- L. Lovász. Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1): 1--46, 1993.Google Scholar
- J. Lu and D. Li. Bias correction in small sample from big data. TKDE, IEEE Transactions of Knowledge and Data Engineering, in Press, 2013. Google ScholarDigital Library
- N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21: 1087, 1953.Google Scholar
- A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In SIGCOMM, pages 29--42. ACM, 2007. Google ScholarDigital Library
- A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC'07), San Diego, CA, October 2007. Google ScholarDigital Library
- M. Montemurro. Beyond the zipf--mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications, 300(3): 567--578, 2001.Google ScholarCross Ref
- N. Perlroth. Fake twitter followers become multimillion-dollar business. 2013.Google Scholar
- M. Stumpf and C. Wiuf. Sampling properties of random graphs: the degree distribution. Physical Review E, 72(3): 036118, 2005.Google ScholarCross Ref
- B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi. On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM SIGCOMM Workshop on Social Networks (WOSN'09), August 2009. Google ScholarDigital Library
- J. Zhou, Y. Li, V. Adhikari, and Z. Zhang. Counting youtube videos via random prefix sampling. In SIGCOMM, pages 371--380. ACM, 2011. Google ScholarDigital Library
Index Terms
- Detect inflated follower numbers in OSN using star sampling
Recommendations
Characterizing Twitter with Respondent-Driven Sampling
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure ComputingTwitter as one of the most important microblogging online social networks has attracted more than 200 million users in recent years. Although there have been several attempts on characterizing the Twitter by using incomplete sampled data, they have not ...
Unbiased sampling in directed social graph
SIGCOMM '10Microblogging services, such as Twitter, are among the most important online social networks(OSNs). Different from OSNs such as Facebook, the topology of microblogging service is a directed graph instead of an undirected graph. Recently, due to the ...
Unbiased sampling in directed social graph
SIGCOMM '10: Proceedings of the ACM SIGCOMM 2010 conferenceMicroblogging services, such as Twitter, are among the most important online social networks(OSNs). Different from OSNs such as Facebook, the topology of microblogging service is a directed graph instead of an undirected graph. Recently, due to the ...
Comments