Abstract
Weibo is the Chinese counterpart of Twitter, which has attracted hundreds of millions of users. Just like other Online Social Networks (hereafter OSNs), Weibo has a large number of fake accounts. They are created to sell their following links to customers, who want to boost their follower counts. These bogus accounts are difficult to identify individually, especially when they are created by sophisticated programs or controlled by human beings directly. This paper proposes a novel fake account detection method that is based on the very purpose of the existence of these accounts: they are created to follow their targets en masse, resulting in high-overlapping between the follower lists of their customers. This paper investigates the top Weibo accounts whose follower lists duplicate or nearly duplicate each other (hereafter called near-duplicates). Discovering near-duplicates is a challenging task. The network is large; the data in its entirety are not available; the pair-wise comparison is very expensive. We developed a sampling-based approach to discover all the near-duplicates of the top accounts, who have at least 50,000 followers. In the experiment, we found 395 near-duplicates, which leads us to 11.90 million fake accounts (4.56 % of total users) who send 741.10 million links (9.50 % of the entire edges). Furthermore, we characterize four typical structures of the spammers, cluster these spammers into 34 groups, and analyze the properties of each group.
Similar content being viewed by others
References
Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), vol 6, page 12
Chen C, Wu K, Srinivasan V, Zhang V (2013) Battling the internet water army: detection of hidden paid posters. In: The 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Chu Z et al (2012) Detecting automation of twitter accounts: are you a human, bot, or cyborg? IEEE Trans Depend Secure Comput 9(6):811–824
Clauset A, Shalizi CR, Newman MEJ (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703
Dasgupta A, Kumar R, Sarlos T (2014) On estimating the average degree. In: Proceedings of the 23rd international conference on World wide web. International World Wide Web Conferences Steering Committee
Ghosh S, Viswanath B, Kooti F, Sharma NK, Korlam G, Benevenuto F, Ganguly N, Gummadi KP (2012) Understanding and combating link farming in the twitter social network. In: Proceedings of the 21st international conference on World Wide Web, pp 61–70. ACM
Giles J (2011) Social-bots infiltrate twitter and trick human users. New Sci 209(2804):28
Gjoka M, Kurant M, Butts C, Markopoulou A (2009) A walk in facebook: uniform sampling of users in online social networks. arXiv:0906.0060
Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR. ACM
Hu X, Tang J, Zhang Y, Liu H (2013) Social spammer detection in microblogging. In: Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pp 2633–2639. AAAI Press
Jacomy M, Venturini T, Heymann S, Bastian M (2014) Forceatlas2, a continuous graph layout algorithm for handy network visualization designed for the gephi software. PLoS One, 9(6):1–12
Katzir L, Liberty E, Somekh O (2011) Estimating sizes of social networks via biased sampling. In WWW, pp 597–606. ACM
Lee S-M, Chao A (1994) Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50(1):88–97
Lin C, He J, Zhou J, Yang X, Chen K, Song L (2013) Analysis and identification of spamming behaviors in sina weibo microblog. In: Proceedings of the 7th Workshop on Social Network Mining and Analysis, ACM
Lu J, Li D (2013) Bias correction in small sample from big data. TKDE, IEEE Trans Knowledge Data Eng 25(11):2658–2663
Manku GS, Jain A, Das Sarma A (2007) Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp 141–150, New York. ACM
Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge England
Miller Z, Dickinson B, Deitrick W, Hu W, Wang AH (2014) Twitter spammer detection using data stream clustering. Information Sci 260:64–73
Myers SA, Sharma A, Gupta P, Lin J (2014) Information network or social network?: The structure of the twitter follow graph. In 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, Companion Volume, pp 493–498. International World Wide Web Conferences Steering Committee
Newman M (2010) Networks: an introduction. Oxford University Press Inc, Oxford England
Perlroth N (2013) Fake twitter followers become multimillion-dollar business. NewYork Times
Stringhini G, Kruegel C, Vigna G (2010) Detecting spammers on social networks. In: Proceedings of the 26th Annual Computer Security Applications Conference on - ACSAC ’10, p 1, New York. ACM Press
Tao K, Abel F, Hauff C, Houben GJ, Gadiraju U (2013) Groundhog day: near-duplicate detection on twitter. In: Proceedings of the 22nd international conference on World Wide Web, pp 1273–1284. International World Wide Web Conferences Steering Committee
Thomas K, Grier C, Song D, Paxson V (2011) Suspended accounts in retrospect: an analysis of twitter spam. In: Proceedings of the 2011 ACM
Wang A (2009) Don’t follow me: spam detection in twitter. In: International Conference on Security and Cryptography (SECRYPT)
Wang H, Lu J (2013) Detect inflated follower numbers in osn using star sampling. The IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp 127–133
Wu B, Davison BD (2005) Identifying link farm spam pages. In: Proceedings of the 14th International World Wide Web Conference, pp 820–829. ACM Press
Zhang Q, Ma H, Qian W, Zhou A (2013) Duplicate detection for identifying social spam in microblogs. In: Big Data (BigData Congress), 2013 IEEE International Congress on, pp 141–148. IEEE
Acknowledgments
This work is supported by NSERC Discovery grant. We would like to thank Hao Wang for collecting the uniform random sample of Weibo that is used in this paper, and for his participation in the calculation of Jaccard similarity on this data.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, Y., Lu, J. Discover millions of fake followers in Weibo. Soc. Netw. Anal. Min. 6, 16 (2016). https://doi.org/10.1007/s13278-016-0324-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-016-0324-2