Abstract
Twitter has been the focus of analysis in regard to various interesting and challenging problems, one of them being clustering of its users based on their interests. There are many clustering approaches for graphs that look at either the structure or the contents of the graph. However, when we consider real-world complex data such as Twitter data, structural approaches may produce many different user clusters with similar interests. Moreover, content-based clustering approaches on Twitter data also produce inferior results because tweets have a limited number of characters and lots of garbled data. Hence, for practical applications, these clustering approaches cannot be directly used on Twitter data. In the study reported in this paper, we clustered Twitter users on the basis of their interests, looking at both the structure of the graph generated from Twitter data and the contents of the Tweets. In short, we clustered Twitter users by using an unsupervised structural approach, merging similar clusters using a content-based approach, expanding the graph and ranking users with Personalized PageRank, and determining the topic to which a cluster belongs in accordance with the hashtag frequency. The results of combining these approaches were better than those of the existing techniques and befit practical applications.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Naik, A., Maeda, H., Kanojia, V., Fujita, S.: Scalable Twitter User Clustering Approach Boosted by Personalized PageRank, pp. 472–485. Springer, Cham (2017)
Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: Scan: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 824–833. ACM (2007)
Shiokawa, H., Fujiwara, Y., Onizuka, M.: Scan++: efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proc. VLDB Endow. 8(11), 1178–1189 (2015)
Latapy, M., Magnien, C., Del Vecchio, N.: Basic notions for the analysis of large two-mode networks. Soc. Netw. 30(1), 31–48 (2008)
Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004)
Ding, C.H., He, X., Zha, H., Gu, M., Simon, H.D.: A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 107–114. IEEE, San Jose (2001). http://ieeexplore.ieee.org/document/989507/
Zhang, Y., Wu, Y., Yang, Q.: Community discovery in twitter based on user interests. J. Comput. Inf. Syst. 8(3), 991–1000 (2012)
Hayashi, K., Maehara, T., Toyoda, M., Kawarabayashi, K.-I.: Real-time top-r topic detection on twitter with topic hijack filtering. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ser. KDD ’15, pp. 417–426 (2015)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web (1999)
Haveliwala, T.: Topic-sensitive PageRank. In: Proceedings of the 11th International Conference on World Wide Web, Honolulu, Hawaii, USA, pp. 517–526 (2002)
Andersen, R., Lang, K.J.: Communities from seed sets. In: Proceedings of the 15th International Conference on World Wide, pp. 223–232. ACM (2006)
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web, pp. 695–704. ACM (2008)
Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in twitter: the million follower fallacy. In: ICWSM, vol. 10, pp. 10–17 (2010)
Avnit, A.: The million followers fallacy (2009). http://blog.pravdam.com/the-million-followers-fallacy-guest-post-by-adi-avnit/. Online accessed 2 Aug 2016
Weng, J., Lim, E.-P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, Ser. WSDM ’10, pp. 261–270 (2010)
David, M.I.J., Blei, M., Ng, A.Y..: Latent Dirichlet Allocation, pp. 993–1022 (2003)
Graph-tool. https://graph-tool.skewed.de/. Online accessed 20 Jan 2016
Bayon Clustering Tool. https://code.google.com/archive/p/bayon/. Online accessed 3 Feb 2016
Trec-9 Results, Appendix A. In: Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009) (2009). http://trec.nist.gov/pubs/trec18/appendices/measures.pdf
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst.: TOIS 20(4), 422–446 (2002)
Acknowledgements
We thank the real-time search team at Yahoo! JAPAN for all their support in carrying out this work. We thank all the people involved in evaluation of the results, without which this work would have been incomplete.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is an extended version of the PAKDD’2017 long presentation paper “Scalable Twitter User Clustering Approach Boosted by Personalized PageRank” [1].
Appendix
Appendix
See Table 9.
Rights and permissions
About this article
Cite this article
Naik, A., Maeda, H., Kanojia, V. et al. Scalable Twitter user clustering approach boosted by Personalized PageRank. Int J Data Sci Anal 6, 297–309 (2018). https://doi.org/10.1007/s41060-017-0089-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-017-0089-3