Scalable Twitter user clustering approach boosted by Personalized PageRank

Naik, Anup; Maeda, Hideyuki; Kanojia, Vibhor; Fujita, Sumio

doi:10.1007/s41060-017-0089-3

Scalable Twitter user clustering approach boosted by Personalized PageRank

Regular Paper
Published: 29 December 2017

Volume 6, pages 297–309, (2018)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Anup Naik ORCID: orcid.org/0000-0001-5457-5713¹,
Hideyuki Maeda¹,
Vibhor Kanojia¹ &
…
Sumio Fujita¹

1007 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Twitter has been the focus of analysis in regard to various interesting and challenging problems, one of them being clustering of its users based on their interests. There are many clustering approaches for graphs that look at either the structure or the contents of the graph. However, when we consider real-world complex data such as Twitter data, structural approaches may produce many different user clusters with similar interests. Moreover, content-based clustering approaches on Twitter data also produce inferior results because tweets have a limited number of characters and lots of garbled data. Hence, for practical applications, these clustering approaches cannot be directly used on Twitter data. In the study reported in this paper, we clustered Twitter users on the basis of their interests, looking at both the structure of the graph generated from Twitter data and the contents of the Tweets. In short, we clustered Twitter users by using an unsupervised structural approach, merging similar clusters using a content-based approach, expanding the graph and ranking users with Personalized PageRank, and determining the topic to which a cluster belongs in accordance with the hashtag frequency. The results of combining these approaches were better than those of the existing techniques and befit practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Taking a Close Look at Twitter Communities and Clusters

NSLPCD: Topic based tweets clustering using Node significance based label propagation community detection algorithm

Article 24 September 2020

Graph-Based Keyword Extraction for Twitter Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Naik, A., Maeda, H., Kanojia, V., Fujita, S.: Scalable Twitter User Clustering Approach Boosted by Personalized PageRank, pp. 472–485. Springer, Cham (2017)
Google Scholar
Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: Scan: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 824–833. ACM (2007)
Shiokawa, H., Fujiwara, Y., Onizuka, M.: Scan++: efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proc. VLDB Endow. 8(11), 1178–1189 (2015)
Article Google Scholar
Latapy, M., Magnien, C., Del Vecchio, N.: Basic notions for the analysis of large two-mode networks. Soc. Netw. 30(1), 31–48 (2008)
Article Google Scholar
Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004)
Article Google Scholar
Ding, C.H., He, X., Zha, H., Gu, M., Simon, H.D.: A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 107–114. IEEE, San Jose (2001). http://ieeexplore.ieee.org/document/989507/
Zhang, Y., Wu, Y., Yang, Q.: Community discovery in twitter based on user interests. J. Comput. Inf. Syst. 8(3), 991–1000 (2012)
Google Scholar
Hayashi, K., Maehara, T., Toyoda, M., Kawarabayashi, K.-I.: Real-time top-r topic detection on twitter with topic hijack filtering. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ser. KDD ’15, pp. 417–426 (2015)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web (1999)
Haveliwala, T.: Topic-sensitive PageRank. In: Proceedings of the 11th International Conference on World Wide Web, Honolulu, Hawaii, USA, pp. 517–526 (2002)
Andersen, R., Lang, K.J.: Communities from seed sets. In: Proceedings of the 15th International Conference on World Wide, pp. 223–232. ACM (2006)
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web, pp. 695–704. ACM (2008)
Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in twitter: the million follower fallacy. In: ICWSM, vol. 10, pp. 10–17 (2010)
Avnit, A.: The million followers fallacy (2009). http://blog.pravdam.com/the-million-followers-fallacy-guest-post-by-adi-avnit/. Online accessed 2 Aug 2016
Weng, J., Lim, E.-P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, Ser. WSDM ’10, pp. 261–270 (2010)
David, M.I.J., Blei, M., Ng, A.Y..: Latent Dirichlet Allocation, pp. 993–1022 (2003)
Graph-tool. https://graph-tool.skewed.de/. Online accessed 20 Jan 2016
Bayon Clustering Tool. https://code.google.com/archive/p/bayon/. Online accessed 3 Feb 2016
Trec-9 Results, Appendix A. In: Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009) (2009). http://trec.nist.gov/pubs/trec18/appendices/measures.pdf
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst.: TOIS 20(4), 422–446 (2002)
Article Google Scholar

Download references

Acknowledgements

We thank the real-time search team at Yahoo! JAPAN for all their support in carrying out this work. We thank all the people involved in evaluation of the results, without which this work would have been incomplete.

Author information

Authors and Affiliations

Yahoo Japan Corporation, Tokyo, Japan
Anup Naik, Hideyuki Maeda, Vibhor Kanojia & Sumio Fujita

Authors

Anup Naik
View author publications
You can also search for this author in PubMed Google Scholar
Hideyuki Maeda
View author publications
You can also search for this author in PubMed Google Scholar
Vibhor Kanojia
View author publications
You can also search for this author in PubMed Google Scholar
Sumio Fujita
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anup Naik.

Additional information

This paper is an extended version of the PAKDD’2017 long presentation paper “Scalable Twitter User Clustering Approach Boosted by Personalized PageRank” [1].

Appendix

See Table 9.

Table 9 List of frequently used terms and their meanings

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naik, A., Maeda, H., Kanojia, V. et al. Scalable Twitter user clustering approach boosted by Personalized PageRank. Int J Data Sci Anal 6, 297–309 (2018). https://doi.org/10.1007/s41060-017-0089-3

Download citation

Received: 12 May 2017
Accepted: 11 December 2017
Published: 29 December 2017
Issue Date: December 2018
DOI: https://doi.org/10.1007/s41060-017-0089-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Twitter user clustering approach boosted by Personalized PageRank

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Taking a Close Look at Twitter Communities and Clusters

NSLPCD: Topic based tweets clustering using Node significance based label propagation community detection algorithm

Graph-Based Keyword Extraction for Twitter Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Scalable Twitter user clustering approach boosted by Personalized PageRank

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Taking a Close Look at Twitter Communities and Clusters

NSLPCD: Topic based tweets clustering using Node significance based label propagation community detection algorithm

Graph-Based Keyword Extraction for Twitter Data

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation