Skip to main content
Log in

Scalable Twitter user clustering approach boosted by Personalized PageRank

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Twitter has been the focus of analysis in regard to various interesting and challenging problems, one of them being clustering of its users based on their interests. There are many clustering approaches for graphs that look at either the structure or the contents of the graph. However, when we consider real-world complex data such as Twitter data, structural approaches may produce many different user clusters with similar interests. Moreover, content-based clustering approaches on Twitter data also produce inferior results because tweets have a limited number of characters and lots of garbled data. Hence, for practical applications, these clustering approaches cannot be directly used on Twitter data. In the study reported in this paper, we clustered Twitter users on the basis of their interests, looking at both the structure of the graph generated from Twitter data and the contents of the Tweets. In short, we clustered Twitter users by using an unsupervised structural approach, merging similar clusters using a content-based approach, expanding the graph and ranking users with Personalized PageRank, and determining the topic to which a cluster belongs in accordance with the hashtag frequency. The results of combining these approaches were better than those of the existing techniques and befit practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Naik, A., Maeda, H., Kanojia, V., Fujita, S.: Scalable Twitter User Clustering Approach Boosted by Personalized PageRank, pp. 472–485. Springer, Cham (2017)

    Google Scholar 

  2. Xu, X., Yuruk, N., Feng, Z., Schweiger, T. A.: Scan: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 824–833. ACM (2007)

  3. Shiokawa, H., Fujiwara, Y., Onizuka, M.: Scan++: efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proc. VLDB Endow. 8(11), 1178–1189 (2015)

    Article  Google Scholar 

  4. Latapy, M., Magnien, C., Del Vecchio, N.: Basic notions for the analysis of large two-mode networks. Soc. Netw. 30(1), 31–48 (2008)

    Article  Google Scholar 

  5. Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004)

    Article  Google Scholar 

  6. Ding, C.H., He, X., Zha, H., Gu, M., Simon, H.D.: A min–max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 107–114. IEEE, San Jose (2001). http://ieeexplore.ieee.org/document/989507/

  7. Zhang, Y., Wu, Y., Yang, Q.: Community discovery in twitter based on user interests. J. Comput. Inf. Syst. 8(3), 991–1000 (2012)

    Google Scholar 

  8. Hayashi, K., Maehara, T., Toyoda, M., Kawarabayashi, K.-I.: Real-time top-r topic detection on twitter with topic hijack filtering. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ser. KDD ’15, pp. 417–426 (2015)

  9. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web (1999)

  10. Haveliwala, T.: Topic-sensitive PageRank. In: Proceedings of the 11th International Conference on World Wide Web, Honolulu, Hawaii, USA, pp. 517–526 (2002)

  11. Andersen, R., Lang, K.J.: Communities from seed sets. In: Proceedings of the 15th International Conference on World Wide, pp. 223–232. ACM (2006)

  12. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web, pp. 695–704. ACM (2008)

  13. Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in twitter: the million follower fallacy. In: ICWSM, vol. 10, pp. 10–17 (2010)

  14. Avnit, A.: The million followers fallacy (2009). http://blog.pravdam.com/the-million-followers-fallacy-guest-post-by-adi-avnit/. Online accessed 2 Aug 2016

  15. Weng, J., Lim, E.-P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influential Twitterers. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, Ser. WSDM ’10, pp. 261–270 (2010)

  16. David, M.I.J., Blei, M., Ng, A.Y..: Latent Dirichlet Allocation, pp. 993–1022 (2003)

  17. Graph-tool. https://graph-tool.skewed.de/. Online accessed 20 Jan 2016

  18. Bayon Clustering Tool. https://code.google.com/archive/p/bayon/. Online accessed 3 Feb 2016

  19. Trec-9 Results, Appendix A. In: Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009) (2009). http://trec.nist.gov/pubs/trec18/appendices/measures.pdf

  20. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst.: TOIS 20(4), 422–446 (2002)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the real-time search team at Yahoo! JAPAN for all their support in carrying out this work. We thank all the people involved in evaluation of the results, without which this work would have been incomplete.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anup Naik.

Additional information

This paper is an extended version of the PAKDD’2017 long presentation paper “Scalable Twitter User Clustering Approach Boosted by Personalized PageRank” [1].

Appendix

Appendix

See Table 9.

Table 9 List of frequently used terms and their meanings

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Naik, A., Maeda, H., Kanojia, V. et al. Scalable Twitter user clustering approach boosted by Personalized PageRank. Int J Data Sci Anal 6, 297–309 (2018). https://doi.org/10.1007/s41060-017-0089-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-017-0089-3

Keywords

Navigation