Skip to main content
Log in

A comparison study of clustering algorithms for microblog posts

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Clustering is a popular unsupervised learning approach for topic analysis in text mining. In this paper, we do a comparison study of clustering algorithms for microblog posts, including weighting and programming model. Our experimental data is crawled from Sina Weibo in China. They are the 74,662 microblogs of 14 topics about Internet Technology. First of all, we do preprocessing to these microblog posts. Then we propose a manual sampling based dynamic incremental clustering algorithm (MS-DICA) to extract the topic threads from the microblogs we crawled. We evaluate the proposed algorithm from four aspects. Moreover, experimental comparisons are done in terms of accuracy and efficiency with the traditional k-means algorithm. Our experimental results show that the proposed MS-DICA is effective in the topic thread extraction. Its accuracy is close to the traditional k-means algorithm, and the running speed improves more than five times. In addition, the MapReduce programming model in Hadoop distributed computation platform that can run paralleled the k-means algorithm for cluster speeding up.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Kaplan, A.M., Haenlein, M.: The early bird catches the news: nine things you should know about micro-blogging. Bus. Horizons 54(2), 105–113 (2011)

    Article  Google Scholar 

  2. Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. In Proceeding of the 20th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 314–323 (1997)

  3. Pervin, N., Fang, F., Datta, A., Dutta, K., VanderMeer, Debra E.: Fast, scalable, and context-sensitive detection of trending topics in microblog post streams. ACM Trans. Manag. Inf. Syst. 3(3), 19 (2013)

    Google Scholar 

  4. Hu, X., Tang, L., Tang, J., Liu, H.: Exploiting social relations for sentiment analysis in microblogging. In: Proceeding of the Sixth ACM International Conference on Web Search and Data Mining, WSDM, pp. 537–546 (2013)

  5. Lin, C., Lin, C., Li, J., Wang, D., Chen, Y., Li, T.: Generating event storylines from microblogs. In: Proceeding of the 21st ACM International Conference on Information and Knowledge Management, CIKM, pp. 175–184 (2012)

  6. Efron, M., Organisciak, P., Fenlon, K.: Improving retrieval of short texts through document expansion. In: Proceeding of the 35th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 911–920 (2012)

  7. Xi, W., Lind, J., Brill, E.: Learning effective ranking functions for newsgroup search. In: Proceeding of the 27th Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 394–401 (2004)

  8. Elsas, J.L., Carbonell, J.G.: It pays to be picky: an evaluation of thread retrieval in online forums. In: Proceeding of the 32nd Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 714–715 (2009)

  9. Sun, A., Hu, M., Lim, E.-P.: Searching blogs and news: a study on popular queries. In: Proceeding of the 31st Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR, pp. 729–730 (2008)

  10. Smith, M., Cadiz, J.J., Burkhalter, B.: Conversation trees and threaded chats. In: Proceeding on the ACM 2000 Conference on Computer Supported Cooperative Work, CSCW, pp. 97–105 (2000)

  11. Qureshi, M.A., O’Riordan, C., Pasi, G.: Short-text domain specific key terms/phrases extraction using an n-gram model with wikipedia. In: Proceeding of the 21st ACM International Conference on Information and Knowledge Management, CIKM, pp. 2515–2518 (2012)

  12. Vitale, D., Ferragina, P., Scaiella, U.: Classification of short texts by deploying topical annotations. In: Proceedings of 34th European Conference on IR Research, ECIR, pp. 376–387 (2012)

  13. Wang, W.-C., Joshi, M., Cohen, W.W., Rosé, C.P.: Recovering implicit thread structure in newsgroup style conversations. In: Proceedings of Proceedings of the Second International Conference on Weblogs and Social Media, ICWSM, pp. 152–160 (2008)

  14. Luo, Z., Osborne, M., Petrovic, S., Wang, T.: Improving twitter retrieval by exploiting structural information. In: Proceedings of Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, AAAI, pp. 648–654 (2012)

  15. Skovsgaard, A., Sidlauskas, D., Jensen, C.S.: A clustering approach to the discovery of points of interest from geo-tagged microblog posts. In Proceedings of IEEE 15th International Conference on Mobile Data Management, MDM, pp. 178–188 (2014)

  16. Hu, X., Lei, T., Huan, L.: Embracing information explosion without choking: clustering and labeling in microblogging. IEEE Trans. Big Data 1(1), 35–46 (2015)

    Article  Google Scholar 

  17. Macqueen, J.: Some methods for classification and analysis of multivariate observations. In Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (2015)

  18. Steinhaus H.: Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. Cl. iii, 801–804 (1956)

  19. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  20. MacKay, D.J.C.: Information theory, inference, and learning algorithms. Cambridge University Press 2003, ISBN 978-0-521-64298-9, pp. I–XII, 1–628

  21. Tan, P.-N., Steinbach, M., Kumar, V.:. Introduction to Data Mining. Addison-Wesley (2005). ISBN : 0321321367

  22. Xu, Z., et al.: Knowle: a semantic link network based system for organizing large scale online news events. Fut. Gener. Comput. Syst. 43–44, 40–50 (2015)

    Article  Google Scholar 

  23. Xu, Z., et al.: Crowdsourcing based Description of urban emergency events using social media big data. IEEE Trans. Cloud Comput. doi:10.1109/TCC.2016.2517638

  24. Xuan, J., Luo, X., Zhang, G., Lu, J., Xu, Z.: Uncertainty analysis for the keyword system of web events. IEEE Trans. Syst. Man Cybern. Syst. 46(4), 829–842 (2016)

    Article  Google Scholar 

  25. Luo, X., Xu, Z., Yu, J., Chen, X.: Building association link network for semantic link on web resources. IEEE Trans. Automat. Sci. Eng. 8(3), 482–494 (2011)

    Article  Google Scholar 

Download references

Acknowledgments

This work is partly supported by National Social Science Fund Project 15BGL048, Project 61303029 funded by National Natural Science Foundation of China and Chinese 863 Project 2015AA015403.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lin Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Ye, J., Deng, F. et al. A comparison study of clustering algorithms for microblog posts. Cluster Comput 19, 1333–1345 (2016). https://doi.org/10.1007/s10586-016-0589-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0589-2

Keywords

Navigation