Abstract
Short text streams, such as social media comments, are continuously generated, making effective clustering methods essential for extracting valuable information. However, existing research fails to address the problem of topic concentration in clustering, which leads to multiple topics being confused in one cluster, making it challenging to summarize the center of clustering. To tackle this issue, this paper proposes a novel topic-enhanced clustering method called TEDM, based on the Dirichlet model. The method uses dynamic clustering, leveraging topic information to improve the sampling of documents and better cluster documents on the same topic. TEDM constructs a dynamic word relation graph to extract topic terms, which is updated with the stream of documents to cope with the dynamic changes in topics. Extensive experimental studies demonstrate that TEDM outperforms state-of-the-art works on multiple real datasets.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availibility
The dataset of Tweets and Tweets-T are available at http://trec.nist.gov/data/microblog, The dataset of News and News-T is available at https://news.google.com/news.
Notes
Tweets dataset: http://trec.nist.gov/data/microblog.
News website: https://news.google.com/news/.
References
Aggarwal CC, Philip SY, Han J, et al (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference, Elsevier, pp 81–92
Blackwell D, MacQueen JB (1973) Ferguson distributions via pólya urn schemes. Anna Statist 1(2):353–355
Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, pp 113–120
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Cao F, Estert M, Qian W, et al (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, SIAM, pp 328–339
Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47
Chen J, Gong Z, Liu W (2020) A dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 50(5):1609–1619
Chu D, Reyers M, Thomson J et al (2020) Route identification in the national football league: An application of model-based curve clustering using the em algorithm. J Quantit Anal Sports 16(2):121–132
Duan T, Lou Q, Srihari SN, et al (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 68–80
Ferguson TS (1973) A bayesian analysis of some nonparametric problems. Annal Statist pp 209–230
Geng F, Liu Q, Zhang P (2020) A time-aware query-focused summarization of an evolving microblogging stream via sentence extraction. Digit Commun Netw 6(3):389–397
Iwata T, Watanabe S, Yamada T, et al (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-First international joint conference on artificial intelligence, Citeseer
Kumar J, Shao J, Uddin S, et al (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 766–776
Li Y, Li H, Wang Z et al (2020) Esa-stream: Efficient self-adaptive online data stream clustering. IEEE Trans Knowl Data Eng 34(2):617–630
Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 995–1004
Lin Y, Jin X, Chen J et al (2019) An analytic computation-driven algorithm for decentralized multicore systems. Future Gener Comput Syst 96:101–110
Miller E (2009) Rank hotness with newton’s law of cooling. Feb 15:3
Mills-Tettey GA, Stentz A, Dias MB (2007) The dynamic hungarian algorithm for the assignment problem with changing costs. Robotics Institute, Pittsburgh, PA, Tech Rep CMU-RI-TR-07-27
Nigam K, McCallum AK, Thrun S et al (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134
Niwattanakul S, Singthongchai J, Naenudorn E, et al (2013) Using of jaccard coefficient for keywords similarity. In: Proceedings of the international multiconference of engineers and computer scientists, pp 380–384
Rakib MRH, Zeh N, Milios E (2021) Efficient clustering of short text streams using online-offline clustering. In: Proceedings of the 21st ACM Symposium on Document Engineering, pp 1–10
Rendón E, Abundez I, Arizmendi A et al (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34
Rosenberg A, Hirschberg J (2007) V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 410–420
Sammut C, Webb GI (2011) Encyclopedia of machine learning. Springer Science & Business Media
Santos JM, Embrechts M (2009) On the use of the adjusted rand index as a metric for evaluating supervised classification. In: International conference on artificial neural networks, Springer, pp 175–184
Shou L, Wang Z, Chen K, et al (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 533–542
Strehl A, Ghosh J (2002) Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Terenin A, Simpson D, Draper D (2020) Asynchronous gibbs sampling. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 144–154
Vo T (2022) Gowseqstream: an integrated sequential embedding and graph-of-words for short text stream clustering. Neural Comput Appl 34(6):4321–4341
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433
Wang Y, Agichtein E, Benzi M (2012) Tm-lda: efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 123–131
Yang S, Huang G, Cai B (2019) Discovering topic representative terms for short text clustering. IEEE Access 7:92037–92047
Yang S, Huang G, Zhou X, et al (2019b) Dynamic clustering of stream short documents using evolutionary word relation network. In: International Conference on Data Service, Springer, pp 418–428
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 233–242
Yin J, Wang J (2016) A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), IEEE, pp 625–636
Yin J, Chao D, Liu Z, et al (2018) Model-based clustering of short text streams. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2634–2642
Yoo S, Huang H, Kasiviswanathan SP (2016) Streaming spectral clustering. In: 2016 IEEE 32nd international conference on data engineering (ICDE), IEEE, pp 637–648
Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 763–772
Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798
Zhou JY, Wang FY, Zeng DJ (2011) Hierarchical dirichlet processes and their applications: a survey. Zidonghua Xuebao/Acta Automatica Sinica 37(4):389–407
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, K., He, J. & Chen, Y. A topic-enhanced dirichlet model for short text stream clustering. Neural Comput & Applic 36, 8125–8140 (2024). https://doi.org/10.1007/s00521-024-09480-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09480-w