ABSTRACT
Short text stream clustering is an important but challenging task since massive amounts of text are generated from different social media. Given streams of texts, the proposed method clusters the streams of texts based on the frequently occurring word pairs (not necessarily consecutive) in texts. It detects outliers in the clusters and reassigns the outliers to appropriate clusters using the semantic similarity between the outliers and the clusters based on the dynamically computed similarity thresholds. Thus the proposed method efficiently deals with the concept drift problem. Experimental results demonstrate that the proposed approach outperforms the state-of-the-art short text stream clustering algorithms by a statistically significant margin on several short text datasets.
- Basant Agarwal and Namita Mittal. 2015. Prominent Feature Extraction for Sentiment Analysis (1st ed.). Springer Publishing Company, Incorporated.Google Scholar
- Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A Framework for Clustering Evolving Data Streams. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin, Germany). 81--92.Google Scholar
- Junyang Chen, Zhiguo Gong, and Weiwen Liu. 2020. A Dirichlet process biterm-based mixture model for short text stream clustering. Applied Intelligence 50, 5 (2020), 1609--1619.Google ScholarCross Ref
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition (3rd ed.). The MIT Press.Google Scholar
- Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory. Wiley-Interscience, New York, NY, USA.Google Scholar
- Umesh Kokate, Arvind Deshpande, Parikshit Mahalle, and Pramod Patil. 2018. Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion. Big Data and Cognitive Computing 2, 4 (2018), 32.Google ScholarCross Ref
- Jay Kumar, Junming Shao, Salah Uddin, and Wazir Ali. 2020. An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering. In Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Online, 766--776.Google ScholarCross Ref
- Shangsong Liang, Emine Yilmaz, and Evangelos Kanoulas. 2016. Dynamic Clustering of Streaming Short Documents. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA). 995--1004.Google ScholarDigital Library
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP, Vol. 14. 1532--1543.Google Scholar
- J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, J. Zhao, and B. Xu. 2017. Self-Taught convolutional neural networks for short text clustering. Neural Networks 88 (2017), 22--31.Google ScholarCross Ref
- Jianhua Yin, Daren Chao, Zhongkun Liu, Wei Zhang, Xiaohui Yu, and Jianyong Wang. 2018. Model-based Clustering of Short Text Streams. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (London, United Kingdom). 2634--2642.Google ScholarDigital Library
- Yukun Zhao, Shangsong Liang, Zhaochun Ren, Jun Ma, Emine Yilmaz, and Maarten de Rijke. 2016. Explainable User Clustering in Short Text Streams. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy). ACM, 155--164.Google ScholarDigital Library
Index Terms
- Short Text Stream Clustering via Frequent Word Pairs and Reassignment of Outliers to Clusters
Recommendations
Efficient clustering of short text streams using online-offline clustering
DocEng '21: Proceedings of the 21st ACM Symposium on Document EngineeringShort text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering ...
Model-based Clustering of Short Text Streams
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningShort text stream clustering has become an increasingly important problem due to the explosive growth of short text in diverse social medias. In this paper, we propose a model-based short text stream clustering algorithm (MStream) which can deal with ...
On clustering massive text and categorical data streams
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present ...
Comments