short-paper

Short Text Stream Clustering via Frequent Word Pairs and Reassignment of Outliers to Clusters

Authors:
Md Rashadul Hasan Rakib

Dalhousie University, Nova Scotia, Canada

Dalhousie University, Nova Scotia, Canada
View Profile

,
Norbert Zeh

Dalhousie University, Nova Scotia, Canada

Dalhousie University, Nova Scotia, Canada
View Profile

,
Evangelos Milios

Dalhousie University, Nova Scotia, Canada

Dalhousie University, Nova Scotia, Canada
View Profile

DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020September 2020Article No.: 13Pages 1–4https://doi.org/10.1145/3395027.3419589

Published:29 September 2020Publication History

DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020

Pages 1–4

ABSTRACT

Short text stream clustering is an important but challenging task since massive amounts of text are generated from different social media. Given streams of texts, the proposed method clusters the streams of texts based on the frequently occurring word pairs (not necessarily consecutive) in texts. It detects outliers in the clusters and reassigns the outliers to appropriate clusters using the semantic similarity between the outliers and the clusters based on the dynamically computed similarity thresholds. Thus the proposed method efficiently deals with the concept drift problem. Experimental results demonstrate that the proposed approach outperforms the state-of-the-art short text stream clustering algorithms by a statistically significant margin on several short text datasets.

References

Basant Agarwal and Namita Mittal. 2015. Prominent Feature Extraction for Sentiment Analysis (1st ed.). Springer Publishing Company, Incorporated.Google Scholar
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A Framework for Clustering Evolving Data Streams. In Proceedings of the 29th International Conference on Very Large Data Bases (Berlin, Germany). 81--92.Google Scholar
Junyang Chen, Zhiguo Gong, and Weiwen Liu. 2020. A Dirichlet process biterm-based mixture model for short text stream clustering. Applied Intelligence 50, 5 (2020), 1609--1619.Google ScholarCross Ref
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition (3rd ed.). The MIT Press.Google Scholar
Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory. Wiley-Interscience, New York, NY, USA.Google Scholar
Umesh Kokate, Arvind Deshpande, Parikshit Mahalle, and Pramod Patil. 2018. Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion. Big Data and Cognitive Computing 2, 4 (2018), 32.Google ScholarCross Ref
Jay Kumar, Junming Shao, Salah Uddin, and Wazir Ali. 2020. An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering. In Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Online, 766--776.Google ScholarCross Ref
Shangsong Liang, Emine Yilmaz, and Evangelos Kanoulas. 2016. Dynamic Clustering of Streaming Short Documents. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA). 995--1004.Google ScholarDigital Library
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP, Vol. 14. 1532--1543.Google Scholar
J. Xu, B. Xu, P. Wang, S. Zheng, G. Tian, J. Zhao, and B. Xu. 2017. Self-Taught convolutional neural networks for short text clustering. Neural Networks 88 (2017), 22--31.Google ScholarCross Ref
Jianhua Yin, Daren Chao, Zhongkun Liu, Wei Zhang, Xiaohui Yu, and Jianyong Wang. 2018. Model-based Clustering of Short Text Streams. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (London, United Kingdom). 2634--2642.Google ScholarDigital Library
Yukun Zhao, Shangsong Liang, Zhaochun Ren, Jun Ma, Emine Yilmaz, and Maarten de Rijke. 2016. Explainable User Clustering in Short Text Streams. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy). ACM, 155--164.Google ScholarDigital Library

Index Terms

Short Text Stream Clustering via Frequent Word Pairs and Reassignment of Outliers to Clusters
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Efficient clustering of short text streams using online-offline clustering
DocEng '21: Proceedings of the 21st ACM Symposium on Document Engineering

Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering ...
Read More
Model-based Clustering of Short Text Streams
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Short text stream clustering has become an increasingly important problem due to the explosive growth of short text in diverse social medias. In this paper, we propose a model-based short text stream clustering algorithm (MStream) which can deal with ...
Read More
On clustering massive text and categorical data streams

In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020
September 2020
130 pages
ISBN:9781450380003
DOI:10.1145/3395027

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 September 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
outlier reassignment
text stream clustering
word pair
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 194
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Short Text Stream Clustering via Frequent Word Pairs and Reassignment of Outliers to Clusters

DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient clustering of short text streams using online-offline clustering

Model-based Clustering of Short Text Streams

On clustering massive text and categorical data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Short Text Stream Clustering via Frequent Word Pairs and Reassignment of Outliers to Clusters

DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient clustering of short text streams using online-offline clustering

Model-based Clustering of Short Text Streams

On clustering massive text and categorical data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media