GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering

Vo, Tham

doi:10.1007/s00521-021-06563-w

GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering

Original Article
Published: 28 October 2021

Volume 34, pages 4321–4341, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Tham Vo ORCID: orcid.org/0000-0001-7291-4168¹

415 Accesses
5 Citations
Explore all metrics

Abstract

Recently, the proposed non-parametric Bayesian based techniques which aim to model short-length textual documents through the multinomial distribution on the bag-of-words (BOW), aka mixture model-based approach. Although existing model can effectively deal with the topic/concept drift and textual sparsity problems, they are unable to exploit the semantic sequential representation of text as well as the co-occurrence relationships between words. To meet these challenges, we propose a novel approach called as GOWSeqStream. Our proposed model is a joint integration of graph-of-words (GOW) and deep sequential encoding within the Dirichlet Process Mixture Model (DPMM) framework to improve the performance of text clustering task. Extensive experiments in benchmark real-world datasets demonstrate the effectiveness of our proposed GOWSeqStream model in comparing with recent state-of-the-art baselines. Experimental outputs in terms of NMI standard metric demonstrate the outperformances of proposed GOWSeqStream model over the recent well-known text stream clustering baselines, such as MStream, NPMM and OSDM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

How to Fine-Tune BERT for Text Classification?

A Review on Word Embedding Techniques for Text Classification

Notes

20-Newsgroups dataset: http://qwone.com/~jason/20Newsgroups/.
Tweet-Set dataset: http://trec.nist.gov/data/microblog.html.
Google News website: https://news.google.com/news/
NLP-Toolkit: https://www.nltk.org/.
Word2Vec & pretrained word embeddings data:https://code.google.com/archive/p/word2vec/.
DTM model (C/C + +): https://github.com/blei-lab/dtm.
MStream model (Python): https://github.com/jackyin12/MStream.
OSDM model (Python): https://github.com/JayKumarr/OSDM.
VNTC dataset: https://github.com/duyvuleo/VNTC.

References

Ganguli I, Sil J, Sengupta N (2021) Nonparametric method of topic identification using granularity concept and graph-based modeling. Neural Comput Appl 1–21
Hassani A, Iranmanesh A, Mansouri N (2021)Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Comput Appl 1–22
Nakamura T, Shirakawa M, Hara T, Nishio S (2019) Wikipedia-based relatedness measurements for multilingual short text clustering. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 18(2):1–25
Article Google Scholar
Ruan YP, Ling ZH, Zhu X (2020) Condition-transforming variational autoencoder for generating diverse short text conversations. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(6):1–13
Article Google Scholar
Zhao S, Gao Y, Ding G, Chua TS (2017) Real-time multimedia social event detection in microblog. IEEE Trans Cybernet 48(11):3218–3231
Article Google Scholar
Pham P, Nguyen LT, Vo B, & Yun U (2021) Bot2Vec: a general approach of intra-community oriented representation learning for bot detection in different types of social networks. Inf Syst 101771
Blei DM, & Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning
Amoualian H, Clausel M, Gaussier E, & Amini MR (2016) Streaming-lda: A copula-based approach to modeling topic dependencies in document streams. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
Du N, Farajtabar M, Ahmed A, Smola AJ, & Song L (2015) Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining
Yin J and Wang J (2015) A text clustering algorithm using an online clustering scheme for initialization. In: ACM International Conference on Knowledge Discovery and Data Mining
Zhao Y, Liang S, Ren Z, Ma J, Yilmaz E, and de Rijke M (2016) Explainable user clustering in short text streams. In: International ACM conference on research and de- velopment in information retrieval
Liang S, Yilmaz E, & Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
Livieris IE, Stavroyiannis S, Iliadis L, Pintelas P (2021) Smoothing and stationarity enforcement framework for deep learning time-series forecasting. Neural Comput Appl 1–15
Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based clustering of short text streams. In: ACM international conference on knowledge discovery and data mining
Chen J, Gong Z, Liu W (2020) A Dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 1–11
Ameur MSH, Belkebir R, Guessoum A (2020) Robust arabic text categorization by combining convolutional and recurrent neural networks. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(5):1–16
Article Google Scholar
Kumar J, Shao J, Uddin S, Ali W (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics
Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47
Article MathSciNet Google Scholar
Liu Y, Che W, Wang Y, Zheng B, Qin B, Liu T (2019) Deep contextualized word embeddings for universal dependency parsing. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(1):1–17
Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint http://arxiv.org/abs/1301.3781
Pirbhulal S, Pombo N, Felizardo V, Garcia N, Sodhro AH, Mukhopadhyay SC (2019) Towards machine learning enabled security framework for iot-based healthcare. In: 2019 13th international conference on sensing technology (ICST), IEEE
AHMAD Ijaz et al (2020) Machine learning meets communication networks: current trends and future challenges. IEEE Access 8:223418–223460
Lin Y, Jin X, Chen J, Sodhro AH, Pan Z (2019) An analytic computation-driven algorithm for decentralized multicore systems. Futur Gener Comput Syst 96:101–110
Article Google Scholar
Talat R, Obaidat MS, Muzammal M, Sodhro AH, Luo Z, Pirbhulal S (2020) A decentralised approach to privacy preserving trajectory mining. Futur Gener Comput Syst 102:382–392
Article Google Scholar
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Wei X, Sun J, Wang X (2007) Dynamic mixture models for multiple time-series. IJCAI 7:2909–2914
Google Scholar
Iwata T, Watanabe S, Yamada T, Ueda N (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-first international joint conference on artificial intelligence
Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. Society for industrial and applied mathematics
Aggarwal CC, Philip SY, Han J, & Wang J (2003) in A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference
Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798
Article Google Scholar
Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining
Shou L, Wang Z, Chen K, Chen G (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Article Google Scholar
Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: Proceedings of IEEE international conference on data mining
Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: Third IEEE international conference on data mining
Duan T, Lou Q, Srihari SN, & Xie X (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K & Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning (PMLR)
Hoang VCD, Dinh D, Le Nguyen N, Ngo HQ (2007) A comparative study on vietnamese text classification methods. In: 2007 IEEE international conference on research, innovation and vision for the future
Vu T, Nguyen DQ, Nguyen DQ, Dras M, Johnson M (2018) Vncorenlp: a Vietnamese natural language processing toolkit. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: demonstrations

Download references

Acknowledgement

This research is funded by Thu Dau Mot University, Binh Duong, Vietnam under grant number DT21.1-069.

Funding

This research is funded by Thu Dau Mot University, Binh Duong, Vietnam under grant number DT21.1–069.

Author information

Authors and Affiliations

Thu Dau Mot University, Binh Duong, Vietnam
Tham Vo

Authors

Tham Vo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tham Vo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vo, T. GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering. Neural Comput & Applic 34, 4321–4341 (2022). https://doi.org/10.1007/s00521-021-06563-w

Download citation

Received: 02 February 2021
Accepted: 20 September 2021
Published: 28 October 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s00521-021-06563-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

How to Fine-Tune BERT for Text Classification?

A Review on Word Embedding Techniques for Text Classification

Notes

References

Acknowledgement

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

How to Fine-Tune BERT for Text Classification?

A Review on Word Embedding Techniques for Text Classification

Notes

References

Acknowledgement

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation