Skip to main content
Log in

GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Recently, the proposed non-parametric Bayesian based techniques which aim to model short-length textual documents through the multinomial distribution on the bag-of-words (BOW), aka mixture model-based approach. Although existing model can effectively deal with the topic/concept drift and textual sparsity problems, they are unable to exploit the semantic sequential representation of text as well as the co-occurrence relationships between words. To meet these challenges, we propose a novel approach called as GOWSeqStream. Our proposed model is a joint integration of graph-of-words (GOW) and deep sequential encoding within the Dirichlet Process Mixture Model (DPMM) framework to improve the performance of text clustering task. Extensive experiments in benchmark real-world datasets demonstrate the effectiveness of our proposed GOWSeqStream model in comparing with recent state-of-the-art baselines. Experimental outputs in terms of NMI standard metric demonstrate the outperformances of proposed GOWSeqStream model over the recent well-known text stream clustering baselines, such as MStream, NPMM and OSDM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. 20-Newsgroups dataset: http://qwone.com/~jason/20Newsgroups/.

  2. Tweet-Set dataset: http://trec.nist.gov/data/microblog.html.

  3. Google News website: https://news.google.com/news/

  4. NLP-Toolkit: https://www.nltk.org/.

  5. Word2Vec & pretrained word embeddings data:https://code.google.com/archive/p/word2vec/.

  6. DTM model (C/C + +): https://github.com/blei-lab/dtm.

  7. MStream model (Python): https://github.com/jackyin12/MStream.

  8. OSDM model (Python): https://github.com/JayKumarr/OSDM.

  9. VNTC dataset: https://github.com/duyvuleo/VNTC.

References

  1. Ganguli I, Sil J, Sengupta N (2021) Nonparametric method of topic identification using granularity concept and graph-based modeling. Neural Comput Appl 1–21

  2. Hassani A, Iranmanesh A, Mansouri N (2021)Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Comput Appl 1–22

  3. Nakamura T, Shirakawa M, Hara T, Nishio S (2019) Wikipedia-based relatedness measurements for multilingual short text clustering. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 18(2):1–25

    Article  Google Scholar 

  4. Ruan YP, Ling ZH, Zhu X (2020) Condition-transforming variational autoencoder for generating diverse short text conversations. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(6):1–13

    Article  Google Scholar 

  5. Zhao S, Gao Y, Ding G, Chua TS (2017) Real-time multimedia social event detection in microblog. IEEE Trans Cybernet 48(11):3218–3231

    Article  Google Scholar 

  6. Pham P, Nguyen LT, Vo B, & Yun U (2021) Bot2Vec: a general approach of intra-community oriented representation learning for bot detection in different types of social networks. Inf Syst 101771

  7. Blei DM, & Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning

  8. Amoualian H, Clausel M, Gaussier E, & Amini MR (2016) Streaming-lda: A copula-based approach to modeling topic dependencies in document streams. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining

  9. Du N, Farajtabar M, Ahmed A, Smola AJ, & Song L (2015) Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining

  10. Yin J and Wang J (2015) A text clustering algorithm using an online clustering scheme for initialization. In: ACM International Conference on Knowledge Discovery and Data Mining

  11. Zhao Y, Liang S, Ren Z, Ma J, Yilmaz E, and de Rijke M (2016) Explainable user clustering in short text streams. In: International ACM conference on research and de- velopment in information retrieval

  12. Liang S, Yilmaz E, & Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining

  13. Livieris IE, Stavroyiannis S, Iliadis L, Pintelas P (2021) Smoothing and stationarity enforcement framework for deep learning time-series forecasting. Neural Comput Appl 1–15

  14. Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based clustering of short text streams. In: ACM international conference on knowledge discovery and data mining

  15. Chen J, Gong Z, Liu W (2020) A Dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 1–11

  16. Ameur MSH, Belkebir R, Guessoum A (2020) Robust arabic text categorization by combining convolutional and recurrent neural networks. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(5):1–16

    Article  Google Scholar 

  17. Kumar J, Shao J, Uddin S, Ali W (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics

  18. Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47

    Article  MathSciNet  Google Scholar 

  19. Liu Y, Che W, Wang Y, Zheng B, Qin B, Liu T (2019) Deep contextualized word embeddings for universal dependency parsing. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(1):1–17

    Google Scholar 

  20. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint http://arxiv.org/abs/1301.3781

  21. Pirbhulal S, Pombo N, Felizardo V, Garcia N, Sodhro AH, Mukhopadhyay SC (2019) Towards machine learning enabled security framework for iot-based healthcare. In: 2019 13th international conference on sensing technology (ICST), IEEE

  22. AHMAD Ijaz et al (2020) Machine learning meets communication networks: current trends and future challenges. IEEE Access 8:223418–223460

  23. Lin Y, Jin X, Chen J, Sodhro AH, Pan Z (2019) An analytic computation-driven algorithm for decentralized multicore systems. Futur Gener Comput Syst 96:101–110

    Article  Google Scholar 

  24. Talat R, Obaidat MS, Muzammal M, Sodhro AH, Luo Z, Pirbhulal S (2020) A decentralised approach to privacy preserving trajectory mining. Futur Gener Comput Syst 102:382–392

    Article  Google Scholar 

  25. Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

  26. Wei X, Sun J, Wang X (2007) Dynamic mixture models for multiple time-series. IJCAI 7:2909–2914

    Google Scholar 

  27. Iwata T, Watanabe S, Yamada T, Ueda N (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-first international joint conference on artificial intelligence

  28. Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. Society for industrial and applied mathematics

  29. Aggarwal CC, Philip SY, Han J, & Wang J (2003) in A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference

  30. Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798

    Article  Google Scholar 

  31. Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining

  32. Shou L, Wang Z, Chen K, Chen G (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval

  33. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  34. Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196

    Article  Google Scholar 

  35. Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: Proceedings of IEEE international conference on data mining

  36. Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: Third IEEE international conference on data mining

  37. Duan T, Lou Q, Srihari SN, & Xie X (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining

  38. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K & Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

  39. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies

  40. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning (PMLR)

  41. Hoang VCD, Dinh D, Le Nguyen N, Ngo HQ (2007) A comparative study on vietnamese text classification methods. In: 2007 IEEE international conference on research, innovation and vision for the future

  42. Vu T, Nguyen DQ, Nguyen DQ, Dras M, Johnson M (2018) Vncorenlp: a Vietnamese natural language processing toolkit. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: demonstrations

Download references

Acknowledgement

This research is funded by Thu Dau Mot University, Binh Duong, Vietnam under grant number DT21.1-069.

Funding

This research is funded by Thu Dau Mot University, Binh Duong, Vietnam under grant number DT21.1–069.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tham Vo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vo, T. GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering. Neural Comput & Applic 34, 4321–4341 (2022). https://doi.org/10.1007/s00521-021-06563-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06563-w

Keywords

Navigation