DEC-transformer: deep embedded clustering with transformer on Chinese long text

Zou, Ao; Hao, Wenning; Chen, Gang; Jin, Dawei

doi:10.1007/s10044-023-01161-z

DEC-transformer: deep embedded clustering with transformer on Chinese long text

Theoretical Advances
Published: 10 May 2023

Volume 26, pages 1349–1362, (2023)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Ao Zou¹,
Wenning Hao ORCID: orcid.org/0000-0002-1526-7889¹,
Gang Chen¹ &
…
Dawei Jin¹

399 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Long text clustering is of great significance and practical value in data mining, such as information retrieval, text integration, and data compression. Compared with short text clustering, long text clustering involves more semantic information representation and processing, making it a challenging problem. Most recent techniques merely rely on dynamic word embeddings from pre-training as a transfer learning or only based on a simple combination of transformers and traditional clustering methods, which still need to be expanded to short text due to the quadratic computational complexity. In this paper, a novel model combining transfer learning and dynamic feedback called deep embedded clustering with transformer(DEC-transformer) is proposed. To better capture the semantic relationships between sentences in documents, a novel transfer learning technology is firstly applied to long text clustering tasks for pre-training. Unlike previous papers, a two-stage training task is designed by treating semantic representation and text clustering as a united process, and the parameter is dynamically optimized by adaptive feedback to further improve efficiency. Experimental results on the test set show that the proposed model has made great progress in accuracy compared with several benchmarks. Furthermore, it also has good robustness and can get good performance on noisy datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Short Text Clustering with a Deep Multi-embedded Self-supervised Model

Representation Learning for Short Text Clustering

Short Text Clustering Using Joint Optimization of Feature Representations and Cluster Assignments

Data availability statement

The datasets used in the experiment are open-source, and their links are as follows: Fudan Corpus: https://www.heywhale.com/mw/dataset/5d3a9c86cf76a600360edd04/content SogouCS Corpus: https://www.sogou.com/labs/resource/cs.php.

Notes

The code of this work is available at https://github.com/Uchiha-Monroe/DEC-transformer.
The data are available at https://www.kesci.com/home/dataset/5d3a9c86cf76a600360edd04.
The data are available at https://hyper.ai/datasets/9270.

References

Min E, Guo X, Liu Q, Zhang G, Cui J, Long J (2018) A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access 6:39501–39514. https://doi.org/10.1109/access.2018.2855437
Article Google Scholar
Soares VHA, Campello RJGB, Nourashrafeddin S, Milios E, Naldi MC (2019) Combining semantic and term frequency similarities for text clustering. Knowl Inf Syst 61(3):1485–1516. https://doi.org/10.1007/s10115-018-1278-7
Article Google Scholar
Fan Y, Gongshen L, Kui M, Zhaoying S (2018) Neural feedback text clustering with BiLSTM-CNN-kmeans. IEEE Access 6:57460–57469. https://doi.org/10.1109/access.2018.2873327
Article Google Scholar
Seifzadeh S, Farahat AK, Kamel MS, Karray F Short-text clustering using statistical semantics. In: Proceedings of the 24th international conference on World Wide Web, New York
Song W, Park SC (2009) Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11–12):1901–1907. https://doi.org/10.1016/j.camwa.2008.10.010
Article MATH Google Scholar
Xu, J, Wang P, Tian G, Xu B, Zhao J, Wang F, Hao H (2015) Short text clustering via convolutional neural networks. https://openreview.net/forum?id=HJ-GGQWdWB. Accessed 03 Jun 2021
Xu J, Xu B, Wang P, Zheng S, Tian G, Zhao J, Xu B (2017) Self-taught convolutional neural networks for short text clustering. Neural Netw 88:22–31. https://doi.org/10.1016/j.neunet.2016.12.008
Article Google Scholar
Revanasiddappa MB, Harish BS, Kumar SVA (2017) Clustering text documents using kernel possibilistic c-means. In: Proceedings of international conference on cognition and recognition. Springer, Berlin
Xiang S, Nie F, Zhang C (2008) Learning a mahalanobis distance metric for data clustering and classification. Pattern Recogn 41(12):3600–3612. https://doi.org/10.1016/j.patcog.2008.05.018
Article MATH Google Scholar
Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: 21st international conference on machine learning—ICML’04, New York
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques’. J Intell Inf Syst 17(2/3):107–145. https://doi.org/10.1023/a:1012801612483
Article MATH Google Scholar
Aggarwal CC, Zhai CA (2012) Survey of text clustering algorithms. In: Mining text data. Springer, New York
Wang B, Liu W, Lin Z, Hu X, Wei J, Liu C (2018) Text clustering algorithm based on deep representation learning. J Eng 2018(16):1407–1414
Article Google Scholar
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
MATH Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119. Accessed 03 Jun 2021
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Xing EP, Jebara T (eds) Proceedings of the 31st international conference on machine learning. Proceedings of machine learning research, vol 32. PMLR, Bejing, pp 1188–1196. http://proceedings.mlr.press/v32/le14.html
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs]. Accessed 03 Jun 2021
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 [cs]. Accessed 03 Jun 2021
Qiang J, Li Y, Yuan Y, Wu X (2017) Short text clustering based on pitman-yor process mixture model. Appl Intell 48(7):1802–1812. https://doi.org/10.1007/s10489-017-1055-4
Article Google Scholar
Dinh D-T, Huynh V-N (2020) k-PbC: an improved cluster center initialization for categorical data clustering. Appl Intell 50(8):2610–2632. https://doi.org/10.1007/s10489-020-01677-5
Article Google Scholar
Chen J, Gong Z, Liu W (2020) A Dirichlet process Biterm-based mixture model for short text stream clustering. Appl Intell 50(5):1609–1619. https://doi.org/10.1007/s10489-019-01606-1
Article Google Scholar
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Article Google Scholar
Yang B, Fu X, Sidiropoulos N, Hong M (2017) Towards k-means-friendly spaces: simultaneous deep learning and clustering, In: Proceedings of machine learning research, PMLR, Sydney, pp 3861–3870
Huang P, Huang Y, Wang W, Wang L (2014) Deep embedding network for clustering. In: 2014 22nd international conference on pattern recognition, Stockholm
Chen D, Lv J, Zhang Y (2017) Unsupervised multi-manifold clustering by learning deep representation. In: AAAI workshops
Dizaji GK, Herandi A, Deng C, Cai W, Huang H (2017) Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In: Proceedings of the IEEE international conference on computer vision, pp 5736–5745
Shah AS, Koltun V (2018) Deep continuous clustering. arXiv:1803.01449 [cs]
Chen G (2015) Deep learning with nonparametric clustering. arXiv:1501.03084 [cs]
Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, pp 478–487. PMLR
Li F, Qiao H, Zhang B, Xi X (2018) Discriminatively boosted image clustering with fully convolutional auto-encoders. Pattern Recogn 83:161–173. https://doi.org/10.1016/j.patcog.2018.05.019
Article Google Scholar
Hsu C-C, Lin C-W (2018) CNN-based joint clustering and representation learning with feature drift compensation for large-scale image data. IEEE Trans Multimedia 20(2):421–429. https://doi.org/10.1109/TMM.2017.2745702
Article Google Scholar
Hu W, Miyato T, Tokui S, Matsumoto E, Sugiyama M (2017) Learning discrete representations via information maximizing self augmented training. In: International conference on machine learning, pp 1558–1567. PMLR
Yang J, Parikh D, Batra D (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5147–5156
Chang J, Wang L, Meng G, Xiang S, Pan C (2017) Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision, pp 5879–5887
Jiang Z, Zheng Y, Tan H, Tang B, Zhou H (2017) Variational deep embedding: an unsupervised and generative approach to clustering. In Proceedings of the 26th international joint conference on artificial intelligence, pp 1965–1972
Dilokthanakul N, Mediano AMP, Garnelo M, Lee CHM, Salimbeni H, Arulkumaran K, Shanahan M (2017) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv: Learning
Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, pp 2172–2180
Hadifar A, Sterckx L, Demeester T, Develder C (2019) A self-training approach for short text clustering. ACL 2019:194
Google Scholar
Zhang W, Dong C, Yin J, Wang J (2021) Attentive representation learning with adversarial training for short text clustering. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/tkde.2021.3052244
Zhou J, Cheng X, Zhang J (2019) An end-to-end neural network framework for text clustering. arXiv:1903.09424 [cs]. arXiv: 1903.09424. Accessed 03 Jun 2021
Rakib MRH, Zeh N, Jankowska M, Milios E Enhancement of short text clustering by iterative classification. In: Natural language processing and information systems. Springer, Berlin
Pugachev L, Burtsev M (2021) Short text clustering with transformers. arXiv:2102.00541 [cs]. Accessed 03 Jun 2021
Aljalbout E, Golkov V, Siddiqui Y, Strobel M, Cremers D (2018) Clustering with deep learning: taxonomy and new methods. arXiv:1801.07648 [cs, stat]. Accessed 03 Jun 2021
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11): 2579–2605
MATH Google Scholar
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Log Q 2(1–2):83–97
Article MathSciNet MATH Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: an Imperative style, high-performance deep learning library. arXiv:1912.01703 [cs, stat]. Accessed 03 Jun 2021
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, Lhoest Q, Rush A (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, Stroudsburg
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2020) XLNet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237 [cs]. Accessed 03 Jun 2021
Cui Y, Che W, Liu T, Qin B, Wang S, Hu G (2020) Revisiting pre-trained models for Chinese natural language processing. Find Assoc Comput Linguist: EMNLP 2020:657–668. https://doi.org/10.18653/v1/2020.findings-emnlp.58. arXiv: 2004.13922. Accessed 03 Jun 2021
Pugachev L, Burtsev M (2021) Short text clustering with transformers. arXiv preprint arXiv:2102.00541
Hu X, Sun N, Zhang C, Chua T-S (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM conference on information and knowledge management—CIKM’09, New York

Download references

Funding

This work was supported by The National Natural Science Foundation of China (No. 61806221).

Author information

Authors and Affiliations

Command and Control Engineering College, Army Engineering University of PLA, No.1 Haifu Xiang, Nanjing, 210007, Jiangsu, China
Ao Zou, Wenning Hao, Gang Chen & Dawei Jin

Authors

Ao Zou
View author publications
You can also search for this author inPubMed Google Scholar
Wenning Hao
View author publications
You can also search for this author inPubMed Google Scholar
Gang Chen
View author publications
You can also search for this author inPubMed Google Scholar
Dawei Jin
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Ao Zou, Gang Chen and Wenning Hao. The first draft of the manuscript was written by Ao Zou and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Wenning Hao.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zou, A., Hao, W., Chen, G. et al. DEC-transformer: deep embedded clustering with transformer on Chinese long text. Pattern Anal Applic 26, 1349–1362 (2023). https://doi.org/10.1007/s10044-023-01161-z

Download citation

Received: 04 June 2022
Accepted: 21 March 2023
Published: 10 May 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10044-023-01161-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DEC-transformer: deep embedded clustering with transformer on Chinese long text

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Short Text Clustering with a Deep Multi-embedded Self-supervised Model

Representation Learning for Short Text Clustering

Short Text Clustering Using Joint Optimization of Feature Representations and Cluster Assignments

Data availability statement

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now