Short Text Clustering with a Deep Multi-embedded Self-supervised Model

Zhang, Kai; Lian, Zheng; Li, Jiangmeng; Li, Haichang; Hu, Xiaohui

doi:10.1007/978-3-030-86383-8_12

Kai Zhang^12,13,
Zheng Lian^12,13,
Jiangmeng Li^12,13,
Haichang Li¹² &
…
Xiaohui Hu¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12895))

Included in the following conference series:

International Conference on Artificial Neural Networks

2194 Accesses

The original version of this chapter was revised: The affiliation of three co-authors has been corrected. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-86383-8_55

Abstract

Short text clustering is challenging in the field of Natural Language Processing (NLP) since it is hard to learn the discriminative representations with limited information. In this paper, fused multi-embedded features are employed to enhance the representations of short texts. Then, a denoising autoencoder with an attention layer is adopted to extract low-dimensional features from the multi-embeddings against the disturbance of noisy texts. Furthermore, we propose a novel distribution estimation with jointly utilizing soft cluster assignment and the prior target distribution transition to better fine-tune the encoder. Combining the above work, we propose a deep multi-embedded self-supervised model(DMESSM) for short text clustering. We compare our DMESSM with the state-of-the-art methods in head-to-head comparisons on benchmark datasets, which indicates that our method outperforms them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Change history

07 September 2021
Due to an oversight, the second affiliation of three co-authors was omitted in the originally published version. The revised version has the correct affiliations of all co-authors.

Notes

1.
Our code is available at https://github.com/zkharryhhhh/DMESSM.
2.
https://github.com/hadifar/stc_clustering.

References

Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: 5th International Conference on Learning Representations, ICLR 2017 (2017)
Google Scholar
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
Article Google Scholar
Coates, J., Bollegala, D.: Frustratingly easy meta-embedding-computing meta-embeddings by averaging source word embeddings. arXiv preprint arXiv:1804.05262 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hadifar, A., Sterckx, L., Demeester, T., Develder, C.: A self-training approach for short text clustering. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 194–199 (2019)
Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967)
Google Scholar
Manning, C., Schutze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013)
Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Article MathSciNet Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Poerner, N., Waltinger, U., Schütze, H.: Sentence meta-embeddings for unsupervised semantic textual similarity. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7027–7034. ACL (2020)
Google Scholar
Rakib, M.R.H., Zeh, N., Jankowska, M., Milios, E.: Enhancement of short text clustering by iterative classification. In: Métais, E., Meziane, F., Horacek, H., Cimiano, P. (eds.) NLDB 2020. LNCS, vol. 12089, pp. 105–117. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-51310-8_10
Chapter Google Scholar
Reimers, N., et al.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., Bottou, L.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(12) (2010)
Google Scholar
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp. 478–487. PMLR (2016)
Google Scholar
Xu, J., et al.: Short text clustering via convolutional neural networks. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 62–69 (2015)
Google Scholar
Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242 (2014)
Google Scholar
Yin, W., Schütze, H.: Learning word meta-embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1351–1360 (2016)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Key Research and Development Program of China under Grant 2019YFB1405100, and the National Natural Science Foundation of China under Grants 61802380 and 62076232.

Author information

Authors and Affiliations

Institute of Software Chinese Academy of Sciences, Beijing, 100190, China
Kai Zhang, Zheng Lian, Jiangmeng Li, Haichang Li & Xiaohui Hu
University of Chinese Academy of Sciences, Beijing, 100049, China
Kai Zhang, Zheng Lian & Jiangmeng Li

Authors

Kai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Lian
View author publications
You can also search for this author in PubMed Google Scholar
Jiangmeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Haichang Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Zhang .

Editor information

Editors and Affiliations

Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
iMotions A/S, Copenhagen, Denmark
Paolo Masulli
University of Tübingen, Tübingen, Baden-Württemberg, Germany
Sebastian Otte
Universität Hamburg, Hamburg, Germany
Stefan Wermter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, K., Lian, Z., Li, J., Li, H., Hu, X. (2021). Short Text Clustering with a Deep Multi-embedded Self-supervised Model. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12895. Springer, Cham. https://doi.org/10.1007/978-3-030-86383-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-86383-8_12
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86382-1
Online ISBN: 978-3-030-86383-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics