Abstract
Inferring coherent and diverse latent topics from short texts is crucial in topic modeling. Existing approaches leverage the Generalized P\(\acute{o}\)lya Urn (GPU) model to incorporate external knowledge and improve topic modeling performance. While the GPU scheme successfully promotes similarity among words within the same topic, it has two major limitations. Firstly, it assumes that similar words contribute equally to the same topic, disregarding the distinctiveness of different words. Secondly, it assumes that a specific word should have the same promotion across all topics, overlooking the variations in word importance across different topics. To address these limitations, we propose a novel Adaptive P\(\acute{o}\)lya Urn (APU) scheme, which builds topic-word correlation according to the external and local knowledge, and the Adaptive P\(\acute{o}\)lya Urn Dirichlet Multinomial Mixture (APU-DMM) model that uses the topic-word correlation as an adaptive weight to promote topic inference process. Our extensive experimental study on three benchmark datasets shows the superiority of our model in terms of topic coherence and topic diversity over the eight baseline methods (The code is available at https://github.com/ddwangr/APUDMM).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Bicalho, P.V., Pita, M., Pedrosa, G., Lacerda, A., Pappa, G.L.: A general framework to expand short text for topic modeling. Inf. Sci. 393, 66–81 (2017)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Burkhardt, S., Kramer, S.: Decoupling sparsity and smoothness in the Dirichlet variational autoencoder topic model. JMLR 20, 131:1–131:27 (2019)
Chen, J., Wang, R., He, J., Li, M.J.: Encouraging sparsity in neural topic modeling with non-mean-field inference. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds.) ECML PKDD 2023. LNCS, vol. 14172, pp. 142–158. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43421-1_9
Chen, Z., Liu, B.: Mining topics in documents: standing on the shoulders of big data. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, pp. 1116–1125. ACM (2014)
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020)
Guo, Y., Huang, Y., Ding, Y., Qi, S., Wang, X., Liao, Q.: GPU-BTM: a topic model for short text using auxiliary information. In: 5th IEEE International Conference on Data Science in Cyberspace, DSC, pp. 198–205. IEEE (2020)
He, J., Chen, J., Li, M.J.: Multi-knowledge embeddings enhanced topic modeling for short texts. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds.) ICONIP 2022. LNCS, vol. 13625, pp. 521–532. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-30111-7_44
Li, C., Duan, Y., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36(2), 11:1–11:30 (2017)
Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR, pp. 165–174 (2016)
Mai, C., Qiu, X., Luo, K., Chen, M., Zhao, B., Huang, Y.: TSSE-DMM: topic modeling for short texts based on topic subdivision and semantic enhancement. In: Advances in Knowledge Discovery and Data Mining - 25th Pacific-Asia Conference, PAKDD, vol. 12713, pp. 640–651 (2021)
Mimno, D.M., Wallach, H.M., Talley, E.M., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP. pp. 262–272. ACL (2011)
Nan, F., Ding, R., Nallapati, R., Xiang, B.: Topic modeling with Wasserstein autoencoders. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 6345–6381. ACL (2019)
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, pp. 100–108 (2010)
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2/3), 103–134 (2000)
Niu, Y., Zhang, H., Li, J.: A Pitman-Yor process self-aggregated topic model for short texts of social media. IEEE Access 9, 129011–129021 (2021)
Phan, X.H., Nguyen, M.L., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (2008)
Qiang, J., Qian, Z., Li, Y., Yuan, Y., Wu, X.: Short text topic modeling techniques, applications, and performance: a survey. IEEE Trans. Knowl. Data Eng. 34(3), 1427–1445 (2022)
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI, pp. 2270–2276. AAAI Press (2015)
Shahnaz, F., Berry, M.W., Pauca, V.P., Plemmons, R.J.: Document clustering using nonnegative matrix factorization. Inf. Process. Manag. 42(2), 373–386 (2006)
Wang, R., Zhou, D., He, Y.: Optimising topic coherence with weighted Po’lya Urn scheme. Neurocomputing 385, 329–339 (2020)
Yin, J., Wang, J.: A Dirichlet multinomial mixture model-based approach for short text clustering. In: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, pp. 233–242. ACM (2014)
Zubiaga, A., Ji, H.: Harnessing web page directories for large-scale classification of tweets. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 225–226 (2013)
Zuo, Y., et al. Topic modeling of short texts: a pseudo-document view. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2105–2114. ACM (2016)
Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)
Acknowledgement
This work was supported by the National Key R &D Program of China: Research on the applicability of port food risk traceability, early warning and emergency assessment models (No.: 2019YFC1605504).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, M.J. et al. (2024). Topic Modeling for Short Texts via Adaptive P\(\acute{o}\)lya Urn Dirichlet Multinomial Mixture. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1968. Springer, Singapore. https://doi.org/10.1007/978-981-99-8181-6_28
Download citation
DOI: https://doi.org/10.1007/978-981-99-8181-6_28
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8180-9
Online ISBN: 978-981-99-8181-6
eBook Packages: Computer ScienceComputer Science (R0)