Abstract
Extracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus level. However, BTM ignores the fact that a topic is usually described by a few words in a given corpus. In other words, the topic word distribution in topic model should be highly sparse. Understanding the sparsity in topic word distribution may get more coherent topics and improve the performance of BTM. In this paper, we propose a sparse biterm topic model (SparseBTM) which combines a spike and slab prior into BTM to explicitly model the topic sparsity. Experiments on two short texts datasets show that our model can get comparable topic coherent scores and higher classification and clustering performance than BTM.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp. 27–34 (2009)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chen, W., Wang, J., Zhang, Y., Yan, H., Li, X.: User based aggregation for biterm topic model. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 489–494 (2015)
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
Doshi-Velez, F., Wallace, B.C., Adams, R.: Graph-sparse LDA: a topic model with structured sparsity. In: 29th AAAI Conference on Artificial Intelligence (2015)
Heiler, M., Schnörr, C.: Learning sparse representations by non-negative matrix factorization and sequential cone programming. J. Mach. Learn. Res. 7, 1385–1407 (2006)
Huang, J., Peng, M., Li, P., Hu, Z., Xu, C.: Improving biterm topic model with word embeddings. World Wide Web 23(6), 3099–3124 (2020). https://doi.org/10.1007/s11280-020-00823-w
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 165–174 (2016)
Li, X., Zhang, A., Li, C., Guo, L., Wang, W., Ouyang, J.: Relational biterm topic model: short-text topic modeling using word embeddings. Comput. J. 62(3), 359–372 (2019)
Li, X., Zhang, J., Ouyang, J.: Dirichlet multinomial mixture with variational manifold regularization: topic modeling over short texts. Proc. AAAI Conf. Artif. Intell. 33, 7884–7891 (2019)
Lin, H., Zuo, Y., Liu, G., Li, H., Wu, J., Wu, Z.: A pseudo-document-based topical n-grams model for short texts. World Wide Web 23(6), 3001–3023 (2020)
Lin, T., Tian, W., Mei, Q., Cheng, H.: The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 539–550 (2014)
Lu, H.Y., Xie, L.Y., Kang, N., Wang, C.J., Xie, J.Y.: Don’t forget the quantifiable relationship between words: using recurrent neural network for short text topic discovery. In: 31st AAAI Conference on Artificial Intelligence (2017)
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892 (2013)
Peng, M., et al.: Sparse topical coding with sparse groups. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds.) WAIM 2016. LNCS, vol. 9658, pp. 415–426. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39937-9_32
Peng, M., Xie, Q., Wang, H., Zhang, Y., Tian, G.: Bayesian sparse topical coding. IEEE Trans. Knowl. Data Eng. 31(6), 1080–1093 (2018)
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100 (2008)
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: 24th International Joint Conference on Artificial Intelligence (2015)
Ročková, V., George, E.I.: The spike-and-slab LASSO. J. Am. Stat. Assoc. 113(521), 431–444 (2018)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)
She, J., Chen, L.: TOMOHA: topic model-based hashtag recommendation on Twitter. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 371–372 (2014)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related groups: hierarchical dirichlet processes. In: Advances in Neural Information Processing Systems, pp. 1385–1392 (2005)
Vitale, D., Ferragina, P., Scaiella, U.: Classification of short texts by deploying topical annotations. In: Baeza-Yates, R., et al. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 376–387. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28997-2_32
Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: Advances in Neural Information Processing Systems, pp. 1982–1989 (2009)
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185 (2006)
Wu, X., Li, C., Zhu, Y., Miao, Y.: Short text topic modeling with topic distribution quantization and negative sampling decoder. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1772–1782 (2020)
Wu, X., Cai, Y., Li, Q., Xu, J., Leung, H.: Combining weighted category-aware contextual information in convolutional neural networks for text classification. World Wide Web 23(5), 2815–2834 (2020)
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456 (2013)
Yang, G., Wen, D., Chen, N.S., Sutinen, E., et al.: A novel contextual topic model for multi-document summarization. Expert Syst. Appl. 42(3), 1340–1352 (2015)
Yang, Y., et al.: Dataless short text classification based on biterm topic model and word embeddings. In: 29th International Joint Conference on Artificial Intelligence (2020)
Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242 (2014)
Zhu, J., Xing, E.P.: Sparse topical coding. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pp. 831–838 (2011)
Zuo, Y., et al.: Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2105–2114 (2016)
Acknowledgments
This work was supported by National Natural Science Foundation of China (No. 62076100), National Key Research and Development Program of China (Standard knowledge graph for epidemic prevention and production recovering intelligent service platform and its applications), the Fundamental Research Funds for the Central Universities, SCUT (No. D2201300, D2210010), the Science and Technology Programs of Guangzhou(201902010046), the Science and Technology Planning Project of Guangdong Province (No. 2020B0101100002).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, B., Cai, Y., Zhang, H. (2021). Sparse Biterm Topic Model for Short Texts. In: U, L.H., Spaniol, M., Sakurai, Y., Chen, J. (eds) Web and Big Data. APWeb-WAIM 2021. Lecture Notes in Computer Science(), vol 12858. Springer, Cham. https://doi.org/10.1007/978-3-030-85896-4_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-85896-4_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85895-7
Online ISBN: 978-3-030-85896-4
eBook Packages: Computer ScienceComputer Science (R0)