Sparse Biterm Topic Model for Short Texts

Zhu, Bingshan; Cai, Yi; Zhang, Huakui

doi:10.1007/978-3-030-85896-4_19

Bingshan Zhu^12,13,
Yi Cai^12,13 &
Huakui Zhang^12,13

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12858))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

1587 Accesses

Abstract

Extracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus level. However, BTM ignores the fact that a topic is usually described by a few words in a given corpus. In other words, the topic word distribution in topic model should be highly sparse. Understanding the sparsity in topic word distribution may get more coherent topics and improve the performance of BTM. In this paper, we propose a sparse biterm topic model (SparseBTM) which combines a spike and slab prior into BTM to explicitly model the topic sparsity. Experiments on two short texts datasets show that our model can get comparable topic coherent scores and higher classification and clustering performance than BTM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://jwebpro.sourceforge.net/data-web-snippets.tar.gz.
2.
Can be downloaded from https://github.com/pgcool/iDocNADEe/.
3.
https://radimrehurek.com/gensim/models/coherencemodel.html.
4.
https://github.com/xiaohuiyan/BTM.

References

Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp. 27–34 (2009)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Chen, W., Wang, J., Zhang, Y., Yan, H., Li, X.: User based aggregation for biterm topic model. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 489–494 (2015)
Google Scholar
Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
Article Google Scholar
Doshi-Velez, F., Wallace, B.C., Adams, R.: Graph-sparse LDA: a topic model with structured sparsity. In: 29th AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Heiler, M., Schnörr, C.: Learning sparse representations by non-negative matrix factorization and sequential cone programming. J. Mach. Learn. Res. 7, 1385–1407 (2006)
MathSciNet MATH Google Scholar
Huang, J., Peng, M., Li, P., Hu, Z., Xu, C.: Improving biterm topic model with word embeddings. World Wide Web 23(6), 3099–3124 (2020). https://doi.org/10.1007/s11280-020-00823-w
Article Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article Google Scholar
Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 165–174 (2016)
Google Scholar
Li, X., Zhang, A., Li, C., Guo, L., Wang, W., Ouyang, J.: Relational biterm topic model: short-text topic modeling using word embeddings. Comput. J. 62(3), 359–372 (2019)
Article Google Scholar
Li, X., Zhang, J., Ouyang, J.: Dirichlet multinomial mixture with variational manifold regularization: topic modeling over short texts. Proc. AAAI Conf. Artif. Intell. 33, 7884–7891 (2019)
Google Scholar
Lin, H., Zuo, Y., Liu, G., Li, H., Wu, J., Wu, Z.: A pseudo-document-based topical n-grams model for short texts. World Wide Web 23(6), 3001–3023 (2020)
Article Google Scholar
Lin, T., Tian, W., Mei, Q., Cheng, H.: The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 539–550 (2014)
Google Scholar
Lu, H.Y., Xie, L.Y., Kang, N., Wang, C.J., Xie, J.Y.: Don’t forget the quantifiable relationship between words: using recurrent neural network for short text topic discovery. In: 31st AAAI Conference on Artificial Intelligence (2017)
Google Scholar
Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892 (2013)
Google Scholar
Peng, M., et al.: Sparse topical coding with sparse groups. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds.) WAIM 2016. LNCS, vol. 9658, pp. 415–426. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39937-9_32
Chapter Google Scholar
Peng, M., Xie, Q., Wang, H., Zhang, Y., Tian, G.: Bayesian sparse topical coding. IEEE Trans. Knowl. Data Eng. 31(6), 1080–1093 (2018)
Article Google Scholar
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100 (2008)
Google Scholar
Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: 24th International Joint Conference on Artificial Intelligence (2015)
Google Scholar
Ročková, V., George, E.I.: The spike-and-slab LASSO. J. Am. Stat. Assoc. 113(521), 431–444 (2018)
Article MathSciNet Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)
Google Scholar
She, J., Chen, L.: TOMOHA: topic model-based hashtag recommendation on Twitter. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 371–372 (2014)
Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related groups: hierarchical dirichlet processes. In: Advances in Neural Information Processing Systems, pp. 1385–1392 (2005)
Google Scholar
Vitale, D., Ferragina, P., Scaiella, U.: Classification of short texts by deploying topical annotations. In: Baeza-Yates, R., et al. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 376–387. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28997-2_32
Chapter Google Scholar
Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: Advances in Neural Information Processing Systems, pp. 1982–1989 (2009)
Google Scholar
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185 (2006)
Google Scholar
Wu, X., Li, C., Zhu, Y., Miao, Y.: Short text topic modeling with topic distribution quantization and negative sampling decoder. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1772–1782 (2020)
Google Scholar
Wu, X., Cai, Y., Li, Q., Xu, J., Leung, H.: Combining weighted category-aware contextual information in convolutional neural networks for text classification. World Wide Web 23(5), 2815–2834 (2020)
Article Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456 (2013)
Google Scholar
Yang, G., Wen, D., Chen, N.S., Sutinen, E., et al.: A novel contextual topic model for multi-document summarization. Expert Syst. Appl. 42(3), 1340–1352 (2015)
Article Google Scholar
Yang, Y., et al.: Dataless short text classification based on biterm topic model and word embeddings. In: 29th International Joint Conference on Artificial Intelligence (2020)
Google Scholar
Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242 (2014)
Google Scholar
Zhu, J., Xing, E.P.: Sparse topical coding. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pp. 831–838 (2011)
Google Scholar
Zuo, Y., et al.: Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2105–2114 (2016)
Google Scholar

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (No. 62076100), National Key Research and Development Program of China (Standard knowledge graph for epidemic prevention and production recovering intelligent service platform and its applications), the Fundamental Research Funds for the Central Universities, SCUT (No. D2201300, D2210010), the Science and Technology Programs of Guangzhou(201902010046), the Science and Technology Planning Project of Guangdong Province (No. 2020B0101100002).

Author information

Authors and Affiliations

Key Laboratory of Big Data and Intelligent Robot, South China University of Technology, Ministry of Education, Guangzhou, China
Bingshan Zhu, Yi Cai & Huakui Zhang
South China University of Technology, Guangzhou, China
Bingshan Zhu, Yi Cai & Huakui Zhang

Authors

Bingshan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Cai
View author publications
You can also search for this author in PubMed Google Scholar
Huakui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Cai .

Editor information

Editors and Affiliations

University of Macau, Macau, China
Leong Hou U
University of Caen Normandie, Caen, France
Marc Spaniol
Osaka University, Osaka, Japan
Yasushi Sakurai
South China University of Technology, Guangzhou, China
Junying Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, B., Cai, Y., Zhang, H. (2021). Sparse Biterm Topic Model for Short Texts. In: U, L.H., Spaniol, M., Sakurai, Y., Chen, J. (eds) Web and Big Data. APWeb-WAIM 2021. Lecture Notes in Computer Science(), vol 12858. Springer, Cham. https://doi.org/10.1007/978-3-030-85896-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-85896-4_19
Published: 19 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85895-7
Online ISBN: 978-3-030-85896-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics