Skip to main content

Sparse Biterm Topic Model for Short Texts

  • Conference paper
  • First Online:
Book cover Web and Big Data (APWeb-WAIM 2021)

Abstract

Extracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus level. However, BTM ignores the fact that a topic is usually described by a few words in a given corpus. In other words, the topic word distribution in topic model should be highly sparse. Understanding the sparsity in topic word distribution may get more coherent topics and improve the performance of BTM. In this paper, we propose a sparse biterm topic model (SparseBTM) which combines a spike and slab prior into BTM to explicitly model the topic sparsity. Experiments on two short texts datasets show that our model can get comparable topic coherent scores and higher classification and clustering performance than BTM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://jwebpro.sourceforge.net/data-web-snippets.tar.gz.

  2. 2.

    Can be downloaded from https://github.com/pgcool/iDocNADEe/.

  3. 3.

    https://radimrehurek.com/gensim/models/coherencemodel.html.

  4. 4.

    https://github.com/xiaohuiyan/BTM.

References

  1. Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp. 27–34 (2009)

    Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Chen, W., Wang, J., Zhang, Y., Yan, H., Li, X.: User based aggregation for biterm topic model. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 489–494 (2015)

    Google Scholar 

  4. Cheng, X., Yan, X., Lan, Y., Guo, J.: BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)

    Article  Google Scholar 

  5. Doshi-Velez, F., Wallace, B.C., Adams, R.: Graph-sparse LDA: a topic model with structured sparsity. In: 29th AAAI Conference on Artificial Intelligence (2015)

    Google Scholar 

  6. Heiler, M., Schnörr, C.: Learning sparse representations by non-negative matrix factorization and sequential cone programming. J. Mach. Learn. Res. 7, 1385–1407 (2006)

    MathSciNet  MATH  Google Scholar 

  7. Huang, J., Peng, M., Li, P., Hu, Z., Xu, C.: Improving biterm topic model with word embeddings. World Wide Web 23(6), 3099–3124 (2020). https://doi.org/10.1007/s11280-020-00823-w

    Article  Google Scholar 

  8. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  Google Scholar 

  9. Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 165–174 (2016)

    Google Scholar 

  10. Li, X., Zhang, A., Li, C., Guo, L., Wang, W., Ouyang, J.: Relational biterm topic model: short-text topic modeling using word embeddings. Comput. J. 62(3), 359–372 (2019)

    Article  Google Scholar 

  11. Li, X., Zhang, J., Ouyang, J.: Dirichlet multinomial mixture with variational manifold regularization: topic modeling over short texts. Proc. AAAI Conf. Artif. Intell. 33, 7884–7891 (2019)

    Google Scholar 

  12. Lin, H., Zuo, Y., Liu, G., Li, H., Wu, J., Wu, Z.: A pseudo-document-based topical n-grams model for short texts. World Wide Web 23(6), 3001–3023 (2020)

    Article  Google Scholar 

  13. Lin, T., Tian, W., Mei, Q., Cheng, H.: The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 539–550 (2014)

    Google Scholar 

  14. Lu, H.Y., Xie, L.Y., Kang, N., Wang, C.J., Xie, J.Y.: Don’t forget the quantifiable relationship between words: using recurrent neural network for short text topic discovery. In: 31st AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  15. Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892 (2013)

    Google Scholar 

  16. Peng, M., et al.: Sparse topical coding with sparse groups. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds.) WAIM 2016. LNCS, vol. 9658, pp. 415–426. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39937-9_32

    Chapter  Google Scholar 

  17. Peng, M., Xie, Q., Wang, H., Zhang, Y., Tian, G.: Bayesian sparse topical coding. IEEE Trans. Knowl. Data Eng. 31(6), 1080–1093 (2018)

    Article  Google Scholar 

  18. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100 (2008)

    Google Scholar 

  19. Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: 24th International Joint Conference on Artificial Intelligence (2015)

    Google Scholar 

  20. Ročková, V., George, E.I.: The spike-and-slab LASSO. J. Am. Stat. Assoc. 113(521), 431–444 (2018)

    Article  MathSciNet  Google Scholar 

  21. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)

    Google Scholar 

  22. She, J., Chen, L.: TOMOHA: topic model-based hashtag recommendation on Twitter. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 371–372 (2014)

    Google Scholar 

  23. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Sharing clusters among related groups: hierarchical dirichlet processes. In: Advances in Neural Information Processing Systems, pp. 1385–1392 (2005)

    Google Scholar 

  24. Vitale, D., Ferragina, P., Scaiella, U.: Classification of short texts by deploying topical annotations. In: Baeza-Yates, R., et al. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 376–387. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28997-2_32

    Chapter  Google Scholar 

  25. Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: Advances in Neural Information Processing Systems, pp. 1982–1989 (2009)

    Google Scholar 

  26. Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185 (2006)

    Google Scholar 

  27. Wu, X., Li, C., Zhu, Y., Miao, Y.: Short text topic modeling with topic distribution quantization and negative sampling decoder. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1772–1782 (2020)

    Google Scholar 

  28. Wu, X., Cai, Y., Li, Q., Xu, J., Leung, H.: Combining weighted category-aware contextual information in convolutional neural networks for text classification. World Wide Web 23(5), 2815–2834 (2020)

    Article  Google Scholar 

  29. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456 (2013)

    Google Scholar 

  30. Yang, G., Wen, D., Chen, N.S., Sutinen, E., et al.: A novel contextual topic model for multi-document summarization. Expert Syst. Appl. 42(3), 1340–1352 (2015)

    Article  Google Scholar 

  31. Yang, Y., et al.: Dataless short text classification based on biterm topic model and word embeddings. In: 29th International Joint Conference on Artificial Intelligence (2020)

    Google Scholar 

  32. Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242 (2014)

    Google Scholar 

  33. Zhu, J., Xing, E.P.: Sparse topical coding. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pp. 831–838 (2011)

    Google Scholar 

  34. Zuo, Y., et al.: Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2105–2114 (2016)

    Google Scholar 

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (No. 62076100), National Key Research and Development Program of China (Standard knowledge graph for epidemic prevention and production recovering intelligent service platform and its applications), the Fundamental Research Funds for the Central Universities, SCUT (No. D2201300, D2210010), the Science and Technology Programs of Guangzhou(201902010046), the Science and Technology Planning Project of Guangdong Province (No. 2020B0101100002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Cai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, B., Cai, Y., Zhang, H. (2021). Sparse Biterm Topic Model for Short Texts. In: U, L.H., Spaniol, M., Sakurai, Y., Chen, J. (eds) Web and Big Data. APWeb-WAIM 2021. Lecture Notes in Computer Science(), vol 12858. Springer, Cham. https://doi.org/10.1007/978-3-030-85896-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85896-4_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85895-7

  • Online ISBN: 978-3-030-85896-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics