Skip to main content
Log in

PSLDA: a novel supervised pseudo document-based topic model for short texts

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Various kinds of online social media applications such as Twitter and Weibo, have brought a huge volume of short texts. However, mining semantic topics from short texts efficiently is still a challenging problem because of the sparseness of word-occurrence and the diversity of topics. To address the above problems, we propose a novel supervised pseudo-document-based maximum entropy discrimination latent Dirichlet allocation model (PSLDA for short). Specifically, we first assume that short texts are generated from the normal size latent pseudo documents, and the topic distributions are sampled from the pseudo documents. In this way, the model will reduce the sparseness of word-occurrence and the diversity of topics because it implicitly aggregates short texts to longer and higher-level pseudo documents. To make full use of labeled information in training data, we introduce labels into the model, and further propose a supervised topic model to learn the reasonable distribution of topics. Extensive experiments demonstrate that our proposed method achieves better performance compared with some state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Rosso P, Errecalde M, Pinto D. Analysis of short texts on the web: introduction to special issue. Language Resources and Evaluation, 2013, 47(1): 123–126

    Article  Google Scholar 

  2. Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 50–57

  3. Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022

    Google Scholar 

  4. Li Z, Zhang H, Wang S, Huang F, Li Z, Zhou J. Exploit latent Dirichlet allocation for collaborative filtering. Frontiers of Computer Science, 2018, 12(3): 571–581

    Article  Google Scholar 

  5. Chen W, Cai F, Chen H, De Rijke M. Personalized query suggestion diversification in information retrieval. Frontiers of Computer Science, 2020, 14(3): 143602

    Article  Google Scholar 

  6. Miyazawa S, Song X, Xia T, Shibasaki R, Kaneda H. Integrating GPS trajectory and topics from twitter stream for human mobility estimation. Frontiers of Computer Science, 2019, 13(3): 460–470

    Article  Google Scholar 

  7. Hong L, Davison B D. Empirical study of topic modeling in twitter. In: Proceedings of the 1st Workshop on Social Media Analytics. 2010, 80–88

  8. Davison B D, Suel T, Craswell N, Liu B. WSDM’10: Third ACM International Conference on Web Search and Data Mining. New York: ACM, 2010

  9. Mehrotra R, Sanner S, Buntine W, Xie L. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013, 889–892

  10. Phan X H, Nguyen C T, Le D T, Nguyen L M, Horiguchi S, Ha Q T. A hidden topic-based framework toward building applications with short Web documents. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(7): 961–976

    Article  Google Scholar 

  11. Quan X, Kit C, Ge Y, Pan S J. Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence. 2015, 2270–2276

  12. Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H. Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 2105–2114

  13. Blei D M, Lafferty J D. Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 113–120

  14. Meek C, Chickering M, Halpern J. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff: AUAI Press, 2004

    Google Scholar 

  15. Nguyen D Q, Billingsley R, Du L, Johnson M. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 2015, 3: 299–313

    Article  Google Scholar 

  16. Zhao F, Zhu Y, Jin H, Yang L T. A personalized hashtag recommendation approach using lda-based topic model in microblog environment. Future Generation Computer Systems, 2016, 65: 196–206

    Article  Google Scholar 

  17. Ibeke E, Lin C, Wyner A, Barawi M H. Extracting and understanding contrastive opinion through topic relevant sentences. In: Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017, 395–400

  18. Tian C, Rong W, Zhou S, Zhang J, Ouyang Y, Xiong Z. Learning word representation by jointly using neighbor and syntactic contexts. Neurocomputing, 2021, 456: 136–146

    Article  Google Scholar 

  19. Weng J, Lim E P, Jiang J, He Q. TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010, 261–270

  20. Jin O, Liu N N, Zhao K, Yu Y, Yang Q. Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 775–784

  21. Lin T, Tian W, Mei Q, Cheng H. The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web. 2014, 539–550

  22. Cheng X, Yan X, Lan Y, Guo J. BTM: topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2928–2941

    Article  Google Scholar 

  23. Zuo Y, Zhao J, Xu K. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 2016, 48(2): 379–398

    Article  Google Scholar 

  24. Yin J, Wang J. A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014, 233–242

  25. Li C, Wang H, Zhang Z, Sun A, Ma Z. Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016, 165–174

  26. Li X, Li C, Chi J, Ouyang J. Short text topic modeling by exploring original documents. Knowledge and Information Systems, 2018, 56(2): 443–462

    Article  Google Scholar 

  27. Bicalho P, Pita M, Pedrosa G, Lacerda A, Pappa G L. A general framework to expand short text for topic modeling. Information Sciences, 2017, 393: 66–81

    Article  Google Scholar 

  28. Pedrosa G, Pita M, Bicalho P, Lacerda A, Pappa G L. Topic modeling for short texts with co-occurrence frequency-based expansion. In: Proceedings of the 5th Brazilian Conference on Intelligent Systems (BRACIS). 2016, 277–282

  29. Shi T, Kang K, Choo J, Reddy C K. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of 2018 World Wide Web Conference. 2018, 1105–1114

  30. Miao Y, Yu L, Blunsom P. Neural variational inference for text processing. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1727–1736

  31. Ding R, Nallapati R, Xiang B. Coherence-aware neural topic modeling. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 830–836

  32. Zhu J, Xing E P. Sparse topical coding. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence. 2011, 831–838

  33. Card D, Tan C, Smith N A. Neural models for documents with metadata. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2031–2040

  34. Zhu J, Chen N, Perkins H, Zhang B. Gibbs max-margin topic models with data augmentation. The Journal of Machine Learning Research, 2014, 15(1): 1073–1110

    MathSciNet  Google Scholar 

  35. Michael J R, Schucany W R, Haas R W. Generating random variates using transformations with multiple roots. The American Statistician, 1976, 30(2): 88–90

    Google Scholar 

  36. Dua D, Graff C. UCI machine learning repository. See https://archiveics.uci.edu/ml/index website, 2017

  37. Zubiaga A, Ji H. Harnessing web page directories for large-scale classification of tweets. In: Proceedings of the 22nd International Conference on World Wide Web. 2013, 225–226

  38. Phan X H, Nguyen C T. GibbsLDA++: A C/C++ implementation of latent dirichlet allocation (LDA). Boston: Free Software Foundation, 2007

    Google Scholar 

  39. Blei D M, McAuliffe J D. Supervised topic models. In: Proceedings of the 20th International Conference on Neural Information Processing Systems. 2007, 121–128

  40. Chong W, Blei D, Li F F. Simultaneous image classification and annotation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2009, 1903–1910

  41. Zhu J, Ahmed A, Xing E P. MedLDA: maximum margin supervised topic models. The Journal of Machine Learning Research, 2012, 13(1): 2237–2278

    MathSciNet  Google Scholar 

  42. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 2011, 12: 2825–2830

    MathSciNet  Google Scholar 

  43. Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 399–408

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guozhu Jia.

Additional information

Mingtao Sun is a PhD candidate in School of Economics and Management, Beihang University, China. His research interests include Big Data processing and Education Administration.

Xiaowei Zhao is currently pursuing the PhD degree in Computer Science with Beihang University, China. Her main research interests include transfer learning and sentiment analysis.

Jingjing Lin is currently a senior student at the School of Instrumentation and Optoelectronic Engineering, Beihang University, China. Her research interests include text classification, natural language inference, and sentiment analysis.

Jian Jing received the MS degree in the Engineering of Computer Techonlogy from the Beihang University, China in 2021. His research interests include knowledge reasoning, algorithms and big data processing.

Deqing Wang received the PhD degree in computer science from Beihang University, China in 2013. He is currently an Associate Professor with the School of Computer Science and the Deputy Chief Engineer with the National Engineering Research Center for Science Technology Resources Sharing and Service, Beihang University, China. His research focuses on text categorization and data mining for software engineering and machine learning.

Guozhu Jia received the PhD degree from Aalborg University, Denmark. He is currently a Professor of School of Economics and Management, Beihang University, China and a member of Expert Committee of China Manufacturing Servitization Alliance. He is also a director of China Innovation Method Society.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, M., Zhao, X., Lin, J. et al. PSLDA: a novel supervised pseudo document-based topic model for short texts. Front. Comput. Sci. 16, 166350 (2022). https://doi.org/10.1007/s11704-021-0606-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-021-0606-3

Keywords

Navigation