Skip to main content
Log in

A Pseudo-document-based Topical N-grams model for short texts

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In recent years, short text topic modeling has drawn considerable attentions from interdisciplinary researchers. Various customized topic models have been proposed to tackle the semantic sparseness nature of short texts. Most (if not all) of them follow the bag-of-words assumption, which, however, is not adequate since word order and phrases are often critical to capturing the meaning of texts. On the other hand, while some existing topic models are sensitive to word order, they do not perform well on short texts due to the severe data sparseness. To address these issues, we propose the Pseudo-document-based Topical N-Grams model (PTNG), which alleviates the data sparsity problem of short texts while is sensitive to word order. Extensive experiments on three real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTNG according to UCI coherence scores and more discriminative semantic representation of short texts according to classification results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://acube.di.unipi.it/tmn-dataset/

  2. http://jgibblda.sourceforge.net

  3. http://www.csie.ntu.edu.tw/ cjlin/liblinear/

References

  1. Blei, D M, Ng, A Y, Jordan, M I: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. El-Kishky, A, Song, Y, Wang, C, Voss, C R, Han, J: Scalable topical phrase mining from text corpora. Proc. VLDB Endow. 8(3), 305–316 (November 2014)

    Article  Google Scholar 

  3. Griffiths, T L, Steyvers, M: Finding scientific topics. Proc. Natl. Acad. Sci. 101, 5228–5235 (2004)

    Article  Google Scholar 

  4. Griffiths, T L, Tenenbaum, J B, Steyvers, M: Topics in semantic representation. Psychol. Rev. 114, 2007 (2007)

    Article  Google Scholar 

  5. Huang, J, Peng, M, Wang, H, Cao, J, Gao, W, Zhang, X: A probabilistic method for emerging topic tracking in microblog stream. World Wide Web 20(2), 325–350 (March 2017). https://doi.org/10.1007/s11280-016-0390-4. https://doi.org/10.1007/s11280-016-0390-4

    Article  Google Scholar 

  6. He, Y: Extracting topical phrases from clinical documents. In: Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp 2957–2963 (2016)

  7. Hong, L, Davison, B D: Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp 80–88 (2010)

  8. Ishwaran, H, Rao, J S: Spike and slab variable selection: frequentist and bayesian strategies. Ann. Stat. 33(2), 730–773 (2005)

    Article  MathSciNet  Google Scholar 

  9. Jin, O, Liu, N N, Zhao, K, Yu, Y, Yang, Q: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp 775–784 (2011)

  10. Lin, T, Tian, W, Mei, Q, Cheng, H: The dual-sparse topic model: Mining focused topics and focused terms in short text. In: Proceedings of the 23rd international conference on World wide web, pp 539–550 (2014)

  11. Lindsey, R V, Headden, W P, Stipicevic, M J: A phrase-discovering topic model using hierarchical pitman-yor processes. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, pp 214–222 (2012)

  12. Li, B, Yang, X, Zhou, R, Wang, B, Liu, C, Zhang, Y: An efficient method for high quality and cohesive topical phrase mining. IEEE Trans. Knowl. Data Eng. 31(1), 120–137 (2019Jan). https://doi.org/10.1109/TKDE.2018.2823758

    Article  Google Scholar 

  13. Li, B, Wang, B, Zhou, R, Yang, X, Liu, C: Citpm: A cluster-based iterative topical phrase mining framework. In: Database Systems for Advanced Applications, pp 197–213. Springer International Publishing, Cham (2016)

  14. Lau, J H, Baldwin, T, Newman, D: On collocations and topic models. ACM Trans. Speech Lang. Process. 10(3), 10:1–10:14 (July 2013). https://doi.org/10.1145/2483969.2483972. http://doi.acm.org/10.1145/2483969.2483972

    Article  Google Scholar 

  15. Li, C, Duan, Y, Wang, H, Zhang, Z, Sun, A, Ma, Z: Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36(2), 11:1–11:30 (August 2017)

    Article  Google Scholar 

  16. Mehrotra, R, Sanner, S, Buntine, W, Xie, L: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 889–892 (2013)

  17. Mimno, D, Wallach, H M, Talley, E, Leenders, M, McCallum, A: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp 262–272 (2011)

  18. Nugroho, R, Zhao, W, Yang, J, Paris, C, Nepal, S: Using time-sensitive interactions to improve topic derivation in twitter. World Wide Web 20(1), 61–87 (January 2017). https://doi.org/10.1007/s11280-016-0417-x. https://doi.org/10.1007/s11280-016-0417-x

    Article  Google Scholar 

  19. Nigam, K, McCallum, A, Thrun, S, Mitchell, T: Text classification from labeled and unlabeled documents using em. Mach. Learn. 39(2-3), 103–134 (2000)

    Article  Google Scholar 

  20. Newman, D, Lau, J H, Grieser, K, Baldwin, T: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp 100–108 (2010)

  21. Quan, X, Kit, C, Ge, Y, Pan, S J: Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence, pp 2270–2276 (2015)

  22. Tang, J, Zhang, M, Mei, Q: One theme in all views: Modeling consensus topics in multiple contexts. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 5–13 (2013)

  23. Tang, J, Li, C, Zhang, M, Mei, Q: Less is more: Learning prominent and diverse topics for data summarization. arXiv:1611.09921 (2016)

  24. Wang, X, McCallum, A: Topics over time: A non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433 (2006)

  25. Wallach, H M: Topic modeling: Beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pp 977–984 (2006)

  26. Wang, X, McCallum, A: A note on topical n-grams. Tech. rep., MASSACHUSETTS UNIV AMHERST DEPT OF COMPUTER SCIENCE (2005)

  27. Wang, X, McCallum, A, Wei, X: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM ’07, pp 697–702. IEEE Computer Society, Washington, DC, USA (2007)

  28. Weng, J, Lim, E-P, Jiang, J, He, Q: Twitterrank: Finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, pp 261–270 (2010)

  29. Wang, C, Blei, D M: Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: Advances in neural information processing systems, pp 1982–1989. Curran Associates Inc. (2009)

  30. Wang, C, Danilevsky, M, Desai, N, Zhang, Y, Nguyen, P, Taula, T, Han, J: A phrase mining framework for recursive construction of a topical hierarchy. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pp 437–445. ACM, New York, NY, USA. http://doi.acm.org/10.1145/2487575.2487631 (2013)

  31. Yan, X, Guo, J, Lan, Y, Cheng, X: A biterm topic model for short texts. In: Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, pp 1445–1456. ACM, New York, NY, USA. http://doi.acm.org/10.1145/2488388.2488514 (2013)

  32. Yang, Y, Wang, F, Zhang, J, Xu, J, Yu, P S: A topic model for co-occurring normal documents and short texts. World Wide Web 21(2), 487–513 (March 2018). https://doi.org/10.1007/s11280-017-0467-8. https://doi.org/10.1007/s11280-017-0467-8

    Article  Google Scholar 

  33. Yin, J, Wang, J: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 233–242 (2014)

  34. Zhao, W X, Jiang, J, Weng, J, He, J, Lim, E-P, Yan, H, Li, X: Comparing twitter and traditional media using topic models. In: Advances in Information Retrieval, pp 338–349 (2011)

  35. Zuo, Y, Zhao, J, Xu, K: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)

    Article  Google Scholar 

  36. Zuo, Y, Wu, J, Zhang, H, Lin, H, Wang, F, Xu, K, Xiong, H: Topic modeling of short texts: A pseudo-document view. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp 2105–2114. ACM, New York, NY, USA (2016)

  37. Zubiaga, A, Ji, H: Harnessing web page directories for large-scale classification of tweets. In: Proceedings of the 22nd international conference on World Wide Web companion, pp 225–226 (2013)

Download references

Acknowledgements

Dr. Junjie Wu’s work was partially supported by the National Key R&D Program of China (2019YFB2101804), and the National Natural Science Foundation of China (U1636210, 71725002, 71531001). Dr. Guannan Liu was supported in part by NSFC under Grants 71701007. Dr. Yuan Zuo was partially supported by the National Natural Science Foundation of China (NSFC) under Grant 71901012, and by the China Postdoctoral Science Foundation under Grant 2018M640045. Dr. Hong Li was partially supported by NSFC under Grants 71471009. Dr. Zhiang Wu was supported by Industry Projects in Jiangsu S&T Pillar Program under Grant No. BE201910.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuan Zuo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, H., Zuo, Y., Liu, G. et al. A Pseudo-document-based Topical N-grams model for short texts. World Wide Web 23, 3001–3023 (2020). https://doi.org/10.1007/s11280-020-00814-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-020-00814-x

Keywords

Navigation