A Pseudo-document-based Topical N-grams model for short texts

Lin, Hao; Zuo, Yuan; Liu, Guannan; Li, Hong; Wu, Junjie; Wu, Zhiang

doi:10.1007/s11280-020-00814-x

A Pseudo-document-based Topical N-grams model for short texts

Published: 23 July 2020

Volume 23, pages 3001–3023, (2020)
Cite this article

World Wide Web Aims and scope Submit manuscript

Hao Lin¹,
Yuan Zuo¹,
Guannan Liu¹,
Hong Li¹,
Junjie Wu^1,2,3 &
…
Zhiang Wu⁴

506 Accesses
4 Citations
Explore all metrics

Abstract

In recent years, short text topic modeling has drawn considerable attentions from interdisciplinary researchers. Various customized topic models have been proposed to tackle the semantic sparseness nature of short texts. Most (if not all) of them follow the bag-of-words assumption, which, however, is not adequate since word order and phrases are often critical to capturing the meaning of texts. On the other hand, while some existing topic models are sensitive to word order, they do not perform well on short texts due to the severe data sparseness. To address these issues, we propose the Pseudo-document-based Topical N-Grams model (PTNG), which alleviates the data sparsity problem of short texts while is sensitive to word order. Extensive experiments on three real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTNG according to UCI coherence scores and more discriminative semantic representation of short texts according to classification results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LDA-PSTR: A Topic Modeling Method for Short Text

Topic Modeling for Short Texts: A Novel Modeling Method

Sparse Biterm Topic Model for Short Texts

Notes

http://acube.di.unipi.it/tmn-dataset/
http://jgibblda.sourceforge.net
http://www.csie.ntu.edu.tw/ cjlin/liblinear/

References

Blei, D M, Ng, A Y, Jordan, M I: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
El-Kishky, A, Song, Y, Wang, C, Voss, C R, Han, J: Scalable topical phrase mining from text corpora. Proc. VLDB Endow. 8(3), 305–316 (November 2014)
Article Google Scholar
Griffiths, T L, Steyvers, M: Finding scientific topics. Proc. Natl. Acad. Sci. 101, 5228–5235 (2004)
Article Google Scholar
Griffiths, T L, Tenenbaum, J B, Steyvers, M: Topics in semantic representation. Psychol. Rev. 114, 2007 (2007)
Article Google Scholar
Huang, J, Peng, M, Wang, H, Cao, J, Gao, W, Zhang, X: A probabilistic method for emerging topic tracking in microblog stream. World Wide Web 20(2), 325–350 (March 2017). https://doi.org/10.1007/s11280-016-0390-4. https://doi.org/10.1007/s11280-016-0390-4
Article Google Scholar
He, Y: Extracting topical phrases from clinical documents. In: Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp 2957–2963 (2016)
Hong, L, Davison, B D: Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics, pp 80–88 (2010)
Ishwaran, H, Rao, J S: Spike and slab variable selection: frequentist and bayesian strategies. Ann. Stat. 33(2), 730–773 (2005)
Article MathSciNet Google Scholar
Jin, O, Liu, N N, Zhao, K, Yu, Y, Yang, Q: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp 775–784 (2011)
Lin, T, Tian, W, Mei, Q, Cheng, H: The dual-sparse topic model: Mining focused topics and focused terms in short text. In: Proceedings of the 23rd international conference on World wide web, pp 539–550 (2014)
Lindsey, R V, Headden, W P, Stipicevic, M J: A phrase-discovering topic model using hierarchical pitman-yor processes. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, pp 214–222 (2012)
Li, B, Yang, X, Zhou, R, Wang, B, Liu, C, Zhang, Y: An efficient method for high quality and cohesive topical phrase mining. IEEE Trans. Knowl. Data Eng. 31(1), 120–137 (2019Jan). https://doi.org/10.1109/TKDE.2018.2823758
Article Google Scholar
Li, B, Wang, B, Zhou, R, Yang, X, Liu, C: Citpm: A cluster-based iterative topical phrase mining framework. In: Database Systems for Advanced Applications, pp 197–213. Springer International Publishing, Cham (2016)
Lau, J H, Baldwin, T, Newman, D: On collocations and topic models. ACM Trans. Speech Lang. Process. 10(3), 10:1–10:14 (July 2013). https://doi.org/10.1145/2483969.2483972. http://doi.acm.org/10.1145/2483969.2483972
Article Google Scholar
Li, C, Duan, Y, Wang, H, Zhang, Z, Sun, A, Ma, Z: Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans. Inf. Syst. 36(2), 11:1–11:30 (August 2017)
Article Google Scholar
Mehrotra, R, Sanner, S, Buntine, W, Xie, L: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp 889–892 (2013)
Mimno, D, Wallach, H M, Talley, E, Leenders, M, McCallum, A: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp 262–272 (2011)
Nugroho, R, Zhao, W, Yang, J, Paris, C, Nepal, S: Using time-sensitive interactions to improve topic derivation in twitter. World Wide Web 20(1), 61–87 (January 2017). https://doi.org/10.1007/s11280-016-0417-x. https://doi.org/10.1007/s11280-016-0417-x
Article Google Scholar
Nigam, K, McCallum, A, Thrun, S, Mitchell, T: Text classification from labeled and unlabeled documents using em. Mach. Learn. 39(2-3), 103–134 (2000)
Article Google Scholar
Newman, D, Lau, J H, Grieser, K, Baldwin, T: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp 100–108 (2010)
Quan, X, Kit, C, Ge, Y, Pan, S J: Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence, pp 2270–2276 (2015)
Tang, J, Zhang, M, Mei, Q: One theme in all views: Modeling consensus topics in multiple contexts. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 5–13 (2013)
Tang, J, Li, C, Zhang, M, Mei, Q: Less is more: Learning prominent and diverse topics for data summarization. arXiv:1611.09921 (2016)
Wang, X, McCallum, A: Topics over time: A non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433 (2006)
Wallach, H M: Topic modeling: Beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pp 977–984 (2006)
Wang, X, McCallum, A: A note on topical n-grams. Tech. rep., MASSACHUSETTS UNIV AMHERST DEPT OF COMPUTER SCIENCE (2005)
Wang, X, McCallum, A, Wei, X: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM ’07, pp 697–702. IEEE Computer Society, Washington, DC, USA (2007)
Weng, J, Lim, E-P, Jiang, J, He, Q: Twitterrank: Finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, pp 261–270 (2010)
Wang, C, Blei, D M: Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: Advances in neural information processing systems, pp 1982–1989. Curran Associates Inc. (2009)
Wang, C, Danilevsky, M, Desai, N, Zhang, Y, Nguyen, P, Taula, T, Han, J: A phrase mining framework for recursive construction of a topical hierarchy. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pp 437–445. ACM, New York, NY, USA. http://doi.acm.org/10.1145/2487575.2487631 (2013)
Yan, X, Guo, J, Lan, Y, Cheng, X: A biterm topic model for short texts. In: Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, pp 1445–1456. ACM, New York, NY, USA. http://doi.acm.org/10.1145/2488388.2488514 (2013)
Yang, Y, Wang, F, Zhang, J, Xu, J, Yu, P S: A topic model for co-occurring normal documents and short texts. World Wide Web 21(2), 487–513 (March 2018). https://doi.org/10.1007/s11280-017-0467-8. https://doi.org/10.1007/s11280-017-0467-8
Article Google Scholar
Yin, J, Wang, J: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 233–242 (2014)
Zhao, W X, Jiang, J, Weng, J, He, J, Lim, E-P, Yan, H, Li, X: Comparing twitter and traditional media using topic models. In: Advances in Information Retrieval, pp 338–349 (2011)
Zuo, Y, Zhao, J, Xu, K: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)
Article Google Scholar
Zuo, Y, Wu, J, Zhang, H, Lin, H, Wang, F, Xu, K, Xiong, H: Topic modeling of short texts: A pseudo-document view. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp 2105–2114. ACM, New York, NY, USA (2016)
Zubiaga, A, Ji, H: Harnessing web page directories for large-scale classification of tweets. In: Proceedings of the 22nd international conference on World Wide Web companion, pp 225–226 (2013)

Download references

Acknowledgements

Dr. Junjie Wu’s work was partially supported by the National Key R&D Program of China (2019YFB2101804), and the National Natural Science Foundation of China (U1636210, 71725002, 71531001). Dr. Guannan Liu was supported in part by NSFC under Grants 71701007. Dr. Yuan Zuo was partially supported by the National Natural Science Foundation of China (NSFC) under Grant 71901012, and by the China Postdoctoral Science Foundation under Grant 2018M640045. Dr. Hong Li was partially supported by NSFC under Grants 71471009. Dr. Zhiang Wu was supported by Industry Projects in Jiangsu S&T Pillar Program under Grant No. BE201910.

Author information

Authors and Affiliations

School of Economics and Management, Beihang University, Beijing, 100191, China
Hao Lin, Yuan Zuo, Guannan Liu, Hong Li & Junjie Wu
Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, 100191, China
Junjie Wu
Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations, Beihang University, Beijing, 100191, China
Junjie Wu
Jiangsu Provincial Key Laboratory of E-Business, Nanjing University of Finance and Economics, Nanjing, China
Zhiang Wu

Authors

Hao Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Guannan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hong Li
View author publications
You can also search for this author in PubMed Google Scholar
Junjie Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan Zuo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, H., Zuo, Y., Liu, G. et al. A Pseudo-document-based Topical N-grams model for short texts. World Wide Web 23, 3001–3023 (2020). https://doi.org/10.1007/s11280-020-00814-x

Download citation

Received: 06 July 2019
Revised: 19 January 2020
Accepted: 30 March 2020
Published: 23 July 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11280-020-00814-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Pseudo-document-based Topical N-grams model for short texts

Abstract

Access this article

Similar content being viewed by others

LDA-PSTR: A Topic Modeling Method for Short Text

Topic Modeling for Short Texts: A Novel Modeling Method

Sparse Biterm Topic Model for Short Texts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Pseudo-document-based Topical N-grams model for short texts

Abstract

Access this article

Similar content being viewed by others

LDA-PSTR: A Topic Modeling Method for Short Text

Topic Modeling for Short Texts: A Novel Modeling Method

Sparse Biterm Topic Model for Short Texts

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation