Incorporating word embeddings into topic modeling of short text

Gao, Wang; Peng, Min; Wang, Hua; Zhang, Yanchun; Xie, Qianqian; Tian, Gang

doi:10.1007/s10115-018-1314-7

Incorporating word embeddings into topic modeling of short text

Regular Paper
Published: 18 December 2018

Volume 61, pages 1123–1145, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Wang Gao¹,
Min Peng¹,
Hua Wang²,
Yanchun Zhang²,
Qianqian Xie¹ &
…
Gang Tian¹

2006 Accesses
42 Citations
Explore all metrics

Abstract

Short texts have become the prevalent format of information on the Internet. Inferring the topics of this type of messages becomes a critical and challenging task for many applications. Due to the length of short texts, conventional topic models (e.g., latent Dirichlet allocation and its variants) suffer from the severe data sparsity problem which makes topic modeling of short texts difficult and unreliable. Recently, word embeddings have been proved effective to capture semantic and syntactic information about words, which can be used to induce similarity measures and semantic correlations among words. Enlightened by this, in this paper, we design a novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM). CRFTM not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional Random Field regularized model that encourages semantically related words to share the same topic assignment. Experimental results on two real-world datasets show that our method can extract more coherent topics, and significantly outperform state-of-the-art baselines on several evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Hamed Jelodar, Yongli Wang, … Liang Zhao

Impact of word embedding models on text analytics in deep learning environment: a review

Article 22 February 2023

Deepak Suresh Asudani, Naresh Kumar Nagwani & Pradeep Singh

Notes

Code of CRFTM: http://github.com/nonobody/CRFTM.
http://acube.di.unipi.it/tmn-dataset/.
http://github.com/jacoxu/StackOverflow.
Stop word list is from NLTK: http://www.nltk.org/.
http://jgibblda.sourceforge.net.
http://code.google.com/p/word2vec.
https://radimrehurek.com/gensim/models/doc2vec.html.
http://aksw.org/Projects/Palmetto.html.
http://scikit-learn.org/.

References

Alsmadi I, Hoon GK (2018) Term weighting scheme for short-text classification: Twitter corpuses. Neural Comput Appl 1–13
Bansal M, Gimpel K, Livescu K (2014) Tailoring continuous word representations for dependency parsing. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 809–815
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Chang J, Gerrish S, Wang C, Boyd-Graber JL, Blei DM (2009) Reading tea leaves: How humans interpret topic models. In: Proceedings of advances in neural information processing systems (NIPS), pp 288–296
Cheng X, Yan X, Lan Y, Guo J (2014) Btm: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941
Article Google Scholar
Das R, Zaheer M, Dyer C (2015) Gaussian LDA for topic models with word embeddings. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 795–804
Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings of the workshop on representation learning for NLP (RepL4NLP), pp 78–86
Gregor H (2005) Parameter estimation for text analysis. Technical Report
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI), pp 289–296
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the workshop on social media analytics (SOMA), pp 80–88
Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 873–882
Huang F, Ahuja A, Downey D, Yang Y, Guo Y, Yates A (2014) Learning representations for weakly supervised natural language processing tasks. Computational Linguistics 40(1):85–120
Article Google Scholar
Huang J, Peng M, Wang H, Cao J, Gao W, Zhang X (2017) A probabilistic method for emerging topic tracking in microblog stream. World Wide Web J 20(2):325–350
Article Google Scholar
Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 775–784
Khan FH, Qamar U, Bashir S (2017) A semi-supervised approach to sentiment analysis using revised sentiment strength based on SentiWordNet. Knowl Inf Syst 51(3):851–872
Article Google Scholar
Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: Proceedings of international conference on machine learning (ICML), pp 957–966
Lafferty JD, Mccallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML), pp 282–289
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the international conference on machine learning (ICML), pp 1188–1196
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the ACM conference on research and development in information retrieval (SIGIR), pp 165–174
Li S, Chua TS, Zhu J, Miao C (2016) Generative topic embedding: a continuous representation of documents. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 666–675
Ma S, Zhang C, He D (2016) Document representation methods for clustering bilingual documents. In: Proceedings of the annual meeting of the association for information science and technology (ASIST), pp 1–10
Mahmoud H (2008) Polya urn models. CRC press
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the ACM conference on research and development in information retrieval (SIGIR), pp 889–892
Menini S, Nanni F, Ponzetto SP, Tonelli S (2017) Topic-based agreement and disagreement in us electoral manifestos. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 2938–2944
Metzler D, Dumais S, Meek C (2007) Similarity measures for short segments of text. In: Proceedings of European conference on information retrieval (ECIR), pp 16–27
Mikolov T, Yih WT, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (HIT-NAACL), pp 889–892
Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 262–272
Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL), pp 100–108
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Linguist 3:299–313
Article Google Scholar
Ni X, Quan X, Lu Z, Wenyin L, Hua B (2011) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365
Article Google Scholar
Peng M, Gao W, Wang H, Zhang Y, Huang J, Xie Q, Hu G, Tian G (2017) Parallelization of massive textstream compression based on compressed sensing. ACM Trans Inf Syst 36(2):1–18
Article Google Scholar
Peng M, Xie Q, Zhang Y, Wang H, Zhang X, Huang J, Tian G (2018) Neural sparse topical coding. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 2332–2340
Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the conference on world wide web (WWW), pp 91–100
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Proceedings of the international joint conferences on artificial intelligence (IJCAI), pp 2270–2276
Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the ACM conference on web search and data mining (WSDM), pp 261–270
Xia Y, Tang N, Hussain A, Cambria E (2015) Discriminative bi-term topic model for headline-based social news clustering. In: Proceedings of the Florida artificial intelligence research society conference (FLAIRS), pp 311–316
Xie P, Yang D, Xing E (2015) Incorporating word correlation knowledge into topic modeling. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (HIT-NAACL), pp 725–734
Xu J, Wang P, Tian G, Xu B, Zhao J, Wang F, Hao H (2015) Short text clustering via convolutional neural networks. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (HIT-NAACL), pp 62–69
Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the ACM international conference on knowledge discovery and data mining (SIGKDD), pp 233–242
Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of European conference on information retrieval (ECIR), pp 338–349
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the ACM international conference on knowledge discovery and data mining (SIGKDD), pp 2015–2114

Download references

Acknowledgements

We thank anonymous reviewers for their very useful comments and suggestions. This research was partially supported by the National Science Foundation of China (NSFC, No. 61472291) and (NSFC, No. 61772382).

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, China
Wang Gao, Min Peng, Qianqian Xie & Gang Tian
Centre for Applied Informatics, Victoria University, Melbourne, Australia
Hua Wang & Yanchun Zhang

Authors

Wang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Min Peng
View author publications
You can also search for this author in PubMed Google Scholar
Hua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanchun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qianqian Xie
View author publications
You can also search for this author in PubMed Google Scholar
Gang Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Min Peng or Gang Tian.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, W., Peng, M., Wang, H. et al. Incorporating word embeddings into topic modeling of short text. Knowl Inf Syst 61, 1123–1145 (2019). https://doi.org/10.1007/s10115-018-1314-7

Download citation

Received: 10 September 2017
Revised: 26 July 2018
Accepted: 28 November 2018
Published: 18 December 2018
Issue Date: 01 November 2019
DOI: https://doi.org/10.1007/s10115-018-1314-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incorporating word embeddings into topic modeling of short text

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Impact of word embedding models on text analytics in deep learning environment: a review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Incorporating word embeddings into topic modeling of short text

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Impact of word embedding models on text analytics in deep learning environment: a review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation