Short texts have become the prevalent format of information on the Internet. Inferring the topics of this type of messages becomes a critical and challenging task for many applications. Due to the length of short texts, conventional topic models (e.g., latent Dirichlet allocation and its variants) suffer from the severe data sparsity problem which makes topic modeling of short texts difficult and unreliable. Recently, word embeddings have been proved effective to capture semantic and syntactic information about words, which can be used to induce similarity measures and semantic correlations among words. Enlightened by this, in this paper, we design a novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM). CRFTM not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional Random Field regularized model that encourages semantically related words to share the same topic assignment. Experimental results on two real-world datasets show that our method can extract more coherent topics, and significantly outperform state-of-the-art baselines on several evaluation metrics.

Similar content being viewed by others
Code of CRFTM: http://github.com/nonobody/CRFTM.
Stop word list is from NLTK: http://www.nltk.org/.
Alsmadi I, Hoon GK (2018) Term weighting scheme for short-text classification: Twitter corpuses. Neural Comput Appl 1–13
Bansal M, Gimpel K, Livescu K (2014) Tailoring continuous word representations for dependency parsing. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 809–815
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Chang J, Gerrish S, Wang C, Boyd-Graber JL, Blei DM (2009) Reading tea leaves: How humans interpret topic models. In: Proceedings of advances in neural information processing systems (NIPS), pp 288–296
Cheng X, Yan X, Lan Y, Guo J (2014) Btm: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941
Das R, Zaheer M, Dyer C (2015) Gaussian LDA for topic models with word embeddings. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 795–804
Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings of the workshop on representation learning for NLP (RepL4NLP), pp 78–86
Gregor H (2005) Parameter estimation for text analysis. Technical Report
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI), pp 289–296
Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the workshop on social media analytics (SOMA), pp 80–88
Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 873–882
Huang F, Ahuja A, Downey D, Yang Y, Guo Y, Yates A (2014) Learning representations for weakly supervised natural language processing tasks. Computational Linguistics 40(1):85–120
Huang J, Peng M, Wang H, Cao J, Gao W, Zhang X (2017) A probabilistic method for emerging topic tracking in microblog stream. World Wide Web J 20(2):325–350
Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 775–784
Khan FH, Qamar U, Bashir S (2017) A semi-supervised approach to sentiment analysis using revised sentiment strength based on SentiWordNet. Knowl Inf Syst 51(3):851–872
Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: Proceedings of international conference on machine learning (ICML), pp 957–966
Lafferty JD, Mccallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML), pp 282–289
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the international conference on machine learning (ICML), pp 1188–1196
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the ACM conference on research and development in information retrieval (SIGIR), pp 165–174
Li S, Chua TS, Zhu J, Miao C (2016) Generative topic embedding: a continuous representation of documents. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 666–675
Ma S, Zhang C, He D (2016) Document representation methods for clustering bilingual documents. In: Proceedings of the annual meeting of the association for information science and technology (ASIST), pp 1–10
Mahmoud H (2008) Polya urn models. CRC press
Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the ACM conference on research and development in information retrieval (SIGIR), pp 889–892
Menini S, Nanni F, Ponzetto SP, Tonelli S (2017) Topic-based agreement and disagreement in us electoral manifestos. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 2938–2944
Metzler D, Dumais S, Meek C (2007) Similarity measures for short segments of text. In: Proceedings of European conference on information retrieval (ECIR), pp 16–27
Mikolov T, Yih WT, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (HIT-NAACL), pp 889–892
Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 262–272
Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Proceedings of the conference of the North American chapter of the association for computational linguistics (NAACL), pp 100–108
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Linguist 3:299–313
Ni X, Quan X, Lu Z, Wenyin L, Hua B (2011) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365
Peng M, Gao W, Wang H, Zhang Y, Huang J, Xie Q, Hu G, Tian G (2017) Parallelization of massive textstream compression based on compressed sensing. ACM Trans Inf Syst 36(2):1–18
Peng M, Xie Q, Zhang Y, Wang H, Zhang X, Huang J, Tian G (2018) Neural sparse topical coding. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 2332–2340
Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the conference on world wide web (WWW), pp 91–100
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: Proceedings of the international joint conferences on artificial intelligence (IJCAI), pp 2270–2276
Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the ACM conference on web search and data mining (WSDM), pp 261–270
Xia Y, Tang N, Hussain A, Cambria E (2015) Discriminative bi-term topic model for headline-based social news clustering. In: Proceedings of the Florida artificial intelligence research society conference (FLAIRS), pp 311–316
Xie P, Yang D, Xing E (2015) Incorporating word correlation knowledge into topic modeling. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (HIT-NAACL), pp 725–734
Xu J, Wang P, Tian G, Xu B, Zhao J, Wang F, Hao H (2015) Short text clustering via convolutional neural networks. In: Proceedings of the conference of the North American chapter of the association for computational linguistics: human language technologies (HIT-NAACL), pp 62–69
Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the ACM international conference on knowledge discovery and data mining (SIGKDD), pp 233–242
Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of European conference on information retrieval (ECIR), pp 338–349
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K (2016) Topic modeling of short texts: a pseudo-document view. In: Proceedings of the ACM international conference on knowledge discovery and data mining (SIGKDD), pp 2015–2114
We thank anonymous reviewers for their very useful comments and suggestions. This research was partially supported by the National Science Foundation of China (NSFC, No. 61472291) and (NSFC, No. 61772382).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gao, W., Peng, M., Wang, H. et al. Incorporating word embeddings into topic modeling of short text. Knowl Inf Syst 61, 1123–1145 (2019). https://doi.org/10.1007/s10115-018-1314-7
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1314-7