Abstract
Topic models have been widely used to infer latent topics in text documents. However, the unsupervised topic models often result in incoherent topics, which always confused users in applications. Incorporating prior domain knowledge into topic models is an effective strategy to extract coherent and meaningful topics. In this paper, we go one step further to explore how different forms of prior semantic relations of words can be encoded into models to improve the performance of topic modeling process. We develop a novel topic model—called Mixed Word Correlation Knowledge-based Latent Dirichlet Allocation—to infer latent topics from text corpus. Specifically, the proposed model mines two forms of lexical semantic knowledge based on recent progress in word embedding, which can represent semantic information of words in a continuous vector space. To incorporate generated prior knowledge, a Mixed Markov Random Field is constructed over the latent topic layer to regularize the topic assignment of each word during the topic sampling process. Experimental results on two public benchmark datasets illustrate the superior performance of the proposed approach over several state-of-the-art baseline models.
Similar content being viewed by others
References
Ahmed A, Long J, Silva D, Wang Y (2017) A practical algorithm for solving the incoherence problem of topic models in industrial applications. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1713–1721
Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In: Proceedings of the 26th annual international conference on machine learning, pp 25–32
Blei DM, Lafferty JD (2005) Correlated topic models. In: Proceedings of the 18th international conference on neural information processing systems, pp 147–154
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res Arch 3:993–1022
Chang J, Boyd-Graber J, Gerrish S, Wang C, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: Proceedings of the 22nd international conference on neural information processing systems, pp 288–296
Chen Z, Liu B (2014a) Mining topics in documents: standing on the shoulders of big data. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1116–1125
Chen Z, Liu B (2014b) Topic modeling using topics from many domains, lifelong learning and big data. In: Proceedings of the 31st international conference on international conference on machine learning, pp II-703–II-711
Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013a) Discovering coherent topics using general knowledge. In: Proceedings of the 22nd ACM international conference on information & knowledge management, pp 209–218
Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013b) Leveraging multi-domain prior knowledge in topic models. In: Proceedings of the twenty-third international joint conference on artificial Intelligence, pp 2071–2077
Fang A, Macdonald C, Ounis I, Habel P (2016) Using word embedding to evaluate the coherence of topics from Twitter data. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, pp 1057–1060
Fu X, Sun X, Wu H, Cui L, Huang JZ (2018) Weakly supervised topic sentiment joint model with word embeddings. Knowl-Based Syst 147:43–54
Gao S, Li X, Yu Z, Qin Y, Zhang Y (2017) Combining paper cooperative network and topic model for expert topic analysis and extraction. Neurocomputing 257:136–143
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci USA 101(Suppl 1):5228–5235
Heinrich, G (2005) Parameter estimation for text analysis. Technical report
Hu Y, Boyd-Graber J, Satinoff B, Smith A (2014) Interactive topic modeling. Mach Learn 95:423–469
Jagarlamudi J, Daumé H III, Udupa R (2012) Incorporating lexical priors into topic models. In: Proceedings of the 13th conference of the European chapter of the Association for Computational Linguistics, pp 204–213
Lee TY, Alison S, Seppi K, Elmqvist N, Boyd-Graber J, Findlater L (2017) The human touch: how non-expert users perceive, interpret, and fix topic models. Int J Hum Comput Stud 105:28–42
Li X, Ma Z, Peng P, Guo X, Huang F, Wang X, Guo J (2018) Supervised latent Dirichlet allocation with a mixture of sparse softmax. Neurocomputing 312:324–335
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Proceedings of the international conference on learning representations (ICLR), pp 1–12
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, pp 3111–3119
Mimno D, Wallach HM, Talley E, Leenders M, Mccallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing, pp 262–272
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Petterson J, Smola AJ, Caetano TS, Buntine WL, Narayanamurthy S (2010) Word features for latent Dirichlet allocation. In: Proceedings of the 23rd international conference on neural information processing systems, pp 1921–1929
Qiang J, Chen P, Wang T, Wu X (2017) Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 363–374
Shams M, Baraani-Dastjerdi A (2017) Enriched LDA (ELDA): combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction. Expert Syst Appl 80:136–146
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, pp 694–703
Xie P, Yang D, Xing E (2015) Incorporating word correlation knowledge into topic modeling. In: Proceedings of the 2015 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, pp 725–734
Xu Y, Yin J, Huang J, Yin Y (2018) Hierarchical topic modeling with automatic knowledge mining. Expert Syst Appl 103:106–117
Xun G, Gopalakrishnan V, Ma F, Li Y, Gao J, Zhang A (2016) Topic discovery for short texts using word embeddings. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1299–1304
Yang L, Liu Z, Chua TS, Sun M (2015a) Topical word embeddings. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, pp 2418–2424
Yang S, Lu W, Yang D, Yao L, Wei B (2015b) Short text understanding by leveraging knowledge into topic model. In: The 2015 annual conference of the North American Chapter of the ACL, pp 1232–1237
Yang Y, Downey D, Boyd-Graber J (2015c) Efficient methods for incorporating knowledge into topic models. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 308–317
Yao L, Zhang Y, Wei B, Li L, Wu F, Zhang P, Bian Y (2016) Concept over time: the combination of probabilistic topic model with wikipedia knowledge. Expert Syst Appl 60:27–38
Yao L, Zhang Y, Chen Q, Qian H, Wei B, Hu Z (2017) Mining coherent topics in documents using word embeddings and large-scale text data. Eng Appl Artif Intell 64:432–439
Zhu J, Xing EP (2010) Conditional topic random fields. In: Proceedings of the 27th international conference on international conference on machine learning, pp 1239–1246
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Nos. 91646102, L1824039, L1724034, L1724026, L1524015, L1624045), the MOE (Ministry of Education in China) Project of Humanities and Social Sciences (16JDGC011), the Construction Project of China Knowledge Center for Engineering Sciences and Technology (No. CKCEST-2019-2-13), the UK–China Industry Academia Partnership Program (UK-CIAPP/260), the Tsinghua University Project of Volvo-supported Green Economy and Sustainable Development (20153000181) and the Tsinghua Initiative Research Project (2016THZW).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, J., Zhang, K., Zhou, Y. et al. A novel topic model for documents by incorporating semantic relations between words. Soft Comput 24, 11407–11423 (2020). https://doi.org/10.1007/s00500-019-04604-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-04604-0