Abstract
Although typically used in classic topic models, surface words cannot represent meaning on their own. Consequently, redundancy is common in those topics, which may, for instance, include synonyms. To face this problem, we present SemLDA, an extended topic model that incorporates semantics from an external lexical-semantic knowledge base. SemLDA is introduced and explained in detail, pointing out where semantics is included both in the pre-pocessing and generative phase of topic distributions. As a result, instead of topics as distributions over words, we obtain distributions over concepts, each represented by a set of synonymous words. In order to evaluate SemLDA, we applied preliminary qualitative tests automatically against a state-of-the-art classical topic model. The results were promising and confirm our intuition towards the benefits of incorporating general semantics in a topic model.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Boyd-Graber, J., Blei, D., Zhu, X.: A topic model for word sense disambiguation. In: Proceedings of 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 1024–1033. ACL Press, Prague, Czech Republic, June 2007
Brody, S., Lapata, M.: Bayesian word sense induction. In: Proceedings of 12th Conference of the European Chapter of the Association for Computational Linguistics. EACL 2009, pp. 103–111. ACL Press (2009)
Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 229–244. Springer, Heidelberg (2008)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Flaherty, P., Giaever, G., Kumm, J., Jordan, M.I., Arkin, A.P.: A latent variable model for chemogenomic profiling. Bioinformatics 21(15), 3286–3293 (2005)
Gonçalo Oliveira, H., de Paiva, V., Freitas, C., Rademaker, A., Real, L., oes, A.S.: As wordnets do português. In: Simões, A., Barreiro, A., Santos, D., Sousa-Silva, R., Tagnin, S.E.O. (eds.) Linguística, Informática e Tradução: Mundos que se Cruzam, OSLa, vol. 7, no. 1, pp. 397–424. University of Oslo (2015)
Guo, W., Diab, M.: Semantic topic models: combining word distributional statistics and dictionary definitions. In: EMNLP, pp. 552–561. ACL Press (2011)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57. ACM (1999)
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Machine learning 37(2), 183–233 (1999)
Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)
Miller, G.A., Chodorow, M., Landes, S., Leacock, C., Thomas, R.G.: Using a semantic concordance for sense identification. In: Proceedings of ARPA Human Language Technology Workshop. Plainsboro, NJ, USA (1994)
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2011, pp. 262–272. ACL Press (2011)
Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys 41(2), 1–69 (2009)
Newman, D., Bonilla, E.V., Buntine, W.: Improving topic coherence with regularized topic models. In: Advances in Neural Information Processing Systems, pp. 496–504 (2011)
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. HLT 2010, pp. 100–108. ACL Press (2010)
Rajagopal, D., Olsher, D., Cambria, E., Kwok, K.: Commonsense-based topic modeling. In: Proceedings of the 2nd International Workshop on Issues of Sentiment Discovery and Opinion Mining, p. 6. ACM (2013)
Tang, G., Xia, Y., Sun, J., Zhang, M., Zheng, T.F.: Topic models incorporating statistical word senses. In: Gelbukh, A. (ed.) CICLing 2014, Part I. LNCS, vol. 8403, pp. 151–162. Springer, Heidelberg (2014)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the american statistical association 101(476) (2006)
Wang, C., Blei, D., Li, F.F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2009, pp. 1903–1910. IEEE (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Ferrugento, A., Alves, A., Gonçalo Oliveira, H., Rodrigues, F. (2015). Towards the Improvement of a Topic Model with Semantic Knowledge. In: Pereira, F., Machado, P., Costa, E., Cardoso, A. (eds) Progress in Artificial Intelligence. EPIA 2015. Lecture Notes in Computer Science(), vol 9273. Springer, Cham. https://doi.org/10.1007/978-3-319-23485-4_76
Download citation
DOI: https://doi.org/10.1007/978-3-319-23485-4_76
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23484-7
Online ISBN: 978-3-319-23485-4
eBook Packages: Computer ScienceComputer Science (R0)