Towards the Improvement of a Topic Model with Semantic Knowledge

Ferrugento, Adriana; Alves, Ana; Gonçalo Oliveira, Hugo; Rodrigues, Filipe

doi:10.1007/978-3-319-23485-4_76

Towards the Improvement of a Topic Model with Semantic Knowledge

Adriana Ferrugento⁸,
Ana Alves^8,9,
Hugo Gonçalo Oliveira⁸ &
…
Filipe Rodrigues⁸

Conference paper
First Online: 01 January 2015

3884 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9273))

Abstract

Although typically used in classic topic models, surface words cannot represent meaning on their own. Consequently, redundancy is common in those topics, which may, for instance, include synonyms. To face this problem, we present SemLDA, an extended topic model that incorporates semantics from an external lexical-semantic knowledge base. SemLDA is introduced and explained in detail, pointing out where semantics is included both in the pre-pocessing and generative phase of topic distributions. As a result, instead of topics as distributions over words, we obtain distributions over concepts, each represented by a set of synonymous words. In order to evaluate SemLDA, we applied preliminary qualitative tests automatically against a state-of-the-art classical topic model. The results were promising and confirm our intuition towards the benefits of incorporating general semantics in a topic model.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Boyd-Graber, J., Blei, D., Zhu, X.: A topic model for word sense disambiguation. In: Proceedings of 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 1024–1033. ACL Press, Prague, Czech Republic, June 2007
Google Scholar
Brody, S., Lapata, M.: Bayesian word sense induction. In: Proceedings of 12th Conference of the European Chapter of the Association for Computational Linguistics. EACL 2009, pp. 103–111. ACL Press (2009)
Google Scholar
Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 229–244. Springer, Heidelberg (2008)
Chapter Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Article Google Scholar
Flaherty, P., Giaever, G., Kumm, J., Jordan, M.I., Arkin, A.P.: A latent variable model for chemogenomic profiling. Bioinformatics 21(15), 3286–3293 (2005)
Article Google Scholar
Gonçalo Oliveira, H., de Paiva, V., Freitas, C., Rademaker, A., Real, L., oes, A.S.: As wordnets do português. In: Simões, A., Barreiro, A., Santos, D., Sousa-Silva, R., Tagnin, S.E.O. (eds.) Linguística, Informática e Tradução: Mundos que se Cruzam, OSLa, vol. 7, no. 1, pp. 397–424. University of Oslo (2015)
Google Scholar
Guo, W., Diab, M.: Semantic topic models: combining word distributional statistics and dictionary definitions. In: EMNLP, pp. 552–561. ACL Press (2011)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57. ACM (1999)
Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Machine learning 37(2), 183–233 (1999)
Article MATH Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)
Article Google Scholar
Miller, G.A., Chodorow, M., Landes, S., Leacock, C., Thomas, R.G.: Using a semantic concordance for sense identification. In: Proceedings of ARPA Human Language Technology Workshop. Plainsboro, NJ, USA (1994)
Google Scholar
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2011, pp. 262–272. ACL Press (2011)
Google Scholar
Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys 41(2), 1–69 (2009)
Article Google Scholar
Newman, D., Bonilla, E.V., Buntine, W.: Improving topic coherence with regularized topic models. In: Advances in Neural Information Processing Systems, pp. 496–504 (2011)
Google Scholar
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. HLT 2010, pp. 100–108. ACL Press (2010)
Google Scholar
Rajagopal, D., Olsher, D., Cambria, E., Kwok, K.: Commonsense-based topic modeling. In: Proceedings of the 2nd International Workshop on Issues of Sentiment Discovery and Opinion Mining, p. 6. ACM (2013)
Google Scholar
Tang, G., Xia, Y., Sun, J., Zhang, M., Zheng, T.F.: Topic models incorporating statistical word senses. In: Gelbukh, A. (ed.) CICLing 2014, Part I. LNCS, vol. 8403, pp. 151–162. Springer, Heidelberg (2014)
Chapter Google Scholar
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the american statistical association 101(476) (2006)
Google Scholar
Wang, C., Blei, D., Li, F.F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2009, pp. 1903–1910. IEEE (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

CISUC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
Adriana Ferrugento, Ana Alves, Hugo Gonçalo Oliveira & Filipe Rodrigues
Coimbra Institute of Engineering, Polytechnic Institute of Coimbra, Coimbra, Portugal
Ana Alves

Authors

Adriana Ferrugento
View author publications
You can also search for this author in PubMed Google Scholar
Ana Alves
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Gonçalo Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Filipe Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adriana Ferrugento .

Editor information

Editors and Affiliations

ISEC - Coimbra Institute of Engineering, Polytechnic Institute of Coimbra, Coimbra, Portugal
Francisco Pereira
CIUSC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
Penousal Machado
CIUSC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
Ernesto Costa
CIUSC, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
Amílcar Cardoso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferrugento, A., Alves, A., Gonçalo Oliveira, H., Rodrigues, F. (2015). Towards the Improvement of a Topic Model with Semantic Knowledge. In: Pereira, F., Machado, P., Costa, E., Cardoso, A. (eds) Progress in Artificial Intelligence. EPIA 2015. Lecture Notes in Computer Science(), vol 9273. Springer, Cham. https://doi.org/10.1007/978-3-319-23485-4_76

Download citation

DOI: https://doi.org/10.1007/978-3-319-23485-4_76
Published: 25 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23484-7
Online ISBN: 978-3-319-23485-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics