Skip to main content

Towards the Improvement of a Topic Model with Semantic Knowledge

  • Conference paper
  • First Online:
  • 3884 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9273))

Abstract

Although typically used in classic topic models, surface words cannot represent meaning on their own. Consequently, redundancy is common in those topics, which may, for instance, include synonyms. To face this problem, we present SemLDA, an extended topic model that incorporates semantics from an external lexical-semantic knowledge base. SemLDA is introduced and explained in detail, pointing out where semantics is included both in the pre-pocessing and generative phase of topic distributions. As a result, instead of topics as distributions over words, we obtain distributions over concepts, each represented by a set of synonymous words. In order to evaluate SemLDA, we applied preliminary qualitative tests automatically against a state-of-the-art classical topic model. The results were promising and confirm our intuition towards the benefits of incorporating general semantics in a topic model.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Boyd-Graber, J., Blei, D., Zhu, X.: A topic model for word sense disambiguation. In: Proceedings of 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 1024–1033. ACL Press, Prague, Czech Republic, June 2007

    Google Scholar 

  3. Brody, S., Lapata, M.: Bayesian word sense induction. In: Proceedings of 12th Conference of the European Chapter of the Association for Computational Linguistics. EACL 2009, pp. 103–111. ACL Press (2009)

    Google Scholar 

  4. Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 229–244. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  5. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)

    Article  Google Scholar 

  6. Flaherty, P., Giaever, G., Kumm, J., Jordan, M.I., Arkin, A.P.: A latent variable model for chemogenomic profiling. Bioinformatics 21(15), 3286–3293 (2005)

    Article  Google Scholar 

  7. Gonçalo Oliveira, H., de Paiva, V., Freitas, C., Rademaker, A., Real, L., oes, A.S.: As wordnets do português. In: Simões, A., Barreiro, A., Santos, D., Sousa-Silva, R., Tagnin, S.E.O. (eds.) Linguística, Informática e Tradução: Mundos que se Cruzam, OSLa, vol. 7, no. 1, pp. 397–424. University of Oslo (2015)

    Google Scholar 

  8. Guo, W., Diab, M.: Semantic topic models: combining word distributional statistics and dictionary definitions. In: EMNLP, pp. 552–561. ACL Press (2011)

    Google Scholar 

  9. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57. ACM (1999)

    Google Scholar 

  10. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Machine learning 37(2), 183–233 (1999)

    Article  MATH  Google Scholar 

  11. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  12. Miller, G.A., Chodorow, M., Landes, S., Leacock, C., Thomas, R.G.: Using a semantic concordance for sense identification. In: Proceedings of ARPA Human Language Technology Workshop. Plainsboro, NJ, USA (1994)

    Google Scholar 

  13. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP 2011, pp. 262–272. ACL Press (2011)

    Google Scholar 

  14. Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys 41(2), 1–69 (2009)

    Article  Google Scholar 

  15. Newman, D., Bonilla, E.V., Buntine, W.: Improving topic coherence with regularized topic models. In: Advances in Neural Information Processing Systems, pp. 496–504 (2011)

    Google Scholar 

  16. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. HLT 2010, pp. 100–108. ACL Press (2010)

    Google Scholar 

  17. Rajagopal, D., Olsher, D., Cambria, E., Kwok, K.: Commonsense-based topic modeling. In: Proceedings of the 2nd International Workshop on Issues of Sentiment Discovery and Opinion Mining, p. 6. ACM (2013)

    Google Scholar 

  18. Tang, G., Xia, Y., Sun, J., Zhang, M., Zheng, T.F.: Topic models incorporating statistical word senses. In: Gelbukh, A. (ed.) CICLing 2014, Part I. LNCS, vol. 8403, pp. 151–162. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  19. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the american statistical association 101(476) (2006)

    Google Scholar 

  20. Wang, C., Blei, D., Li, F.F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2009, pp. 1903–1910. IEEE (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adriana Ferrugento .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ferrugento, A., Alves, A., Gonçalo Oliveira, H., Rodrigues, F. (2015). Towards the Improvement of a Topic Model with Semantic Knowledge. In: Pereira, F., Machado, P., Costa, E., Cardoso, A. (eds) Progress in Artificial Intelligence. EPIA 2015. Lecture Notes in Computer Science(), vol 9273. Springer, Cham. https://doi.org/10.1007/978-3-319-23485-4_76

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23485-4_76

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23484-7

  • Online ISBN: 978-3-319-23485-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics