skip to main content
10.1145/2911451.2914720acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Topic Quality Metrics Based on Distributed Word Representations

Authors Info & Claims
Published:07 July 2016Publication History

ABSTRACT

Automated evaluation of topic quality remains an important unsolved problem in topic modeling and represents a major obstacle for development and evaluation of new topic models. Previous attempts at the problem have been formulated as variations on the coherence and/or mutual information of top words in a topic. In this work, we propose several new metrics for evaluating topic quality with the help of distributed word representations; our experiments suggest that the new metrics are a better match for human judgement, which is the gold standard in this case, than previously developed approaches.

References

  1. R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual NLP. In Proc. 17th Conference on Computational Natural Language Learning, pages 183--192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.Google ScholarGoogle Scholar
  2. N. Arefyev, A. Panchenko, A. Lukanin, O. Lesota, and P. Romanov. Evaluating three corpus-based semantic similarity systems for russian. In Proc. International Conference on Computational Linguistics Dialogue, 2015.Google ScholarGoogle Scholar
  3. D. M. Blei. Introduction to probabilistic topic models. Communications of the ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4--5):993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Bodrunova, S. Koltsov, O. Koltsova, S. I. Nikolenko, and A. Shimorina. Interval semi-supervised LDA: Classifying needles in a haystack. In Proc. 12th Mexican International Conference on Artificial Intelligence, volume 8625 of Lecture Notes in Computer Science, pages 265--274. Springer, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  6. G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proc. Biennial GSCL Conference, pages 31--40, 2013.Google ScholarGoogle Scholar
  7. J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret topic models. Advances in Neural Information Processing Systems, 20, 2009.Google ScholarGoogle Scholar
  8. T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101 (Suppl. 1):5228--5335, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45:171--186, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Hoffmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177--196, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  11. J. H. Lau, D. Newman, and T. Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, pages 530--539, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  12. C. X. Ling, J. Huang, and H. Zhang. AUC: a statistically consistent and more discriminating measure than accuracy. In Proc. International Joint Conference on Artificial Intelligence 2003, pages 519--526, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proc. Conference on Empirical Methods in Natural Language Processing, pages 262--272, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies 2010, HLT '10, pages 100--108, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. I. Nikolenko, O. Koltsova, and S. Koltsov. Topic modelling for qualitative studies. Journal of Information Science, 2015.Google ScholarGoogle Scholar
  16. A. Panchenko, N. Loukachevitch, D. Ustalov, D. Paperno, C. M. Meyer, and N. Konstantinova. Russe: The first workshop on Russian semantic similarity. In Proc. International Conference on Computational Linguistics and Intellectual Technologies (Dialogue), pages 89--105, May 2015.Google ScholarGoogle Scholar
  17. J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543, Doha, Qatar, October 2014. Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. S. Rathore and D. Roy. Performance of LDA and DCT models. Journal of Information Science, 40(3):281--292, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Vorontsov. Additive regularization for topic models of text collections. Doklady Mathematics, 89(3):301--304, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  20. K. V. Vorontsov and A. A. Potapenko. Additive regularization of topic models. Machine Learning, Special Issue on Data Analysis and Intelligent Optimization with Applications, 101(1):303--323, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Topic Quality Metrics Based on Distributed Word Representations

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
      July 2016
      1296 pages
      ISBN:9781450340694
      DOI:10.1145/2911451

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 July 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader