short-paper

Topic Quality Metrics Based on Distributed Word Representations

Author:
Sergey I. Nikolenko

National Research University Higher School of Economics, St. Petersburg, Russian Fed.

National Research University Higher School of Economics, St. Petersburg, Russian Fed.
View Profile

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalJuly 2016Pages 1029–1032https://doi.org/10.1145/2911451.2914720

Published:07 July 2016Publication History

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Pages 1029–1032

ABSTRACT

Automated evaluation of topic quality remains an important unsolved problem in topic modeling and represents a major obstacle for development and evaluation of new topic models. Previous attempts at the problem have been formulated as variations on the coherence and/or mutual information of top words in a topic. In this work, we propose several new metrics for evaluating topic quality with the help of distributed word representations; our experiments suggest that the new metrics are a better match for human judgement, which is the gold standard in this case, than previously developed approaches.

References

R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual NLP. In Proc. 17th Conference on Computational Natural Language Learning, pages 183--192, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.Google Scholar
N. Arefyev, A. Panchenko, A. Lukanin, O. Lesota, and P. Romanov. Evaluating three corpus-based semantic similarity systems for russian. In Proc. International Conference on Computational Linguistics Dialogue, 2015.Google Scholar
D. M. Blei. Introduction to probabilistic topic models. Communications of the ACM, 2011. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(4--5):993--1022, 2003. Google ScholarDigital Library
S. Bodrunova, S. Koltsov, O. Koltsova, S. I. Nikolenko, and A. Shimorina. Interval semi-supervised LDA: Classifying needles in a haystack. In Proc. 12th Mexican International Conference on Artificial Intelligence, volume 8625 of Lecture Notes in Computer Science, pages 265--274. Springer, 2013.Google ScholarCross Ref
G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proc. Biennial GSCL Conference, pages 31--40, 2013.Google Scholar
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret topic models. Advances in Neural Information Processing Systems, 20, 2009.Google Scholar
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101 (Suppl. 1):5228--5335, 2004.Google ScholarCross Ref
D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45:171--186, 2001. Google ScholarDigital Library
T. Hoffmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177--196, 2001.Google ScholarCross Ref
J. H. Lau, D. Newman, and T. Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, pages 530--539, 2014.Google ScholarCross Ref
C. X. Ling, J. Huang, and H. Zhang. AUC: a statistically consistent and more discriminating measure than accuracy. In Proc. International Joint Conference on Artificial Intelligence 2003, pages 519--526, 2003. Google ScholarDigital Library
D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proc. Conference on Empirical Methods in Natural Language Processing, pages 262--272, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. Google ScholarDigital Library
D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies 2010, HLT '10, pages 100--108, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. Google ScholarDigital Library
S. I. Nikolenko, O. Koltsova, and S. Koltsov. Topic modelling for qualitative studies. Journal of Information Science, 2015.Google Scholar
A. Panchenko, N. Loukachevitch, D. Ustalov, D. Paperno, C. M. Meyer, and N. Konstantinova. Russe: The first workshop on Russian semantic similarity. In Proc. International Conference on Computational Linguistics and Intellectual Technologies (Dialogue), pages 89--105, May 2015.Google Scholar
J. Pennington, R. Socher, and C. Manning. GloVe: Global vectors for word representation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543, Doha, Qatar, October 2014. Association for Computational Linguistics.Google ScholarCross Ref
A. S. Rathore and D. Roy. Performance of LDA and DCT models. Journal of Information Science, 40(3):281--292, 2014. Google ScholarDigital Library
K. Vorontsov. Additive regularization for topic models of text collections. Doklady Mathematics, 89(3):301--304, 2014.Google ScholarCross Ref
K. V. Vorontsov and A. A. Potapenko. Additive regularization of topic models. Machine Learning, Special Issue on Data Analysis and Intelligent Optimization with Applications, 101(1):303--323, 2015. Google ScholarDigital Library

Index Terms

Topic Quality Metrics Based on Distributed Word Representations
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models

Recommendations

Topic modelling for qualitative studies

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that ...
Read More
Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification

With the considerable growth of user-generated content, online reviews are becoming extremely valuable sources for mining customers' opinions on products and services. However, most of the traditional opinion mining methods are coarse-grained and cannot ...
Read More
Using PageRank for Characterizing Topic Quality in LDA
ICTIR '18: Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

Topic models based on Latent Dirichlet Allocation (LDA) are employed effectively in various information retrieval and data mining tasks. Despite their popularity and wide-spread application, the question of assessing the quality of topics extracted by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
July 2016
1296 pages
ISBN:9781450340694
DOI:10.1145/2911451
General Chairs:
Raffaele Perego
ISTI-CNR, Italy
,
Fabrizio Sebastiani
Qatar Computing Research Institute, HBKU, Qatar
,
Program Chairs:
Javed Aslam
Northeastern University, US
,
Ian Ruthven
University of Strathclyde, UK
,
Justin Zobel
University of Melbourne, Australia
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 July 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
text mining
topic modeling
topic quality
Qualifiers
- short-paper
Conference

Acceptance Rates
SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 366
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Topic Quality Metrics Based on Distributed Word Representations

SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topic modelling for qualitative studies

Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification

Using PageRank for Characterizing Topic Quality in LDA