skip to main content
10.1145/3227609.3227659acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Thesaurus-Based Topic Models and Their Evaluation

Published: 25 June 2018 Publication History

Abstract

In this paper we study thesaurus-based topic models and evaluate them from the point of view of topic coherence. Thesaurus-based topic model enhances scores of related terms found in the same text, which means that the model encourages these terms to be in the same topics. We evaluate various variants of such models. At the first step, we carry out manual evaluation of the obtained topics. At the second step, we study the possibility to use the collected manual data for evaluating new variants of thesaurus-based models, propose a method and select the best of its parameters in cross-validation. At the third step, we apply the created evaluation method to estimate the influence of word frequencies on adding thesaurus relations during generating topic models.

References

[1]
David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 25--32.
[2]
Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In International Conference on Machine Learning. 280--288.
[3]
Luisa Bentivogli, Pamela Forner, Bernardo Magnini, and Emanuele Pianta. 2004. Revising the wordnet domains hierarchy: semantics, coverage and balancing. In Proceedings of the Workshop on Multilingual Linguistic Ressources. Association for Computational Linguistics, 101--108.
[4]
David M Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (2012), 77--84.
[5]
David M Blei and John D Lafferty. 2009. Visualizing topics with multi-word expressions. arXiv preprint arXiv:0907.1013 (2009).
[6]
Jordan Boyd-Graber, David Mimno, and David Newman. 2014. Care and feeding of topic models: Problems, diagnostics, and improvements. Handbook of mixed membership models and their applications 225255 (2014).
[7]
Vanda Broughton. 2006. The need for a faceted classification as the basis of all methods of information retrieval. In Aslib proceedings, Vol. 58. Emerald Group Publishing Limited, 49--72.
[8]
Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh. 2013. Discovering coherent topics using general knowledge. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 209--218.
[9]
Jason Chuang, Christopher D Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the international working conference on advanced visual interfaces. ACM, 74--77.
[10]
Stella G Dextre Clarke and Marcia Lei Zeng. 2012. From ISO 2788 to ISO 25964: The evolution of thesaurus standards towards interoperability and data modelling. Information Standards Quarterly (ISQ) 24, 1 (2012).
[11]
Christiane Fellbaum (Ed.). 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
[12]
Yan Gao and Dunwei Wen. 2015. Semantic Similarity-Enhanced Topic Models for Document Analysis. In Smart Learning Environments. Springer, 45--56.
[13]
Thomas L Griffiths, Mark Steyvers, and Joshua B Tenenbaum. 2007. Topics in semantic representation. Psychological review 114, 2 (2007), 211.
[14]
Nicola Guarino, Daniel Oberle, and Steffen Staab. 2009. What is an Ontology? In Handbook on ontologies. Springer, 1--17.
[15]
Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith. 2014. Interactive topic modeling. Machine learning 95, 3 (2014), 423--469.
[16]
Jey Han Lau, Timothy Baldwin, and David Newman. 2013. On collocations and topic models. ACM Transactions on Speech and Language Processing (TSLP) 10, 3 (2013), 10.
[17]
Jey Han Lau, David Newman, and Timothy Baldwin. 2014. Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. In EACL. 530--539.
[18]
Jey Han Lau, David Newman, Sarvnaz Karimi, and Timothy Baldwin. 2010. Best topic word selection for topic labelling. In Proceedings of the 23rd International Conference on Computational Linguistics. ACL, 605--613.
[19]
Loet Leydesdorff and Ismael Rafols. 2009. A global map of science based on the ISI subject categories. Journal of the Association for Information Science and Technology 60, 2 (2009), 348--362.
[20]
Natalia Loukachevitch and Boris Dobrov. 2014. RuThes linguistic ontology vs. Russian wordnets. In Proceedings of Global WordNet Conference GWC-2014.
[21]
Natalia Loukachevitch and Michael Nokel. 2017. Adding Thesaurus Information into Probabilistic Topic Models. In International Conference on Text, Speech, and Dialogue. Springer, 210--218.
[22]
David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 262--272.
[23]
David Newman, Edwin V Bonilla, and Wray Buntine. 2011. Improving topic coherence with regularized topic models. In Advances in neural information processing systems. 496--504.
[24]
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 100--108.
[25]
Michael Nokel and Natalia Loukachevitch. 2016. Accounting ngrams and multi-word terms can improve topic models. ACL 2016 (2016), 44.
[26]
Michael Nokel and Natalia V Loukachevitch. 2015. A Method of Accounting Bigrams in Topic Models. In MWE@NAACL-HLT. 1--9.
[27]
Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining. ACM, 399--408.
[28]
Saidah Saad, Naomie Salim, Hakim Zainal, and S Azman M Noah. 2010. A framework for Islamic knowledge via ontology representation. In Information Retrieval & Knowledge Management, (CAMP), 2010 International Conference on. IEEE, 310--314.
[29]
Carson Sievert and Kenneth E Shirley. 2014. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces. 63--70.
[30]
Alison Smith, Tak Yeon Lee, Forough Poursabzi-Sangdeh, Jordan Boyd-Graber, Niklas Elmqvist, and Leah Findlater. 2017. Evaluating Visual Representations for Topic Understanding and Their Effects on Manually Generated Topic Labels. (2017).
[31]
Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, Marina Suvorova, and Anastasia Yanina. 2015. Non-Bayesian additive regularization for multimodal topic modeling of large collections. In Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications. ACM, 29--37.
[32]
Hanna M Wallach. 2006. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning. ACM, 977--984.
[33]
Xuerui Wang, Andrew McCallum, and Xing Wei. 2007. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on. IEEE, 697--702.
[34]
Pengtao Xie, Diyi Yang, and Eric P Xing. 2015. Incorporating Word Correlation Knowledge into Topic Modeling. In HLT-NAACL. 725--734.

Cited By

View all
  • (2025)Text mining technologies applied to free-text answers of students in e-assessmentDiscover Computing10.1007/s10791-024-09496-928:1Online publication date: 17-Jan-2025
  • (2020)A Topic Learning Pipeline for Curating Brain Cognitive ResearchesIEEE Access10.1109/ACCESS.2020.30321738(191758-191774)Online publication date: 2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WIMS '18: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics
June 2018
398 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content-based analysis
  2. thesaurus
  3. topic models

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WIMS '18

Acceptance Rates

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)3
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Text mining technologies applied to free-text answers of students in e-assessmentDiscover Computing10.1007/s10791-024-09496-928:1Online publication date: 17-Jan-2025
  • (2020)A Topic Learning Pipeline for Curating Brain Cognitive ResearchesIEEE Access10.1109/ACCESS.2020.30321738(191758-191774)Online publication date: 2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media