research-article

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

Authors:
Felipe Viegas

Federal University of Minas Gerais, Minas Gerais, Brazil

Federal University of Minas Gerais, Minas Gerais, Brazil
View Profile

,
Sérgio Canuto

Instituto Federal de Goias, Goiânia, Brazil

Instituto Federal de Goias, Goiânia, Brazil
View Profile

,
Christian Gomes

Federal University of São João Del Rei, São João Del Rei, Brazil

Federal University of São João Del Rei, São João Del Rei, Brazil
View Profile

,
Washington Luiz

Federal University of Minas Gerais, Belo Horizonte, Brazil

Federal University of Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Thierson Rosa

Universidade Federal de Goias, Goiânia, Brazil

Universidade Federal de Goias, Goiânia, Brazil
View Profile

,
Sabir Ribas

SEEK, Melbourne, Australia

SEEK, Melbourne, Australia
View Profile

,
Leonardo Rocha

Federal University of São João Del Rei, São João Del Rei, Brazil

Federal University of São João Del Rei, São João Del Rei, Brazil
View Profile

,
Marcos André Gonçalves

Federal University of Minas Gerais, Belo Horizonte, Brazil

Federal University of Minas Gerais, Belo Horizonte, Brazil
View Profile

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data MiningJanuary 2019Pages 753–761https://doi.org/10.1145/3289600.3291032

Published:30 January 2019Publication History

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pages 753–761

ABSTRACT

In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

References

Klemens Boehm Abel Elekes, Martin Schaeler. 2017. On the Various Semantics of Similarity in Word Embedding Models. Digital Libraries (JCDL) 2017 ACM/IEEE Joint Conference (2017). Google ScholarDigital Library
Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL'14.Google ScholarCross Ref
M. S. Bartlett. 1937. Properties of Sufficiency and Statistical Tests. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 160 (1937).Google Scholar
David M. Blei. 2012. Probabilistic Topic Models. Communications of The ACM 55, 4 (April 2012), 77--84. Google ScholarDigital Library
DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research (2003). Google ScholarDigital Library
William B. Cavnar and John M. Trenkle. 1994. N-Gram-Based Text Categorization. In SDAIR'94.Google Scholar
Zhiyuan Chen and Bing Liu. 2014. Topic Modeling Using Topics from Many Domains, Lifelong Learning and Big Data. In ICML'14. Google ScholarDigital Library
X. Cheng, X. Yan, Y. Lan, and J. Guo. 2014. BTM: Topic Modeling over Short Texts. IEEE TKDE '14 (2014).Google Scholar
Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with Word Embeddings.. In ACL '15.Google ScholarCross Ref
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research (2008). Google ScholarDigital Library
Emitza Guzman and Walid Maalej. 2014. How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews.. In Requirements Engineering.Google Scholar
William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. CoRR (2016).Google Scholar
Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. In SIGIR '99. Google ScholarDigital Library
Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In ICWSM'14.Google Scholar
Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering. In CIKM. Google ScholarDigital Library
Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature (1999).Google Scholar
Howard Levene. 1960. Robust tests for equality of variances. (1960).Google Scholar
Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings. ACM TOIS (2017). Google ScholarDigital Library
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In SIGIR'16. Google ScholarDigital Library
Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang. {n. d.}. TweetSift: Tweet Topic Classification Based on Entity Knowledge Base and Topic Enhanced Word Embedding. In CIKM'16. Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. (2013).Google Scholar
Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In LREC'18.Google Scholar
David M. Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. In EMNLP.Google Scholar
Sergey I Nikolenko. 2016. Topic Quality Metrics Based on Distributed Word Representations. In SIGIR'16. Google ScholarDigital Library
Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic modelling for qualitative studies. Journal of Information Science (2017). Google ScholarDigital Library
G. Pedrosa, M. Pita, P. Bicalho, A. Lacerda, and G. L. Pappa. 2016. Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion. In BRACIS.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP.Google Scholar
Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic Modeling over Short Texts by Incorporating Word Embeddings. In PAKDD. Springer.Google Scholar
Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. 2012. Statistical Topic Models for Multi-label Document Classification. Mach. Learn. 88, 1--2 (July 2012), 157--208. Google ScholarDigital Library
Gerard Salton and Christopher Buckley. 1988. Term-weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24, 5 (1988), 513--523. Google ScholarDigital Library
Bei Shi, Wai Lam, Shoaib Jameel, Steven Schockaert, and Kwun Ping Lai. 2017. Jointly Learning Word Embeddings and Latent Topics. In SIGIR'17. Google ScholarDigital Library
Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short- Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations. In WWW '18. 1105--1114. Google ScholarDigital Library
Felipe Viegas, Marcos Gonçalves, Wellington Martins, and Leonardo Rocha. 2015. Parallel Lazy Semi-Naive Bayes Strategies for Effective and Efficient Document Classification. In CIKM. Google ScholarDigital Library
Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization of topic models. Mach. Learn. 101, 1--3 (2015), 303--323. Google ScholarDigital Library
Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics.Google ScholarCross Ref
Pengtao Xie and Eric P. Xing. 2013. Integrating Document Clustering and Topic Modeling. CoRR abs/1309.6874 (2013). Google ScholarDigital Library
Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Ret. (1999). Google ScholarDigital Library
Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In SIGIR'17. Google ScholarDigital Library
Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and Traditional Media Using Topic Models. In ECIR'11.Google Scholar

Index Terms

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling

Recommendations

Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations
WWW '18: Proceedings of the 2018 World Wide Web Conference

Being a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual ...
Read More
Transportation sentiment analysis using word embedding and ontology-based topic modeling
Abstract
Social networks play a key role in providing a new approach to collecting information regarding mobility and transportation services. To study this information, sentiment analysis can make decent observations to support intelligent ...
Highlights
- Social networks provide a new approach to collect data regarding transportation.
Read More
Multi-prototype Morpheme Embedding for Text Classification
SMA 2020: The 9th International Conference on Smart Media and Applications

Representing a word into a continuous space, also known as a word vector, has been successful in various NLP tasks. The word-based embedding has two problems; one is the out-of-vocabulary problem and the other is does not take into account the context ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining
January 2019
874 pages
ISBN:9781450359405
DOI:10.1145/3289600
General Chairs:
J. Shane Culpepper
RMIT University
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Paul N. Bennett
Microsoft
,
Kristina Lerman
University of Southern California
Copyright © 2019 ACM
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 January 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data representation
topic modeling
word embedding
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '19 Paper Acceptance Rate84of511submissions,16%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 973
  Total Downloads
- Downloads (Last 12 months)87
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations

Transportation sentiment analysis using word embedding and ontology-based topic modeling

Multi-prototype Morpheme Embedding for Text Classification