ABSTRACT
In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.
- Klemens Boehm Abel Elekes, Martin Schaeler. 2017. On the Various Semantics of Similarity in Word Embedding Models. Digital Libraries (JCDL) 2017 ACM/IEEE Joint Conference (2017). Google ScholarDigital Library
- Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL'14.Google ScholarCross Ref
- M. S. Bartlett. 1937. Properties of Sufficiency and Statistical Tests. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 160 (1937).Google Scholar
- David M. Blei. 2012. Probabilistic Topic Models. Communications of The ACM 55, 4 (April 2012), 77--84. Google ScholarDigital Library
- DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research (2003). Google ScholarDigital Library
- William B. Cavnar and John M. Trenkle. 1994. N-Gram-Based Text Categorization. In SDAIR'94.Google Scholar
- Zhiyuan Chen and Bing Liu. 2014. Topic Modeling Using Topics from Many Domains, Lifelong Learning and Big Data. In ICML'14. Google ScholarDigital Library
- X. Cheng, X. Yan, Y. Lan, and J. Guo. 2014. BTM: Topic Modeling over Short Texts. IEEE TKDE '14 (2014).Google Scholar
- Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with Word Embeddings.. In ACL '15.Google ScholarCross Ref
- Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research (2008). Google ScholarDigital Library
- Emitza Guzman and Walid Maalej. 2014. How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews.. In Requirements Engineering.Google Scholar
- William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. CoRR (2016).Google Scholar
- Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. In SIGIR '99. Google ScholarDigital Library
- Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In ICWSM'14.Google Scholar
- Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering. In CIKM. Google ScholarDigital Library
- Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature (1999).Google Scholar
- Howard Levene. 1960. Robust tests for equality of variances. (1960).Google Scholar
- Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings. ACM TOIS (2017). Google ScholarDigital Library
- Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In SIGIR'16. Google ScholarDigital Library
- Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang. {n. d.}. TweetSift: Tweet Topic Classification Based on Entity Knowledge Base and Topic Enhanced Word Embedding. In CIKM'16. Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. (2013).Google Scholar
- Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In LREC'18.Google Scholar
- David M. Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. In EMNLP.Google Scholar
- Sergey I Nikolenko. 2016. Topic Quality Metrics Based on Distributed Word Representations. In SIGIR'16. Google ScholarDigital Library
- Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic modelling for qualitative studies. Journal of Information Science (2017). Google ScholarDigital Library
- G. Pedrosa, M. Pita, P. Bicalho, A. Lacerda, and G. L. Pappa. 2016. Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion. In BRACIS.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP.Google Scholar
- Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic Modeling over Short Texts by Incorporating Word Embeddings. In PAKDD. Springer.Google Scholar
- Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. 2012. Statistical Topic Models for Multi-label Document Classification. Mach. Learn. 88, 1--2 (July 2012), 157--208. Google ScholarDigital Library
- Gerard Salton and Christopher Buckley. 1988. Term-weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24, 5 (1988), 513--523. Google ScholarDigital Library
- Bei Shi, Wai Lam, Shoaib Jameel, Steven Schockaert, and Kwun Ping Lai. 2017. Jointly Learning Word Embeddings and Latent Topics. In SIGIR'17. Google ScholarDigital Library
- Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short- Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations. In WWW '18. 1105--1114. Google ScholarDigital Library
- Felipe Viegas, Marcos Gonçalves, Wellington Martins, and Leonardo Rocha. 2015. Parallel Lazy Semi-Naive Bayes Strategies for Effective and Efficient Document Classification. In CIKM. Google ScholarDigital Library
- Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization of topic models. Mach. Learn. 101, 1--3 (2015), 303--323. Google ScholarDigital Library
- Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics.Google ScholarCross Ref
- Pengtao Xie and Eric P. Xing. 2013. Integrating Document Clustering and Topic Modeling. CoRR abs/1309.6874 (2013). Google ScholarDigital Library
- Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Ret. (1999). Google ScholarDigital Library
- Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In SIGIR'17. Google ScholarDigital Library
- Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and Traditional Media Using Topic Models. In ECIR'11.Google Scholar
Index Terms
- CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling
Recommendations
Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations
WWW '18: Proceedings of the 2018 World Wide Web ConferenceBeing a prevalent form of social communications on the Internet, billions of short texts are generated everyday. Discovering knowledge from them has gained a lot of interest from both industry and academia. The short texts have a limited contextual ...
Transportation sentiment analysis using word embedding and ontology-based topic modeling
AbstractSocial networks play a key role in providing a new approach to collecting information regarding mobility and transportation services. To study this information, sentiment analysis can make decent observations to support intelligent ...
Highlights- Social networks provide a new approach to collect data regarding transportation.
Multi-prototype Morpheme Embedding for Text Classification
SMA 2020: The 9th International Conference on Smart Media and ApplicationsRepresenting a word into a continuous space, also known as a word vector, has been successful in various NLP tasks. The word-based embedding has two problems; one is the out-of-vocabulary problem and the other is does not take into account the context ...
Comments