skip to main content
10.1145/3289600.3291032acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

Authors Info & Claims
Published:30 January 2019Publication History

ABSTRACT

In this paper, we advance the state-of-the-art in topic modeling by means of a new document representation based on pre-trained word embeddings for non-probabilistic matrix factorization. Specifically, our strategy, called CluWords, exploits the nearest words of a given pre-trained word embedding to generate meta-words capable of enhancing the document representation, in terms of both, syntactic and semantic information. The novel contributions of our solution include: (i)the introduction of a novel data representation for topic modeling based on syntactic and semantic relationships derived from distances calculated within a pre-trained word embedding space and (ii)the proposal of a new TF-IDF-based strategy, particularly developed to weight the CluWords. In our extensive experimentation evaluation, covering 12 datasets and 8 state-of-the-art baselines, we exceed (with a few ties) in almost cases, with gains of more than 50% against the best baselines (achieving up to 80% against some runner-ups). Finally, we show that our method is able to improve document representation for the task of automatic text classification.

References

  1. Klemens Boehm Abel Elekes, Martin Schaeler. 2017. On the Various Semantics of Similarity in Word Embedding Models. Digital Libraries (JCDL) 2017 ACM/IEEE Joint Conference (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL'14.Google ScholarGoogle ScholarCross RefCross Ref
  3. M. S. Bartlett. 1937. Properties of Sufficiency and Statistical Tests. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 160 (1937).Google ScholarGoogle Scholar
  4. David M. Blei. 2012. Probabilistic Topic Models. Communications of The ACM 55, 4 (April 2012), 77--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. William B. Cavnar and John M. Trenkle. 1994. N-Gram-Based Text Categorization. In SDAIR'94.Google ScholarGoogle Scholar
  7. Zhiyuan Chen and Bing Liu. 2014. Topic Modeling Using Topics from Many Domains, Lifelong Learning and Big Data. In ICML'14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Cheng, X. Yan, Y. Lan, and J. Guo. 2014. BTM: Topic Modeling over Short Texts. IEEE TKDE '14 (2014).Google ScholarGoogle Scholar
  9. Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with Word Embeddings.. In ACL '15.Google ScholarGoogle ScholarCross RefCross Ref
  10. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Emitza Guzman and Walid Maalej. 2014. How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews.. In Requirements Engineering.Google ScholarGoogle Scholar
  12. William L. Hamilton, Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. CoRR (2016).Google ScholarGoogle Scholar
  13. Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. In SIGIR '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In ICWSM'14.Google ScholarGoogle Scholar
  15. Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering. In CIKM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature (1999).Google ScholarGoogle Scholar
  17. Howard Levene. 1960. Robust tests for equality of variances. (1960).Google ScholarGoogle Scholar
  18. Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings. ACM TOIS (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In SIGIR'16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Quanzhi Li, Sameena Shah, Xiaomo Liu, Armineh Nourbakhsh, and Rui Fang. {n. d.}. TweetSift: Tweet Topic Classification Based on Entity Knowledge Base and Topic Enhanced Word Embedding. In CIKM'16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. (2013).Google ScholarGoogle Scholar
  22. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In LREC'18.Google ScholarGoogle Scholar
  23. David M. Mimno and Laure Thompson. 2017. The strange geometry of skip-gram with negative sampling. In EMNLP.Google ScholarGoogle Scholar
  24. Sergey I Nikolenko. 2016. Topic Quality Metrics Based on Distributed Word Representations. In SIGIR'16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sergey I Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic modelling for qualitative studies. Journal of Information Science (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Pedrosa, M. Pita, P. Bicalho, A. Lacerda, and G. L. Pappa. 2016. Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion. In BRACIS.Google ScholarGoogle Scholar
  27. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation.. In EMNLP.Google ScholarGoogle Scholar
  28. Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic Modeling over Short Texts by Incorporating Word Embeddings. In PAKDD. Springer.Google ScholarGoogle Scholar
  29. Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. 2012. Statistical Topic Models for Multi-label Document Classification. Mach. Learn. 88, 1--2 (July 2012), 157--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gerard Salton and Christopher Buckley. 1988. Term-weighting Approaches in Automatic Text Retrieval. Inf. Process. Manage. 24, 5 (1988), 513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Bei Shi, Wai Lam, Shoaib Jameel, Steven Schockaert, and Kwun Ping Lai. 2017. Jointly Learning Word Embeddings and Latent Topics. In SIGIR'17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short- Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations. In WWW '18. 1105--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Felipe Viegas, Marcos Gonçalves, Wellington Martins, and Leonardo Rocha. 2015. Parallel Lazy Semi-Naive Bayes Strategies for Effective and Efficient Document Classification. In CIKM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization of topic models. Mach. Learn. 101, 1--3 (2015), 303--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  36. Pengtao Xie and Eric P. Xing. 2013. Integrating Document Clustering and Topic Modeling. CoRR abs/1309.6874 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yiming Yang. 1999. An Evaluation of Statistical Approaches to Text Categorization. Inf. Ret. (1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In SIGIR'17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and Traditional Media Using Topic Models. In ECIR'11.Google ScholarGoogle Scholar

Index Terms

  1. CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining
        January 2019
        874 pages
        ISBN:9781450359405
        DOI:10.1145/3289600

        Copyright © 2019 ACM

        © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 January 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WSDM '19 Paper Acceptance Rate84of511submissions,16%Overall Acceptance Rate498of2,863submissions,17%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader