skip to main content
10.1145/3097983.3098009acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts

Published: 04 August 2017 Publication History

Abstract

A text corpus typically contains two types of context information -- global context and local context. Global context carries topical information which can be utilized by topic models to discover topic structures from the text corpus, while local context can train word embeddings to capture semantic regularities reflected in the text corpus. This encourages us to exploit the useful information in both the global and the local context information. In this paper, we propose a unified language model based on matrix factorization techniques which 1) takes the complementary global and local context information into consideration simultaneously, and 2) models topics and learns word embeddings collaboratively. We empirically show that by incorporating both global and local context, this collaborative model can not only significantly improve the performance of topic discovery over the baseline topic models, but also learn better word embeddings than the baseline word embedding models. We also provide qualitative analysis that explains how the cooperation of global and local context information can result in better topic structures and word embeddings.

Supplementary Material

MP4 File (xun_contexts.mp4)

References

[1]
Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain 2006. Neural probabilistic language models. Innovations in Machine Learning. Springer, 137--186.
[2]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research Vol. 3 (2003), 993--1022.
[3]
Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 136--145.
[4]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research Vol. 12 (2011), 2493--2537.
[5]
Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with Word Embeddings. ACL (1). 795--804.
[6]
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, Vol. 41, 6 (1990), 391.
[7]
Chris Ding, Tao Li, and Wei Peng 2006. Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence chi-square statistic, and a hybrid method AAAI, Vol. Vol. 6. 137--143.
[8]
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin 2001. Placing search in context: The concept revisited. Proceedings of the 10th international conference on World Wide Web. ACM, 406--414.
[9]
Zellig S Harris. 1954. Distributional structure. Word, Vol. 10, 2--3 (1954), 146--162.
[10]
Felix Hill, Roi Reichart, and Anna Korhonen 2016. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics (2016).
[11]
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 50--57.
[12]
Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng 2012. Improving word representations via global context and multiple word prototypes ACL (1). 873--882.
[13]
Quoc V Le and Tomas Mikolov 2014. Distributed Representations of Sentences and Documents. ICML, Vol. Vol. 14. 1188--1196.
[14]
Daniel D Lee and H Sebastian Seung 2001. Algorithms for non-negative matrix factorization. Advances in neural information processing systems. 556--562.
[15]
Omer Levy and Yoav Goldberg 2014. Neural word embedding as implicit matrix factorization Advances in neural information processing systems. 2177--2185.
[16]
Omer Levy, Yoav Goldberg, and Israel Ramat-Gan 2014. Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL. 171--180.
[17]
Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao. 2016. Generative topic embedding: a continuous representation of documents Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL).
[18]
Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015. Topical Word Embeddings. In AAAI. 2418--2424.
[19]
Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word representations with recursive neural networks for morphology. CoNLL. 104--113.
[20]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean 2013natexlaba. Distributed representations of words and phrases and their compositionality Advances in neural information processing systems. 3111--3119.
[21]
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. Hlt-naacl, Vol. Vol. 13. 746--751.
[22]
David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum 2011. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 262--272.
[23]
Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics Vol. 3 (2015), 299--313.
[24]
Liqiang Niu, Xinyu Dai, Jianbing Zhang, and Jiajun Chen. 2015. Topic2Vec: learning distributed representations of topics Asian Language Processing (IALP), 2015 International Conference on. IEEE, 193--196.
[25]
Jeffrey Pennington, Richard Socher, and Christopher D Manning 2014. Glove: Global Vectors for Word Representation. In EMNLP, Vol. Vol. 14. 1532--1543.
[26]
Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch 2011. A word at a time: computing word relatedness using temporal semantic analysis Proceedings of the 20th international conference on World wide web. ACM, 337--346.
[27]
Guangxu Xun, Vishrawas Gopalakrishnan, Fenglong Ma, Yaliang Li, Jing Gao, and Aidong Zhang. 2016. Topic Discovery for Short Texts Using Word Embeddings Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 1299--1304.
[28]
Guangxu Xun, Yaliang Li, Wayne Xin Zhao, Jing Gao, and Aidong Zhang 2017. A Correlated Topic Model Using Word Embeddings. In Proceedings of the 26th International Joint Conference on Artificial Intelligence.
[29]
Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li 2011. Comparing twitter and traditional media using topic models. Advances in Information Retrieval. Springer, 338--349.

Cited By

View all
  • (2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
  • (2024)A Hybrid Semantic Representation Method Based on Fusion Conceptual Knowledge and Weighted Word Embeddings for English TextsInformation10.3390/info1511070815:11(708)Online publication date: 5-Nov-2024
  • (2023)Effective Seed-Guided Topic Discovery by Integrating Multiple Types of ContextsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570475(429-437)Online publication date: 27-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. global context
  2. local context
  3. topic modeling
  4. unified language model
  5. word embeddings

Qualifiers

  • Research-article

Funding Sources

  • US National Science Foundation

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
  • (2024)A Hybrid Semantic Representation Method Based on Fusion Conceptual Knowledge and Weighted Word Embeddings for English TextsInformation10.3390/info1511070815:11(708)Online publication date: 5-Nov-2024
  • (2023)Effective Seed-Guided Topic Discovery by Integrating Multiple Types of ContextsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570475(429-437)Online publication date: 27-Feb-2023
  • (2023)Modeling Spatial Trajectories with Attribute Representation Learning (Extended Abstract)2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00333(3813-3814)Online publication date: Apr-2023
  • (2023)Contextualized Word Embeddings via Generative Adversarial Learning of Syntagmatic and Paradigmatic Structure2023 6th International Conference on Software Engineering and Computer Science (CSECS)10.1109/CSECS60003.2023.10428465(1-8)Online publication date: 22-Dec-2023
  • (2022)A Generative Model for Topic Discovery and Polysemy Embeddings on Directed Attributed NetworksSymmetry10.3390/sym1404070314:4(703)Online publication date: 30-Mar-2022
  • (2022)A Method of Short Text Representation Fusion with Weighted Word Embeddings and Extended Topic InformationSensors10.3390/s2203106622:3(1066)Online publication date: 29-Jan-2022
  • (2022)Deciphering the Diversity of Mental Models in Neurodevelopmental Disorders: Knowledge Graph Representation of Public Data Using Natural Language ProcessingJournal of Medical Internet Research10.2196/3988824:8(e39888)Online publication date: 5-Aug-2022
  • (2022)A Text Generation Model that Maintains the Order of Words, Topics, and Parts of Speech via Their Embedding Representations and Neural Language ModelsIEEE/WIC/ACM International Conference on Web Intelligence10.1145/3486622.3493968(262-269)Online publication date: 13-Apr-2022
  • (2022)Modeling Spatial Trajectories With Attribute Representation LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300102534:4(1902-1914)Online publication date: 1-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media