research-article

Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts

Authors:

Aidong ZhangAuthors Info & Claims

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 535 - 543

https://doi.org/10.1145/3097983.3098009

Published: 04 August 2017 Publication History

Abstract

A text corpus typically contains two types of context information -- global context and local context. Global context carries topical information which can be utilized by topic models to discover topic structures from the text corpus, while local context can train word embeddings to capture semantic regularities reflected in the text corpus. This encourages us to exploit the useful information in both the global and the local context information. In this paper, we propose a unified language model based on matrix factorization techniques which 1) takes the complementary global and local context information into consideration simultaneously, and 2) models topics and learns word embeddings collaboratively. We empirically show that by incorporating both global and local context, this collaborative model can not only significantly improve the performance of topic discovery over the baseline topic models, but also learn better word embeddings than the baseline word embedding models. We also provide qualitative analysis that explains how the cooperation of global and local context information can result in better topic structures and word embeddings.

Supplementary Material

MP4 File (xun_contexts.mp4)

Download
394.24 MB

References

[1]

Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain 2006. Neural probabilistic language models. Innovations in Machine Learning. Springer, 137--186.

[2]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research Vol. 3 (2003), 993--1022.

Digital Library

[3]

Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 136--145.

Digital Library

[4]

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research Vol. 12 (2011), 2493--2537.

Digital Library

[5]

Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with Word Embeddings. ACL (1). 795--804.

[6]

Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, Vol. 41, 6 (1990), 391.

[7]

Chris Ding, Tao Li, and Wei Peng 2006. Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence chi-square statistic, and a hybrid method AAAI, Vol. Vol. 6. 137--143.

[8]

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin 2001. Placing search in context: The concept revisited. Proceedings of the 10th international conference on World Wide Web. ACM, 406--414.

Digital Library

[9]

Zellig S Harris. 1954. Distributional structure. Word, Vol. 10, 2--3 (1954), 146--162.

[10]

Felix Hill, Roi Reichart, and Anna Korhonen 2016. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics (2016).

[11]

Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 50--57.

Digital Library

[12]

Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng 2012. Improving word representations via global context and multiple word prototypes ACL (1). 873--882.

[13]

Quoc V Le and Tomas Mikolov 2014. Distributed Representations of Sentences and Documents. ICML, Vol. Vol. 14. 1188--1196.

Digital Library

[14]

Daniel D Lee and H Sebastian Seung 2001. Algorithms for non-negative matrix factorization. Advances in neural information processing systems. 556--562.

[15]

Omer Levy and Yoav Goldberg 2014. Neural word embedding as implicit matrix factorization Advances in neural information processing systems. 2177--2185.

[16]

Omer Levy, Yoav Goldberg, and Israel Ramat-Gan 2014. Linguistic Regularities in Sparse and Explicit Word Representations. CoNLL. 171--180.

[17]

Shaohua Li, Tat-Seng Chua, Jun Zhu, and Chunyan Miao. 2016. Generative topic embedding: a continuous representation of documents Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL).

[18]

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015. Topical Word Embeddings. In AAAI. 2418--2424.

[19]

Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word representations with recursive neural networks for morphology. CoNLL. 104--113.

[20]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean 2013natexlaba. Distributed representations of words and phrases and their compositionality Advances in neural information processing systems. 3111--3119.

[21]

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. Hlt-naacl, Vol. Vol. 13. 746--751.

[22]

David Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum 2011. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 262--272.

Digital Library

[23]

Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics Vol. 3 (2015), 299--313.

[24]

Liqiang Niu, Xinyu Dai, Jianbing Zhang, and Jiajun Chen. 2015. Topic2Vec: learning distributed representations of topics Asian Language Processing (IALP), 2015 International Conference on. IEEE, 193--196.

[25]

Jeffrey Pennington, Richard Socher, and Christopher D Manning 2014. Glove: Global Vectors for Word Representation. In EMNLP, Vol. Vol. 14. 1532--1543.

[26]

Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch 2011. A word at a time: computing word relatedness using temporal semantic analysis Proceedings of the 20th international conference on World wide web. ACM, 337--346.

[27]

Guangxu Xun, Vishrawas Gopalakrishnan, Fenglong Ma, Yaliang Li, Jing Gao, and Aidong Zhang. 2016. Topic Discovery for Short Texts Using Word Embeddings Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 1299--1304.

[28]

Guangxu Xun, Yaliang Li, Wayne Xin Zhao, Jing Gao, and Aidong Zhang 2017. A Correlated Topic Model Using Word Embeddings. In Proceedings of the 26th International Joint Conference on Artificial Intelligence.

[29]

Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li 2011. Comparing twitter and traditional media using topic models. Advances in Information Retrieval. Springer, 338--349.

Digital Library

Cited By

Koltcov SSurkov AFilippov VIgnatenko V(2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
https://doi.org/10.7717/peerj-cs.1758
Qiu ZHuang GQin XWang YWang JZhou Y(2024)A Hybrid Semantic Representation Method Based on Fusion Conceptual Knowledge and Weighted Word Embeddings for English TextsInformation10.3390/info1511070815:11(708)Online publication date: 5-Nov-2024
https://doi.org/10.3390/info15110708
Zhang YZhang YMichalski MJiang YMeng YHan JChua TLauw HSi LTerzi ETsaparas P(2023)Effective Seed-Guided Topic Discovery by Integrating Multiple Types of ContextsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570475(429-437)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570475
Show More Cited By

Index Terms

Collaboratively Improving Topic Discovery and Word Embeddings by Coordinating Global and Local Contexts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Improving biterm topic model with word embeddings
Abstract
As one of the fundamental information extraction methods, topic model has been widely used in text clustering, information recommendation and other text analysis tasks. Conventional topic models mainly utilize word co-occurrence information in ...
Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification

With the considerable growth of user-generated content, online reviews are becoming extremely valuable sources for mining customers' opinions on products and services. However, most of the traditional opinion mining methods are coarse-grained and cannot ...
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial Intelligence

In recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2017

2240 pages

ISBN:9781450348874

DOI:10.1145/3097983

General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM

Copyright © 2017 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

US National Science Foundation

Conference

KDD '17

Sponsor:

KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2017

NS, Halifax, Canada

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
1,662
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koltcov SSurkov AFilippov VIgnatenko V(2024)Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topicsPeerJ Computer Science10.7717/peerj-cs.175810(e1758)Online publication date: 3-Jan-2024
https://doi.org/10.7717/peerj-cs.1758
Qiu ZHuang GQin XWang YWang JZhou Y(2024)A Hybrid Semantic Representation Method Based on Fusion Conceptual Knowledge and Weighted Word Embeddings for English TextsInformation10.3390/info1511070815:11(708)Online publication date: 5-Nov-2024
https://doi.org/10.3390/info15110708
Zhang YZhang YMichalski MJiang YMeng YHan JChua TLauw HSi LTerzi ETsaparas P(2023)Effective Seed-Guided Topic Discovery by Integrating Multiple Types of ContextsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570475(429-437)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570475
Chen MZhao YLiu YYu XZheng K(2023)Modeling Spatial Trajectories with Attribute Representation Learning (Extended Abstract)2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00333(3813-3814)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00333
Wei CZhao WChen LWang Y(2023)Contextualized Word Embeddings via Generative Adversarial Learning of Syntagmatic and Paradigmatic Structure2023 6th International Conference on Software Engineering and Computer Science (CSECS)10.1109/CSECS60003.2023.10428465(1-8)Online publication date: 22-Dec-2023
https://doi.org/10.1109/CSECS60003.2023.10428465
Chai BJi XGuo JMa LZheng Y(2022)A Generative Model for Topic Discovery and Polysemy Embeddings on Directed Attributed NetworksSymmetry10.3390/sym1404070314:4(703)Online publication date: 30-Mar-2022
https://doi.org/10.3390/sym14040703
Liu WPang JDu QLi NYang S(2022)A Method of Short Text Representation Fusion with Weighted Word Embeddings and Extended Topic InformationSensors10.3390/s2203106622:3(1066)Online publication date: 29-Jan-2022
https://doi.org/10.3390/s22031066
Kaur MCostello JWillis EKelm KReformat MBolduc F(2022)Deciphering the Diversity of Mental Models in Neurodevelopmental Disorders: Knowledge Graph Representation of Public Data Using Natural Language ProcessingJournal of Medical Internet Research10.2196/3988824:8(e39888)Online publication date: 5-Aug-2022
https://doi.org/10.2196/39888
Kawamae N(2022)A Text Generation Model that Maintains the Order of Words, Topics, and Parts of Speech via Their Embedding Representations and Neural Language ModelsIEEE/WIC/ACM International Conference on Web Intelligence10.1145/3486622.3493968(262-269)Online publication date: 13-Apr-2022
https://doi.org/10.1145/3486622.3493968
Chen MZhao YLiu YYu XZheng K(2022)Modeling Spatial Trajectories With Attribute Representation LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300102534:4(1902-1914)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TKDE.2020.3001025
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten