skip to main content
10.1145/3397271.3401231acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Graph Regularization for Multi-lingual Topic Models

Published:25 July 2020Publication History

ABSTRACT

Unsupervised multi-lingual language modeling has gained attraction in the last few years and poly-lingual topic models provide a mechanism to learn aligned document representations. However, training such models require translation-aligned data across languages, which is not always available. Also, in case of short texts like tweets, search queries, etc, the training of topic models continues to be a challenge. In this work, we present a novel strategy of creating a pseudo-parallel dataset followed by training topic models for sponsored search retrieval, that also mitigates the short text challenge. Our data augmentation strategy leverages easily available bipartite click-though graph that allows us to draw similar documents in different languages. The proposed methodology is evaluated on sponsored search system whose performance is measured on correctly matching the user intent, presented via the query, with ads provided by the advertiser. Our experiments substantiate the goodness of the method on EuroParl dataset and live search-engine traffic.

Skip Supplemental Material Section

Supplemental Material

3397271.3401231.mp4

mp4

102 MB

References

  1. Gundeep Arora, Anupreet Porwal, Kanupriya Agarwal, Avani Samdariya, and Piyush Rai. 2018. Small-variance asymptotics for nonparametric Bayesian overlapping stochastic blockmodels. arXiv preprint arXiv:1807.03570 (2018).Google ScholarGoogle Scholar
  2. DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Thang D Bui, Sujith Ravi, and Vivek Ramavajjala. 2018. Neural graph learning: Training neural networks using graphs. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 64--71.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jianfei Chen, Kaiwei Li, Jun Zhu, andWenguang Chen. 2015. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. arXiv preprint arXiv:1510.08628 (2015).Google ScholarGoogle Scholar
  5. Thiago de Paulo Faleiros and Alneu de Andrade Lopes. 2016. On the equivalence between algorithms for Non-negative Matrix Factorization and Latent Dirichlet Allocation. In ESANN.Google ScholarGoogle Scholar
  6. Prasenjit Dey, Kunal Goel, and Rahul Agrawal. 2020. P-Simrank: Extending Simrank to Scale-free bipartite networks. In Proceedings of the 30th international conference on World Wide Web. 1445--1456.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kosuke Fukumasu, Koji Eguchi, and Eric P Xing. 2012. Symmetric correspondence topic models for multilingual text analysis. In Advances in Neural Information Processing Systems. 1286--1294.Google ScholarGoogle Scholar
  8. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, Vol. 5. Citeseer, 79--86.Google ScholarGoogle Scholar
  9. Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Mimno, HannaMWallach, Jason Naradowsky, David A Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 880--889.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Martin F Porter. 2001. Snowball: A language for stemming algorithms.Google ScholarGoogle Scholar
  12. Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic modeling over short texts by incorporating word embeddings. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 363--374.Google ScholarGoogle ScholarCross RefCross Ref
  13. John Richardson, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Pivot-based topic models for low-resource lexicon extraction. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 369--377.Google ScholarGoogle Scholar
  14. Filipe Rodrigues, Mariana Lourenco, Bernardete Ribeiro, and Francisco C Pereira. 2017. Learning supervised topic models for classification and regression from crowds. IEEE transactions on pattern analysis and machine intelligence 39, 12 (2017), 2409--2422.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limiting factors of topic modeling via posterior contraction analysis. In International Conference on Machine Learning. 190--198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xing Wei and W Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 178--185.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web. 1445--1456.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. 1351--1361.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive review. Computational Social Networks 6, 1 (2019), 11.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2105--2114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems 48, 2 (2016), 379--398.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Graph Regularization for Multi-lingual Topic Models

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
          July 2020
          2548 pages
          ISBN:9781450380164
          DOI:10.1145/3397271

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 July 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%
        • Article Metrics

          • Downloads (Last 12 months)10
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader