short-paper

Graph Regularization for Multi-lingual Topic Models

Authors:
Arnav Kumar Jain

Microsoft IDC, Bangalore, India

Microsoft IDC, Bangalore, India
View Profile

,
Gundeep Arora

Microsoft IDC, Bangalore, India

Microsoft IDC, Bangalore, India
View Profile

,
Rahul Agrawal

Microsoft IDC, Bangalore, India

Microsoft IDC, Bangalore, India
View Profile

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2020Pages 1741–1744https://doi.org/10.1145/3397271.3401231

Published:25 July 2020Publication History

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1741–1744

ABSTRACT

Unsupervised multi-lingual language modeling has gained attraction in the last few years and poly-lingual topic models provide a mechanism to learn aligned document representations. However, training such models require translation-aligned data across languages, which is not always available. Also, in case of short texts like tweets, search queries, etc, the training of topic models continues to be a challenge. In this work, we present a novel strategy of creating a pseudo-parallel dataset followed by training topic models for sponsored search retrieval, that also mitigates the short text challenge. Our data augmentation strategy leverages easily available bipartite click-though graph that allows us to draw similar documents in different languages. The proposed methodology is evaluated on sponsored search system whose performance is measured on correctly matching the user intent, presented via the query, with ads provided by the advertiser. Our experiments substantiate the goodness of the method on EuroParl dataset and live search-engine traffic.

Supplemental Material

3397271.3401231.mp4

mp4

102 MB

Download

References

Gundeep Arora, Anupreet Porwal, Kanupriya Agarwal, Avani Samdariya, and Piyush Rai. 2018. Small-variance asymptotics for nonparametric Bayesian overlapping stochastic blockmodels. arXiv preprint arXiv:1807.03570 (2018).Google Scholar
DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.Google ScholarDigital Library
Thang D Bui, Sujith Ravi, and Vivek Ramavajjala. 2018. Neural graph learning: Training neural networks using graphs. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 64--71.Google ScholarDigital Library
Jianfei Chen, Kaiwei Li, Jun Zhu, andWenguang Chen. 2015. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. arXiv preprint arXiv:1510.08628 (2015).Google Scholar
Thiago de Paulo Faleiros and Alneu de Andrade Lopes. 2016. On the equivalence between algorithms for Non-negative Matrix Factorization and Latent Dirichlet Allocation. In ESANN.Google Scholar
Prasenjit Dey, Kunal Goel, and Rahul Agrawal. 2020. P-Simrank: Extending Simrank to Scale-free bipartite networks. In Proceedings of the 30th international conference on World Wide Web. 1445--1456.Google ScholarDigital Library
Kosuke Fukumasu, Koji Eguchi, and Eric P Xing. 2012. Symmetric correspondence topic models for multilingual text analysis. In Advances in Neural Information Processing Systems. 1286--1294.Google Scholar
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, Vol. 5. Citeseer, 79--86.Google Scholar
Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).Google ScholarDigital Library
David Mimno, HannaMWallach, Jason Naradowsky, David A Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 880--889.Google ScholarDigital Library
Martin F Porter. 2001. Snowball: A language for stemming algorithms.Google Scholar
Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic modeling over short texts by incorporating word embeddings. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 363--374.Google ScholarCross Ref
John Richardson, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Pivot-based topic models for low-resource lexicon extraction. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 369--377.Google Scholar
Filipe Rodrigues, Mariana Lourenco, Bernardete Ribeiro, and Francisco C Pereira. 2017. Learning supervised topic models for classification and regression from crowds. IEEE transactions on pattern analysis and machine intelligence 39, 12 (2017), 2409--2422.Google ScholarCross Ref
Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limiting factors of topic modeling via posterior contraction analysis. In International Conference on Machine Learning. 190--198.Google ScholarDigital Library
Xing Wei and W Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 178--185.Google ScholarDigital Library
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web. 1445--1456.Google ScholarDigital Library
Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. 1351--1361.Google ScholarDigital Library
Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive review. Computational Social Networks 6, 1 (2019), 11.Google ScholarCross Ref
Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2105--2114.Google ScholarDigital Library
Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems 48, 2 (2016), 379--398.Google ScholarDigital Library

Index Terms

Graph Regularization for Multi-lingual Topic Models
1. Information systems
  1. Information retrieval

Recommendations

A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models
ECIR'13: Proceedings of the 35th European conference on Advances in Information Retrieval

We explore the potential of probabilistic topic modeling within the relevance modeling framework for both monolingual and cross-lingual ad-hoc retrieval. Multilingual topic models provide a way to represent documents in a structured and coherent way, ...
Read More
Cross-Lingual Information Retrieve in Sogou Search
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

In recent years, more and more Chinese people desires to be able to access the large amount of foreign language information and understand what is happening all over the world. However, language barrier is always a problem to them. In order to break the ...
Read More
Amharic-English bilingual web search engine
MEDES '12: Proceedings of the International Conference on Management of Emergent Digital EcoSystems

As non-English languages are growing exponentially on the Web, the number of online non-English speakers who realizes the importance of finding information in different languages is enormously growing. However, the major general purpose search engines ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2020
2548 pages
ISBN:9781450380164
DOI:10.1145/3397271
General Chairs:
Jimmy Huang
York University, Canada
,
Yi Chang
Jilin University, China
,
Xueqi Cheng
Chinese Academy of Sciences, China
,
Program Chairs:
Jaap Kamps
University of Amsterdam, Netherlands
,
Vanessa Murdock
Amazon, U.S.A.
,
Ji-Rong Wen
Renmin University of China, China
,
Yiqun Liu
Tsinghua University, China
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-lingual information retrieval
graph regularization
topic models
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 149
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Graph Regularization for Multi-lingual Topic Models

SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models

Cross-Lingual Information Retrieve in Sogou Search

Amharic-English bilingual web search engine