ABSTRACT
Unsupervised multi-lingual language modeling has gained attraction in the last few years and poly-lingual topic models provide a mechanism to learn aligned document representations. However, training such models require translation-aligned data across languages, which is not always available. Also, in case of short texts like tweets, search queries, etc, the training of topic models continues to be a challenge. In this work, we present a novel strategy of creating a pseudo-parallel dataset followed by training topic models for sponsored search retrieval, that also mitigates the short text challenge. Our data augmentation strategy leverages easily available bipartite click-though graph that allows us to draw similar documents in different languages. The proposed methodology is evaluated on sponsored search system whose performance is measured on correctly matching the user intent, presented via the query, with ads provided by the advertiser. Our experiments substantiate the goodness of the method on EuroParl dataset and live search-engine traffic.
Supplemental Material
- Gundeep Arora, Anupreet Porwal, Kanupriya Agarwal, Avani Samdariya, and Piyush Rai. 2018. Small-variance asymptotics for nonparametric Bayesian overlapping stochastic blockmodels. arXiv preprint arXiv:1807.03570 (2018).Google Scholar
- DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.Google ScholarDigital Library
- Thang D Bui, Sujith Ravi, and Vivek Ramavajjala. 2018. Neural graph learning: Training neural networks using graphs. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 64--71.Google ScholarDigital Library
- Jianfei Chen, Kaiwei Li, Jun Zhu, andWenguang Chen. 2015. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. arXiv preprint arXiv:1510.08628 (2015).Google Scholar
- Thiago de Paulo Faleiros and Alneu de Andrade Lopes. 2016. On the equivalence between algorithms for Non-negative Matrix Factorization and Latent Dirichlet Allocation. In ESANN.Google Scholar
- Prasenjit Dey, Kunal Goel, and Rahul Agrawal. 2020. P-Simrank: Extending Simrank to Scale-free bipartite networks. In Proceedings of the 30th international conference on World Wide Web. 1445--1456.Google ScholarDigital Library
- Kosuke Fukumasu, Koji Eguchi, and Eric P Xing. 2012. Symmetric correspondence topic models for multilingual text analysis. In Advances in Neural Information Processing Systems. 1286--1294.Google Scholar
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, Vol. 5. Citeseer, 79--86.Google Scholar
- Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).Google ScholarDigital Library
- David Mimno, HannaMWallach, Jason Naradowsky, David A Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 880--889.Google ScholarDigital Library
- Martin F Porter. 2001. Snowball: A language for stemming algorithms.Google Scholar
- Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic modeling over short texts by incorporating word embeddings. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 363--374.Google ScholarCross Ref
- John Richardson, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Pivot-based topic models for low-resource lexicon extraction. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 369--377.Google Scholar
- Filipe Rodrigues, Mariana Lourenco, Bernardete Ribeiro, and Francisco C Pereira. 2017. Learning supervised topic models for classification and regression from crowds. IEEE transactions on pattern analysis and machine intelligence 39, 12 (2017), 2409--2422.Google ScholarCross Ref
- Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limiting factors of topic modeling via posterior contraction analysis. In International Conference on Machine Learning. 190--198.Google ScholarDigital Library
- Xing Wei and W Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 178--185.Google ScholarDigital Library
- Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web. 1445--1456.Google ScholarDigital Library
- Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. 1351--1361.Google ScholarDigital Library
- Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive review. Computational Social Networks 6, 1 (2019), 11.Google ScholarCross Ref
- Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2105--2114.Google ScholarDigital Library
- Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems 48, 2 (2016), 379--398.Google ScholarDigital Library
Index Terms
- Graph Regularization for Multi-lingual Topic Models
Recommendations
A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models
ECIR'13: Proceedings of the 35th European conference on Advances in Information RetrievalWe explore the potential of probabilistic topic modeling within the relevance modeling framework for both monolingual and cross-lingual ad-hoc retrieval. Multilingual topic models provide a way to represent documents in a structured and coherent way, ...
Cross-Lingual Information Retrieve in Sogou Search
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalIn recent years, more and more Chinese people desires to be able to access the large amount of foreign language information and understand what is happening all over the world. However, language barrier is always a problem to them. In order to break the ...
Amharic-English bilingual web search engine
MEDES '12: Proceedings of the International Conference on Management of Emergent Digital EcoSystemsAs non-English languages are growing exponentially on the Web, the number of online non-English speakers who realizes the importance of finding information in different languages is enormously growing. However, the major general purpose search engines ...
Comments