research-article

Cross-Lingual Topic Discovery From Multilingual Search Engine Query Log

Authors:

Yuanfeng SongAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 35, Issue 2

Article No.: 9, Pages 1 - 28

https://doi.org/10.1145/2956235

Published: 21 September 2016 Publication History

Abstract

Today, major commercial search engines are operating in a multinational fashion to provide web search services for millions of users who compose search queries by different languages. Hence, the search engine query log, which serves as the backbone of many search engine applications, records millions of users’ search history in a wide spectrum of human languages and demonstrates a strong multilingual phenomenon. However, with its salience, the multilingual nature of a search engine query log is usually ignored by existing works, which usually consider query log entries of different languages as being orthogonal and independent. This kind of oversimplified assumption heavily distorts the underlying structure of web search data. In this article, we pioneer in recognition of the multilingual nature of a query log and make the first attempt to cross the language barrier in query logs. We propose a novel model named Cross-Lingual Query Log Topic Model (CL-QLTM) to analyze query logs from a cross-lingual perspective and derive the latent topics of web search data. The CL-QLTM comprehensively integrates web search data in different languages by collectively utilizing cross-lingual dictionaries, as well as the co-occurrence relations in the query log. In order to relieve the efficiency bottleneck of applying the CL-QLTM on voluminous query logs, we propose an efficient parameter inference algorithm based on the MapReduce computing paradigm. Both qualitative and quantitative experimental results show that the CL-QLTM is able to effectively derive cross-lingual topics from multilingual query logs and spawn a wide spectrum of new search engine applications.

References

[1]

Vamshi Ambati and U. Rohini. 2006. Using monolingual clickthrough data to build cross-lingual search systems. New Directions in Multilingual Information Access (2006), 28.

[2]

David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 113--120, 2006.

Digital Library

[3]

D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 993--1022.

Digital Library

[4]

Jordan Boyd-Graber and David M. Blei. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 75--82, 2009.

Digital Library

[5]

M. J. Carman, F. Crestani, M. Harvey, and M. Baillie. 2010. Towards query log based personalization using topic models. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. ACM, 1849--1852, 2010.

Digital Library

[6]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[7]

Qiming Diao, Jing Jiang, Feida Zhu, and Ee-Peng Lim. 2012. Finding bursty topics from microblogs. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 536--544, 2012.

Digital Library

[8]

Kosuke Fukumasu, Koji Eguchi, and Eric P. Xing. 2012. Symmetric correspondence topic models for multilingual text analysis. In Advances in Neural Information Processing Systems. 1286--1294, 2012.

Digital Library

[9]

Wei Gao, Cheng Niu, Jian-Yun Nie, Ming Zhou, Jian Hu, Kam-Fai Wong, and Hsiao-Wuen Hon. 2007. Cross-lingual query suggestion using query logs of different languages. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 463--470, 2007.

Digital Library

[10]

T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, suppl 1 (2004), 5228--5235.

[11]

Carrie Grimes, Diane Tang, and Daniel M. Russell. 2007. Query logs alone are not enough. In Proceedings of the Workshop on Query Log Analysis at the 16th International Conference on World Wide Web. ACM, 2007.

[12]

Tom Hebert and Richard Leahy. 1989. A generalized EM algorithm for 3-D Bayesian reconstruction from Poisson data using Gibbs priors. IEEE Transactions on Medical Imaging 8, 2 (1989), 194--202.

[13]

Thomas Hofmann. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42, 1--2 (2001), 177--196.

Digital Library

[14]

J. Huang and E. N. Efthimiadis. 2009. Analyzing and evaluating query reformulation strategies in web search logs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 77--86, 2009.

Digital Library

[15]

Jagadeesh Jagarlamudi and Hal Daumé III. 2010. Extracting multilingual topics from unaligned comparable corpora. In Advances in Information Retrieval. Springer, 444--456, 2010.

Digital Library

[16]

Di Jiang, Kenneth Wai-Ting Leung, and Wilfred Ng. 2014. Fast topic discovery from web search streams. In Proceedings of the 23rd International World Wide Web Conference. ACM, 949--960, 2014.

Digital Library

[17]

Di Jiang, Kenneth Wai-Ting Leung, and Wilfred Ng. 2016. Query intent mining with multiple dimensions of web search data. World Wide Web 19, 3 (2016), 475--497.

Digital Library

[18]

Di Jiang, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. 2015a. TEII: Topic enhanced inverted index for top-k document retrieval. Knowledge-Based Systems 89 (2015), 346--358.

Digital Library

[19]

Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng. 2012. G-WSTD: A framework for geographic web search topic discovery. In Proceedings of the 21st ACM Conference on Information and Knowledge Management. ACM, 1143--1152, 2012.

Digital Library

[20]

Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng. 2013. Panorama: A semantic-aware application search framework. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 371--382, 2013.

Digital Library

[21]

Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng. 2014. Personalized query suggestion with diversity awareness. In Proceedings of the 30th International Conference on Data Engineering. IEEE, 400--411, 2014.

[22]

Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. 2015. SG-WSTD: A framework for scalable geographic web search topic discovery. Knowledge-Based Systems 84 (2015), 18--33.

Digital Library

[23]

Victor Lavrenko, Martin Choquette, and W. Bruce Croft. 2002. Cross-lingual relevance models. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 175--182, 2002.

Digital Library

[24]

K. W.-T. Leung, Dik Lun Lee, and Wang-Chien Lee. 2010. Personalized web search with location preferences. In Proceedings of the 26th International Conference on Data Engineering. IEEE, 701--712, 2010.

[25]

David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 880--889, 2009.

Digital Library

[26]

D. Newman, A. Asuncion, P. Smyth, and M. Welling. 2009. Distributed algorithms for topic models. The Journal of Machine Learning Research 10 (2009), 1801--1828.

Digital Library

[27]

Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. 2009. Mining multilingual topics from Wikipedia. In Proceedings of the 18th International Conference on World Wide Web. ACM, 1155--1156, 2009.

Digital Library

[28]

Zhaochun Ren and Maarten de Rijke. 2015. Summarizing contrastive themes via hierarchical non-parametric processes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 93--102, 2015.

Digital Library

[29]

Zhaochun Ren, Shangsong Liang, Edgar Meij, and Maarten de Rijke. 2013. Personalized time-aware tweets summarization. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 513--522, 2013.

Digital Library

[30]

Zhaochun Ren, Maria-Hendrike Peetz, Shangsong Liang, Willemijn Van Dolen, and Maarten De Rijke. 2014. Hierarchical multi-label classification of social text streams. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 213--222, 2014.

Digital Library

[31]

M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494, 2004.

Digital Library

[32]

Ivan Vulić, Wim De Smet, and Marie-Francine Moens. 2011. Identifying word translations from comparable corpora using latent topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. ACL, 479--484, 2011.

Digital Library

[33]

Ivan Vulić, Wim De Smet, Jie Tang, and Marie-Francine Moens. 2015. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing & Management 51, 1 (2015), 111--147.

[34]

Ivan Vulić and Marie-Francine Moens. 2013. A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models. In Advances in Information Retrieval. Springer, 98--109, 2013.

Digital Library

[35]

Ivan Vulic and Marie-Francine Moens. 2014. Probabilistic models of cross-lingual semantic similarity in context based on latent cross-lingual concepts induced from comparable data. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014). ACL, 349--362, 2014.

[36]

Xuerui Wang, Andrei Broder, Evgeniy Gabrilovich, Vanja Josifovski, and Bo Pang. 2008. Cross-lingual query classification: A preliminary study. In Proceedings of the 2nd ACM workshop on Improving Non English Web Searching. ACM, 101--104, 2008.

Digital Library

[37]

X. Wang and A. McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 424--433, 2006.

Digital Library

[38]

Z. Yin, L. Cao, J. Han, C. Zhai, and T. Huang. 2011. Geographical topic discovery and comparison. In Proceedings of the 20th International Conference on World Wide Web. ACM, 247--256, 2011.

Digital Library

[39]

Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st International Conference on World Wide Web. ACM, 879--888, 2012.

Digital Library

[40]

Duo Zhang, Qiaozhu Mei, and ChengXiang Zhai. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1128--1137, 2010.

Digital Library

[41]

Hua-Ping Zhang, Hong-Kui Yu, De-Yi Xiong, and Qun Liu. 2003. HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing-Volume 17. ACL.

Digital Library

Cited By

C. A. Padmanabha Reddy YSunder Reddy Kasireddy SRao Sirisala NKuchipudi RKollapudi P(2023)An Efficient Long Short-Term Memory Model for Digital Cross-Language SummarizationComputers, Materials & Continua10.32604/cmc.2023.03407274:3(6389-6409)Online publication date: 2023
https://doi.org/10.32604/cmc.2023.034072
Reimer JSchmidt SFröbe MGienapp LScells HStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web ArchivesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591890(2848-2860)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591890
Hsu CChen TChen H(2022)Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web LogsJournal of Data and Information Quality10.1145/349039214:2(1-17)Online publication date: 23-Mar-2022
https://dl.acm.org/doi/10.1145/3490392
Show More Cited By

Index Terms

Cross-Lingual Topic Discovery From Multilingual Search Engine Query Log
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval

Recommendations

Exploiting query logs for cross-lingual query suggestions

Query suggestion aims to suggest relevant queries for a given query, which helps users better specify their information needs. Previous work on query suggestion has been limited to the same language. In this article, we extend it to cross-lingual query ...
Query intent mining with multiple dimensions of web search data

Understanding the users' latent intents behind the search queries is critical for search engines. Hence, there has been an increasing attention on studying how to effectively mine the intents of search queries by analyzing search engine query log. ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 35, Issue 2

April 2017

232 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/3001595

Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 September 2016

Accepted: 01 June 2016

Revised: 01 June 2016

Received: 01 October 2015

Published in TOIS Volume 35, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Hong Kong RGC Project
Microsoft Research Asia Fellowship
National Natural Science Foundation of China
National Grand Fundamental Research 973 Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
383
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

C. A. Padmanabha Reddy YSunder Reddy Kasireddy SRao Sirisala NKuchipudi RKollapudi P(2023)An Efficient Long Short-Term Memory Model for Digital Cross-Language SummarizationComputers, Materials & Continua10.32604/cmc.2023.03407274:3(6389-6409)Online publication date: 2023
https://doi.org/10.32604/cmc.2023.034072
Reimer JSchmidt SFröbe MGienapp LScells HStein BHagen MPotthast MChen HDuh WHuang HKato MMothe JPoblete B(2023)The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web ArchivesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591890(2848-2860)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591890
Hsu CChen TChen H(2022)Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web LogsJournal of Data and Information Quality10.1145/349039214:2(1-17)Online publication date: 23-Mar-2022
https://dl.acm.org/doi/10.1145/3490392
Chauhan UShah A(2021)Topic Modeling Using Latent Dirichlet allocationACM Computing Surveys10.1145/346247854:7(1-35)Online publication date: 17-Sep-2021
https://dl.acm.org/doi/10.1145/3462478
Chauhan UShah A(2021)Improving Semantic Coherence of Gujarati Text Topic Model Using Inflectional Forms Reduction and Single-letter Words RemovalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344776020:1(1-18)Online publication date: 10-Mar-2021
https://dl.acm.org/doi/10.1145/3447760
Li YJiang DLian RWu XTan CXu YSu Z(2021)Heterogeneous Latent Topic Discovery for Semantic Text MiningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3077025(1-1)Online publication date: 2021
https://doi.org/10.1109/TKDE.2021.3077025
Shi YTong YSu ZJiang DZhou ZZhang W(2021)Federated Topic Discovery: A Semantic Consistent ApproachIEEE Intelligent Systems10.1109/MIS.2020.303345936:5(96-103)Online publication date: 1-Sep-2021
https://doi.org/10.1109/MIS.2020.3033459
Lind FEberl JEisele OHeidenreich TGalyga SBoomgaarden H(2021)Building the Bridge: Topic Modeling for Comparative ResearchCommunication Methods and Measures10.1080/19312458.2021.196597316:2(96-114)Online publication date: 7-Sep-2021
https://doi.org/10.1080/19312458.2021.1965973
Chang CHwang S(2021)A word embedding-based approach to cross-lingual topic modelingKnowledge and Information Systems10.1007/s10115-021-01555-763:6(1529-1555)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s10115-021-01555-7
Song DLi ZJiang MQin LLiao L(2019)A novel temporal and topic-aware recommender modelWorld Wide Web10.1007/s11280-018-0595-922:5(2105-2127)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1007/s11280-018-0595-9
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents