skip to main content
research-article

Cross-Lingual Topic Discovery From Multilingual Search Engine Query Log

Published: 21 September 2016 Publication History

Abstract

Today, major commercial search engines are operating in a multinational fashion to provide web search services for millions of users who compose search queries by different languages. Hence, the search engine query log, which serves as the backbone of many search engine applications, records millions of users’ search history in a wide spectrum of human languages and demonstrates a strong multilingual phenomenon. However, with its salience, the multilingual nature of a search engine query log is usually ignored by existing works, which usually consider query log entries of different languages as being orthogonal and independent. This kind of oversimplified assumption heavily distorts the underlying structure of web search data. In this article, we pioneer in recognition of the multilingual nature of a query log and make the first attempt to cross the language barrier in query logs. We propose a novel model named Cross-Lingual Query Log Topic Model (CL-QLTM) to analyze query logs from a cross-lingual perspective and derive the latent topics of web search data. The CL-QLTM comprehensively integrates web search data in different languages by collectively utilizing cross-lingual dictionaries, as well as the co-occurrence relations in the query log. In order to relieve the efficiency bottleneck of applying the CL-QLTM on voluminous query logs, we propose an efficient parameter inference algorithm based on the MapReduce computing paradigm. Both qualitative and quantitative experimental results show that the CL-QLTM is able to effectively derive cross-lingual topics from multilingual query logs and spawn a wide spectrum of new search engine applications.

References

[1]
Vamshi Ambati and U. Rohini. 2006. Using monolingual clickthrough data to build cross-lingual search systems. New Directions in Multilingual Information Access (2006), 28.
[2]
David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 113--120, 2006.
[3]
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet allocation. The Journal of Machine Learning Research 3 (2003), 993--1022.
[4]
Jordan Boyd-Graber and David M. Blei. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 75--82, 2009.
[5]
M. J. Carman, F. Crestani, M. Harvey, and M. Baillie. 2010. Towards query log based personalization using topic models. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. ACM, 1849--1852, 2010.
[6]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[7]
Qiming Diao, Jing Jiang, Feida Zhu, and Ee-Peng Lim. 2012. Finding bursty topics from microblogs. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 536--544, 2012.
[8]
Kosuke Fukumasu, Koji Eguchi, and Eric P. Xing. 2012. Symmetric correspondence topic models for multilingual text analysis. In Advances in Neural Information Processing Systems. 1286--1294, 2012.
[9]
Wei Gao, Cheng Niu, Jian-Yun Nie, Ming Zhou, Jian Hu, Kam-Fai Wong, and Hsiao-Wuen Hon. 2007. Cross-lingual query suggestion using query logs of different languages. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 463--470, 2007.
[10]
T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, suppl 1 (2004), 5228--5235.
[11]
Carrie Grimes, Diane Tang, and Daniel M. Russell. 2007. Query logs alone are not enough. In Proceedings of the Workshop on Query Log Analysis at the 16th International Conference on World Wide Web. ACM, 2007.
[12]
Tom Hebert and Richard Leahy. 1989. A generalized EM algorithm for 3-D Bayesian reconstruction from Poisson data using Gibbs priors. IEEE Transactions on Medical Imaging 8, 2 (1989), 194--202.
[13]
Thomas Hofmann. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42, 1--2 (2001), 177--196.
[14]
J. Huang and E. N. Efthimiadis. 2009. Analyzing and evaluating query reformulation strategies in web search logs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 77--86, 2009.
[15]
Jagadeesh Jagarlamudi and Hal Daumé III. 2010. Extracting multilingual topics from unaligned comparable corpora. In Advances in Information Retrieval. Springer, 444--456, 2010.
[16]
Di Jiang, Kenneth Wai-Ting Leung, and Wilfred Ng. 2014. Fast topic discovery from web search streams. In Proceedings of the 23rd International World Wide Web Conference. ACM, 949--960, 2014.
[17]
Di Jiang, Kenneth Wai-Ting Leung, and Wilfred Ng. 2016. Query intent mining with multiple dimensions of web search data. World Wide Web 19, 3 (2016), 475--497.
[18]
Di Jiang, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. 2015a. TEII: Topic enhanced inverted index for top-k document retrieval. Knowledge-Based Systems 89 (2015), 346--358.
[19]
Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng. 2012. G-WSTD: A framework for geographic web search topic discovery. In Proceedings of the 21st ACM Conference on Information and Knowledge Management. ACM, 1143--1152, 2012.
[20]
Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng. 2013. Panorama: A semantic-aware application search framework. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 371--382, 2013.
[21]
Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng. 2014. Personalized query suggestion with diversity awareness. In Proceedings of the 30th International Conference on Data Engineering. IEEE, 400--411, 2014.
[22]
Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. 2015. SG-WSTD: A framework for scalable geographic web search topic discovery. Knowledge-Based Systems 84 (2015), 18--33.
[23]
Victor Lavrenko, Martin Choquette, and W. Bruce Croft. 2002. Cross-lingual relevance models. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 175--182, 2002.
[24]
K. W.-T. Leung, Dik Lun Lee, and Wang-Chien Lee. 2010. Personalized web search with location preferences. In Proceedings of the 26th International Conference on Data Engineering. IEEE, 701--712, 2010.
[25]
David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 880--889, 2009.
[26]
D. Newman, A. Asuncion, P. Smyth, and M. Welling. 2009. Distributed algorithms for topic models. The Journal of Machine Learning Research 10 (2009), 1801--1828.
[27]
Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. 2009. Mining multilingual topics from Wikipedia. In Proceedings of the 18th International Conference on World Wide Web. ACM, 1155--1156, 2009.
[28]
Zhaochun Ren and Maarten de Rijke. 2015. Summarizing contrastive themes via hierarchical non-parametric processes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 93--102, 2015.
[29]
Zhaochun Ren, Shangsong Liang, Edgar Meij, and Maarten de Rijke. 2013. Personalized time-aware tweets summarization. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 513--522, 2013.
[30]
Zhaochun Ren, Maria-Hendrike Peetz, Shangsong Liang, Willemijn Van Dolen, and Maarten De Rijke. 2014. Hierarchical multi-label classification of social text streams. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 213--222, 2014.
[31]
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494, 2004.
[32]
Ivan Vulić, Wim De Smet, and Marie-Francine Moens. 2011. Identifying word translations from comparable corpora using latent topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2. ACL, 479--484, 2011.
[33]
Ivan Vulić, Wim De Smet, Jie Tang, and Marie-Francine Moens. 2015. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing & Management 51, 1 (2015), 111--147.
[34]
Ivan Vulić and Marie-Francine Moens. 2013. A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models. In Advances in Information Retrieval. Springer, 98--109, 2013.
[35]
Ivan Vulic and Marie-Francine Moens. 2014. Probabilistic models of cross-lingual semantic similarity in context based on latent cross-lingual concepts induced from comparable data. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014). ACL, 349--362, 2014.
[36]
Xuerui Wang, Andrei Broder, Evgeniy Gabrilovich, Vanja Josifovski, and Bo Pang. 2008. Cross-lingual query classification: A preliminary study. In Proceedings of the 2nd ACM workshop on Improving Non English Web Searching. ACM, 101--104, 2008.
[37]
X. Wang and A. McCallum. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 424--433, 2006.
[38]
Z. Yin, L. Cao, J. Han, C. Zhai, and T. Huang. 2011. Geographical topic discovery and comparison. In Proceedings of the 20th International Conference on World Wide Web. ACM, 247--256, 2011.
[39]
Ke Zhai, Jordan Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st International Conference on World Wide Web. ACM, 879--888, 2012.
[40]
Duo Zhang, Qiaozhu Mei, and ChengXiang Zhai. 2010. Cross-lingual latent topic extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1128--1137, 2010.
[41]
Hua-Ping Zhang, Hong-Kui Yu, De-Yi Xiong, and Qun Liu. 2003. HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing-Volume 17. ACL.

Cited By

View all
  • (2023)An Efficient Long Short-Term Memory Model for Digital Cross-Language SummarizationComputers, Materials & Continua10.32604/cmc.2023.03407274:3(6389-6409)Online publication date: 2023
  • (2023)The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web ArchivesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591890(2848-2860)Online publication date: 19-Jul-2023
  • (2022)Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web LogsJournal of Data and Information Quality10.1145/349039214:2(1-17)Online publication date: 23-Mar-2022
  • Show More Cited By

Index Terms

  1. Cross-Lingual Topic Discovery From Multilingual Search Engine Query Log

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 35, Issue 2
    April 2017
    232 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3001595
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 September 2016
    Accepted: 01 June 2016
    Revised: 01 June 2016
    Received: 01 October 2015
    Published in TOIS Volume 35, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Search engine
    2. probabilistic topic model
    3. query log

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)An Efficient Long Short-Term Memory Model for Digital Cross-Language SummarizationComputers, Materials & Continua10.32604/cmc.2023.03407274:3(6389-6409)Online publication date: 2023
    • (2023)The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web ArchivesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591890(2848-2860)Online publication date: 19-Jul-2023
    • (2022)Experience: Analyzing Missing Web Page Visits and Unintentional Web Page Visits from the Client-side Web LogsJournal of Data and Information Quality10.1145/349039214:2(1-17)Online publication date: 23-Mar-2022
    • (2021)Topic Modeling Using Latent Dirichlet allocationACM Computing Surveys10.1145/346247854:7(1-35)Online publication date: 17-Sep-2021
    • (2021)Improving Semantic Coherence of Gujarati Text Topic Model Using Inflectional Forms Reduction and Single-letter Words RemovalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344776020:1(1-18)Online publication date: 10-Mar-2021
    • (2021)Heterogeneous Latent Topic Discovery for Semantic Text MiningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3077025(1-1)Online publication date: 2021
    • (2021)Federated Topic Discovery: A Semantic Consistent ApproachIEEE Intelligent Systems10.1109/MIS.2020.303345936:5(96-103)Online publication date: 1-Sep-2021
    • (2021)Building the Bridge: Topic Modeling for Comparative ResearchCommunication Methods and Measures10.1080/19312458.2021.196597316:2(96-114)Online publication date: 7-Sep-2021
    • (2021)A word embedding-based approach to cross-lingual topic modelingKnowledge and Information Systems10.1007/s10115-021-01555-763:6(1529-1555)Online publication date: 1-Jun-2021
    • (2019)A novel temporal and topic-aware recommender modelWorld Wide Web10.1007/s11280-018-0595-922:5(2105-2127)Online publication date: 1-Sep-2019
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media