short-paper

Learning to Enrich Query Representation with Pseudo-Relevance Feedback for Cross-lingual Retrieval

Authors:

Ramraj Chandradevan,

Mahsa Yarmohammadi,

Eugene AgichteinAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1790 - 1795

https://doi.org/10.1145/3477495.3532013

Published: 07 July 2022 Publication History

Abstract

Cross-lingual information retrieval (CLIR) aims to provide access to information across languages. Recent pre-trained multilingual language models brought large improvements to the natural language tasks, including cross-lingual adhoc retrieval. However, pseudo-relevance feedback (PRF), a family of techniques for improving ranking using the contents of top initially retrieved items, has not been explored with neural CLIR retrieval models. Two of the challenges are incorporating feedback from long documents, and cross-language knowledge transfer. To address these challenges, we propose a novel neural CLIR architecture, NCLPRF, capable of incorporating PRF feedback from multiple potentially long documents, which enables improvements to query representation in the shared semantic space between query and document languages. The additional information that the feedback documents provide in a target language, can enrich the query representation, bringing it closer to relevant documents in the embedding space. The proposed model performance across three CLIR test collections in Chinese, Russian, and Persian languages, exhibits significant improvements over traditional and SOTA neural CLIR baselines across all three collections.

References

[1]

Jane M. Bromley, Isabelle Guyon, Yann LeCun, Eduard Sackinger, and Roopak Shah. 1994. Signature verification using a Siamese time delay neural network. In 7th Annual Neural Information Processing Systems Conference. Morgan Kaufmann Publishers, 737--744. http://oro.open.ac.uk/39214/ Advances in Neural Information Processing Systems 6 Edited by Jack D. Cowan, Gerald Tasauro, Joshua Alspector.

[2]

Zewen Chi, Li Dong, Bo Zheng, Shaohan Huang, Xian-Ling Mao, Heyan Huang, and Furu Wei. 2021. Improving pretrained cross-lingual language models via self-labeled word alignment. arXiv preprint arXiv:2106.06381 (2021).

[3]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440--8451. https://doi.org/10.18653/v1/2020.acl-main.747

[4]

Cash Costello, Eugene Yang, Dawn Lawrie, and James Mayfield. 2022. Patapasco: A Python Framework for Cross-Language Information Retrieval Experiments. arXiv:2201.09996 [cs.IR]

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).

[6]

Zi-Yi Dou and Graham Neubig. 2021. Word Alignment by Fine-tuning Embeddings on Parallel Corpora. arXiv:2101.08231 [cs.CL]

[7]

Hui Fang and ChengXiang Zhai. 2006. Semantic Term Matching in Axiomatic Approaches to Information Retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, USA) (SIGIR '06). Association for Computing Machinery, New York, NY, USA, 115--122. https://doi.org/10.1145/1148170.1148193

Digital Library

[8]

Luyu Gao and Jamie Callan. 2021. Condenser: a Pre-training Architecture for Dense Retrieval. arXiv preprint arXiv:2104.08253 (2021).

[9]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769--6781. https://doi.org/10.18653/v1/2020.emnlp-main.550

[10]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Association for Computing Machinery, New York, NY, USA, 39--48. https://doi.org/10.1145/3397271.3401075

Digital Library

[11]

Saar Kuzi, Anna Shtok, and Oren Kurland. 2016. Query Expansion Using Word Embeddings. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, Indiana, USA) (CIKM'16). Association for Computing Machinery, New York, NY, USA, 1929--1932. https://doi.org/10.1145/2983323.2983876

Digital Library

[12]

Victor Lavrenko and W. Bruce Croft. 2001. Relevance Based Language Models (SIGIR '01). Association for Computing Machinery, New York, NY, USA, 120--127. https://doi.org/10.1145/383952.383972

Digital Library

[13]

Dawn Lawrie, James Mayfield, Douglas W. Oard, and Eugene Yang. 2022. HC4: A New Suite of Test Collections for Ad Hoc CLIR. In Proceedings of the 44th European Conference on Information Retrieval (ECIR).

Digital Library

[14]

Chia-Jung Lee and W. Bruce Croft. 2014. Cross-Language Pseudo-Relevance Feedback Techniques for Informal Text. In Proceedings of the 36th European Conference on IR Research on Advances in Information Retrieval - Volume 8416 (Amsterdam, The Netherlands) (ECIR 2014). Springer-Verlag, Berlin, Heidelberg, 260--272.

[15]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv:2010.06467 [cs.IR]

[16]

T MITAMURA. 2010. Overview of the NTCIR-8 ACLIA Tasks: Advanced crosslingual information access. In NTCIR-8 Workshop, 2010.

[17]

Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton Murray, James Mayfield, and Douglas W. Oard. 2022. Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models. arXiv:2201.08471 [cs.IR]

[18]

Shahrzad Naseri, Jeffrey Dalton, Andrew Yates, and James Allan. 2021. CEQE: Contextualized Embeddings for Query Expansion. arXiv:2103.05256 [cs.IR]

[19]

Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR]

[20]

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 708--718. https://doi.org/10.18653/v1/2020.findings-emnlp.63

[21]

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-Stage Document Ranking with BERT. arXiv:1910.14424 [cs.IR]

[22]

Carol Peters and Martin Braschler. 2001. European Research Letter: Cross- Language System Evaluation: The CLEF Campaigns. J. Am. Soc. Inf. Sci. Technol. 52, 12 (oct 2001), 1067--1072. https://doi.org/10.1002/asi.1164

Digital Library

[23]

J. J. Rocchio. 1971. Relevance feedback in information retrieval. In The Smart retrieval system - experiments in automatic document processing, G. Salton (Ed.). Englewood Cliffs, NJ: Prentice-Hall, 313--323.

[24]

Shuo Sun and Kevin Duh. 2020. CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4160--4170. https://doi.org/10.18653/v1/2020.emnlp-main.340

[25]

Xiao Wang, Craig MacDonald, Nicola Tonellotto, and Iadh Ounis. 2021. Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval. Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval (2021).

Digital Library

[26]

Xuwen Wang, Qiang Zhang, Xiaojie Wang, and Junlian Li. 2015. Cross-lingual Pseudo Relevance Feedback Based onWeak Relevant Topic Alignment. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. Shanghai, China, 529--534. https://aclanthology.org/Y15--1061

[27]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations. https://openreview.net/forum?id=zeFrfgyZln

[28]

HongChien Yu, Chenyan Xiong, and Jamie Callan. 2021. Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. Association for Computing Machinery, New York, NY, USA, 3592--3596. https://doi.org/10.1145/3459637.3482124

Digital Library

Cited By

Abdullahi TSingh REickhoff COosterhuis HBast HXiong C(2024)Retrieval Augmented Zero-Shot Text ClassificationProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672514(195-203)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672514
Zhuang SShou LZuccon GChen HDuh WHuang HKato MMothe JPoblete B(2023)Augmenting Passage Representations with Query Generation for Enhanced Cross-Lingual Dense RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591952(1827-1832)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591952

Index Terms

Learning to Enrich Query Representation with Pseudo-Relevance Feedback for Cross-lingual Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval
ICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users' initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-...
Query dependent pseudo-relevance feedback based on wikipedia
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pseudo-relevance feedback (PRF) via query-expansion has been proven to be e®ective in many information retrieval (IR) tasks. In most existing work, the top-ranked documents from an initial search are assumed to be relevant and used for PRF. One problem ...
Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval
Abstract
Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different from the target documents language. CLIR incorporates a translation technique based on either a manual dictionary or a probabilistic dictionary ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2022

3569 pages

ISBN:9781450387323

DOI:10.1145/3477495

General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

IARPA BETTER

Conference

SIGIR '22

Sponsor:

SIGIR

SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2022

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
288
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abdullahi TSingh REickhoff COosterhuis HBast HXiong C(2024)Retrieval Augmented Zero-Shot Text ClassificationProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672514(195-203)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672514
Zhuang SShou LZuccon GChen HDuh WHuang HKato MMothe JPoblete B(2023)Augmenting Passage Representations with Query Generation for Enhanced Cross-Lingual Dense RetrievalProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591952(1827-1832)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591952

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten