abstract

Dense Passage Retrieval: Architectures and Augmentation Methods

Author:

Thilina C. RajapakseAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Page 3494

https://doi.org/10.1145/3539618.3591796

Published: 18 July 2023 Publication History

Get Access

Abstract

The dual-encoder model is a dense retrieval architecture, consisting of two encoder models, that has surpassed traditional sparse retrieval methods for open-domain retrieval [1]. But, room exists for improvement, particularly when dense retrievers are exposed to unseen passages or queries.

Considering out-of-domain queries, i.e., queries originating from domains other than the one the model was trained on, the loss in accuracy may be significant. A main factor for this is the mismatch in the information available to the context encoder and the query encoder during training. Common retireval training datasets contain an overwhelming majority of passages with one query from a passage. I hypothesize that this could lead the dual-encoder model, particularly the passage encoder, to overfit to a single potential query from a given passage to the detriment of out-of-domain performance. Based on this, I seek to answer the following research question: (RQ1.1) Does training a DPR model on data containing multiple queries per passage improve the generalizability of the model? To answer RQ1.1, I build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on (mostly) single query per passage datasets. I show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets [2].

Language can be considered another domain in the context of a dense retrieval. Training a dense retrieval model is especially challenging in languages other than English due to the scarcity of training data. I propose a novel training technique, clustered training, aimed at improving the retrieval quality of dense retrievers, especially in out-of-distribution and zero-shot settings. I address the following research questions: (RQ2.1)Does clustered training improve the effectiveness of multilingual DPR models on in-distribution data? (RQ2.2) Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.2 Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.3) Does clustered training help multilingual DPR models to generalize to new languages (zero-shot)? I show that clustered training improves the out-of-distribution and zero-shot performance of a DPR model without a clear loss in in-distribution performance using the Mr. TyDi [3] dataset.

Finally, I propose a modified dual-encoder architecture that can perform both retrieval and reranking with the same model in a single forward pass. While dual encoder models can surpass traditional sparse retrieval methods, they lag behind two stage retrieval pipelines in retrieval quality. I propose a modification to the dual encoder model where a second representation is used to rerank the passages retrieved using the first representation. Here, a second stage model is not required and both representations are generated in a single forward pass from the dual encoder. I aim to answer the following research questions in this work: (RQ3.1), Can the same model be trained to effectively generate two representations intended for two uses? RQ3.2 Can the retrieval quality of the model be improved by simultaneously performing retrieval and reranking? (RQ3.3 What is the tradeoff between retrieval quality vs. latency and compute resource efficiency for the proposed method vs. a two stage retriever? I expect that my proposed architecture would improve the dual encoder retrieval quality without sacrificing throughput or needing more computational resources.

References

[1]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 6769--6781.

Crossref

Google Scholar

[2]

Thilina Rajapakse and Maarten de Rijke. 2023. Improving the Generalizability of the Dense Passage Retriever Using Generated Datasets. In ECIR 2023: 45th European Conference on Information Retrieval. Springer, 94--109.

Google Scholar

[3]

Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. arXiv preprint arXiv:2108.08787 (2021).

Google Scholar

Cited By

View all

Li MWang XLi CZeng A(2024)Nonnegative matrix factorization with Wasserstein metric-based regularization for enhanced text embeddingPLOS ONE10.1371/journal.pone.031476219:12(e0314762)Online publication date: 5-Dec-2024
https://doi.org/10.1371/journal.pone.0314762
Xu ZChen DKuang JYi ZLi YShen YHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Dynamic Demonstration Retrieval and Cognitive Understanding for Emotional Support ConversationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657695(774-784)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657695

Index Terms

Dense Passage Retrieval: Architectures and Augmentation Methods
1. Information systems
  1. Information retrieval

Recommendations

ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document Retrieval
Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-...
Zero-Shot Dense Retrieval Based on Query Expansion
Artificial Intelligence Security and Privacy
Abstract
Dense retrieval is an effective information retrieval technique that utilizes semantic embedding similarity to retrieve documents. There are two typical kinds of dense retrieval methods, i.e., training-based methods and zero-shot methods. Training-...
Improving zero-shot retrieval using dense external expansion
Abstract
Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on ...
Highlights
- Dense external expansion improves zero-shot retrieval performance.
- High quality ...

Comments

Information & Contributors

Information

Published In

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Check for updates

Author Tags

Qualifiers

Abstract

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
293
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)12

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Li MWang XLi CZeng A(2024)Nonnegative matrix factorization with Wasserstein metric-based regularization for enhanced text embeddingPLOS ONE10.1371/journal.pone.031476219:12(e0314762)Online publication date: 5-Dec-2024
https://doi.org/10.1371/journal.pone.0314762
Xu ZChen DKuang JYi ZLi YShen YHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Dynamic Demonstration Retrieval and Cognitive Understanding for Emotional Support ConversationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657695(774-784)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657695

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document Retrieval

Zero-Shot Dense Retrieval Based on Query Expansion

Improving zero-shot retrieval using dense external expansion

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations