skip to main content
10.1145/3539618.3591796acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
abstract

Dense Passage Retrieval: Architectures and Augmentation Methods

Published: 18 July 2023 Publication History

Abstract

The dual-encoder model is a dense retrieval architecture, consisting of two encoder models, that has surpassed traditional sparse retrieval methods for open-domain retrieval [1]. But, room exists for improvement, particularly when dense retrievers are exposed to unseen passages or queries.
Considering out-of-domain queries, i.e., queries originating from domains other than the one the model was trained on, the loss in accuracy may be significant. A main factor for this is the mismatch in the information available to the context encoder and the query encoder during training. Common retireval training datasets contain an overwhelming majority of passages with one query from a passage. I hypothesize that this could lead the dual-encoder model, particularly the passage encoder, to overfit to a single potential query from a given passage to the detriment of out-of-domain performance. Based on this, I seek to answer the following research question: (RQ1.1) Does training a DPR model on data containing multiple queries per passage improve the generalizability of the model? To answer RQ1.1, I build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on (mostly) single query per passage datasets. I show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets [2].
Language can be considered another domain in the context of a dense retrieval. Training a dense retrieval model is especially challenging in languages other than English due to the scarcity of training data. I propose a novel training technique, clustered training, aimed at improving the retrieval quality of dense retrievers, especially in out-of-distribution and zero-shot settings. I address the following research questions: (RQ2.1)Does clustered training improve the effectiveness of multilingual DPR models on in-distribution data? (RQ2.2) Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.2 Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.3) Does clustered training help multilingual DPR models to generalize to new languages (zero-shot)? I show that clustered training improves the out-of-distribution and zero-shot performance of a DPR model without a clear loss in in-distribution performance using the Mr. TyDi [3] dataset.
Finally, I propose a modified dual-encoder architecture that can perform both retrieval and reranking with the same model in a single forward pass. While dual encoder models can surpass traditional sparse retrieval methods, they lag behind two stage retrieval pipelines in retrieval quality. I propose a modification to the dual encoder model where a second representation is used to rerank the passages retrieved using the first representation. Here, a second stage model is not required and both representations are generated in a single forward pass from the dual encoder. I aim to answer the following research questions in this work: (RQ3.1), Can the same model be trained to effectively generate two representations intended for two uses? RQ3.2 Can the retrieval quality of the model be improved by simultaneously performing retrieval and reranking? (RQ3.3 What is the tradeoff between retrieval quality vs. latency and compute resource efficiency for the proposed method vs. a two stage retriever? I expect that my proposed architecture would improve the dual encoder retrieval quality without sacrificing throughput or needing more computational resources.

References

[1]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 6769--6781.
[2]
Thilina Rajapakse and Maarten de Rijke. 2023. Improving the Generalizability of the Dense Passage Retriever Using Generated Datasets. In ECIR 2023: 45th European Conference on Information Retrieval. Springer, 94--109.
[3]
Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. arXiv preprint arXiv:2108.08787 (2021).

Cited By

View all
  • (2024)Nonnegative matrix factorization with Wasserstein metric-based regularization for enhanced text embeddingPLOS ONE10.1371/journal.pone.031476219:12(e0314762)Online publication date: 5-Dec-2024
  • (2024)Dynamic Demonstration Retrieval and Cognitive Understanding for Emotional Support ConversationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657695(774-784)Online publication date: 10-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Check for updates

Author Tags

  1. dense retrieval
  2. generalizability

Qualifiers

  • Abstract

Conference

SIGIR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)126
  • Downloads (Last 6 weeks)12
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Nonnegative matrix factorization with Wasserstein metric-based regularization for enhanced text embeddingPLOS ONE10.1371/journal.pone.031476219:12(e0314762)Online publication date: 5-Dec-2024
  • (2024)Dynamic Demonstration Retrieval and Cognitive Understanding for Emotional Support ConversationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657695(774-784)Online publication date: 10-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media