skip to main content
10.1145/3539618.3591796acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
abstract

Dense Passage Retrieval: Architectures and Augmentation Methods

Published:18 July 2023Publication History

ABSTRACT

The dual-encoder model is a dense retrieval architecture, consisting of two encoder models, that has surpassed traditional sparse retrieval methods for open-domain retrieval [1]. But, room exists for improvement, particularly when dense retrievers are exposed to unseen passages or queries.

Considering out-of-domain queries, i.e., queries originating from domains other than the one the model was trained on, the loss in accuracy may be significant. A main factor for this is the mismatch in the information available to the context encoder and the query encoder during training. Common retireval training datasets contain an overwhelming majority of passages with one query from a passage. I hypothesize that this could lead the dual-encoder model, particularly the passage encoder, to overfit to a single potential query from a given passage to the detriment of out-of-domain performance. Based on this, I seek to answer the following research question: (RQ1.1) Does training a DPR model on data containing multiple queries per passage improve the generalizability of the model? To answer RQ1.1, I build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on (mostly) single query per passage datasets. I show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets [2].

Language can be considered another domain in the context of a dense retrieval. Training a dense retrieval model is especially challenging in languages other than English due to the scarcity of training data. I propose a novel training technique, clustered training, aimed at improving the retrieval quality of dense retrievers, especially in out-of-distribution and zero-shot settings. I address the following research questions: (RQ2.1)Does clustered training improve the effectiveness of multilingual DPR models on in-distribution data? (RQ2.2) Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.2 Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.3) Does clustered training help multilingual DPR models to generalize to new languages (zero-shot)? I show that clustered training improves the out-of-distribution and zero-shot performance of a DPR model without a clear loss in in-distribution performance using the Mr. TyDi [3] dataset.

Finally, I propose a modified dual-encoder architecture that can perform both retrieval and reranking with the same model in a single forward pass. While dual encoder models can surpass traditional sparse retrieval methods, they lag behind two stage retrieval pipelines in retrieval quality. I propose a modification to the dual encoder model where a second representation is used to rerank the passages retrieved using the first representation. Here, a second stage model is not required and both representations are generated in a single forward pass from the dual encoder. I aim to answer the following research questions in this work: (RQ3.1), Can the same model be trained to effectively generate two representations intended for two uses? RQ3.2 Can the retrieval quality of the model be improved by simultaneously performing retrieval and reranking? (RQ3.3 What is the tradeoff between retrieval quality vs. latency and compute resource efficiency for the proposed method vs. a two stage retriever? I expect that my proposed architecture would improve the dual encoder retrieval quality without sacrificing throughput or needing more computational resources.

References

  1. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 6769--6781.Google ScholarGoogle ScholarCross RefCross Ref
  2. Thilina Rajapakse and Maarten de Rijke. 2023. Improving the Generalizability of the Dense Passage Retriever Using Generated Datasets. In ECIR 2023: 45th European Conference on Information Retrieval. Springer, 94--109.Google ScholarGoogle Scholar
  3. Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. arXiv preprint arXiv:2108.08787 (2021).Google ScholarGoogle Scholar

Index Terms

  1. Dense Passage Retrieval: Architectures and Augmentation Methods

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
            July 2023
            3567 pages
            ISBN:9781450394086
            DOI:10.1145/3539618

            Copyright © 2023 Owner/Author

            Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 July 2023

            Check for updates

            Qualifiers

            • abstract

            Acceptance Rates

            Overall Acceptance Rate792of3,983submissions,20%
          • Article Metrics

            • Downloads (Last 12 months)202
            • Downloads (Last 6 weeks)26

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader