Contextualized query expansion via unsupervised chunk selection for text retrieval

https://doi.org/10.1016/j.ipm.2021.102672Get rights and content

Highlights

  • A BERT-based query expansion (QE) model to identify relevant information from text.

  • Novel QE components to better trade-off efficiency against effectiveness.

  • Evaluation on two standard TREC test collections demonstrates superior performance.

  • Analysis provides insights on how to fine-tune BERT ranker for long documents.

Abstract

When ranking a list of documents relative to a given query, the vocabulary mismatches could compromise the performance, as a result of the different language used in the queries and the documents. Though the BERT-based re-ranker have significantly advanced the state-of-the-art, such mismatch still exist. Moreover, recent works demonstrated that it is non-trivial to use the established query expansion methods to boost the performance of BERT-based re-rankers. Henceforth, this paper proposes a novel query expansion model using unsupervised chunk selection, coined as BERT-QE. In particular, BERT-QE consists of three phases. After performing the first-round re-ranking in phase one, BERT-QE leverages the strength of the BERT model to select relevant text chunks from feedback documents in phase two and uses them for the final re-ranking in phase three. Furthermore, different variants of BERT-QE are thoroughly investigated for a better trade-off between effectiveness and efficiency, including the uses of smaller BERT variants and of recently proposed late interaction methods. On the standard TREC Robust04 and GOV2 test collections, the proposed BERT-QE model significantly outperforms BERT-Large models. Actually, the best variant of BERT-QE can outperform BERT-Large significantly on shallow metrics with less than 1% extra computations.

Introduction

In an information retrieval (IR) system, the documents are ranked in descending order according to their relevance relative to the given query. Recent advances have shown superior performance gain on ad-hoc text retrieval tasks by utilizing large-scale pre-trained transformer-based language models to evaluate the relevance between individual query-document pairs as in BERT-based (Devlin, Chang, Lee, & Toutanova, 2019) re-rankers, improving upon classical IR models by a wide margin on different benchmarks (Dai and Callan, 2019, Li et al., 2020, Nogueira and Cho, 2019, Yilmaz et al., 2019). However, the vocabulary mismatches between the query and the document, due to their different brevity, length, and even format, make the relevance evaluation sub-optimal. Different query expansion methods have been proposed to address the mismatch by exploiting the pseudo relevance feedback (PRF) to expand the query (Amati, 2003, Lavrenko and Croft, 2001, Metzler and Croft, 2007, Rocchio, 1971) before evaluating the relevance of a query-document pair. For the expansion selection, the existing works either rely on words as in RM3 (Lavrenko & Croft, 2001) and KL (Amati, 2003), or phrases for expansion (Metzler & Croft, 2007). In the context of neural approaches, the recent neural PRF architecture (Li et al., 2018, Wang, Luo et al., 2020) uses entire feedback documents for expansion. However, as have been shown in Padaki, Dai, and Callan (2020), the existing expansion methods may not be directly applicable to the BERT-based ranking models. Actually, Padaki et al. (2020) demonstrate that the use of RM3 (Lavrenko & Croft, 2001) on top of a fine-tuned BERT model significantly dampens the ranking quality, highlighting the difficulties to perform query expansion for BERT-based ranking models. Besides, the reliance on the PRF information makes the existing expansion methods more prone to the non-relevant information from the feedback documents, which could pollute the query and lead to topic shift (Macdonald & Ounis, 2007). Moreover, the selected expansions in the existing expansion methods are either too short, e.g., the words (Amati, 2003, Lavrenko and Croft, 2001), or too long, e.g., the entire document (Li et al., 2018, Wang, Luo et al., 2020), to introduce relevant information without bringing in non-relevant information.

To mitigate such gap, in this work, we propose a novel query expansion model, coined as BERT-QE, for BERT-based ranking models particularly. The proposed BERT-QE model unsupervisedly selects the relevant information from PRF documents in the form of sentences, providing more flexibility for the expansion granularity. Furthermore, a novel architecture is proposed to re-weight the relevance of individual documents using the selected expansions on top of the BERT-based ranker, achieving superior performance by overcoming the topic shift problems. In particular, given a query and a list of feedback documents from an initial ranking (e.g., from BM25), we propose to re-rank the documents in three sequential phases, as illustrated in Fig. 1. In phase one, the documents are re-ranked with a fine-tuned BERT model and the top-ranked documents are used as PRF documents; in phase two, these PRF documents are decomposed into text chunks with fixed length (e.g., 10), and the relevance of individual chunks are evaluated; finally, to assess the relevance of a given document, the selected chunks and original query are used to score the document jointly. We release the source code and related resources for reproducibility.2 The contributions of this work are as follows:

  • 1.

    A novel query expansion model is proposed to exploit the strength of contextualized model BERT in identifying relevant information from feedback documents.

  • 2.

    Evaluation on two standard TREC test collections, namely, Robust04 and GOV2, demonstrates that the proposed BERT-QE-LLFILWS could advance the performance of BERT-Large significantly on both shallow and deep pool, when using BERT-Large in all three phases.

  • 3.

    Several novel components are proposed on top of BERT-QE to better trade-off the efficiency against effectiveness. In particular, we investigate the uses of smaller BERT components for different phases and demonstrate that, with a smaller variant of BERT-QE, e.g., BERT-QE-LMFITWS, one could outperform BERT-Large significantly on shallow pool with no more than 1% extra computational cost; meanwhile, a larger variant, e.g., BERT-QE-LLFISWS, could significantly outperform BERT-Large on both shallow and deep pool with 2% more computations. Besides, we also propose two novel building blocks for phase two and phase three to further improve the efficiency.

This paper is an extension of a previous paper appeared in Findings of ACL: EMNLP 2020 (Zheng et al., 2020). In this paper, we extend the previous version by:

  • 1.

    Two novel alternatives designs have been proposed for phase two and phase three, providing better trade-off between effectiveness and efficiency.

  • 2.

    The efficiency is more thoroughly investigated especially in terms of document ranking, which has not been discussed in the preliminary version.

  • 3.

    More analyses have been included to provide a better understanding of the proposed method, as well as the results.

  • 4.

    Conducting a detailed analysis for the configurations of fine-tuning the first-round BERT re-ranker, which is crucial for the document ranking task.

The remainder of this paper is organized as follows. In Section 2, we recap the related works. In Section 3, we describe the proposed BERT-QE model and methods to trade-off the efficiency against effectiveness. Experimental settings and evaluation results are introduced in Sections 4 Experimental setup, 5 Results . We conduct further experimental analyses in Section 6, before concluding this work in Section 7.

Section snippets

BERT for IR

In the past few years, based on Word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) or GloVe word embeddings (Pennington, Socher, & Manning, 2014), neural ranking models such as DRMM (Guo, Fan, Ai, & Croft, 2016), PACRR (Hui, Yates, Berberich, & de Melo, 2017), ADRM (Liu, Li et al., 2020), and KNRM (Xiong, Dai, Callan, Liu, & Power, 2017) have shown the ability in improving upon classical probabilistic retrieval approaches. More recently, inspired by the success of contextualized models

Method

In this section we describe BERT-QE, which takes a query and a ranked list of documents for this query as input (e.g., from an unsupervised ranking model) and outputs a re-ranked list based on the expanded queries. The proposed BERT-QE performs query expansion unsupervisedly, and thus can be used on top of any BERT-based ranking models.

Dataset and metrics

Akin to Guo et al. (2016) and Yilmaz et al. (2019), we use two representative ad-hoc retrieval datasets: Robust04 (Voorhees, 2004) and GOV2 (Clarke, Craswell, & Soboroff, 2004). Robust04 is a newswire collection with 249 queries and 528,155 news articles used for TREC Robust Track 2004. GOV2 is a Web collection used for TREC Terabyte Tracks 2004, 2005 and 2006, consisting of 150 queries and 25,205,179 documents crawled from government websites. For both datasets, we employ the title-only

Results

In this section, we report results for the proposed BERT-QE model and compare them to the baseline models. We consider the following research questions:

  • RQ1: Can BERT-QE outperform baseline models in terms of effectiveness, especially compared with those based on BERT-Large? (Section 5.1)

  • RQ2: Is BERT-QE still effective when using smaller BERT building blocks? (Section 5.2)

  • RQ3: Is BERT-QE still effective when using two more efficient alternatives, namely, the Late Interaction (LI) for phase two

Analysis

Conclusion

We propose a novel expansion model, coined as BERT-QE, to better select relevant information for query expansion. The proposed BERT-QE consists of three phases. In phase one, we perform the first-round re-ranking using a fine-tuned BERT model. In phase two and three, we use the BERT model to unsupervisedly select expansion chunks and use them to re-evaluate the relevance scores of documents. Besides, in order to trade-off the efficiency and the effectiveness, we explore two different methods in

CRediT authorship contribution statement

Zhi Zheng: Methodology, Investigation, Software, Validation, Writing - original draft. Kai Hui: Methodology, Writing - original draft, Writing - review & editing. Ben He: Conceptualization, Writing - review & editing. Xianpei Han: Conceptualization, Methodology. Le Sun: Supervision. Andrew Yates: Writing - review & editing.

Acknowledgment

This work is supported by the University of Chinese Academy of Sciences .

References (51)

  • DaiZ. et al.

    Deeper text understanding for IR with contextual neural language modeling

  • DevlinJ. et al.

    BERT: pre-training of deep bidirectional transformers for language understanding

  • DiazF. et al.

    Query expansion with locally-trained word embeddings

  • GaoL. et al.

    Modularized transfomer-based ranking framework

  • HuiK. et al.

    A comparative study of pseudo relevance feedback for ad-hoc retrieval

  • HuiK. et al.

    PACRR: a position-aware neural IR model for relevance matching

  • HuiK. et al.

    Co-PACRR: A context-aware neural IR model for ad-hoc retrieval

  • KarpukhinV. et al.

    Dense passage retrieval for open-domain question answering

  • KhattabO. et al.

    Colbert: Efficient and effective passage search via contextualized late interaction over BERT

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd international conference on learning...
  • LavrenkoV. et al.

    Relevance-based language models

  • LiC. et al.

    NPRF: a neural pseudo relevance feedback framework for ad-hoc information retrieval

  • LiC. et al.

    PARADE: passage representation aggregation for document reranking

    (2020)
  • LiuB. et al.

    An attention-based deep relevance model for few-shot document filtering

    ACM Transactions on Information Systems

    (2020)
  • LiuW. et al.

    FastBERT: a self-distilling BERT with adaptive inference time

  • Cited by (18)

    • Dealing with textual noise for robust and effective BERT re-ranking

      2023, Information Processing and Management
      Citation Excerpt :

      Text ranking is a key task for many natural language processing applications, such as web search (Zheng et al., 2021), text summarization (Narayan, Cohen, & Lapata, 2018) and question answering (Kratzwald, Eigenmann, & Feuerriegel, 2019).

    View all citing articles on Scopus
    1

    This work was performed before joining Amazon.

    View full text