Abstract
In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn’t exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The model is made available by Google on the HuggingFace’ Hub: bert-base-uncased.
- 2.
References
Althammer, S., Zuccon, G., Hofstätter, S., Verberne, S., Hanbury, A.: Annotating data for fine-tuning a neural ranker? Current active learning strategies are not better than random election. In: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2023, pp. 139–149. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3624918.3625333
Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4623–4637 (2020). https://doi.org/10.18653/v1/2020.acl-main.421. arXiv:1910.11856 [cs]
Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (2019). https://doi.org/10.18653/v1/w19-4828. http://dx.doi.org/10.18653/v1/W19-4828
Devlin, J., Chang, M.W., Lee, K., Toutanov, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistic (2018). https://doi.org/10.48550/arXiv.1810.04805
Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: SPLADE v2: sparse lexical and expansion model for information retrieval (2021). https://doi.org/10.48550/ARXIV.2109.10086. https://arxiv.org/abs/2109.10086
Gao, L., Ma, X., Lin, J., Callan, J.: Precise zero-shot dense retrieval without relevance labels (2022)
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.740. https://aclanthology.org/2020.acl-main.740
Hashemi, H., Aliannejadi, M., Zamani, H., Croft, W.B.: ANTIQUE: a non-factoid question answering benchmark. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 166–173. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_21
Hashemi, H., Zhuang, Y., Kothur, S.S.R., Prasad, S., Meij, E., Croft, W.B.: Dense retrieval adaptation using target domain description. In: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2023, pp. 95–104. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3578337.3605127
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP, June 2019. http://arxiv.org/abs/1902.00751
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models, October 2021. arXiv:2106.09685 [cs]
Izacard, G., et al.: Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res. (2022). https://openreview.net/forum?id=jKN1pXi7b0
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550. https://aclanthology.org/2020.emnlp-main.550
Krishna, K., Garg, S., Bigham, J., Lipton, Z.: Downstream datasets make surprisingly good pretraining corpora. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12207–12222. Association for Computational Linguistics, Toronto, July 2023. https://doi.org/10.18653/v1/2023.acl-long.682. https://aclanthology.org/2023.acl-long.682
Lassance, C., Dejean, H., Clinchant, S.: An experimental study on pretraining transformers from scratch for IR. In: Kamps, J., et al. (eds.) Advances in Information Retrieval. LNCS, vol. 13980, pp. 504–520. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-28244-7_32. https://link.springer.com/10.1007/978-3-031-28244-7_32
Li, M., Gaussier, E.: Domain adaptation for dense retrieval through self-supervision by pseudo-relevance labeling (2022)
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation, January 2021. http://arxiv.org/abs/2101.00190, arXiv:2101.00190 [cs]
Litschko, R., Vulić, I., Glavaš, G.: Parameter-efficient neural reranking for cross-lingual and multilingual retrieval. In: Calzolari, N., et al. (eds.) Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, pp. 1071–1082. International Committee on Computational Linguistics, October 2022. https://aclanthology.org/2022.coling-1.90
Liu, X., et al.: P-Tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks, March 2022. arXiv:2110.07602 [cs]
MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., Goharian, N.: Simplified data wrangling with ir_datasets. In: SIGIR (2021)
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. CoRR abs/1611.09268 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1611.html#NguyenRSGTMD16
Nogueira, R., Cho, K.: Passage re-ranking with BERT (2019). http://arxiv.org/abs/1901.04085
Nogueira, R., Yang, W., Lin, J.J., Cho, K.: Document expansion by query prediction. ArXiv abs/1904.08375 (2019). https://api.semanticscholar.org/CorpusID:119314259
Pal, V., Lassance, C., Déjean, H., Clinchant, S.: Parameter-efficient sparse retrievers and rerankers using adapters. In: Kamps, J., et al. (eds.) Advances in Information Retrieval. LNCS, pp. 16–31. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-28238-6_2
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010). https://doi.org/10.1109/TKDE.2009.191
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: Text Retrieval Conference (1994). https://api.semanticscholar.org/CorpusID:3946054
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies), pp. 3715–3734. Association for Computational Linguistics, July 2022. https://doi.org/10.18653/v1/2022.naacl-main.272. https://aclanthology.org/2022.naacl-main.272
Tam, W.L., et al.: Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers, July 2022. https://doi.org/10.48550/arXiv.2207.07087, http://arxiv.org/abs/2207.07087
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models, October 2021. https://doi.org/10.48550/arXiv.2104.08663. http://arxiv.org/abs/2104.08663
Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2022). https://doi.org/10.18653/v1/2022.naacl-main.168. http://dx.doi.org/10.18653/v1/2022.naacl-main.168
Zhan, J., et al.: Disentangled modeling of domain and relevance for adaptable dense retrieval, August 2022. arXiv:2208.05753 [cs]
Zhang, X., Yates, A., Lin, J.: A little bit is worse than none: ranking with limited training data. In: Moosavi, N.S., et al. (eds.) Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 107–112. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.sustainlp-1.14, https://aclanthology.org/2020.sustainlp-1.14
Zong, Y., Piwowarski, B.: XpmIR: a modular library for learning to rank and neural IR experiments. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, pp. 3185–3189. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539618.3591818
Ackowledgements
We wish to thank Basile Van Cooten, from Sinequa, for his support and supervision on Mathias Vast’s PhD and particularly on this paper. This work is supported by the ANR project ANR-23-IAS1-0003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vast, M., Zong, Y., Piwowarski, B., Soulier, L. (2024). Simple Domain Adaptation for Sparse Retrievers. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-56063-7_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)