Simple Domain Adaptation for Sparse Retrievers

Vast, Mathias; Zong, Yuxuan; Piwowarski, Benjamin; Soulier, Laure

doi:10.1007/978-3-031-56063-7_32

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14610))

Included in the following conference series:

European Conference on Information Retrieval

1079 Accesses

Abstract

In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn’t exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Online Meta-learning for Multi-source and Semi-supervised Domain Adaptation

Query Expansion and Verification with Large Language Model for Information Retrieval

Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation

Notes

1.
The model is made available by Google on the HuggingFace’ Hub: bert-base-uncased.
2.
https://git.isir.upmc.fr/mat_vast/cross_domain_adaptation.

References

Althammer, S., Zuccon, G., Hofstätter, S., Verberne, S., Hanbury, A.: Annotating data for fine-tuning a neural ranker? Current active learning strategies are not better than random election. In: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2023, pp. 139–149. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3624918.3625333
Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4623–4637 (2020). https://doi.org/10.18653/v1/2020.acl-main.421. arXiv:1910.11856 [cs]
Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (2019). https://doi.org/10.18653/v1/w19-4828. http://dx.doi.org/10.18653/v1/W19-4828
Devlin, J., Chang, M.W., Lee, K., Toutanov, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistic (2018). https://doi.org/10.48550/arXiv.1810.04805
Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: SPLADE v2: sparse lexical and expansion model for information retrieval (2021). https://doi.org/10.48550/ARXIV.2109.10086. https://arxiv.org/abs/2109.10086
Gao, L., Ma, X., Lin, J., Callan, J.: Precise zero-shot dense retrieval without relevance labels (2022)
Google Scholar
Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.740. https://aclanthology.org/2020.acl-main.740
Hashemi, H., Aliannejadi, M., Zamani, H., Croft, W.B.: ANTIQUE: a non-factoid question answering benchmark. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 166–173. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_21
Chapter Google Scholar
Hashemi, H., Zhuang, Y., Kothur, S.S.R., Prasad, S., Meij, E., Croft, W.B.: Dense retrieval adaptation using target domain description. In: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2023, pp. 95–104. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3578337.3605127
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP, June 2019. http://arxiv.org/abs/1902.00751
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models, October 2021. arXiv:2106.09685 [cs]
Izacard, G., et al.: Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res. (2022). https://openreview.net/forum?id=jKN1pXi7b0
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550. https://aclanthology.org/2020.emnlp-main.550
Krishna, K., Garg, S., Bigham, J., Lipton, Z.: Downstream datasets make surprisingly good pretraining corpora. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12207–12222. Association for Computational Linguistics, Toronto, July 2023. https://doi.org/10.18653/v1/2023.acl-long.682. https://aclanthology.org/2023.acl-long.682
Lassance, C., Dejean, H., Clinchant, S.: An experimental study on pretraining transformers from scratch for IR. In: Kamps, J., et al. (eds.) Advances in Information Retrieval. LNCS, vol. 13980, pp. 504–520. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-28244-7_32. https://link.springer.com/10.1007/978-3-031-28244-7_32
Li, M., Gaussier, E.: Domain adaptation for dense retrieval through self-supervision by pseudo-relevance labeling (2022)
Google Scholar
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation, January 2021. http://arxiv.org/abs/2101.00190, arXiv:2101.00190 [cs]
Litschko, R., Vulić, I., Glavaš, G.: Parameter-efficient neural reranking for cross-lingual and multilingual retrieval. In: Calzolari, N., et al. (eds.) Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, pp. 1071–1082. International Committee on Computational Linguistics, October 2022. https://aclanthology.org/2022.coling-1.90
Liu, X., et al.: P-Tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks, March 2022. arXiv:2110.07602 [cs]
MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., Goharian, N.: Simplified data wrangling with ir_datasets. In: SIGIR (2021)
Google Scholar
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. CoRR abs/1611.09268 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1611.html#NguyenRSGTMD16
Nogueira, R., Cho, K.: Passage re-ranking with BERT (2019). http://arxiv.org/abs/1901.04085
Nogueira, R., Yang, W., Lin, J.J., Cho, K.: Document expansion by query prediction. ArXiv abs/1904.08375 (2019). https://api.semanticscholar.org/CorpusID:119314259
Pal, V., Lassance, C., Déjean, H., Clinchant, S.: Parameter-efficient sparse retrievers and rerankers using adapters. In: Kamps, J., et al. (eds.) Advances in Information Retrieval. LNCS, pp. 16–31. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-28238-6_2
Chapter Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010). https://doi.org/10.1109/TKDE.2009.191
Article Google Scholar
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: Text Retrieval Conference (1994). https://api.semanticscholar.org/CorpusID:3946054
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies), pp. 3715–3734. Association for Computational Linguistics, July 2022. https://doi.org/10.18653/v1/2022.naacl-main.272. https://aclanthology.org/2022.naacl-main.272
Tam, W.L., et al.: Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers, July 2022. https://doi.org/10.48550/arXiv.2207.07087, http://arxiv.org/abs/2207.07087
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models, October 2021. https://doi.org/10.48550/arXiv.2104.08663. http://arxiv.org/abs/2104.08663
Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2022). https://doi.org/10.18653/v1/2022.naacl-main.168. http://dx.doi.org/10.18653/v1/2022.naacl-main.168
Zhan, J., et al.: Disentangled modeling of domain and relevance for adaptable dense retrieval, August 2022. arXiv:2208.05753 [cs]
Zhang, X., Yates, A., Lin, J.: A little bit is worse than none: ranking with limited training data. In: Moosavi, N.S., et al. (eds.) Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 107–112. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.sustainlp-1.14, https://aclanthology.org/2020.sustainlp-1.14
Zong, Y., Piwowarski, B.: XpmIR: a modular library for learning to rank and neural IR experiments. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, pp. 3185–3189. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539618.3591818

Download references

Ackowledgements

We wish to thank Basile Van Cooten, from Sinequa, for his support and supervision on Mathias Vast’s PhD and particularly on this paper. This work is supported by the ANR project ANR-23-IAS1-0003.

Author information

Authors and Affiliations

Sinequa, Paris, France
Mathias Vast
Sorbonne Université, CNRS, ISIR, 75005, Paris, France
Mathias Vast, Yuxuan Zong, Benjamin Piwowarski & Laure Soulier

Authors

Mathias Vast
View author publications
You can also search for this author in PubMed Google Scholar
Yuxuan Zong
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Piwowarski
View author publications
You can also search for this author in PubMed Google Scholar
Laure Soulier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mathias Vast .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, PISA, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vast, M., Zong, Y., Piwowarski, B., Soulier, L. (2024). Simple Domain Adaptation for Sparse Retrievers. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-56063-7_32
Published: 23 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Simple Domain Adaptation for Sparse Retrievers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Online Meta-learning for Multi-source and Semi-supervised Domain Adaptation

Query Expansion and Verification with Large Language Model for Information Retrieval

Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation

Notes

References

Ackowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Simple Domain Adaptation for Sparse Retrievers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Online Meta-learning for Multi-source and Semi-supervised Domain Adaptation

Query Expansion and Verification with Large Language Model for Information Retrieval

Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation

Notes

References

Ackowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation