Skip to main content

Simple Domain Adaptation for Sparse Retrievers

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)

Abstract

In Information Retrieval, and more generally in Natural Language Processing, adapting models to specific domains is conducted through fine-tuning. Despite the successes achieved by this method and its versatility, the need for human-curated and labeled data makes it impractical to transfer to new tasks, domains, and/or languages when training data doesn’t exist. Using the model without training (zero-shot) is another option that however suffers an effectiveness cost, especially in the case of first-stage retrievers. Numerous research directions have emerged to tackle these issues, most of them in the context of adapting to a task or a language. However, the literature is scarcer for domain (or topic) adaptation. In this paper, we address this issue of cross-topic discrepancy for a sparse first-stage retriever by transposing a method initially designed for language adaptation. By leveraging pre-training on the target data to learn domain-specific knowledge, this technique alleviates the need for annotated data and expands the scope of domain adaptation. Despite their relatively good generalization ability, we show that even sparse retrievers can benefit from our simple domain adaptation method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The model is made available by Google on the HuggingFace’ Hub: bert-base-uncased.

  2. 2.

    https://git.isir.upmc.fr/mat_vast/cross_domain_adaptation.

References

  1. Althammer, S., Zuccon, G., Hofstätter, S., Verberne, S., Hanbury, A.: Annotating data for fine-tuning a neural ranker? Current active learning strategies are not better than random election. In: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP 2023, pp. 139–149. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3624918.3625333

  2. Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4623–4637 (2020). https://doi.org/10.18653/v1/2020.acl-main.421. arXiv:1910.11856 [cs]

  3. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (2019). https://doi.org/10.18653/v1/w19-4828. http://dx.doi.org/10.18653/v1/W19-4828

  4. Devlin, J., Chang, M.W., Lee, K., Toutanov, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistic (2018). https://doi.org/10.48550/arXiv.1810.04805

  5. Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: SPLADE v2: sparse lexical and expansion model for information retrieval (2021). https://doi.org/10.48550/ARXIV.2109.10086. https://arxiv.org/abs/2109.10086

  6. Gao, L., Ma, X., Lin, J., Callan, J.: Precise zero-shot dense retrieval without relevance labels (2022)

    Google Scholar 

  7. Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.740. https://aclanthology.org/2020.acl-main.740

  8. Hashemi, H., Aliannejadi, M., Zamani, H., Croft, W.B.: ANTIQUE: a non-factoid question answering benchmark. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 166–173. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_21

    Chapter  Google Scholar 

  9. Hashemi, H., Zhuang, Y., Kothur, S.S.R., Prasad, S., Meij, E., Croft, W.B.: Dense retrieval adaptation using target domain description. In: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2023, pp. 95–104. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3578337.3605127

  10. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP, June 2019. http://arxiv.org/abs/1902.00751

  11. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models, October 2021. arXiv:2106.09685 [cs]

  12. Izacard, G., et al.: Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res. (2022). https://openreview.net/forum?id=jKN1pXi7b0

  13. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550. https://aclanthology.org/2020.emnlp-main.550

  14. Krishna, K., Garg, S., Bigham, J., Lipton, Z.: Downstream datasets make surprisingly good pretraining corpora. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12207–12222. Association for Computational Linguistics, Toronto, July 2023. https://doi.org/10.18653/v1/2023.acl-long.682. https://aclanthology.org/2023.acl-long.682

  15. Lassance, C., Dejean, H., Clinchant, S.: An experimental study on pretraining transformers from scratch for IR. In: Kamps, J., et al. (eds.) Advances in Information Retrieval. LNCS, vol. 13980, pp. 504–520. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-28244-7_32. https://link.springer.com/10.1007/978-3-031-28244-7_32

  16. Li, M., Gaussier, E.: Domain adaptation for dense retrieval through self-supervision by pseudo-relevance labeling (2022)

    Google Scholar 

  17. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation, January 2021. http://arxiv.org/abs/2101.00190, arXiv:2101.00190 [cs]

  18. Litschko, R., Vulić, I., Glavaš, G.: Parameter-efficient neural reranking for cross-lingual and multilingual retrieval. In: Calzolari, N., et al. (eds.) Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, pp. 1071–1082. International Committee on Computational Linguistics, October 2022. https://aclanthology.org/2022.coling-1.90

  19. Liu, X., et al.: P-Tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks, March 2022. arXiv:2110.07602 [cs]

  20. MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., Goharian, N.: Simplified data wrangling with ir_datasets. In: SIGIR (2021)

    Google Scholar 

  21. Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. CoRR abs/1611.09268 (2016). http://dblp.uni-trier.de/db/journals/corr/corr1611.html#NguyenRSGTMD16

  22. Nogueira, R., Cho, K.: Passage re-ranking with BERT (2019). http://arxiv.org/abs/1901.04085

  23. Nogueira, R., Yang, W., Lin, J.J., Cho, K.: Document expansion by query prediction. ArXiv abs/1904.08375 (2019). https://api.semanticscholar.org/CorpusID:119314259

  24. Pal, V., Lassance, C., Déjean, H., Clinchant, S.: Parameter-efficient sparse retrievers and rerankers using adapters. In: Kamps, J., et al. (eds.) Advances in Information Retrieval. LNCS, pp. 16–31. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-28238-6_2

    Chapter  Google Scholar 

  25. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010). https://doi.org/10.1109/TKDE.2009.191

    Article  Google Scholar 

  26. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: Text Retrieval Conference (1994). https://api.semanticscholar.org/CorpusID:3946054

  27. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies), pp. 3715–3734. Association for Computational Linguistics, July 2022. https://doi.org/10.18653/v1/2022.naacl-main.272. https://aclanthology.org/2022.naacl-main.272

  28. Tam, W.L., et al.: Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers, July 2022. https://doi.org/10.48550/arXiv.2207.07087, http://arxiv.org/abs/2207.07087

  29. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models, October 2021. https://doi.org/10.48550/arXiv.2104.08663. http://arxiv.org/abs/2104.08663

  30. Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2022). https://doi.org/10.18653/v1/2022.naacl-main.168. http://dx.doi.org/10.18653/v1/2022.naacl-main.168

  31. Zhan, J., et al.: Disentangled modeling of domain and relevance for adaptable dense retrieval, August 2022. arXiv:2208.05753 [cs]

  32. Zhang, X., Yates, A., Lin, J.: A little bit is worse than none: ranking with limited training data. In: Moosavi, N.S., et al. (eds.) Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 107–112. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.sustainlp-1.14, https://aclanthology.org/2020.sustainlp-1.14

  33. Zong, Y., Piwowarski, B.: XpmIR: a modular library for learning to rank and neural IR experiments. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, pp. 3185–3189. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3539618.3591818

Download references

Ackowledgements

We wish to thank Basile Van Cooten, from Sinequa, for his support and supervision on Mathias Vast’s PhD and particularly on this paper. This work is supported by the ANR project ANR-23-IAS1-0003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mathias Vast .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vast, M., Zong, Y., Piwowarski, B., Soulier, L. (2024). Simple Domain Adaptation for Sparse Retrievers. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56063-7_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56062-0

  • Online ISBN: 978-3-031-56063-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics