Skip to main content

Dataset Selection for Transfer Learning in Information Retrieval

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1342))

Abstract

Information Retrieval is the task of satisfying an information need by retrieving relevant information from large collections. Recently, deep neural networks have achieved several performance breakthroughs in the field, owing to the availability of large-scale training sets. When training data is limited, however, neural retrieval systems vastly underperform. To compensate for the lack of training data, researchers have turned to transfer learning by relying on labelled data from other search domains. Despite having access to several publicly available datasets, researchers are currently unguided in selecting the best training set for their particular applications. To address this knowledge gap, we propose a rigorous method to select an optimal training set for a specific search domain. We validate this method on the TREC-COVID challenge, which was organized by the Allen Institute for Artificial Intelligence and the National Institute of Standards and Technology. Our neural model ranked first from 143 competing systems. More importantly, it was able to achieve this result by training on a dataset that was selected using our proposed method. This work highlights the performance gains that may be achieved through careful dataset selection in transfer learning.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This result was achieved by the authors in [20], who re-implemented the work in [5].

  2. 2.

    This result was achieved by the authors in [20], who re-implemented the work in [10].

  3. 3.

    The official evaluation software used by the organizers of TREC-COVID was trec-eval, and can be downloaded at https://trec.nist.gov/trec_eval/.

  4. 4.

    TREC-COVID round 1 leader board: https://ir.nist.gov/covidSubmit.

References

  1. Asadi, N., Metzler, D., Elsayed, T., Lin, J.: Pseudo test collections for learning web search ranking functions. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, pp. 1073–1082 (2011)

    Google Scholar 

  2. Bajaj, P., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: 30th Conference on Neural Information Processing Systems, Barcelona, Spain (2016)

    Google Scholar 

  3. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.: Overview of the trec 2019 deep learning track. arXiv:2003.07820 (2020)

  4. Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 985–988. Association for Computing Machinery (2019)

    Google Scholar 

  5. Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.: Neural ranking models with weak supervision. In: Proceedings of SIGIR 2017, Shinjuku, Tokyo, Japan (2017)

    Google Scholar 

  6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019, Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019)

    Google Scholar 

  7. Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. In: Jiang, J. (ed.) Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466 (2019)

    Google Scholar 

  8. Lin, J.: The neural hype and comparisons against weak baselines. ACM SIGIR Forum 52(2), 40–51 (2019)

    Article  Google Scholar 

  9. MacAvaney, S., Cohan, A., Goharian, N.: SLEDGE: A simple yet effective baseline for coronavirus scientific knowledge search. arXiv:2005.02365 (2020)

  10. MacAvaney, S., Yates, A., Hui, K., Frieder, O.: Content-based weak supervision for AdHoc re-ranking. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 993–996. Association of Computing Machinery (2019)

    Google Scholar 

  11. Marchesin, S., Purpura, A., Silvello, G.: Focal elements of neural information retrieval models. An outlook through a reproducibility study. Inf. Process. Manage. 57, 102109 (2020)

    Article  Google Scholar 

  12. Nayak, P.: Understanding searches better than ever before. https://www.blog.google/products/search/search-language-understanding-bert/. Accessed 18 May 2020

  13. Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv:1901.04085 (2019)

  14. Rao, J., Yang, W., Zhang, Y., Ture, F., Lin, J.: Multi-perspective relevance matching with hierarchical ConvNets for social media search. In: The 33rd AAAI Conference on Artificial Intelligence, AAAI19, vol. 33, pp. 232240 (2019)

    Google Scholar 

  15. Roberts, K., et al.: TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. J. Am. Med. Inform. Assoc. 27, 1431–1436 (2020)

    Article  Google Scholar 

  16. Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence (2016)

    Google Scholar 

  17. Wouter, M., Marco, L.: An introduction to domain adaptation and transfer learning. arXiv:1812.11806 (2019)

  18. Yang, W., Lu, K., Yang, P., Lin, J.: Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 1129–1132. Association for Computing Machinery (2019)

    Google Scholar 

  19. Yilmaz, Z., Yang, W., Zhang, H., Lin, J.: Cross-domain modeling of sentence-level evidence for document retrieval. In: 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 3481–3487. Association for Computational Linguistics (2019)

    Google Scholar 

  20. Zhang, K., Xiong, C., Liu, Z., Liu, Z.: Selective weak supervision for neural information retrieval. In: International World Web Conference, Creative Commons, Taiwan (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yastil Rughbeer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rughbeer, Y., Pillay, A.W., Jembere, E. (2020). Dataset Selection for Transfer Learning in Information Retrieval. In: Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2021. Communications in Computer and Information Science, vol 1342. Springer, Cham. https://doi.org/10.1007/978-3-030-66151-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66151-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66150-2

  • Online ISBN: 978-3-030-66151-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics