Abstract
Information Retrieval is the task of satisfying an information need by retrieving relevant information from large collections. Recently, deep neural networks have achieved several performance breakthroughs in the field, owing to the availability of large-scale training sets. When training data is limited, however, neural retrieval systems vastly underperform. To compensate for the lack of training data, researchers have turned to transfer learning by relying on labelled data from other search domains. Despite having access to several publicly available datasets, researchers are currently unguided in selecting the best training set for their particular applications. To address this knowledge gap, we propose a rigorous method to select an optimal training set for a specific search domain. We validate this method on the TREC-COVID challenge, which was organized by the Allen Institute for Artificial Intelligence and the National Institute of Standards and Technology. Our neural model ranked first from 143 competing systems. More importantly, it was able to achieve this result by training on a dataset that was selected using our proposed method. This work highlights the performance gains that may be achieved through careful dataset selection in transfer learning.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
The official evaluation software used by the organizers of TREC-COVID was trec-eval, and can be downloaded at https://trec.nist.gov/trec_eval/.
- 4.
TREC-COVID round 1 leader board: https://ir.nist.gov/covidSubmit.
References
Asadi, N., Metzler, D., Elsayed, T., Lin, J.: Pseudo test collections for learning web search ranking functions. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, pp. 1073–1082 (2011)
Bajaj, P., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: 30th Conference on Neural Information Processing Systems, Barcelona, Spain (2016)
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.: Overview of the trec 2019 deep learning track. arXiv:2003.07820 (2020)
Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 985–988. Association for Computing Machinery (2019)
Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.: Neural ranking models with weak supervision. In: Proceedings of SIGIR 2017, Shinjuku, Tokyo, Japan (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019, Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019)
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. In: Jiang, J. (ed.) Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466 (2019)
Lin, J.: The neural hype and comparisons against weak baselines. ACM SIGIR Forum 52(2), 40–51 (2019)
MacAvaney, S., Cohan, A., Goharian, N.: SLEDGE: A simple yet effective baseline for coronavirus scientific knowledge search. arXiv:2005.02365 (2020)
MacAvaney, S., Yates, A., Hui, K., Frieder, O.: Content-based weak supervision for AdHoc re-ranking. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 993–996. Association of Computing Machinery (2019)
Marchesin, S., Purpura, A., Silvello, G.: Focal elements of neural information retrieval models. An outlook through a reproducibility study. Inf. Process. Manage. 57, 102109 (2020)
Nayak, P.: Understanding searches better than ever before. https://www.blog.google/products/search/search-language-understanding-bert/. Accessed 18 May 2020
Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv:1901.04085 (2019)
Rao, J., Yang, W., Zhang, Y., Ture, F., Lin, J.: Multi-perspective relevance matching with hierarchical ConvNets for social media search. In: The 33rd AAAI Conference on Artificial Intelligence, AAAI19, vol. 33, pp. 232240 (2019)
Roberts, K., et al.: TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. J. Am. Med. Inform. Assoc. 27, 1431–1436 (2020)
Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence (2016)
Wouter, M., Marco, L.: An introduction to domain adaptation and transfer learning. arXiv:1812.11806 (2019)
Yang, W., Lu, K., Yang, P., Lin, J.: Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 1129–1132. Association for Computing Machinery (2019)
Yilmaz, Z., Yang, W., Zhang, H., Lin, J.: Cross-domain modeling of sentence-level evidence for document retrieval. In: 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 3481–3487. Association for Computational Linguistics (2019)
Zhang, K., Xiong, C., Liu, Z., Liu, Z.: Selective weak supervision for neural information retrieval. In: International World Web Conference, Creative Commons, Taiwan (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Rughbeer, Y., Pillay, A.W., Jembere, E. (2020). Dataset Selection for Transfer Learning in Information Retrieval. In: Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2021. Communications in Computer and Information Science, vol 1342. Springer, Cham. https://doi.org/10.1007/978-3-030-66151-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-66151-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66150-2
Online ISBN: 978-3-030-66151-9
eBook Packages: Computer ScienceComputer Science (R0)