Dataset Selection for Transfer Learning in Information Retrieval

Rughbeer, Yastil; Pillay, Anban W.; Jembere, Edgar

doi:10.1007/978-3-030-66151-9_4

Dataset Selection for Transfer Learning in Information Retrieval

Conference paper
First Online: 21 December 2020

715 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1342))

Abstract

Information Retrieval is the task of satisfying an information need by retrieving relevant information from large collections. Recently, deep neural networks have achieved several performance breakthroughs in the field, owing to the availability of large-scale training sets. When training data is limited, however, neural retrieval systems vastly underperform. To compensate for the lack of training data, researchers have turned to transfer learning by relying on labelled data from other search domains. Despite having access to several publicly available datasets, researchers are currently unguided in selecting the best training set for their particular applications. To address this knowledge gap, we propose a rigorous method to select an optimal training set for a specific search domain. We validate this method on the TREC-COVID challenge, which was organized by the Allen Institute for Artificial Intelligence and the National Institute of Standards and Technology. Our neural model ranked first from 143 competing systems. More importantly, it was able to achieve this result by training on a dataset that was selected using our proposed method. This work highlights the performance gains that may be achieved through careful dataset selection in transfer learning.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
This result was achieved by the authors in [20], who re-implemented the work in [5].
2.
This result was achieved by the authors in [20], who re-implemented the work in [10].
3.
The official evaluation software used by the organizers of TREC-COVID was trec-eval, and can be downloaded at https://trec.nist.gov/trec_eval/.
4.
TREC-COVID round 1 leader board: https://ir.nist.gov/covidSubmit.

References

Asadi, N., Metzler, D., Elsayed, T., Lin, J.: Pseudo test collections for learning web search ranking functions. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, pp. 1073–1082 (2011)
Google Scholar
Bajaj, P., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: 30th Conference on Neural Information Processing Systems, Barcelona, Spain (2016)
Google Scholar
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.: Overview of the trec 2019 deep learning track. arXiv:2003.07820 (2020)
Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 985–988. Association for Computing Machinery (2019)
Google Scholar
Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.: Neural ranking models with weak supervision. In: Proceedings of SIGIR 2017, Shinjuku, Tokyo, Japan (2017)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019, Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. In: Jiang, J. (ed.) Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466 (2019)
Google Scholar
Lin, J.: The neural hype and comparisons against weak baselines. ACM SIGIR Forum 52(2), 40–51 (2019)
Article Google Scholar
MacAvaney, S., Cohan, A., Goharian, N.: SLEDGE: A simple yet effective baseline for coronavirus scientific knowledge search. arXiv:2005.02365 (2020)
MacAvaney, S., Yates, A., Hui, K., Frieder, O.: Content-based weak supervision for AdHoc re-ranking. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 993–996. Association of Computing Machinery (2019)
Google Scholar
Marchesin, S., Purpura, A., Silvello, G.: Focal elements of neural information retrieval models. An outlook through a reproducibility study. Inf. Process. Manage. 57, 102109 (2020)
Article Google Scholar
Nayak, P.: Understanding searches better than ever before. https://www.blog.google/products/search/search-language-understanding-bert/. Accessed 18 May 2020
Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv:1901.04085 (2019)
Rao, J., Yang, W., Zhang, Y., Ture, F., Lin, J.: Multi-perspective relevance matching with hierarchical ConvNets for social media search. In: The 33rd AAAI Conference on Artificial Intelligence, AAAI19, vol. 33, pp. 232240 (2019)
Google Scholar
Roberts, K., et al.: TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. J. Am. Med. Inform. Assoc. 27, 1431–1436 (2020)
Article Google Scholar
Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence (2016)
Google Scholar
Wouter, M., Marco, L.: An introduction to domain adaptation and transfer learning. arXiv:1812.11806 (2019)
Yang, W., Lu, K., Yang, P., Lin, J.: Critically examining the “neural hype”: weak baselines and the additivity of effectiveness gains from neural ranking models. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, pp. 1129–1132. Association for Computing Machinery (2019)
Google Scholar
Yilmaz, Z., Yang, W., Zhang, H., Lin, J.: Cross-domain modeling of sentence-level evidence for document retrieval. In: 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 3481–3487. Association for Computational Linguistics (2019)
Google Scholar
Zhang, K., Xiong, C., Liu, Z., Liu, Z.: Selective weak supervision for neural information retrieval. In: International World Web Conference, Creative Commons, Taiwan (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

University of KwaZulu-Natal, Westville, 4001, South Africa
Yastil Rughbeer, Anban W. Pillay & Edgar Jembere
Centre for AI Research (CAIR), Cape Town, South Africa
Yastil Rughbeer, Anban W. Pillay & Edgar Jembere

Authors

Yastil Rughbeer
View author publications
You can also search for this author in PubMed Google Scholar
Anban W. Pillay
View author publications
You can also search for this author in PubMed Google Scholar
Edgar Jembere
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yastil Rughbeer .

Editor information

Editors and Affiliations

University of Pretoria, Pretoria, South Africa
Aurona Gerber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rughbeer, Y., Pillay, A.W., Jembere, E. (2020). Dataset Selection for Transfer Learning in Information Retrieval. In: Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2021. Communications in Computer and Information Science, vol 1342. Springer, Cham. https://doi.org/10.1007/978-3-030-66151-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-66151-9_4
Published: 21 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66150-2
Online ISBN: 978-3-030-66151-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics