short-paper

Tevatron: An Efficient and Flexible Toolkit for Neural Retrieval

Authors:

Jamie CallanAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3120 - 3124

https://doi.org/10.1145/3539618.3591805

Published: 18 July 2023 Publication History

Abstract

Recent rapid advances in deep pre-trained language models and the introduction of large datasets have powered research in embedding-based neural retrieval. While many excellent research papers have emerged, most of them come with their own implementations, which are typically optimized for some particular research goals instead of efficiency or code organization. In this paper, we introduce Tevatron, a neural retrieval toolkit that is optimized for efficiency, flexibility, and code simplicity. Tevatron enables model training and evaluation for a variety of ranking components such as dense retrievers, sparse retrievers, and rerankers. It also provides a standardized pipeline that includes text processing, model training, corpus/query encoding, and search. In addition, Tevatron incorporates well-studied methods for improving retriever effectiveness such as hard negative mining and knowledge distillation. We provide an overview of Tevatron in this paper, demonstrating its effectiveness and efficiency on multiple IR and QA datasets. We highlight Tevatron's flexible design, which enables easy generalization across datasets, model architectures, and accelerator platforms (GPUs and TPUs). Overall, we believe that Tevatron can serve as a solid software foundation for research on neural retrieval systems, including their design, modeling, and optimization.

References

[1]

Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021. XOR QA: Cross-lingual Open-Retrieval Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online, 547--564.

[2]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, 1533--1544.

[3]

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of PythonNumPy programs. Google.

[4]

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics, Vol. 8 (2020), 454--470.

[5]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21). 2288--2292.

Digital Library

[6]

Luyu Gao and Jamie Callan. 2021. Condenser: a Pre-training Architecture for Dense Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic, 981--993.

[7]

Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland, 2843--2853.

[8]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021a. Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline. In Advances in Information Retrieval: 43rd European Conference on IR Research (ECIR 2021), Part II. 280--286.

[9]

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. 2021b. Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). Online, 316--321.

[10]

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33 (2011), 117--128.

Digital Library

[11]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, Vol. 7, 3 (2019), 535--547.

[12]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada, 1601--1611.

[13]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online, 6769--6781.

[14]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics, Vol. 7 (2019), 452--466.

[15]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021a. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, Canada) (SIGIR '21). 2356--2362.

Digital Library

[16]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021b. In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). Association for Computational Linguistics, Online, 163--173.

[17]

Xueguang Ma, Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2022a. Document Expansion Baselines and Learned Sparse Lexical Representations for MS MARCO V1 and V2. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22). 3187--3197.

Digital Library

[18]

Xueguang Ma, Kai Sun, Ronak Pradeep, Minghan Li, and Jimmy Lin. 2022b. Another Look at DPR: Reproduction of Training and Replication of Retrieval. In Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Part I (Stavanger, Norway). 613--626.

Digital Library

[19]

Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 4 (2020), 824--836.

Digital Library

[20]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. In Proceedings of the 6th International Conference on Learning Representations.

[21]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. 8024--8035.

[22]

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online, 5835--5847.

[23]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas, 2383--2392.

[24]

Ellen M. Voorhees and Dawn M. Tice. 2000. The TREC-8 Question Answering Track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC'00). Athens, Greece.

[25]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021).

[26]

Hansi Zeng, Hamed Zamani, and Vishwa Vinay. 2022. Curriculum Learning for Dense Retrieval Distillation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

Digital Library

Cited By

Zhuang SKoopman BChu XZuccon GSakai TIshita EOhshima HHasibi FMao JJose J(2024)Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval SystemsProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698414(259-268)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1145/3673791.3698414
Zhang LLi SPeng H(2024)Lora for dense passage retrieval of ConTextual masked auto-encodingSignal, Image and Video Processing10.1007/s11760-024-03593-419:1Online publication date: 2-Dec-2024
https://doi.org/10.1007/s11760-024-03593-4
Sidiropoulos GKanoulas E(2024)Improving the Robustness of Dense Retrievers Against Typos via Multi-Positive Contrastive LearningAdvances in Information Retrieval10.1007/978-3-031-56063-7_21(297-305)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_21
Show More Cited By

Index Terms

Tevatron: An Efficient and Flexible Toolkit for Neural Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

A flexible and efficient image retrieval system
Flexible and efficient retrieval of haemodialysis time series
BPM' 2012: Proceedings of the 2012 international conference on Process Support and Knowledge Representation in Health Care

The problem of retrieving time series similar to a specified query pattern has been recently addressed within the Case Based Reasoning (CBR) literature. Providing a flexible and efficient way of dealing with such an issue would be of paramount ...
FIRE – flexible image retrieval engine: ImageCLEF 2004 evaluation
CLEF'04: Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images

We describe FIRE, a content-based image retrieval system, and the methods we used within this system in the ImageCLEF 2004 evaluation. In FIRE, various features are available to represent images. The diversity of available features allows the user to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Short-paper

Funding Sources

Natural Sciences and Engineering Research Council (NSERC) of Canada

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
148
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)6

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhuang SKoopman BChu XZuccon GSakai TIshita EOhshima HHasibi FMao JJose J(2024)Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval SystemsProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698414(259-268)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1145/3673791.3698414
Zhang LLi SPeng H(2024)Lora for dense passage retrieval of ConTextual masked auto-encodingSignal, Image and Video Processing10.1007/s11760-024-03593-419:1Online publication date: 2-Dec-2024
https://doi.org/10.1007/s11760-024-03593-4
Sidiropoulos GKanoulas E(2024)Improving the Robustness of Dense Retrievers Against Typos via Multi-Positive Contrastive LearningAdvances in Information Retrieval10.1007/978-3-031-56063-7_21(297-305)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_21
Cohen NCohen-Indelman HFairstein YKushilevitz G(2024)InDi: Informative and Diverse Sampling for Dense RetrievalAdvances in Information Retrieval10.1007/978-3-031-56063-7_16(243-258)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_16
Zhang XThakur NOgundepo OKamalloo EAlfonso-Hermelo DLi XLiu QRezagholizadeh MLin J(2023) MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages Transactions of the Association for Computational Linguistics10.1162/tacl_a_0059511(1114-1131)Online publication date: 1-Sep-2023
https://doi.org/10.1162/tacl_a_00595
Ma XFun HYin XMallia ALin J(2023)Enhancing Sparse Retrieval via Unsupervised LearningProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625334(150-157)Online publication date: 26-Nov-2023
https://dl.acm.org/doi/10.1145/3624918.3625334
Zhang XOgueji KMa XLin J(2023)Toward Best Practices for Training Multilingual Dense Retrieval ModelsACM Transactions on Information Systems10.1145/361344742:2(1-33)Online publication date: 27-Sep-2023
https://dl.acm.org/doi/10.1145/3613447
Ma XTeofili TLin JFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Anserini Gets Dense Retrieval: Integration of Lucene's HNSW IndexesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615112(5366-5370)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615112
Lawrie DYang EOard DMayfield J(2023)Neural Approaches to Multilingual Information RetrievalAdvances in Information Retrieval10.1007/978-3-031-28244-7_33(521-536)Online publication date: 2-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-28244-7_33
Tamber MPradeep RLin J(2023)Pre-processing Matters! Improved Wikipedia Corpora for Open-Domain Question AnsweringAdvances in Information Retrieval10.1007/978-3-031-28241-6_11(163-176)Online publication date: 2-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-28241-6_11

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten