short-paper

XpmIR: A Modular Library for Learning to Rank and Neural IR Experiments

Authors:

Benjamin PiwowarskiAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3185 - 3189

https://doi.org/10.1145/3539618.3591818

Published: 18 July 2023 Publication History

Abstract

During past years, several frameworks for (Neural) Information Retrieval have been proposed. However, while they allow reproducing already published results, it is still very hard to re-use some parts of the learning pipelines, such as for instance the pre-training, sampling strategy, or a loss in newly developed models. It is also difficult to use new training techniques with old models, which makes it more difficult to assess the usefulness of ideas on various neural IR models. This slows the adoption of new techniques, and in turn, the development of the IR field. In this paper, we present XpmIR, a Python library defining a reusable set of experimental components. The library already contains state-of-the-art models and indexation techniques and is integrated with the HuggingFace hub.

References

[1]

Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the twelfth international conference on Information and knowledge management (CIKM '03). Association for Computing Machinery, New York, NY, USA, 426--434. https://doi.org/10.1145/956863.956944

Digital Library

[2]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021a. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv:2109.10086 (Sep 2021). http://arxiv.org/abs/2109.10086 arXiv:2109.10086 [cs].

[3]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2022. From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. arXiv:2205.04733 (May 2022). http://arxiv.org/abs/2205.04733 arXiv:2205.04733 [cs].

[4]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021b. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. Technical Report. http://arxiv.org/abs/2107.05720 arXiv: 2107.05720.

[5]

Luyu Gao, Xueguang Ma, Jimmy J. Lin, and Jamie Callan. 2022. Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval. ArXiv, Vol. abs/2203.05765 (2022).

[6]

Sebastian Hofstätter. 2019. Matchmaker. https://github.com/sebastian-hofstaetter/matchmaker

[7]

Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. 2021a. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv:2010.02666 (Jan 2021). http://arxiv.org/abs/2010.02666 arXiv:2010.02666 [cs].

[8]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021b. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. arXiv:2104.06967 (May 2021). http://arxiv.org/abs/2104.06967 arXiv:2104.06967 [cs].

[9]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, Vol. 7, 3 (2019), 535--547.

[10]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. arXiv:2004.12832 [cs] (April 2020). http://arxiv.org/abs/2004.12832 arXiv: 2004.12832.

[11]

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv: 1412.6980 [cs.LG]

[12]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, Canada) (SIGIR '21). Association for Computing Machinery, New York, NY, USA, 2356--2362. https://doi.org/10.1145/3404835.3463238

Digital Library

[13]

Sean MacAvaney. 2020. OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline. In WSDM 2020.

[14]

Sean MacAvaney, Craig Macdonald, and Iadh Ounis. 2022. Streamlining evaluation with ir-measures. Lecture Notes in Computer Science, Vol. 13186. Springer International Publishing, Cham, 305--310. https://doi.org/10.1007/978-3-030-99739-7_38

Digital Library

[15]

Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified Data Wrangling with ir_datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, 2429--2436. https://doi.org/10.1145/3404835.3463254

Digital Library

[16]

Craig Macdonald and Nicola Tonellotto. 2020. Declarative Experimentation inInformation Retrieval using PyTerrier. In Proceedings of ICTIR 2020.

Digital Library

[17]

Joel Mackenzie, Antonio Mallia, and Alistair Moffat. 2022. Accelerating Learned Sparse Indexes Via Term Impact Decomposition. In Findings of the Association for Computational Linguistics: EMNLP 2022.

[18]

Antonio Mallia, Michal Siedlaczek, Joel Mackenzie, and Torsten Suel. 2019. PISA: Performant Indexes and Search for Academia. In Proceedings of the Open-Source IR Replicability Challenge co-located with 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, OSIRRC@SIGIR 2019, Paris, France, July 25, 2019. 50--56. http://ceur-ws.org/Vol-2409/docker08.pdf

[19]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS'14, Vol. cs.CL. 3111--3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality

Digital Library

[20]

Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. https://doi.org/10.48550/arXiv.1901.04085 arXiv:1901.04085 [cs].

[21]

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-Stage Document Ranking with BERT. arXiv:1910.14424 [cs] (Oct. 2019). http://arxiv.org/abs/1910.14424 ZSCC: 0000001 arXiv: 1910.14424.

[22]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation.

[23]

Benjamin Piwowarski. 2020. Experimaestro and Datamaestro: Experiment and Dataset Managers (for IR). In ACM SIGIR 2020. Xian, China. https://doi.org/10.1145/3397271.3401410 ZSCC: NoCitationData[s0].

Digital Library

[24]

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In In Proceedings of NAACL.

[25]

Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Xiaolong Huang, Binxing Jiao, Linjun Yang, and Daxin Jiang. 2022. LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval. In ICLR. arXiv. https://doi.org/10.48550/arXiv.2208.14754 arXiv:2208.14754 [cs].

[26]

Howard Turtle and James Flood. 1995. Query evaluation: Strategies and optimizations. Information Processing & Management, Vol. 31, 6 (Nov. 1995), 831--850. https://doi.org/10.1016/0306--4573(95)00020-H

Digital Library

[27]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace's Transformers: State-of-the-art Natural Language Processing. https://doi.org/10.48550/arXiv.1910.03771 arXiv:1910.03771 [cs].

[28]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv:2007.00808 (Oct 2020). http://arxiv.org/abs/2007.00808 arXiv:2007.00808 [cs].

[29]

Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene., Vol. 10, 4 (2018). https://doi.org/10/ggmdws

Digital Library

[30]

Andrew Yates, Siddhant Arora, Xinyu Zhang, Wei Yang, Kevin Martin Jose, and Jimmy Lin. [n.,d.]. Capreolus: A Toolkit for End -to-End Neural Ad Hoc Retrieval (WSDM '20). 861--864. https://doi.org/10/ggjnkm

Cited By

Keshvari SSaeedi FSadoghi Yazdi HEnsan F(2024)A Self-Distilled Learning to Rank Model for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/368178442:6(1-28)Online publication date: 22-Oct-2024
https://dl.acm.org/doi/10.1145/3681784
Vast MVan Cooten BSoulier LPiwowarski BOosterhuis HBast HXiong C(2024)Which Neurons Matter in IR? Applying Integrated Gradients-based Methods to Understand Cross-EncodersProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672528(133-143)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672528
Vast MZong YPiwowarski BSoulier L(2024)Simple Domain Adaptation for Sparse RetrieversAdvances in Information Retrieval10.1007/978-3-031-56063-7_32(403-412)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_32

Index Terms

XpmIR: A Modular Library for Learning to Rank and Neural IR Experiments
1. Software and its engineering
  1. Software notations and tools
    1. Development frameworks and environments
      1. Application specific development environments

Recommendations

Learning to Adaptively Rank Document Retrieval System Configurations

Modern Information Retrieval (IR) systems have become more and more complex, involving a large number of parameters. For example, a system may choose from a set of possible retrieval models (BM25, language model, etc.), or various query expansion ...
A Learning to Rank framework applied to text-image retrieval

We present a framework based on a Learning to Rank setting for a text-image retrieval task. In Information Retrieval, the goal is to compute the similarity between a document and an user query. In the context of text-image retrieval where several ...
Learning to Rank System Configurations
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Information Retrieval (IR) systems heavily rely on a large number of parameters, such as the retrieval model or various query expansion parameters, whose values greatly influence the overall retrieval effectiveness. However, setting all these parameters ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
99
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Keshvari SSaeedi FSadoghi Yazdi HEnsan F(2024)A Self-Distilled Learning to Rank Model for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/368178442:6(1-28)Online publication date: 22-Oct-2024
https://dl.acm.org/doi/10.1145/3681784
Vast MVan Cooten BSoulier LPiwowarski BOosterhuis HBast HXiong C(2024)Which Neurons Matter in IR? Applying Integrated Gradients-based Methods to Understand Cross-EncodersProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672528(133-143)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672528
Vast MZong YPiwowarski BSoulier L(2024)Simple Domain Adaptation for Sparse RetrieversAdvances in Information Retrieval10.1007/978-3-031-56063-7_32(403-412)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_32

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten