skip to main content
10.1145/3488560.3502186acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
short-paper

Aligning the Research and Practice of Building Search Applications: Elasticsearch and Pyserini

Published: 15 February 2022 Publication History

Abstract

We demonstrate, via competitive bag-of-words first-stage retrieval baselines for the MS MARCO document ranking task, seamless replicability and interoperability between Elasticsearch and the Pyserini IR toolkit, which are both built on the open-source Lucene search library. This integration highlights the benefits of recent efforts to promote the use of Lucene in information retrieval research to better align the research and practice of building search applications. Closer alignment between academia and industry is mutually beneficial: Academic researchers gain a smoother path to real-world impact because their contributions can be more easily deployed in production applications. Industry practitioners gain an easy way to benchmark their innovations in a rigorous and vendor-neutral manner by exploiting evaluation resources and infrastructure built by the academic community. This two-way exchange between academia and industry allows both parties to "have their cakes and eat them too".

References

[1]
P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3.
[2]
R. Clancy, N. Ferro, C. Hauff, J. Lin, T. Sakai, and Z. Wu. 2019. Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019). In Open-Source IR Replicability Challenge at SIGIR 2019.
[3]
Nick Craswell, Bhaskar Mitra, Daniel Campos, Emine Yilmaz, and Jimmy Lin. 2021. MS MARCO: Benchmarking Ranking Models in the Large-Data Regime. In SIGIR. 1566--1576.
[4]
A. Grand, R. Muir, J. Ferenczi, and J. Lin. 2020. From MaxScore to Block-Max WAND: The Story of How Lucene Significantly Improved Query Evaluation Performance. In ECIR.
[5]
V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In EMNLP. 6769--6781.
[6]
H. Li. 2011. Learning to Rank for Information Retrieval and Natural Language Processing .Morgan & Claypool Publishers.
[7]
J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Macdonald, and S. Vigna. 2016. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In ECIR.
[8]
J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, and R. Nogueira. 2021 a. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In SIGIR. 2356--2362.
[9]
J. Lin, R. Nogueira, and A. Yates. 2021 b. Pretrained Transformers for Text Ranking: BERT and Beyond .Morgan & Claypool Publishers.
[10]
T.-Y. Liu. 2009. Learning to Rank for Information Retrieval. FnTIR, Vol. 3, 3 (2009), 225--331.
[11]
A. Mallia, M. Siedlaczek, J. Mackenzie, and T. Suel. 2019. PISA: Performant Indexes and Search for Academia. In Open-Source IR Replicability Challenge at SIGIR 2019.
[12]
R. Nogueira and J. Lin. 2019. From doc2query to docTTTTTquery.
[13]
R. Nogueira, W. Yang, J. Lin, and K. Cho. 2019. Document Expansion by Query Prediction. arXiv:1904.08375.
[14]
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, Vol. 21, 140 (2020), 1--67.
[15]
H. Turtle, Y. Hegde, and S. Rowe. 2012. Yet Another Comparison of Lucene and Indri performance. In SIGIR 2012 Workshop on Open Source Information Retrieval.
[16]
P. Yang, H. Fang, and J. Lin. 2018. Anserini: Reproducible Ranking Baselines Using Lucene. JDIQ, Vol. 10, 4 (2018).

Cited By

View all
  • (2024)Resources for Brewing BEIR: Reproducible Reference Models and Statistical AnalysesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657862(1431-1440)Online publication date: 10-Jul-2024
  • (2024)Improving Hotel Search Autocomplete in Online Travel Agent (OTA) Mobile Apps Using Elasticsearch and Learning to Rank2024 International Conference on ICT for Smart Society (ICISS)10.1109/ICISS62896.2024.10751515(1-8)Online publication date: 4-Sep-2024
  • (2024)Enhancing empirical software performance engineering research with kernel-level events: A comprehensive system tracing approachJournal of Systems and Software10.1016/j.jss.2024.112117216(112117)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Aligning the Research and Practice of Building Search Applications: Elasticsearch and Pyserini

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
    February 2022
    1690 pages
    ISBN:9781450391320
    DOI:10.1145/3488560
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 February 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. academia-industry collaborations
    2. open-source software

    Qualifiers

    • Short-paper

    Conference

    WSDM '22

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Resources for Brewing BEIR: Reproducible Reference Models and Statistical AnalysesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657862(1431-1440)Online publication date: 10-Jul-2024
    • (2024)Improving Hotel Search Autocomplete in Online Travel Agent (OTA) Mobile Apps Using Elasticsearch and Learning to Rank2024 International Conference on ICT for Smart Society (ICISS)10.1109/ICISS62896.2024.10751515(1-8)Online publication date: 4-Sep-2024
    • (2024)Enhancing empirical software performance engineering research with kernel-level events: A comprehensive system tracing approachJournal of Systems and Software10.1016/j.jss.2024.112117216(112117)Online publication date: Oct-2024
    • (2024)Redis-based full-text search extensions for relational databasesInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02160-015:10(4475-4491)Online publication date: 12-Apr-2024
    • (2023)Anserini Gets Dense Retrieval: Integration of Lucene's HNSW IndexesProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615112(5366-5370)Online publication date: 21-Oct-2023
    • (2023)A Highly Accurate Data Synchronization and Full-text Search Algorithm for Canal and Elasticsearch2023 IEEE International Conference on Networking, Sensing and Control (ICNSC)10.1109/ICNSC58704.2023.10318999(1-6)Online publication date: 25-Oct-2023
    • (2023)Field featuresApplied Soft Computing10.1016/j.asoc.2023.110183138:COnline publication date: 1-May-2023
    • (2022)Integration of text and geospatial search for hydrographic datasets using the lucene search libraryProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3533280(1-5)Online publication date: 20-Jun-2022
    • (2022)A Common Framework for Exploring Document-at-a-Time and Score-at-a-Time Retrieval MethodsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531657(3229-3234)Online publication date: 6-Jul-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media