short-paper

Open access

SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval

Authors:

Jeffrey M. Dudek,

Mingyang Zhang,

Michael BenderskyAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2399 - 2403

https://doi.org/10.1145/3539618.3592065

Published: 18 July 2023 Publication History

Abstract

In dense retrieval, prior work has largely improved retrieval effectiveness using multi-vector dense representations, exemplified by ColBERT. In sparse retrieval, more recent work, such as SPLADE, demonstrated that one can also learn sparse lexical representations to achieve comparable effectiveness while enjoying better interpretability. In this work, we combine the strengths of both the sparse and dense representations for first-stage retrieval. Specifically, we propose SparseEmbed - a novel retrieval model that learns sparse lexical representations with contextual embeddings. Compared with SPLADE, our model leverages the contextual embeddings to improve model expressiveness. Compared with ColBERT, our sparse representations are trained end-to-end to optimize both efficiency and effectiveness.

References

[1]

Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. Approximate nearest neighbor search in high dimensions. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pages 3287--3318. World Scientific, 2018.

[2]

Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, and Qun Liu. Sparterm: Learning term-based sparse representation for fast text retrieval. arXiv preprint arXiv:2010.00768, 2020.

[3]

Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proc. of ICML, pages 160--167, 2008.

Digital Library

[4]

Zhuyun Dai and Jamie Callan. Context-aware term weighting for first stage passage retrieval. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 1533--1536, 2020.

Digital Library

[5]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086, 2021.

[6]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2288--2292, 2021.

Digital Library

[7]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. From distillation to hard negative sampling: Making sparse neural ir models more effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2353--2359, 2022.

Digital Library

[8]

Luyu Gao and Jamie Callan. Unsupervised corpus aware language model pretraining for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2843--2853, 2022.

[9]

Luyu Gao, Zhuyun Dai, and Jamie Callan. Coil: Revisit exact lexical match in information retrieval with contextualized inverted list. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3030--3042, 2021.

[10]

Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint arXiv:2010.02666, 2020.

[11]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113--122, 2021.

Digital Library

[12]

Sebastian Hofstätter, Omar Khattab, Sophia Althammer, Mete Sertkan, and Allan Hanbury. Introducing neural bag of whole-words with colberter: Contextualized late interactions using enhanced reduction. In Proceedings of the 31st ACM Inter- national Conference on Information and Knowledge Management, page 737--747, 2022.

Digital Library

[13]

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proc. of SIGIR, pages 39--48, 2020.

Digital Library

[14]

Weize Kong, Swaraj Khadanga, Cheng Li, Shaleen Kumar Gupta, Mingyang Zhang, Wensong Xu, and Michael Bendersky. Multi-aspect dense retrieval. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3178--3186, 2022.

Digital Library

[15]

Carlos Lassance, Maroua Maachou, Joohee Park, and Stéphane Clinchant. A study on token pruning for colbert. arXiv preprint arXiv:2112.06540, 2021.

[16]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP- 2021), pages 163--173, 2021.

[17]

Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval. arXiv preprint arXiv:2005.00181, 2020.

[18]

Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. Learning passage impacts for inverted indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1723--1727, 2021.

Digital Library

[19]

Biswajit Paria, Chih-Kuan Yeh, Ian EH Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. Minimizing flops to learn efficient sparse representations. In International Conference on Learning Representations, 2019.

[20]

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191, 2020.

[21]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488, 2021.

[22]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of in- formation retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.

[23]

Nicola Tonellotto and Craig Macdonald. Query embedding pruning for dense retrieval. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 3453--3457, 2021.

Digital Library

[24]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808, 2020.

[25]

Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. Pretrained transformers for text ranking: Bert and beyond. In Proceedings of the 14th ACM International Conference on web search and data mining, pages 1154--1156, 2021.

Digital Library

[26]

Hamed Zamani, Mostafa Dehghani, W Bruce Croft, Erik Learned-Miller, and Jaap Kamps. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 497--506, 2018.

Digital Library

Cited By

Tigunova AHaratinezhad Torbati GYates AWeikum GSerra ESpezzano F(2024)STAR: Sparse Text Approach for RecommendationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679999(4086-4090)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679999
Formal TClinchant SDéjean HLassance CHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)SPLATE: Sparse Late Interaction RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657968(2635-2640)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657968
Dobrynin VSherman MAbramovich RPlatonov A(2024)A Sparsifier Model for Efficient Information Retrieval2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT)10.1109/AICT61888.2024.10740301(1-4)Online publication date: 25-Sep-2024
https://doi.org/10.1109/AICT61888.2024.10740301
Show More Cited By

Index Terms

SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval
1. Information systems
  1. Information retrieval

Recommendations

Cluster-based Partial Dense Retrieval Fused with Sparse Text Retrieval
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Previous work has demonstrated the potential to combine document rankings from dense and sparse retrievers for higher relevance effectiveness. This paper proposes a cluster-based partial dense retrieval scheme guided by sparse retrieval results to ...
Vietnamese Legal Text Retrieval based on Sparse and Dense Retrieval approaches
Abstract
We introduce the combination of two techniques: Sparse Retrieval and Dense Retrieval, while experimenting with different training approaches to find the optimal method for the Vietnamese Legal Text Retrieval task. Moreover, the Question Answering ...
A Dense Representation Framework for Lexical and Semantic Matching
Lexical and semantic matching capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust than either alone. Prior work performs hybrid retrieval by conducting lexical and semantic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Check for updates

Badges

Best Short Paper

Author Tags

Qualifiers

Short-paper

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
2,048
Total Downloads

Downloads (Last 12 months)1,441
Downloads (Last 6 weeks)169

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tigunova AHaratinezhad Torbati GYates AWeikum GSerra ESpezzano F(2024)STAR: Sparse Text Approach for RecommendationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679999(4086-4090)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679999
Formal TClinchant SDéjean HLassance CHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)SPLATE: Sparse Late Interaction RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657968(2635-2640)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657968
Dobrynin VSherman MAbramovich RPlatonov A(2024)A Sparsifier Model for Efficient Information Retrieval2024 IEEE 18th International Conference on Application of Information and Communication Technologies (AICT)10.1109/AICT61888.2024.10740301(1-4)Online publication date: 25-Sep-2024
https://doi.org/10.1109/AICT61888.2024.10740301
Dudek JKong WLi CZhang MBendersky MFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Learning Sparse Lexical Representations Over Specified Vocabularies for RetrievalProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615207(3865-3869)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615207

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten