Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT

Giacalone, Ben; Paiement, Greg; Tucker, Quinn; Zanibbi, Richard

doi:10.1007/978-3-031-56063-7_35

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14610))

Included in the following conference series:

European Conference on Information Retrieval

352 Accesses

Abstract

ColBERT is a highly effective and interpretable retrieval model based on token embeddings. For scoring, the model adds cosine similarities between the most similar pairs of query and document token embeddings. Previous work on interpreting how tokens affect scoring pay little attention to non-text tokens used in ColBERT such as [MASK]. Using MS MARCO and the TREC 2019-2020 deep passage retrieval task, we show that [MASK] embeddings may be replaced by other query and structural token embeddings to obtain similar effectiveness, and that [Q] and [MASK] are sensitive to token order, while [CLS] and [SEP] are not.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
[MASK] was originally devised for BERT to represent a “hidden” input token in its masked token prediction training task.
2.
Interactive version: https://cs.rit.edu/~bsg8294/colbert/query_viz.html.
3.
http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip.
4.
Running the TREC test queries takes roughly 15 min to complete using a multithreaded Rust program: https://github.com/Boxxfish/IR2023-Project.

References

Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. In: Voorhees, E.M., Ellis, A. (eds.) Proceedings of Text REtrieval Conference (TREC), vol. 1266. NIST Special Publication (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Formal, T., Piwowarski, B., Clinchant, S.: A white box analysis of ColBERT. In: Hiemstra, D., Moens, M.F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) Proceedings of European Conference on Information Retrieval (ECIR). LNCS, vol. 12657, pp. 257–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_23
Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Huang, J.X., et al. (eds.) Proceedings of SIGIR, pp. 39–48 (2020). https://doi.org/10.1145/3397271.3401075
MacAvaney, S., Feldman, S., Goharian, N., Downey, D., Cohan, A.: ABNIRML: analyzing the behavior of neural IR Models. Trans. Assoc. Comput. Linguist. 10, 224–239 (2022). https://doi.org/10.1162/tacl_a_00457
Macdonald, C., Tonellotto, N., MacAvaney, S., Ounis, I.: PyTerrier: declarative experimentation in Python from BM25 to dense retrieval. In: Proceedings of International Conference on Information & Knowledge Management (CIKM), pp. 4526–4533 (2021). https://doi.org/10.1145/3459637.3482013
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: Besold, T.R., Bordes, A., d’Avila Garcez, A.S., Wayne, G. (eds.) Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 Co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 9 December 2016. CEUR Workshop Proceedings, vol. 1773. CEUR-WS.org (2016). https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
Rau, D., Kamps, J.: The role of complex NLP in transformers for text ranking. In: Proceedings of ICTIR, pp. 153–160 (2022). https://doi.org/10.1145/3539813.3545144
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: effective and efficient retrieval via lightweight late interaction. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 3715–3734 (2022). https://doi.org/10.18653/v1/2022.naacl-main.272
Tonellotto, N., Macdonald, C.: Query embedding pruning for dense retrieval. In: Proceedings of International Conference on Information & Knowledge Management (CIKM), pp. 3453–3457 (2021). https://doi.org/10.1145/3459637.3482162
Voorhees, E.M., Ellis, A. (eds.): Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019, Gaithersburg, Maryland, USA, 13–15 November 2019, vol. 1250. NIST Special Publication. National Institute of Standards and Technology (NIST) (2019). https://trec.nist.gov/pubs/trec28/trec2019.html
Wang, X., Macdonald, C., Ounis, I.: Improving zero-shot retrieval using dense external expansion. Inf. Process. Manag. 59(5), 103026 (2022). https://doi.org/10.1016/j.ipm.2022.103026, https://www.sciencedirect.com/science/article/pii/S0306457322001364
Wang, X., MacDonald, C., Tonellotto, N., Ounis, I.: ColBERT-PRF: semantic pseudo-relevance feedback for dense passage and document retrieval. ACM Trans. Web 17(1), 1–39 (2023). https://doi.org/10.1145/3572405
Wang, X., Macdonald, C., Tonellotto, N., Ounis, I.: Reproducibility, replicability, and insights into dense multi-representation retrieval models: from ColBERT to Col*. In: Proceedings of SIGIR, pp. 2552–2561. ACM (2023). https://doi.org/10.1145/3539618.3591916
Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of International Conference on Learning Representations (ICLR). OpenReview.net (2021). https://openreview.net/forum?id=zeFrfgyZln
Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=cpDhcsEDC2
Zhan, J., Mao, J., Liu, Y., Zhang, M., Ma, S.: RepBERT: contextualized text embeddings for first-stage retrieval. CoRR abs/2006.15498 (2020). https://arxiv.org/abs/2006.15498

Download references

Author information

Authors and Affiliations

Rochester Institute of Technology, Rochester, NY, 14623, USA
Ben Giacalone, Greg Paiement, Quinn Tucker & Richard Zanibbi

Authors

Ben Giacalone
View author publications
You can also search for this author in PubMed Google Scholar
Greg Paiement
View author publications
You can also search for this author in PubMed Google Scholar
Quinn Tucker
View author publications
You can also search for this author in PubMed Google Scholar
Richard Zanibbi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ben Giacalone .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, PISA, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Giacalone, B., Paiement, G., Tucker, Q., Zanibbi, R. (2024). Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-56063-7_35
Published: 23 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT