Abstract
ColBERT is a highly effective and interpretable retrieval model based on token embeddings. For scoring, the model adds cosine similarities between the most similar pairs of query and document token embeddings. Previous work on interpreting how tokens affect scoring pay little attention to non-text tokens used in ColBERT such as [MASK]. Using MS MARCO and the TREC 2019-2020 deep passage retrieval task, we show that [MASK] embeddings may be replaced by other query and structural token embeddings to obtain similar effectiveness, and that [Q] and [MASK] are sensitive to token order, while [CLS] and [SEP] are not.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
[MASK] was originally devised for BERT to represent a “hidden” input token in its masked token prediction training task.
- 2.
Interactive version: https://cs.rit.edu/~bsg8294/colbert/query_viz.html.
- 3.
- 4.
Running the TREC test queries takes roughly 15 min to complete using a multithreaded Rust program: https://github.com/Boxxfish/IR2023-Project.
References
Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. In: Voorhees, E.M., Ellis, A. (eds.) Proceedings of Text REtrieval Conference (TREC), vol. 1266. NIST Special Publication (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Formal, T., Piwowarski, B., Clinchant, S.: A white box analysis of ColBERT. In: Hiemstra, D., Moens, M.F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) Proceedings of European Conference on Information Retrieval (ECIR). LNCS, vol. 12657, pp. 257–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_23
Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Huang, J.X., et al. (eds.) Proceedings of SIGIR, pp. 39–48 (2020). https://doi.org/10.1145/3397271.3401075
MacAvaney, S., Feldman, S., Goharian, N., Downey, D., Cohan, A.: ABNIRML: analyzing the behavior of neural IR Models. Trans. Assoc. Comput. Linguist. 10, 224–239 (2022). https://doi.org/10.1162/tacl_a_00457
Macdonald, C., Tonellotto, N., MacAvaney, S., Ounis, I.: PyTerrier: declarative experimentation in Python from BM25 to dense retrieval. In: Proceedings of International Conference on Information & Knowledge Management (CIKM), pp. 4526–4533 (2021). https://doi.org/10.1145/3459637.3482013
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: Besold, T.R., Bordes, A., d’Avila Garcez, A.S., Wayne, G. (eds.) Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 Co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 9 December 2016. CEUR Workshop Proceedings, vol. 1773. CEUR-WS.org (2016). https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
Rau, D., Kamps, J.: The role of complex NLP in transformers for text ranking. In: Proceedings of ICTIR, pp. 153–160 (2022). https://doi.org/10.1145/3539813.3545144
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: effective and efficient retrieval via lightweight late interaction. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 3715–3734 (2022). https://doi.org/10.18653/v1/2022.naacl-main.272
Tonellotto, N., Macdonald, C.: Query embedding pruning for dense retrieval. In: Proceedings of International Conference on Information & Knowledge Management (CIKM), pp. 3453–3457 (2021). https://doi.org/10.1145/3459637.3482162
Voorhees, E.M., Ellis, A. (eds.): Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019, Gaithersburg, Maryland, USA, 13–15 November 2019, vol. 1250. NIST Special Publication. National Institute of Standards and Technology (NIST) (2019). https://trec.nist.gov/pubs/trec28/trec2019.html
Wang, X., Macdonald, C., Ounis, I.: Improving zero-shot retrieval using dense external expansion. Inf. Process. Manag. 59(5), 103026 (2022). https://doi.org/10.1016/j.ipm.2022.103026, https://www.sciencedirect.com/science/article/pii/S0306457322001364
Wang, X., MacDonald, C., Tonellotto, N., Ounis, I.: ColBERT-PRF: semantic pseudo-relevance feedback for dense passage and document retrieval. ACM Trans. Web 17(1), 1–39 (2023). https://doi.org/10.1145/3572405
Wang, X., Macdonald, C., Tonellotto, N., Ounis, I.: Reproducibility, replicability, and insights into dense multi-representation retrieval models: from ColBERT to Col*. In: Proceedings of SIGIR, pp. 2552–2561. ACM (2023). https://doi.org/10.1145/3539618.3591916
Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of International Conference on Learning Representations (ICLR). OpenReview.net (2021). https://openreview.net/forum?id=zeFrfgyZln
Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=cpDhcsEDC2
Zhan, J., Mao, J., Liu, Y., Zhang, M., Ma, S.: RepBERT: contextualized text embeddings for first-stage retrieval. CoRR abs/2006.15498 (2020). https://arxiv.org/abs/2006.15498
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Giacalone, B., Paiement, G., Tucker, Q., Zanibbi, R. (2024). Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-56063-7_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)