Skip to main content

Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)

Abstract

ColBERT is a highly effective and interpretable retrieval model based on token embeddings. For scoring, the model adds cosine similarities between the most similar pairs of query and document token embeddings. Previous work on interpreting how tokens affect scoring pay little attention to non-text tokens used in ColBERT such as [MASK]. Using MS MARCO and the TREC 2019-2020 deep passage retrieval task, we show that [MASK] embeddings may be replaced by other query and structural token embeddings to obtain similar effectiveness, and that [Q] and [MASK] are sensitive to token order, while [CLS] and [SEP] are not.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    [MASK] was originally devised for BERT to represent a “hidden” input token in its masked token prediction training task.

  2. 2.

    Interactive version: https://cs.rit.edu/~bsg8294/colbert/query_viz.html.

  3. 3.

    http://www.dcs.gla.ac.uk/~craigm/ecir2021-tutorial/colbert_model_checkpoint.zip.

  4. 4.

    Running the TREC test queries takes roughly 15 min to complete using a multithreaded Rust program: https://github.com/Boxxfish/IR2023-Project.

References

  1. Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. In: Voorhees, E.M., Ellis, A. (eds.) Proceedings of Text REtrieval Conference (TREC), vol. 1266. NIST Special Publication (2020)

    Google Scholar 

  2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423

  3. Formal, T., Piwowarski, B., Clinchant, S.: A white box analysis of ColBERT. In: Hiemstra, D., Moens, M.F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) Proceedings of European Conference on Information Retrieval (ECIR). LNCS, vol. 12657, pp. 257–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_23

  4. Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Huang, J.X., et al. (eds.) Proceedings of SIGIR, pp. 39–48 (2020). https://doi.org/10.1145/3397271.3401075

  5. MacAvaney, S., Feldman, S., Goharian, N., Downey, D., Cohan, A.: ABNIRML: analyzing the behavior of neural IR Models. Trans. Assoc. Comput. Linguist. 10, 224–239 (2022). https://doi.org/10.1162/tacl_a_00457

  6. Macdonald, C., Tonellotto, N., MacAvaney, S., Ounis, I.: PyTerrier: declarative experimentation in Python from BM25 to dense retrieval. In: Proceedings of International Conference on Information & Knowledge Management (CIKM), pp. 4526–4533 (2021). https://doi.org/10.1145/3459637.3482013

  7. Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: Besold, T.R., Bordes, A., d’Avila Garcez, A.S., Wayne, G. (eds.) Proceedings of the Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches 2016 Co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 9 December 2016. CEUR Workshop Proceedings, vol. 1773. CEUR-WS.org (2016). https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf

  8. Rau, D., Kamps, J.: The role of complex NLP in transformers for text ranking. In: Proceedings of ICTIR, pp. 153–160 (2022). https://doi.org/10.1145/3539813.3545144

  9. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: effective and efficient retrieval via lightweight late interaction. In: Carpuat, M., de Marneffe, M.C., Meza Ruiz, I.V. (eds.) Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 3715–3734 (2022). https://doi.org/10.18653/v1/2022.naacl-main.272

  10. Tonellotto, N., Macdonald, C.: Query embedding pruning for dense retrieval. In: Proceedings of International Conference on Information & Knowledge Management (CIKM), pp. 3453–3457 (2021). https://doi.org/10.1145/3459637.3482162

  11. Voorhees, E.M., Ellis, A. (eds.): Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019, Gaithersburg, Maryland, USA, 13–15 November 2019, vol. 1250. NIST Special Publication. National Institute of Standards and Technology (NIST) (2019). https://trec.nist.gov/pubs/trec28/trec2019.html

  12. Wang, X., Macdonald, C., Ounis, I.: Improving zero-shot retrieval using dense external expansion. Inf. Process. Manag. 59(5), 103026 (2022). https://doi.org/10.1016/j.ipm.2022.103026, https://www.sciencedirect.com/science/article/pii/S0306457322001364

  13. Wang, X., MacDonald, C., Tonellotto, N., Ounis, I.: ColBERT-PRF: semantic pseudo-relevance feedback for dense passage and document retrieval. ACM Trans. Web 17(1), 1–39 (2023). https://doi.org/10.1145/3572405

  14. Wang, X., Macdonald, C., Tonellotto, N., Ounis, I.: Reproducibility, replicability, and insights into dense multi-representation retrieval models: from ColBERT to Col*. In: Proceedings of SIGIR, pp. 2552–2561. ACM (2023). https://doi.org/10.1145/3539618.3591916

  15. Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of International Conference on Learning Representations (ICLR). OpenReview.net (2021). https://openreview.net/forum?id=zeFrfgyZln

  16. Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022). https://openreview.net/forum?id=cpDhcsEDC2

  17. Zhan, J., Mao, J., Liu, Y., Zhang, M., Ma, S.: RepBERT: contextualized text embeddings for first-stage retrieval. CoRR abs/2006.15498 (2020). https://arxiv.org/abs/2006.15498

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ben Giacalone .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Giacalone, B., Paiement, G., Tucker, Q., Zanibbi, R. (2024). Beneath the [MASK]: An Analysis of Structural Query Tokens in ColBERT. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56063-7_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56062-0

  • Online ISBN: 978-3-031-56063-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics