Few-Shot Adaptation of Medical Vision-Language Models

Shakeri, Fereshteh; Huang, Yunshi; Silva-Rodríguez, Julio; Bahig, Houda; Tang, An; Dolz, Jose; Ben Ayed, Ismail

doi:10.1007/978-3-031-72390-2_52

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15012))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

1324 Accesses

Abstract

Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject: https://github.com/FereshteShakeri/few-shot-MedVLMs.

F. Shakeri and Y. Huang—Equal contributions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Few-shot adaptation of multi-modal foundation models: a survey

Article Open access 27 August 2024

Efficient and Versatile Robust Fine-Tuning of Zero-Shot Models

MFAFD: a few-shot learning method for cascading models with parameter free attention and finite discrete space

Article 06 March 2025

References

Ocular disease intelligent recognition (odir) (2019), https://odir2019.grand-challenge.org/
Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Clinical Natural Language Processing Workshop (2019)
Google Scholar
Chen, G., et al.: Prompt learning with optimal transport for vision-language models. In: International Conference on Learning Representations (2023)
Google Scholar
Chen, X., et al.: Recent advances and clinical applications of deep learning in medical image analysis. Medical Image Analysis 79 (2022)
Google Scholar
Decencière, E., et al.: Feedback on a publicly distributed image database: The messidor database. Image Analysis & Stereology 33, 231–234 (07 2014)
Google Scholar
Fischer, M., Bartler, A., Yang, B.: Prompt tuning for parameter-efficient medical image segmentation. Medical Image Analysis 91, 103024 (2024)
Article Google Scholar
Gao, P., et al.: Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132, 581–595 (2023)
Article Google Scholar
hong, Z., Friedman, D., Chen, D.: Factual probing is [mask]: Learning vs. learning to recall. In: Conference of the North American Chapter of the Association for Computational Linguistics (2021)
Google Scholar
Huang, Y., Shakeri, F., Dolz, J., Boudiaf, M., Bahig, H., Ayed, I.B.: Lp++: A surprisingly strong linear probe for few-shot clip. In: IEEE Conference on Computer Vision and Pattern Recognition (2024)
Google Scholar
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T., Zou, J.: A visual–language foundation model for pathology image analysis using medical twitter. Nature Medicine 29, 1–10 (2023)
Google Scholar
Ikezogwo, W.O., et al.: Quilt-1m: One million image-text pairs for histopathology. In: Neural Information Processing Systems (2023)
Google Scholar
Irvin, J., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: AAAI (2019)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Jiang, Z., Xu, F., Araki, J., Neubig, G.: How can we know what language models know. In: Association for Computational Linguistics (2020)
Google Scholar
Jin, K., et al.: Fives: A fundus image dataset for artificial intelligence based vessel segmentation. Scientific Data 9 (2022)
Google Scholar
Johnson, A.E., et al.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6 (2019)
Google Scholar
Kather, J.N., Halama, N., Marx, A.: 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo 5281 (2018)
Google Scholar
Kriegsmann, K., et al.: Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections. Frontiers in Oncology 12 (2022)
Google Scholar
Lin, Z., Yu, S., Kuang, Z., Pathak, D., Ramanan, D.: Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19325–19337 (2023)
Google Scholar
Litjens, G., et al.: A survey on deep learning in medical image analysis. Medical Image Analysis 42 (2017)
Google Scholar
Moor, M., et al.: Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (4 2023)
Google Scholar
Nocedal, J.: Updating quasi-newton matrices with limited storage. Mathematics of Computation 35(151), 773–782 (1980)
Article MathSciNet Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Raghu, M., Zhang, C., Kleinberg, J., Bengio, S.: Transfusion: Understanding transfer learning for medical imaging. In: Advances in neural information processing systems (2019)
Google Scholar
Shih, G., et al.: Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence 1 (2019)
Google Scholar
Shin, T., et al.: Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In: CoRR (2020)
Google Scholar
Silva-Rodriguez, J., Chakor, H., Kobbi, R., Dolz, J., Ayed, I.B.: A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision. ArXiv Preprint (2023)
Google Scholar
Silva-Rodríguez, J., Colomer, A., Sales, M.A., Molina, R., Naranjo, V.: Going deeper through the gleason scoring scale: An automatic end-to-end system for histology prostate grading and cribriform pattern detection. Computer methods and programs in biomedicine 195 (2020)
Google Scholar
Song, C., Ristenpart, T., Shmatikov, V.: Machine learning models that remember too much. In: Conference on Computer and Communications Security (2017)
Google Scholar
Taylor, N., et al.: Clinical prompt learning with frozen language models. IEEE Transactions on Neural Networks and Learning Systems (2023)
Google Scholar
Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from unpaired medical images and text. In: Empirical Methods in Natural Language Processing (2022)
Google Scholar
Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In: International Conference on Computer Vision (2023)
Google Scholar
Yao, H., Zhang, R., Xu, C.: Visual-language prompt tuning with knowledge-guided context optimization. In: IEEE Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Zhang, R., et al.: Tip-adapter: Training-free adaption of clip for few-shot classification. In: European Conference on Computer Vision (2022)
Google Scholar
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: MHLC (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: IEEE Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. International Journal of Computer Vision 130, 2337–2348 (2022)
Article Google Scholar
Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned gradient for prompt tuning. In: International Conference on Computer Vision (2023)
Google Scholar

Download references

Acknowledgement

This work was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Montreal University Hospital Research Center (CRCHUM). We also thank Calcul Quebec and Compute Canada.

Author information

Authors and Affiliations

ÉTS Montreal, Montreal, Canada
Fereshteh Shakeri, Yunshi Huang, Julio Silva-Rodríguez, Jose Dolz & Ismail Ben Ayed
Centre de Recherche du Centre Hospitalier de l’Université de Montréal (CRCHUM), Montreal, Canada
Fereshteh Shakeri, Yunshi Huang, Houda Bahig, An Tang, Jose Dolz & Ismail Ben Ayed

Authors

Fereshteh Shakeri
View author publications
You can also search for this author in PubMed Google Scholar
Yunshi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Julio Silva-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Houda Bahig
View author publications
You can also search for this author in PubMed Google Scholar
An Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jose Dolz
View author publications
You can also search for this author in PubMed Google Scholar
Ismail Ben Ayed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fereshteh Shakeri .

Editor information

Editors and Affiliations

Children’s National Hospital/George Washington University, Washington, DC, USA
Marius George Linguraru
The Chinese University of Hong Kong, Hong Kong, China
Qi Dou
Technical University of Denmark, Kgs Lyngby, Denmark
Aasa Feragen
Imperial College London, London, UK
Stamatia Giannarou
Imperial College London, London, UK
Ben Glocker
Universitat de Barcelona, Barcelona, Spain
Karim Lekadir
Helmholtz Munich, Technical University of Munich and King’s College London, Munich, Germany
Julia A. Schnabel

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shakeri, F. et al. (2024). Few-Shot Adaptation of Medical Vision-Language Models. In: Linguraru, M.G., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, vol 15012. Springer, Cham. https://doi.org/10.1007/978-3-031-72390-2_52

Download citation

DOI: https://doi.org/10.1007/978-3-031-72390-2_52
Published: 23 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72389-6
Online ISBN: 978-3-031-72390-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Few-Shot Adaptation of Medical Vision-Language Models