Abstract
Contrastive Language-Image Pre-training (CLIP) has shown its proficiency in acquiring distinctive visual representations and exhibiting strong generalization across diverse vision tasks. However, its effectiveness in pathology image analysis, particularly with limited labeled data, remains an ongoing area of investigation due to challenges associated with significant domain shifts and catastrophic forgetting. Thus, it is imperative to devise efficient adaptation strategies in this domain to enable scalable analysis. In this study, we introduce Path-CLIP, a framework tailored for a swift adaptation of CLIP to various pathology tasks. Firstly, we propose Residual Feature Refinement (RFR) with a dynamically adjustable ratio to effectively integrate and balance source and task-specific knowledge. Secondly, we introduce Hidden Representation Perturbation (HRP) and Dual-view Vision Contrastive (DVC) techniques to mitigate overfitting issues. Finally, we present the Doublet Multimodal Contrastive Loss (DMCL) for fine-tuning CLIP for pathology tasks. We demonstrate that Path-CLIP adeptly adapts pre-trained CLIP to downstream pathology tasks, yielding competitive results. Specifically, Path-CLIP achieves over +19% improvement in accuracy when utilizing mere 0.1% of labeled data in PCam with only 10 min of fine-tuning while running on a single GPU.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bándi, P., et al.: Comparison of different methods for tissue segmentation in histopathological whole-slide images. In: Proceedings of the 2017 IEEE International Symposium on Biomedical Imaging, pp. 591–595 (2017)
Bejnordi, B.E., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318(22), 2199–2210 (2017)
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: MixMatch: a holistic approach to semi-supervised learning. In: NeurIPS, vol. 32 (2019)
Chen, F., et al.: Unitail: detecting, reading, and matching in retail scene. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13667, pp. 705–722. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_41
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607 (2020)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: RandAugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 702–703 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: clip-guided domain adaptation of image generators. ACM Trans. Graph. (TOG) 41(4), 1–13 (2022)
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)
Ge, C., et al.: Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687 (2022)
He, G., Chen, J., Zhu, J.: Preserving pre-trained features helps calibrate fine-tuned language models. In: ICLR (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: ICML, pp. 2790–2799. PMLR (2019)
Huang, Z., et al.: A visual-language foundation model for pathology image analysis using medical twitter. Nat. Med. 29, 2307–2316 (2023)
Jeremy, et al.: CheXpert. In: AAAI (2019)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015)
Lai, Z., et al.: BrainSec: automated brain tissue segmentation pipeline for scalable neuropathological analysis. IEEE Access 10, 49064–49079 (2022)
Lai, Z., Wang, C., Oliveira, L.C., Dugger, B.N., Cheung, S.C., Chuah, C.N.: Joint semi-supervised and active learning for segmentation of gigapixel pathology images with cost-effective labeling. In: ICCV Workshop, pp. 591–600 (2021)
Lam, S.Y., Zeng, Q., Zhang, K., You, C., Voigt, R.: Large language models are partially primed in pronoun interpretation. arXiv preprint arXiv:2305.16917 (2023)
Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning (ICML) (2013)
Liu, D., Cui, Y., Yan, L., Mousas, C., Yang, B., Chen, Y.: DenserNet: weakly supervised visual localization using multi-scale feature aggregation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 6101–6109 (2021)
Liu, M., et al.: A deep learning method for breast cancer classification in the pathology images. IEEE J. Biomed. Health. Inf. 26(10), 5025–5032 (2022)
Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: CVPR, pp. 5206–5215 (2022)
Lyu, W., et al.: A multimodal transformer: fusing clinical notes with structured EHR data for interpretable in-hospital mortality prediction. In: AMIA Annual Symposium Proceedings, vol. 2022, p. 719. American Medical Informatics Association (2022)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Oskal, K.R.J., Risdal, M., Janssen, E.A.M., Undersrud, E.S., Gulsrud, T.O.: A U-Net based approach to epidermal tissue segmentation in whole slide histopathological images. SN Appl. Sci. 1(7), 1–12 (2019). https://doi.org/10.1007/s42452-019-0694-y
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Rahman, T., et al.: Covid. Comput. Biol. Med. (2021)
Rasheed, H., khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Finetuned clip models are efficient video learners. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Sohn, K., et al.: FixMatch: simplifying semi-supervised learning with consistency and confidence. In: NeurIPS, vol. 33, pp. 596–608 (2020)
Tian, Y., Newsam, S., Boakye, K.: Fashion image retrieval with text feedback by additive attention compositional learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1011–1021 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)
Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equivariant CNNs for digital pathology. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 210–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_24
Wang, X., et al.: TransPath: transformer-based self-supervised learning for histopathological image classification. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12908, pp. 186–195. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87237-3_18
Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., Bailey, J.: Symmetric cross entropy for robust learning with noisy labels. In: ICCV (2019)
Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: contrastive learning from unpaired medical images and text. In: Proceedings of EMNLP (2022)
Wei, J., et al.: A petri dish for histopathology image analysis. In: Tucker, A., Henriques Abreu, P., Cardoso, J., Pereira Rodrigues, P., Riaño, D. (eds.) AIME 2021. LNCS (LNAI), vol. 12721, pp. 11–24. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77211-6_2
Wu, C., Wu, F., Qi, T., Huang, Y.: NoisyTune: a little noise can help you finetune pretrained language models better. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 680–685 (2022)
Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: a survey. arXiv preprint arXiv:2206.06488 (2022)
Xu, Y., et al.: Dash: semi-supervised learning with dynamic thresholding. In: International Conference on Machine Learning, pp. 11525–11536. PMLR (2021)
Yan, X., et al.: Representation recovering for self-supervised pre-training on medical images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2685–2695 (2023)
Yang, B., et al.: Multimodal prompt learning for product title generation with extremely limited labels. arXiv preprint arXiv:2307.01969 (2023)
Yang, J., Chen, H., Liang, Y., Huang, J., He, L., Yao, J.: ConCL: concept contrastive learning for dense prediction pre-training in pathology images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13681, pp. 523–539. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_31
You, C., et al.: Rethinking semi-supervised medical image segmentation: a variance-reduction perspective. In: NeurIPS (2023)
You, C., Dai, W., Min, Y., Staib, L., Duncan, J.S.: Bootstrapping semi-supervised medical image segmentation with anatomical-aware contrastive distillation. In: Frangi, A., de Bruijne, M., Wassermann, D., Navab, N. (eds.) IPMI 2023. LNCS, vol. 13939, pp. 641–653. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-34048-2_49
You, C., Dai, W., Min, Y., Staib, L., Sekhon, J., Duncan, J.S.: Action++: improving semi-supervised medical image segmentation with adaptive anatomical contrast. arXiv preprint arXiv:2304.02689 (2023)
Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
Yuan, H., Yuan, Z., Tan, C., Huang, F., Huang, S.: Hype: better pre-trained language model fine-tuning with hidden representation perturbation. arXiv preprint arXiv:2212.08853 (2022)
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12104–12113, June 2022
Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: CVPR, pp. 18123–18133 (2022)
Zhang, B., et al.: FlexMatch: boosting semi-supervised learning with curriculum pseudo labeling. In: NeurIPS, vol. 34 (2021)
Zhang, R., et al.: PointClip: point cloud understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)
Zhang, R., et al.: Tip-adapter: training-free adaption of clip for few-shot classification. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 493–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_29
Zhang, T., Wu, F., Katiyar, A., Weinberger, K.Q., Artzi, Y.: Revisiting few-sample BERT fine-tuning. In: ICLR (2021)
Zhang, W., et al.: BoostMIS: boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation. In: CVPR, pp. 20666–20676 (2022)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)
Acknowledgments
This work was supported by the Noyce Initiative UC Partnerships in Computational Transformation Grant and Child Family Endowed Professorship. Resources for this study were funded in part by grants from the National Institute on Aging of the National Institutes of Health under Award Numbers R01AG062517, P30AG072972, and R01AG056519.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lai, Z., Chauhan, J., Dugger, B.N., Chuah, CN. (2025). Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15122. Springer, Cham. https://doi.org/10.1007/978-3-031-73039-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-73039-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73038-2
Online ISBN: 978-3-031-73039-9
eBook Packages: Computer ScienceComputer Science (R0)