Skip to main content

Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Contrastive Language-Image Pre-training (CLIP) has shown its proficiency in acquiring distinctive visual representations and exhibiting strong generalization across diverse vision tasks. However, its effectiveness in pathology image analysis, particularly with limited labeled data, remains an ongoing area of investigation due to challenges associated with significant domain shifts and catastrophic forgetting. Thus, it is imperative to devise efficient adaptation strategies in this domain to enable scalable analysis. In this study, we introduce Path-CLIP, a framework tailored for a swift adaptation of CLIP to various pathology tasks. Firstly, we propose Residual Feature Refinement (RFR) with a dynamically adjustable ratio to effectively integrate and balance source and task-specific knowledge. Secondly, we introduce Hidden Representation Perturbation (HRP) and Dual-view Vision Contrastive (DVC) techniques to mitigate overfitting issues. Finally, we present the Doublet Multimodal Contrastive Loss (DMCL) for fine-tuning CLIP for pathology tasks. We demonstrate that Path-CLIP adeptly adapts pre-trained CLIP to downstream pathology tasks, yielding competitive results. Specifically, Path-CLIP achieves over +19% improvement in accuracy when utilizing mere 0.1% of labeled data in PCam with only 10 min of fine-tuning while running on a single GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bándi, P., et al.: Comparison of different methods for tissue segmentation in histopathological whole-slide images. In: Proceedings of the 2017 IEEE International Symposium on Biomedical Imaging, pp. 591–595 (2017)

    Google Scholar 

  2. Bejnordi, B.E., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318(22), 2199–2210 (2017)

    Article  Google Scholar 

  3. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: MixMatch: a holistic approach to semi-supervised learning. In: NeurIPS, vol. 32 (2019)

    Google Scholar 

  4. Chen, F., et al.: Unitail: detecting, reading, and matching in retail scene. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13667, pp. 705–722. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_41

    Chapter  Google Scholar 

  5. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607 (2020)

    Google Scholar 

  6. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: RandAugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop, pp. 702–703 (2020)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  9. Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-NADA: clip-guided domain adaptation of image generators. ACM Trans. Graph. (TOG) 41(4), 1–13 (2022)

    Article  Google Scholar 

  10. Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)

  11. Ge, C., et al.: Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687 (2022)

  12. He, G., Chen, J., Zhu, J.: Preserving pre-trained features helps calibrate fine-tuned language models. In: ICLR (2023)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  14. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: ICML, pp. 2790–2799. PMLR (2019)

    Google Scholar 

  15. Huang, Z., et al.: A visual-language foundation model for pathology image analysis using medical twitter. Nat. Med. 29, 2307–2316 (2023)

    Article  Google Scholar 

  16. Jeremy, et al.: CheXpert. In: AAAI (2019)

    Google Scholar 

  17. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916 (2021)

    Google Scholar 

  18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015)

    Google Scholar 

  19. Lai, Z., et al.: BrainSec: automated brain tissue segmentation pipeline for scalable neuropathological analysis. IEEE Access 10, 49064–49079 (2022)

    Article  Google Scholar 

  20. Lai, Z., Wang, C., Oliveira, L.C., Dugger, B.N., Cheung, S.C., Chuah, C.N.: Joint semi-supervised and active learning for segmentation of gigapixel pathology images with cost-effective labeling. In: ICCV Workshop, pp. 591–600 (2021)

    Google Scholar 

  21. Lam, S.Y., Zeng, Q., Zhang, K., You, C., Voigt, R.: Large language models are partially primed in pronoun interpretation. arXiv preprint arXiv:2305.16917 (2023)

  22. Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning (ICML) (2013)

    Google Scholar 

  23. Liu, D., Cui, Y., Yan, L., Mousas, C., Yang, B., Chen, Y.: DenserNet: weakly supervised visual localization using multi-scale feature aggregation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 6101–6109 (2021)

    Google Scholar 

  24. Liu, M., et al.: A deep learning method for breast cancer classification in the pathology images. IEEE J. Biomed. Health. Inf. 26(10), 5025–5032 (2022)

    Article  Google Scholar 

  25. Lu, Y., Liu, J., Zhang, Y., Liu, Y., Tian, X.: Prompt distribution learning. In: CVPR, pp. 5206–5215 (2022)

    Google Scholar 

  26. Lyu, W., et al.: A multimodal transformer: fusing clinical notes with structured EHR data for interpretable in-hospital mortality prediction. In: AMIA Annual Symposium Proceedings, vol. 2022, p. 719. American Medical Informatics Association (2022)

    Google Scholar 

  27. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  28. Oskal, K.R.J., Risdal, M., Janssen, E.A.M., Undersrud, E.S., Gulsrud, T.O.: A U-Net based approach to epidermal tissue segmentation in whole slide histopathological images. SN Appl. Sci. 1(7), 1–12 (2019). https://doi.org/10.1007/s42452-019-0694-y

    Article  Google Scholar 

  29. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  30. Rahman, T., et al.: Covid. Comput. Biol. Med. (2021)

    Google Scholar 

  31. Rasheed, H., khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Finetuned clip models are efficient video learners. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  32. Sohn, K., et al.: FixMatch: simplifying semi-supervised learning with consistency and confidence. In: NeurIPS, vol. 33, pp. 596–608 (2020)

    Google Scholar 

  33. Tian, Y., Newsam, S., Boakye, K.: Fashion image retrieval with text feedback by additive attention compositional learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1011–1021 (2023)

    Google Scholar 

  34. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)

    Google Scholar 

  35. Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equivariant CNNs for digital pathology. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 210–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_24

    Chapter  Google Scholar 

  36. Wang, X., et al.: TransPath: transformer-based self-supervised learning for histopathological image classification. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12908, pp. 186–195. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87237-3_18

    Chapter  Google Scholar 

  37. Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., Bailey, J.: Symmetric cross entropy for robust learning with noisy labels. In: ICCV (2019)

    Google Scholar 

  38. Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: contrastive learning from unpaired medical images and text. In: Proceedings of EMNLP (2022)

    Google Scholar 

  39. Wei, J., et al.: A petri dish for histopathology image analysis. In: Tucker, A., Henriques Abreu, P., Cardoso, J., Pereira Rodrigues, P., Riaño, D. (eds.) AIME 2021. LNCS (LNAI), vol. 12721, pp. 11–24. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77211-6_2

    Chapter  Google Scholar 

  40. Wu, C., Wu, F., Qi, T., Huang, Y.: NoisyTune: a little noise can help you finetune pretrained language models better. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 680–685 (2022)

    Google Scholar 

  41. Xu, P., Zhu, X., Clifton, D.A.: Multimodal learning with transformers: a survey. arXiv preprint arXiv:2206.06488 (2022)

  42. Xu, Y., et al.: Dash: semi-supervised learning with dynamic thresholding. In: International Conference on Machine Learning, pp. 11525–11536. PMLR (2021)

    Google Scholar 

  43. Yan, X., et al.: Representation recovering for self-supervised pre-training on medical images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2685–2695 (2023)

    Google Scholar 

  44. Yang, B., et al.: Multimodal prompt learning for product title generation with extremely limited labels. arXiv preprint arXiv:2307.01969 (2023)

  45. Yang, J., Chen, H., Liang, Y., Huang, J., He, L., Yao, J.: ConCL: concept contrastive learning for dense prediction pre-training in pathology images. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13681, pp. 523–539. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_31

    Chapter  Google Scholar 

  46. You, C., et al.: Rethinking semi-supervised medical image segmentation: a variance-reduction perspective. In: NeurIPS (2023)

    Google Scholar 

  47. You, C., Dai, W., Min, Y., Staib, L., Duncan, J.S.: Bootstrapping semi-supervised medical image segmentation with anatomical-aware contrastive distillation. In: Frangi, A., de Bruijne, M., Wassermann, D., Navab, N. (eds.) IPMI 2023. LNCS, vol. 13939, pp. 641–653. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-34048-2_49

    Chapter  Google Scholar 

  48. You, C., Dai, W., Min, Y., Staib, L., Sekhon, J., Duncan, J.S.: Action++: improving semi-supervised medical image segmentation with adaptive anatomical contrast. arXiv preprint arXiv:2304.02689 (2023)

  49. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)

  50. Yuan, H., Yuan, Z., Tan, C., Huang, F., Huang, S.: Hype: better pre-trained language model fine-tuning with hidden representation perturbation. arXiv preprint arXiv:2212.08853 (2022)

  51. Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)

  52. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12104–12113, June 2022

    Google Scholar 

  53. Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: CVPR, pp. 18123–18133 (2022)

    Google Scholar 

  54. Zhang, B., et al.: FlexMatch: boosting semi-supervised learning with curriculum pseudo labeling. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  55. Zhang, R., et al.: PointClip: point cloud understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)

    Google Scholar 

  56. Zhang, R., et al.: Tip-adapter: training-free adaption of clip for few-shot classification. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 493–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_29

    Chapter  Google Scholar 

  57. Zhang, T., Wu, F., Katiyar, A., Weinberger, K.Q., Artzi, Y.: Revisiting few-sample BERT fine-tuning. In: ICLR (2021)

    Google Scholar 

  58. Zhang, W., et al.: BoostMIS: boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation. In: CVPR, pp. 20666–20676 (2022)

    Google Scholar 

  59. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the Noyce Initiative UC Partnerships in Computational Transformation Grant and Child Family Endowed Professorship. Resources for this study were funded in part by grants from the National Institute on Aging of the National Institutes of Health under Award Numbers R01AG062517, P30AG072972, and R01AG056519.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengfeng Lai .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 741 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lai, Z., Chauhan, J., Dugger, B.N., Chuah, CN. (2025). Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15122. Springer, Cham. https://doi.org/10.1007/978-3-031-73039-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73039-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73038-2

  • Online ISBN: 978-3-031-73039-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics