Skip to main content

MIL-VT: Multiple Instance Learning Enhanced Vision Transformer for Fundus Image Classification

  • Conference paper
  • First Online:
Book cover Medical Image Computing and Computer Assisted Intervention – MICCAI 2021 (MICCAI 2021)

Abstract

With the advancement and prevailing success of Transformer models in the natural language processing (NLP) field, an increasing number of research works have explored the applicability of Transformer for various vision tasks and reported superior performance compared with convolutional neural networks (CNNs). However, as the proper training of Transformer generally requires an extremely large quantity of data, it has rarely been explored for the medical imaging tasks. In this paper, we attempt to adopt the Vision Transformer for the retinal disease classification tasks, by pre-training the Transformer model on a large fundus image database and then fine-tuning on downstream retinal disease classification tasks. In addition, to fully exploit the feature representations extracted by individual image patches, we propose a multiple instance learning (MIL) based ‘MIL head’, which can be conveniently attached to the Vision Transformer in a plug-and-play manner and effectively enhances the model performance for the downstream fundus image classification tasks. The proposed MIL-VT framework achieves superior performance over CNN models on two publicly available datasets when being trained and tested under the same setup. The implementation code and pre-trained weights are released for public access (Code link: https://github.com/greentreeys/MIL-VT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. APTOS 2019 blindness detection (2019). https://www.kaggle.com/c/aptos2019- blindness-detection/

  2. Retinal image analysis for multi-disease detection challenge (2020). https://riadd.grand-challenge.org/

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., (eds) Computer Vision, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)

    Google Scholar 

  5. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  7. Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127–2136. PMLR (2018)

    Google Scholar 

  8. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. arXiv preprint arXiv:2101.01169 (2021)

  9. Li, T., et al.: Applications of deep learning in fundus images: a review. Med. Image Anal. 69, 101971 (2021)

    Article  Google Scholar 

  10. Li, X., Hu, X., Yu, L., Zhu, L., Fu, C.W., Heng, P.A.: CANet: cross-disease attention network for joint diabetic retinopathy and diabetic macular edema grading. IEEE Trans. Med. Imaging 39(5), 1483–1493 (2019)

    Article  Google Scholar 

  11. Liu, S., Gong, L., Ma, K., Zheng, Y.: GREEN: a graph residual re-ranking network for grading diabetic retinopathy. In: Martel, A.L., et al. (eds) Medical Image Computing and Computer Assisted Intervention, vol. 12265, pp. 585–594. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59722-1_56

  12. Peng, T., Wang, X., Xiang, B., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  13. Rakhlin, A.: Diabetic retinopathy detection through integration of deep learning classification framework. bioRxiv p. 225508 (2018)

    Google Scholar 

  14. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers and distillation through attention. arXiv preprint arXiv:2012.12877 (2020)

  15. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  16. Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5800 (2020)

    Google Scholar 

  17. Yuan, L., et al.: Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986 (2021)

  18. Zhou, S.K., et al.: A review of deep learning in medical imaging: image traits, technology trends, case studies with progress highlights, and future promises. arXiv preprint arXiv:2008.09104 (2020)

  19. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgment

This work was funded by Key-Area Research and Development Program of Guangdong Province, China (No. 2018B010111001), and Scientific and Technical Innovation 2030 - ‘New Generation Artificial Intelligence’ Project (No.2020AAA0104100).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Ma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, S. et al. (2021). MIL-VT: Multiple Instance Learning Enhanced Vision Transformer for Fundus Image Classification. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12908. Springer, Cham. https://doi.org/10.1007/978-3-030-87237-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87237-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87236-6

  • Online ISBN: 978-3-030-87237-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics