MIL-VT: Multiple Instance Learning Enhanced Vision Transformer for Fundus Image Classification

Yu, Shuang; Ma, Kai; Bi, Qi; Bian, Cheng; Ning, Munan; He, Nanjun; Li, Yuexiang; Liu, Hanruo; Zheng, Yefeng

doi:10.1007/978-3-030-87237-3_5

Shuang Yu¹⁵,
Kai Ma¹⁵,
Qi Bi¹⁵,
Cheng Bian¹⁵,
Munan Ning¹⁵,
Nanjun He¹⁵,
Yuexiang Li¹⁵,
Hanruo Liu¹⁶ &
…
Yefeng Zheng¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12908))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

11k Accesses
48 Citations

Abstract

With the advancement and prevailing success of Transformer models in the natural language processing (NLP) field, an increasing number of research works have explored the applicability of Transformer for various vision tasks and reported superior performance compared with convolutional neural networks (CNNs). However, as the proper training of Transformer generally requires an extremely large quantity of data, it has rarely been explored for the medical imaging tasks. In this paper, we attempt to adopt the Vision Transformer for the retinal disease classification tasks, by pre-training the Transformer model on a large fundus image database and then fine-tuning on downstream retinal disease classification tasks. In addition, to fully exploit the feature representations extracted by individual image patches, we propose a multiple instance learning (MIL) based ‘MIL head’, which can be conveniently attached to the Vision Transformer in a plug-and-play manner and effectively enhances the model performance for the downstream fundus image classification tasks. The proposed MIL-VT framework achieves superior performance over CNN models on two publicly available datasets when being trained and tested under the same setup. The implementation code and pre-trained weights are released for public access (Code link: https://github.com/greentreeys/MIL-VT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

APTOS 2019 blindness detection (2019). https://www.kaggle.com/c/aptos2019- blindness-detection/
Retinal image analysis for multi-disease detection challenge (2020). https://riadd.grand-challenge.org/
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., (eds) Computer Vision, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127–2136. PMLR (2018)
Google Scholar
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. arXiv preprint arXiv:2101.01169 (2021)
Li, T., et al.: Applications of deep learning in fundus images: a review. Med. Image Anal. 69, 101971 (2021)
Article Google Scholar
Li, X., Hu, X., Yu, L., Zhu, L., Fu, C.W., Heng, P.A.: CANet: cross-disease attention network for joint diabetic retinopathy and diabetic macular edema grading. IEEE Trans. Med. Imaging 39(5), 1483–1493 (2019)
Article Google Scholar
Liu, S., Gong, L., Ma, K., Zheng, Y.: GREEN: a graph residual re-ranking network for grading diabetic retinopathy. In: Martel, A.L., et al. (eds) Medical Image Computing and Computer Assisted Intervention, vol. 12265, pp. 585–594. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59722-1_56
Peng, T., Wang, X., Xiang, B., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Rakhlin, A.: Diabetic retinopathy detection through integration of deep learning classification framework. bioRxiv p. 225508 (2018)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers and distillation through attention. arXiv preprint arXiv:2012.12877 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5800 (2020)
Google Scholar
Yuan, L., et al.: Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986 (2021)
Zhou, S.K., et al.: A review of deep learning in medical imaging: image traits, technology trends, case studies with progress highlights, and future promises. arXiv preprint arXiv:2008.09104 (2020)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

Download references

Acknowledgment

This work was funded by Key-Area Research and Development Program of Guangdong Province, China (No. 2018B010111001), and Scientific and Technical Innovation 2030 - ‘New Generation Artificial Intelligence’ Project (No.2020AAA0104100).

Author information

Authors and Affiliations

Tencent Jarvis Lab, Tencent, Shenzhen, China
Shuang Yu, Kai Ma, Qi Bi, Cheng Bian, Munan Ning, Nanjun He, Yuexiang Li & Yefeng Zheng
Beijing Tongren Hospital, Capital Medical University, Beijing, China
Hanruo Liu

Authors

Shuang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kai Ma
View author publications
You can also search for this author in PubMed Google Scholar
Qi Bi
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Bian
View author publications
You can also search for this author in PubMed Google Scholar
Munan Ning
View author publications
You can also search for this author in PubMed Google Scholar
Nanjun He
View author publications
You can also search for this author in PubMed Google Scholar
Yuexiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Hanruo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yefeng Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Ma .

Editor information

Editors and Affiliations

Erasmus MC - University Medical Center Rotterdam, Rotterdam, The Netherlands
Marleen de Bruijne
University of Basel, Allschwil, Switzerland
Philippe C. Cattin
Inria Nancy Grand Est, Villers-lès-Nancy, France
Stéphane Cotin
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Nicolas Padoy
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Tencent Jarvis Lab, Shenzhen, China
Yefeng Zheng
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Caroline Essert

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, S. et al. (2021). MIL-VT: Multiple Instance Learning Enhanced Vision Transformer for Fundus Image Classification. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12908. Springer, Cham. https://doi.org/10.1007/978-3-030-87237-3_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-87237-3_5
Published: 21 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87236-6
Online ISBN: 978-3-030-87237-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)