Abstract
With the advancement and prevailing success of Transformer models in the natural language processing (NLP) field, an increasing number of research works have explored the applicability of Transformer for various vision tasks and reported superior performance compared with convolutional neural networks (CNNs). However, as the proper training of Transformer generally requires an extremely large quantity of data, it has rarely been explored for the medical imaging tasks. In this paper, we attempt to adopt the Vision Transformer for the retinal disease classification tasks, by pre-training the Transformer model on a large fundus image database and then fine-tuning on downstream retinal disease classification tasks. In addition, to fully exploit the feature representations extracted by individual image patches, we propose a multiple instance learning (MIL) based ‘MIL head’, which can be conveniently attached to the Vision Transformer in a plug-and-play manner and effectively enhances the model performance for the downstream fundus image classification tasks. The proposed MIL-VT framework achieves superior performance over CNN models on two publicly available datasets when being trained and tested under the same setup. The implementation code and pre-trained weights are released for public access (Code link: https://github.com/greentreeys/MIL-VT).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
APTOS 2019 blindness detection (2019). https://www.kaggle.com/c/aptos2019- blindness-detection/
Retinal image analysis for multi-disease detection challenge (2020). https://riadd.grand-challenge.org/
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., (eds) Computer Vision, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127–2136. PMLR (2018)
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. arXiv preprint arXiv:2101.01169 (2021)
Li, T., et al.: Applications of deep learning in fundus images: a review. Med. Image Anal. 69, 101971 (2021)
Li, X., Hu, X., Yu, L., Zhu, L., Fu, C.W., Heng, P.A.: CANet: cross-disease attention network for joint diabetic retinopathy and diabetic macular edema grading. IEEE Trans. Med. Imaging 39(5), 1483–1493 (2019)
Liu, S., Gong, L., Ma, K., Zheng, Y.: GREEN: a graph residual re-ranking network for grading diabetic retinopathy. In: Martel, A.L., et al. (eds) Medical Image Computing and Computer Assisted Intervention, vol. 12265, pp. 585–594. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59722-1_56
Peng, T., Wang, X., Xiang, B., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Rakhlin, A.: Diabetic retinopathy detection through integration of deep learning classification framework. bioRxiv p. 225508 (2018)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers and distillation through attention. arXiv preprint arXiv:2012.12877 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5800 (2020)
Yuan, L., et al.: Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. arXiv preprint arXiv:2101.11986 (2021)
Zhou, S.K., et al.: A review of deep learning in medical imaging: image traits, technology trends, case studies with progress highlights, and future promises. arXiv preprint arXiv:2008.09104 (2020)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Acknowledgment
This work was funded by Key-Area Research and Development Program of Guangdong Province, China (No. 2018B010111001), and Scientific and Technical Innovation 2030 - ‘New Generation Artificial Intelligence’ Project (No.2020AAA0104100).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, S. et al. (2021). MIL-VT: Multiple Instance Learning Enhanced Vision Transformer for Fundus Image Classification. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12908. Springer, Cham. https://doi.org/10.1007/978-3-030-87237-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-87237-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87236-6
Online ISBN: 978-3-030-87237-3
eBook Packages: Computer ScienceComputer Science (R0)