Representation, Alignment, Fusion: A Generic Transformer-Based Framework for Multi-modal Glaucoma Recognition

Zhou, You; Yang, Gang; Zhou, Yang; Ding, Dayong; Zhao, Jianchun

doi:10.1007/978-3-031-43990-2_66

You Zhou¹⁴,
Gang Yang^14,15,
Yang Zhou¹⁶,
Dayong Ding¹⁶ &
…
Jianchun Zhao¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14226))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

3034 Accesses
1 Citations

Abstract

Early glaucoma can be diagnosed with various modalities based on morphological features. However, most existing automated solutions rely on single-modality, such as Color Fundus Photography (CFP) which lacks 3D structural information, or Optical Coherence Tomography (OCT) which suffers from insufficient specificity for glaucoma. To effectively detect glaucoma with CFP and OCT, we propose a generic multi-modal Transformer-based framework for glaucoma, MM-RAF. Our framework is implemented with pure self-attention mechanisms and consists of three simple and effective modules: Bilateral Contrastive Alignment (BCA) aligns both modalities into the same semantic space to bridge the semantic gap; Multiple Instance Learning Representation (MILR) aggregates multiple OCT B-scans into a semantic structure and downsizes the scale of the OCT branch; Hierarchical Attention Fusion (HAF) enhances the cross-modality interaction capability with spatial information. By incorporating three modules, our framework can effectively handle cross-modality interaction between different modalities with huge disparity. The experimental results demonstrate that the framework outperforms the existing multi-modal methods of this task and is robust even with a clinical small dataset. Moreover, by visualizing, OCT can reveal the subtle abnormalities in CFP, indicating that the relationship between various modalities is captured. Our code is available at https://github.com/YouZhouRUC/MM-RAF.

The computer resources were provided by Public Computing Cloud Platform of Renmin University of China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

An, G., et al.: Glaucoma diagnosis with machine learning based on optical coherence tomography and color fundus images. J. Healthcare Eng. 2019 (2019)
Google Scholar
Asaoka, R., et al.: Using deep learning and transfer learning to accurately diagnose early-onset glaucoma from macular optical coherence tomography images. Am. J. Ophthalmol. 198, 136–145 (2019)
Article Google Scholar
Cai, Z., Lin, L., He, H., Tang, X.: Corolla: an efficient multi-modality fusion framework with supervised contrastive learning for glaucoma grading. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), pp. 1–4. IEEE (2022)
Google Scholar
Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 397–406 (2021)
Google Scholar
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649 (2021)
Google Scholar
Ding, F., Yang, G., Ding, D., Cheng, G.: Retinal nerve fiber layer defect detection with position guidance. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12265, pp. 745–754. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59722-1_72
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Harizman, N., et al.: The isnt rule and differentiation of normal from glaucomatous eyes. Arch. Ophthalmol. 124(11), 1579–1583 (2006)
Article Google Scholar
Lee, J., Kim, Y.K., Park, K.H., Jeoung, J.W.: Diagnosing glaucoma with spectral-domain optical coherence tomography using deep learning classifier. J. Glaucoma 29(4), 287–294 (2020)
Article Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)
Google Scholar
Li, X., et al.: Multi-modal multi-instance learning for retinal disease recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2474–2482 (2021)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Mehta, P., et al.: Automated detection of glaucoma with interpretable machine learning using clinical data and multimodal retinal images. Am. J. Ophthalmol. 231, 154–169 (2021)
Article Google Scholar
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. Adv. Neural. Inf. Process. Syst. 34, 14200–14213 (2021)
Google Scholar
Raghavendra, U., Bhandary, S.V., Gudigar, A., Acharya, U.R.: Novel expert system for glaucoma identification using non-parametric spatial envelope energy spectrum with fundus images. Biocybernetics Biomed. Eng. 38(1), 170–180 (2018)
Article Google Scholar
Ran, A.R., et al.: Detection of glaucomatous optic neuropathy with spectral-domain optical coherence tomography: a retrospective training and validation deep-learning analysis. The Lancet Digital Health 1(4), e172–e182 (2019)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers; distillation through attention. In: International Conference on Machine Learning, vol. 139, pp. 10347–10357, July 2021
Google Scholar
Wightman, R.: Pytorch image models (2019). https://github.com/rwightman/pytorch-image-models. https://doi.org/10.5281/zenodo.4414861
Wu, J., et al.: Gamma challenge: Glaucoma grAding from Multi-Modality imAges (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China, Beijing, China
You Zhou & Gang Yang
MOE Key Lab of DEKE, Renmin University of China, Beijing, China
Gang Yang
Vistel AI Lab, Visionary Intelligence Ltd., Beijing, China
Yang Zhou, Dayong Ding & Jianchun Zhao

Authors

You Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Gang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Dayong Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jianchun Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gang Yang .

Editor information

Editors and Affiliations

Icahn School of Medicine, Mount Sinai, NYC, NY, USA, Tel Aviv University, Tel Aviv, Israel
Hayit Greenspan
Emory University, Atlanta, GA, USA
Anant Madabhushi
Queen’s University, Kingston, ON, Canada
Parvin Mousavi
The University of British Columbia, Vancouver, BC, Canada
Septimiu Salcudean
Yale University, New Haven, CT, USA
James Duncan
IBM Research, San Jose, CA, USA
Tanveer Syeda-Mahmood
Johns Hopkins University, Baltimore, MD, USA
Russell Taylor

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7948 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Y., Yang, G., Zhou, Y., Ding, D., Zhao, J. (2023). Representation, Alignment, Fusion: A Generic Transformer-Based Framework for Multi-modal Glaucoma Recognition. In: Greenspan, H., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol 14226. Springer, Cham. https://doi.org/10.1007/978-3-031-43990-2_66

Download citation

DOI: https://doi.org/10.1007/978-3-031-43990-2_66
Published: 01 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43989-6
Online ISBN: 978-3-031-43990-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Representation, Alignment, Fusion: A Generic Transformer-Based Framework for Multi-modal Glaucoma Recognition