Abstract
With the emergence of Foundation Model, the generation quality and generalisation ability of image generation method have been further improved. However, medical image generation is still a challenging and promising task. Recently, diffusion-based models are more prominent in multi-modal image generation for its flexibility. Therefore, in order to solve the problem of lack of high-quality medical images and high annotation costs, we propose a fine-grained multi-modal fundus image generation method based on foundation models to research an efficient way of data augmentation. First, we adopt optic fundus images, fundus vessel images and class textual information to form a weakly supervised fine-tuning dataset. Then, based on the Stable-Diffusion and Control-Net model, we fine-tune our method by LoRA model to generate high-resolution fundus images of special diseases in a targeted manner. Furthermore, we use these synthetic fundus images in conjunction with existing datasets for data augmentation or model fine-tuning to improve performance in the glaucoma classification task. Extensive experiments have shown that our method produces high quality medical fundus images and can be well applied to real-world medical imaging tasks. Moreover, experimental results show that we are able to generate fundus images that act as an augmentation, meaning that the generation of foundation models is effective in certain domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22563–22575, June 2023
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2018)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Cao, Y., et al.: A comprehensive survey of AI-generated content (AIGC): a history of generative AI from GAN to chatGPT. arXiv abs/2303.04226 (2023)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794 (2021)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: International Conference on Learning Representations (2016)
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Guo, J., Pang, Z., Yang, F., Shen, J., Zhang, J.: Study on the method of fundus image generation based on improved GAN. Math. Probl. Eng. 2020, 1–13 (2020)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: International Conference on Learning Representations (2021)
Kawar, B., et al.: Imagic: text-based real image editing with diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6007–6017 (2022)
van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Pandey, K., Mukherjee, A., Rai, P., Kumar, A.: Diffusevae: efficient, controllable and high-fidelity generation from low-dimensional latents. Trans. Mach. Learn. Res. 2022 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv abs/2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685 (2021)
Rombach, R., Blattmann, A., Ommer, B.: Text-guided synthesis of artistic images with retrieval-augmented diffusion models. arXiv abs/2207.13038 (2022)
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22500–22510 (2022)
Shenkut, D., Kumar, B.V.K.V.: Fundus GAN - GAN-based fundus image synthesis for training retinal image classifiers. In: 2022 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 2185–2189 (2022)
Sivaswamy, J., et al.: Drishti-gs: retinal image dataset for optic nerve head(ONH) segmentation. In: 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI), pp. 53–56 (2014)
Wang, Z., Wang, J., Liu, Z., Qiu, Q.: Binary latent diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22576–22585 (2023)
Yi, J., Chen, C., Yang, G.: Retinal artery/vein classification by multi-channel multi-scale fusion network. Appl. Intell. (2023)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251 (2017)
Acknowledgement
This work is supported by fund for building world-class universities (disciplines) of Renmin University of China. The computer resources were provided by Public Computing Cloud Platform of Renmin University of China.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, X. et al. (2024). Fine-Grained Multi-modal Fundus Image Generation Based on Diffusion Models for Glaucoma Classification. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-53302-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)