Abstract
Medical imaging, such as X-rays and CT scans, plays a critical role in diagnostics, yet the growing workload leads to reporting delays and potential errors. Traditional deep learning-based approaches often struggled to capture complex semantic relationships in long medical reports, leading to issues of coherence and consistency in generated texts. To address these challenges, we propose a novel multi-stage generative framework based on diffusion models. A cross-attention mechanism is incorporated that simultaneously attends to both textual and visual features, significantly improving the model’s ability to align image content with accurate textual descriptions. Additionally, we optimize multimodal information fusion by integrating skip connections, Long Short-Term Memory (LSTM) networks, and MIX-MLP networks, enhancing the flow of information between different modalities. By integrating advanced multimodal fusion mechanisms, our approach enhances the accuracy and coherence of automatic report generation. The model was evaluated on IU-XRAY and MIMIC-CXR datasets, achieving state-of-the-art performance across multiple metrics, including BLEU, METEOR, and ROUGE, significantly surpassing prior methods. These results validate the framework’s effectiveness in generating professional and coherent medical reports, offering a reliable solution to alleviate the burden of manual reporting. The source code is available at https://github.com/watersunhznu/DifMIRG.






Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
This study utilized publicly available datasets, namely IU X-Ray and MIMIC-CXR. More information about these datasets can be found in their respective original publications.
References
Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C-Y, Mark RG, Horng S (2019) Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6(1):317
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Hovy E (1987) Generating natural language under pragmatic constraints. J Pragmat 11(6):689–719
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14, pp. 3104–3112. MIT Press, Cambridge, MA, USA
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164
Jing B, Xie P, Xing E (2018) On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2577–2586
Yin C, Qian B, Wei J, Li X, Zhang X, Li Y, Zheng Q (2019) Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 728–737
Kong M, Huang Z, Kuang K, Zhu Q, Wu F (2022) Transq: Transformer-based semantic query for medical report generation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 610–620
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851
Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst 34:8780–8794
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA
Yi X, Fu Y, Liu R, Zhang H, Hua R (2024) Tsget: Two-stage global enhanced transformer for automatic radiology report generation. IEEE J Biomed Health Inform 28(4):2152–2162
Yi X, Fu Y, Yu J, Liu R, Zhang H, Hua R (2024) Lhr-rfl: Linear hybrid-reward based reinforced focal learning for automatic radiology report generation. IEEE Trans Med Imaging, 1–1
Yao Z, Lin F, Chai S, He W, Dai L, Fei X (2024) Integrating medical imaging and clinical reports using multimodal deep learning for advanced disease analysis. In: 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), pp. 1217–1223
Song S, Li X, Li S, Zhao S, Yu J, Ma J, Mao X, Zhang W, Wang M (2025) How to bridge the gap between modalities: survey on multimodal large language model. IEEE Trans Knowl Data Eng 1–20
Pang T, Li P, Zhao L (2023) A survey on automatic generation of medical imaging reports based on deep learning. BioMed Eng OnLine 22(1):48
Xing S, Fang J, Ju Z, Guo Z, Wang Y (2024) Research on automatic generation of multimodal medical image reports based on memory driven. J Biomed Eng 41(1):60–69
Wang J, Bhalerao A, Yin T, See S, He Y (2024) Camanet: Class activation map guided attention network for radiology report generation. IEEE J Biomed Health Inform 28(4):2199–2210
Ranjit M, Ganapathy G, Manuel R, Ganu T (2023) Retrieval augmented chest x-ray report generation using openai gpt models. In: Machine Learning for Healthcare Conference, pp. 650–666
Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171
Saharia C, Chan W, Saxena S, Li L, Whang J, Denton EL, Ghasemipour K, Gontijo Lopes R, Karagol Ayan B, Salimans T (2022) Photorealistic text-to-image diffusion models with deep language understanding. Adv Neural Inf Process Syst 35:36479–36494
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383
Chen Z, Shen Y, Song Y, Wan X (2021) Cross-modal memory networks for radiology report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5904–5914
Chen Z, Song Y, Chang T-H, Wan X (2020) Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1439–1449
Liu F, Ge S, Wu X (2021) Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3001–3012
Li CY, Liang X, Hu Z, Xing EP (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1537–1547
Yang S, Wu X, Ge S, Zheng Z, Zhou SK, Xiao L (2023) Radiology report generation with a learned knowledge base and multi-modal alignment. Med Image Anal 86:102798
Tanida T, Müller P, Kaissis G, Rueckert D (2023) Interactive and explainable region-guided radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7433–7442
Funding
This research was funded in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LQ23F010005, LY24F030012, Scientific Research Fund of Zhejiang Provincial Education Department under Grant Y202250022, by Joint Funds of the Zhejiang Provincial Natural Science Foundation of China under Grant LHY21E090004, and by ’The Professional Development Projects of Teachers’ for Domestic Visiting Scholars of Colleges and Universities in 2022, China, under Grant FX2022075.
Author information
Authors and Affiliations
Contributions
Conceptualization, writing-original draft, S.S.; writing-review and editing, Z.S.; software, formal analysis, J.M.; methodology, Q.H.; data curation, resources, J.L.; visualization, investigation, K.H.; funding acquisition, supervision, Z.Y.; project administration, validation, Y.F. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, S., Su, Z., Meizhou, J. et al. Optimizing medical image report generation through a discrete diffusion framework. J Supercomput 81, 637 (2025). https://doi.org/10.1007/s11227-025-07111-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07111-2