Skip to main content

Advertisement

Log in

Optimizing medical image report generation through a discrete diffusion framework

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Medical imaging, such as X-rays and CT scans, plays a critical role in diagnostics, yet the growing workload leads to reporting delays and potential errors. Traditional deep learning-based approaches often struggled to capture complex semantic relationships in long medical reports, leading to issues of coherence and consistency in generated texts. To address these challenges, we propose a novel multi-stage generative framework based on diffusion models. A cross-attention mechanism is incorporated that simultaneously attends to both textual and visual features, significantly improving the model’s ability to align image content with accurate textual descriptions. Additionally, we optimize multimodal information fusion by integrating skip connections, Long Short-Term Memory (LSTM) networks, and MIX-MLP networks, enhancing the flow of information between different modalities. By integrating advanced multimodal fusion mechanisms, our approach enhances the accuracy and coherence of automatic report generation. The model was evaluated on IU-XRAY and MIMIC-CXR datasets, achieving state-of-the-art performance across multiple metrics, including BLEU, METEOR, and ROUGE, significantly surpassing prior methods. These results validate the framework’s effectiveness in generating professional and coherent medical reports, offering a reliable solution to alleviate the burden of manual reporting. The source code is available at https://github.com/watersunhznu/DifMIRG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data availability

This study utilized publicly available datasets, namely IU X-Ray and MIMIC-CXR. More information about these datasets can be found in their respective original publications.

References

  1. Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C-Y, Mark RG, Horng S (2019) Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6(1):317

    Article  Google Scholar 

  2. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    Article  MathSciNet  MATH  Google Scholar 

  3. Hovy E (1987) Generating natural language under pragmatic constraints. J Pragmat 11(6):689–719

    Article  MATH  Google Scholar 

  4. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14, pp. 3104–3112. MIT Press, Cambridge, MA, USA

  5. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164

  6. Jing B, Xie P, Xing E (2018) On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2577–2586

  7. Yin C, Qian B, Wei J, Li X, Zhang X, Li Y, Zheng Q (2019) Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 728–737

  8. Kong M, Huang Z, Kuang K, Zhu Q, Wu F (2022) Transq: Transformer-based semantic query for medical report generation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 610–620

  9. Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851

    Google Scholar 

  10. Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst 34:8780–8794

    MATH  Google Scholar 

  11. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA

  12. Yi X, Fu Y, Liu R, Zhang H, Hua R (2024) Tsget: Two-stage global enhanced transformer for automatic radiology report generation. IEEE J Biomed Health Inform 28(4):2152–2162

    Article  MATH  Google Scholar 

  13. Yi X, Fu Y, Yu J, Liu R, Zhang H, Hua R (2024) Lhr-rfl: Linear hybrid-reward based reinforced focal learning for automatic radiology report generation. IEEE Trans Med Imaging, 1–1

  14. Yao Z, Lin F, Chai S, He W, Dai L, Fei X (2024) Integrating medical imaging and clinical reports using multimodal deep learning for advanced disease analysis. In: 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), pp. 1217–1223

  15. Song S, Li X, Li S, Zhao S, Yu J, Ma J, Mao X, Zhang W, Wang M (2025) How to bridge the gap between modalities: survey on multimodal large language model. IEEE Trans Knowl Data Eng 1–20

  16. Pang T, Li P, Zhao L (2023) A survey on automatic generation of medical imaging reports based on deep learning. BioMed Eng OnLine 22(1):48

    Article  MATH  Google Scholar 

  17. Xing S, Fang J, Ju Z, Guo Z, Wang Y (2024) Research on automatic generation of multimodal medical image reports based on memory driven. J Biomed Eng 41(1):60–69

    MATH  Google Scholar 

  18. Wang J, Bhalerao A, Yin T, See S, He Y (2024) Camanet: Class activation map guided attention network for radiology report generation. IEEE J Biomed Health Inform 28(4):2199–2210

    Article  MATH  Google Scholar 

  19. Ranjit M, Ganapathy G, Manuel R, Ganu T (2023) Retrieval augmented chest x-ray report generation using openai gpt models. In: Machine Learning for Healthcare Conference, pp. 650–666

  20. Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171

  21. Saharia C, Chan W, Saxena S, Li L, Whang J, Denton EL, Ghasemipour K, Gontijo Lopes R, Karagol Ayan B, Salimans T (2022) Photorealistic text-to-image diffusion models with deep language understanding. Adv Neural Inf Process Syst 35:36479–36494

    Google Scholar 

  22. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778

  23. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241

  24. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024

  25. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383

  26. Chen Z, Shen Y, Song Y, Wan X (2021) Cross-modal memory networks for radiology report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5904–5914

  27. Chen Z, Song Y, Chang T-H, Wan X (2020) Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1439–1449

  28. Liu F, Ge S, Wu X (2021) Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3001–3012

  29. Li CY, Liang X, Hu Z, Xing EP (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1537–1547

  30. Yang S, Wu X, Ge S, Zheng Z, Zhou SK, Xiao L (2023) Radiology report generation with a learned knowledge base and multi-modal alignment. Med Image Anal 86:102798

    Article  MATH  Google Scholar 

  31. Tanida T, Müller P, Kaissis G, Rueckert D (2023) Interactive and explainable region-guided radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7433–7442

Download references

Funding

This research was funded in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LQ23F010005, LY24F030012, Scientific Research Fund of Zhejiang Provincial Education Department under Grant Y202250022, by Joint Funds of the Zhejiang Provincial Natural Science Foundation of China under Grant LHY21E090004, and by ’The Professional Development Projects of Teachers’ for Domestic Visiting Scholars of Colleges and Universities in 2022, China, under Grant FX2022075.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, writing-original draft, S.S.; writing-review and editing, Z.S.; software, formal analysis, J.M.; methodology, Q.H.; data curation, resources, J.L.; visualization, investigation, K.H.; funding acquisition, supervision, Z.Y.; project administration, validation, Y.F. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Yang Feng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, S., Su, Z., Meizhou, J. et al. Optimizing medical image report generation through a discrete diffusion framework. J Supercomput 81, 637 (2025). https://doi.org/10.1007/s11227-025-07111-2

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-07111-2

Keywords