Optimizing medical image report generation through a discrete diffusion framework

Sun, Shuifa; Su, Zhanglin; Meizhou, Junsen; Feng, Yang; Hu, Qin; Luo, Jiacheng; Hu, Keyong; Yang, Zhen

doi:10.1007/s11227-025-07111-2

Optimizing medical image report generation through a discrete diffusion framework

Published: 18 March 2025

Volume 81, article number 637, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Shuifa Sun^1,2,
Zhanglin Su¹,
Junsen Meizhou³,
Yang Feng^1,2,
Qin Hu¹,
Jiacheng Luo¹,
Keyong Hu^1,2 &
…
Zhen Yang⁴

162 Accesses
Explore all metrics

Abstract

Medical imaging, such as X-rays and CT scans, plays a critical role in diagnostics, yet the growing workload leads to reporting delays and potential errors. Traditional deep learning-based approaches often struggled to capture complex semantic relationships in long medical reports, leading to issues of coherence and consistency in generated texts. To address these challenges, we propose a novel multi-stage generative framework based on diffusion models. A cross-attention mechanism is incorporated that simultaneously attends to both textual and visual features, significantly improving the model’s ability to align image content with accurate textual descriptions. Additionally, we optimize multimodal information fusion by integrating skip connections, Long Short-Term Memory (LSTM) networks, and MIX-MLP networks, enhancing the flow of information between different modalities. By integrating advanced multimodal fusion mechanisms, our approach enhances the accuracy and coherence of automatic report generation. The model was evaluated on IU-XRAY and MIMIC-CXR datasets, achieving state-of-the-art performance across multiple metrics, including BLEU, METEOR, and ROUGE, significantly surpassing prior methods. These results validate the framework’s effectiveness in generating professional and coherent medical reports, offering a reliable solution to alleviate the burden of manual reporting. The source code is available at https://github.com/watersunhznu/DifMIRG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Semantic and Visual Attention-Driven Multi-LSTM Network for Automated Clinical Report Generation

Automatic Medical Image Report Generation with Multi-view and Multi-modal Attention Mechanism

Multimodal Recurrent Model with Attention for Automated Radiology Report Generation

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Medical Imaging

Data availability

This study utilized publicly available datasets, namely IU X-Ray and MIMIC-CXR. More information about these datasets can be found in their respective original publications.

References

Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng C-Y, Mark RG, Horng S (2019) Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6(1):317
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Article MathSciNet MATH Google Scholar
Hovy E (1987) Generating natural language under pragmatic constraints. J Pragmat 11(6):689–719
Article MATH Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14, pp. 3104–3112. MIT Press, Cambridge, MA, USA
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164
Jing B, Xie P, Xing E (2018) On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2577–2586
Yin C, Qian B, Wei J, Li X, Zhang X, Li Y, Zheng Q (2019) Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 728–737
Kong M, Huang Z, Kuang K, Zhu Q, Wu F (2022) Transq: Transformer-based semantic query for medical report generation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 610–620
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851
Google Scholar
Dhariwal P, Nichol A (2021) Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst 34:8780–8794
MATH Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA
Yi X, Fu Y, Liu R, Zhang H, Hua R (2024) Tsget: Two-stage global enhanced transformer for automatic radiology report generation. IEEE J Biomed Health Inform 28(4):2152–2162
Article MATH Google Scholar
Yi X, Fu Y, Yu J, Liu R, Zhang H, Hua R (2024) Lhr-rfl: Linear hybrid-reward based reinforced focal learning for automatic radiology report generation. IEEE Trans Med Imaging, 1–1
Yao Z, Lin F, Chai S, He W, Dai L, Fei X (2024) Integrating medical imaging and clinical reports using multimodal deep learning for advanced disease analysis. In: 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), pp. 1217–1223
Song S, Li X, Li S, Zhao S, Yu J, Ma J, Mao X, Zhang W, Wang M (2025) How to bridge the gap between modalities: survey on multimodal large language model. IEEE Trans Knowl Data Eng 1–20
Pang T, Li P, Zhao L (2023) A survey on automatic generation of medical imaging reports based on deep learning. BioMed Eng OnLine 22(1):48
Article MATH Google Scholar
Xing S, Fang J, Ju Z, Guo Z, Wang Y (2024) Research on automatic generation of multimodal medical image reports based on memory driven. J Biomed Eng 41(1):60–69
MATH Google Scholar
Wang J, Bhalerao A, Yin T, See S, He Y (2024) Camanet: Class activation map guided attention network for radiology report generation. IEEE J Biomed Health Inform 28(4):2199–2210
Article MATH Google Scholar
Ranjit M, Ganapathy G, Manuel R, Ganu T (2023) Retrieval augmented chest x-ray report generation using openai gpt models. In: Machine Learning for Healthcare Conference, pp. 650–666
Nichol AQ, Dhariwal P (2021) Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171
Saharia C, Chan W, Saxena S, Li L, Whang J, Denton EL, Ghasemipour K, Gontijo Lopes R, Karagol Ayan B, Salimans T (2022) Photorealistic text-to-image diffusion models with deep language understanding. Adv Neural Inf Process Syst 35:36479–36494
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 375–383
Chen Z, Shen Y, Song Y, Wan X (2021) Cross-modal memory networks for radiology report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5904–5914
Chen Z, Song Y, Chang T-H, Wan X (2020) Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1439–1449
Liu F, Ge S, Wu X (2021) Competence-based multimodal curriculum learning for medical report generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3001–3012
Li CY, Liang X, Hu Z, Xing EP (2018) Hybrid retrieval-generation reinforced agent for medical image report generation. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1537–1547
Yang S, Wu X, Ge S, Zheng Z, Zhou SK, Xiao L (2023) Radiology report generation with a learned knowledge base and multi-modal alignment. Med Image Anal 86:102798
Article MATH Google Scholar
Tanida T, Müller P, Kaissis G, Rueckert D (2023) Interactive and explainable region-guided radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7433–7442

Download references

Funding

This research was funded in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LQ23F010005, LY24F030012, Scientific Research Fund of Zhejiang Provincial Education Department under Grant Y202250022, by Joint Funds of the Zhejiang Provincial Natural Science Foundation of China under Grant LHY21E090004, and by ’The Professional Development Projects of Teachers’ for Domestic Visiting Scholars of Colleges and Universities in 2022, China, under Grant FX2022075.

Author information

Authors and Affiliations

School of Information Science and Technology, Hangzhou Normal University, Yuhangtang Rd, Yuhang District, Hangzhou, 311121, Zhejiang, China
Shuifa Sun, Zhanglin Su, Yang Feng, Qin Hu, Jiacheng Luo & Keyong Hu
Mobile Health Management System Engineering Research Center of the Ministry of Education, Hangzhou Normal University, Yuhangtang Rd, Yuhang District, Hangzhou, 311121, Zhejiang, China
Shuifa Sun, Yang Feng & Keyong Hu
College of Computer and Information Technology, China Three Gorges University, Daxue Rd, Xiling District, Yichang, 443002, Hubei, China
Junsen Meizhou
School of Electronic Information, Huzhou College, Longquan Rd, Wuxing District, Huzhou, 313002, Zhejiang, China
Zhen Yang

Authors

Shuifa Sun
View author publications
You can also search for this author inPubMed Google Scholar
Zhanglin Su
View author publications
You can also search for this author inPubMed Google Scholar
Junsen Meizhou
View author publications
You can also search for this author inPubMed Google Scholar
Yang Feng
View author publications
You can also search for this author inPubMed Google Scholar
Qin Hu
View author publications
You can also search for this author inPubMed Google Scholar
Jiacheng Luo
View author publications
You can also search for this author inPubMed Google Scholar
Keyong Hu
View author publications
You can also search for this author inPubMed Google Scholar
Zhen Yang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization, writing-original draft, S.S.; writing-review and editing, Z.S.; software, formal analysis, J.M.; methodology, Q.H.; data curation, resources, J.L.; visualization, investigation, K.H.; funding acquisition, supervision, Z.Y.; project administration, validation, Y.F. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Yang Feng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, S., Su, Z., Meizhou, J. et al. Optimizing medical image report generation through a discrete diffusion framework. J Supercomput 81, 637 (2025). https://doi.org/10.1007/s11227-025-07111-2

Download citation

Accepted: 21 February 2025
Published: 18 March 2025
DOI: https://doi.org/10.1007/s11227-025-07111-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing medical image report generation through a discrete diffusion framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semantic and Visual Attention-Driven Multi-LSTM Network for Automated Clinical Report Generation

Automatic Medical Image Report Generation with Multi-view and Multi-modal Attention Mechanism

Multimodal Recurrent Model with Attention for Automated Radiology Report Generation

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now