Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning

Han, Woojung; Kim, Chanyoung; Ju, Dayun; Shim, Yumin; Hwang, Seong Jae

doi:10.1007/978-3-031-72384-1_6

Woojung Han¹⁴,
Chanyoung Kim¹⁵,
Dayun Ju¹⁴,
Yumin Shim¹⁶ &
…
Seong Jae Hwang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15003))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

1771 Accesses

Abstract

Recent advances in text-conditioned image generation diffusion models have begun paving the way for new opportunities in the modern medical domain, in particular, particularly in generating Chest X-rays (CXRs) from diagnostic reports. Nonetheless, to further drive the diffusion models to generate CXRs that faithfully reflect the complexity and diversity of real data, it has become evident that a nontrivial learning approach is needed. In light of this, we propose CXRL, a framework motivated by the potential of reinforcement learning (RL). Specifically, we integrate a policy gradient RL approach with well-designed multiple distinctive CXR-domain specific reward models. This approach guides the diffusion denoising trajectory, achieving precise CXR posture and pathological details. Here, considering the complex medical image environment, we present “RL with Comparative Feedback” (RLCF) for the reward mechanism, a human-like comparative evaluation that is known to be more effective and reliable in complex scenarios compared to direct evaluation. Our CXRL framework includes jointly optimizing learnable adaptive condition embeddings (ACE) and the image generator, enabling the model to produce more accurate and higher perceptual CXR quality. Our extensive evaluation of the MIMIC-CXR-JPG dataset demonstrates the effectiveness of our RL-based tuning approach. Consequently, our CXRL generates pathologically realistic CXRs, establishing a new standard for generating CXRs with high fidelity to real-world clinical scenarios. Project page: https://micv-yonsei.github.io/cxrl2024/.

W. Han, C. Kim—Equal contribution.

$\ddagger $ In accordance with the MIMIC-CXR data usage license [13], the text reports presented in Fig. 1, 2 and 3 have been rephrased while maintaining the original content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation

Explainable Generative Attention Mechanisms for Chest X-Ray Medical Image Synthesis and Diagnosis of Pediatric Pneumonia

References

Alsentzer, E., Murphy, J., Boag, W., Weng, W.H., Jin, D., Naumann, T., McDermott, M.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. pp. 72–78. Association for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019)
Google Scholar
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. In: The Twelfth International Conference on Learning Representations (2024)
Google Scholar
Chambon, P., Bluethgen, C., Delbrouck, J.B., Van der Sluijs, R., Połacin, M., Chaves, J.M.Z., Abraham, T.M., Purohit, S., Langlotz, C.P., Chaudhari, A.: Roentgen: vision-language foundation model for chest x-ray generation. arXiv preprint arXiv:2211.12737 (2022)
Cohen, J.P., Viviano, J.D., Bertin, P., Morrison, P., Torabian, P., Guarrera, M., Lungren, M.P., Chaudhari, A., Brooks, R., Hashir, M., et al.: Torchxrayvision: A library of chest x-ray datasets and models. In: International Conference on Medical Imaging with Deep Learning. pp. 231–249. PMLR (2022)
Google Scholar
Du, Y., Jiang, Y., Tan, S., Wu, X., Dou, Q., Li, Z., Li, G., Wan, X.: Arsdm: colonoscopy images synthesis with adaptive refinement semantic diffusion models. In: International conference on medical image computing and computer-assisted intervention. pp. 339–349. Springer (2023)
Google Scholar
Fan, Y., Watkins, O., Du, Y., Liu, H., Ryu, M., Boutilier, C., Abbeel, P., Ghavamzadeh, M., Lee, K., Lee, K.: Reinforcement learning for fine-tuning text-to-image diffusion models. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image generation. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
Google Scholar
Hu, E.J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
Google Scholar
Jiang, L., Mao, Y., Wang, X., Chen, X., Li, C.: Cola-diff: Conditional latent diffusion model for multi-modal mri synthesis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 398–408. Springer (2023)
Google Scholar
Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
Kazerouni, A., Aghdam, E.K., Heidari, M., Azad, R., Fayyaz, M., Ilker: Diffusion models in medical imaging: A comprehensive survey. Medical Image Analysis 88, 102846 (2023)
Article Google Scholar
Ke, J., Ye, K., Yu, J., Wu, Y., Milanfar, P., Yang, F.: Vila: Learning image aesthetics from user comments with vision-language pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10041–10051 (2023)
Google Scholar
Khader, F., Mueller-Franzes, G., Arasteh, S.T., Han, T., Haarburger, C., Schulze-Hagen, M., Schad, P., Engelhardt, S., Baessler, B., Foersch, S., et al.: Medical diffusion–denoising diffusion probabilistic models for 3d medical image generation. arXiv preprint arXiv:2211.03364 (2022)
Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Lee, S.H., Li, Y., Ke, J., Yoo, I., Zhang, H., Yu, J., Wang, Q., Deng, F., Entis, G., He, J., et al.: Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation. In: European Conference on Computer Vision. Springer (2024)
Google Scholar
Lee, S., Kim, W.J., Chang, J., Ye, J.C.: LLM-CXR: Instruction-finetuned LLM for CXR image understanding and generation. In: The Twelfth International Conference on Learning Representations (2024)
Google Scholar
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 3045–3059. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (Nov 2021)
Google Scholar
Liu, J., Zhao, G., Fei, Y., Zhang, M., Wang, Y., Yu, Y.: Align, attend and locate: Chest x-ray diagnosis via contrast induced attention network with limited supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
Google Scholar
Margaret Cheng, H.L., Stikov, N., Ghugre, N.R., Wright, G.A.: Practical medical applications of quantitative mr relaxometry. Journal of Magnetic Resonance Imaging 36(4), 805–824 (2012)
Article Google Scholar
Mussweiler, T., Posten, A.C.: Relatively certain! comparative thinking reduces uncertainty. Cognition 122(2), 236–240 (2012)
Article Google Scholar
Peng, W., Adeli, E., Zhao, Q., Pohl, K.M.: Generating realistic 3d brain mris using a conditional diffusion probabilistic model. In: International conference on medical image computing and computer-assisted intervention. Springer (2023)
Google Scholar
Pinaya, W.H., Tudosiu, P.D., Dafflon, J., Da Costa, P.F., Fernandez, V., Nachev, P., Ourselin, S., Cardoso, M.J.: Brain imaging generation with latent diffusion models. In: MICCAI Workshop on Deep Generative Models. pp. 117–126. Springer (2022)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
Google Scholar
Rosen, A.F., Roalf, D.R., Ruparel, K., Blake, J., Seelaus, K., Villa, L.P., Ciric, R., Cook, P.A., Davatzikos, C., Elliott, M.A., et al.: Quantitative assessment of structural image quality. Neuroimage 169, 407–418 (2018)
Article Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2021)
Google Scholar
You, K., Gu, J., Ham, J., Park, B., Kim, J., Hong, E.K., Baek, W., Roh, B.: Cxr-clip: Toward large scale chest x-ray language-image pre-training. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 101–111. Springer (2023)
Google Scholar

Download references

Acknowledgments

We thank S. Jung, J.E. Lee, S.H. Yun, and C. Lee for their valuable medical expertise and advice. This work was supported in part by the IITP 2020-0-01361 (AI Graduate School Program at Yonsei University), NRF RS-2024-00345806, and NRF RS-2023-00219019 funded by Korean Government (MSIT).

Author information

Authors and Affiliations

Department of Computer Science, Yonsei University, Seoul, Republic of Korea
Woojung Han & Dayun Ju
Department of Artificial Intelligence, Yonsei University, Seoul, Republic of Korea
Chanyoung Kim & Seong Jae Hwang
College of Medicine, Yonsei University, Seoul, Republic of Korea
Yumin Shim

Authors

Woojung Han
View author publications
You can also search for this author in PubMed Google Scholar
Chanyoung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Dayun Ju
View author publications
You can also search for this author in PubMed Google Scholar
Yumin Shim
View author publications
You can also search for this author in PubMed Google Scholar
Seong Jae Hwang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seong Jae Hwang .

Editor information

Editors and Affiliations

Children’s National Hospital/George Washington University, Washington, DC, USA
Marius George Linguraru
The Chinese University of Hong Kong, Hong Kong, China
Qi Dou
Technical University of Denmark, Kgs Lyngby, Denmark
Aasa Feragen
Imperial College London, London, UK
Stamatia Giannarou
Imperial College London, London, UK
Ben Glocker
Universitat de Barcelona, Barcelona, Spain
Karim Lekadir
Helmholtz Munich, Technical University of Munich and King’s College London, Munich, Germany
Julia A. Schnabel

Ethics declarations

Disclosure of Interests

The authors have no competing interests.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 603 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, W., Kim, C., Ju, D., Shim, Y., Hwang, S.J. (2024). Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning. In: Linguraru, M.G., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, vol 15003. Springer, Cham. https://doi.org/10.1007/978-3-031-72384-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72384-1_6
Published: 03 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72383-4
Online ISBN: 978-3-031-72384-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning