Automatic Medical Image Report Generation with Multi-view and Multi-modal Attention Mechanism

Yang, Shaokang; Niu, Jianwei; Wu, Jiyan; Liu, Xuefeng

doi:10.1007/978-3-030-60248-2_48

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12454))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2181 Accesses
4 Citations

Abstract

Medical image report writing is a time-consuming and knowledge intensive task. However, the existing machine/deep learning models often incur similar reports and inaccurate descriptions. To address these critical issues, we propose a multi-view and multi-modal (MvMM) approach which utilizes various-perspective visual features and medical semantic features to generate diverse and accurate medical reports. First, we design a multi-view encoder with attention to extract visual features from the frontal and lateral viewing angles. Second, we extract medical concepts from the radiology reports which are adopted as semantic features and combined with visual features through a two-layer decoder with attention. Third, we fine-tune the model parameters using self-critical training with a coverage reward to generate more accurate medical concepts. Experimental results show that our method achieves noticeable performance improvements over the baseline approaches and increases CIDEr scores by 0.157.

Supported by Hangzhou Innovation Institution, Beihang University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Brady, A., Laoide, R.Ó., McCarthy, P., McDermott, R.: Discrepancy and error in radiology: concepts, causes and consequences. Ulster Med. J. 81(1), 3 (2012)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, June 2015
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Donahue, J., Anne Hendricks, L., Guadarrama, S., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: Bengio, Y., LeCun, Y. (eds.) ICLR 2015 (2015). http://arxiv.org/abs/1412.6632
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp. 375–383 (2017)
Google Scholar
Li, Y., Liang, X., Hu, Z., Xing, E.P.: Hybrid retrieval-generation reinforced agent for medical image report generation. In: Advances in Neural Information Processing Systems, pp. 1530–1540 (2018)
Google Scholar
Li, C.Y., Liang, X., Hu, Z., Xing, E.P.: Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In: AAAI 2019, pp. 6666–6673. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.33016666
Jing, B., Xie, P., Xing, E.P.: On the automatic generation of medical imaging reports. In: ACL 2018, pp. 2577–2586. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/P18-1240, https://www.aclweb.org/anthology/P18-1240/
Yuan, J., Liao, H., Luo, R., Luo, J.: Automatic radiology report generation based on multi-view image fusion and medical concept enrichment. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11769, pp. 721–729. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32226-7_80
Chapter Google Scholar
Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., et al.: Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23(2), 304–310 (2015)
Article Google Scholar
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR 2016 (2016). http://arxiv.org/abs/1511.06732
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)
Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR 2015 (2015). http://arxiv.org/abs/1409.0473
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)
Google Scholar
Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Google Scholar
You, Q., Jin, H., Wang, Z., et al.: Image captioning with semantic attention. In: CVPR, June 2016
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Bahdanau, D., et al.: An actor-critic algorithm for sequence prediction. In: ICLR 2017. OpenReview.net (2017). https://openreview.net/forum?id=SJDaqqveg
Tan, B., Hu, Z., Yang, Z., Salakhutdinov, R., Xing, E.P.: Connecting the dots between MLE and RL for sequence generation. In: ICLR 2019 (2019). OpenReview.net. https://openreview.net/forum?id=Syl1pGI9wN
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)
Google Scholar
Liu, F., Ren, X., Liu, Y., Wang, H., Sun, X.: simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. In: EMNLP 2018, pp. 137–149. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/d18-1013
Fang, H., et al.: From captions to visual concepts and back. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1482 (2015)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV 2017, pp. 2980–2988 (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015). http://arxiv.org/abs/1412.6980
Xue, Y.: Multimodal recurrent model with attention for automated radiology report generation. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 457–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00928-1_52
Chapter Google Scholar
Irvin, J., et al.: CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In: AAAI, vol. 33, pp. 590–597 (2019)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar

Download references

Acknowledgment

This work has been supported by National Natural Science Foundation of China (61772060, 61976012, 61602024), Qianjiang Postdoctoral Foundation (2020-Y4- A-001), and CERNET Innovation Project (NGII20170315).

Author information

Authors and Affiliations

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Shaokang Yang, Jianwei Niu & Xuefeng Liu
Beijing Advanced Innovation Center for Big Data and Brain Computing (BDBC), Beihang University, Beijing, China
Shaokang Yang, Jianwei Niu & Xuefeng Liu
Research Center of Big Data and Computational Intelligence, Hangzhou Innovation Institute of Beihang University, Hangzhou, China
Jianwei Niu & Jiyan Wu

Authors

Shaokang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Niu
View author publications
You can also search for this author in PubMed Google Scholar
Jiyan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xuefeng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianwei Niu .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Meikang Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, S., Niu, J., Wu, J., Liu, X. (2020). Automatic Medical Image Report Generation with Multi-view and Multi-modal Attention Mechanism. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12454. Springer, Cham. https://doi.org/10.1007/978-3-030-60248-2_48

Download citation

DOI: https://doi.org/10.1007/978-3-030-60248-2_48
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60247-5
Online ISBN: 978-3-030-60248-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics