skip to main content
10.1145/3286606.3286863acmotherconferencesArticle/Chapter ViewAbstractPublication PagessmartcityappConference Proceedingsconference-collections
research-article

Automatic Caption Generation for Medical Images

Published: 10 October 2018 Publication History

Abstract

With the increasing availability of medical images coming from different modalities (X-Ray, CT, PET, MRI, ultrasound, etc.), and the huge advances in the development of incredibly fast, accurate and enhanced computing power with the current graphics processing units. The task of automatic caption generation from medical images became a new way to improve healthcare and the key method for getting better results at lower costs. In this paper, we give a comprehensive overview of the task of image captioning in the medical domain, covering: existing models, the benchmark medical image-caption datasets, and evaluation metrics that have been used to measure the quality of the generated captions.

References

[1]
Kulkarni, G., et al. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, 2891--2903.
[2]
Siming, Li., Kulkarni, Girish, Berg, Tamara, L., Berg, Alexander, C., and Choi, Yejin. 2011. Composing simple image descriptions using web-scale n-grams. In Computational Natural Language Learning. ACL.
[3]
Yang, Yezhou, Teo, Ching Lik, Daume III, Hal, and Aloimonos, Yiannis. 2011. Corpus-guided sentence generation of natural images. In EMNLP, ACL, 444--454.
[4]
Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch, Alyssa, Goyal, Amit, Berg, Alex, Yamaguchi, Kota, Berg, Tamara, Stratos, Karl, and Daume III, Hal. 2012. Midge: Generating image descriptions from computer vision detections. In European Chapter of the Association for Computational Linguistics. ACL, 747--756.
[5]
Elliott, Desmond and Keller, Frank. 2013. Image description using visual dependency representations. In EMNLP.
[6]
Hodosh, M., Young, P., and Hockenmaier, J. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, vol. 47, 853--899.
[7]
Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng, A. Y. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, vol. 2, 207--218.
[8]
Farhadi, A., et al. 2010. Every picture tells a story: Generating sentences from images, in European Conference on Computer Vision, 15--29.
[9]
Kuznetsova, P., Ordonez, V., Berg, T. L., and Choi, Y. 2014. TREETALK: Composition and Compression of Trees for Image Descriptions, TACL, vol. 2, no. 10, 351--362.
[10]
Cho, K., Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
[11]
Sutskever, I., Vinyals, O., and Le, Q. V. 2014. Sequence to sequence learning with neural networks, in NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, 3104--3112.
[12]
Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks, in NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems, vol. 1, 1097--1105.
[13]
Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556
[14]
Murthy, V. N., Maji, S., Manmatha, R. 2015. Automatic image annotation using deep learning representations, in: Proc. of ACM ICMR.
[15]
Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S. 2013. Deep convolutional ranking for multilabel image annotation, arXiv:1312.4894.
[16]
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W. 2016. Cnn-rnn: a unified framework for multi-label image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2285--2294.
[17]
Donahue, J., et al. 2015. Long-term recurrent convolutional networks for visual recognition and description, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2625--2634.
[18]
Vinyals, O., Toshev, A., Bengio, S., Erhan, D. 2015. Show and tell: a neural image caption generator, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, 2015, pp. 3156--3164.
[19]
Hodosh, M., Young, P., Hockenmaier, Framing, J. 2013. Image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res. 47:853--899.
[20]
Young, P., Lai, A., Hodosh, M., Hockenmaier, J. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist. 2:67--78.
[21]
Lin, TY., Maire, M., Belongie, S, Hays, J., Perona, P., Ramanan, D., et al. 2014.Microsoft COCO: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer Vision-ECCV 2014; New York. Springer, 740--755.
[22]
Kumar, G., and Bhatia, P. K. 2014. A detailed review of feature extraction in image processing systems, in Fourth International Conference on Advanced Computing and Communication Technologies. IEEE Computer Society, 5--12.
[23]
Lecun, Y., and Bottou, L., and Bengio, Y., and Haffner, P. 1998. Gradient-based learning applied to document recognition, Proceedings of the IEEE, 2278--2324.
[24]
Simonyan, K., and Zisserman, A. 2014. 'Very Deep Convolutional Networks for Large-Scale Image Recognition'. arXiv preprint arXiv:1409.1556.
[25]
Szegedy, C., et al. 2014. Going Deeper with Convolutions, arXiv preprint arXiv:1409.4886.
[26]
Kaiming, He., Zhang, X., Ren, S., and Sun, J. 2015. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385.
[27]
Hochreiter, S., and Schmidhuber, J. 1997.Long short-term memory, Neural Computation, vol. 9, no. 8, 1735--1780.
[28]
Jozefowicz, R., Zaremba, W., and Sutskever, I. 2015. An empirical exploration of recurrent network architectures, in Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2342--2350.
[29]
Wu, L., Wan, C., Wu, Y., and Liu, J. 2017. Generative caption for diabetic retinopathy images, International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), Shenzhen, 515--519.
[30]
Rahman, M.M. 2018. A cross modal deep learning based approach for caption prediction and concept detection by cs morgan state. In: CLEF2018 Working Notes. CEUR Workshop Proceedings, Avignon, France, CEUR-WS.org
[31]
Lyndon, D., Kumar, A., Kim, J. 2017. Neural captioning for the ImageCLEF 2017 medical image challenges.
[32]
Hasan, S.A., Ling, Y., Liu, J., Sreenivasan, R., Anand, S., Arora, T., Datla, V.V., Lee, K., Qadir, A., Swisher, C., Farri, O. 2017. PRNA at ImageCLEF 2017 caption prediction and concept detection tasks.
[33]
Su, Y., Liu, F. 2018. UMass at ImageCLEF caption prediction 2018 task. In: CLEF2018 Working Notes. CEUR Workshop Proceedings, Avignon, France, CEUR-WS.org
[34]
Jing, B., Xie, P., Xing, E. 2017. On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195
[35]
Wang, X., Peng, Y., Lu, L., Lu, Z., and Summers, R. M. 2018. Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. arXiv preprint arXiv:1801.04334, CVPR.
[36]
Ben Abacha, A., García Seco de Herrera, A., Gayen, S., Demner-Fushman, D., Antani, S. 2017. NLM at ImageCLEF 2017 caption task. CLEF2017 working notes, CEUR.
[37]
Homepage, https://www.ncbi.nlm.nih.gov/pmc/, last accessed 2018/5/30.
[38]
Wang, X., Zhang, Y., Guo, Z., Li, J. 2018. ImageSem at ImageCLEF 2018 caption task: Image retrieval and transfer learning. In: CLEF2018 Working Notes. CEUR Workshop Proceedings, Avignon, France, CEUR-WS.org.
[39]
Rahman, M., Lagree, T., Taylor, M. 2017. A cross-modal concept detection and caption prediction approach in ImageCLEFcaption track of ImageCLEF 2017.
[40]
Liang, S., Li, X., Zhu, Y., Li, X., Jiang, S. 2017. ISIA at ImageCLEF 2017 image caption task.
[41]
Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani, S., Thoma, G. R., and McDonald, C. J. 2015. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304--310.
[42]
Zhang, Z., Chen, P., Sapkota, M., and Yang., L. 2017. Tandemnet: Distilling knowledge from medical images using diagnostic reports as optional semantic references. In International Conference on Medical Image Computing and ComputerAssisted Intervention, 320--328. Springer.
[43]
Eickhoff, C., Schwall, I., García Seco de Herrera, A., and Müller, H. 2017. Overview of ImageCLEFcaption 2017 - the Image Caption Prediction and Concept Extraction Tasks to Understand Biomedical Images, CLEF working notes, CEUR.
[44]
Garcia Seco de Herrera, A., Eickhoff, C., Andrearczyk, V., Müller, H. 2018. Overview of the ImageCLEF 2018 Caption Prediction tasks. In: CLEF2018 Working Notes. CEUR-WS.org, Avignon, France.
[45]
Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. 2002. BLEU: a method for automatic evaluation of machine translation. in Proceedings of the 40th annual meeting on association for computational linguistics, 311--318
[46]
Lin, C. Y. 2004. Rouge: A package for automatic evaluation of summaries, in Text summarization branches out: Proceedings of the ACL-04 workshop, vol. 8.
[47]
Denkowski, M., and Lavie, A. 2014. Meteor universal: Language specific translation evaluation for any target language, in Proceedings of the ninth Workshop on Statistical Machine Translation, 376--380.
[48]
Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. Cider: Consensus-based image description evaluation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4566--4575.
[49]
Anderson, P., Fernando, B., Johnson, M., and Gould, S. 2016. Spice: Semantic propositional image caption evaluation, in Computer Vision - ECCV 2016, 382--398.
[50]
Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. 2016. Re-evaluating automatic metrics for image captioning, arXiv preprint arXiv:1612.07600.

Cited By

View all
  • (2024)From Data to Diagnosis: Enhancing Radiology Reporting With Clinical Features Encoding and Cross-Modal CoherenceIEEE Access10.1109/ACCESS.2024.344992912(127341-127356)Online publication date: 2024
  • (2024)XRaySwinGen: Automatic Medical Reporting for X-ray Exams with Multimodal ModelHeliyon10.1016/j.heliyon.2024.e27516(e27516)Online publication date: Mar-2024
  • (2024)Self-Enhanced Attention for Image CaptioningNeural Processing Letters10.1007/s11063-024-11527-x56:2Online publication date: 1-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SCA '18: Proceedings of the 3rd International Conference on Smart City Applications
October 2018
580 pages
ISBN:9781450365628
DOI:10.1145/3286606
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Attention mechanism
  2. CNN
  3. Computer Vision
  4. Deep Neural Networks
  5. Encoder-Decoder framework
  6. Generative models
  7. LSTM
  8. Medical Image Captioning
  9. Natural Language Processing
  10. RNN
  11. Retrieval-based models

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SCA '18

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)70
  • Downloads (Last 6 weeks)8
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)From Data to Diagnosis: Enhancing Radiology Reporting With Clinical Features Encoding and Cross-Modal CoherenceIEEE Access10.1109/ACCESS.2024.344992912(127341-127356)Online publication date: 2024
  • (2024)XRaySwinGen: Automatic Medical Reporting for X-ray Exams with Multimodal ModelHeliyon10.1016/j.heliyon.2024.e27516(e27516)Online publication date: Mar-2024
  • (2024)Self-Enhanced Attention for Image CaptioningNeural Processing Letters10.1007/s11063-024-11527-x56:2Online publication date: 1-Apr-2024
  • (2024)CSAMDT: Conditional Self Attention Memory-Driven Transformers for Radiology Report Generation from Chest X-RayJournal of Imaging Informatics in Medicine10.1007/s10278-024-01126-637:6(2825-2837)Online publication date: 3-Jun-2024
  • (2024)Current Approaches and Challenges in Medical Image Analysis and Visually Explainable Artificial Intelligence as Future OpportunitiesThe Future of Artificial Intelligence and Robotics10.1007/978-3-031-60935-0_69(796-811)Online publication date: 20-Aug-2024
  • (2023)Lenke Classification Report Generation Method for Scoliosis Based on Spatial and Context Dual AttentionApplied Sciences10.3390/app1313798113:13(7981)Online publication date: 7-Jul-2023
  • (2023)CaptionGenX: Advancements in Deep Learning for Automated Image Captioning2023 3rd Asian Conference on Innovation in Technology (ASIANCON)10.1109/ASIANCON58793.2023.10270020(1-8)Online publication date: 25-Aug-2023
  • (2023)Vision Transformer and Language Model Based Radiology Report GenerationIEEE Access10.1109/ACCESS.2022.323271911(1814-1824)Online publication date: 2023
  • (2023)Automatic aid diagnosis report generation for lumbar disc MR image based on lightweight artificial neural networksBiomedical Signal Processing and Control10.1016/j.bspc.2023.10527586(105275)Online publication date: Sep-2023
  • (2022)Reconsidering Tourism Destination Images by Exploring Similarities between Travelogue Texts and PhotographsISPRS International Journal of Geo-Information10.3390/ijgi1111055311:11(553)Online publication date: 8-Nov-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media