Automatic medical image interpretation: State of the art and future directions

https://doi.org/10.1016/j.patcog.2021.107856Get rights and content

Highlights

  • Image interpretation is an emerging field of artificial intelligence.

  • A good amount of research has been published with different titles that may include caption generation, image interpretation, video captioning, deep captioning.

  • For medical image analysis and interpretation, the work is little at present and need attention of researchers to produce high performance algorithms in order to apply these methods in clinical practices.

  • This work reviews recent advances in describing medical images in natural and medical language.

  • The work compares and discusses the strengths and short coming of state of the art work and also proposes the dimensions that can be explored for future work.

Abstract

Automatic Natural language interpretation of medical images is an emerging field of Artificial Intelligence (AI). The task combines two fields of AI; computer vision and natural language processing. This is a challenging task that goes beyond object detection, segmentation, and classification because it also requires the understanding of the relationship between different objects of an image and the actions performed by these objects as visual representations. Image interpretation is helpful in many tasks like helping visually impaired persons, information retrieval, early childhood learning, producing human like natural interaction between robots, and many more applications. Recently this work fascinated researchers to use the same approach by using more complex biomedical images. It has been applied from generating single sentence captions to multi sentence paragraph descriptions. Medical image captioning can assist and speed up the diagnosis process of medical professionals and generated report can be used for many further tasks. This is a comprehensive review of recent years’ research of medical image captioning published in different international conferences and journals. Their common parameters are extracted to compare their methods, performance, strengths, limitations, and our recommendations are discussed. Further publicly available datasets and evaluation measures used for deep-learning based captioning of medical images are also discussed.

Introduction

Automatic image caption generation is a task of extracting the contents of an image through different feature extraction techniques and describing those contents through natural language sentences using natural language processing (NLP). Image captioning is the combination of two artificial intelligence fields that include computer vision to extract visual representations and natural language processing to explain that representations in simple English like sentences. This is a challenging task that goes beyond object detection, segmentation, and classification because it also requires the understanding of the relationship between different objects of an image and the actions performed by these objects as visual representations and to convert these representations into English like sentences. With the availability of large datasets, most widely used approaches for image captioning based on machine learning methods are gaining popularity day by day. Image captioning is helpful in many tasks like helping visually impaired persons, information retrieval, early childhood learning, producing human like natural interaction between robots, and many more but in the medical imaging field this topic has yet to gain popularity because this field has its own problems.

In the medical sector, use of medical images is ubiquitous, for example, medical professionals and radiologists use medical images for diagnosing and treatment of diseases. Pharmacists may use them for drug discovery and surgeons may use imaging in pre-operational, post-operational, and during the operation to monitor the treatment process. Competent medical professionals manually write textual reports after examining these medical images containing their findings (normal, abnormal, and potentially abnormal) in full paragraphed descriptions as shown in Fig. 1.

For inexperienced or less experienced examiners, writing medical report in textual form may be error-prone because it requires deep understanding of the disease, medical imaging, and thorough analysis of images under consideration. Also, for experienced medical professionals, this task is time-consuming and laborious because it takes at least half an hour to examine an image and write their findings in the form of a report whereas they have to examine many medical images per day. So, it is an unpleasant and tiresome work to do for both experienced and inexperienced medical professionals. Due to shortage of medical professionals in a country like Pakistan that has a huge population, work overload increases tremendously and in areas with the shortage of medical treatment facilities, the proportion of wrong diagnosis is higher [1]. Pakistan is a low-income country where it is an extra expense that comes in the form of an extra fee of their visit to doctors again for asking about their problems/prescriptions in the form of a report.

To facilitate the medical image reporting process, many computer-aided report generation systems based on an image captioning are proposed that automatically extract the findings from medical images and generate a textual report containing fine-grained information like an expert doctor. This saves time of doctors, consumed in manually extracting features from images and then writing a textual report so their workload can be reduced. Moreover, it also helps to reduce the requirement of extra professionals to write reports; the whole process of a medical report generation is automatic and efficient.

The generated reports can be used by many potential users. For example, radiologists can use this report for cross checking, monitoring the subtle changes, and in final decision making. It can also be used for the second opinion by peer doctors. It can be used by technologists to find immediate and first-hand information about an image under consideration. It can also be used in case of emergency in case where expert doctors are not available at the moment. So, this automatically generated report can be used as a context for further treatment without waiting for an expert doctor to generate the report.

But, this also has a variety of challenges like in generating medical reports, instead of a single sentence caption; we have to generate large paragraph that is a non-trivial task. Moreover, a medical report contains heterogeneous information (Fig. 1), for example, a text description (radiologists narrate their observations in this section), an impression (a diagnosis is provided in this sentence), comparison, and a list of tags (keyword from findings containing of critical information). So, to use all this heterogeneous information in generating sentences multiple stages are involved that may include pre-processing, segmentation, feature selection, feature extraction, and classification. Segmentation of different regions of interest is another challenging task because some modalities of medical imaging like ultra-sound contains a lot of noise [2], so identification of regions possessing abnormality is difficult. Another challenge is the limited availability of quality datasets in the medical field. Researchers have developed datasets containing of medical images to excel research in this field that may include IU Chest X-Ray [3], Chest X-Ray14 [4], PEIR Gross [5], BCIDR [6], CheXpert [7], MIMICsingle bondCXR [8], PadChest [9] and ICLEFCaption [10], [11].

Image captions can be generated using several approaches that can be broadly classified into three categories as in Fig. 2.

In template-based method, first objects and attributes are detected then captions are generated following specified grammar rules and constraints or through sentence templates. Generated captions are very small and grammatically correct, but, the disadvantage of this approach is that generated captions are hard-coded having no variety and flexibility in them. The second approach is retrieval-based in which new images similar to input image are retrieved from dataset along with their captions also. The new generated caption of input image is either the same caption of most similar image retrieved or the combination of many candidate captions. The third approach relies on deep learning (DL) based neural networks (NN) to generate automatic captions. In this method network is trained on end to end mapping from images to captions. Template-based and retrieval-based approaches are early work, but now-a-days, state-of-the-art is deep neural networks that are widely used in the medical image description generation. The network architecture used for this purpose may include an encoder-decoder framework, fully connected networks, and convolutional networks. Encoder is a convolutional neural network (CNN) that extracts the visual features from images in hierarchical manner and can be trained directly on the application dataset in hand or used as a pre-trained model such as VGGNet [12], ResNet [13], and Inception-V3 [14]. Decoder is a language generating module that is a recurrent neural network (RNN) [15] or it's variant such as gated recurrent unit (GRU) [16] and long short-term memory (LSTM) [17], and generates natural language captions. Recently, an attention mechanism is introduced that lies between encoder and decoder and is used to give importance to salient parts of images corresponding to which captions are generated.

No classic machine learning has been employed in medical image captioning because in ML only a limited number of features are extracted manually that is a difficult task and not good enough to produce good results. On the other hand, medical images are very complex and DL based techniques can handle such challenges and complexities occurring during the generation of medical image captions. So, in the last three years, medical image captioning based on DL has gained a lot of attention and many papers are published in this area. Still there are some problems of using DL for image captioning. For example, DL-based models (i.e. CNNs) require a large amount of training data to avoid overfitting problem and to improve the generalizability of the model. However, due to the scarcity of such large scale publically available datasets, it is challenging to train new deep models from scratch. Transfer learning-based approaches come here as rescue to perform this task. Secondly, language models (LSTM) consumes high computation power and training time because of their sequential nature and also suffer from vanishing gradient issue.

According to best of our knowledge, only one survey paper [18] has been published on this topic. Although, a good literature survey [18] is presented in his paper, but it is not structured and coherent. In our work, we are aimed to provide a comprehensive and structured review of automatic captioning for medical images generated using different imaging modalities. Our major focus is deep-learning based approaches and minor focus on retrieval-based methods that are using deep neural networks to generate medical image captions. The rest of the paper is organized in the following manner: In Section 2, some tasks in medical image analysis using deep learning are described. Section 3 provides a summary of our study methodology. In Section 4, a brief introduction of publicly available datasets used for medical image captioning is given. Section 5 provides details about evaluation measures used for deep-learning based image captioning. In Section 6, medical image captioning methods are categorized in different ways. In Section 7, reviewed methods are compared on different datasets used by researchers. Our findings of reviewed studies and some potential future directions are discussed in Section 8. Finally, Conclusion and our future work are described in Section 9.

Section snippets

Deep learning in medical imaging

A number of Deep Learning (DL) methods are being used to perform various medical image analysis tasks [19]. Researchers are experimenting methods, designed to perform different tasks, for a medical image description generation too. In addition, these methods are also being used for medical video captioning, enhancing the resolution of 2D and 3D medical images, medical image generation, data completion, discovering patterns, removing obstructing objects in a medical image, and normalizing a

Study methodology

In order to collect the state of the work on our topic, we have explored different research search engines, conferences, and high-quality journals. The search engines include IEEE Xplore, refseek, Virtual LRC, ACM digital library, scinapse, Google Scholar, Elsevier Science Direct, and Springer Link search engines. Other source examples like conferences and journals include Pattern Recognition, IEEE conference on computer vision and pattern recognition (CVPR), IEEE Access, Journal of the

Datasets

Datasets for medical image captioning consists of medical images and corresponding descriptions. These descriptions may be comprised of a single sentence or multiple sentences in the form of a medical report. Only a limited number of datasets for medical image captioning are publicly available. This include IU Chest X-Ray [3], Chest X-Ray14 [4], PEIR Gross [5], BCIDR [6], CheXpert [7], MIMICsingle bondCXR [8], PadChest [9] and ICLEFCaption [10], [11] that are described in detail in the following text.

Evaluation measures

Caption generating methods having complexity in their output are difficult to measure. The evaluation of captions generated by different captioning methods can be intuitively done through an extensive way of human judgment. At the same time, it requires a lot of human effort making evaluation process expensive and difficult to scale up. Also, it suffers from user variances because human judgment is mostly subjective. However, it is also necessary to gage the quality of automatically produced

Deep learning based medical image caption generation

Recently natural language like image captioning using deep learning networks [61], [62] has gained great success. This motivated the researchers to use deep learning methods for medical image captioning. Most of the existing literature uses encoder-decoder architecture where CNN extracts features from images and encodes them into vector representations of fixed length. These representations are fed into decoder RNN that generates the sequence of words against these vector representations. LSTM

Comparison of state-of-the-art captioning methods

While no experiments are done for formal evaluation, we offer an analysis of the results and performance as reported in different methods reviewed in this study. Different techniques of DL based captioning methods are compared on datasets using evaluation measures in Table 7. In encoder-decoder based approach, Shin et al. [24] achieved a higher BLEU score on IU Chest X-Ray dataset than other models applying the same method on ICLEFcaption. Results on IU Chest X-Ray were further outperformed in

Findings, limitations and potential future directions

Writing medical report in textual form may be error-prone, time-consuming, and laborious, whereas medical professionals have to examine many medical images per day. However, an automatic report generation using AI techniques can give better results. Researchers are presenting different deep-learning based models for this purpose among which attention-based caption generation is considered better. Different imaging modalities are being used in medical such as X-rays, CT-Scans, magnetic resonance

Conclusion and future work

In this work, we attempted to provide an organized reference for people attracted to medical report generation of medical images using deep learning. We tried to review every recently published research in this problem domain thoroughly and presented it in a detailed and structured manner. We observed that a lot of work is done using a simple encoder-decoder based framework whereas attention-based captioning is being used widely. Moreover, different emerging models for medical image captioning

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Hareem Ayesha has completed her undergraduate from Department of Computer Science, Bahauddin Zakariya University, Multan in 2018. At present she is doing her Master of Science in Computer Science from Department of Computer Science, Bahauddin Zakariya University, Multan. Her research interests include Artificial Intelligence, Computer Vision and Medical Image Analysis.

References (71)

  • D. Demner-Fushman et al.

    Preparing a collection of radiology examinations for distribution and retrieval

    J. Am. Med. Informatics Assoc.

    (2016)
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R.M. Summers, ChestX-ray8 : hospital-scale Chest X-ray Database and...
  • B. Jing et al.

    On the automatic generation of medical imaging reports

    ACL 2018 - 56th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap

    (2018)
  • Z. Zhang et al.

    MDNet: a semantically and visually interpretable medical image diagnosis network

  • J. Irvin et al.

    CheXpert: a Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison, Proc. AAAI Conf

    Artif. Intell.

    (2019)
  • A.E.W. Johnson et al.

    MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports

    Sci. Data.

    (2019)
  • A. Bustos, A. Pertusa, J.-.M. Salinas, M. de la Iglesia-Vayá, PadChest: a large chest x-ray image dataset with...
  • C. Eickhoff et al.

    Overview of imageclefcaption 2017 - Image caption prediction and concept detection for biomedical images

    CEUR Workshop Proc

    (2017)
  • A.G. Seco De Herrera et al.

    Overview of the ImageCLEF 2018 caption prediction tasks

  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

  • K. He et al.

    Deep residual learning for image recognition

  • C. Szegedy et al.

    Going deeper with convolutions

  • J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,...
  • K. Cho et al.

    Learning phrase representations using RNN encoder-decoder for statistical machine translation

  • S. Hochreiter et al.

    Long Short-Term Memory

    Neural Comput

    (1997)
  • J. Pavlopoulos et al.

    A Survey on Biomedical Image Captioning

    (2019)
  • M. Tariq et al.

    Medical Image based Breast Cancer Diagnosis : state of the Art and Future Directions

    Expert Syst. Appl.

    (2020)
  • P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, M.P....
  • P. Rajpurkar et al.

    Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists

    PLoS Med

    (2018)
  • X. Shin et al.

    Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database for Automated Image Interpretation

    J. Mach. Learn. Res.

    (2016)
  • W. Gale, L. Oakden-Rayner, G. Carneiro, A.P. Bradley, L.J. Palmer, Producing radiologist-quality reports for...
  • Y. Cai et al.

    Medical image retrieval based on convolutional neural network and supervised hashing

    IEEE Access

    (2019)
  • A. Ben Abacha et al.

    NLM at ImageCLEF 2017 caption task

  • Y. Zhang et al.

    ImageSem at ImageCLEF 2018 caption task: image retrieval and transfer learning

  • S.S. Azam, M. Raju, V. Pagidimarri, V. Kasivajjala, Q-Map: clinical Concept Mining from Clinical Documents, 560076...
  • Cited by (0)

    Hareem Ayesha has completed her undergraduate from Department of Computer Science, Bahauddin Zakariya University, Multan in 2018. At present she is doing her Master of Science in Computer Science from Department of Computer Science, Bahauddin Zakariya University, Multan. Her research interests include Artificial Intelligence, Computer Vision and Medical Image Analysis.

    Sajid Iqbal has completed his BSCS from Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan, in 2002, and Master of Science in Computer Science from National University of Computer and Emerging Sciences, Lahore, Pakistan, in 2003. After this he has been associated with different institutes of higher education and taught courses at undergraduate and graduate level. He completed his Ph.D from University of Engineering and Technology, Lahore, Pakistan. Currently he is working as an Assistant Professor at Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan. He is published more than 15 research papers in well reputed international journals. His-research areas include deep learning, medical image analysis and natural language processing.

    Mehreen Tariq has completed her undergraduate from Department of Computer Science, Bahauddin Zakariya University, Multan in 2018. Currently she is enrolled in Master of Science in Computer Science program as a research student. She is interested in Artificial Intelligence as her research domain. Sub fields of her research interest include medical image processing and computer vision.

    Muhammad Abrar has completed his BSIT from Department of Computer Science University of Education, Multan, Pakistan. Currently he is enrolled in his Master of Science in Computer Science degree program at Department of Computer Science, Nawaz Sharif Agriculture University, Multan, Pakistan. His-research interests include image processing, machine learning and Artificial Intelligence in general.

    Muhammad Sanaullah has been working as an Assistant Professor in the Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan. His-current research focuses on the use of Semantic Web Technologies in the field of Machine Learning, Data Mining and IoTs.

    Ishaq Abbas has completed his undergraduate from Bahauddin Zakariya University, Multan in 2018. At present he is pursuing his research based graduate program (Master of Science in Computer Science) at Department of Computer Science, Bahauddin Zakariya University, Multan. Areas of his research interest are deep learning and computer vision.

    Amjad Rehman received the Ph.D. degree in image processing and pattern recognition from Universiti Teknologi Malaysia, Malaysia, in 2010. During his Ph.D., he proposed novel techniques for pattern recognition based on novel features mining strategies. He is currently conducting research under his supervision for three Ph.D. students. He is the author of dozens of papers published in international journals and conferences of high repute. His-keen research includes information security, data mining and documents analysis, recognition.

    Muhammad Farooq Khan Niazi Dr. Niazi completed his MBBS in 1985 and has served in multiple health care units and institutes. He completed his FCPS fellowship in 1999. He has more than 24 years practice experience in the field of diagnostic radiology. He has also served as senior faculty member at College of Physicians and Surgeons Pakistan. Presently, he is serving as a professor of radiology/ head of department at Bakhtawar Amin Memorial Trust Hospital Multan, Pakistan.

    View full text