Automatic medical image interpretation: State of the art and future directions
Introduction
Automatic image caption generation is a task of extracting the contents of an image through different feature extraction techniques and describing those contents through natural language sentences using natural language processing (NLP). Image captioning is the combination of two artificial intelligence fields that include computer vision to extract visual representations and natural language processing to explain that representations in simple English like sentences. This is a challenging task that goes beyond object detection, segmentation, and classification because it also requires the understanding of the relationship between different objects of an image and the actions performed by these objects as visual representations and to convert these representations into English like sentences. With the availability of large datasets, most widely used approaches for image captioning based on machine learning methods are gaining popularity day by day. Image captioning is helpful in many tasks like helping visually impaired persons, information retrieval, early childhood learning, producing human like natural interaction between robots, and many more but in the medical imaging field this topic has yet to gain popularity because this field has its own problems.
In the medical sector, use of medical images is ubiquitous, for example, medical professionals and radiologists use medical images for diagnosing and treatment of diseases. Pharmacists may use them for drug discovery and surgeons may use imaging in pre-operational, post-operational, and during the operation to monitor the treatment process. Competent medical professionals manually write textual reports after examining these medical images containing their findings (normal, abnormal, and potentially abnormal) in full paragraphed descriptions as shown in Fig. 1.
For inexperienced or less experienced examiners, writing medical report in textual form may be error-prone because it requires deep understanding of the disease, medical imaging, and thorough analysis of images under consideration. Also, for experienced medical professionals, this task is time-consuming and laborious because it takes at least half an hour to examine an image and write their findings in the form of a report whereas they have to examine many medical images per day. So, it is an unpleasant and tiresome work to do for both experienced and inexperienced medical professionals. Due to shortage of medical professionals in a country like Pakistan that has a huge population, work overload increases tremendously and in areas with the shortage of medical treatment facilities, the proportion of wrong diagnosis is higher [1]. Pakistan is a low-income country where it is an extra expense that comes in the form of an extra fee of their visit to doctors again for asking about their problems/prescriptions in the form of a report.
To facilitate the medical image reporting process, many computer-aided report generation systems based on an image captioning are proposed that automatically extract the findings from medical images and generate a textual report containing fine-grained information like an expert doctor. This saves time of doctors, consumed in manually extracting features from images and then writing a textual report so their workload can be reduced. Moreover, it also helps to reduce the requirement of extra professionals to write reports; the whole process of a medical report generation is automatic and efficient.
The generated reports can be used by many potential users. For example, radiologists can use this report for cross checking, monitoring the subtle changes, and in final decision making. It can also be used for the second opinion by peer doctors. It can be used by technologists to find immediate and first-hand information about an image under consideration. It can also be used in case of emergency in case where expert doctors are not available at the moment. So, this automatically generated report can be used as a context for further treatment without waiting for an expert doctor to generate the report.
But, this also has a variety of challenges like in generating medical reports, instead of a single sentence caption; we have to generate large paragraph that is a non-trivial task. Moreover, a medical report contains heterogeneous information (Fig. 1), for example, a text description (radiologists narrate their observations in this section), an impression (a diagnosis is provided in this sentence), comparison, and a list of tags (keyword from findings containing of critical information). So, to use all this heterogeneous information in generating sentences multiple stages are involved that may include pre-processing, segmentation, feature selection, feature extraction, and classification. Segmentation of different regions of interest is another challenging task because some modalities of medical imaging like ultra-sound contains a lot of noise [2], so identification of regions possessing abnormality is difficult. Another challenge is the limited availability of quality datasets in the medical field. Researchers have developed datasets containing of medical images to excel research in this field that may include IU Chest X-Ray [3], Chest X-Ray14 [4], PEIR Gross [5], BCIDR [6], CheXpert [7], MIMICCXR [8], PadChest [9] and ICLEFCaption [10], [11].
Image captions can be generated using several approaches that can be broadly classified into three categories as in Fig. 2.
In template-based method, first objects and attributes are detected then captions are generated following specified grammar rules and constraints or through sentence templates. Generated captions are very small and grammatically correct, but, the disadvantage of this approach is that generated captions are hard-coded having no variety and flexibility in them. The second approach is retrieval-based in which new images similar to input image are retrieved from dataset along with their captions also. The new generated caption of input image is either the same caption of most similar image retrieved or the combination of many candidate captions. The third approach relies on deep learning (DL) based neural networks (NN) to generate automatic captions. In this method network is trained on end to end mapping from images to captions. Template-based and retrieval-based approaches are early work, but now-a-days, state-of-the-art is deep neural networks that are widely used in the medical image description generation. The network architecture used for this purpose may include an encoder-decoder framework, fully connected networks, and convolutional networks. Encoder is a convolutional neural network (CNN) that extracts the visual features from images in hierarchical manner and can be trained directly on the application dataset in hand or used as a pre-trained model such as VGGNet [12], ResNet [13], and Inception-V3 [14]. Decoder is a language generating module that is a recurrent neural network (RNN) [15] or it's variant such as gated recurrent unit (GRU) [16] and long short-term memory (LSTM) [17], and generates natural language captions. Recently, an attention mechanism is introduced that lies between encoder and decoder and is used to give importance to salient parts of images corresponding to which captions are generated.
No classic machine learning has been employed in medical image captioning because in ML only a limited number of features are extracted manually that is a difficult task and not good enough to produce good results. On the other hand, medical images are very complex and DL based techniques can handle such challenges and complexities occurring during the generation of medical image captions. So, in the last three years, medical image captioning based on DL has gained a lot of attention and many papers are published in this area. Still there are some problems of using DL for image captioning. For example, DL-based models (i.e. CNNs) require a large amount of training data to avoid overfitting problem and to improve the generalizability of the model. However, due to the scarcity of such large scale publically available datasets, it is challenging to train new deep models from scratch. Transfer learning-based approaches come here as rescue to perform this task. Secondly, language models (LSTM) consumes high computation power and training time because of their sequential nature and also suffer from vanishing gradient issue.
According to best of our knowledge, only one survey paper [18] has been published on this topic. Although, a good literature survey [18] is presented in his paper, but it is not structured and coherent. In our work, we are aimed to provide a comprehensive and structured review of automatic captioning for medical images generated using different imaging modalities. Our major focus is deep-learning based approaches and minor focus on retrieval-based methods that are using deep neural networks to generate medical image captions. The rest of the paper is organized in the following manner: In Section 2, some tasks in medical image analysis using deep learning are described. Section 3 provides a summary of our study methodology. In Section 4, a brief introduction of publicly available datasets used for medical image captioning is given. Section 5 provides details about evaluation measures used for deep-learning based image captioning. In Section 6, medical image captioning methods are categorized in different ways. In Section 7, reviewed methods are compared on different datasets used by researchers. Our findings of reviewed studies and some potential future directions are discussed in Section 8. Finally, Conclusion and our future work are described in Section 9.
Section snippets
Deep learning in medical imaging
A number of Deep Learning (DL) methods are being used to perform various medical image analysis tasks [19]. Researchers are experimenting methods, designed to perform different tasks, for a medical image description generation too. In addition, these methods are also being used for medical video captioning, enhancing the resolution of 2D and 3D medical images, medical image generation, data completion, discovering patterns, removing obstructing objects in a medical image, and normalizing a
Study methodology
In order to collect the state of the work on our topic, we have explored different research search engines, conferences, and high-quality journals. The search engines include IEEE Xplore, refseek, Virtual LRC, ACM digital library, scinapse, Google Scholar, Elsevier Science Direct, and Springer Link search engines. Other source examples like conferences and journals include Pattern Recognition, IEEE conference on computer vision and pattern recognition (CVPR), IEEE Access, Journal of the
Datasets
Datasets for medical image captioning consists of medical images and corresponding descriptions. These descriptions may be comprised of a single sentence or multiple sentences in the form of a medical report. Only a limited number of datasets for medical image captioning are publicly available. This include IU Chest X-Ray [3], Chest X-Ray14 [4], PEIR Gross [5], BCIDR [6], CheXpert [7], MIMICCXR [8], PadChest [9] and ICLEFCaption [10], [11] that are described in detail in the following text.
Evaluation measures
Caption generating methods having complexity in their output are difficult to measure. The evaluation of captions generated by different captioning methods can be intuitively done through an extensive way of human judgment. At the same time, it requires a lot of human effort making evaluation process expensive and difficult to scale up. Also, it suffers from user variances because human judgment is mostly subjective. However, it is also necessary to gage the quality of automatically produced
Deep learning based medical image caption generation
Recently natural language like image captioning using deep learning networks [61], [62] has gained great success. This motivated the researchers to use deep learning methods for medical image captioning. Most of the existing literature uses encoder-decoder architecture where CNN extracts features from images and encodes them into vector representations of fixed length. These representations are fed into decoder RNN that generates the sequence of words against these vector representations. LSTM
Comparison of state-of-the-art captioning methods
While no experiments are done for formal evaluation, we offer an analysis of the results and performance as reported in different methods reviewed in this study. Different techniques of DL based captioning methods are compared on datasets using evaluation measures in Table 7. In encoder-decoder based approach, Shin et al. [24] achieved a higher BLEU score on IU Chest X-Ray dataset than other models applying the same method on ICLEFcaption. Results on IU Chest X-Ray were further outperformed in
Findings, limitations and potential future directions
Writing medical report in textual form may be error-prone, time-consuming, and laborious, whereas medical professionals have to examine many medical images per day. However, an automatic report generation using AI techniques can give better results. Researchers are presenting different deep-learning based models for this purpose among which attention-based caption generation is considered better. Different imaging modalities are being used in medical such as X-rays, CT-Scans, magnetic resonance
Conclusion and future work
In this work, we attempted to provide an organized reference for people attracted to medical report generation of medical images using deep learning. We tried to review every recently published research in this problem domain thoroughly and presented it in a detailed and structured manner. We observed that a lot of work is done using a simple encoder-decoder based framework whereas attention-based captioning is being used widely. Moreover, different emerging models for medical image captioning
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Hareem Ayesha has completed her undergraduate from Department of Computer Science, Bahauddin Zakariya University, Multan in 2018. At present she is doing her Master of Science in Computer Science from Department of Computer Science, Bahauddin Zakariya University, Multan. Her research interests include Artificial Intelligence, Computer Vision and Medical Image Analysis.
References (71)
- et al.
A survey on deep learning in medical image analysis
Med. Image Anal.
(2017) - et al.
Multi-view multi-scale CNNs for lung nodule type classification from CT images
Pattern Recognit
(2018) - et al.
An improved deep learning approach for detection of thyroid papillary cancer in ultrasound images
Sci. Rep.
(2018) - et al.
Detection and classification of cancer in whole slide breast histopathology images using deep convolutional networks
Pattern Recognit
(2018) - et al.
Abnormality detection in retinal image by individualized background learning
Pattern Recognit
(2020) - et al.
Automated pulmonary nodule detection in CT images using deep convolutional neural networks
Pattern Recognit
(2019) - et al.
Medical image retrieval using deep convolutional neural network
Neurocomputing.
(2017) - et al.
Computer-aided diagnosis of mammographic masses based on a supervised content-based image retrieval approach
Pattern Recognit
(2017) - et al.
Discrepancy and error in radiology: concepts, causes and consequences
Ulster Med. J.
(2012) - et al.
Deep learning for ultrasound image caption generation based on object detection
Neurocomputing
(2019)
Preparing a collection of radiology examinations for distribution and retrieval
J. Am. Med. Informatics Assoc.
On the automatic generation of medical imaging reports
ACL 2018 - 56th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap
MDNet: a semantically and visually interpretable medical image diagnosis network
CheXpert: a Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison, Proc. AAAI Conf
Artif. Intell.
MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports
Sci. Data.
Overview of imageclefcaption 2017 - Image caption prediction and concept detection for biomedical images
CEUR Workshop Proc
Overview of the ImageCLEF 2018 caption prediction tasks
Very deep convolutional networks for large-scale image recognition
Deep residual learning for image recognition
Going deeper with convolutions
Learning phrase representations using RNN encoder-decoder for statistical machine translation
Long Short-Term Memory
Neural Comput
A Survey on Biomedical Image Captioning
Medical Image based Breast Cancer Diagnosis : state of the Art and Future Directions
Expert Syst. Appl.
Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists
PLoS Med
Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database for Automated Image Interpretation
J. Mach. Learn. Res.
Medical image retrieval based on convolutional neural network and supervised hashing
IEEE Access
NLM at ImageCLEF 2017 caption task
ImageSem at ImageCLEF 2018 caption task: image retrieval and transfer learning
Cited by (0)
Hareem Ayesha has completed her undergraduate from Department of Computer Science, Bahauddin Zakariya University, Multan in 2018. At present she is doing her Master of Science in Computer Science from Department of Computer Science, Bahauddin Zakariya University, Multan. Her research interests include Artificial Intelligence, Computer Vision and Medical Image Analysis.
Sajid Iqbal has completed his BSCS from Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan, in 2002, and Master of Science in Computer Science from National University of Computer and Emerging Sciences, Lahore, Pakistan, in 2003. After this he has been associated with different institutes of higher education and taught courses at undergraduate and graduate level. He completed his Ph.D from University of Engineering and Technology, Lahore, Pakistan. Currently he is working as an Assistant Professor at Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan. He is published more than 15 research papers in well reputed international journals. His-research areas include deep learning, medical image analysis and natural language processing.
Mehreen Tariq has completed her undergraduate from Department of Computer Science, Bahauddin Zakariya University, Multan in 2018. Currently she is enrolled in Master of Science in Computer Science program as a research student. She is interested in Artificial Intelligence as her research domain. Sub fields of her research interest include medical image processing and computer vision.
Muhammad Abrar has completed his BSIT from Department of Computer Science University of Education, Multan, Pakistan. Currently he is enrolled in his Master of Science in Computer Science degree program at Department of Computer Science, Nawaz Sharif Agriculture University, Multan, Pakistan. His-research interests include image processing, machine learning and Artificial Intelligence in general.
Muhammad Sanaullah has been working as an Assistant Professor in the Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan. His-current research focuses on the use of Semantic Web Technologies in the field of Machine Learning, Data Mining and IoTs.
Ishaq Abbas has completed his undergraduate from Bahauddin Zakariya University, Multan in 2018. At present he is pursuing his research based graduate program (Master of Science in Computer Science) at Department of Computer Science, Bahauddin Zakariya University, Multan. Areas of his research interest are deep learning and computer vision.
Amjad Rehman received the Ph.D. degree in image processing and pattern recognition from Universiti Teknologi Malaysia, Malaysia, in 2010. During his Ph.D., he proposed novel techniques for pattern recognition based on novel features mining strategies. He is currently conducting research under his supervision for three Ph.D. students. He is the author of dozens of papers published in international journals and conferences of high repute. His-keen research includes information security, data mining and documents analysis, recognition.
Muhammad Farooq Khan Niazi Dr. Niazi completed his MBBS in 1985 and has served in multiple health care units and institutes. He completed his FCPS fellowship in 1999. He has more than 24 years practice experience in the field of diagnostic radiology. He has also served as senior faculty member at College of Physicians and Surgeons Pakistan. Presently, he is serving as a professor of radiology/ head of department at Bakhtawar Amin Memorial Trust Hospital Multan, Pakistan.