Stacked auto-encoder based tagging with deep features for content-based medical image retrieval

https://doi.org/10.1016/j.eswa.2020.113693Get rights and content

Highlights

  • Proposed method provides an effective and efficient solution method for highly unbalanced medical benchmark datasets.

  • This is the first study in which data imbalance using the feature vector at the output of the FCL layer.

  • Enabling the reduced search area to be used more effectively.

  • Converting high-level features into few digits using unsupervised sAE is considerably improves the performance.

Abstract

Content-based medical image retrieval (CBMIR) is one of the most challenging and ambiguous tasks used to minimize the semantic gap between images and human queries in datasets with rich information content. Similar to the human visual saliency mechanism, CBMIR systems also use the visual features in the images for searching purposes. As a result of this search process, automatically accessing the images is very convenient in large and balanced datasets. Still, it is generally not possible to find such datasets in the medical domain. In this study, a four-step and effective hash code generation technique is presented to reduce the semantic gap between low-level features and high-level semantics for unbalanced medical image datasets. In the first stage, the convolutional neural network (CNN) architecture, the most effective feature representation method available today, is employed to extract discriminative features from images automatically. The features obtained in the last fully connected layer (FCL) at the output of the CNN architecture are used for hash code generation. In the second stage, using the Synthetic Minority Over-sampling Technique (SMOTE), the imbalance between the classes in the dataset is reduced. The solution to the unbalanced problem increases performance by almost 3%. In the third stage, balanced features are converted to a code of 13 symbols by using deep stacked auto-encoder. Finally, this code is translated to the standard 13-character labeling and retrieval code used by the 'Image retrieval in the medical application' (IRMA) dataset, since this is the database with which experiments have been done. IRMA error parameter, classification performance, and retrieval performance of the proposed method are more successful than other state-of-the-art methods.

Introduction

Modern imaging technologies enable multi-dimensional and parametric visualization of objects, thanks to today's technological breakthroughs. Imaging devices, which can even enter our pocket with the development of technology, have been used for the medical domain since the early years because of their advantages (Bartels, Bibbo, Wied, & Bahr, 2016). Due to the advantages provided by it, the increasing dependence on these devices causes a massive increase in the volume of digital images. In 2010, an average of 120 medical images were recorded per second for mammography only (Krupinski, 2010). In 2016, 38 million magnetic resonance imaging (MRI) scans and 79 million computed tomography (CT) scans were recorded (Papanicolas, Woskie, & Jha, 2018). Analyzing this massive number of medical images is very difficult and time consuming for expert doctors. The severity of the condition can be understood, especially considering the number of doctors per patient is almost 11,000/1 in some regions (Pandey, Singh, Singh, & Kumar, 2019). In the medical domain, where early diagnosis is vital for many diseases and directly related to human health, all images should be scrutinized. However, in most cases, the number of expert doctors does not allow this examination to be carried out quickly. In addition, errors caused by the human factor should be minimized (Stoean, Pelka, Nensa, & Friedrich, 2018). To overcome these drawbacks, computer-aided diagnosis (CAD) systems are a highly effective solution that has been used since the past (Chen et al., 2020, Sobrinho et al., 2020). CAD systems are extensive spectrum systems that include artificial intelligence techniques, which are used to reduce the workload of specialist doctors and support the decisions made. They are used for many applications such as medical image segmentation, classification, enhancement, pre-processing, post-processing, and retrieval. Recently, retrieval-based CAD systems attract a lot of attention (Das and Neelima, 2020, Owais et al., 2019). CBMIR methods not only produce the classification result of images but also retrieve and visualize images. Also, it assigns specific and descriptive numbers or characters to each image, allowing images to be quickly saved in the database (Shi, et al., 2018). Then, using these numbers or characters, it can retrieve all disease information about the image (Zhang, Liu, Dundar, Badve, & Zhang, 2015).

Retrieval systems provide many positive contributions such as more regular datasets, adding new images to the dataset by tagging, quick access to images in the dataset, and classification result for each image. These systems are divided into two as text-based image retrieval and content-based image retrieval. In the early years, the text-based image retrieval method was used, which was based on the representation of each image with one or more text (Das & Neelima, 2020). This manual text information is time consuming, repetitive, not always reliable, and cannot be applied for unannotated or unlabeled images. In addition, it requires experience and knowledge of an expert doctor (Rajaei, Dallalzadeh, & Rangarajan, 2013). Content-based image retrieval (CBIR) method, an automatic retrieval system based on image features, has been proposed for the elimination of these drawbacks (Al-Mohamade, Bchir, & Ben Ismail, 2020). These features, which are related to information such as the shape, color, texture, edges of the objects in the images, are usually extracted from the images using hand-crafted feature extraction methods. The incompatibility between these low-level features and high-level image concepts causes a 'semantic gap'. This gap negatively affects the overall system performance by causing an ambiguity between the query image and the generated features (Wang, et al., 2020).

The performance of retrieval systems is critically dependent on feature representation and measurement of similarity. Today, CNN architecture is accepted as the most effective solution to image processing problems (Sengupta et al., 2020). The CNN architecture automatically extracts features from images at multiple levels. These multiple transformations and representations capability plays a significant role in CNN architecture's ability to solve complex functions more effectively. Besides, it helps to solve the problem of the semantic gap by extracting discriminative features from images (Wei et al., 2019). Effective CNN architectures such as AlexNet (Krizhevsky, Sutskever, & Hinton, 2017), VGGNet (Simonyan & Zisserman, 2014), ResNet (He, Zhang, Ren, & Sun, 2016), InceptionNet (Szegedy et al., 2015) are used frequently in the literature to represent image features robustly. Current CBIR studies use these features produced by CNN architectures quite often (Abdel-Nabi et al., 2019, Saritha et al., 2018, Sezavar et al., 2019, Siradjuddin et al., 2019). These discriminative features are usually converted into binary codes by processing them with the help of a classifier or a second algorithm. In another approach, vectors obtained in the FCL layer are used to get hash code (Shi et al., 2018).

The existing layers of the CNN structure enable it to produce high performance in specific tasks such as classification, segmentation, and detection. Other layers except FCL are used to learn the features and represent them more efficiently. The FCL turns these raw features into classifiable vectors and classify. Zhou (2020). Feature vectors created automatically by FCL layers are especially important in creating hashing functions used for image retrieval. For this reason, the importance of FCL layers in CNN architecture cannot be ignored. On the other hand, raw feature vectors produced by FCL layers cannot directly generate discrete hash codes. Quantization is needed to generate these codes (Cao et al., 2017, Tang et al., 2018). Image retrieval using binary codes is much faster than direct matching and less costly to store. The direct use of feature vectors taken from the FCL layer is not efficient due to the high dimensionality of such vectors. Auto-encoders (AE) are highly prone to be used to create such codes. However, the raw images are not suitable for use in AE training. For this reason, feature vectors in the FCL layer are used as the AE input (Camlica, Tizhoosh, & Khalvati, 2015a). AE is a special kind of neural network structure that can encode the features in its input and express these features with fewer parameters with minimum error. As a result of the hash code generation process, the feature vector can be represented with fewer parameters but with the same meaning. Thanks to this feature, it is beneficial in retrieval systems (Zhu, Wang, Bai, Yao, & Bai, 2016).

CBMIR systems can be used as clinical decision machines, education, research, and counseling system. The importance of these systems has allowed many researchers to conduct research on this subject from past to present. When looking at the development of CBMIR systems from a wide window, hand-crafted features have been used in the past, while automatic feature extraction methods are used today (Mohd Zin et al., 2018). In the retrieval systems, Fourier transform (Bueno, Chino, Traina, Traina, & Azevedo-Marques, 2002), Gabor filters (Gang & Zong-Min, 2007), wavelet-based systems (Quellec, Lamard, Cazuguel, Cochener, & Roux, 2010), invariants moments (Afifi & Ashour, 2012), co-occurrence matrices (Kwak et al., 2002) were used in the early stages, which are among the low level and single level features. These features are extracted from the color, texture, edges, and shape of the image and are of low level. Although the produced retrieval performance is not satisfactory, it has made a big leap compared to the text-based image retrieval systems. In the following years, the bag-of-visual-words (BoVW) framework was preferred to prevent the drawbacks caused by sticking to a single feature. BoVW can be defined as a codebook of visual words. It is created by collecting samples taken from more than one salient keypoints (Iakovidis et al., 2009). In this period, studies focused on capturing saliency points in images and determining their interests. For this purpose, scale-invariant feature transform (SIFT) (Zhi, Zhang, Zhao, Zhao, & Lin, 2009), speeded up robust features (SURF) (Lee & Kim, 2014), local binary patterns (LBP) (Camlica, Tizhoosh, & Khalvati, 2015b), histogram of oriented gradient (HOG) (Vijendran & Kumar, 2015), GIST (Rupali & Bhakti, 2017) algorithms were used for CBIR systems. Such hand-crafted feature generation algorithms are still used due to the high number of data requirements of CNN (Ahn, Kumar, Fulham, Feng, & Kim, 2019). While it is still challenging to find labeled and balanced medical datasets, in addition, a large number of images are required to train the CNN architecture. While some CBMIR researchers look for different ways to solve this problem, such as Radon transform (Babaie, Tizhoosh, Khatami, & Shiri, 2017), some researchers go over CNN architectures. The siamese network architecture has been advantageous in the works carried out with the unsupervised CNN architecture (Spitzer, Kiwitz, Amunts, Harmeling, & Dickscheid, 2018). In addition, supervised CNN architectures also achieved promising results (Cai, Li, Qiu, Ma, & Gao, 2019). The fact that CNN algorithms produced stunning results in almost all image processing areas did not escape the attention of CBIR researchers. They carried out their work to transfer these architectures to the CBMIR area. For this purpose, structures such as transfer learning, shallow CNNs, hybrid CNNs have been used.

The training and test procedures of the proposed study are carried out using the well-known X-ray image dataset called IRMA (Huang et al., 2003, Lehmann et al., 2004, Lehmann et al., 2005). Each image in the IRMA dataset is represented by IRMA codes consisting of 13-characters. In this study, each symbol of the IRMA code is referred to as 'character', and each symbol of the codes produced in all steps of our algorithm except the normalization step is referred to as 'digit'. This is because the IRMA code consists of numbers and letters, whereas the outputs produced by the proposed architecture consist of floating-point numbers (stored as variables of type double). The images in IRMA dataset, which has a very unbalanced class distribution, are accessed with these codes (Khatami, Babaie, Tizhoosh, et al., 2018). In addition, a unique error value called IRMA error is calculated to measure performance (Ahn et al., 2019, Khatami et al., 2018, Sriram et al., 2019).

Details of CBMIR studies and specific methods used for the IRMA dataset are examined in this part. Tang, Liu, & Liu (2017) proposed a multi-scale single layer stacked AE (sAE) structure for the classification of IRMA images. Feature matrices were obtained with the convolution operator, and these matrices were coded with fisher vector encodings. Kundu, Chowdhury, & Das (2017) extracted global shape features with the pulse coupled neural network (PCNN) model. In the second part, they obtained local features with counterlet transform. Khatami et al. (2018) reduced the search space using parallel CNN architectures. LBP, HOG, and Radon transformations, which are local feature extraction models, followed this structure. In this way, they represented the features by narrowing the search space further. Ahn et al. (2019) proposed a convolutional sparse kernel network (CSKN) to learn discriminative features from unlabeled medical images. Shamna, Govindan, & Abdul Nazeer (2018) presented the BoVW model based on the spatial matching of the visual words with location-based correlation. Also, they suggested skip similarity index for retrieval from the generated codes. Khatami, Nazari, Khosravi, Lim, & Nahavandi (2020) proposed the new generalization model based on noise perturbation for the CNN model. They added additive noise in each iteration to the weights of convolution layers. Ahn et al. (2016) used an architecture called late-fusion of domain transferred CNN with spatial pyramid features. The performance of their method was quite high with a 159.2 IRMA error score. Tizhoosh carried out various studies on the use of Radon barcodes and Gabor barcodes on IRMA dataset (Nouredanesh et al., 2016, Tizhoosh, 2015). He combined Radon barcodes with CNN architecture (Liu, Tizhoosh, & Kofman, 2016), in encoded local projections method (Tizhoosh & Babaie, 2018), in the last part of LeNet architecture (Khatami et al., 2017), and Projectron architecture (Sriram, et al., 2019). Tang, Yang, & Xia (2017) proposed the IRMA dataset retrieval method with the combined texton dictionary and locality constrained linear coding technique.

Several retrieval and tagging studies are available in the literature for the IRMA dataset. These studies are generally based on hand-crafted feature extraction algorithms (Camlica et al., 2015b). Hand-crafted feature extraction methods generally require experience and are prone to error in multi-organ datasets. Besides, they may be inadequate in producing discriminative features. For this reason, the researchers tended to extract features from raw IRMA images using CNN. However, accessing labeled medical datasets is very difficult, and the number of images in these datasets is insufficient to train the CNN architecture. Researchers have used pre-trained CNN architectures to avoid this problem. They partially solved this problem by training pre-trained CNN structures with their datasets (Tizhoosh & Babaie, 2018). Several studies combine hand-crafted feature extraction methods such as Radon, LBP, HOG with CNN architecture to improve the results obtained with traditional CNN architectures (Khatami, Babaie, Khosravi, et al., 2018). Such studies are inefficient in terms of computational complexity. To perform a more efficient analysis, researchers focus on methods based on feature vectors in FCL. Thus, CNN mechanized CBMIR systems are increasing day by day (Shi et al., 2018). Apart from CNN architecture, AE structures are also used to extract features from images and generate hash codes (Zhang, Dou, Ju, Xu, & Zhang, 2016). AE structures generate a specific relation between input and output. This specific relation allows the input vector to be represented in a different dimension at the output of the AE. AE architectures are generally used in the literature to reduce the length by preserving the important features of the input vector (Das & Walia, 2017). Thanks to these features, it is more profitable to use them for hash code generation in CBMIR studies. Using feature vectors in the FCL as an AE input can significantly improve performance. In addition, the AE structure can generate robust hash codes automatically, without the need for ground truth hash code information. The other contribution is that by simply changing the output layer of the AE architecture, the hash code of any length can be generated automatically.

In this study, a medical image tagging method consisting of four steps is presented to solve the problems mentioned above. In the first stage, features are extracted from raw medical images with the help of CNN architecture. The feature vector of 2000-digits from the FCL layer of CNN architecture is used to generate retrieval codes. These codes are obtained from the IRMA dataset containing an unbalanced number of samples. For this reason, this imbalance is reduced by creating new class-guided codes using the SMOTE algorithm (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). These new classes created by the SMOTE algorithm increase the performance by reducing data imbalance. The new generated codes have the same number of components as the feature vectors. These feature vectors must be encoded as vectors with the same dimensionality (number of components) of the IRMA code (i.e.: 13) without any loss of information. The specially designed sAE structure for this process solves the code length problem. The main contributions of the proposed framework are:

  • (1)

    Imbalance in sample distribution between classes is a widespread problem for medical datasets. The proposed method copes with this problem by balancing feature vectors with the help of the SMOTE algorithm, which is one of the most effective data oversampling methods in the literature.

  • (2)

    To the best of the author's knowledge, this is the first study in which eliminates inter-class imbalance problems using feature vectors in the FCL layer rather than using raw data from the input.

  • (3)

    The proposed sAE for the conversion of discriminative features produced by CNN into hash codes has increased the overall system performance considerably.

The rest of this paper is organized as follows: The technical details and parameters of the proposed method are described in Section 2. Information about the IRMA dataset, implementation details, and experimental results are given in Section 3. Finally, our conclusion is presented in Section 4.

Section snippets

A overview of the proposed framework

This study suggests an effective tagging method using an unbalanced medical dataset, as shown in Fig. 1. The proposed framework consists of four parts, three of which are the main parts. The fourth part may not be counted among the main parts as it includes the conversion of the 13-digits real numbers to the IRMA dataset standard. First, a CNN architecture is developed to extract features from an unbalanced dataset containing different numbers of MRIs from different parts of the body. This part

Dataset

The IRMA dataset was created from randomly selected samples from various X-ray images obtained during routine radiology at the Department of Diagnostic Radiology, Aachen University of Technology (RWTH), Aachen, Germany (Huang et al., 2003, Lehmann et al., 2004, Lehmann et al., 2005). The X-ray images contain many regions from various age, gender, and viewing positions. Consists of 12,677 training images and 1733 test images in 57 different image categories. These images have been converted to

Discussion

When the studies in the literature are examined, hand-crafted feature extraction is used in early studies. Recently, CNN-based methods have come to the fore. However, CNN methods do not achieve the desired performance due to the lack of labeled medical datasets. For this reason, features produced by CNN and hand-crafted features are combined. This phenomenon is again highly dependent on the human experience and time-consuming. To prevent this, this study focuses on automatically generating code

Conclusion

This study presents an effective CBIR method that can generate code for medical images and can also be used for retrieval. Instead of producing codes directly from images, this study follows the approach of generating code using deep features. Besides, an effective vector over-sampling approach is introduced for unbalanced medical image datasets. Accordingly, the vector-book is expanded using the well-known SMOTE method, unlike image similarities in the image augmentation approach, this method

CRediT authorship contribution statement

Şaban Öztürk: Conceptualization, Software, Methodology, Resources, Data curation, Validation, Formal analysis, Investigation, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The Scientific and Technological Research Council of Turkey (TÜBİTAK) under grant number 120E018.

Human and animal rights

The paper does not contain any studies with human participants or animals performed by any of the authors.

References (69)

  • Z. Zhu et al.

    Deep learning representation using autoencoder for 3D Shape Retrieval

    Neurocomputing

    (2016)
  • H. Abdel-Nabi et al.

    Content Based Image Retrieval Approach using Deep Learning

    (2019)
  • A.J. Afifi et al.

    Content-Based Image Retrieval Using Invariant Color and Texture Features

    (2012)
  • E. Ahn et al.

    X-ray image classification using domain transferred convolutional neural networks and local sparse spatial pyramid

    (2016)
  • A. Al-Mohamade et al.

    Multiple query content-based image retrieval using relevance feature weight learning

    Journal of Imaging

    (2020)
  • M. Babaie et al.

    Local radon descriptors for image search

    (2017)
  • P.H. Bartels et al.

    Objective cell image analysis

    Journal of Histochemistry & Cytochemistry

    (2016)
  • J.M. Bueno et al.

    How to add content-based image retrieval capability in a PACS

    (2002)
  • Y. Cai et al.

    Medical image retrieval based on convolutional neural network and supervised hashing

    IEEE Access

    (2019)
  • Z. Camlica et al.

    Autoencoding the retrieval relevance of medical images

    (2015)
  • Z. Camlica et al.

    Medical Image Classification via SVM Using LBP Features from Saliency-Based Folded Data

    (2015)
  • Y. Cao et al.

    Deep Visual-Semantic Quantization for Efficient Image Retrieval

    (2017)
  • N.V. Chawla et al.

    SMOTE: Synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • T. Chen et al.

    Computer-aided diagnosis of gallbladder polyps based on high resolution ultrasonography

    Computer Methods and Programs in Biomedicine

    (2020)
  • P. Das et al.

    Content-Based Medical Visual Information Retrieval

    Hybrid Machine Intelligence for Medical Image Analysis

    (2020)
  • R. Das et al.

    Partition selection with sparse autoencoders for content based image classification

    Neural Computing and Applications

    (2017)
  • Z. Gang et al.

    Texture feature extraction and description using gabor wavelet in content-based medical image retrieval

    (2007)
  • K. He et al.

    Deep Residual Learning for Image Recognition

    (2016)
  • Huang, H. K., Lehmann, T. M., Ratib, O. M., Schubert, H., Keysers, D., Kohnen, M., & Wein, B. B. (2003). The IRMA code...
  • Huang, Y., Huang, K., Yu, Y., & Tan, T. (2011). Salient coding for image classification. In Cvpr 2011 (pp....
  • D.K. Iakovidis et al.

    A pattern similarity scheme for medical image retrieval

    IEEE Transactions on Information Technology in Biomedicine

    (2009)
  • Khatami, A., Babaie, M., Khosravi, A., Tizhoosh, H. R., Salaken, S. M., & Nahavandi, S. (2017). A deep-structural...
  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Communications of the ACM

    (2017)
  • E.A. Krupinski

    Current perspectives in medical image perception

    Attention, Perception & Psychophysics

    (2010)
  • Cited by (73)

    View all citing articles on Scopus
    View full text