Elsevier

Medical Image Analysis

Volume 81, October 2022, 102535
Medical Image Analysis

Adversarial multimodal fusion with attention mechanism for skin lesion classification using clinical and dermoscopic images

https://doi.org/10.1016/j.media.2022.102535Get rights and content

Highlights

  • A novel multimodal fusion method is proposed to perform automated skin lesion classification using clinical and dermoscopic images by learning the correlated and complementary information.

  • A modality discriminator is designed to guide the feature extractor to learn the correlated information.

  • A self-attention-based image reconstruction approach to automatically enforce the feature extractor concentrating on lesion areas.

Abstract

Accurate skin lesion diagnosis requires a great effort from experts to identify the characteristics from clinical and dermoscopic images. Deep multimodal learning-based methods can reduce intra- and inter-reader variability and improve diagnostic accuracy compared to the single modality-based methods. This study develops a novel method, named adversarial multimodal fusion with attention mechanism (AMFAM), to perform multimodal skin lesion classification. Specifically, we adopt a discriminator that uses adversarial learning to enforce the feature extractor to learn the correlated information explicitly. Moreover, we design an attention-based reconstruction strategy to encourage the feature extractor to concentrate on learning the features of the lesion area, thus, enhancing the feature vector from each modality with more discriminative information. Unlike existing multimodal-based approaches, which only focus on learning complementary features from dermoscopic and clinical images, our method considers both correlated and complementary information of the two modalities for multimodal fusion. To verify the effectiveness of our method, we conduct comprehensive experiments on a publicly available multimodal and multi-task skin lesion classification dataset: 7-point criteria evaluation database. The experimental results demonstrate that our proposed method outperforms the current state-of-the-art methods and improves the average AUC score by above 2% on the test set.

Introduction

According to the Global Cancer Statistics 2020, skin cancer is ranked as the fourth leading cause of new cancer cases and deaths worldwide for 36 cancers and all cancers combined in 2020 (Sung et al., 2021). Skin cancer is one of the most dangerous cancers, especially melanoma, one of the most dangerous with the highest mortality skin cancer (Rigel et al., 1996). In a recent study (Barata et al., 2017), researchers have shown that early detection and timely adjuvant treatment could significantly reduce skin cancer mortality. Fortunately, with advanced developments in medical technology, there are many approaches to detect different kinds of skin cancers. Among these approaches, dermoscopic combined with clinical imaging is one of the most commonly used lesion diagnosis approaches in clinical practice (Massone et al., 2007). Dermoscopic images are obtained using optical magnification with liquid immersion and low angle-of-incidence lighting or cross-polarized lighting to make the contact area translucent, making subsurface structures visible. These images usually provide away to pay more attention to the local features of the lesions. Clinical images obtained using a digital camera usually provide more information about the global features of the lesions, such as the geometry and color of the lesions (Bi et al., 2020). The images captured by dermoscopic and clinical digital cameras make a comprehensive multimodal assessment of skin lesions possible.

In clinical practice, multimodal assessment is conducted by human experts. However, it lacks well-trained experts to perform large-scale skin cancer screening promptly. In addition, human experts’ diagnosis is quite subjective, which is prone to intra- and inter-reader variability, causing inaccurate and inconsistent results across experts. Many factors can affect the diagnosis results, such as empirical knowledge, visual fatigue, and the resolution of images. Developing automated computer-aided diagnosis (CAD) systems to assist the diagnosis procedure may help to mitigate the impact of these factors (Chen, Zhou, Wu, Hu, Hassan, Alamri, 2020, Xu, Wang, Guo, Gan, Wang, Bai, Zhang, Li, Yi, 2020, He, Li, Kim, Jia, Gu, Zhen, Zhou, 2020, Zhang, Xie, Xia, Shen, 2019, Polat, Koc, 2020). The classification module is the core part of an automated skin lesion diagnosis system. When it comes the development of CAD methods, convolutional neural networks (CNNs) (LeCun et al., 2015) have replaced the traditional methods (Claridge, Cotton, Hall, Moncrieff, 2003, Mendoza, Serrano, Acha, 2009, Zhou, Schaefer, Sadka, Celebi, 2009, Ma, Staunton, 2013) and become the most effective approaches to learn the features of skin lesion images (Pereira, Thomaz, Tavora, Assuncao, Fonseca-Pinto, Paiva, de Faria, 2021, Thomas, Lefevre, Baxter, Hamilton, 2021, Pérez, Reyes, Ventura, 2021).

In automated skin lesion classification, researchers have made an excellent effort to classify skin lesion images using CNNs. However, most of them only consider a single modality, i.e., clinical images or dermoscopic images (Yu, Jiang, Zhou, Qin, Ni, Chen, Lei, Wang, 2018, Zhang, Xie, Xia, Shen, 2019, Yu, Chen, Dou, Qin, Heng, 2016, Harangi, 2018). As mentioned above, clinical and dermoscopic imaging modalities capture different characteristics of skin lesions. Clinical images provide the global features of the lesions, while dermoscopic images provide the detailed features of lesions. One modality may not catch the critical information about the lesion and result in a wrong decision, as shown in Fig. 1(a). To overcome this problem, researchers have attempted to combine clinical and dermoscopic images to classify skin lesions (Ge, Demyanov, Chakravorty, Bowling, Garnavi, 2017, Yap, Yolland, Tschandl, 2018, Kawahara, Daneshvar, Argenziano, Hamarneh, 2018). The key idea of these methods is to learn the complementary information from each modality to improve the classification performance, as shown in Fig. 1(b). The complementary information is the knowledge that is not visible in individual modalities on their own but is suitable for understanding the underline semantics of the target event/topic (Baltrušaitis et al., 2018). A typical way of learning complementary information is concatenating the feature vectors from different modalities. Each modality’s feature vector provides information about different aspects of an object, event, or activity of interest (Liu et al., 2018). However, these methods mainly focus on learning the complementary information while ignoring the correlated information between the two input modalities. Correlated information is the correlation over the representations of different modalities. The correlated information can be leveraged to increase the confidence of the learned features for both modalities by encouraging the consistency of the feature vectors from the two modalities. The correlated information is represented in multiple aspects, such as the color, geometry information, and other potential shared characteristics between the two modalities; they are all critical for skin lesion classification.

In this paper, we propose a novel classification method, named adversarial multimodal fusion with attention mechanism (AMFAM), to learn the discriminative feature representations from clinical and dermoscopic images. The flow chart of our method is shown in Fig. 1(c), from which we can see that, on the one hand, our method aims to learn high discriminative features from each modality by adopting attention-based reconstruction; on the other hand, it tries to restrain the CNN backbone to explicitly learn the correlated features from both modalities to maintain the essential shared characteristics. Then, we concatenate the feature vectors from the two modalities to gain high discriminative representations. Specifically, adversarial learning is adopted to guide the feature extractor to learn the correlated information. Moreover, we employ the gradient reversal layer (GRL) that forces the feature extractor to produce multimodal-invariant representations on multiple source images (Ganin et al., 2016). The multimodal-invariant representations are the correlated information we aim to learn, as well the shared characteristics we aim to maintain. At the same time, we design an attention-based image reconstruction procedure to encourage the feature extractor to learn more discriminative features for each modality by concentrating on the lesion area of its input image. Lastly, we combine the high-level feature vectors of the two modalities to obtain more discriminative representations and feed them to a classifier for the final classification. A multimodal skin lesion database, 7-point criteria evaluation database (Kawahara et al., 2018), is used to evaluate our proposed method. The experimental results show that our method outperforms the state-of-the-art methods and verify our method’s effectiveness.

The main novelty and contributions of this work can be summarized as follows:

  • A novel multimodal fusion method is proposed to perform automated skin lesion classification using clinical and dermoscopic images. Its effectiveness is verified on a widely-used skin lesion classification dataset, i.e., 7-point criteria evaluation database.

  • By adopting the adversarial learning strategy, our method can learn the correlated information between the two modalities. More specifically, a modality discriminator is designed to guide the feature extractor to learn the correlated information explicitly.

  • To extract more discriminative features for each modality, we propose a self-attention-based image reconstruction approach to enforce the feature extractor concentrating on lesion areas automatically.

  • Unlike most existing methods that only consider the complementary information, our method simultaneously considers both the correlated and complementary information of the two modalities.

The rest of this paper is organized as follows. First, a review of related work is provided in Section 2. In Section 3, we present the details of the material and our proposed method. In Section 4, we describe the experimental setups and performance metrics and report the experimental results. The discussion and future work are presented in Section 5. At last, we conclude this work in Section 6.

Section snippets

Related work

This section reviews some related skin lesion classification approaches, including single-modality skin lesion classification, multimodal fusion methods, and multi-modality skin lesion classification. Also, we will highlight how the proposed method differs from the existing methods.

Material

We employ a publicly available multimodal skin lesion dataset, named 7-point criteria evaluation database (Kawahara et al., 2018), as our material. It contains three modalities (two image modalities and one text modality) for evaluating automated image-based prediction of the 7-point skin lesion malignancy checklist. There are 1011 cases, and each case contains one dermoscopic image, one clinical image, and metadata (such as patient gender and lesion location). Two image modalities are used for

Training and testing details

We implement our proposed method using the PyTorch1 library. We run all the training and testing processes on an NVIDIA QUADRO RTX 8000 GPU with 48 GB memory. For a fair comparison, we use ResNet-50 (He et al., 2016) as our CNN backbone, which keeps the same with HcCNN. The backbone is initialized with the ImageNet pre-trained parameters. During the training process, we used Adam (Kingma and Ba, 2014) optimizer with learning rate l=0.00001 and weight decay wd=0.0001 to

Discussion and future work

Although ablation studies and comparisons have proven the advantages of our proposed method, some phenomena need to be noted. First, the attention mechanism-based and adversarial training models in the ablation study do not improve the performance of every sub-tasks. For instance, the Concat. model achieves better accuracy on the DIAG category than Concat.+Recon.+Att and Concat.+AD. models, as shown in Table 2. By considering the data distribution, as shown in Table 1, most of the

Conclusion

In this study, to leverage multiple modalities of medical data, we proposed a multimodal deep neural network, AMFAM, for multimodal and multi-task skin lesion classification. Our proposed method can learn both correlated and complementary information from different modalities. Specifically, to learn the correlated information, we adopted adversarial learning to train the model. Furthermore, to make the CNN backbone pay more attention to the lesion for better extracting complementary

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the Agency for Science, Technology and Research (A*STAR) through its AME Programmatic Funding Scheme under Project A20H4g2141.

References (41)

  • X. Xu et al.

    Mscs-deepln: evaluating lung nodule malignancy using multi-scale cost-sensitive neural networks

    Med Image Anal

    (2020)
  • T. Baltrušaitis et al.

    Multimodal machine learning: a survey and taxonomy

    IEEE Trans Pattern Anal Mach Intell

    (2018)
  • A. Bhardwaj et al.

    Skin lesion classification using deep learning

    Advances in Signal and Data Processing

    (2021)
  • Y. Ganin et al.

    Domain-adversarial training of neural networks

    The Journal of Machine Learning Research

    (2016)
  • Z. Ge et al.

    Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clinical images

    International Conference on Medical Image Computing and Computer-Assisted Intervention

    (2017)
  • N. Gessert et al.

    Skin lesion classification using cnns with patch-based attention and diagnosis-guided loss weighting

    IEEE Trans. Biomed. Eng.

    (2020)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • J. Kawahara et al.

    Seven-point checklist and skin lesion classification using multitask multimodal neural nets

    IEEE J Biomed Health Inform

    (2018)
  • J. Kawahara et al.

    Multi-resolution-tract cnn with hybrid pretrained and skin-lesion trained layers

    International Workshop on Machine Learning in Medical Imaging

    (2016)
  • N.S. Keskar et al.

    On large-batch training for deep learning: generalization gap and sharp minima

    arXiv preprint arXiv:1609.04836

    (2016)
  • Cited by (32)

    View all citing articles on Scopus
    View full text