Adversarial multimodal fusion with attention mechanism for skin lesion classification using clinical and dermoscopic images

doi:10.1016/j.media.2022.102535

Medical Image Analysis

Volume 81, October 2022, 102535

https://doi.org/10.1016/j.media.2022.102535 Get rights and content

Highlights

•
A novel multimodal fusion method is proposed to perform automated skin lesion classification using clinical and dermoscopic images by learning the correlated and complementary information.
•
A modality discriminator is designed to guide the feature extractor to learn the correlated information.
•
A self-attention-based image reconstruction approach to automatically enforce the feature extractor concentrating on lesion areas.

Abstract

Accurate skin lesion diagnosis requires a great effort from experts to identify the characteristics from clinical and dermoscopic images. Deep multimodal learning-based methods can reduce intra- and inter-reader variability and improve diagnostic accuracy compared to the single modality-based methods. This study develops a novel method, named adversarial multimodal fusion with attention mechanism (AMFAM), to perform multimodal skin lesion classification. Specifically, we adopt a discriminator that uses adversarial learning to enforce the feature extractor to learn the correlated information explicitly. Moreover, we design an attention-based reconstruction strategy to encourage the feature extractor to concentrate on learning the features of the lesion area, thus, enhancing the feature vector from each modality with more discriminative information. Unlike existing multimodal-based approaches, which only focus on learning complementary features from dermoscopic and clinical images, our method considers both correlated and complementary information of the two modalities for multimodal fusion. To verify the effectiveness of our method, we conduct comprehensive experiments on a publicly available multimodal and multi-task skin lesion classification dataset: 7-point criteria evaluation database. The experimental results demonstrate that our proposed method outperforms the current state-of-the-art methods and improves the average AUC score by above $2 %$ on the test set.

Graphical abstract

Introduction

According to the Global Cancer Statistics 2020, skin cancer is ranked as the fourth leading cause of new cancer cases and deaths worldwide for 36 cancers and all cancers combined in 2020 (Sung et al., 2021). Skin cancer is one of the most dangerous cancers, especially melanoma, one of the most dangerous with the highest mortality skin cancer (Rigel et al., 1996). In a recent study (Barata et al., 2017), researchers have shown that early detection and timely adjuvant treatment could significantly reduce skin cancer mortality. Fortunately, with advanced developments in medical technology, there are many approaches to detect different kinds of skin cancers. Among these approaches, dermoscopic combined with clinical imaging is one of the most commonly used lesion diagnosis approaches in clinical practice (Massone et al., 2007). Dermoscopic images are obtained using optical magnification with liquid immersion and low angle-of-incidence lighting or cross-polarized lighting to make the contact area translucent, making subsurface structures visible. These images usually provide away to pay more attention to the local features of the lesions. Clinical images obtained using a digital camera usually provide more information about the global features of the lesions, such as the geometry and color of the lesions (Bi et al., 2020). The images captured by dermoscopic and clinical digital cameras make a comprehensive multimodal assessment of skin lesions possible.

In clinical practice, multimodal assessment is conducted by human experts. However, it lacks well-trained experts to perform large-scale skin cancer screening promptly. In addition, human experts’ diagnosis is quite subjective, which is prone to intra- and inter-reader variability, causing inaccurate and inconsistent results across experts. Many factors can affect the diagnosis results, such as empirical knowledge, visual fatigue, and the resolution of images. Developing automated computer-aided diagnosis (CAD) systems to assist the diagnosis procedure may help to mitigate the impact of these factors (Chen, Zhou, Wu, Hu, Hassan, Alamri, 2020, Xu, Wang, Guo, Gan, Wang, Bai, Zhang, Li, Yi, 2020, He, Li, Kim, Jia, Gu, Zhen, Zhou, 2020, Zhang, Xie, Xia, Shen, 2019, Polat, Koc, 2020). The classification module is the core part of an automated skin lesion diagnosis system. When it comes the development of CAD methods, convolutional neural networks (CNNs) (LeCun et al., 2015) have replaced the traditional methods (Claridge, Cotton, Hall, Moncrieff, 2003, Mendoza, Serrano, Acha, 2009, Zhou, Schaefer, Sadka, Celebi, 2009, Ma, Staunton, 2013) and become the most effective approaches to learn the features of skin lesion images (Pereira, Thomaz, Tavora, Assuncao, Fonseca-Pinto, Paiva, de Faria, 2021, Thomas, Lefevre, Baxter, Hamilton, 2021, Pérez, Reyes, Ventura, 2021).

In automated skin lesion classification, researchers have made an excellent effort to classify skin lesion images using CNNs. However, most of them only consider a single modality, i.e., clinical images or dermoscopic images (Yu, Jiang, Zhou, Qin, Ni, Chen, Lei, Wang, 2018, Zhang, Xie, Xia, Shen, 2019, Yu, Chen, Dou, Qin, Heng, 2016, Harangi, 2018). As mentioned above, clinical and dermoscopic imaging modalities capture different characteristics of skin lesions. Clinical images provide the global features of the lesions, while dermoscopic images provide the detailed features of lesions. One modality may not catch the critical information about the lesion and result in a wrong decision, as shown in Fig. 1(a). To overcome this problem, researchers have attempted to combine clinical and dermoscopic images to classify skin lesions (Ge, Demyanov, Chakravorty, Bowling, Garnavi, 2017, Yap, Yolland, Tschandl, 2018, Kawahara, Daneshvar, Argenziano, Hamarneh, 2018). The key idea of these methods is to learn the complementary information from each modality to improve the classification performance, as shown in Fig. 1(b). The complementary information is the knowledge that is not visible in individual modalities on their own but is suitable for understanding the underline semantics of the target event/topic (Baltrušaitis et al., 2018). A typical way of learning complementary information is concatenating the feature vectors from different modalities. Each modality’s feature vector provides information about different aspects of an object, event, or activity of interest (Liu et al., 2018). However, these methods mainly focus on learning the complementary information while ignoring the correlated information between the two input modalities. Correlated information is the correlation over the representations of different modalities. The correlated information can be leveraged to increase the confidence of the learned features for both modalities by encouraging the consistency of the feature vectors from the two modalities. The correlated information is represented in multiple aspects, such as the color, geometry information, and other potential shared characteristics between the two modalities; they are all critical for skin lesion classification.

In this paper, we propose a novel classification method, named adversarial multimodal fusion with attention mechanism (AMFAM), to learn the discriminative feature representations from clinical and dermoscopic images. The flow chart of our method is shown in Fig. 1(c), from which we can see that, on the one hand, our method aims to learn high discriminative features from each modality by adopting attention-based reconstruction; on the other hand, it tries to restrain the CNN backbone to explicitly learn the correlated features from both modalities to maintain the essential shared characteristics. Then, we concatenate the feature vectors from the two modalities to gain high discriminative representations. Specifically, adversarial learning is adopted to guide the feature extractor to learn the correlated information. Moreover, we employ the gradient reversal layer (GRL) that forces the feature extractor to produce multimodal-invariant representations on multiple source images (Ganin et al., 2016). The multimodal-invariant representations are the correlated information we aim to learn, as well the shared characteristics we aim to maintain. At the same time, we design an attention-based image reconstruction procedure to encourage the feature extractor to learn more discriminative features for each modality by concentrating on the lesion area of its input image. Lastly, we combine the high-level feature vectors of the two modalities to obtain more discriminative representations and feed them to a classifier for the final classification. A multimodal skin lesion database, 7-point criteria evaluation database (Kawahara et al., 2018), is used to evaluate our proposed method. The experimental results show that our method outperforms the state-of-the-art methods and verify our method’s effectiveness.

The main novelty and contributions of this work can be summarized as follows:

•
A novel multimodal fusion method is proposed to perform automated skin lesion classification using clinical and dermoscopic images. Its effectiveness is verified on a widely-used skin lesion classification dataset, i.e., 7-point criteria evaluation database.
•
By adopting the adversarial learning strategy, our method can learn the correlated information between the two modalities. More specifically, a modality discriminator is designed to guide the feature extractor to learn the correlated information explicitly.
•
To extract more discriminative features for each modality, we propose a self-attention-based image reconstruction approach to enforce the feature extractor concentrating on lesion areas automatically.
•
Unlike most existing methods that only consider the complementary information, our method simultaneously considers both the correlated and complementary information of the two modalities.

The rest of this paper is organized as follows. First, a review of related work is provided in Section 2. In Section 3, we present the details of the material and our proposed method. In Section 4, we describe the experimental setups and performance metrics and report the experimental results. The discussion and future work are presented in Section 5. At last, we conclude this work in Section 6.

Section snippets

Related work

This section reviews some related skin lesion classification approaches, including single-modality skin lesion classification, multimodal fusion methods, and multi-modality skin lesion classification. Also, we will highlight how the proposed method differs from the existing methods.

Material

We employ a publicly available multimodal skin lesion dataset, named 7-point criteria evaluation database (Kawahara et al., 2018), as our material. It contains three modalities (two image modalities and one text modality) for evaluating automated image-based prediction of the 7-point skin lesion malignancy checklist. There are 1011 cases, and each case contains one dermoscopic image, one clinical image, and metadata (such as patient gender and lesion location). Two image modalities are used for

Training and testing details

We implement our proposed method using the PyTorch¹ library. We run all the training and testing processes on an NVIDIA QUADRO RTX 8000 GPU with 48 GB memory. For a fair comparison, we use ResNet-50 (He et al., 2016) as our CNN backbone, which keeps the same with HcCNN. The backbone is initialized with the ImageNet pre-trained parameters. During the training process, we used Adam (Kingma and Ba, 2014) optimizer with learning rate $l = 0.00001$ and weight decay $w d = 0.0001$ to

Discussion and future work

Although ablation studies and comparisons have proven the advantages of our proposed method, some phenomena need to be noted. First, the attention mechanism-based and adversarial training models in the ablation study do not improve the performance of every sub-tasks. For instance, the Concat. model achieves better accuracy on the DIAG category than Concat.+Recon.+Att and Concat.+AD. models, as shown in Table 2. By considering the data distribution, as shown in Table 1, most of the

Conclusion

In this study, to leverage multiple modalities of medical data, we proposed a multimodal deep neural network, AMFAM, for multimodal and multi-task skin lesion classification. Our proposed method can learn both correlated and complementary information from different modalities. Specifically, to learn the correlated information, we adopted adversarial learning to train the model. Furthermore, to make the CNN backbone pay more attention to the lesion for better extracting complementary

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the Agency for Science, Technology and Research (A*STAR) through its AME Programmatic Funding Scheme under Project A20H4g2141.

References (41)

C. Barata et al.
Development of a clinically oriented system for melanoma diagnosis
Pattern Recognit
(2017)
L. Bi et al.
Multi-label classification of multi-modality skin lesion via hyper-connected convolutional neural network
Pattern Recognit
(2020)
M. Chen et al.
Ai-skin: skin disease recognition based on self-learning and wide data collection through a closed-loop framework
Information Fusion
(2020)
E. Claridge et al.
From colour to tissue histology: physics-based interpretation of images of pigmented skin lesions
Med Image Anal
(2003)
B. Harangi
Skin lesion classification with ensembles of deep convolutional neural networks
J Biomed Inform
(2018)
Q. He et al.
Feasibility study of a multi-criteria decision-making based hierarchical model for multi-modality feature and multi-classifier fusion: applications in medical prognosis prediction
Information Fusion
(2020)
L. Ma et al.
Analysis of the contour structural irregularity of skin lesions using wavelet decomposition
Pattern Recognit
(2013)
E. Pérez et al.
Convolutional neural networks for the automatic diagnosis of melanoma: an extensive experimental study
Med Image Anal
(2021)
D.S. Rigel et al.
The incidence of malignant melanoma in the united states: issues as we approach the 21st century
J. Am. Acad. Dermatol.
(1996)
S.M. Thomas et al.
Interpretable deep learning systems for multi-class segmentation and classification of non-melanoma skin cancer
Med Image Anal
(2021)

X. Xu et al.

Mscs-deepln: evaluating lung nodule malignancy using multi-scale cost-sensitive neural networks

Med Image Anal

(2020)

T. Baltrušaitis et al.

Multimodal machine learning: a survey and taxonomy

IEEE Trans Pattern Anal Mach Intell

(2018)

A. Bhardwaj et al.

Skin lesion classification using deep learning

Advances in Signal and Data Processing

(2021)

Y. Ganin et al.

Domain-adversarial training of neural networks

The Journal of Machine Learning Research

(2016)

Z. Ge et al.

Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clinical images

International Conference on Medical Image Computing and Computer-Assisted Intervention

(2017)

N. Gessert et al.

Skin lesion classification using cnns with patch-based attention and diagnosis-guided loss weighting

IEEE Trans. Biomed. Eng.

(2020)

K. He et al.

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

J. Kawahara et al.

Seven-point checklist and skin lesion classification using multitask multimodal neural nets

IEEE J Biomed Health Inform

(2018)

J. Kawahara et al.

Multi-resolution-tract cnn with hybrid pretrained and skin-lesion trained layers

International Workshop on Machine Learning in Medical Imaging

(2016)

N.S. Keskar et al.

On large-batch training for deep learning: generalization gap and sharp minima

arXiv preprint arXiv:1609.04836

(2016)

Cited by (32)

Multi-view compression and collaboration for skin disease diagnosis
2024, Expert Systems with Applications
In the field of skin disease diagnosis based on Convolutional Neural Networks (CNNs), there are currently two challenges. Firstly, there is a significant amount of label-independent information present in skin disease images. This information significantly affects the CNN’s ability to recognize skin disease. Finding an effective way to remove this label-independence is a challenging problem. Secondly, most research focuses solely on information-limited RGB images. It is imperative to introduce additional color space views. Hence, there is a need to investigate which combinations of views are most effective for skin disease diagnosis. To address these two issues, this study first employs the information bottleneck theory to guide convolution operations, retaining relevant skin lesion information while filtering out irrelevant details. Secondly, through a view selection method, a combination of RGB, HSL, and YCbCr was chosen from seven views, which exhibited the best performance. A multi-view compression and collaboration (MCC) framework was constructed based on these two approaches. MCC assists CNNs in removing label-independent information while enriching image views, ultimately enhancing the diagnosis of skin diseases. To validate the effectiveness of MCC, experiments were conducted by using ResNet-50, DensNet-169, Inception-v4, and ConvNeXt-B on both a self-collected hyperpigmented skin disease dataset and a public ISIC2018 dataset. The experimental results show that MCC can effectively improve the accuracy, precision, recall, and F1-score of CNNs. Thus, MCC has the potential to assist medical professionals in more accurately diagnosing skin diseases in clinical practice, thereby improving healthcare services and patients’ quality of life.
MamlFormer: Priori-experience guiding transformer network via manifold adversarial multi-modal learning for laryngeal histopathological grading
2024, Information Fusion
Pathologic grading of laryngeal squamous cell carcinoma (LSCC) plays a crucial role in diagnosis, prognosis, and migration. However, the grading performance and interpretability of the intelligent grading model based on LSCC low magnification images are poor. This is because it lacks the delicate nuclear information and information more relevant to grading contained in the high magnification images labeled by pathologists. Yet, low magnification images have information such as tissue texture and contours. Thus, we proposed an end-to-end transformer network with manifold adversarial multi-modal learning (MamlFormer). It effectively fuses and learns LSCC high and low magnification pathology image modalities. Firstly, we demonstrate the feasibility and sufficient conditions for modal fusion of LSCC high and low magnification images from Hoeffding's inequality and multimodal co-regularization. Secondly, we design a new manifold block. It constructs the manifold subspace by some principles. Those principles are divisibility, recoverability, and local distance closest of the feature matrix before and after the mapping of the LSCC each magnification image modalities. Meanwhile it can well solve the problems of redundant feature matrix information and weak modal semantic consistency after multimodal learning. Thirdly, we utilize the encoder and the adversarial loss function to implement adversarial block. It can adaptively learn the latent metrics of the modal distributions of LSCC high and low magnification images. Therefore, it also enhances the complementarity of LSCC high and low magnification image modalities. Then, numerous experiments show that MamlFormer outperforms other SOTA models in both grading performance and interpretability. Finally, we also performed generalization experiments on highly prevalent cervix squamous cell carcinoma. The MamlFormer over is superior to other SOTA models in terms of grading performance and interpretability. This indicates its excellent generalization performance and clinical practicability.
Residual cosine similar attention and bidirectional convolution in dual-branch network for skin lesion image classification
2024, Engineering Applications of Artificial Intelligence
Skin cancer is one of the most serious threats to human health among skin lesions. Computer-aided diagnosis methods can assist patients in identifying and detecting skin lesion types early, thereby enabling corresponding treatments. In this paper, we propose a dual-branch neural network model Conformer with Residual Cosine Similarity Attention and Bidirectional Convolutional fusion strategy, named RCSABC-Conformer. The core of this network structure comprises three parts: a Convolutional Neural Network (CNN) branch with Residual Cosine Similarity Attention (RCSA), a Transformer branch, and a Feature Couple Unit with Bidirectional Convolutional strategy (BC-FCU). The RCSA module calculates the cosine similarity value between the feature map generated by the convolutional operation and the feature map of the residual edge to assess whether their semantic information is similar. The semantic information of similar parts is weighted by exponential normalization to enhance the network's memory of similar features of the same type of skin lesion. The BC-FCU module interactively fuses local features and global representations of skin lesion images with different resolutions in the two branches. Specifically, when the global representations is integrated into local features, we introduce a new bidirectional convolution strategy to extract the feature map from both forward and backward directions, and then select the element with the smaller feature value from the two directions to fuse into local features. In this way, we can minimize the interference of the artifact features extracted by the Transformer branch on the CNN branch. In addition, taking advantage of the Transformer branch's capacity to construct global representations, our model can learn contextual semantic information of normal skin and lesion areas to enhance model robustness. We conducted experiments on three datasets, consisting of clinical and dermoscopic skin lesion images, as well as a hybrid of both. The experimental results show that RCSABC-Conformer outperforms both advanced and classical classification methods in terms of classification accuracy across all three datasets, without requiring an increase in the number of parameters and computational complexity. Compared with the baseline model, the classification accuracy of our proposed method improves by 2.40%, 5.39%, and 4.44% on the three datasets, respectively. To the best of our knowledge, this is the first study to apply an interactive fusion dual-branch network for multi-disease classification on different modalities of skin lesion databases. Code will be available at https://github.com/AlenLi817/RCSABC-Conformer.
MSMA: A multi-stage and multi-attention algorithm for the classification of multimodal skin lesions
2024, Biomedical Signal Processing and Control
Skin lesion classification is a fundamental task for automated skin lesion analysis. Relative to a single modality, multiple modalities can provide different information to drive the rapid development of the dermatological lesion classification task. In this work, we propose a ConvNext-based multi-modal attention mechanism fusion framework for multi-modal skin lesion classification. To fully exploit the complementary information between different modalities and provide a more comprehensive modal interaction and fusion, we construct a multi-stage modal fusion framework with a dual-stream architecture. In each feature extraction stage of image modality fusion, we begin by utilizing use the cross-modal feature interaction module to interact with features in spatial and channel dimensions while suppressing the introduction of noisy information, and then use the multi-scale cross-attention fusion module to provide long-range dependencies and semantic information at different scales to facilitate the flow and fusion of information between modalities. Finally, the image modalities and text modalities are aggregated with features using an image text feature fusion module. We validate the effectiveness of the proposed method on the publicly available multi-modal skin lesion dataset Derm7pt. The average accuracy of multi-modal skin lesion classification was 77.6%, outperforming current state-of-the-art methods and enhancing the average accuracy of the test set by 1.3%.
A multimodal deep learning approach for gravel road condition evaluation through image and audio integration
2024, Transportation Engineering
This study investigates the combination of audio and image data to classify road conditions, particularly focusing on loose gravel scenarios. The dataset underwent binary categorisation, comprising audio segments capturing gravel sounds and corresponding images. Early feature fusion, utilising a pre-trained Very Deep Convolutional Networks 19 (VGG19) and Principal component analysis (PCA), improved the accuracy of the Random Forest classifier, surpassing other models in accuracy, precision, recall, and F1-score. Late fusion, involving decision-level processing with logical disjunction and conjunction gates (AND and OR) in combination with individual classifiers for images and audio based on Densely Connected Convolutional Networks 121 (DenseNet121), demonstrated notable performance, especially with the OR gate, achieving 97 % accuracy. The late fusion method enhances adaptability by compensating for limitations in one modality with information from the other. Adapting maintenance based on identified road conditions minimises unnecessary environmental impact. This method can help to identify loose gravel on gravel roads, substantially improving road safety and implementing a precise maintenance strategy through a data-driven approach.
Feature-enhanced multi-sequence MRI-based fusion mechanism for breast tumor segmentation
2024, Biomedical Signal Processing and Control
Multi-sequence MRI plays a crucial role in the effective segmentation of breast tumors, contributing to accurate clinical diagnosis and treatment. However, the problem of missing certain sequence images may occur in clinical practice, leading to a potential impact on network performance. In medical imaging tasks, researchers are prone to ignore the correlation between different sequences, resulting in lack of expression of the extracted features. To address these problems, we propose the Feature-enhanced Multi-sequence Feature Fusion Network (FMFF-Net), which fuses information from dynamic contrast-enhanced (DCE) and diffusion-weighted imaging (DWI) sequences to improve the accuracy of breast tumor segmentation. Utilizing an attention mechanism and adversarial learning, FMFF-Net identifies potential mappings and relationships between sequences, generating features for missing sequences and enhancing feature representation. The network incorporates a High-Frequency Edge Attention Block (HFAB) to accentuate tumor edge details, leading to more precise segmentation. Tested on a dataset of 98 high-risk breast cancer MRI images, FMFF-Net demonstrated superior performance, with a Dice Similarity Coefficient (DSC) of 83.6%, Intersection over Union (IoU) of 73.4%, F1 score of 88.1%, and Sensitivity of 74.1%. Comparative analysis against mainstream segmentation methods revealed that our proposed FMFF-Net exhibits a competitive edge in the field of breast tumor segmentation.

View all citing articles on Scopus

View full text

Adversarial multimodal fusion with attention mechanism for skin lesion classification using clinical and dermoscopic images

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Related work

Material

Training and testing details

Discussion and future work

Conclusion

Declaration of Competing Interest

Acknowledgements

Pattern Recognit

Pattern Recognit

Information Fusion

Med Image Anal

J Biomed Inform

Information Fusion

Pattern Recognit

Med Image Anal

J. Am. Acad. Dermatol.

Med Image Anal

Med Image Anal

Multimodal machine learning: a survey and taxonomy

IEEE Trans Pattern Anal Mach Intell

Skin lesion classification using deep learning

Advances in Signal and Data Processing

Domain-adversarial training of neural networks

The Journal of Machine Learning Research

Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clinical images

International Conference on Medical Image Computing and Computer-Assisted Intervention

Skin lesion classification using cnns with patch-based attention and diagnosis-guided loss weighting

IEEE Trans. Biomed. Eng.

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Seven-point checklist and skin lesion classification using multitask multimodal neural nets

IEEE J Biomed Health Inform

Multi-resolution-tract cnn with hybrid pretrained and skin-lesion trained layers

International Workshop on Machine Learning in Medical Imaging

On large-batch training for deep learning: generalization gap and sharp minima

arXiv preprint arXiv:1609.04836