FAT-Net: Feature adaptive transformers for automated skin lesion segmentation
Graphical abstract
Introduction
Automatic segmentation of skin lesion is considered as a crucial step in computer-aided diagnosis (CAD) for melanoma, which causes most deaths in skin cancer. Furthermore, skin cancer is ranked as one of the fastest growing cancers in the world (Siegel et al., 2017) due to the rapid increase of people diagnosed with skin cancer. Dermoscopy, a non-invasive imaging technique, plays a key role in the diagnosis process of melanoma. Studies have shown that if skin cancer can be diagnosed at an early stage, the survival rate of a patient was as high as 90% for melanoma (Ge et al., 2017). Traditionally, malignant melanoma can be visually inspected by expert dermatologists from the images produced using dermoscopy (Haenssle et al., 2018), but such visual interpretation is time-consuming and is often considered as a tedious task. Therefore, automatic skin lesion segmentation in CAD is highly desired and would greatly assist the dermatologist and improve analysis accuracy.
However, automatic skin lesion segmentation to separate the lesion from the surrounding healthy skin by assigning pixel-wise labels in dermoscopic images is a complicated and challenging task. In addition, patient-specific properties may change in skin colors, textures, lesion sizes, lesion location shapes, and the presence of numerous artifacts such as body hairs, reflections, air bubbles, shadows, non-uniform lighting, and markers (Mishra and Celebi, 2016), where typical challenging cases are shown in Fig. 1. In early automatic skin lesion segmentation techniques, skin lesion borders are often considered as the key points to distinguish lesions from its surrounding skin background. Skin lesion borders are usually extracted based on the ensembles of several thresholding methods (Emre Celebi, Kingravi, Iyatomi, Alp Aslandogan, Stoecker, Moss, Malters, Grichnik, Marghoob, Rabinovitz, et al., 2008, Celebi, Iyatomi, Schaefer, Stoecker, 2009). On the other hand, Garnavi et al. (2010) also employed color space analysis and clustering-based histogram thresholding to determine the optimal color channel for segmentation of skin lesions. Although skin lesion borders can be used to segment skin lesions, above methods usually required extracting pre-defined image features. To this end, deep learning methods are proposed to improve the segmentation performance by learning image features based on convolutional neural networks (CNNs).
In recent years, several architectures in deep learning-based approaches were proposed to improve segmentation performance, such as the famous fully connected network (FCN) (Yuan et al., 2017) and U-Net (Ronneberger et al., 2015). In particular, by extracting contextual features based on an encoder-decoder architecture, U-Net exhibited an excellent performance in a variety of medical image segmentation tasks with different imaging modalities, including ResU-Net (Jha, Smedsrud, Riegler, Halvorsen, de Lange, Johansen, Johansen, 2020, Zhang, Liu, Wang, 2018), U-Net++ (Zhou et al., 2018), DenseNet (Huang, Liu, Van Der Maaten, Weinberger, 2017, Hasan, Dahal, Samarakoon, Tushar, Martí, 2020), 3d U-Net (Li et al., 2020), V-Net (Milletari et al., 2016), and etc. To solve the challenges of the skin lesion segmentation task, Bi et al. (2017) employed a parallel integration method in a multi-stage fully convolutional network (mFCN) to accurately segment skin lesions. Similarly, Tang et al. (2019b) also developed a multi-stage U-Net (MS-UNet) based on a deep-supervised learning strategy to further improve the segmentation performance. However, these methods usually ignore global context information which is crucial for determining the precise location of skin lesions. In other words, long-range dependencies involved with the pixels assignment in semantic segmentation are of significant importance in medical images, especially for the definition of boundary pixels. Therefore, enriching the global context information of feature maps and learning long-range dependencies between pixels in medical images can be effective in defining the precise positions and boundaries for the skin lesions to improve the segmentation performance.
It is well known that the applications of U-Net have boosted the performance of various medical segmentation tasks because of the skip connection between encoders and decoders. Although the encoder-decoder architecture and skip connections can help U-Net to effectively capture both low-level and high-level features of input data, the loss of location and global context information during the consecutive down-sampling operations still may hinder the improvement of segmentation accuracy. In addition, the consecutive up-sampling operations in the decoders are based on the high-level feature maps, which also ignore the detailed spatial information of low-level feature maps. As a result, extracting more global context information and adaptively fusing the feature maps between encoder and decoder are the keys to improve the segmentation performance. Wang et al. (2018) proposed a non-local mechanism to capture long-range dependencies via computing the response at a position as a weighted sum of the features at all positions in the input feature maps. Obviously, non-local attention mechanism can be considered as a simple self-attention by computing interactions between any two positions in the input feature maps. More recently, transformer (Vaswani et al., 2017) employed self-attention mechanisms to capture long-range dependencies, whose advantages have been demonstrated in the fields of both natural language processing (NLP) and computer vision. Unlike the non-local attention mechanism, vision transformer (ViT) (Dosovitskiy et al., 2021) can take advantages of multiple parallel self-attention heads to capture long-range dependencies. In addition, data-efficient image transformer (Touvron et al., 2021) can also add a Feed-Forward Network (FFN) to improve its modeling capabilities.
In this paper, we propose a novel segmentation network with feature adaptive transformers, named FAT-Net, to deal with the challenging skin lesion segmentation task. Specifically, we implement a dual encoder including both CNNs and transformer branches to simultaneously capture both local features and global context information. Inspired by MultiResUNet (Ibtehaz and Rahman, 2020), we also propose a feature adaptation module (FAM) to register the feature distributions between encoder and decoder. On the other hand, we also employ a simple squeeze and excitation (SE) module to re-weight the channels of the feature maps in the top layer. By activating the effective channels and restraining the useless channels, we can perform a better fusion between local and global contexts. After matching the feature distributions in encoder and decoder based on our feature adaptation module (FAM), we can directly perform an element-wise summation to fuse the feature maps from the FAM and the corresponding layer in the decoder. Finally, we can obtain the skin lesion segmentation predictions based on our memory-efficient decoder. We have evaluated our proposed FAT-Net in four public skin lesion datasets, including ISIC 2016 dataset, ISIC 2017 dataset, ISIC 2018 dataset and PH2 dataset. Ablation studies and comparisons with state-of-the-art methods are also conducted to demonstrate the effectiveness of the proposed method. Our contributions can be summarized as follows:
- (1)
We propose a novel segmentation network with feature adaptive transformers for automated skin lesion segmentation, named FAT-Net. We replace the traditional single branch encoder architecture with a dual encoder in our FAT-Net, which consists of both CNNs and transformer encoder branches. By seamlessly integrating CNNs and transformer branches, our method can not only extract rich local features, but also capture the important global context information for skin lesion segmentation.
- (2)
We simultaneously employ three feature adaptation modules (FAM) in the encoder-decoder structure to adaptively match the distributions between two feature maps in encoder and decoder, which can improve the adaptive feature fusion and enable us to implement a memory-efficient decoder.
- (3)
We propose a novel memory-efficient decoder to fuse the features extracted in the parallel two-stream encoder by aggregating the rich local features and the global context information. Based on our feature adaptation module, we can directly perform an element-wise summation to fuse the feature maps from the FAM and the corresponding layer in the decoder. Compared with the concatenation operation in each layer for the traditional decoder, our memory-efficient decoder can save one half computation and memory consumption.
Section snippets
Skin lesion segmentation
Traditional skin lesion segmentation methods mainly relied on extracting and recognizing low-level image features. By investigating the color distributions among different tissues, histogram-based thresholding methods (Møllersen, Kirchesch, Schopf, Godtliebsen, 2010, Gomez, Butakoff, Ersboll, Stoecker, 2007, Yueksel, Borlu, 2009, Emre Celebi, Wen, Hwang, Iyatomi, Schaefer, 2013, Peruch, Bogo, Bonazza, Cappelleri, Peserico, 2013) were developed to conduct skin lesion segmentation. To mimic the
Methodology
The overview of our proposed FAT-Net for skin lesion segmentation is shown in Fig. 2. Our framework mainly consists of three components, including a dual encoder to enhance feature encoding, a feature adaptation module embedded in the skip connection to promote the adjacent features fusion, and a memory-efficient decoder to perform the feature decoding layer by layer more effectively. Note that, we still employ the symmetric encoder-decoder architecture with skip connection as the network
Datasets
We conducted extensive experiments on four public skin lesion segmentation datasets to verify the effectiveness of our method. The first three datasets are provided by international skin imaging collaboration (ISIC), including ISIC 2016 dataset1 (Gutman et al., 2016), ISIC 2017 dataset1 (Codella et al., 2018) and ISIC 2018 dataset1 (Codella, Rotemberg, Tschandl, Celebi, Dusza, Gutman, Helba, Kalloo, Liopyris, Marchetti, et al., 2019, Tschandl, Rosendahl,
Discussion and limitations
While encoder and decoder architectures have been widely used in medical image segmentation, most of them only contain single encoder or cannot simultaneously capture both local features and global context information. Due to the limitations in simultaneously extracting both local features to global long-range dependencies, most of existing methods still cannot accurately distinguish the differences between the objective pixels and background pixels, especially for the challenging cases where
Conclusion
In this paper, we presented a novel and efficient two-stream network with feature adaptation transformer, namely FAT-Net, aiming to address the challenging task of skin lesion segmentation. Different from the traditional CNN-based encoder, our transformer encoder utilizes a novel sequence-to-sequence prediction method to perform image segmentation. Compared with the convolution operation, our transformer encoder greatly increases the receptive field based on the self-attention mechanism, so
CRediT authorship contribution statement
Huisi Wu: Conceptualization, Funding acquisition, Project administration, Methodology, Writing – original draft, Writing – review & editing. Shihuai Chen: Conceptualization, Methodology, Writing – original draft, Writing – review & editing. Guilian Chen: Conceptualization, Methodology, Writing – original draft, Writing – review & editing. Wei Wang: Conceptualization, Methodology, Writing – original draft, Writing – review & editing. Baiying Lei: Conceptualization, Funding acquisition, Project
Declaration of Competing Interest
The authors declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported partly by National Natural Science Foundation of China (Nos. 61973221, 61871274, 61801305, 61872351, and 81571758), the Natural Science Foundation of Guangdong Province, China (Nos. 2018A030313381 and 2019A1515011165), the COVID-19 Prevention Project of Guangdong Province, China (No. 2020KZDZX1174), the Major Project of the New Generation of Artificial Intelligence (No. 2018AAA0102900), and Shenzhen Key Basic Research Project (Nos. JCYJ20180507184647636,
References (66)
- et al.
Lesion border detection in dermoscopy images
Computerized Medical Imaging and Graphics
(2009) - et al.
Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists
Annals of Oncology
(2018) - et al.
Dsnet: automatic dermoscopic skin lesion segmentation
Comput. Biol. Med.
(2020) - et al.
Multiresunet: rethinking the u-net architecture for multimodal biomedical image segmentation
Neural Networks
(2020) - et al.
Cascade knowledge diffusion network for skin lesion diagnosis and segmentation
Appl Soft Comput
(2021) - et al.
Skin lesion segmentation via generative adversarial networks with dual discriminators
Med Image Anal
(2020) - et al.
Dense pooling layers in fully convolutional network for skin lesion segmentation
Computerized Medical Imaging and Graphics
(2019) - et al.
Attention gated networks: learning to leverage salient regions in medical images
Med Image Anal
(2019) - et al.
Efficient skin lesion segmentation using separable-unet with stochastic weight averaging
Comput Methods Programs Biomed
(2019) - et al.
Illumination-based transformations improve skin lesion segmentation in dermoscopic images
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
(2020)
A novel focal tversky loss function with improved attention u-net for lesion segmentation
2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019)
Dermoscopic image segmentation via multistage fully convolutional networks
IEEE Trans. Biomed. Eng.
End-to-end object detection with transformers
European Conference on Computer Vision
Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE Trans Pattern Anal Mach Intell
Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (isic)
arXiv preprint arXiv:1902.03368
Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic)
2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018)
An image is worth 16x16 words: Transformers for image recognition at scale
International Conference on Learning Representations
Border detection in dermoscopy images using statistical region merging
Skin Research and Technology
Lesion border detection in dermoscopy images using ensembles of thresholding methods
Skin Research and Technology
Cpfnet: context pyramid fusion network for medical image segmentation
IEEE Trans Med Imaging
Automatic segmentation of dermoscopy images using histogram thresholding on optimal color channels
International Journal of Medicine and Medical Sciences
Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clinical images
International Conference on Medical Image Computing and Computer-Assisted Intervention
Independent histogram pursuit for segmentation of skin lesions
IEEE Trans. Biomed. Eng.
Skin lesion segmentation in dermoscopic images with ensemble deep learning methods
IEEE Access
Skin lesion analysis toward melanoma detection: achallenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic)
arXiv preprint arXiv:1605.01397
Mask r-cnn
Proceedings of the IEEE international conference on computer vision
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Gaussian error linear units (gelus)
arXiv preprint arXiv:1606.08415
Squeeze-and-excitation networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Densely connected convolutional networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Doubleu-net: A deep convolutional neural network for medical image segmentation
2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)
Kvasir-seg: A segmented polyp dataset
International Conference on Multimedia Modeling
Resunet++: An advanced architecture for medical image segmentation
2019 IEEE International Symposium on Multimedia (ISM)
Cited by (240)
Hierarchical damage correlations for old photo restoration
2024, Information FusionMSMA: A multi-stage and multi-attention algorithm for the classification of multimodal skin lesions
2024, Biomedical Signal Processing and ControlSpider-Net: High-resolution multi-scale attention network with full-attention decoder for tumor segmentation in kidney, liver and pancreas
2024, Biomedical Signal Processing and ControlESDMR-Net: A lightweight network with expand-squeeze and dual multiscale residual connections for medical image segmentation
2024, Engineering Applications of Artificial IntelligenceFDUM-Net: An enhanced FPN and U-Net architecture for skin lesion segmentation
2024, Biomedical Signal Processing and ControlA 2.5D multi-path fusion network framework with focusing on z-axis 3D joint for medical image segmentation
2024, Biomedical Signal Processing and Control