Elsevier

Medical Image Analysis

Volume 76, February 2022, 102327
Medical Image Analysis

FAT-Net: Feature adaptive transformers for automated skin lesion segmentation

https://doi.org/10.1016/j.media.2021.102327Get rights and content

Highlights

  • A novel feature adaptive transformers network (FAT-Net) for automated skin lesion segmentation.

  • A dual encoder with CNNs and Transformers to effectively extract local features and global context information.

  • A FAM module to register the feature distributions between encoder and decoder.

  • A memory-efficient decoder to adaptively fuse multi-level features and restore spatial information.

Abstract

Skin lesion segmentation from dermoscopic image is essential for improving the quantitative analysis of melanoma. However, it is still a challenging task due to the large scale variations and irregular shapes of the skin lesions. In addition, the blurred lesion boundaries between the skin lesions and the surrounding tissues may also increase the probability of incorrect segmentation. Due to the inherent limitations of traditional convolutional neural networks (CNNs) in capturing global context information, traditional CNN-based methods usually cannot achieve a satisfactory segmentation performance. In this paper, we propose a novel feature adaptive transformer network based on the classical encoder-decoder architecture, named FAT-Net, which integrates an extra transformer branch to effectively capture long-range dependencies and global context information. Furthermore, we also employ a memory-efficient decoder and a feature adaptation module to enhance the feature fusion between the adjacent-level features by activating the effective channels and restraining the irrelevant background noise. We have performed extensive experiments to verify the effectiveness of our proposed method on four public skin lesion segmentation datasets, including the ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. Ablation studies demonstrate the effectiveness of our feature adaptive transformers and memory-efficient strategies. Comparisons with state-of-the-art methods also verify the superiority of our proposed FAT-Net in terms of both accuracy and inference speed. The code is available at https://github.com/SZUcsh/FAT-Net.

Introduction

Automatic segmentation of skin lesion is considered as a crucial step in computer-aided diagnosis (CAD) for melanoma, which causes most deaths in skin cancer. Furthermore, skin cancer is ranked as one of the fastest growing cancers in the world (Siegel et al., 2017) due to the rapid increase of people diagnosed with skin cancer. Dermoscopy, a non-invasive imaging technique, plays a key role in the diagnosis process of melanoma. Studies have shown that if skin cancer can be diagnosed at an early stage, the survival rate of a patient was as high as 90% for melanoma (Ge et al., 2017). Traditionally, malignant melanoma can be visually inspected by expert dermatologists from the images produced using dermoscopy (Haenssle et al., 2018), but such visual interpretation is time-consuming and is often considered as a tedious task. Therefore, automatic skin lesion segmentation in CAD is highly desired and would greatly assist the dermatologist and improve analysis accuracy.

However, automatic skin lesion segmentation to separate the lesion from the surrounding healthy skin by assigning pixel-wise labels in dermoscopic images is a complicated and challenging task. In addition, patient-specific properties may change in skin colors, textures, lesion sizes, lesion location shapes, and the presence of numerous artifacts such as body hairs, reflections, air bubbles, shadows, non-uniform lighting, and markers (Mishra and Celebi, 2016), where typical challenging cases are shown in Fig. 1. In early automatic skin lesion segmentation techniques, skin lesion borders are often considered as the key points to distinguish lesions from its surrounding skin background. Skin lesion borders are usually extracted based on the ensembles of several thresholding methods (Emre Celebi, Kingravi, Iyatomi, Alp Aslandogan, Stoecker, Moss, Malters, Grichnik, Marghoob, Rabinovitz, et al., 2008, Celebi, Iyatomi, Schaefer, Stoecker, 2009). On the other hand, Garnavi et al. (2010) also employed color space analysis and clustering-based histogram thresholding to determine the optimal color channel for segmentation of skin lesions. Although skin lesion borders can be used to segment skin lesions, above methods usually required extracting pre-defined image features. To this end, deep learning methods are proposed to improve the segmentation performance by learning image features based on convolutional neural networks (CNNs).

In recent years, several architectures in deep learning-based approaches were proposed to improve segmentation performance, such as the famous fully connected network (FCN) (Yuan et al., 2017) and U-Net (Ronneberger et al., 2015). In particular, by extracting contextual features based on an encoder-decoder architecture, U-Net exhibited an excellent performance in a variety of medical image segmentation tasks with different imaging modalities, including ResU-Net (Jha, Smedsrud, Riegler, Halvorsen, de Lange, Johansen, Johansen, 2020, Zhang, Liu, Wang, 2018), U-Net++ (Zhou et al., 2018), DenseNet (Huang, Liu, Van Der Maaten, Weinberger, 2017, Hasan, Dahal, Samarakoon, Tushar, Martí, 2020), 3d U-Net (Li et al., 2020), V-Net (Milletari et al., 2016), and etc. To solve the challenges of the skin lesion segmentation task, Bi et al. (2017) employed a parallel integration method in a multi-stage fully convolutional network (mFCN) to accurately segment skin lesions. Similarly, Tang et al. (2019b) also developed a multi-stage U-Net (MS-UNet) based on a deep-supervised learning strategy to further improve the segmentation performance. However, these methods usually ignore global context information which is crucial for determining the precise location of skin lesions. In other words, long-range dependencies involved with the pixels assignment in semantic segmentation are of significant importance in medical images, especially for the definition of boundary pixels. Therefore, enriching the global context information of feature maps and learning long-range dependencies between pixels in medical images can be effective in defining the precise positions and boundaries for the skin lesions to improve the segmentation performance.

It is well known that the applications of U-Net have boosted the performance of various medical segmentation tasks because of the skip connection between encoders and decoders. Although the encoder-decoder architecture and skip connections can help U-Net to effectively capture both low-level and high-level features of input data, the loss of location and global context information during the consecutive down-sampling operations still may hinder the improvement of segmentation accuracy. In addition, the consecutive up-sampling operations in the decoders are based on the high-level feature maps, which also ignore the detailed spatial information of low-level feature maps. As a result, extracting more global context information and adaptively fusing the feature maps between encoder and decoder are the keys to improve the segmentation performance. Wang et al. (2018) proposed a non-local mechanism to capture long-range dependencies via computing the response at a position as a weighted sum of the features at all positions in the input feature maps. Obviously, non-local attention mechanism can be considered as a simple self-attention by computing interactions between any two positions in the input feature maps. More recently, transformer (Vaswani et al., 2017) employed self-attention mechanisms to capture long-range dependencies, whose advantages have been demonstrated in the fields of both natural language processing (NLP) and computer vision. Unlike the non-local attention mechanism, vision transformer (ViT) (Dosovitskiy et al., 2021) can take advantages of multiple parallel self-attention heads to capture long-range dependencies. In addition, data-efficient image transformer (Touvron et al., 2021) can also add a Feed-Forward Network (FFN) to improve its modeling capabilities.

In this paper, we propose a novel segmentation network with feature adaptive transformers, named FAT-Net, to deal with the challenging skin lesion segmentation task. Specifically, we implement a dual encoder including both CNNs and transformer branches to simultaneously capture both local features and global context information. Inspired by MultiResUNet (Ibtehaz and Rahman, 2020), we also propose a feature adaptation module (FAM) to register the feature distributions between encoder and decoder. On the other hand, we also employ a simple squeeze and excitation (SE) module to re-weight the channels of the feature maps in the top layer. By activating the effective channels and restraining the useless channels, we can perform a better fusion between local and global contexts. After matching the feature distributions in encoder and decoder based on our feature adaptation module (FAM), we can directly perform an element-wise summation to fuse the feature maps from the FAM and the corresponding layer in the decoder. Finally, we can obtain the skin lesion segmentation predictions based on our memory-efficient decoder. We have evaluated our proposed FAT-Net in four public skin lesion datasets, including ISIC 2016 dataset, ISIC 2017 dataset, ISIC 2018 dataset and PH2 dataset. Ablation studies and comparisons with state-of-the-art methods are also conducted to demonstrate the effectiveness of the proposed method. Our contributions can be summarized as follows:

  • (1)

    We propose a novel segmentation network with feature adaptive transformers for automated skin lesion segmentation, named FAT-Net. We replace the traditional single branch encoder architecture with a dual encoder in our FAT-Net, which consists of both CNNs and transformer encoder branches. By seamlessly integrating CNNs and transformer branches, our method can not only extract rich local features, but also capture the important global context information for skin lesion segmentation.

  • (2)

    We simultaneously employ three feature adaptation modules (FAM) in the encoder-decoder structure to adaptively match the distributions between two feature maps in encoder and decoder, which can improve the adaptive feature fusion and enable us to implement a memory-efficient decoder.

  • (3)

    We propose a novel memory-efficient decoder to fuse the features extracted in the parallel two-stream encoder by aggregating the rich local features and the global context information. Based on our feature adaptation module, we can directly perform an element-wise summation to fuse the feature maps from the FAM and the corresponding layer in the decoder. Compared with the concatenation operation in each layer for the traditional decoder, our memory-efficient decoder can save one half computation and memory consumption.

Section snippets

Skin lesion segmentation

Traditional skin lesion segmentation methods mainly relied on extracting and recognizing low-level image features. By investigating the color distributions among different tissues, histogram-based thresholding methods (Møllersen, Kirchesch, Schopf, Godtliebsen, 2010, Gomez, Butakoff, Ersboll, Stoecker, 2007, Yueksel, Borlu, 2009, Emre Celebi, Wen, Hwang, Iyatomi, Schaefer, 2013, Peruch, Bogo, Bonazza, Cappelleri, Peserico, 2013) were developed to conduct skin lesion segmentation. To mimic the

Methodology

The overview of our proposed FAT-Net for skin lesion segmentation is shown in Fig. 2. Our framework mainly consists of three components, including a dual encoder to enhance feature encoding, a feature adaptation module embedded in the skip connection to promote the adjacent features fusion, and a memory-efficient decoder to perform the feature decoding layer by layer more effectively. Note that, we still employ the symmetric encoder-decoder architecture with skip connection as the network

Datasets

We conducted extensive experiments on four public skin lesion segmentation datasets to verify the effectiveness of our method. The first three datasets are provided by international skin imaging collaboration (ISIC), including ISIC 2016 dataset1 (Gutman et al., 2016), ISIC 2017 dataset1 (Codella et al., 2018) and ISIC 2018 dataset1 (Codella, Rotemberg, Tschandl, Celebi, Dusza, Gutman, Helba, Kalloo, Liopyris, Marchetti, et al., 2019, Tschandl, Rosendahl,

Discussion and limitations

While encoder and decoder architectures have been widely used in medical image segmentation, most of them only contain single encoder or cannot simultaneously capture both local features and global context information. Due to the limitations in simultaneously extracting both local features to global long-range dependencies, most of existing methods still cannot accurately distinguish the differences between the objective pixels and background pixels, especially for the challenging cases where

Conclusion

In this paper, we presented a novel and efficient two-stream network with feature adaptation transformer, namely FAT-Net, aiming to address the challenging task of skin lesion segmentation. Different from the traditional CNN-based encoder, our transformer encoder utilizes a novel sequence-to-sequence prediction method to perform image segmentation. Compared with the convolution operation, our transformer encoder greatly increases the receptive field based on the self-attention mechanism, so

CRediT authorship contribution statement

Huisi Wu: Conceptualization, Funding acquisition, Project administration, Methodology, Writing – original draft, Writing – review & editing. Shihuai Chen: Conceptualization, Methodology, Writing – original draft, Writing – review & editing. Guilian Chen: Conceptualization, Methodology, Writing – original draft, Writing – review & editing. Wei Wang: Conceptualization, Methodology, Writing – original draft, Writing – review & editing. Baiying Lei: Conceptualization, Funding acquisition, Project

Declaration of Competing Interest

The authors declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported partly by National Natural Science Foundation of China (Nos. 61973221, 61871274, 61801305, 61872351, and 81571758), the Natural Science Foundation of Guangdong Province, China (Nos. 2018A030313381 and 2019A1515011165), the COVID-19 Prevention Project of Guangdong Province, China (No. 2020KZDZX1174), the Major Project of the New Generation of Artificial Intelligence (No. 2018AAA0102900), and Shenzhen Key Basic Research Project (Nos. JCYJ20180507184647636,

References (66)

  • N. Abraham et al.

    A novel focal tversky loss function with improved attention u-net for lesion segmentation

    2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019)

    (2019)
  • L. Bi et al.

    Dermoscopic image segmentation via multistage fully convolutional networks

    IEEE Trans. Biomed. Eng.

    (2017)
  • N. Carion et al.

    End-to-end object detection with transformers

    European Conference on Computer Vision

    (2020)
  • L.-C. Chen et al.

    Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

    IEEE Trans Pattern Anal Mach Intell

    (2017)
  • N. Codella et al.

    Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (isic)

    arXiv preprint arXiv:1902.03368

    (2019)
  • N.C. Codella et al.

    Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic)

    2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018)

    (2018)
  • A. Dosovitskiy et al.

    An image is worth 16x16 words: Transformers for image recognition at scale

    International Conference on Learning Representations

    (2021)
  • M. Emre Celebi et al.

    Border detection in dermoscopy images using statistical region merging

    Skin Research and Technology

    (2008)
  • M. Emre Celebi et al.

    Lesion border detection in dermoscopy images using ensembles of thresholding methods

    Skin Research and Technology

    (2013)
  • S. Feng et al.

    Cpfnet: context pyramid fusion network for medical image segmentation

    IEEE Trans Med Imaging

    (2020)
  • R. Garnavi et al.

    Automatic segmentation of dermoscopy images using histogram thresholding on optimal color channels

    International Journal of Medicine and Medical Sciences

    (2010)
  • Z. Ge et al.

    Skin disease recognition using deep saliency features and multimodal learning of dermoscopy and clinical images

    International Conference on Medical Image Computing and Computer-Assisted Intervention

    (2017)
  • D.D. Gomez et al.

    Independent histogram pursuit for segmentation of skin lesions

    IEEE Trans. Biomed. Eng.

    (2007)
  • M. Goyal et al.

    Skin lesion segmentation in dermoscopic images with ensemble deep learning methods

    IEEE Access

    (2019)
  • D. Gutman et al.

    Skin lesion analysis toward melanoma detection: achallenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic)

    arXiv preprint arXiv:1605.01397

    (2016)
  • K. He et al.

    Mask r-cnn

    Proceedings of the IEEE international conference on computer vision

    (2017)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • D. Hendrycks et al.

    Gaussian error linear units (gelus)

    arXiv preprint arXiv:1606.08415

    (2016)
  • J. Hu et al.

    Squeeze-and-excitation networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • G. Huang et al.

    Densely connected convolutional networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • D. Jha et al.

    Doubleu-net: A deep convolutional neural network for medical image segmentation

    2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)

    (2020)
  • D. Jha et al.

    Kvasir-seg: A segmented polyp dataset

    International Conference on Multimedia Modeling

    (2020)
  • D. Jha et al.

    Resunet++: An advanced architecture for medical image segmentation

    2019 IEEE International Symposium on Multimedia (ISM)

    (2019)
  • Cited by (240)

    View all citing articles on Scopus
    View full text