1 Introduction

Melanoma is one of the most rapidly growing cancer worldwide [36]. Melanoma accounts for only 1% of skin cancers, but it causes 75% of skin cancer deaths [37]. Therefore, the diagnosis of melanoma has attracted much attention. Researches show that the five-year survival rate will exceed 95% if melanoma is diagnosed early, but will be below 20% due to the latest diagnosis of it [31]. So timely and accurate diagnosis of melanoma is a very critical task.

Nowadays, dermoscopy image is widely used for melanoma diagnosis clinically. However, the correct understanding of the dermoscopy image requires complex clinical experience and is laborious. In addition, it usually suffers from intra- and inter-observer variations, not to mention the errors caused by fatigue. Therefore, many computer-aided approaches have been proposed to help the dermatologist accurately and efficiently diagnose the melanoma.

The automatic segmentation of skin lesions in dermoscopy images is critical for melanoma diagnosis [29]. Accurate skin lesion segmentation is very important for extracting the discriminative feature, which is used to classify the melanoma. Recently, deep learning based methods have achieved great improvement in the image segmentation domain [4, 23, 27]. However, it remains a challenging task for building fully automatic skin lesions segmentation network based on the Deep Convolution Neural Networks (DCNNs). There exist three reasons for preventing performance gains. Firstly, the training data is less which makes the DCNNs unwell trained. Secondly, there exists a label imbalance problem in dermoscopy images. There are many small skin lesions areas in the dermoscopy images; therefore, we must pay attention to the recall rate of the skin lesions because many small lesions will be lost due to the low recall rate. Thirdly, the skin lesions usually change greatly in appearance like size, shape, location and color, present blurred and irregular boundaries. Besides, affection of many artifacts, such as covered by hair, ruler, ink mark, blood vessel, or air bubble, make the automatic skin lesion segmentation more difficult in dermoscopy images, as shown in Fig. 1. In clinical, the shape of the skin lesion is a key factor for melanoma identification [29]. In the skin lesion segmentation task, the Jaccard index score (JAC) denotes the shape precision of the skin lesions and the Sensitivity score (SEN) denotes the recall rate of the skin lesions. The automatic skin lesions segmentation method with high JAC and SEN scores is very important for melanoma recognition.

Fig. 1
figure 1

Challenges on automatic segmentation of skin lesions in dermoscopy images. a small size and blood vessels; b air bubbles; c low contrast with the surrounding skin; d large variety of size and hair covering

In the DCNN based image segmentation domain, many studies [4, 18, 35] have approved that using the skip connection between different level layers can generate meaningful semantic segmentation. Given that the objects in medical images usually change variously in appearances, U-Net [32] is widely used which employs skip connections to introduce the decoding path with much detail boundary information from the encoding path. However, because there are many artifacts in the dermoscopy images, the skip connection can also introduce the irrelevant artifacts features to the decoding path, which will reduce the segmentation performance. Recently, Dakhia et al. [14] employed the Pyramid Pooling Module (PPM) to extract both the local and global contexts of the Inter-CL at the same time for salient object detection. The local context contains detail boundary information, while the global context can locate the salient object and diminish the impact of the cluttered background. Because the receptive fields of the Inter-CLs are different, they manually set the pooling scales in the corresponding PPM, after that they concatenate the outputs of the PPMs to form the final prediction. However, given that the skin lesions change greatly in sizes and shapes, manually setting the pooling scales is sub-optimal. Besides, due to the pooling operations, the detail boundary information or small lesions will be lost. In this paper, we proposed a novel Scale-Att-ASPP module. In the Scale-Att-ASPP module, the ASPP module can capture both the local and global contexts of the Inter-CL in the encoding path at the same time without the feature map resolution pooling down, which can keep the detail boundary information or small lesions and is suitable for multi-scale skin lesion segmentation. After re-weighting and aggregating the outputs of the ASPP by the attention maps, which are generated by the concatenation of the parallel outputs of the ASPP, the optimal scale of the skin lesion feature with diminished irrelevant artifacts features of the Inter-CL is automatically selected and forms the output of the Scale-Att-ASPP module, instead of manually setting the dilated rates in the ASPP. Finally, the output of the Scale-Att-ASPP module is introduced to the same level layer in the decoding path in pixel-wise addition way, and the performance of the final predicted segmentation will be improved.

In order to solve the problem of label imbalance, instead of using pixel-wise re-weighting loss, which causes high computational complexity, we employed the Jaccard distance loss [43] as our loss function which aims to maximize the overlap between the foreground of the ground truth label map and that of the predicted segmentation map. Our network is adversarially trained by the multi-scale L1 loss [40], which can ensure the multi-scale feature maps extracted by the convolutional layers in the critic network matched between the ground-truth segmentation and the predicted segmentation. Therefore, the adversarial multi-scale L1 loss can guide the Scale-Att-ASPP module learning to select the optimal scale of the skin lesion feature.

Our contributions are listed as follows.

  1. (1)

    A novel Scale-Att-ASPP module is proposed, which can automatically select the optimal scale of the skin lesion feature with diminished irrelevant artifacts features of the Inter-CL in the encoding path. After introducing the output of the Scale-Att-ASPP module to the same level layer in the decoding path, meaningful semantic segmentation is gained.

  2. (2)

    In the Scale-Att-ASPP module, the parallel outputs of the ASPP are reweighted by the pixel-wise scale attention maps and aggregated in pixel-wise addition way to form the output of the Scale-Att-ASPP module, instead of the traditional concatenation way, which will cause high computing cost and memory usage. After adding the output of the Scale-Att-ASPP module to the same level layer of the decoding path in pixel-wise addition way through skip-connection, the time cost is further reduced. Therefore, our proposed segmentation network gains lowest time cost compared with the recent studies after our experiment.

  3. (3)

    We employed the Jaccard distance loss combined with the adversarial multi-scale L1 loss to adversarially train our network, which can solve the label imbalance problem, meanwhile, guide the proposed Scale-Att-ASPP module learning to automatically select the optimal scale of the skin lesion feature. In addition, it can make our network trained stably and avoid overfitting when trained on the small dataset.

  4. (4)

    We extensively evaluated the performance of our network on three datasets. The results show that our network significantly improves the performance of the segmentation compared with other state-of-the-art studies, especially for the JAC and SEN scores. Besides, our network works efficiently and shows robustness for different datasets.

The rest of this paper is organized as follows. The related works are indicated in Section 2; Section 3 indicates the details of our proposed method; In Section 4, the design of the experiment is given, the key modules employed in our proposed network are verified, and the performance of our proposed network compared with other state-of-the-art methods are analyzed; Finally, we discussed the results in Section 5 and summarized our work in Section 6.

2 Related works

2.1 Skin lesion segmentation models

For the skin lesion segmentation task, traditional approaches like region growing [9], thresholding based [44], clustering [22] and active contour based [26] have been proposed. However, these approaches rely on hand-craft features which are not discriminative enough because the appearance of the skin lesions changes greatly. Recently DCNN based approaches have been carried out, and developed rapidly which improved the performance of skin lesion segmentation significantly. Yuan et al. [42] designed a deep network with small convolution kernels to enhance the feature representation. They augmented the training set of the ISBI 2017 dataset by using different color spaces, employed the dual-thresholds method as post-processing to refine the DCNN output and form the final predicted segmentation output. Besides, they used the Jaccard distance loss proposed by their previous work [43] to address the label imbalance problem. However, both the introducing of different color spaces and post-processing are time consuming. Besides, it is difficult to manually set the thresholds values at the post-processing stage and may reduce the robustness of the network. Bi et al. [7] adopted ResNet [19] architecture in the Fully Convolutional Network (FCN) for segmenting skin lesion. They collected much extra training data to train their network. However, it is difficult to obtain the extra training data, besides, annotating them requires expertise and is time-consuming. Recently, a novel network is designed by Mohammed et al. [3], which attempts to extract the full resolution feature of every individual pixel of the input image without the feature resolution pooling down, by this way, the features of the multi-scale skin lesions can be ensured flowing between layers. However, it will cause high computing costs without the pooling operation. SLSDeep [34] is proposed based on PSPNet [46] for skin lesion segmentation, in which dilated residual blocks are used in the last few convolution layers of the encoding path to gain global level context and keep the feature resolution unchanged, meanwhile, the PPM are used following the last convolutional layer in the encoding path to extract multi-scale global context. Though SLSDeep [34] has improved the segmentation performance, the SEN score is low.

2.2 Multi-level contexts fusion

The common FCN suffers from the loss of the detail boundary information due to the successive strided convolution and pooling operations in the encoding path, thus the network will make predicted segmentation output with blur boundary, meanwhile, the feature of the small object will be lost heavily. To solve the above problems, multi-level contexts fusion is usually employed to gain meaningful semantic segmentation. There are three main technical lines.

The first line is using multiple parallel pathways to produce different feature maps with different scales [16, 21], then the predicted segmentation is generated based on the aggregation of the feature maps. The disadvantage of this line is that the memory usage is high and the scale range of aggregated feature maps is limited, therefore, it cannot meet the need of extracting the features from the skin lesions, which change greatly in size and shape.

The second line is combining the multi-level layers through skip connection in a network [12, 18, 32, 35]. DeepLabv3+ [12] and U-Net [32] employ skip connection between the encoding path and the decoding path to enrich the final predicted segmentation with detail boundary information, therefore, segmentation output with meaningful semantic information can be gained. In [22], the U-Net based histogram equalization approach outperforms the clustering-based approach for skin lesion segmentation. The paper [22] employs a two-dimensional Gaussian filter at the center of the training images to diminish the impact of the artifacts in the pre-processing stage. However, there are also some lesions exist along the edge of the dermoscopy images, besides, it is suboptimal to meet the need of all the training images using the Gaussian filter with fixed kernel size, so discriminative features cannot be well extracted, in addition, the complex pre-processing is time-consuming.

The third line is using the PPM [14, 34, 46] or ASPP [11, 12] to extract both the local and global contexts of the convolution layer at the same time. Then the multi-level contexts are concatenated together to improve the segmentation output. The PPM employs different pooling scales to extract multi-level contexts. However, due to the pooling operation, the detail boundary information or small object will be lost. The ASPP uses parallel dilated convolutions with different rates to capture multi-level contexts without the resolution pooling down, therefore, the detail boundary information or small object can be kept, which is suitable for multi-scale skin lesion segmentation. However, the concatenating operation used in ASPP will cause high memory usage and computing cost, more than that, it cannot reweight the different level context to select the optimal scale of the skin lesions feature maps.

2.3 Attention mechanism

The attention mechanism becomes an appealing approach in deep feature learning which can focus on what we want. It can help the network extract discriminative feature and accelerate the convergence of the network, which is suitable for training on a small dataset. Recently, Hu et al. [20] proposed a channel-wise attention module to recalibrate the feature map channels and pay more attention on the useful channels. Because the higher layer has a more global semantic context as the network going deeper, F. Wang et al. [38] designed stacking residual attention modules which focus on the features of the regions of interests (ROIs) in the lower layer gated by the higher layer feature maps. Similar to [38], J. Zhang et al. [45] employed stacking attention based residual blocks for skin lesion classification. In [10], several scale versions of the original image are made and fed to a shared deep network, so the multi-scale feature maps of the original image can be gained. After that the multi-scale feature maps are concatenated together to generate the pixel-wise scale attention maps for multi-scale object segmentation. However, it is time-consuming for resizing the original image and generating the corresponding multi-scale feature maps. O. Oktay et al. [28] employed the high-level context captured by the bottleneck layer as a “global guidance” to guide the low-level layers focusing on the ROIs.

2.4 GAN

In the GAN, the generator tries to generate the data distribution which can fools the discriminator to consider it as real, while the discriminator tries to separate the generated data from the real. During the adversarial training, both the generator and the discriminator try to minimize their own loss functions, alternately. Therefore, the adversarial training is not stable. In order to stabilize the adversarial training, Salimans et al. [33] proposed the adversarial feature matching loss, which ensure the features extracted by the convolutional layers of the discriminator matched between the generated data and real data, instead of only maximizing the output of the discriminator. Recently, both [8, 39] employed the feature matching loss based on [33] to stabilize the adversarial training, aiming to synthesize high resolution natural image and dermoscopy image from a corresponding semantic pixel-wise label map, respectively. Similar to [33], Xue et al. [40] proposed the adversarial multi-scale L1 loss introduced by the critic network, which forces the segmentation network to capture discriminative multi-scale features, meanwhile, makes the adversarial training stably. Because the size and shape of the skin lesion change variously, in addition, there are many artifacts in dermoscopy image. In our work, we employ the multi-scale L1 loss to guide the proposed Scale-Att-ASPP module learning to automatically select the optimal scale of the skin lesion feature with diminished irrelevant artifacts features. Besides, the adversarial multi-scale L1 loss brings in strong regularization, which can make our segmentation network avoid over-fitting.

3 Methods

Our network is a GAN, which includes the segmentation network and the critic network, as Fig. 2 shows. The implementation details of our proposed network are indicated as follows.

Fig. 2
figure 2

The architecture of our proposed network, which includes the segmentation network and the critic network

3.1 Segmentation network

Our segmentation network includes the ResNet34 based encoding path, the decoding path, the Scale-Att-ASPP based skip connection, and the PPM on top of the last convolution layer in the encoding path, see Fig. 2 for details.

3.1.1 Encoding path

The encoding path is based on the ResNet34 pre-trained on the ImageNet [15] without the average-pooling layer and fully connected layers. The encoding path inputs the dermoscopy image resized as 192 × 256 and contains five layers as shown in Fig. 2. The Layer1 (dark green arrow) employs 7 × 7, stride 2 convolution, and each of the other four layers (Layer2 ~ 5) (light blue arrows) contains a max pooling operation and several number of Residual blocks (3,4,6,3 for Layer2 ~ 5, respectively). Each Residual block includes two 3 × 3 convolutions and residual connection as shown in Fig. 2. The residual connection can make the deep network training avoid gradient vanishing. Inside each layer, the feature resolution stays the same. The max pooling operation is employed to increase the receptive field step by step in the encoding path. The details of the encoding path are indicated in Table 1.

Table 1 The architecture of the encoding path. Note: the bracket denotes the Residual block as shown in Fig. 2

3.1.2 Scale-Att-ASPP module

As shown in Fig. 2, the Scale-Att-ASPP module based skip connection is proposed between each of the three Inter-CLs (Layer2 ~ 4) in the encoding path and the same level layer in the decoding path. The topmost skip connection remains a direct connection. The proposed Scale-Att-ASPP module contains the ASPP module, the pixel-wise scale attention module, and the aggregating module, see Fig. 3 for details.

Fig. 3
figure 3

The architecture of our proposed Scale-Att-ASPP module

As the network going deeper, the receptive field of the Inter-CL (referred to RF) is getting larger, so our Scale-Att-ASPP module can capture different level contexts of the input image. The ASPP inputs the corresponding Inter-CL, and uses four parallel dilated convolutions with different rates r ∈{1,3,5,7} to extract both the local and global contexts of the Inter-CL, each of the dilated convolutions is followed by subsequent BN, Relu, 1 × 1 convolution, and BN layers. The output of the dilated convolution operation is defined as Fr with receptive field RFr, which can be expressed as:

$$ {RF}^r= RF+r\times \left(k-1\right)\times s $$
(1)

where k, s denote the kernel size and stride of the dilated convolution, respectively. In our experiment, we set k = 3 and s = 1. Equation (1) shows that as the r is getting bigger, the RFr is getting larger, lesser detail boundary information and more global context will be captured. Therefore, both the local and global contexts of the Inter-CL are acquired. The local context contains more detail boundary information, which can make the predicted segmentation gain sharp boundary, while the global context can locate the skin lesions and diminish the impact of the artifacts.

Because the sizes of the skin lesions changes greatly and the Inter-CLs have different receptive fields, it is suboptimal to fix the dilated rates setting in ASPP. Instead of manually adjusting the dilated rates, we proposed a novel pixel-wise scale attention module, which automatically selects the optimal scale for the skin lesion feature in Inter-CL. As shown in Fig. 3, the pixel-wise scale attention module concatenates (Concat) the outputs of the parallel four dilated convolutions in ASPP as input, and the 3 × 3 convolution (Conv3), Relu, 1 × 1 convolution (Conv1), channel-wise softmax operations are followed in the attention module to generate the soft pixel-wise scale attention weight maps W, in which Wr is corresponding to the weight map of Fr r∈{1,3,5,7}, respectively. Then W is defined as (2):

$$ W=\mathrm{softmax}\left(\mathrm{Conv}1\left(\mathrm{Relu}\left(\mathrm{Conv}3\left(\mathrm{Concat}\left({F}^1,{F}^3,{F}^5,{F}^7\right)\right)\right)\right)\right) $$
(2)

After that, the Fr is reweighted by the corresponding Wr and aggregated together in pixel-wise addition way, which can reduce the computing costs and memory usages, instead of the traditional concatenation way. Finally, the output of the Scale-Att-ASPP module is formed which is defined as F and can be expressed as:

$$ F=\sum \limits_{r\in \left\{1,3,5,7\right\}}{W}^r\times {F}^r $$
(3)

by this way the optimal scale of skin lesion feature with diminished irrelevant artifact feature is automatically gained in F, then F is pixel-wise added to the same level layer in the decoding path through skip connection, therefore, the performance of the predicted segmentation will be improved.

The soft attention mechanism used in the proposed Scale-Att-ASPP module is similar to [20], both of which are self-attention mechanism. However, different from [20], which uses global pooling to gain the global context of the feature maps and generate the weight vector to adaptively recalibrate the channel-wise feature responses, our attention mechanism generates spatial pixel-wise scale attention weight maps to reweight the parallel outputs of the ASPP. The pixel-wise scale attention weight maps are generated based on the concatenation of the parallel outputs of the ASPP, which contains the different level contexts of the Inter-CL, instead of only the global level. Only employing fixed kernel convolution on the Inter-CL to further generate the attention weight maps is suboptimal for skin lesion segmentation after our experiment, because the size and shape of the skin lesion change greatly and there are artifacts in the surrounding background. See Section 4.3.3 for details.

3.1.3 Decoding path

The decoding path includes several up-sampling layers, which are employed to recover the feature resolution step by step. Inside each up-sampling layer, we employed the 3 × 3, stride 2 deconvolution to up-sample the feature map. Before the deconvolution operation there is a 1 × 1 convolution used to reduce the channel dimension by factor of 4, and after the deconvolution operation there is a channel dimension matching 1 × 1 convolution, which is used to own the same channel number with the F coming from the corresponding skip connection. Finally, the decoding path outputs the predicted segmentation output, which owns the same resolution with the original input, as shown in Fig. 2.

3.1.4 The PPM

To enhance the feature representation, the PSPNet [46] proposes a PPM, which contains parallel pooling operations with different kernel sizes on top of the last convolution layer in the encoding path to capture multi-scale high-level contexts. Given that the skin lesions change greatly in sizes, so our network also employs a PPM as shown in Fig. 2. Inside the PPM, there are four parallel Max Pooling operations with kernel sizes of 2,3,5,6, respectively, employed on top of the output of the Layer5 in the encoding path. Then each output of the four Max Pooling operations is processed by the 1 × 1 convolution to reduce the channel wise dimension from 512 to 1, and up-sampled to own the same resolution size with the output of the Layer5 by the bilinear interpolation. After that, the four up-sampling outputs are concatenated together. Finally, the original output of the Layer5 is also concatenated with the concatenation of the four up-sampling outputs through residual connection to form the final output of the PPM. See Fig. 4 for the details of the architecture of the PPM.

Fig. 4
figure 4

The architecture of the PPM

3.2 Critic network

The details of the architecture of the critic network are showed in Fig. 5. There are six convolution layers in the critic network. Instead of pooling operation, each of the convolution layers is set with a stride of 2 to gradually increase the receptive field and extract multi-scale feature maps of the skin lesion. Batch Normalization, and Leaky Relu are employed after each convolution operation. The original input image I is masked by the predicted segmentation output map (the output of our segmentation network which is defined as S(I)) and the ground-truth label map y separately to form the pair inputs of the critic network. The convolution layers with different receptive fields in the critic network can extract the multi-scale features of the pair inputs. At each layer, the feature distance between the pair inputs is calculated based on the L1 distance. Then the multi-scale L1 loss function L(θS,θD) can be formed by averaging the feature distances among all the layers of the critic network, which can be expressed as (4):

$$ {\displaystyle \begin{array}{c}L\left({\theta}_S,{\theta}_D\right)=\frac{1}{N}\sum \limits_{n=1}^N{L}_1\left({f}_C^n\left(I\circ S(I)\right),{f}_C^n\left(I\circ y\right)\right)\\ {}=\frac{1}{N}\sum \limits_{n=1}^N{\left\Vert {f}_C^n\left(I\circ S(I)\right)-{f}_C^n\left(I\circ y\right)\right\Vert}_1\end{array}} $$
(4)

where N denotes the number of the layers contained in the critic network; The symbol ∘ denotes the mask operation; \( {f}_C^n \) is the feature map extracted by the n th layer of the critic network; θS and θD denote the parameters of the segmentation network and critic network, respectively.

Fig. 5
figure 5

The architecture of the critic network. The parameters of each convolution layer are defined as

k: the kernel size; n: the number of output channels; s: the stride size; p: the padding size.

3.3 Loss function

During the adversarial training, maximizing the L(θS,θD) when training the critic network can force the segmentation network to extract discriminative multi-scale features of the skin lesion, and guide the proposed Scale-Att-ASPP module learning to select the optimal scale of the skin lesion feature with diminished irrelevant artifacts features.

To cope with the label imbalance and avoid re-weighting pixels, which is high computing cost, we employed the Jaccard distance loss [43], which can be expressed as (5):

$$ {L}_{jaccard}=1-\frac{\sum \limits_{i=1}^K{p}_i{g}_i}{\sum \limits_{i=1}^K{p}_i^2+\sum \limits_{i=1}^K{g}_i^2-\sum \limits_{i=1}^K{p}_i{g}_i} $$
(5)

where K denotes the total number of pixels in the image; pi and gi represent the pixel i of the predicted segmentation map and that of the corresponding ground truth label map, respectively. Our loss function \( {\mathcal{L}}_s \) combines the Ljaccard and the L(θS,θD) to adversarially train the segmentation network. \( {\mathcal{L}}_s \) can be expressed as (6):

$$ {\mathcal{L}}_s=\frac{1}{M}\sum \limits_{m=1}^M\left({L}_{jaccard}+L\left({\theta}_S,{\theta}_D\right)\right) $$
(6)

where M represents the number of images in the dataset. Our segmentation network is adversarially trained by minimizing the \( {\mathcal{L}}_s \).

4 Experiment

Our network is trained on the public dermoscopy image dataset named ISBI 2017 [13] for skin lesion segmentation, which contains 2000, 150 and 600 images for training, validation and testing, respectively. Also, the public ISBI 2016 [17] and PH2 [25] dermoscopy image datasets are employed for evaluating the robustness of our proposed network. In ISBI 2016, there are 900 and 379 images for training and testing, respectively. The independent PH2 dataset contains 200 images.

4.1 Pre-processing and training

Our network is trained end-to-end with minimal pre-processing operations as following. First, all the input images are resized to a resolution of 192 × 256. Second, in order to augment the training data during training, color jitter is adopted by randomly adjusting the brightness (0 to 0.6), contrast (0 to 1), saturation (0 to 0.3) and hue (0 to 0.1) on the image, then we flipped the image either vertically or horizontally, and rotated it by −90° or 90° with 0.5 probability. Third, channel-wise normalization is adopted.

We used the platform of Pytorch to implement our network and train it on GPU Geforce GTX Titan X. Due to the small dermoscopy image dataset, we initialized the parameters of our segmentation network’s encoding path with the parameters of the ResNet34 trained on the ImageNet dataset. The segmentation network and critic network are alternately trained for 200 epochs with batch size of 36. The Adam optimizer is employed with β1 = 0.5, β2 = 0.999. The learning rates of both the segmentation network and the critic network are initialized as 0.0002, and decay by 0.5 every 25 epochs. All the above settings are used for all the experiments in this paper.

4.2 Performance evaluation metrics

To evaluate the performance of our segmentation network, we used the metrics given by the ISBI 2017 challenge as following. The Jaccard index (JAC) indicates the overlap rate between the skin lesion regions segmented by our network and those masked by the ground truth label map; The Dice coefficient (DIC) also represents the similarity between the predicted skin lesion regions and those of the ground truth; The Sensitivity (SEN) denotes the recall rate of the skin lesion regions; The Accuracy (ACC) indicates the overall accuracy of the classification on all the pixels; The Specificity (SPE) denotes the extent of how the non-lesion regions are correctly segmented, see (7), (8), (9), (10), and (11) for details, respectively. Note: we adopted a thresholding value of 0.5 to produce the final predicted segmentation label map.

$$ \mathrm{JAC}=\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FP}+\mathrm{FN}\right) $$
(7)
$$ \mathrm{DIC}=\left(2\times \mathrm{TP}\right)/\left(2\times \mathrm{TP}+\mathrm{FN}+\mathrm{FP}\right) $$
(8)
$$ \mathrm{SEN}=\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FN}\right) $$
(9)
$$ \mathrm{ACC}=\left(\mathrm{TP}+\mathrm{TN}\right)/\left(\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}\right) $$
(10)
$$ \mathrm{SPE}=\mathrm{TN}/\left(\mathrm{TN}+\mathrm{FP}\right) $$
(11)

where the TP, FP, FN, and TN denote true positive, false positive, false negative, and true negative, respectively.

4.3 Key modules validation

4.3.1 Scale-Att-ASPP

In order to verify the effectiveness of the proposed attention mechanism in our network, we experimentally compared the Scale-Att-ASPP module based segmentation network (referred as Scale-Att-ASPP) showed as Fig. 2 with the ASPP module based segmentation network (referred as ASPP-net), which is obtained by removing the pixel-wise scale attention module in the Scale-Att-ASPP module, replacing the aggregation module in the Scale-Att-ASPP module by directly concatenating the parallel outputs of the ASPP module. Both the Scale-Att-ASPP and ASPP-net are trained using the Jaccard distance loss without using adversarial training. JAC score and ACC score curves of both the Scale-Att-ASPP and ASPP-net on the Validation Set of ISBI 2017 during training are showed in Fig. 6a and b, respectively. As shown in Fig. 6, we can see that both the JAC score and ACC score curves of Scale-Att-ASPP (red lines) are higher, also increase more quickly and smoothly than those of ASPP-net (blue lines), respectively. So using the Scale-Att-ASPP module can remarkably improve the segmentation performance and accelerate the convergence rate of our network. The comparison results of the performance on the Test Set of ISBI 2017 are showed in Table 2. We can see that the ACC (0.9164), DIC (0.8732), JAC (0.7964), SPE (0.9346) scores of Scale-Att-ASPP are higher than those scores of ASPP-net which are 0.9045, 0.8671, 0.7895, 0.8497, respectively. It is confirmed that the Scale-Att-ASPP module can automatically select the optimal scale of the skin lesion feature with diminished irrelevant artifacts features in Inter-CL by reweighting and aggregating the parallel outputs of the ASPP module in Scale-Att-ASPP, while ASPP-net which only uses fixed-size kernels is not flexible to meet the need for the multi-scale skin lesion segmentation. Therefore, the segmentation performance is significantly improved after using the proposed novel Scale-Att-ASPP module.

Fig. 6
figure 6

The JAC score (a) and ACC score (b) comparisons among ASPP-net, Scale-Att-ASPP, and Scale-Att-ASPP-adv on the Validation Set of ISBI 2017 during training

Table 2 Comparision results of ASPP-net, Scale-Att-ASPP, and Scale-Att-ASPP-adv on the Test Set of ISBI 2017

4.3.2 Adversarial training

To further improve the segmentation performance, we employed adversarial training on Scale-Att-ASPP (refer to Scale-Att-ASPP-adv) by introducing the critic network. We also compared the performance between Scale-Att-ASPP-adv and Scale-Att-ASPP which is not adversarially trained. The JAC score and ACC score curves of both Scale-Att-ASPP-adv and Scale-Att-ASPP on the Validation Set of the ISBI2017 during training are showed in Fig. 6a and b, respectively. As shown in Fig. 6a, though the JAC score curve of Scale-Att-ASPP-adv (green line) is a little lower than that of Scale-Att-ASPP (red line) at the early training stage, as the training continues, the JAC score curve of Scale-Att-ASPP-adv (green line) becomes higher than that of Scale-Att-ASPP (red line). That is because the adversarial multi-scale L1 loss introduced by the critic network brings in strong regularization during training which can force our segmentation network to avoid over-fitting and finally make our segmentation network be well optimized. As shown in Fig. 6b, the ACC score curve of Scale-Att-ASPP-adv (green line) is significantly higher, also increases more quickly and smoothly than that of Scale-Att-ASPP (red line).

The comparison results of the performance between Scale-Att-ASPP-adv and Scale-Att-ASPP on the Test Set of ISBI 2017 are also showed in Table 2. The ACC (0.9316), DIC (0.8781), JAC (0.8028), SEN (0.8697) scores of Scale-Att-ASPP-adv are higher than those scores of Scale-Att-ASPP which are 0.9164, 0.8732, 0.7964, 0.8524, respectively. Therefore, it is confirmed that the adversarial multi-scale L1 loss introduced by the critic network tries to discriminate multi-scale features between the masked original images, which are generated by the predicted segmentation map and the corresponding ground truth label map, respectively. By this way, it can guide the proposed attention module learning to select the optimal scale of the skin lesion feature with diminished irrelevant artifacts features in Inter-CL compared with the Scale-Att-ASPP. After the above experiments, we refer to Scale-Att-ASPP-adv as our proposed network.

4.3.3 Trigger sources of attention

As shown in Fig. 3, in our proposed Scale-Att-ASPP module, the pixel-wise scale attention module generates the attention maps by inputting the concatenation of the four parallel outputs of the ASPP module, which can capture multi-level contexts of the Inter-CL. Therefore, in Scale-Att-ASPP-adv the trigger source of the attention is ASPP, and we refer the proposed Scale-Att-ASPP-adv as Triggered-by-ASPP in this subsection.

We also conducted another experiment, in which the input of the pixel-wise scale attention module is changed as the Inter-CL. Then we employed 3 × 3 convolution on Inter-CL, followed by 1 × 1 convolution, channel-wise softmax operations to produce the attention maps, so in this experiment, the trigger source of the attention is Inter-CL and we refer this experiment as Triggered-by-Inter-CL, which is also adversarial trained. The JAC score and ACC score curves of both the Triggered-by-ASPP (green lines) and the Triggered-by-Inter-CL (orange lines) on the Validation Set of ISBI 2017 are showed in Fig. 7a and b, respectively. Both the JAC score and ACC score curves are much higher and converges more quickly when using the Triggered-by-ASPP. The comparison results of the performance on the Test Set of ISBI 2017 between the Triggered-by-ASPP and Triggered-by-Inter-CL are showed in Table 3. All the metric scores are much higher when using Triggered-by-ASPP. That is because in Triggered-by-Inter-CL it only employs one kernel size fixed convolution on the Inter-CL to capture the context of the Inter-CL, which lacks the ability to represent the discriminative feature of the multi-scale skin lesions. Thus, the attention maps generated by Inter-CL in Triggered-by-Inter-CL cannot well reweight and aggregate the output of the ASPP module, and cannot select the optimal scale of the skin lesion feature. While in Triggered-by-ASPP, the attention maps are generated by the concatenation of the four parallel outputs the ASPP module, which represents the multi-level context of Inter-CL, so the outputs of the ASPP module can be well reweighted and aggregated, thus the optimal scale of the skin lesion feature with diminished irrelevant artifacts features can be automatically selected.

Fig. 7
figure 7

The JAC score (a) and ACC score (b) comparisons under the different attention trigger sources (Triggered-by-Inter-CL vs Triggered-by-ASPP) on the Validation Set of ISBI 2017 during training

Table 3 Different results under the different trigger sources of the attention on the Test Set of ISBI 2017

4.4 Evaluation and results

4.4.1 Performance comparison

After verifying the key modules used in our proposed Scale-Att-ASPP-adv, we compared the proposed network with the top three performing approaches in the ISBI 2017 challenge, which are given in [5, 7, 42], and the latest best studies [3, 34] on ISBI 2017. The comparison results of the performance on ISBI 2017 are showed in Table 4. The segmentation performance is significantly improved by our proposed Scale-Att-ASPP-adv, which improves the JAC (0.8028) by 2% and the SEN (0.8697) by 1.5%. Besides, our network slightly improves the DIC (0.8781), and our ACC (0.9316) is very close to the best value. We gained the best JAC and SEN scores because the ASPP in the proposed Scale-Att-ASPP module can extract both the local and global contexts of the Inter-CL at the same time. The local context can capture the detail boundary information of the skin lesion, which can make the predicted segmentation gain sharp boundary, and the global context can locate the skin lesion feature with diminished irrelevant artifacts features. More than that, the proposed pixel-wise scale attention module in our Scale-Att-ASPP module generates the reweighting attention maps for the outputs of the ASPP module. After the reweighting and aggregating the outputs of the ASPP module by the attention maps, the optimal scale of the skin lesion feature of the Inter-CL is automatically selected. Thus, our proposed network significantly improves both the JAC and SEN scores. As we indicated in Section 1, both the JAC and SEN scores play key roles in melanoma diagnosis. In addition, the JAC score is used to rank the ISBI 2017 challenge. SLSDeep [34] has gained good performance; however, their SEN score is low. That may because the PPM used on top of the bottleneck layer in [34] employs parallel pooling operations with the resolution pooling down, which will lose small skin lesions features. Mohammed et al. [3] gained high SEN. However, because they never used pooling operation in their network, in addition to the impact of the noisy background in the dermoscopy image, the high-level semantic context which determines the location of the skin lesion cannot be well extracted. Therefore, the JAC score is relatively low as they experimentally show. Though MResNet-Seg [7] gained the best SPE, the DIC, JAC, and SEN is much lower.

Table 4 Comparison on the Test Set of ISBI 2017

Figure 8 shows the examples of segmentation generated by our proposed network on the Test Set of ISBI 2017. Our proposed network can well segment the skin lesions in various sizes and shapes under the impact of the artifacts such as hairs, bubbles, ink marks, rulers. Besides, some skin lesions in very low contrast to the surrounding skin can be also well segmented by our proposed network. Therefore, it is confirmed that our network can capture multi-scale contexts of the skin lesion feature using the ASPP module. More importantly, our network can automatically select the optimal scale of the skin lesion feature and diminish the irrelevant artifacts features using the attention maps produced by the proposed pixel-wise scale attention mechanism in Scale-Att-ASPP module. The function of the proposed attention mechanism is similar to the “zoom in or out” visual strategy used by the humans when looking at objects carefully, see Fig. 8 (d ~ g) for details, the W1 focuses on the detail boundary information of the skin lesion. As the dilated rate getting bigger, W3W5, and W7 focus on more global contexts which are used to locate the skin lesions and diminish the irrelevant features.

Fig. 8
figure 8

Examples of segmentation generated by our proposed model on the Test Set of ISBI 2017. a original image; b the ground-truth label map; c the predicted segmentation result by our proposed model; (d) ~ (g) the pixel-wise scale attention maps generated by the proposed attention mechanism, from left to right for W1, W3W5W7, respectively

4.4.2 Time cost test

We also test the time cost for segmenting per image by the proposed network. The comparison with the recent researches are summarized in Table 5. The time cost contains loading the network parameters, loading and segmenting per dermoscopy image, saving the corresponding segmenting result. Although the best ACC score is achieved by Mohammed A et al. [3] as Table 4 shows, the time cost of their network is 9.7 s as they experimentally showed which is much higher than ours (1.1 s). That is because they learned the full resolution feature of every individual pixel in the original image without using the pooling operation in their network. Instead, our network employed pooling operations in the encoding path, which can increase the receptive field to extract multi-level features of the input image and reduce the computing cost at the same time. Besides, the outputs of the ASPP in the Scale-Att-ASPP module is aggregated in pixel-wise addition way which can reduce the memory usage and time-consuming, instead of the traditional concatenation way. In addition, the output of the Scale-Att-ASPP module is added to the same level layer in the decoding path through skip connection, instead of the traditional concatenation way, by this way the time cost is further reduced.

Table 5 Comparison of the time cost for per dermoscopy image segmentation

4.4.3 Robustness testing

We tested the robustness of the proposed network by doing two evaluation experiments on the Test Set of ISBI 2016 and the PH2 dataset, respectively.

Regarding the Test Set of ISBI 2016, we compared the result with the top four approaches in the ISBI 2016 challenge: Yu et al. [41], Rahman et al. [30], and Yuan et al. [43], ExB. The results are summarized in Table 6. The proposed network performs better than the top four methods in ACC, DIC, JAC, SEN scores and improves by 0.9%, 3.6%, 6.1%, 3.8%, respectively.

Table 6 Comparison on the Test Set of ISBI 2016

To further test the robustness of the proposed network, we also carried out another experiment on the PH2 dataset, which is an independent dataset for performance evaluation on dermoscopy image segmentation. We evaluated the predicted segmentation output by employing divergence value (DV), which represents the proportion of segmenting errors and is defined as (12):

$$ DV=\frac{FP+ FN}{TP+ FN}\times 100\% $$
(12)

The comparison results with the recent studies [1, 2, 6, 24] are showed in Table 7. Our network gains the best DV score (11.26%). There are two main reasons. The first one is that we trained the proposed network end-to-end without using complex pre-processing like hair removal used in [1, 2, 6]. Because the manually setting filters in the pre-processing stage is suboptimal and cannot meet the need of different dermoscopy images, therefore, the subsequent network cannot extract discriminative features. Instead, in our proposed network, we designed the Scale-Att-ASPP module to automatically select the optimal scale of the skin lesion with diminished irrelevant artifacts features. The second one is that the adversarial multi-scale L1 loss can bring in strong regularization, which makes the proposed network avoid overfitting when trained on the small dataset. After the above two experiments, it can be confirmed that our proposed network is robust enough.

Table 7 Comparison on the PH2 Dataset

5 Discussion

Computer-assisted segmentation of the skin lesions in the dermoscopy image is a critical step for the melanoma diagnosis. Although there have been many related studies before, it still is a challenging task for automatically segmenting skin lesions. Due to the variously changed appearance of the skin lesions and the noisy background, most existing methods cannot automatically focus on the skin lesion regions and diminish the impact of irrelevant artifacts features in noisy background at the same time. Therefore, the optimal scale of the skin lesions features cannot be selected and easily affected by the cluttered background. To address the above challenges, we adopted the following measures: we proposed a novel Scale-Att-ASPP module to automatically select the optimal scale of the skin lesion features with diminished irrelevant artifacts features of the Inter-CLs in the encoding path, instead of employing complex pre-processing methods such as time-consuming hair removal operation. In addition, the proposed attention module can significantly accelerate the convergence of our network. Then, the output of the Scale-Att-ASPP module is pixel-wise added to the same level layer in the decoding path through skip connection, in this way, both sharp boundary and high recall rate of the skin lesions are gained, meanwhile, the computing cost is reduced. Our proposed network is adversarially trained in an end-to-end way to avoid overfitting when trained on the small dataset. Finally, our proposed network significantly improves skin lesion segmentation performance, especially for the JAC and SEN. As indicated in Section 1, both the JAC and SEN play key roles in the diagnosis of melanoma, so the SEN should be paid more attention than the SPE.

Though our proposed network has gained state-of-the-art performance, there still exist somewhat over-segmentation and under-segmentation problems. There are two main reasons. The first reason is that some pixels which have the very similar appearances in the original image are annotated with different labels in the corresponding ground-truth label map or different appearances but annotated with same labels, see Fig. 9a and b for details, respectively. For this problem, we need more confident image annotation or require additional images like these situations to enhance the prediction of the proposed network during training. The second reason is that in some dermoscopy images the skin lesion regions show very low contrast to the surrounding skin, as shown in Fig. 9c. We can adjust the threshold value on the predicted segmentation map to address this problem; however, it is difficult to set a value for the needs of all these images at the same time. Because we pay more attention on the SEN score of the skin lesions using the proposed attention module, the SPE score of the proposed network is a little bit low. We can incorporate additional prior knowledge of clinical experience into our network during training to improve the SPE score. That is worthy of future work.

Fig. 9
figure 9

Examples of wrong segmentation generated by our proposed model. a original image. b ground-truth label map. c predicted segmentation result

6 Conclusion

We proposed a novel Scale-Att-ASPP-adv network for automatically segmenting skin lesion in dermoscopy images. The proposed novel Scale-Att-ASPP module can be easily integrated into the skip connection of the “U-Net” architecture. The Scale-Att-ASPP module can automatically select the optimal scale of the skin lesion feature with diminished irrelevant artifacts features. We gained the state-of-the-art performance, especially for the JAC and SEN scores. In addition, our proposed network is low time cost and show excellent robustness for different datasets. Besides, without using complex pre-processing and with no post-processing, we trained our segmentation network end-to-end, therefore, we think our proposed network can be easily applied to other image segmentation assignments.