Keywords

1 Introduction

Skin cancer is one of the most prevalent types of cancer [1, 2] and there is a growing need for accurate and scalable decision support systems for skin diseases. To assist doctors in making correct diagnoses, decision support systems can be trained on dermoscopic images, the same type of input data that dermatologists often use for an initial assessment.

The International Skin Imaging Collaboration (ISIC) [7, 10] provides public datasets of dermoscopic images and organizes challenges where state-of-the-art (SoA) methods in this field can compete. These datasets allow researchers to design data-driven systems for the detection of skin diseases. Although in recent years high accuracy has been achieved with different Deep Learning approaches [7, 9, 10], most methods do not provide a mechanism to make use of prior medical knowledge.

In this work, we present a novel approach that tackles this issue. We aim at leveraging the predictive power of a deep convolutional neural network (CNN) while providing functionalities to understand which factors influence the network’s prediction. We further quantify the attention that the trained classifier pays to each feature channel and image location, as a means to demonstrate our conclusions.

Fig. 1.
figure 1

Annotation of dermascopic structures overlaid on images [7].

Related Work. Esteva et al. recently presented a CNN-based approach that outperformed certified dermatologists at differentiating benign and malignant lesions [9]. They used transfer learning and a disease partitioning algorithm for the generation of optimal training classes. They further computed saliency maps that highlight the importance of every pixel for the final prediction. However, the saliency maps provide only little interpretable information, such as the fact that the network mainly focuses on pixels belonging to the lesion rather than on the background.

Codella et al. used a mixture of hand-coded features and features extracted by deep CNNs to achieve SoA results on the dataset of the ISBI 2016 “Skin Lesion Analysis Towards Melanoma Detection” challenge [6, 10]. Despite leveraging color features and shape descriptors for lesions, their approach does not facilitate an intuitive way of understanding the system’s predictions.

The extensive work of López-Labraca et al. [13] is closely related to our approach. They employed sophisticated, hand-crafted filters to detect relevant dermoscopic structures (see Fig. 1). For a given lesion image, malignancy scores of different dermoscopic structures were computed and then combined to form a single diagnosis (malignant or benign). The authors were able to generate comprehensive reports containing the final diagnosis and the detected structures along with their respective malignancy scores. Nonetheless, their proposed approach requires extensive feature engineering and, in contrast to deep learning methods, is limited to features that are already known to dermatologists.

Building on the same idea, González-Díaz presented a method that used dermoscopic structures in combination with ResNet50, a deep residual network [8, 11]. An input image was fed into a segmentation network that produced probability maps of eight different dermoscopic structures. These maps were then used to modulate the latent representation of the input image at a hidden layer in the ResNet50. Using this CNN-based method, González-Díaz achieved the best score for the detection of seborrheic keratosis in the ISBI 2017 “Skin Lesion Analysis Towards Melanoma Detection” challenge [7]. However, despite making use of known dermoscopic structures, this method does not provide interpretable information as the work of López-Labraca et al., where hand-crafted features were employed [13]. Furthermore, it is unclear to what extent the segmentations of the dermoscopic structures influence the final diagnosis.

2 Methods

Overview. Our method consists of two stages. Given an input image of a skin lesion, we first employ a segmentation network (SN) to detect dermoscopic structures that dermatologists consider to be important for disease classification (see Fig. 1). We focus on pigment networks as they are known indicators for malignant melanoma and benign nevi. Furthermore, they are the dermoscopic structures that are segmented with the highest confidence by our SN. In a second stage, the output of the network is stacked on top of the existing RGB channels of the original image, and then the resulting four-channel input is used to train a classifier network (CN) for each considered type of disease.

Additionally, we introduce two measures of attention given by the classifier network, namely the channel-wise and location-wise attention. These measures allow us to quantify how much attention the classifier is paying to the provided dermoscopic structure compared to the rest of the input data.

Material and Dataset. For the detection of pigment networks, we trained our SN with the dataset provided by the ISBI 2017 “Skin Lesion Analysis Towards Melanoma Detection” challenge [7]. It consists of 2000 training images and 600 testing images with superpixel level annotations of different types of dermoscopic structures, one of which is the pigment network. Every image is labeled with one of three classes: Melanoma, nevus, or seborrheic keratosis.

For the disease classification task, we trained our CN on the datasets released for the “ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection” challenge [7, 17]. It comprises 10,015 images belonging to one of seven classes: melanoma, melanocytic nevus, basal cell carcinoma, actinic keratosis, benign keratosis, dermatofibroma, or vascular lesion. For simplicity, we will from now on call the datasets DS2017 and DS2018, respectively.

Methodology Overview. Since the images exhibit varying dimensions, they were resized to \(224\times 224\) pixels. Both datasets were augmented with random rotations of \(90^\circ \), \(180^\circ \) and \(270^\circ \), as well as vertical and horizontal flips. The algorithms for pigment network segmentation SN and disease classification CN were implemented in Tensorflow [3], both by extending the code provided by [5].

Detection of Pigment Networks. The detection task was formulated as a pixel-wise binary segmentation problem with a foreground and a background class. Due to the large class imbalance by background pixels, we reduced the original training set to a subset containing only those images where a pigment network was annotated. Note that this may result in a biased segmenter SN because it has been trained to always detect a pigment network somewhere in the image. Nevertheless, the subsequent disease classifier CN still sees the actual image and may choose to ignore this segmented area, if that does not facilitate the classification. Accordingly, our motivation was to have an (over-)sensitive SN, in order to let the subsequent CN decide how much importance to give to the allegedly-detected pigment network. Although such a two-step approach can be argued to potentially be inferior to an end-to-end solution, the former allows us to facilitate dedicated datasets and train targeted models, giving us more control over each step.

For SN, we employed a shallow U-Net [14] that outputs probability maps for the occurrence of pigment networks. To further alleviate the problem of background dominance, the Sørensen-Dice coefficient was used as loss function.

Disease Classification. We evaluated two types of classifiers: (i) ResNet50 (with 50 layers) pre-trained with images from the 2014 ImageNet Large Scale Visual Recognition Challenge [15], and (ii) the shallower ResNet18 (with 18 layers) proposed by He et al. [11], which we trained from scratch.

Fig. 2.
figure 2

Input images and their corresponding attribution maps (red = positive contribution, blue = negative contribution). (Color figure online)

Attention. Our attention measures are based on so-called attribution methods. Given a deep CNN with input \(x = [x_1, ... , x_N] \in \mathbb {R}^N\) and output \(f(x) = [f_1(x), ... , f_C(x)] \in \mathbb {R}^C\), attribution methods compute the contribution \(R_{i,c}\) of every input pixel \(x_i\) to a specific target neuron \(f_c\). Different types of attribution methods have been proposed in the past, such as the perturbation- and gradient-based approaches [4]. We herein employed a simple method of input \(\times \) partial-derivative [16], which is fast and worked for generating successful attribution maps for our purposes (see Fig. 2). This metric defines attribution component \(R_{i,c}\) of an input pixel \(x_i\) to a target neuron \(f_c\) as follows:

(1)

Based on (1) we propose two quantities to measure the attention that a CNN pays to each input channel and image location.

Channel-wise attention (\(A_c\)) is defined as the ratio of contribution from a particular structure channel c to the contributions of all K channels:

(2)

Location-wise attention (\(A_L\)) captures the local attention in the image space, again as a ratio. The numerator contains the contributions of all channels except for the dermoscopic structure channel c. The contributions are weighted with p, the local probability of the dermoscopic structure; and the denominator contains the corresponding unweighted contributions:

(3)

We used the implementation of Ancona et al.Footnote 1 to compute the contribution values and to generate the attribution maps.

3 Results and Discussion

Segmentation. Examples for segmentations of pigment networks are depicted in Fig. 3. We trained and tested on DS2017 because at the moment of this writing, SoA results for DS2018 are not yet publicly available for comparison. As seen in Table 1, our pigment network segmentation results are not as accurate but comparable to the SoA results by Kawahara & Hamarneh from the ISBI 2017 “Skin Lesion Analysis Towards Melanoma Detection” challenge [7] in dermoscopic structure segmentation [12]. Note that our goal herein was not to perfect the SN stage, but rather to investigate if and how any information that SN provides can be further used in classification.

Fig. 3.
figure 3

Input images and segmented pigment networks (green = true positives, red = false positives, black = true negatives, blue = false negatives) (Color figure online)

Table 1. Evaluation scores for the segmentation of pigment networks

Classification. For the classification experiments, we used the DS2018 training set and applied an 80%-10%-10%-split into training, validation and test set. Table 2 and 3 show the classification scores of ResNet18 and ResNet50 with and without additional pigment network channel.

For ResNet18, the ROC AUC values do not change significantly when the pigment network channel is added. Despite a slight overall decrease in ROC AUC values, Table 3 shows that the F1-score for the crucial melanoma class increases thanks to a clear improvement of the recall from 0.213 to 0.400. This is even better than the recall obtained by the much more complex ResNet50. In terms of ROC AUC values, ResNet50 still performs best. Notice however, that adding the pigment network channel to the input of ResNet50 actually leads to lower ROC AUC values, recall, and F1-scores.

Table 2. ROC AUC values. ME = Melanoma, MN = Melanocytic Nevus, BCC = Basal Cell Carcinoma, AK = Actinic Keratosis, BK = Benign Keratosis, DF = Dermatofibroma, VL = Vascular Lesion.
Table 3. Recall, precision, and F1-score for Melanoma.

As seen in Table 4, the channel-wise attention \(A_c\) as well as the location-wise attention \(A_l\) for melanoma and melanocytic nevus are clearly higher in the case of ResNet18. This suggests that ResNet50 is not focusing its attention on the parts of the image that are medically relevant. The pre-trained ResNet50 may require more sophisticated fine-tuning if an additional channel is to be used. In our approach, only the weights of the first convolutional layer and the final fully-connected layer were learned whereas all weights in-between were pre-trained and frozen. Since the images from the 2014 ImageNet Large Scale Visual Recognition Challenge [15] are very different from dermoscopic images, it might be beneficial to use more general pre-trained feature representations from a higher layer and start learning from there. However, this is in turn computationally more expensive.

Table 4. Attention measures \(A_c\), \(A_L\) for additional pigment network channel. ME = Melanoma, MN = Melanocytic Nevus.

4 Conclusion

We showed that the recall and the F1-score for the detection of melanoma can be improved by providing a CNN with an additional input channel that contains relevant prior knowledge. Furthermore, we demonstrated that our proposed attention measures can help to identify where a CNN focuses its attention. In a next step, one might consider integrating more than just information about pigment networks in the input. Other dermoscopic structures such as streaks and dots could be used to further improve existing classifiers, e.g. for non-melanocytic lesions.