Hyper-fusion network for semi-automatic segmentation of skin lesions
Graphical abstract
Introduction
Melanoma (also known as malignant melanoma) has one of the most rapidly increasing incidences in the world and has considerable mortality rate if left untreated (Rigel et al., 1996). Early diagnosis is particularly important because melanoma can be cured with early excision (Celebi et al., 2007; Capdehourat et al., 2011). Skin lesion images such as dermoscopy are commonly acquired as a non-invasive imaging technique for the in-vivo evaluation of pigmented skin lesions and play an important role in early diagnosis (Celebi et al., 2007). The identification of melanoma from skin lesion images using human vision alone, can be subjective, inaccurate and poorly reproducible, even amongst experienced dermatologists (Celebi et al., 2008; Abbas et al., 2013). This is attributed to the challenges in interpreting skin lesion images where there can be diverse visual characteristics such as variations in size, shape boundaries (e.g., ‘fuzzy’), artifacts and has hairs (Fig. 1) (Barata et al., 2015). Therefore, automated image analysis is a valuable aid for clinical decision support (CDS) systems and for the image-based diagnosis of skin lesions (Serrano and Acha 2009; Esteva et al., 2017). Skin lesion segmentation is the fundamental for these CDS systems and has motivated the development of numerous segmentation methods.
Traditional fully automatic segmentation methods mainly focus on extracting pixel-level or region-level features such as Gaussian (Wighton et al., 2011) and texture (He and Xie 2012) features and then use various classifiers, such as Wavelet network (Sadri et al., 2012) and Bayes classifier (Wighton et al., 2011), to separate the skin lesions from surrounding healthy skin. However, their performance depends heavily on correctly tuning a large number of parameters and effective pre- and post-processing techniques such as hair removal and corrections to illumination. These methods, without pre- and post-processing techniques, have difficulty in segmenting lesions when there are artifacts, hair or when the lesion reaches the boundary of the image.
Deep learning based fully automatic segmentation methods are regarded as the state-of-the-art in skin lesion segmentation, and most of these methods are based on fully convolutional networks (FCNs) (Shelhamer et al., 2016). The success of FCNs is primarily attributed to their use of encoders and decoders to derive an image feature representation that combines low-level appearance information with high-level semantic information. The encoders use convolutional layers and downsampling processes to extract high-level semantic features from images. The decoders then upsample the extracted image features to output the segmentation results. Therefore, FCNs can be trained in an end-to-end manner for efficient inference, i.e., images are taken as inputs and the segmentation results are outputted directly. Yuan et al. (Yuan et al., 2017) replaced the cross-entropy loss used in a traditional FCN with a Jaccard distance loss for skin lesion segmentation. Yu et al. (Lequan et al., 2017) increased the FCN network depth (number of layers) with a 50-layer deep residual network for the segmentation based on deeper image features. Bi et al. (Bi et al., 2017) proposed a class-specific learning to combine (ensemble) multiple trained FCNs (trained only with melanoma or non-melanoma images) for segmentation. More recently, Xie et al. (Xie et al., 2020) proposed learning skin lesion segmentation and classification (melanoma vs. non-melanoma) via a mutual bootstrapping network, where skin lesion classification results were used to guide and improve the segmentation results. However, all these FCN-based methods are reliant on large annotated training data that include all the possible variations in skin lesions, including differences between patients in lesion size, shape and texture. When there are, however, insufficient training data to cover all the variations in skin lesions, these methods failed to segment the lesions that have image characteristics, which are less common in the training datasets. Further, skin lesions from different datasets may have major differences in appearance e.g., illumination and field-of-view (as shown in Fig. 1). The end result is that these methods tend to overfit to one dataset and have limited generalizability to a different dataset.
FCN-based semi-automatic segmentation methods for medical images, which combine manual user-inputs (priori knowledge) with high-level semantic features derived from FCNs, offer an alternative approach to segment the skin lesions. Currently, there are few such methods. Wang et al. (G. Wang et al. 2018) proposed a semi-automatic medical image segmentation method with two FCNs: the first FCN automatically segmented the input image, and the second FCN repeated the segmentation but with the fusion of the input image, the segmentation result (from the first FCN) and the user-inputs. The regions that failed to be segmented by the first FCN were then refined by the second FCN. Lei et al. (Lei et al., 2019) replaced the FCNs in the approach reported by Wang et al. (G. Wang et al. 2018), with a lightweight network architecture to segment organs-at-risk structures from computed tomography (CT) images. Koohbanani et al. (Koohbanani et al., 2020) fused user-inputs with a multi-scale FCN for microscopy images. Sakinis et al. (Sakinis et al., 2019) fused user-inputs with a U-Net for organs segmentation in abdominal CT images. Wang et al. (G. Wang et al. 2018) proposed fine-tuning the individual test images with user-inputs (including scribbles and user-defined bounding boxes) to enclose the regions of interest. (Zhang et al. 2021) proposed a patch-based segmentation method where a user-defined centroid point was used to partition the medical image into small patches and the small patches were then segmented with a convolutional recurrent neural network (ConvRNN). For non-medical images, Majumder et al. (Majumder and Yao 2019) fused superpixel-based user-inputs with input images for natural image segmentation and the superpixel-based user-inputs were derived by calculating the Euclidean distance from the centroid of the superpixel to the user-clicks. All these FCN-based methods focused on early-fusion, where the medical images are fused with the user-inputs (both foreground and background inputs) as a single input prior to the FCN. The reliance on a single fused input means that the important user-input information could be lost after early-fusion, and so there will be limited priori knowledge that can be used by the FCN. In addition, the reliance on a user-defined centroid point is not always feasible. It is challenging to accurately place a centroid point for lesions with differing shapes and the centroid point may not always be within the lesion region. Further, fine-tuning individual test images requires additional computational time and manual input e.g., bounding boxes and scribbles, and this is challenging to implement for a large cohort study. Hu et al. (Hu et al., 2019) proposed a two-stream late fusion network for natural image segmentation, where the image and the user-inputs were separately processed by two FCN networks with fusion of the resultant features. The late fusion of extracted image features, however, tends to dismiss the correlations between the image and the user-inputs; the correlations may only accessible at the early stage of the network. In addition, when these methods are applied to skin lesion segmentation, they usually have difficulty in accurately delineating the boundary of the lesion and have inconsistent outcomes for the challenging skin lesions.
Our hyper-fusion network (HFN) shown in Fig. 2b, separately extracts features from user-inputs. Our fusion strategy provides the flexibility to learn complementary features between the lesion images and the user-input, and provide continuous guidance and constraint to the segmentation results. Our HFN adds the following contributions to the current knowledge: (i) separate extraction of features from skin lesion images and user-inputs; they will allow to continuous leverage of user-inputs to optimize learning of skin lesion characteristics and minimize the loss of user-input information during early-fusion. (ii) training and predicting segmentation results in multiple fusion stages. When compared to early-fusion based semi-automatic segmentation methods, multiple fusion stages have the advantage of using the user-inputs to iteratively refine the segmentation, which ensures better segmentation of skin lesion boundaries. (iii) the introduction of hyper-integration modules (HIMs) to fuse user-input features and skin lesion image features at individual fusion stages. HIMs help guide and constrain the learning of the lesion characteristics and then propagate the intermediary segmentation results to the next stage of the decoder. The fusion from individual stages ensures the appearance of the segmented skin lesions is spatially consistent.
Section snippets
Materials
We used three well-established public benchmark datasets to train and test the effectiveness of our method.
- •
The 2017 and 2016 ISBI Skin Lesion Challenge (denoted as ISBI 2017 (Codella et al., 2017) and ISBI 2016 (Gutman et al., 2016)) datasets are a subset of the large International Skin Imaging Collaboration (ISIC) archive, which contains skin lesion images acquired on a variety of different devices at numerous leading international clinical centers. The ISBI 2017 challenge dataset provides
Experiment setup
We performed the following experiments on the three datasets: (a) comparison of the overall performance of our method with fully automated and semi-automated segmentation methods; (b) comparison of the results from (a) using different number of user-inputs; (c) analysis of the performance of each component in our proposed method; (d) analysis of the segmentation results on the challenging skin lesions; and (e) analysis of the segmentation results with noisy user-inputs. For experiments using
Segmentation results on ISBI 2017, ISBI 2016 and PH2 datasets
Table 1, Table 2 and Fig. 7 show that our HFN method achieved the best overall performance across all measurements on the ISBI 2017 dataset. When compared with the recently published fully automatic methods of MB and BiDFL, our method improved by a large margin of 3.3% and 2.23% in Jaccard measure (Table 1).
Table 3, Table 4 and Fig. 8 show that our HFN method outperformed all the current methods on the ISBI 2016 dataset. When compared to the current state-of-the-art method DAGAN, our method
Discussions
Our main findings are that: (i) our HFN with user-inputs consistently improved skin lesion segmentation, in particular, for the skin lesions, where the image characteristics are less common in the training datasets; (ii) compared to early-fusion methods, fusing separately extracted complementary features (user-inputs and image features) produced advantages in leveraging user-inputs that resulted in improved segmentation of challenging skin lesions; and (iii) HIMs ensured the appearance of the
Application to total-body 3-Dimensional (3D) photography
Total-body 3D photography, currently being implemented in the clinic, that constructs a digital 3D avatar of the patient that can be used to view and monitor skin lesions across the body over time. When compared to current manual dermoscopy and limited-access time-consuming 2D total body photography systems, total-body 3D photography brings new spatial and temporal capabilities and skin lesions at different sites of the body and at different times can be detected simultaneously. The Australian
Conclusions
In this paper, we proposed a method to segment skin lesions in a semi-automated manner. Our method used a deep hyper-fusion FCN to iteratively fuse, separately extracted user-input features, with skin lesion image features and to continuously leverage user-inputs to guide and constrain the learning of skin lesion characteristics. By learning and inferring user-inputs derived from few user-clicks, we achieved accurate segmentation results for skin lesions that are known to be challenging, such
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported in part by Australia Research Council (ARC) grants (IC170100022 and DP200103748).
References (45)
- et al.
Pattern classification of dermoscopy images: a perceptually uniform model
Pattern Recognit.
(2013) - et al.
Skin lesion segmentation in dermoscopy images via deep full resolution convolutional networks
Comput. Method. Program. Biomed.
(2018) - et al.
Step-wise integration of deep class-specific learning for dermoscopic image segmentation
Pattern Recognit.
(2019) - et al.
Toward a combined tool to assist dermatologists in melanoma detection from dermoscopic images of pigmented skin lesions
Pattern Recognit. Lett.
(2011) - et al.
Automatic detection of blue-white veil and related structures in dermoscopy images
Comput. Med. Imaging Graph.
(2008) - et al.
A methodological approach to the classification of dermoscopy images
Comput. Med. Imaging Graph.
(2007) - et al.
A fully convolutional two-stream fusion network for interactive image segmentation
Neural Netw.
(2019) - et al.
Skin lesion segmentation via generative adversarial networks with dual discriminators
Med. Image Anal.
(2020) - et al.
Skin lesion image segmentation using Delaunay Triangulation for melanoma detection
Comput. Med. Imaging Graph.
(2016) - et al.
The incidence of malignant melanoma in the United States: issues as we approach the 21st century
J. Am. Acad. Dermatol.
(1996)