1 Introduction

Sickle cell disease (SCD) is an inherited blood disorder, where SCD patients have abnormal hemoglobin that can cause normal disc-shaped red blood cells (RBCs) to distort and generate heterogeneous shapes. The differences in cell morphology between healthy and pathological cells make it possible to perform image-based diagnosis using image processing techniques, which is very important for faster and more accurate diagnosis of potential SCD crises. Various methods have been developed to perform RBC segmentation and/or classification, such as thresholding, region growing [1], watershed transform [2], deformable models [3], and clustering [4]. However, traditional image processing models such as thresholding and region growing are susceptible to the noisy image background and blurred cell boundaries, which are common in microscopy images. Moreover, deformable models like active contour [3] needs good initialization and relies on relatively clear cell morphology. In addition, due to the heterogeneous shapes and touching RBCs in SCD, recent open source cell detection tools, such as CellProfiler [5], CellTrack [6] or Fiji [7] are not readily used to accurately detect and classify the SCD RBCs. Hence, an effective SCD cell segmentation and classification method is still an open problem for the field.

Recently, deep learning methods with convolutional neural networks (CNN) have achieved remarkable success in the field of both natural image [8] and medical image analysis [9]. Among these methods, the fully convolutional network (FCN) has shown state-of-the-art performance in various real-world applications [10]. Specifically, FCN has been applied in the cell segmentation problems [11, 12] and obtained good results. U-Net was developed based on FCN and takes skip connection between encoder and decoder into consideration [13], which has also been applied on medical images. On the other hand, one of the major challenges in capturing the most discriminative shape and texture features of the RBCs is that cells can be imaged in various poses and sizes, thus a spatial-invariant scheme is needed to overcome those variations. For example, the work applies dense transformer network based on thin-plate spline, and has achieved superior performance on brain electron microscopy image segmentation problems [14]. In this work, we apply deformable convolution [15] to the U-Net architecture and develop the deformable U-Net framework for semantic cell segmentation. Deformable convolution accommodates geometric variations in the images by learning and applying adaptive receptive fields driven by data [15], in contrast to standard CNNs where the receptive field is constant. Therefore, it can be more robust to the spatial variations of the RBCs.

The proposed framework is trained and tested on a large, multi-institutional RBC microscopic image base with manual annotations, consisting of both healthy and pathological populations. We perform the simultaneous segmentation and classification of the RBCs using the trained network based on various experimental settings. The supreme accuracy for both segmentation and classification indicates that the proposed framework is an effective solution for the automatic detection of SCD RBCs. To the best of our knowledge, this work is the first attempt of solving the SCD detection problem in an end-to-end semantic segmentation approach.

2 Materials and Methods

Since the traditional U-Net is inherently limited to deal with object shape transformations due to its regular square receptive field, in this work, we propose deformable U-Net replacing the convolution kernel with deformable convolution throughout the U-Net. In the classic CNN architecture, convolution kernel is defined with fixed shape and size by sampling the input feature map on a regular grid. For example, the grid \(\mathcal {R}\) for a \(3\times 3\) kernel is \(\mathcal {R}=\{(-1,-1),(-1,0),\cdots ,(0,1), (1,1)\}\). For each pixel \(\mathbf {p}_0\) on the output feature map \(\mathbf {y}\), the standard convolution can be expressed as:

$$\begin{aligned} \mathbf {y}(\mathbf {p}_0)=\sum _{\mathbf {p}_n\in \mathcal {R}}\mathbf {w}(\mathbf {p}_n)\cdot \mathbf {x}(\mathbf {p}_0+\mathbf {p}_n), \end{aligned}$$
(1)

where \(\mathbf {y}(\mathbf {p}_0)\) denotes the value of pixel \(\mathbf {p}_0\) on the output feature map, \(\mathbf {x}(\mathbf {p}_0+\mathbf {p}_n)\) denotes the value of pixel \(\mathbf {p}_0+\mathbf {p}_n\) on the input feature map, and \(\mathbf {w}(\mathbf {p}_n)\) is the weight parameter. In contrast, deformable convolution adds 2D offsets to the regular sampling grid \(\mathcal {R}\), thus Eq. (1) becomes:

$$\begin{aligned} \mathbf {y}(\mathbf {p}_0)=\sum _{\mathbf {p}_n\in \mathcal {R}}\mathbf {w}(\mathbf {p}_n)\cdot \mathbf {x}(\mathbf {p}_0+\mathbf {p}_n+\varDelta \mathbf {p}_n). \end{aligned}$$
(2)

As offset \(\varDelta \mathbf {p}_n\) is probably fractional, Eq. (2) is implemented by bilinear interpolation as:

$$\begin{aligned} \mathbf {x}(\mathbf {p})=\sum _{\mathbf {q}}max(0,1-|q_x-p_x|)\cdot max(0,1-|q_y-p_y|)\cdot \mathbf {x}(\mathbf {q}), \end{aligned}$$
(3)

where \(\mathbf {p}\) enumerates an arbitrary fractional location on the input feature map, \(\mathbf {q}\) denotes all integer locations on the input feature map, and \(p_x\) (\(p_y\)) denotes the x-coordinate (y-coordinate) of \(\mathbf {p}\). Equation (3) is easy to compute as it is only related with the four nearest integer coordinates \(\mathbf {q}_i,i=[1,2,3,4]\) of \(\mathbf {p}\). Equation (3) is also equivalent to:

$$\begin{aligned} \mathbf {x}(\mathbf {p})=\sum _{i=1}^{n}\mathbf {x}(\mathbf {q}_i)\cdot S_i, \end{aligned}$$
(4)

where \(S_i,i=[1,2,3,4]\) is the area of the assigned rectangle generated by \(\mathbf {q}_i,i=[1,2,3,4]\) and \(\mathbf {p}\), and the illustration is shown in Fig. 1.

The detailed procedure of deformable convolution is described in Fig. 1. First, we implement an additional classic convolution with activation function TANH to learn offset field from the input feature map, which is normalized to \([-1,1]\). The offset field has the same height and width with input feature map while its number of channels is \(2N (N=|\mathcal {R}|)\). The offset field is then multiplied by parameter s (which is used to adjust the scope of receptive field) and added by the regular grid \(\mathcal {R}\) to obtain the sampling locations (every coordinate on offset field has N pairs of values corresponding to the regular grid \(\mathcal {R}\)). Finally, values of the irregular sampling coordinates are computed via bilinear interpolation, then the original convolution kernel samples the deformed feature map to get the new feature map. In this work, we set deformable kernel works in the same way across different channels, rather than learning a separate kernel for each channel to improve the learning efficiency. Deformable convolution can sample the input feature map in a local and dense way, and be adaptive to the localization for objects with different shapes [15], which is exactly what we need in SCD RBC semantic segmentation.

Fig. 1.
figure 1

Illustration of the deformable convolution, showing how the fixed sampling locations are adaptively deformed.

Fig. 2.
figure 2

Architecture of the deformable U-Net in this work.

The main architecture of the deformable U-Net is shown in Fig. 2. It includes two parts: encoder path and decoder path. In the encoder path, each layer has two \(3\times 3\) deformable convolutions followed by a \(2\times 2\) max pooling operation with stride of 2, which doubles the number of channels and halves the resolution of input feature map for down-sampling. The encoder is followed by two \(3\times 3\) deformable convolutions called bottom layers. Each step in the decoder path contains a \(3\times 3\) deconvolution with stride 2 followed by two \(3\times 3\) deformable convolutions. The skip connection between encoder and decoder helps to preserve more contextual information for better localization [13]. The proposed deformable U-Net can be easily trained end-to-end (from the input image to the label map) through back propagation in the same way with the U-Net architecture.

3 Results

In this section, to evaluate the performance of the proposed deformable U-Net in dealing with RBC semantic segmentation for SCD, we perform experiments from two different aspects: (1) single-class RBC semantic segmentation, which aims at differentiating cells from background, and (2) multi-class RBC semantic segmentation, which aims at differentiating various sub-types of SCD RBC. The experimental data and implementation details are presented below.

Data and Implementation Details. In terms of the latest public SCD RBC image dataset from MIT, refer to [16], we use 266 raw microscopy images of 4 different SCD patients as our experimental data. The original blood sample is collected from UPMC (University of Pittsburgh Medical Center) and MGH (Massachusetts General Hospital). In the dataset, raw microscopy images are acquired using a Zeiss inverted Axiovert 200 microscope under 63\(\times \) oil objective lens using an industrial camera (Sony Exmor CMOS color sensor, 1080p resolution), the image resolution is \(1920\times 1080\). Additionally, RBC areas and RBC categories are manually annotated as ground truths by the data provider. Based on the coarse RBC labeling strategy in previous work [16], three SCD RBC categories are employed in our experiments: (1) Dic+Ovl, (2) El+Sk, and (3) others. During the implementation, we initially pre-process the collected raw image data by removing two-side margins and resize them into same size \(512\times 512\).The network is implemented in TensorFlow 1.2.1, and we use RELU as activation function with scope of 2 and batch normalization for convolution operations. Furthermore, we employ the Adam algorithm for training with learning rate \(10^{-3}\), weight decay \(10^{-8}\), batch size (2) and epoch (30000).Our code can be accessed on GitHubFootnote 1.

Evaluation of Single-Class RBC Semantic Segmentation Performance. To demonstrate the performance of our method in single-class RBC semantic segmentation, we compare the proposed method with the prevalent U-Net and region growing methods.The preliminary SCD RBC dataset are divided into two parts: 166 random samples for training, and the rest 100 samples for testing. As it can be seen from Fig. 3, the proposed method improves the performance of U-Net in SCD RBC semantic segmentation significantly. First, it can effectively separate touching RBCs as shown in A and D (yellow circles) of Fig. 3; Second, for the cases of heterogeneous shapes SCD RBC segmentation, deformable U-Net obtains more accurate results than the other two methods, see C and D (blue circles) of Fig. 3; Third, regarding the RBC segmentation under blur boundary, our method receives a clearly high accuracy, see purple circles in C of Fig. 3. Moreover, Fig. 3 indicates that the proposed method has a better generalization ability for shaded cell segmentation at the edge. Furthermore, deformable U-Net can effectively avoid the disturbance of various noises (e.g. dirties, halos, etc.) in the RBC semantic segmentation procedure. To quantify the comprehensive performance of our method, three main indices are calculated, see Table 1. The proposed network outperforms the other two approaches in terms of accuracy, precision and F1 score.

Fig. 3.
figure 3

Representative comparisons of different RBC semantic segmentation methods. (a) patch images, (b) region growing results, (c) U-Net results, (d) proposed method, (e) ground truth.

Table 1. Quantitative performance analysis of different methods in single-class SCD RBC segmentation.

Evaluation of Multi-class RBC Semantic Segmentation Performance. In addition to single-class segmentation evaluation for the proposed network, we also conduct an experiment on multi-class RBC semantic segmentation for SCD based on the same dataset division schema as above. The corresponding segmentation results is shown in Fig. 4, different colors indicate different RBC types: red (Dic+Ovl), blue (El+Sk) and green (others).Specifically,the proposed deformable U-Net gains better capability of predicting an integrated RBC without any shape prior than the standard U-Net method, as certain cells are segmented out yet identified as two classes simultaneously in the U-Net prediction, the yellow square region in Fig. 4. Additionally, deformable U-Net is more robust to the background noise presented in the microscopic images, e.g. the baseline U-Net predict background objects as RBCs in the blue square of Fig. 4, while deformable U-Net predict the accurate negative label. Furthermore, we perform the quantitative analysis for our trained model by three statistic metrics: loss, accuracy, and mean IoU (Intersection over Union). The evaluation results in Table 2 indicate that deformable U-Net possess a superior performance than the standard U-Net.

Fig. 4.
figure 4

Comparisons of multi-class RBC semantic segmentation results. (a) original images, (b) U-Net results, (c) proposed method results, (d) ground truth.

Table 2. Quantitative performance analysis of different methods in multi-class SCD RBC segmentation.

4 Conclusion

In this work, we present an improved U-Net framework (deformable U-Net) for automated SCD RBC semantic segmentation. Experimental results demonstrate that the proposed approach obtains an obvious superior performance than the baseline U-Net, especially for the key problems, e.g.background noise discrimination, heterogeneous shapes of RBC segmentation, touching RBC separation, blurred RBC segmentation. Moreover, it has high consistency in performing the prediction on cell boundaries.